From nobody Fri Dec 19 12:06:57 2025 Received: from mail-ed1-f66.google.com (mail-ed1-f66.google.com [209.85.208.66]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42D741A01BC for ; Tue, 27 Aug 2024 12:18:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.66 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724761142; cv=none; b=sOSXKuy4BQa34ysPoBZzvVGXmGg8L07pK5wXOwk0mKjCBo+XkQFAErd5zMM5PpMGuhk3rADTLozXrWbFg6J8fAjQ0WQl9l4fEdg18NpfWPOSl9D10fRGUG4ltn7WWrU6sEjt+j8JjzQ714r8whzXVDaed/pWujaC5SMBQd1hMHk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724761142; c=relaxed/simple; bh=L+w3VuaEDrwzK0tX5us3sHk7M+lej+aiGmyh8rPf5Lc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Ad27cyveFUIxOLLxy9+SnA44B9ru4+TuxX57ss4dDrqE/3bl9Y8ri/2NUKG9gZowdhXmi3wHuUUK2hy6VRU2R3jFBgsLhCxh/snTMWM/jr1boLxV3DKEKc5RN36qJ1Gx7pn+HCB7HC0RhZqmAn1mNw+Lr41MUdPe8LVDuIXbS78= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=et/z9bKm; arc=none smtp.client-ip=209.85.208.66 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="et/z9bKm" Received: by mail-ed1-f66.google.com with SMTP id 4fb4d7f45d1cf-5c09fd20eddso2501609a12.3 for ; Tue, 27 Aug 2024 05:18:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1724761137; x=1725365937; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=I8xjJWmPLQlupiBp2HbQ1GsjuXsvaJ4qBaIANBbAMQU=; b=et/z9bKmZW1wnZ8Cu520x0mM2DPRZlhgvn1nSe6mW94uyS/7EbsMsxa5jiRpKvwaUF adZBfrskdRDjqhi81hvvHEEs7NURtzTblQdX+yik23j+Hgzyaweppi+J6ybrQWBLo16Y n92G032gW1sxa6o6XBA0thqbrfeeZocfw3VOpLw8Uc7eqQCtKWu/SayZCvMjER0+gK/8 GkV0gkIUpd01Q/NSV46R69sI34W5JGQxuhqTVlqryd9ucegNDKFhYcW8cf4u8koyTM6H nU5ccwWxdZTMtOCntXuWNm7rXef72ifTfIxNd+9jEtXPllJykNXNeQtaC4+jbZgvxzMF ZRRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724761137; x=1725365937; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=I8xjJWmPLQlupiBp2HbQ1GsjuXsvaJ4qBaIANBbAMQU=; b=nYnE+sU3fyQnUX9Zng7rlV1dzFZChB8z8k1DMujqwSkt7VHoblsF7FzbuhG8lzbsqN xHxLflu8gd1yU54ZKNJPd+mPPiTlmGPoyoPb47Tw4W6KzMCgVMpuFvbg9dVjOZ24zJB2 1EfL9WHyGhEI2I8BjfxvDdWiVAhLzWXslBYgDA1WuseXKGTldlRe4pTNQE64pld0qdIz 5zF3k2j1fdZ7U+fWdVX7lSymdpt0MjI+WHI+aijNIFphiKj9vEoO+EAIpHBbYhO6ThWP n99C8vE+YwAe4Bkmjv0mbrMFmCL7L0NOdh5o2Bsg/VxZbmudXweJTnS9pE88LiLlng/b rZMQ== X-Forwarded-Encrypted: i=1; AJvYcCXfhZMmpL3TPcMLynpoCABfIL4b43OzY+26x28OpwvJfFahUpsz7escvJBZRK+7/C58EMq9mUqLzII00AA=@vger.kernel.org X-Gm-Message-State: AOJu0YwOnlshLlfqXi2lyl4EaVjUNLEEEH6wDP+SlCIefX7977vQOg5E rLPNkNLfxpzR8crdOJhWFo9V368ahnN1sfQKblzLLDg2MH+4TBHbtng833jBxDY= X-Google-Smtp-Source: AGHT+IHXSxAdj9Xg1S9r/KjqKKlSBB3gSkKBbNgZ4MdagTFWjJgJ2k/rPRxU7CG0jpJkG3I78cgWbw== X-Received: by 2002:a17:907:7d8d:b0:a7a:ab1a:2d71 with SMTP id a640c23a62f3a-a86e3d43261mr210945666b.59.1724761137128; Tue, 27 Aug 2024 05:18:57 -0700 (PDT) Received: from localhost (ip-046-005-139-073.um12.pools.vodafone-ip.de. [46.5.139.73]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a86e548781csm103288266b.28.2024.08.27.05.18.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Aug 2024 05:18:56 -0700 (PDT) Date: Tue, 27 Aug 2024 14:18:51 +0200 From: Johannes Weiner To: bugzilla-daemon@kernel.org Cc: Suren Baghdasaryan , Peter Zijlstra , Shakeel Butt , linux-kernel@vger.kernel.org Subject: [PATCH] sched: psi: fix bogus pressure spikes from aggregation race Message-ID: <20240827121851.GB438928@cmpxchg.org> References: <20240826181900.GA3214@cmpxchg.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20240826181900.GA3214@cmpxchg.org> Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Sending a proper patch. Peter, can you please take this? Reported-by: Brandon Duffany Reviewed-by: Chengming Zhou --- Brandon reports sporadic, non-sensical spikes in cumulative pressure time (total=3D) when reading cpu.pressure at a high rate. This is due to a race condition between reader aggregation and tasks changing states. While it affects all states and all resources captured by PSI, in practice it most likely triggers with CPU pressure, since scheduling events are so frequent compared to other resource events. The race context is the live snooping of ongoing stalls during a pressure read. The read aggregates per-cpu records for stalls that have concluded, but will also incorporate ad-hoc the duration of any active state that hasn't been recorded yet. This is important to get timely measurements of ongoing stalls. Those ad-hoc samples are calculated on-the-fly up to the current time on that CPU; since the stall hasn't concluded, it's expected that this is the minimum amount of stall time that will enter the per-cpu records once it does. The problem is that the path that concludes the state uses a CPU clock read that is not synchronized against aggregators; the clock is read outside of the seqlock protection. This allows aggregators to race and snoop a stall with a longer duration than will actually be recorded. With the recorded stall time being less than the last snapshot remembered by the aggregator, a subsequent sample will underflow and observe a bogus delta value, resulting in an erratic jump in pressure. Fix this by moving the clock read of the state change into the seqlock protection. This ensures no aggregation can snoop live stalls past the time that's recorded when the state concludes. Reported-by: Brandon Duffany Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D219194 Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi") Cc: stable@vger.kernel.org Signed-off-by: Johannes Weiner --- kernel/sched/psi.c | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 020d58967d4e..84dad1511d1e 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc= , u64 now) } =20 static void psi_group_change(struct psi_group *group, int cpu, - unsigned int clear, unsigned int set, u64 now, + unsigned int clear, unsigned int set, bool wake_clock) { struct psi_group_cpu *groupc; unsigned int t, m; u32 state_mask; + u64 now; =20 lockdep_assert_rq_held(cpu_rq(cpu)); groupc =3D per_cpu_ptr(group->pcpu, cpu); @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, * SOME and FULL time these may have resulted in. */ write_seqcount_begin(&groupc->seq); + now =3D cpu_clock(cpu); =20 /* * Start with TSK_ONCPU, which doesn't have a corresponding @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int cl= ear, int set) { int cpu =3D task_cpu(task); struct psi_group *group; - u64 now; =20 if (!task->pid) return; =20 psi_flags_change(task, clear, set); =20 - now =3D cpu_clock(cpu); - group =3D task_psi_group(task); do { - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, true); } while ((group =3D group->parent)); } =20 @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, { struct psi_group *group, *common =3D NULL; int cpu =3D task_cpu(prev); - u64 now =3D cpu_clock(cpu); =20 if (next->pid) { psi_flags_change(next, 0, TSK_ONCPU); @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, break; } =20 - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); + psi_group_change(group, cpu, 0, TSK_ONCPU, true); } while ((group =3D group->parent)); } =20 @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, do { if (group =3D=3D common) break; - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, wake_clock); } while ((group =3D group->parent)); =20 /* @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &=3D ~TSK_ONCPU; for (; group; group =3D group->parent) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, wake_clock); } } } @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_str= uct *curr, struct task_st int cpu =3D task_cpu(curr); struct psi_group *group; struct psi_group_cpu *groupc; - u64 now, irq; s64 delta; + u64 irq; =20 if (static_branch_likely(&psi_disabled)) return; @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_s= truct *curr, struct task_st if (prev && task_psi_group(prev) =3D=3D group) return; =20 - now =3D cpu_clock(cpu); irq =3D irq_time_read(cpu); delta =3D (s64)(irq - rq->psi_irq_time); if (delta < 0) @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task= _struct *curr, struct task_st rq->psi_irq_time =3D irq; =20 do { + u64 now; + if (!group->enabled) continue; =20 groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 write_seqcount_begin(&groupc->seq); + now =3D cpu_clock(cpu); =20 record_times(groupc, now); groupc->times[PSI_IRQ_FULL] +=3D delta; @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group) for_each_possible_cpu(cpu) { struct rq *rq =3D cpu_rq(cpu); struct rq_flags rf; - u64 now; =20 rq_lock_irq(rq, &rf); - now =3D cpu_clock(cpu); - psi_group_change(group, cpu, 0, 0, now, true); + psi_group_change(group, cpu, 0, 0, true); rq_unlock_irq(rq, &rf); } } --=20 2.46.0