From nobody Wed Oct 8 05:33:49 2025 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C7E6A26D4D8 for ; Wed, 2 Jul 2025 12:12:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; cv=none; b=ehkgB0eGSNT7BVEqpwiPuEiYQOUtZwiddn23a6eO2ziGjZoWE0pqJT+JIypCyrJ8ao0fR/iYyxVhwiuFTutM5yMYYD4YBdvRKXkTsGubWn5F0sQ82t4bEu9VNPHcJk9WT078KPYCAdwCu9c4OrEt6+cPhAKOnJ3YSvoXFyA3mZ8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; c=relaxed/simple; bh=iL6Xr1UbCXh9U0OUHMk9e+4p13YuWktvg/N7dx6E8m4=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=n8/bivDAlqwdoko7ztYnpSHia/pitjDtbXEUGoRDlLXfpDLZLuCJhGCHlK8UKjdBhQ3d66U0XEvqW3ts40Eul3wZKSJBVqjT8reli1nIkaDXFsCkdECZZeM+ed6TGDWowgopyuPJk5H4FU/38KMkbwvUtIVFMjpoN0IlWQisJO8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=q0kmm7Bg; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="q0kmm7Bg" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=yzkICH1Op6XVW/c3NAJn8TeKEXZpkhzNFaNncVS++/g=; b=q0kmm7Bg88UaLh8NWPNQ8zWvbp cgYmOkKGzQAy81ZzkHhRbjyNTVios3ws5XyHfi43RAm98woFFDE2EhQ3Y3ghkdsZE0n0ODN7LiHcp kSudck+hwA8l7YCgdbpru0jaeq38FAvsB2FUDnurIrhYCx9m6hqmBZsncM7oDQO3Cha2NRBUB2ret QZuH2ZZRE2GCosz2HBskNc3HnjCtip8pXG/4hpOvrU8XUWEcRmtW9XDux6Of77DB1nt+jKSPfmdak hKGG9DJN2E+6ijiftaqsaPdiZjnLjb4lOgv5xlH8eT72h9o7VpCqEqmPByKWX/Mb9nQ4+u1EQdk9H DuVpcmRA==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKV-00000007Lo8-0okn; Wed, 02 Jul 2025 12:12:47 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id BAABF30012D; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121158.350561696@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:25 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, Johannes Weiner Subject: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") caused a regression for him on a high context switch rate benchmark (schbench) due to the now repeating cpu_clock() calls. In particular the problem is that get_recent_times() will extrapolate the current state to 'now'. But if an update uses a timestamp from before the start of the update, it is possible to get two reads with inconsistent results. It is effectively back-dating an update. (note that this all hard-relies on the clock being synchronized across CPUs -- if this is not the case, all bets are off). Combine this problem with the fact that there are per-group-per-cpu seqcounts, the commit in question pushed the clock read into the group iteration, causing tree-depth cpu_clock() calls. On architectures where cpu_clock() has appreciable overhead, this hurts. Instead move to a per-cpu seqcount, which allows us to have a single clock read for all group updates, increasing internal consistency and lowering update overhead. This comes at the cost of a longer update side (proportional to the tree depth) which can cause the read side to retry more often. Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregatio= n race") Reported-by: Dietmar Eggemann Signed-off-by: Peter Zijlstra (Intel) Acked-by: Johannes Weiner Tested-by: Dietmar Eggemann , Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kick= s-ass.net Reported-by: Chris Mason Suggested-by: Beata Michalska Tested-by: K Prateek Nayak Tested-by: Srikanth Aithal --- include/linux/psi_types.h | 6 -- kernel/sched/psi.c | 121 +++++++++++++++++++++++++----------------= ----- 2 files changed, 68 insertions(+), 59 deletions(-) --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -84,11 +84,9 @@ enum psi_aggregators { struct psi_group_cpu { /* 1st cacheline updated by the scheduler */ =20 - /* Aggregator needs to know of concurrent changes */ - seqcount_t seq ____cacheline_aligned_in_smp; - /* States of the tasks belonging to this group */ - unsigned int tasks[NR_PSI_TASK_COUNTS]; + unsigned int tasks[NR_PSI_TASK_COUNTS] + ____cacheline_aligned_in_smp; =20 /* Aggregate pressure state derived from the tasks */ u32 state_mask; --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -176,6 +176,28 @@ struct psi_group psi_system =3D { .pcpu =3D &system_group_pcpu, }; =20 +static DEFINE_PER_CPU(seqcount_t, psi_seq); + +static inline void psi_write_begin(int cpu) +{ + write_seqcount_begin(per_cpu_ptr(&psi_seq, cpu)); +} + +static inline void psi_write_end(int cpu) +{ + write_seqcount_end(per_cpu_ptr(&psi_seq, cpu)); +} + +static inline u32 psi_read_begin(int cpu) +{ + return read_seqcount_begin(per_cpu_ptr(&psi_seq, cpu)); +} + +static inline bool psi_read_retry(int cpu, u32 seq) +{ + return read_seqcount_retry(per_cpu_ptr(&psi_seq, cpu), seq); +} + static void psi_avgs_work(struct work_struct *work); =20 static void poll_timer_fn(struct timer_list *t); @@ -186,7 +208,7 @@ static void group_init(struct psi_group =20 group->enabled =3D true; for_each_possible_cpu(cpu) - seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); + seqcount_init(per_cpu_ptr(&psi_seq, cpu)); group->avg_last_update =3D sched_clock(); group->avg_next_update =3D group->avg_last_update + psi_period; mutex_init(&group->avgs_lock); @@ -266,14 +288,14 @@ static void get_recent_times(struct psi_ =20 /* Snapshot a coherent view of the CPU state */ do { - seq =3D read_seqcount_begin(&groupc->seq); + seq =3D psi_read_begin(cpu); now =3D cpu_clock(cpu); memcpy(times, groupc->times, sizeof(groupc->times)); state_mask =3D groupc->state_mask; state_start =3D groupc->state_start; if (cpu =3D=3D current_cpu) memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); - } while (read_seqcount_retry(&groupc->seq, seq)); + } while (psi_read_retry(cpu, seq)); =20 /* Calculate state time deltas against the previous snapshot */ for (s =3D 0; s < NR_PSI_STATES; s++) { @@ -772,31 +794,21 @@ static void record_times(struct psi_grou groupc->times[PSI_NONIDLE] +=3D delta; } =20 +#define for_each_group(iter, group) \ + for (typeof(group) iter =3D group; iter; iter =3D iter->parent) + static void psi_group_change(struct psi_group *group, int cpu, unsigned int clear, unsigned int set, - bool wake_clock) + u64 now, bool wake_clock) { struct psi_group_cpu *groupc; unsigned int t, m; u32 state_mask; - u64 now; =20 lockdep_assert_rq_held(cpu_rq(cpu)); groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 /* - * First we update the task counts according to the state - * change requested through the @clear and @set bits. - * - * Then if the cgroup PSI stats accounting enabled, we - * assess the aggregate resource states this CPU's tasks - * have been in since the last change, and account any - * SOME and FULL time these may have resulted in. - */ - write_seqcount_begin(&groupc->seq); - now =3D cpu_clock(cpu); - - /* * Start with TSK_ONCPU, which doesn't have a corresponding * task count - it's just a boolean flag directly encoded in * the state mask. Clear, set, or carry the current state if @@ -847,7 +859,6 @@ static void psi_group_change(struct psi_ =20 groupc->state_mask =3D state_mask; =20 - write_seqcount_end(&groupc->seq); return; } =20 @@ -868,8 +879,6 @@ static void psi_group_change(struct psi_ =20 groupc->state_mask =3D state_mask; =20 - write_seqcount_end(&groupc->seq); - if (state_mask & group->rtpoll_states) psi_schedule_rtpoll_work(group, 1, false); =20 @@ -904,24 +913,29 @@ static void psi_flags_change(struct task void psi_task_change(struct task_struct *task, int clear, int set) { int cpu =3D task_cpu(task); - struct psi_group *group; + u64 now; =20 if (!task->pid) return; =20 psi_flags_change(task, clear, set); =20 - group =3D task_psi_group(task); - do { - psi_group_change(group, cpu, clear, set, true); - } while ((group =3D group->parent)); + psi_write_begin(cpu); + now =3D cpu_clock(cpu); + for_each_group(group, task_psi_group(task)) + psi_group_change(group, cpu, clear, set, now, true); + psi_write_end(cpu); } =20 void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep) { - struct psi_group *group, *common =3D NULL; + struct psi_group *common =3D NULL; int cpu =3D task_cpu(prev); + u64 now; + + psi_write_begin(cpu); + now =3D cpu_clock(cpu); =20 if (next->pid) { psi_flags_change(next, 0, TSK_ONCPU); @@ -930,16 +944,15 @@ void psi_task_switch(struct task_struct * ancestors with @prev, those will already have @prev's * TSK_ONCPU bit set, and we can stop the iteration there. */ - group =3D task_psi_group(next); - do { - if (per_cpu_ptr(group->pcpu, cpu)->state_mask & - PSI_ONCPU) { + for_each_group(group, task_psi_group(next)) { + struct psi_group_cpu *groupc =3D per_cpu_ptr(group->pcpu, cpu); + + if (groupc->state_mask & PSI_ONCPU) { common =3D group; break; } - - psi_group_change(group, cpu, 0, TSK_ONCPU, true); - } while ((group =3D group->parent)); + psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); + } } =20 if (prev->pid) { @@ -972,12 +985,11 @@ void psi_task_switch(struct task_struct =20 psi_flags_change(prev, clear, set); =20 - group =3D task_psi_group(prev); - do { + for_each_group(group, task_psi_group(prev)) { if (group =3D=3D common) break; - psi_group_change(group, cpu, clear, set, wake_clock); - } while ((group =3D group->parent)); + psi_group_change(group, cpu, clear, set, now, wake_clock); + } =20 /* * TSK_ONCPU is handled up to the common ancestor. If there are @@ -987,20 +999,21 @@ void psi_task_switch(struct task_struct */ if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &=3D ~TSK_ONCPU; - for (; group; group =3D group->parent) - psi_group_change(group, cpu, clear, set, wake_clock); + for_each_group(group, common) + psi_group_change(group, cpu, clear, set, now, wake_clock); } } + psi_write_end(cpu); } =20 #ifdef CONFIG_IRQ_TIME_ACCOUNTING void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct t= ask_struct *prev) { int cpu =3D task_cpu(curr); - struct psi_group *group; struct psi_group_cpu *groupc; s64 delta; u64 irq; + u64 now; =20 if (static_branch_likely(&psi_disabled) || !irqtime_enabled()) return; @@ -1009,8 +1022,7 @@ void psi_account_irqtime(struct rq *rq, return; =20 lockdep_assert_rq_held(rq); - group =3D task_psi_group(curr); - if (prev && task_psi_group(prev) =3D=3D group) + if (prev && task_psi_group(prev) =3D=3D task_psi_group(curr)) return; =20 irq =3D irq_time_read(cpu); @@ -1019,25 +1031,22 @@ void psi_account_irqtime(struct rq *rq, return; rq->psi_irq_time =3D irq; =20 - do { - u64 now; + psi_write_begin(cpu); + now =3D cpu_clock(cpu); =20 + for_each_group(group, task_psi_group(curr)) { if (!group->enabled) continue; =20 groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 - write_seqcount_begin(&groupc->seq); - now =3D cpu_clock(cpu); - record_times(groupc, now); groupc->times[PSI_IRQ_FULL] +=3D delta; =20 - write_seqcount_end(&groupc->seq); - if (group->rtpoll_states & (1 << PSI_IRQ_FULL)) psi_schedule_rtpoll_work(group, 1, false); - } while ((group =3D group->parent)); + } + psi_write_end(cpu); } #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ =20 @@ -1225,12 +1234,14 @@ void psi_cgroup_restart(struct psi_group return; =20 for_each_possible_cpu(cpu) { - struct rq *rq =3D cpu_rq(cpu); - struct rq_flags rf; + u64 now; =20 - rq_lock_irq(rq, &rf); - psi_group_change(group, cpu, 0, 0, true); - rq_unlock_irq(rq, &rf); + guard(rq_lock_irq)(cpu_rq(cpu)); + + psi_write_begin(cpu); + now =3D cpu_clock(cpu); + psi_group_change(group, cpu, 0, 0, now, true); + psi_write_end(cpu); } } #endif /* CONFIG_CGROUPS */