[PATCH v4] sched/psi: Skip CPUs with zero non-idle delta in per-CPU aggregation

Zhan Xusheng posted 1 patch 1 month, 2 weeks ago
There is a newer version of this series
kernel/sched/psi.c | 7 +++++++
1 file changed, 7 insertions(+)
[PATCH v4] sched/psi: Skip CPUs with zero non-idle delta in per-CPU aggregation
Posted by Zhan Xusheng 1 month, 2 weeks ago
collect_percpu_times() iterates over every possible CPU to build a
non-idle-weighted average of the PSI state times. When a CPU has
no PSI_NONIDLE delta for the current sampling interval:
  nonidle  = nsecs_to_jiffies(times[PSI_NONIDLE]) = 0
  deltas[s] += times[s] * nonidle               /* += 0 */

so the weighted accumulation contributes nothing.

get_recent_times() already sets the PSI_NONIDLE bit in
cpu_changed_states iff the PSI_NONIDLE delta is non-zero. Use that
bit to skip such CPUs early, as suggested by Johannes, avoiding the
nsecs_to_jiffies() call.

No functional change intended.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
---
v4:
 - Drop the incorrect Reviewed-by added in v2/v3; replace with
   Suggested-by. Johannes' "Makes sense." on v1 was an
   acknowledgement and an implementation suggestion, not a review
   tag.
 - Rebase commit message wording to describe "PSI_NONIDLE delta"
   rather than "non-idle jiffies", matching the actual check.
v3: https://lore.kernel.org/all/20260313034847.1422-1-zhanxusheng@xiaomi.com/
 - Resend of v2.
v2: https://lore.kernel.org/all/20260204022328.23938-1-zhanxusheng@xiaomi.com/
 - Use cpu_changed_states & (1 << PSI_NONIDLE) per Johannes'
   suggestion, saving the nsecs_to_jiffies() call.
v1: https://lore.kernel.org/all/20260203100007.22044-1-zhanxusheng@xiaomi.com/
---
 kernel/sched/psi.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d9c9d9480a45..cd1174f0b5e5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -384,6 +384,13 @@ static void collect_percpu_times(struct psi_group *group,
 
 		get_recent_times(group, cpu, aggregator, times,
 				&cpu_changed_states);
+		/*
+		 * If this CPU's PSI_NONIDLE delta is zero, it contributes
+		 * nothing to nonidle_total or to any deltas[] entry below,
+		 * so skip it early.
+		 */
+		if (!(cpu_changed_states & (1 << PSI_NONIDLE)))
+			continue;
 		changed_states |= cpu_changed_states;
 
 		nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
-- 
2.43.0
Re: [PATCH v4] sched/psi: Skip CPUs with zero non-idle delta in per-CPU aggregation
Posted by Peter Zijlstra 1 month, 1 week ago
On Wed, Apr 29, 2026 at 06:05:55PM +0800, Zhan Xusheng wrote:
> collect_percpu_times() iterates over every possible CPU to build a
> non-idle-weighted average of the PSI state times. When a CPU has
> no PSI_NONIDLE delta for the current sampling interval:
>   nonidle  = nsecs_to_jiffies(times[PSI_NONIDLE]) = 0
>   deltas[s] += times[s] * nonidle               /* += 0 */
> 
> so the weighted accumulation contributes nothing.
> 
> get_recent_times() already sets the PSI_NONIDLE bit in
> cpu_changed_states iff the PSI_NONIDLE delta is non-zero. Use that
> bit to skip such CPUs early, as suggested by Johannes, avoiding the
> nsecs_to_jiffies() call.
> 
> No functional change intended.

So presumably this is an optimization. Where is the data that justifies
this?
[PATCH v5] sched/psi: Skip CPUs with zero non-idle delta in per-CPU aggregation
Posted by Zhan Xusheng 1 month, 1 week ago
collect_percpu_times() iterates over every possible CPU to build a
non-idle-weighted average of the PSI state times. When a CPU has no
PSI_NONIDLE delta for the current sampling interval:
  nonidle     = nsecs_to_jiffies(times[PSI_NONIDLE]) = 0
  deltas[s]  += times[s] * nonidle               /* += 0 */

so the weighted accumulation contributes nothing.

get_recent_times() already sets the PSI_NONIDLE bit in
cpu_changed_states iff the PSI_NONIDLE delta is non-zero. Use that
bit to skip such CPUs early, as suggested by Johannes, avoiding the
nsecs_to_jiffies() call and the PSI_NONIDLE * u64 mul-adds that
follow.

No functional change: on the skipped path the old code adds zero to
deltas[] and zero to nonidle_total, which is exactly the result of
not iterating.

Measured on i7-8700 (6C/12T), same mainline base and same build
flags for both kernels. Reader is a pinned userspace loop of
open()+read()+close() on /proc/pressure/cpu, 100k iterations inside
a KVM guest with -smp matching the host LCPU count (12):
                            baseline    patched     diff
  idle             p50       2438 ns    2270 ns    -6.9%
  idle             p99       2598 ns    2449 ns    -5.7%
  1 busy / 12      p50       2479 ns    2281 ns    -8.0%
  all 12 busy      p50       3738 ns    3537 ns    -5.4%

The all-busy improvement shows the skip also kicks in when the box
is hot: between two samples, many CPUs record no PSI_NONIDLE state
transition even if they've been 100% utilised.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
---
 kernel/sched/psi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d9c9d9480a45..f220debc3fe0 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -386,6 +386,9 @@ static void collect_percpu_times(struct psi_group *group,
 				&cpu_changed_states);
 		changed_states |= cpu_changed_states;
 
+		if (!(cpu_changed_states & (1 << PSI_NONIDLE)))
+			continue;
+
 		nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
 		nonidle_total += nonidle;
 
-- 
2.43.0