nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.
Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:
(1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
(or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
any nohz balancing work to do, in every scheduler tick.
(2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
(through nohz_balancer_kick() via sched_tick()) modify (write)
nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
The characteristic frequencies are the following:
(1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
on CPU(which has not gone idle). This is a relatively constant
frequency in the ~1 kHz range or lower.
(2) happens at idle enter/exit frequency on every CPU that goes to idle.
This is workload dependent, but can easily be hundreds of kHz for
IO-bound loads and high CPU counts. Ie. can be orders of magnitude
higher than (1), in which case a cachemiss at every invocation of (1)
is almost inevitable. idle exit will trigger (1) on the CPU
which is coming out of idle.
There's two types of costs from these functions:
(A) scheduler tick cost via (1): this happens on busy CPUs too, and is
thus a primary scalability cost. But the rate here is constant and
typically much lower than (B), hence the absolute benefit to workload
scalability will be lower as well.
(B) idle cost via (2): going-to-idle and coming-from-idle costs are
secondary concerns, because they impact power efficiency more than
they impact scalability. But in terms of absolute cost this scales
up with nr_cpus as well, and a much faster rate, and thus may also
approach and negatively impact system limits like
memory bus/fabric bandwidth.
Above mentioned fundamental scalability challenge remains true for
nohz.idle_cpus_mask even after this patch. But nr_cpus can be derived from
the mask itself. Its usage doesn't warrant a functionally correct value.
It can race, at worst an additional load balance may be attempted.
So, derive the value from the idle_cpus_mask. This helps to
save some bus bandwidth w.r.t to that nohz cacheline(approx 50%).
This in turn helps to improve enterprise workload throughput.
This theory holds true for CPUMASK_OFFSTACK=y and mostly true for
CPUMASK_OFFSTACK=n (last few bits based on NR_CPUs could be in same
cacheline as nr_cpus)
On system with 480 CPUs, running hackbench 40 process 10000 loops
(Avg of 3 runs)
baseline:
0.81% hackbench [k] nohz_balance_exit_idle
0.21% hackbench [k] nohz_balancer_kick
0.09% swapper [k] nohz_run_idle_balance
With patch:
0.35% hackbench [k] nohz_balance_exit_idle
0.09% hackbench [k] nohz_balancer_kick
0.07% swapper [k] nohz_run_idle_balance
[Ingo Molnar: scalability analysis changlog]
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
kernel/sched/fair.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c03f963f6216..3408a5beb95b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
static struct {
cpumask_var_t idle_cpus_mask;
- atomic_t nr_cpus;
int has_blocked_load; /* Idle CPUS has blocked load */
int needs_update; /* Newly idle CPUs need their next_balance collated */
unsigned long next_balance; /* in jiffy units */
@@ -12466,7 +12465,7 @@ static void nohz_balancer_kick(struct rq *rq)
* None are in tickless mode and hence no need for NOHZ idle load
* balancing
*/
- if (unlikely(!atomic_read(&nohz.nr_cpus)))
+ if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
return;
if (rq->nr_running >= 2) {
@@ -12579,7 +12578,6 @@ void nohz_balance_exit_idle(struct rq *rq)
rq->nohz_tick_stopped = 0;
cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
- atomic_dec(&nohz.nr_cpus);
set_cpu_sd_state_busy(rq->cpu);
}
@@ -12637,7 +12635,6 @@ void nohz_balance_enter_idle(int cpu)
rq->nohz_tick_stopped = 1;
cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
- atomic_inc(&nohz.nr_cpus);
/*
* Ensures that if nohz_idle_balance() fails to observe our
--
2.47.3
On 07/01/26 12:21, Shrikanth Hegde wrote: > nohz.nr_cpus was observed as contended cacheline when running > enterprise workload on large systems. > > Fundamental scalability challenge with nohz.idle_cpus_mask > and nohz.nr_cpus is the following: > > (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus > (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's > any nohz balancing work to do, in every scheduler tick. > > (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() > (through nohz_balancer_kick() via sched_tick()) modify (write) > nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. > My first reaction on reading the whole changelog was: "but .nr_cpus and .idle_cpus_mask are in the same cacheline?!", which as Ingo pointed out somewhere down [1] isn't true for CPUMASK_OFFSTACK, so this change effectively gets rid of the dirtying of one extra cacheline during idle entry/exit. [1]: http://lore.kernel.org/r/aS3za7X9BLS5rg65@gmail.com I'd suggest adding something like so in this part of the changelog: """ Note that nohz.idle_cpus_mask and nohz.nr_cpus reside in the same cacheline, however under CONFIG_CPUMASK_OFFSTACK the backing storage for nohz.idle_cpus_mask will be elsewhere. This implies two separate cachelines being dirtied upon idle entry / exit. """
Hi Valentin. Thanks for going through. On 1/9/26 8:14 PM, Valentin Schneider wrote: > On 07/01/26 12:21, Shrikanth Hegde wrote: >> nohz.nr_cpus was observed as contended cacheline when running >> enterprise workload on large systems. >> >> Fundamental scalability challenge with nohz.idle_cpus_mask >> and nohz.nr_cpus is the following: >> >> (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus >> (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's >> any nohz balancing work to do, in every scheduler tick. >> >> (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() >> (through nohz_balancer_kick() via sched_tick()) modify (write) >> nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. >> > > My first reaction on reading the whole changelog was: "but .nr_cpus and > .idle_cpus_mask are in the same cacheline?!", which as Ingo pointed out > somewhere down [1] isn't true for CPUMASK_OFFSTACK, so this change > effectively gets rid of the dirtying of one extra cacheline during idle > entry/exit. > > [1]: http://lore.kernel.org/r/aS3za7X9BLS5rg65@gmail.com > > I'd suggest adding something like so in this part of the changelog: > > """ > Note that nohz.idle_cpus_mask and nohz.nr_cpus reside in the same > cacheline, however under CONFIG_CPUMASK_OFFSTACK the backing storage for > nohz.idle_cpus_mask will be elsewhere. This implies two separate cachelines > being dirtied upon idle entry / exit. > """ > ok. Will do that. Thanks. Even for CONFIG_CPUMASK_OFFSTACK=n, usual configuration is like 512/1024/ 2048 or higher. For 64 byte cacheline, 1 cacheline can hold 512 CPUs. So idle_cpus_mask and rest of nohz fields including nr_cpus will be in different cacheline. Even for powerpc(128 byte cacheline), where CONFIG_CPUMASK_OFFSTACK=n, default is NR_CPUS=2048. that means idle_cpus_mask will take 2 cachelines and rest of nohz fields will be in third cacheline. So in most of the cases, this implies dirtying one less cacheline. data points with CONFIG_CPUMASK_OFFSTACK=y/n [1]: https://lore.kernel.org/all/fdb378e7-7797-4aeb-a79f-12af4cb1b81a@linux.ibm.com/
© 2016 - 2026 Red Hat, Inc.