[v4] sched/fair: Improve nohz fields for large systems

[PATCH v4 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead

Posted by Shrikanth Hegde 3 weeks, 6 days ago

nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.

Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:

 (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
     (or nohz.idle_cpu_mask) and nohz.has_blocked to  see whether there's
     any nohz balancing work to do, in every scheduler tick.

 (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
     (through nohz_balancer_kick() via sched_tick()) modify (write)
     nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.

The characteristic frequencies are the following:

 (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
     on CPU(which has not gone idle). This is a relatively constant
     frequency  in the ~1 kHz range or lower.

 (2) happens at idle enter/exit frequency on every CPU that goes to idle.
     This is workload dependent, but can easily be hundreds of kHz for
     IO-bound loads and high CPU counts. Ie. can be orders of magnitude
     higher than (1), in which case a cachemiss at every invocation of (1)
     is almost inevitable. idle exit will trigger (1) on the CPU
     which is coming out of idle.

There's two types of costs from these functions:

 (A) scheduler tick cost via (1): this happens on busy CPUs too, and is
     thus a primary scalability cost. But the rate here is constant and
     typically much lower than (B), hence the absolute benefit to workload
     scalability will be lower as well.

 (B) idle cost via (2): going-to-idle and coming-from-idle costs are
     secondary concerns, because they impact power efficiency more than
     they impact scalability. But in terms of absolute cost this scales
     up with nr_cpus as well, and a much faster rate, and thus may also
     approach and negatively impact system limits like
     memory bus/fabric bandwidth.

Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage for
nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
under typical NR_CPUS=512/2048. This implies two separate cachelines
being dirtied upon idle entry / exit. 

nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
a functionally correct value. This means one less cacheline being dirtied in
idle entry/exit path which helps to save some bus bandwidth w.r.t to those
nohz functions(approx 50%). This in turn helps to improve enterprise
workload throughput.

On system with 480 CPUs, running "hackbench 40 process 10000 loops"
(Avg of 3 runs)
baseline:
     0.81%  hackbench          [k] nohz_balance_exit_idle
     0.21%  hackbench          [k] nohz_balancer_kick
     0.09%  swapper            [k] nohz_run_idle_balance

With patch:
     0.35%  hackbench          [k] nohz_balance_exit_idle
     0.09%  hackbench          [k] nohz_balancer_kick
     0.07%  swapper            [k] nohz_run_idle_balance

[Ingo Molnar: scalability analysis changlog]
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c03f963f6216..3408a5beb95b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
 
 static struct {
 	cpumask_var_t idle_cpus_mask;
-	atomic_t nr_cpus;
 	int has_blocked_load;		/* Idle CPUS has blocked load */
 	int needs_update;		/* Newly idle CPUs need their next_balance collated */
 	unsigned long next_balance;     /* in jiffy units */
@@ -12466,7 +12465,7 @@ static void nohz_balancer_kick(struct rq *rq)
 	 * None are in tickless mode and hence no need for NOHZ idle load
 	 * balancing
 	 */
-	if (unlikely(!atomic_read(&nohz.nr_cpus)))
+	if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
 		return;
 
 	if (rq->nr_running >= 2) {
@@ -12579,7 +12578,6 @@ void nohz_balance_exit_idle(struct rq *rq)
 
 	rq->nohz_tick_stopped = 0;
 	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
-	atomic_dec(&nohz.nr_cpus);
 
 	set_cpu_sd_state_busy(rq->cpu);
 }
@@ -12637,7 +12635,6 @@ void nohz_balance_enter_idle(int cpu)
 	rq->nohz_tick_stopped = 1;
 
 	cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
-	atomic_inc(&nohz.nr_cpus);
 
 	/*
 	 * Ensures that if nohz_idle_balance() fails to observe our
-- 
2.47.3

Re: [PATCH v4 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead

Posted by Vincent Guittot 3 weeks, 5 days ago

On Mon, 12 Jan 2026 at 06:05, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
> nohz.nr_cpus was observed as contended cacheline when running
> enterprise workload on large systems.
>
> Fundamental scalability challenge with nohz.idle_cpus_mask
> and nohz.nr_cpus is the following:
>
>  (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
>      (or nohz.idle_cpu_mask) and nohz.has_blocked to  see whether there's
>      any nohz balancing work to do, in every scheduler tick.
>
>  (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
>      (through nohz_balancer_kick() via sched_tick()) modify (write)
>      nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
>
> The characteristic frequencies are the following:
>
>  (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
>      on CPU(which has not gone idle). This is a relatively constant
>      frequency  in the ~1 kHz range or lower.
>
>  (2) happens at idle enter/exit frequency on every CPU that goes to idle.
>      This is workload dependent, but can easily be hundreds of kHz for
>      IO-bound loads and high CPU counts. Ie. can be orders of magnitude
>      higher than (1), in which case a cachemiss at every invocation of (1)
>      is almost inevitable. idle exit will trigger (1) on the CPU
>      which is coming out of idle.
>
> There's two types of costs from these functions:
>
>  (A) scheduler tick cost via (1): this happens on busy CPUs too, and is
>      thus a primary scalability cost. But the rate here is constant and
>      typically much lower than (B), hence the absolute benefit to workload
>      scalability will be lower as well.
>
>  (B) idle cost via (2): going-to-idle and coming-from-idle costs are
>      secondary concerns, because they impact power efficiency more than
>      they impact scalability. But in terms of absolute cost this scales
>      up with nr_cpus as well, and a much faster rate, and thus may also
>      approach and negatively impact system limits like
>      memory bus/fabric bandwidth.
>
> Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
> same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage for
> nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
> the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
> under typical NR_CPUS=512/2048. This implies two separate cachelines
> being dirtied upon idle entry / exit.
>
> nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
> a functionally correct value. This means one less cacheline being dirtied in
> idle entry/exit path which helps to save some bus bandwidth w.r.t to those
> nohz functions(approx 50%). This in turn helps to improve enterprise
> workload throughput.
>
> On system with 480 CPUs, running "hackbench 40 process 10000 loops"
> (Avg of 3 runs)
> baseline:
>      0.81%  hackbench          [k] nohz_balance_exit_idle
>      0.21%  hackbench          [k] nohz_balancer_kick
>      0.09%  swapper            [k] nohz_run_idle_balance
>
> With patch:
>      0.35%  hackbench          [k] nohz_balance_exit_idle
>      0.09%  hackbench          [k] nohz_balancer_kick
>      0.07%  swapper            [k] nohz_run_idle_balance
>
> [Ingo Molnar: scalability analysis changlog]
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>

This change makes sense to me but I'm not convinced by patch1.
You wrote in patch 1 that It doesn't provide any real benefit, what
are the figures with only patch 3 ?

> ---
>  kernel/sched/fair.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c03f963f6216..3408a5beb95b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
>
>  static struct {
>         cpumask_var_t idle_cpus_mask;
> -       atomic_t nr_cpus;
>         int has_blocked_load;           /* Idle CPUS has blocked load */
>         int needs_update;               /* Newly idle CPUs need their next_balance collated */
>         unsigned long next_balance;     /* in jiffy units */
> @@ -12466,7 +12465,7 @@ static void nohz_balancer_kick(struct rq *rq)
>          * None are in tickless mode and hence no need for NOHZ idle load
>          * balancing
>          */
> -       if (unlikely(!atomic_read(&nohz.nr_cpus)))
> +       if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>                 return;
>
>         if (rq->nr_running >= 2) {
> @@ -12579,7 +12578,6 @@ void nohz_balance_exit_idle(struct rq *rq)
>
>         rq->nohz_tick_stopped = 0;
>         cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
> -       atomic_dec(&nohz.nr_cpus);
>
>         set_cpu_sd_state_busy(rq->cpu);
>  }
> @@ -12637,7 +12635,6 @@ void nohz_balance_enter_idle(int cpu)
>         rq->nohz_tick_stopped = 1;
>
>         cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> -       atomic_inc(&nohz.nr_cpus);
>
>         /*
>          * Ensures that if nohz_idle_balance() fails to observe our
> --
> 2.47.3
>

Re: [PATCH v4 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead

Posted by Shrikanth Hegde 3 weeks, 5 days ago

Hi Vincent.

>> On system with 480 CPUs, running "hackbench 40 process 10000 loops"
>> (Avg of 3 runs)
>> baseline:
>>       0.81%  hackbench          [k] nohz_balance_exit_idle
>>       0.21%  hackbench          [k] nohz_balancer_kick
>>       0.09%  swapper            [k] nohz_run_idle_balance
>>
>> With patch:
>>       0.35%  hackbench          [k] nohz_balance_exit_idle
>>       0.09%  hackbench          [k] nohz_balancer_kick
>>       0.07%  swapper            [k] nohz_run_idle_balance
>>
>> [Ingo Molnar: scalability analysis changlog]
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> 
> This change makes sense to me but I'm not convinced by patch1.
> You wrote in patch 1 that It doesn't provide any real benefit, what
> are the figures with only patch 3 ?
> 

Whole point of patch 1 is, (assuming normal case i.e system wont be 100% busy)

- we read the value and then do time check.
- we bail out if time is not due.

Why bother reading, if time is not due.

I won't expect patch1 to make any major difference. But seemed like right thing to do,
considering that most of the time system won't be 100% busy.

So numbers will be same without patch1.

>> ---
>>   kernel/sched/fair.c | 5 +----
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c03f963f6216..3408a5beb95b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
>>
>>   static struct {
>>          cpumask_var_t idle_cpus_mask;
>> -       atomic_t nr_cpus;
>>          int has_blocked_load;           /* Idle CPUS has blocked load */
>>          int needs_update;               /* Newly idle CPUs need their next_balance collated */
>>          unsigned long next_balance;     /* in jiffy units */
>> @@ -12466,7 +12465,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>           * None are in tickless mode and hence no need for NOHZ idle load
>>           * balancing
>>           */
>> -       if (unlikely(!atomic_read(&nohz.nr_cpus)))
>> +       if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>>                  return;
>>
>>          if (rq->nr_running >= 2) {
>> @@ -12579,7 +12578,6 @@ void nohz_balance_exit_idle(struct rq *rq)
>>
>>          rq->nohz_tick_stopped = 0;
>>          cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
>> -       atomic_dec(&nohz.nr_cpus);
>>
>>          set_cpu_sd_state_busy(rq->cpu);
>>   }
>> @@ -12637,7 +12635,6 @@ void nohz_balance_enter_idle(int cpu)
>>          rq->nohz_tick_stopped = 1;
>>
>>          cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>> -       atomic_inc(&nohz.nr_cpus);
>>
>>          /*
>>           * Ensures that if nohz_idle_balance() fails to observe our
>> --
>> 2.47.3
>>

Re: [PATCH v4 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead

Posted by Valentin Schneider 3 weeks, 6 days ago

On 12/01/26 10:34, Shrikanth Hegde wrote:
> nohz.nr_cpus was observed as contended cacheline when running
> enterprise workload on large systems.
>
> Fundamental scalability challenge with nohz.idle_cpus_mask
> and nohz.nr_cpus is the following:
>
>  (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
>      (or nohz.idle_cpu_mask) and nohz.has_blocked to  see whether there's
>      any nohz balancing work to do, in every scheduler tick.
>
>  (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
>      (through nohz_balancer_kick() via sched_tick()) modify (write)
>      nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
>
> The characteristic frequencies are the following:
>
>  (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
>      on CPU(which has not gone idle). This is a relatively constant
>      frequency  in the ~1 kHz range or lower.
>
>  (2) happens at idle enter/exit frequency on every CPU that goes to idle.
>      This is workload dependent, but can easily be hundreds of kHz for
>      IO-bound loads and high CPU counts. Ie. can be orders of magnitude
>      higher than (1), in which case a cachemiss at every invocation of (1)
>      is almost inevitable. idle exit will trigger (1) on the CPU
>      which is coming out of idle.
>
> There's two types of costs from these functions:
>
>  (A) scheduler tick cost via (1): this happens on busy CPUs too, and is
>      thus a primary scalability cost. But the rate here is constant and
>      typically much lower than (B), hence the absolute benefit to workload
>      scalability will be lower as well.
>
>  (B) idle cost via (2): going-to-idle and coming-from-idle costs are
>      secondary concerns, because they impact power efficiency more than
>      they impact scalability. But in terms of absolute cost this scales
>      up with nr_cpus as well, and a much faster rate, and thus may also
>      approach and negatively impact system limits like
>      memory bus/fabric bandwidth.
>
> Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
> same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage for
> nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
> the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
> under typical NR_CPUS=512/2048. This implies two separate cachelines
> being dirtied upon idle entry / exit.
>
> nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
> a functionally correct value. This means one less cacheline being dirtied in
> idle entry/exit path which helps to save some bus bandwidth w.r.t to those
> nohz functions(approx 50%). This in turn helps to improve enterprise
> workload throughput.
>
> On system with 480 CPUs, running "hackbench 40 process 10000 loops"
> (Avg of 3 runs)
> baseline:
>      0.81%  hackbench          [k] nohz_balance_exit_idle
>      0.21%  hackbench          [k] nohz_balancer_kick
>      0.09%  swapper            [k] nohz_run_idle_balance
>
> With patch:
>      0.35%  hackbench          [k] nohz_balance_exit_idle
>      0.09%  hackbench          [k] nohz_balancer_kick
>      0.07%  swapper            [k] nohz_run_idle_balance
>
> [Ingo Molnar: scalability analysis changlog]
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>

[PATCH v4 1/3] sched/fair: Move checking for nohz cpus after time check
[PATCH v4 2/3] sched/fair: Change likelyhood of nohz.nr_cpus
[PATCH v4 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead