[v1] sched/fair: improve nohz fields for large systems

[PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead

Posted by Shrikanth Hegde 2 months, 1 week ago

nohz_balance_enter_idle:
	cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
	atomic_inc(&nohz.nr_cpus)

nohz_balance_exit_idle:
	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
	atomic_dec(&nohz.nr_cpus)

kick_ilb:
	if (likely(!atomic_read(&nohz.nr_cpus)))
		return;

So, idle_cpus_mask contains the same information. Instead of doing
costly atomic in large systems, its better to check if cpumask is empty
or not to make the same decision to trigger idle load balance.

There might be race between cpumask_empty check and set of cpumask in
the remote CPUs. In such case at next tick idle load balance will be
triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
checks if CPU is idle or not before doing the balance.

cpumask_empty uses ffs. So should not be very costly.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5534822fd754..3afa79d733dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
 
 static struct {
 	cpumask_var_t idle_cpus_mask;
-	atomic_t nr_cpus;
 	int has_blocked;		/* Idle CPUS has blocked load */
 	int needs_update;		/* Newly idle CPUs need their next_balance collated */
 	unsigned long next_balance;     /* in jiffy units */
@@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
 	 * None are in tickless mode and hence no need for NOHZ idle load
 	 * balancing, do stats update if its due
 	 */
-	if (unlikely(!atomic_read(&nohz.nr_cpus)))
+	if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
 		goto out;
 
 	if (rq->nr_running >= 2) {
@@ -12563,7 +12562,6 @@ void nohz_balance_exit_idle(struct rq *rq)
 
 	rq->nohz_tick_stopped = 0;
 	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
-	atomic_dec(&nohz.nr_cpus);
 
 	set_cpu_sd_state_busy(rq->cpu);
 }
@@ -12621,7 +12619,6 @@ void nohz_balance_enter_idle(int cpu)
 	rq->nohz_tick_stopped = 1;
 
 	cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
-	atomic_inc(&nohz.nr_cpus);
 
 	/*
 	 * Ensures that if nohz_idle_balance() fails to observe our
-- 
2.43.0

Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead

Posted by Ingo Molnar 2 months, 1 week ago

* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> nohz_balance_enter_idle:
> 	cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
> 	atomic_inc(&nohz.nr_cpus)
> 
> nohz_balance_exit_idle:
> 	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
> 	atomic_dec(&nohz.nr_cpus)
> 
> kick_ilb:
> 	if (likely(!atomic_read(&nohz.nr_cpus)))
> 		return;
> 
> So, idle_cpus_mask contains the same information. Instead of doing
> costly atomic in large systems, its better to check if cpumask is empty
> or not to make the same decision to trigger idle load balance.
> 
> There might be race between cpumask_empty check and set of cpumask in
> the remote CPUs. In such case at next tick idle load balance will be
> triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
> checks if CPU is idle or not before doing the balance.
> 
> cpumask_empty uses ffs. So should not be very costly.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>

>  static struct {
>  	cpumask_var_t idle_cpus_mask;
> -	atomic_t nr_cpus;
>  	int has_blocked;		/* Idle CPUS has blocked load */
>  	int needs_update;		/* Newly idle CPUs need their next_balance collated */
>  	unsigned long next_balance;     /* in jiffy units */
> @@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
>  	 * None are in tickless mode and hence no need for NOHZ idle load
>  	 * balancing, do stats update if its due
>  	 */
> -	if (unlikely(!atomic_read(&nohz.nr_cpus)))
> +	if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>  		goto out;

So the thing is, if the goal is to avoid cacheline 
bouncing, this won't fundamentally change the 
situation:

>  	rq->nohz_tick_stopped = 0;
>  	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
> -	atomic_dec(&nohz.nr_cpus);

nohz.idle_cpus_mask will be on a single 64-byte 
cacheline even on 512 CPU systems, and the 
cpumask_clear_cpu() and cpumask_set_cpu() calls will 
dirty the cacheline and make it bounce with exactly the 
same frequency as the atomic_inc/dec() of nohz.nr_cpus 
does today.

From the 0/4 boilerplate description:

 > It was noted when running on large systems 
 > nohz.nr_cpus cacheline was bouncing quite often. 
 > There is atomic inc/dec and read happening on many 
 > CPUs at a time and it is possible for this line to 
 > bounce often.

That the nr_cpus modification is an atomic op doesn't 
change the situation much in terms of cacheline 
bouncing, because the cacheline dirtying will still 
cause comparable levels of bouncing on modern CPUs with 
modern cache coherency protocols.

If nr_cpus and nohz.nr_cpus are in separate cachelines, 
then this patch might eliminate about half of the 
bounces - but AFAICS they are right next to each other, 
so unless it's off-stack cpumasks, they should be in 
the same cacheline. Half of 'bad bouncing' is still 
kinda 'bad bouncing'. :-)

I'm not really objecting to the patch, because it would 
reduce cacheline bouncing in the offstack-mask case, 
but the explanation isn't very clear about these 
details.

Thanks,

	Ingo

Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead

Posted by Shrikanth Hegde 2 months, 1 week ago

Hi Ingo,

Thanks for taking a look at this.

On 12/2/25 1:28 AM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> nohz_balance_enter_idle:
>> 	cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
>> 	atomic_inc(&nohz.nr_cpus)
>>
>> nohz_balance_exit_idle:
>> 	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
>> 	atomic_dec(&nohz.nr_cpus)
>>
>> kick_ilb:
>> 	if (likely(!atomic_read(&nohz.nr_cpus)))
>> 		return;
>>
>> So, idle_cpus_mask contains the same information. Instead of doing
>> costly atomic in large systems, its better to check if cpumask is empty
>> or not to make the same decision to trigger idle load balance.
>>
>> There might be race between cpumask_empty check and set of cpumask in
>> the remote CPUs. In such case at next tick idle load balance will be
>> triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
>> checks if CPU is idle or not before doing the balance.
>>
>> cpumask_empty uses ffs. So should not be very costly.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> 
>>   static struct {
>>   	cpumask_var_t idle_cpus_mask;
>> -	atomic_t nr_cpus;
>>   	int has_blocked;		/* Idle CPUS has blocked load */
>>   	int needs_update;		/* Newly idle CPUs need their next_balance collated */
>>   	unsigned long next_balance;     /* in jiffy units */
>> @@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>   	 * None are in tickless mode and hence no need for NOHZ idle load
>>   	 * balancing, do stats update if its due
>>   	 */
>> -	if (unlikely(!atomic_read(&nohz.nr_cpus)))
>> +	if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>>   		goto out;
> 
> So the thing is, if the goal is to avoid cacheline
> bouncing, this won't fundamentally change the
> situation:
> 
>>   	rq->nohz_tick_stopped = 0;
>>   	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
>> -	atomic_dec(&nohz.nr_cpus);
> 
> nohz.idle_cpus_mask will be on a single 64-byte
> cacheline even on 512 CPU systems, and the
> cpumask_clear_cpu() and cpumask_set_cpu() calls will
> dirty the cacheline and make it bounce with exactly the
> same frequency as the atomic_inc/dec() of nohz.nr_cpus
> does today.
> 
>  From the 0/4 boilerplate description:
> 
>   > It was noted when running on large systems
>   > nohz.nr_cpus cacheline was bouncing quite often.
>   > There is atomic inc/dec and read happening on many
>   > CPUs at a time and it is possible for this line to
>   > bounce often.
> 
> That the nr_cpus modification is an atomic op doesn't
> change the situation much in terms of cacheline
> bouncing, because the cacheline dirtying will still
> cause comparable levels of bouncing on modern CPUs with
> modern cache coherency protocols.
> 
> If nr_cpus and nohz.nr_cpus are in separate cachelines,
> then this patch might eliminate about half of the
> bounces - but AFAICS they are right next to each other,
> so unless it's off-stack cpumasks, they should be in
> the same cacheline. Half of 'bad bouncing' is still
> kinda 'bad bouncing'. :-)
> 

You are right. If we have to get rid of cacheline bouncing then
we need to fix nohz.idle_cpus_mask too.

I forgot about CPUMASK_OFFSTACK.

If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and nr_cpus are in same
cacheline Right?. That data in cover-letter is with =y. In that case, getting
it to cpumask_empty will give minimal gains by remvong an additional
atomic inc/dec operations.

If CPUMASK_OFFSTACK=n, then they could be in different cacheline.
In that case gains should be better. Very likely our performance team
would have done with =n.
IIRC, on powerpc, based on NR_CPU we change it. On x86 it chooses NR_CPUs.

arm64/Kconfig:	select CPUMASK_OFFSTACK if NR_CPUS > 256
powerpc/Kconfig:	select CPUMASK_OFFSTACK			if NR_CPUS >= 8192
x86/Kconfig:	select CPUMASK_OFFSTACK
x86/Kconfig:	default 8192 if  SMP && CPUMASK_OFFSTACK
x86/Kconfig:	default  512 if  SMP && !CPUMASK_OFFSTACK



In either case, if we think,
	nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)

Since it is not a correctness stuff here, at worst we will lose a chance
to do idle load balance. But at next tick we will do the idle balance.
Looking at code it might happen even today, First we set/clear the mask
and then we do inc/dec. So if mask was set, but inc hasn't happened, but
read completed, then would lose a chance. (though very slim)


> I'm not really objecting to the patch, because it would
> reduce cacheline bouncing in the offstack-mask case,
> but the explanation isn't very clear about these
> details.
>

Let me re-write changelog. Also see a bit more into it.


  
> Thanks,
> 
> 	Ingo

Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead

Posted by Ingo Molnar 2 months, 1 week ago

* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> > That the nr_cpus modification is an atomic op 
> > doesn't change the situation much in terms of 
> > cacheline bouncing, because the cacheline dirtying 
> > will still cause comparable levels of bouncing on 
> > modern CPUs with modern cache coherency protocols.
> > 
> > If nr_cpus and nohz.nr_cpus are in separate 
> > cachelines, then this patch might eliminate about 
> > half of the bounces - but AFAICS they are right 
> > next to each other, so unless it's off-stack 
> > cpumasks, they should be in the same cacheline. 
> > Half of 'bad bouncing' is still kinda 'bad 
> > bouncing'. :-)
> > 
> 
> You are right. If we have to get rid of cacheline 
> bouncing then we need to fix nohz.idle_cpus_mask too.
> 
> I forgot about CPUMASK_OFFSTACK.
> 
> If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and 
> nr_cpus are in same cacheline Right?. That data in 
> cover-letter is with =y. In that case, getting it to 
> cpumask_empty will give minimal gains by remvong an 
> additional atomic inc/dec operations.
> 
> If CPUMASK_OFFSTACK=n, then they could be in 
> different cacheline. In that case gains should be 
> better. Very likely our performance team would have 
> done with =n. IIRC, on powerpc, based on NR_CPU we 
> change it. On x86 it chooses NR_CPUs.

Well, it's the other way around: in the 'off stack' 
case the cpumask_var_t is moved "off the stack" because 
it's too large - i.e. we allocate it separately, in a 
separate cacheline as a side effect. Even if the main 
cpumask pointer is next to nohz.nr_cpus, the mask 
itself is behind an indirect pointer, see 
<linux/cpumask_types.h>:

  #ifdef CONFIG_CPUMASK_OFFSTACK
  typedef struct cpumask *cpumask_var_t;
  #else

Note that idle_cpus_mask is defined as a cpumask_var_t 
and is thus affected by CONFIG_CPUMASK_OFFSTACK and may 
be allocated dynamically:

  kernel/sched/fair.c:    cpumask_var_t idle_cpus_mask;
  ...
  kernel/sched/fair.c:    zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);

So I think it's quite possible that the performance 
measurements were done with CONFIG_CPUMASK_OFFSTACK=y: 
it's commonly enabled in server/enterprise distros - 
but even Ubuntu enables it on their desktop kernel, so 
the reduction in cacheline ping-pong is probably real 
and the change makes sense in that context.

But even with OFFSTACK=n, if NR_CPUS=512 it's possible 
that a fair chunk of the cpumask ends up on the 
previous (separate) cacheline from where nr_cpus is, 
with a resulting observable reduction of the cache 
bouncing rate:

  static struct {
          cpumask_var_t idle_cpus_mask;
          atomic_t nr_cpus;

Note that since 'nohz' is ____cacheline_aligned, in 
this case idle_cpus_mask will take a full cacheline in 
the NR_CPUS=512 case, and nr_cpus will always be on a 
separate cacheline.

If CONFIG_NR_CPUS is 128 or smaller, then 
idle_cpus_mask and nr_cpus will be on the same 
cacheline.

Anyway, if the reduction in cache ping-pong is higher 
than 50%, then either something weird is going on, or 
I'm missing something. :-)

But the measurement data you provided:

   baseline: tip sched/core at 3eb593560146
   1.01%  [k] nohz_balance_exit_idle
   0.31%  [k] nohz_balancer_kick
   0.05%  [k] nohz_balance_enter_idle

   With series:
   0.45%  [k] nohz_balance_exit_idle
   0.18%  [k] nohz_balancer_kick
   0.01%  [k] nohz_balance_enter_idle

... is roughly in the 50% reduction range, if profiled 
overhead is a good proxy for cache bounce overhead 
(which it may be), which supports my hypothesis that 
the tests were run with CONFIG_CPUMASK_OFFSTACK=y and 
the cache pong-pong rate in these functions got roughly 
halved.

BTW., I'd expect _nohz_idle_balance() to show up too in 
the profile.

> arm64/Kconfig:	select CPUMASK_OFFSTACK if NR_CPUS > 256
> powerpc/Kconfig:	select CPUMASK_OFFSTACK			if NR_CPUS >= 8192
> x86/Kconfig:	select CPUMASK_OFFSTACK
> x86/Kconfig:	default 8192 if  SMP && CPUMASK_OFFSTACK
> x86/Kconfig:	default  512 if  SMP && !CPUMASK_OFFSTACK

Yeah, we make the cpumask a direct mask up to 512 bits 
(64 bytes) - it's allocated indirectly from that point 
onwards.

> In either case, if we think,
> 	nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)
> 
> Since it is not a correctness stuff here, at worst we 
> will lose a chance to do idle load balance.

Yeah, I don't think it's a correctness issue, removing 
nr_cpus should not change the ordering of modifications 
to nohz.idle_cpus_mask and nohz.has_blocked.

( The nohz.nr_cpus and nohz.idle_cpus_mask 
  modifications were not ordered versus each other 
  previously to begin with - they are only ordered 
  versus nohz.has_blocked. )

> Let me re-write changelog. Also see a bit more into it.

Thank you!

Note that the fundamental scalability challenge with 
the nohz_balancer_kick(), nohz_balance_enter_idle() and 
nohz_balance_exit_idle() functions is the following:

 (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus 
     (or nohz.idle_cpu_mask) and nohz.has_blocked to 
     see whether there's any nohz balancing work to do, 
     in every scheduler tick.

 (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
     modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask)
     and nohz.has_blocked.

The characteristic frequencies are the following:

 (1) happens at scheduler (busy-)tick frequency on 
     every CPU. This is a relatively constant frequency 
     in the ~1 kHz range or lower.

 (2) happens at idle enter/exit frequency on every CPU 
     that goes to idle. This is workload dependent, but 
     can easily be hundreds of kHz for IO-bound loads 
     and high CPU counts. Ie. can be orders of 
     magnitude higher than (1), in which case a 
     cachemiss at every invocation of (1) is almost 
     inevitable. [ Ie. the cost of getting really long 
     NOHZ idling times is the extra overhead of the 
     exit/enter nohz cycles for partially idle CPUs on 
     high-rate IO workloads. ]

There's two types of costs from these functions:

  (A) scheduler tick cost via (1): this happens on busy 
      CPUs too, and is thus a primary scalability cost. 
      But the rate here is constant and typically much 
      lower than (A), hence the absolute benefit to 
      workload scalability will be lower as well.

  (B) idle cost via (2): going-to-idle and 
      coming-from-idle costs are secondary concerns, 
      because they impact power efficiency more than 
      they impact scalability. (Ie. while 'wasting idle 
      time' isn't good, but it often doesn't hurt 
      scalability, at least as long as it's done for a 
      good reason and done in moderation.)

      but in terms of absolute cost this scales up with 
      nr_cpus as well, and a much faster rate, and thus 
      may also approach and negatively impact system 
      limits like memory bus/fabric bandwidth.

So I'd argue that reductions in both (A) and (B) are 
useful, but for different reasons.

The *real* breakthrough in this area would be to reduce 
the unlimited upwards frequency of (2), by 
fundamentally changing the model of NOHZ idle 
balancing:

For example by measuring the rate (frequency) of idle 
cycles on each CPU (this can be done without any 
cross-CPU logic), we would turn off NOHZ-idle for that 
CPU when the rate goes beyond a threshold.

The resulting regular idle load-balancing passes will 
be rate-limited by balance intervals and won't be as 
aggressive as nohz_balance_enter+exit_idle(). (I hope...)

Truly idle CPUs would go into NOHZ mode automatically, 
as their measured rate of idling drops below the 
threshold.

Thoughts?

	Ingo

Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead

Posted by Shrikanth Hegde 2 months, 1 week ago


On 12/2/25 1:24 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>>> That the nr_cpus modification is an atomic op
>>> doesn't change the situation much in terms of
>>> cacheline bouncing, because the cacheline dirtying
>>> will still cause comparable levels of bouncing on
>>> modern CPUs with modern cache coherency protocols.
>>>
>>> If nr_cpus and nohz.nr_cpus are in separate
>>> cachelines, then this patch might eliminate about
>>> half of the bounces - but AFAICS they are right
>>> next to each other, so unless it's off-stack
>>> cpumasks, they should be in the same cacheline.
>>> Half of 'bad bouncing' is still kinda 'bad
>>> bouncing'. :-)
>>>
>>
>> You are right. If we have to get rid of cacheline
>> bouncing then we need to fix nohz.idle_cpus_mask too.
>>
>> I forgot about CPUMASK_OFFSTACK.
>>
>> If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and
>> nr_cpus are in same cacheline Right?. That data in
>> cover-letter is with =y. In that case, getting it to
>> cpumask_empty will give minimal gains by remvong an
>> additional atomic inc/dec operations.
>>
>> If CPUMASK_OFFSTACK=n, then they could be in
>> different cacheline. In that case gains should be
>> better. Very likely our performance team would have
>> done with =n. IIRC, on powerpc, based on NR_CPU we
>> change it. On x86 it chooses NR_CPUs.
> 
> Well, it's the other way around: in the 'off stack'
> case the cpumask_var_t is moved "off the stack" because
> it's too large - i.e. we allocate it separately, in a
> separate cacheline as a side effect. Even if the main
> cpumask pointer is next to nohz.nr_cpus, the mask
> itself is behind an indirect pointer, see
> <linux/cpumask_types.h>:
> 
>    #ifdef CONFIG_CPUMASK_OFFSTACK
>    typedef struct cpumask *cpumask_var_t;
>    #else
> 
> Note that idle_cpus_mask is defined as a cpumask_var_t
> and is thus affected by CONFIG_CPUMASK_OFFSTACK and may
> be allocated dynamically:
> 
>    kernel/sched/fair.c:    cpumask_var_t idle_cpus_mask;
>    ...
>    kernel/sched/fair.c:    zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> 
> So I think it's quite possible that the performance
> measurements were done with CONFIG_CPUMASK_OFFSTACK=y:
> it's commonly enabled in server/enterprise distros -
> but even Ubuntu enables it on their desktop kernel, so
> the reduction in cacheline ping-pong is probably real
> and the change makes sense in that context.
> 

Yes. Numbers in cover-letter were done with CONFIG_CPUMASK_OFFSTACK=y.
Those numbers are with hackbench.

I was saying initial report of cacheline contention by our
performance team has it or not. It was running enterprise workload.

> But even with OFFSTACK=n, if NR_CPUS=512 it's possible
> that a fair chunk of the cpumask ends up on the
> previous (separate) cacheline from where nr_cpus is,
> with a resulting observable reduction of the cache
> bouncing rate:
> 
>    static struct {
>            cpumask_var_t idle_cpus_mask;
>            atomic_t nr_cpus;
> 
> Note that since 'nohz' is ____cacheline_aligned, in
> this case idle_cpus_mask will take a full cacheline in
> the NR_CPUS=512 case, and nr_cpus will always be on a
> separate cacheline.
> 
> If CONFIG_NR_CPUS is 128 or smaller, then
> idle_cpus_mask and nr_cpus will be on the same
> cacheline.
> 

I set NR_CPUS=2048 which makes CONFIG_CPUMASK_OFFSTACK=n in powerpc.

baseline:
    0.97%  [k] nohz_balance_exit_idle        -      -
    0.40%  [k] nohz_balancer_kick            -      -
    0.07%  [k] nohz_run_idle_balance         -      -

with patch:
    0.39%  [k] nohz_balance_exit_idle        -      -
    0.14%  [k] nohz_balancer_kick            -      -
    0.08%  [k] nohz_run_idle_balance         -      -

its in 50% reduction range.


> Anyway, if the reduction in cache ping-pong is higher
> than 50%, then either something weird is going on, or
> I'm missing something. :-)
>

Yes, in both cases reduction seems to be by 50% when running
hackbench as mentioned.

  
> But the measurement data you provided:
> 
>     baseline: tip sched/core at 3eb593560146
>     1.01%  [k] nohz_balance_exit_idle
>     0.31%  [k] nohz_balancer_kick
>     0.05%  [k] nohz_balance_enter_idle
> 
>     With series:
>     0.45%  [k] nohz_balance_exit_idle
>     0.18%  [k] nohz_balancer_kick
>     0.01%  [k] nohz_balance_enter_idle
> 
> ... is roughly in the 50% reduction range, if profiled
> overhead is a good proxy for cache bounce overhead
> (which it may be), which supports my hypothesis that
> the tests were run with CONFIG_CPUMASK_OFFSTACK=y and
> the cache pong-pong rate in these functions got roughly
> halved.
> 

Yes. This data is with CONFIG_CPUMASK_OFFSTACK=y

> BTW., I'd expect _nohz_idle_balance() to show up too in
> the profile.
> 
> 

It is rate limited by sd_balance_interval no? Likely it wont happen
as aggressive as idle enter/exit.

>> arm64/Kconfig:	select CPUMASK_OFFSTACK if NR_CPUS > 256
>> powerpc/Kconfig:	select CPUMASK_OFFSTACK			if NR_CPUS >= 8192
>> x86/Kconfig:	select CPUMASK_OFFSTACK
>> x86/Kconfig:	default 8192 if  SMP && CPUMASK_OFFSTACK
>> x86/Kconfig:	default  512 if  SMP && !CPUMASK_OFFSTACK
> 
> Yeah, we make the cpumask a direct mask up to 512 bits
> (64 bytes) - it's allocated indirectly from that point
> onwards.
> 
>> In either case, if we think,
>> 	nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)
>>
>> Since it is not a correctness stuff here, at worst we
>> will lose a chance to do idle load balance.
> 
> Yeah, I don't think it's a correctness issue, removing
> nr_cpus should not change the ordering of modifications
> to nohz.idle_cpus_mask and nohz.has_blocked.
> 
> ( The nohz.nr_cpus and nohz.idle_cpus_mask
>    modifications were not ordered versus each other
>    previously to begin with - they are only ordered
>    versus nohz.has_blocked. )
> 

Yes. This is true.

>> Let me re-write changelog. Also see a bit more into it.
> 
> Thank you!
> 
> Note that the fundamental scalability challenge with
> the nohz_balancer_kick(), nohz_balance_enter_idle() and
> nohz_balance_exit_idle() functions is the following:
> 
>   (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
>       (or nohz.idle_cpu_mask) and nohz.has_blocked to
>       see whether there's any nohz balancing work to do,
>       in every scheduler tick.
> 
>   (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
>       modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask)
>       and nohz.has_blocked.
> 
> The characteristic frequencies are the following:
> 
>   (1) happens at scheduler (busy-)tick frequency on
>       every CPU. This is a relatively constant frequency
>       in the ~1 kHz range or lower.
> 
>   (2) happens at idle enter/exit frequency on every CPU
>       that goes to idle. This is workload dependent, but
>       can easily be hundreds of kHz for IO-bound loads
>       and high CPU counts. Ie. can be orders of
>       magnitude higher than (1), in which case a
>       cachemiss at every invocation of (1) is almost
>       inevitable. [ Ie. the cost of getting really long
>       NOHZ idling times is the extra overhead of the
>       exit/enter nohz cycles for partially idle CPUs on
>       high-rate IO workloads. ]
> 
> There's two types of costs from these functions:
> 
>    (A) scheduler tick cost via (1): this happens on busy
>        CPUs too, and is thus a primary scalability cost.
>        But the rate here is constant and typically much
>        lower than (A), hence the absolute benefit to
>        workload scalability will be lower as well.
> 
>    (B) idle cost via (2): going-to-idle and
>        coming-from-idle costs are secondary concerns,
>        because they impact power efficiency more than
>        they impact scalability. (Ie. while 'wasting idle
>        time' isn't good, but it often doesn't hurt
>        scalability, at least as long as it's done for a
>        good reason and done in moderation.)
> 
>        but in terms of absolute cost this scales up with
>        nr_cpus as well, and a much faster rate, and thus
>        may also approach and negatively impact system
>        limits like memory bus/fabric bandwidth.
> 

Thank you. I am going to mostly copy this in next version :)

> So I'd argue that reductions in both (A) and (B) are
> useful, but for different reasons.
> 
> The *real* breakthrough in this area would be to reduce
> the unlimited upwards frequency of (2), by
> fundamentally changing the model of NOHZ idle
> balancing:
> 
> For example by measuring the rate (frequency) of idle
> cycles on each CPU (this can be done without any
> cross-CPU logic), we would turn off NOHZ-idle for that
> CPU when the rate goes beyond a threshold.
> 
> The resulting regular idle load-balancing passes will
> be rate-limited by balance intervals and won't be as
> aggressive as nohz_balance_enter+exit_idle(). (I hope...)
> 
> Truly idle CPUs would go into NOHZ mode automatically,
> as their measured rate of idling drops below the
> threshold.
> 
> Thoughts?

Interesting.

Let me see if i get this right.

So track the idle duration over certain past interval.
If is below certain threshould mark those CPUs in nohz state
while doing idle entry/exit.
If not, reset their bits in nohz mask and don't update the mask?

I think rq->avg_idle there already and we do similar checks for newidle_balance.
sched_balance_newidle
...
         if (!get_rd_overloaded(this_rq->rd) ||
             this_rq->avg_idle < sd->max_newidle_lb_cost) {

                 update_next_balance(sd, &next_balance);
                 rcu_read_unlock();
                 goto out;
         }

Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead

Posted by Ingo Molnar 2 months, 1 week ago

* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> > So I'd argue that reductions in both (A) and (B) 
> > are useful, but for different reasons.
> > 
> > The *real* breakthrough in this area would be to 
> > reduce the unlimited upwards frequency of (2), by 
> > fundamentally changing the model of NOHZ idle 
> > balancing:
> > 
> > For example by measuring the rate (frequency) of 
> > idle cycles on each CPU (this can be done without 
> > any cross-CPU logic), we would turn off NOHZ-idle 
> > for that CPU when the rate goes beyond a threshold.
> > 
> > The resulting regular idle load-balancing passes 
> > will be rate-limited by balance intervals and won't 
> > be as aggressive as nohz_balance_enter+exit_idle(). 
> > (I hope...)
> > 
> > Truly idle CPUs would go into NOHZ mode 
> > automatically, as their measured rate of idling 
> > drops below the threshold.
> > 
> > Thoughts?
> 
> Interesting.
> 
> Let me see if i get this right.
> 
> So track the idle duration over certain past 
> interval. If is below certain threshould mark those 
> CPUs in nohz state while doing idle entry/exit. If 
> not, reset their bits in nohz mask and don't update 
> the mask?
> 
> I think rq->avg_idle there already and we do similar checks for newidle_balance.
> sched_balance_newidle
> ...
>         if (!get_rd_overloaded(this_rq->rd) ||
>             this_rq->avg_idle < sd->max_newidle_lb_cost) {
> 
>                 update_next_balance(sd, &next_balance);
>                 rcu_read_unlock();
>                 goto out;
>         }

Yeah, seems so - but I haven't put much thought into 
the idea, so caveat emptor. :-)

Thanks,

	Ingo

[PATCH 1/4] sched/fair: Move checking for nohz cpus after time check
[PATCH 2/4] sched/fair: Change likelyhood of nohz nr_cpus check
[PATCH 3/4] sched/fair: Check for blocked task after time check
[PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead