[v1] sched/fair: improve nohz fields for large systems

[PATCH 3/4] sched/fair: Check for blocked task after time check

Posted by Shrikanth Hegde 2 months, 1 week ago

nohz.has_blocked can be updated often as and when CPUs enter idle state.
But stats are updated only at regular intervals. Usually fixed to
LOAD_AVG_PERIOD=32. 

Read the value only after time check is successful to avoid cache
references to it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 55746274af06..5534822fd754 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12440,8 +12440,8 @@ static void nohz_balancer_kick(struct rq *rq)
 	 */
 	nohz_balance_exit_idle(rq);
 
-	if (READ_ONCE(nohz.has_blocked) &&
-	    time_after(now, READ_ONCE(nohz.next_blocked)))
+	if (time_after(now, READ_ONCE(nohz.next_blocked)) &&
+	    READ_ONCE(nohz.has_blocked))
 		flags = NOHZ_STATS_KICK;
 
 	if (time_before(now, nohz.next_balance))
-- 
2.43.0

Re: [PATCH 3/4] sched/fair: Check for blocked task after time check

Posted by Ingo Molnar 2 months, 1 week ago

* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> nohz.has_blocked can be updated often as and when CPUs enter idle state.
> But stats are updated only at regular intervals. Usually fixed to
> LOAD_AVG_PERIOD=32. 
> 
> Read the value only after time check is successful to avoid cache
> references to it.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 55746274af06..5534822fd754 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12440,8 +12440,8 @@ static void nohz_balancer_kick(struct rq *rq)
>  	 */
>  	nohz_balance_exit_idle(rq);
>  
> -	if (READ_ONCE(nohz.has_blocked) &&
> -	    time_after(now, READ_ONCE(nohz.next_blocked)))
> +	if (time_after(now, READ_ONCE(nohz.next_blocked)) &&
> +	    READ_ONCE(nohz.has_blocked))
>  		flags = NOHZ_STATS_KICK;

So this patch makes no sense, as the two fields [1] and 
[2] are almost next to each other:

  static struct {
        cpumask_var_t idle_cpus_mask;                                                                           // 0
        atomic_t nr_cpus;                                                                                       // 8
        int has_blocked;                /* Idle CPUS has blocked load */                  <========== [1]       // 12
        int needs_update;               /* Newly idle CPUs need their next_balance collated */                  // 16
        unsigned long next_balance;     /* in jiffy units */                                                    // 24
        unsigned long next_blocked;     /* Next update of blocked load in jiffies */      <========== [2]       // 32
  } nohz ____cacheline_aligned;

... and thus they very likely share the same cacheline 
and there can be no reduction in cacheline bouncing 
from this change.

In fact with OFFSTACK=y the cpumask_var_t is 8 bytes 
and thus the offset of the two fields will be 12 and 32 
within the same 64-byte cacheline, guaranteed. I've 
marked the field offsets in the rightmost column for 
this case.

Thanks,

	Ingo

Re: [PATCH 3/4] sched/fair: Check for blocked task after time check

Posted by Shrikanth Hegde 2 months, 1 week ago


On 12/2/25 11:56 AM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> nohz.has_blocked can be updated often as and when CPUs enter idle state.
>> But stats are updated only at regular intervals. Usually fixed to
>> LOAD_AVG_PERIOD=32.
>>
>> Read the value only after time check is successful to avoid cache
>> references to it.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/fair.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 55746274af06..5534822fd754 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12440,8 +12440,8 @@ static void nohz_balancer_kick(struct rq *rq)
>>   	 */
>>   	nohz_balance_exit_idle(rq);
>>   
>> -	if (READ_ONCE(nohz.has_blocked) &&
>> -	    time_after(now, READ_ONCE(nohz.next_blocked)))
>> +	if (time_after(now, READ_ONCE(nohz.next_blocked)) &&
>> +	    READ_ONCE(nohz.has_blocked))
>>   		flags = NOHZ_STATS_KICK;
> 
> So this patch makes no sense, as the two fields [1] and
> [2] are almost next to each other:
> 
>    static struct {
>          cpumask_var_t idle_cpus_mask;                                                                           // 0
>          atomic_t nr_cpus;                                                                                       // 8
>          int has_blocked;                /* Idle CPUS has blocked load */                  <========== [1]       // 12
>          int needs_update;               /* Newly idle CPUs need their next_balance collated */                  // 16
>          unsigned long next_balance;     /* in jiffy units */                                                    // 24
>          unsigned long next_blocked;     /* Next update of blocked load in jiffies */      <========== [2]       // 32
>    } nohz ____cacheline_aligned;
> 
> ... and thus they very likely share the same cacheline
> and there can be no reduction in cacheline bouncing
> from this change.
> 
> In fact with OFFSTACK=y the cpumask_var_t is 8 bytes
> and thus the offset of the two fields will be 12 and 32
> within the same 64-byte cacheline, guaranteed. I've
> marked the field offsets in the rightmost column for
> this case.
> 
> Thanks,
> 
> 	Ingo

Ok. Since we fetch the line in either of the case,
read should be minimal overhead. at best we maybe saving one
read. Likely not worth it.

I got a bit carried away. We can ignore this change.