[PATCH] sched/fair: Avoid false sharing in nohz struct

Wangyang Guo posted 1 patch 1 month, 3 weeks ago
kernel/sched/fair.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
[PATCH] sched/fair: Avoid false sharing in nohz struct
Posted by Wangyang Guo 1 month, 3 weeks ago
There are two potential false sharing issue in nohz struct:
1. idle_cpus_mask is a read-mostly field, but share the same cacheline
   with frequently updated nr_cpus.
2. Data followed by nohz still share the same cacheline and has
   potential false sharing issue.

This patch tries to resolve the above two problems by isolating the
frequently updated fields in a single cacheline.

Reported-by: Benjamin Lei <benjamin.lei@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 kernel/sched/fair.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..bcc2766b7986 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7193,13 +7193,14 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
 #ifdef CONFIG_NO_HZ_COMMON
 
 static struct {
-	cpumask_var_t idle_cpus_mask;
-	atomic_t nr_cpus;
+	/* Isolate frequently updated fields in a cacheline to avoid false sharing issue. */
+	atomic_t nr_cpus ____cacheline_aligned;
 	int has_blocked;		/* Idle CPUS has blocked load */
 	int needs_update;		/* Newly idle CPUs need their next_balance collated */
 	unsigned long next_balance;     /* in jiffy units */
 	unsigned long next_blocked;	/* Next update of blocked load in jiffies */
-} nohz ____cacheline_aligned;
+	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
+} nohz;
 
 #endif /* CONFIG_NO_HZ_COMMON */
 
-- 
2.47.3
Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
Posted by Shrikanth Hegde 1 month, 2 weeks ago
Hi Wangyang,

On 12/11/25 11:26 AM, Wangyang Guo wrote:
> There are two potential false sharing issue in nohz struct:
> 1. idle_cpus_mask is a read-mostly field, but share the same cacheline
>     with frequently updated nr_cpus.

Updates to idle_cpus_mask is not same cacheline. it is updated alongside nr_cpus.

with CPUMASK_OFFSTACK=y, idle_cpus_mask is a pointer to the actual mask.
Updates to it happen in another cacheline.

with CPUMASK_OFFSTACK=n, idle_cpus_mask is on the stack and its length
depends on NR_CPUS. typical value being 512/2048/8192 it can span a few
cachelines. So updates to it likely in different cacheline compared to nr_cpus.

see  https://lore.kernel.org/all/aS6bK4ad-wO2fsoo@gmail.com/


Likely in your case, nr_cpus updates are the costly ones.
Try below and see if it helps to fix your issue too.
https://lore.kernel.org/all/20251201183146.74443-1-sshegde@linux.ibm.com/
I Should send out new version soon.

> 2. Data followed by nohz still share the same cacheline and has
>     potential false sharing issue.
> 

How does your patch handle this?
I don't see any other struct apart from nohz being changed.


> This patch tries to resolve the above two problems by isolating the
> frequently updated fields in a single cacheline.
> 
> Reported-by: Benjamin Lei <benjamin.lei@intel.com>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reviewed-by: Tianyou Li <tianyou.li@intel.com>
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>   kernel/sched/fair.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5b752324270b..bcc2766b7986 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7193,13 +7193,14 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
>   #ifdef CONFIG_NO_HZ_COMMON
>   
>   static struct {
> -	cpumask_var_t idle_cpus_mask;
> -	atomic_t nr_cpus;
> +	/* Isolate frequently updated fields in a cacheline to avoid false sharing issue. */
> +	atomic_t nr_cpus ____cacheline_aligned;
>   	int has_blocked;		/* Idle CPUS has blocked load */
>   	int needs_update;		/* Newly idle CPUs need their next_balance collated */
>   	unsigned long next_balance;     /* in jiffy units */
>   	unsigned long next_blocked;	/* Next update of blocked load in jiffies */
> -} nohz ____cacheline_aligned;
> +	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
> +} nohz;
>

This can cause a lot of space wastage.
for exp: powerpc has 128 byte cacheline.
Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
Posted by Guo, Wangyang 1 month, 2 weeks ago
On 12/21/2025 9:05 PM, Shrikanth Hegde wrote:
> Hi Wangyang,
> 
> On 12/11/25 11:26 AM, Wangyang Guo wrote:
>> There are two potential false sharing issue in nohz struct:
>> 1. idle_cpus_mask is a read-mostly field, but share the same cacheline
>>     with frequently updated nr_cpus.
> 
> Updates to idle_cpus_mask is not same cacheline. it is updated alongside 
> nr_cpus.
> 
> with CPUMASK_OFFSTACK=y, idle_cpus_mask is a pointer to the actual mask.
> Updates to it happen in another cacheline.
> 
> with CPUMASK_OFFSTACK=n, idle_cpus_mask is on the stack and its length
> depends on NR_CPUS. typical value being 512/2048/8192 it can span a few
> cachelines. So updates to it likely in different cacheline compared to 
> nr_cpus.
> 
> see  https://lore.kernel.org/all/aS6bK4ad-wO2fsoo@gmail.com/
> 
This patch is mainly target for idle_cpus_mask as a pointer, which is 
default for many distro OS.

> 
> Likely in your case, nr_cpus updates are the costly ones.
> Try below and see if it helps to fix your issue too.
> https://lore.kernel.org/all/20251201183146.74443-1-sshegde@linux.ibm.com/
> I Should send out new version soon.
> 
>> 2. Data followed by nohz still share the same cacheline and has
>>     potential false sharing issue.
>>
> 
> How does your patch handle this?
> I don't see any other struct apart from nohz being changed.

The data follow by nohz is implicit and determined by compiler.
For example, this is the layout from /proc/kallsyms in my machine:
ffffffff88600d40 b nohz
ffffffff88600d68 B arch_needs_tick_broadcast
ffffffff88600d6c b __key.264
ffffffff88600d6c b __key.265
ffffffff88600d70 b dl_generation
ffffffff88600d78 b sched_clock_irqtime

What we can do is placing read-mostly `idle_cpus_mask` pointer in a new 
cacheline, so data followed by nohz would not be affected by nr_cpus.

> 
>> This patch tries to resolve the above two problems by isolating the
>> frequently updated fields in a single cacheline.
>>
>> Reported-by: Benjamin Lei <benjamin.lei@intel.com>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Reviewed-by: Tianyou Li <tianyou.li@intel.com>
>> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
>> ---
>>   kernel/sched/fair.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5b752324270b..bcc2766b7986 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7193,13 +7193,14 @@ static DEFINE_PER_CPU(cpumask_var_t, 
>> should_we_balance_tmpmask);
>>   #ifdef CONFIG_NO_HZ_COMMON
>>   static struct {
>> -    cpumask_var_t idle_cpus_mask;
>> -    atomic_t nr_cpus;
>> +    /* Isolate frequently updated fields in a cacheline to avoid 
>> false sharing issue. */
>> +    atomic_t nr_cpus ____cacheline_aligned;
>>       int has_blocked;        /* Idle CPUS has blocked load */
>>       int needs_update;        /* Newly idle CPUs need their 
>> next_balance collated */
>>       unsigned long next_balance;     /* in jiffy units */
>>       unsigned long next_blocked;    /* Next update of blocked load in 
>> jiffies */
>> -} nohz ____cacheline_aligned;
>> +    cpumask_var_t idle_cpus_mask ____cacheline_aligned;
>> +} nohz;
>>
> 
> This can cause a lot of space wastage.
> for exp: powerpc has 128 byte cacheline.
> 

nohz is global, only one exists. The size inflating is minimal, less 
than 1 cacheline.

BR
Wangyang
Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
Posted by Shrikanth Hegde 1 month, 2 weeks ago

On 12/22/25 7:51 AM, Guo, Wangyang wrote:
> On 12/21/2025 9:05 PM, Shrikanth Hegde wrote:
>> Hi Wangyang,
>>
>> On 12/11/25 11:26 AM, Wangyang Guo wrote:
>>> There are two potential false sharing issue in nohz struct:
>>> 1. idle_cpus_mask is a read-mostly field, but share the same cacheline
>>>     with frequently updated nr_cpus.
>>
>> Updates to idle_cpus_mask is not same cacheline. it is updated 
>> alongside nr_cpus.
>>
>> with CPUMASK_OFFSTACK=y, idle_cpus_mask is a pointer to the actual mask.
>> Updates to it happen in another cacheline.
>>
>> with CPUMASK_OFFSTACK=n, idle_cpus_mask is on the stack and its length
>> depends on NR_CPUS. typical value being 512/2048/8192 it can span a few
>> cachelines. So updates to it likely in different cacheline compared to 
>> nr_cpus.
>>
>> see  https://lore.kernel.org/all/aS6bK4ad-wO2fsoo@gmail.com/
>>
> This patch is mainly target for idle_cpus_mask as a pointer, which is 
> default for many distro OS.
> 

Not all archs.

>>
>> Likely in your case, nr_cpus updates are the costly ones.
>> Try below and see if it helps to fix your issue too.
>> https://lore.kernel.org/all/20251201183146.74443-1-sshegde@linux.ibm.com/
>> I Should send out new version soon.
>>
>>> 2. Data followed by nohz still share the same cacheline and has
>>>     potential false sharing issue.
>>>
>>
>> How does your patch handle this?
>> I don't see any other struct apart from nohz being changed.
> 
> The data follow by nohz is implicit and determined by compiler.
> For example, this is the layout from /proc/kallsyms in my machine:
> ffffffff88600d40 b nohz
> ffffffff88600d68 B arch_needs_tick_broadcast
> ffffffff88600d6c b __key.264
> ffffffff88600d6c b __key.265
> ffffffff88600d70 b dl_generation
> ffffffff88600d78 b sched_clock_irqtime
> 
> What we can do is placing read-mostly `idle_cpus_mask` pointer in a new 
> cacheline, so data followed by nohz would not be affected by nr_cpus.
> 

That's a concern. If it is compiler dependent, then sometime it helps, sometime it wont.

It should done other way around rather than changing the nohz.
If there is structure which has a lot of read/updates, it should go into its
own cacheline rather.

i.e in your case sched_clock_irqtime should go into its own cacheline.

---
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 4f97896887ec..29f9438f9f03 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -25,7 +25,7 @@
   */
  DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
  
-int sched_clock_irqtime;
+int sched_clock_irqtime __cacheline_aligned;
  
  void enable_sched_clock_irqtime(void)
  {

Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
Posted by Guo, Wangyang 1 month, 2 weeks ago
On 12/23/2025 3:27 PM, Shrikanth Hegde wrote:
>>>
>>> Likely in your case, nr_cpus updates are the costly ones.
>>> Try below and see if it helps to fix your issue too.
>>> https://lore.kernel.org/all/20251201183146.74443-1- 
>>> sshegde@linux.ibm.com/
>>> I Should send out new version soon.
>>>
>>>> 2. Data followed by nohz still share the same cacheline and has
>>>>     potential false sharing issue.
>>>>
>>>
>>> How does your patch handle this?
>>> I don't see any other struct apart from nohz being changed.
>>
>> The data follow by nohz is implicit and determined by compiler.
>> For example, this is the layout from /proc/kallsyms in my machine:
>> ffffffff88600d40 b nohz
>> ffffffff88600d68 B arch_needs_tick_broadcast
>> ffffffff88600d6c b __key.264
>> ffffffff88600d6c b __key.265
>> ffffffff88600d70 b dl_generation
>> ffffffff88600d78 b sched_clock_irqtime
>>
>> What we can do is placing read-mostly `idle_cpus_mask` pointer in a 
>> new cacheline, so data followed by nohz would not be affected by nr_cpus.
>>
> 
> That's a concern. If it is compiler dependent, then sometime it helps, 
> sometime it wont.

This patch wants to make it compiler independent. Sometimes it maybe 
sched_clock_irqtime, sometimes, it maybe baba_vars. Changing on nohz 
makes sure it would not affect others, no matter whatever data followed.

> It should done other way around rather than changing the nohz.
> If there is structure which has a lot of read/updates, it should go into 
> its
> own cacheline rather.
> 
> i.e in your case sched_clock_irqtime should go into its own cacheline.

If that is preferred in kernel, I can resubmit the patch which only 
requires alignment on sched_clock_irqtime.

> ---
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 4f97896887ec..29f9438f9f03 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -25,7 +25,7 @@
>    */
>   DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
> 
> -int sched_clock_irqtime;
> +int sched_clock_irqtime __cacheline_aligned;
> 
>   void enable_sched_clock_irqtime(void)
>   {
> 
> 

BR
Wangyang