[v4] sched/fair: Improve nohz fields for large systems

[PATCH v4 0/3] sched/fair: Improve nohz fields for large systems

Posted by Shrikanth Hegde 3 weeks, 6 days ago

Running on large systems nohz.nr_cpus cacheline was seen as contended.
There is atomic inc/dec and read happening on many
CPUs at a time and it is possible for this line to bounce often.

1st and 2nd patch are minor ones. Looks like correct things to do.
Not very important ones.

3rd patch: Main patch which is to get rid of nr_cpus.Instead, use the cpumask
which is always updated alongside with it. Functionally it should serve
the same purpose. Rest of the fields aren't updated that often. So this
line shouldn't bounce that often.

Contention issue with nohz.idle_cpus_mask still remains. Mostly it is in
separate cacheline than nohz. There are ongoing efforts to mitigate it. It
is not addressed by this series.

v3 -> v4:
- Added to changelog on one less cacheline being dirtied on idle
  entry/exit (Valentin Schneider)

v2 -> v3:
- Converted out to return when there are no CPU is in tickless mode
  since find_ilb_cpu returns anyway (K Prateek Nayak)

v1 -> v2:
- Dropped patch to check has_blocked based on time.
- Detailed changelog for removing nr_cpus (Thanks to Ingo Molnar)

v1: https://lore.kernel.org/all/20251201183146.74443-1-sshegde@linux.ibm.com/
v2: https://lore.kernel.org/all/20260102124744.360872-1-sshegde@linux.ibm.com/
v3: https://lore.kernel.org/all/20260107065125.669668-1-sshegde@linux.ibm.com/

Shrikanth Hegde (3):
  sched/fair: Move checking for nohz cpus after time check
  sched/fair: Change likelyhood of nohz.nr_cpus
  sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead

 kernel/sched/fair.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

-- 
2.47.3

Re: [PATCH v4 0/3] sched/fair: Improve nohz fields for large systems

Posted by K Prateek Nayak 3 weeks, 5 days ago

Hello Shrikanth,

On 1/12/2026 10:34 AM, Shrikanth Hegde wrote:
> Running on large systems nohz.nr_cpus cacheline was seen as contended.
> There is atomic inc/dec and read happening on many
> CPUs at a time and it is possible for this line to bounce often.
> 
> 1st and 2nd patch are minor ones. Looks like correct things to do.
> Not very important ones.
> 
> 3rd patch: Main patch which is to get rid of nr_cpus.Instead, use the cpumask
> which is always updated alongside with it. Functionally it should serve
> the same purpose. Rest of the fields aren't updated that often. So this
> line shouldn't bounce that often.
> 
> Contention issue with nohz.idle_cpus_mask still remains. Mostly it is in
> separate cacheline than nohz. There are ongoing efforts to mitigate it. It
> is not addressed by this series.
> 
> v3 -> v4:
> - Added to changelog on one less cacheline being dirtied on idle
>   entry/exit (Valentin Schneider)

I tested the v3 over the weekend and didn't spot any regressions
(at least none that I can reproduce consistently) so feel free to
include:

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

If anyone is curious, following are results from my setup
(3rd Generation EPYC, 2 socket x 64/128T, boost on, C2 disabled):

Note: tbench hit some insane luck on higher utilization runs. I
haven't been able to reproduce those regressions reliably.
Most data points that show regression also have high run to run
variance on both tip and tip + patch making them unreliable.

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
   1-groups     1.00 [ -0.00]( 6.43)     1.04 [ -3.60](15.15)
   2-groups     1.00 [ -0.00]( 5.42)     1.02 [ -2.17]( 3.57)
   4-groups     1.00 [ -0.00]( 2.72)     0.99 [  0.84]( 3.11)
   8-groups     1.00 [ -0.00]( 3.65)     1.00 [  0.31]( 2.50)
  16-groups     1.00 [ -0.00]( 2.26)     1.02 [ -1.67]( 2.92)
  
  
  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
      1     1.00 [  0.00]( 0.40)     1.00 [ -0.25]( 1.22)
      2     1.00 [  0.00]( 1.33)     0.99 [ -0.57]( 0.37)
      4     1.00 [  0.00]( 0.27)     1.00 [  0.07]( 0.89)
      8     1.00 [  0.00]( 0.53)     0.99 [ -0.83]( 0.32)
     16     1.00 [  0.00]( 1.39)     1.00 [  0.11]( 1.92)
     32     1.00 [  0.00]( 1.85)     0.99 [ -1.44]( 3.08)
     64     1.00 [  0.00]( 1.55)     0.98 [ -2.17]( 2.51)
    128     1.00 [  0.00]( 1.05)     0.94 [ -6.11]( 0.28)
    256     1.00 [  0.00]( 0.68)     0.94 [ -5.58]( 3.77)
    512     1.00 [  0.00]( 0.30)     0.95 [ -4.91]( 0.22)
   1024     1.00 [  0.00]( 0.19)     0.95 [ -4.86]( 0.21)
  
  
  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
   Copy     1.00 [  0.00]( 8.08)     1.03 [  2.91]( 4.84)
  Scale     1.00 [  0.00]( 5.43)     1.04 [  3.56]( 3.32)
    Add     1.00 [  0.00]( 5.96)     1.04 [  4.10]( 2.96)
  Triad     1.00 [  0.00]( 6.36)     0.99 [ -1.23]( 5.83)
  
  
  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
   Copy     1.00 [  0.00]( 3.78)     1.03 [  3.17]( 1.90)
  Scale     1.00 [  0.00]( 4.17)     1.02 [  1.79]( 0.91)
    Add     1.00 [  0.00]( 1.97)     1.01 [  0.52]( 1.66)
  Triad     1.00 [  0.00]( 2.28)     0.99 [ -1.49]( 4.44)
  
  
  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
    1     1.00 [ -0.00](33.02)     1.21 [-20.59]( 8.06)
    2     1.00 [ -0.00](14.30)     1.14 [-14.29]( 6.45)
    4     1.00 [ -0.00]( 2.22)     0.98 [  2.22]( 4.55)
    8     1.00 [ -0.00]( 4.63)     0.94 [  5.56]( 1.96)
   16     1.00 [ -0.00]( 1.67)     1.07 [ -6.67]( 1.82)
   32     1.00 [ -0.00]( 5.58)     0.99 [  1.04]( 2.11)
   64     1.00 [ -0.00]( 6.03)     0.99 [  0.52]( 5.25)
  128     1.00 [ -0.00]( 7.09)     1.00 [ -0.49]( 5.11)
  256     1.00 [ -0.00]( 3.14)     0.94 [  6.06](13.53)
  512     1.00 [ -0.00]( 0.86)     0.98 [  2.23]( 1.53)
  
  
  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
    1     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.52)
    2     1.00 [  0.00]( 0.14)     1.00 [  0.28]( 0.00)
    4     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.00)
    8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.14)
   16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
   32     1.00 [  0.00]( 5.05)     0.97 [ -3.11]( 1.91)
   64     1.00 [  0.00](10.41)     1.06 [  5.60]( 3.79)
  128     1.00 [  0.00]( 0.30)     0.98 [ -2.38]( 0.31)
  256     1.00 [  0.00]( 1.43)     0.98 [ -1.73]( 1.38)
  512     1.00 [  0.00]( 1.45)     0.97 [ -3.33]( 1.48)
  
  
  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
    1     1.00 [ -0.00](24.99)     1.08 [ -8.33](16.90)
    2     1.00 [ -0.00]( 0.00)     1.40 [-40.00](18.20)
    4     1.00 [ -0.00](12.06)     1.27 [-27.27]( 7.75)
    8     1.00 [ -0.00](14.13)     0.90 [ 10.00](23.66)
   16     1.00 [ -0.00](15.96)     1.09 [ -9.09]( 7.45)
   32     1.00 [ -0.00](12.06)     0.91 [  9.09](18.23)
   64     1.00 [ -0.00](15.78)     1.06 [ -6.25](13.18)
  128     1.00 [ -0.00](10.57)     1.03 [ -3.41]( 5.15)
  256     1.00 [ -0.00]( 0.32)     1.00 [ -0.00]( 0.21)
  512     1.00 [ -0.00]( 0.00)     1.00 [  0.38]( 0.20)
  
  
  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)    nohz_no_nr_cpus[pct imp](CV)
    1     1.00 [ -0.00]( 0.00)     1.00 [ -0.27]( 1.79)
    2     1.00 [ -0.00]( 0.74)     0.96 [  4.07]( 1.90)
    4     1.00 [ -0.00]( 0.37)     0.96 [  3.83]( 1.91)
    8     1.00 [ -0.00]( 1.02)     1.00 [ -0.28]( 1.52)
   16     1.00 [ -0.00]( 1.61)     1.00 [  0.28]( 1.86)
   32     1.00 [ -0.00]( 9.22)     1.04 [ -3.52]( 6.84)
   64     1.00 [ -0.00]( 6.39)     1.06 [ -5.96](22.58)
  128     1.00 [ -0.00]( 1.08)     1.12 [-12.43]( 4.61)
  256     1.00 [ -0.00]( 6.10)     1.01 [ -0.77]( 4.87)
  512     1.00 [ -0.00]( 1.41)     1.01 [ -1.03]( 1.27) 

-- 
Thanks and Regards,
Prateek