[RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing

K Prateek Nayak posted 29 patches 1 week, 4 days ago
Only 28 patches received!
include/linux/cpumask.h        |  20 +
include/linux/find.h           |  37 ++
include/linux/sched/topology.h |  18 +-
kernel/sched/core.c            |   4 +-
kernel/sched/fair.c            | 828 ++++++++++++++++++++++++++-------
kernel/sched/sched.h           |  10 +-
kernel/sched/topology.c        | 386 +++++++++++++--
7 files changed, 1076 insertions(+), 227 deletions(-)
[RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing
Posted by K Prateek Nayak 1 week, 4 days ago
Resending the series with the correct Email IDs this time. Sorry for the
noise.

This is the combined successor to the following two series:

https://lore.kernel.org/lkml/20250904041516.3046-1-kprateek.nayak@amd.com/
https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/

Most bits are same except for more initial cleanups. Changelog is
attached towards the end. This topic will be discussed at LPC'25 in the
"Scheduler and Real-Time MC" - jump to "LPC 2025" to know what will be
discussed.


Problem statement
=================

This series tackles three problems:

1. The busy load balancing always uses the first CPU of
   group_balance_mask() for load balancing which puts all the load
   balancing responsibility on a single CPU.

2. The "nohz.nr_idle" and "nohz.idle_cpus" are global system-wide
   variables that can run into scalability bottlenecks on a large
   core-count system.

3. Periodic balance can take a long time to even out imbalance on
   systems with mostly large and flat sched-domain hierarchy. Preempted
   tasks may wait for a long time behind other runnable tasks increasing
   the tail latencies.

This series aims at addressing the combined problems listed above.


Implementation details
======================

Note: Sections marked EXPERIMENTAL are known to introduce regeressions
for certain benchmarks. These have been discussed in details in the next
section. These patches also may be incomplete from a schedstats
accounting standpoint.


o Busy balancing optimization

The busy balance CPU is always the first cpu of
"sched_group->scg->group_balance_mask". "sgc" object is shared by all
the CPUs on the "group_balance_mask", even for the overlapping domains.

To keep overheads minimal, a simple "busy_balance_cpu" is maintained in
the shared "sgc" object. A CPU is nominated to handle the busy
balancing. Once the CPU is done with its turn, it nominates the next CPU
on the group_balance_mask.

  - Advantages: The responsibility of busy balance is rotated among the
    CPUs on the group_balance_mask. Maintaining the "busy_balance_cpu"
    only requires a READ_ONCE() / WRITE_ONCE() modifications making it
    relatively cheap.

  - Disadvantages: The currently nominated "busy_balance_cpu" can run
    for a long time with bh disabled that can prevent balancing work
    from running however, it is no worse that the current state where
    the first CPU continues running with bh disabled for a prolonged
    period of time.


o Centralized "nohz" accounting optimizations

The centralized "nohz" tracking maintains the number and list of CPUs
that are in nohz idle state. These are done via atomic operations on
variables shared across the system which is less than ideal.

Peter suggested breaking the mask down and embedding it into the
sched_domain hierarchy which would minimize atomic operations on the
global variables.

There are 2 possible ways to implement this:

1. Maintain the idle CPUs mask in sched_domain_shared. Also construct a
   hierarchy of the sched_domain_shared objects which can be used to
   propagate a signal up to the topmost domain.

   - Advantages: Distributed tracking. Less atomic operations on the
     global variables.

   - Disadvantages: Number of atomic ops can scale with the depth of the
     hierarchy with multiple cache lines being possibly shared between
     multiple NUMA domains.

2. [Implemented in this series] Maintain the idle CPUs mask in
   sched_domain_shared. String all the sched_domain_shared objects in a
   global list which is used for traversing all nohz idle CPUs during
   load balancing.

   Maintain a global "nrr_doms" indicator that is only updated when the
   first CPU is added to the LLC local mask / last CPU leaves the LLC
   local mask.

   - Advantages: Distributed tracking. Simpler implementation.

   - Disadvantages: Number of atomic ops to global "nr_doms" can scale
     with the number of LLC domains, however the changes in the boundary
     conditions are still less frequent than the current global scheme.

The nohz_idle_cpus mask is also inherently optimized by retaining a CPU
on the mask until the first tick and is not immediately cleared when the
ticks are enabled again / at idle exit.


o [EXPERIMENTAL] Push-based load balancing

Proactively push tasks to idle CPUs within an LLC domain. Push-based
load balancing is found to be a delicate balancing act where delaying
running the tasks, especially if their runtime is small, can lead to
performance regressions.

There are cases, especially with larger benchmarks where pushing the
tasks more proactively helps with performance however, a number of
microbenchmark suffer as a result of additional work the busy CPU has to
do to to push a preempted task.


o [EXPERIMENTAL] Optimizing Intra-NUMA newidle balancing

On a CONFIG_PREEMPTION enabled kernel, newidle balance only pulls one
task to keep the latency of balancing low. Despite the effort to keep
the latency low, the newidle balance ends up computing a great deal of
stats just to pull a single task at best.

Instead of following the usual path, directly traverse CPUs for newidle
balance in search of CPUs to pull load from.

This too is found to have interesting effects on benchmarks where CPUs
can converge on single target to pull tasks from causing some amount of
lock contention.

More interestingly, a number of benchmarks seem to regress if the
newidle balance yields on spotting (nr_running > 1 || ttwu_pending)
instead of just proceeding to scan the entire domain and bail at the
end.


Benchmark results
=================

Results for some variants are incomplete as a result of setup issues
(and my sheer incompetence to revert some of the changes I made when
analyzing the benchmarks)

I'll update these as and when the runs (and re-runs) complete but as the
moment, this is how the different [EXPERIMENTAL] bits stack up from
benchmarking perspective on a dual socket 3rd Generation EPYC system (2
x 64C/128T)

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV) push+newidle[pct imp](CV)
   1-groups     1.00 [ -0.00]( 5.58)     1.01 [ -1.10](14.77)     1.01 [ -0.66]( 8.80)     1.03 [ -2.85]( 6.13)
   2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -2.41]( 3.09)     1.00 [  0.22]( 5.62)     1.02 [ -1.97]( 4.54)
   4-groups     1.00 [ -0.00]( 2.11)     0.99 [  1.48]( 2.30)     1.00 [ -0.21]( 2.60)     1.03 [ -2.54]( 2.82)
   8-groups     1.00 [ -0.00]( 2.07)     1.02 [ -2.31]( 2.98)     1.15 [-14.79]( 2.15)     1.13 [-12.63]( 2.57)
  16-groups     1.00 [ -0.00]( 3.55)     1.09 [ -8.57]( 7.80)     1.04 [ -3.64]( 3.89)     1.04 [ -4.33]( 1.36)


  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
      1     1.00 [  0.00]( 0.29)     1.01 [  0.63]( 0.68)     1.00 [ -0.15]( 0.96)       0.99 [ -1.46]( 0.25)
      2     1.00 [  0.00]( 0.55)     1.00 [ -0.09]( 0.21)     1.00 [  0.47]( 0.46)       0.99 [ -1.36]( 0.54)
      4     1.00 [  0.00]( 0.33)     0.99 [ -0.83]( 0.54)     1.01 [  0.76]( 0.36)       0.98 [ -1.51]( 0.20)
      8     1.00 [  0.00]( 0.75)     1.00 [ -0.42]( 1.14)     1.01 [  0.96]( 0.49)       0.99 [ -0.64]( 0.34)
     16     1.00 [  0.00]( 0.98)     0.99 [ -0.70]( 1.23)     0.97 [ -2.55]( 0.73)       0.98 [ -1.80]( 1.62)
     32     1.00 [  0.00]( 0.04)     0.98 [ -2.32]( 1.14)     0.98 [ -1.94]( 0.86)       0.98 [ -2.02]( 0.64)
     64     1.00 [  0.00]( 1.27)     0.94 [ -5.51]( 3.69)     0.97 [ -3.45]( 1.28)       0.99 [ -1.49]( 1.68)
    128     1.00 [  0.00]( 0.69)     1.00 [ -0.05]( 2.34)     1.01 [  0.79]( 0.93)       0.99 [ -1.16]( 0.68)
    256     1.00 [  0.00]( 5.60)     0.97 [ -2.67]( 5.28)     1.00 [  0.34]( 1.23)       0.98 [ -2.16]( 7.10)
    512     1.00 [  0.00]( 0.90)     1.00 [ -0.38]( 0.86)     1.01 [  0.53]( 0.10)       0.98 [ -1.88]( 0.09)
   1024     1.00 [  0.00]( 0.25)     0.99 [ -1.01]( 0.37)     1.01 [  0.91]( 0.53)       0.98 [ -1.58]( 0.32)


  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
   Copy     1.00 [  0.00]( 4.37)     0.97 [ -2.82]( 8.57)     0.99 [ -1.31]( 6.75)       0.97 [ -3.34]( 6.18)
  Scale     1.00 [  0.00]( 2.75)     0.99 [ -0.73]( 3.62)     0.99 [ -0.86]( 3.73)       0.99 [ -1.49]( 5.39)
    Add     1.00 [  0.00]( 3.54)     0.98 [ -2.40]( 3.99)     0.98 [ -1.51]( 4.12)       0.97 [ -3.27]( 6.28)
  Triad     1.00 [  0.00]( 4.41)     0.98 [ -1.71]( 7.00)     1.01 [  0.55]( 3.77)       0.96 [ -4.32]( 7.49)


  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
   Copy     1.00 [  0.00]( 3.25)     0.96 [ -4.08]( 3.07)     0.98 [ -1.56]( 3.45)       0.97 [ -2.74]( 2.00)
  Scale     1.00 [  0.00]( 1.49)     0.98 [ -2.25]( 4.13)     0.98 [ -1.86]( 4.32)       0.99 [ -1.19]( 1.43)
    Add     1.00 [  0.00]( 1.75)     1.00 [ -0.47]( 2.17)     1.00 [ -0.14]( 1.31)       0.99 [ -0.81]( 2.26)
  Triad     1.00 [  0.00]( 1.95)     0.97 [ -2.82]( 4.63)     0.95 [ -4.65]( 6.59)       0.97 [ -2.80]( 4.84)


  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:         tip[pct imp](CV)       push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
   1-clients     1.00 [  0.00]( 0.25)     0.98 [ -1.51]( 0.56)      0.99 [ -1.37]( 0.32)       0.98 [ -1.91]( 0.38)
   2-clients     1.00 [  0.00]( 0.39)     0.99 [ -1.26]( 1.05)      0.99 [ -0.99]( 0.75)       0.98 [ -2.16]( 0.57)
   4-clients     1.00 [  0.00]( 0.67)     0.99 [ -0.73]( 0.68)      1.00 [ -0.22]( 0.46)       0.98 [ -1.70]( 0.30)
   8-clients     1.00 [  0.00]( 0.46)     0.99 [ -1.09]( 0.50)      1.00 [ -0.27]( 0.44)       0.98 [ -1.84]( 0.59)
  16-clients     1.00 [  0.00]( 0.76)     0.99 [ -0.79]( 0.48)      1.00 [ -0.24]( 1.35)       0.99 [ -1.31]( 0.74)
  32-clients     1.00 [  0.00]( 0.82)     0.99 [ -0.91]( 0.80)      1.00 [ -0.04]( 1.16)       0.99 [ -1.27]( 0.83)
  64-clients     1.00 [  0.00]( 1.63)     0.99 [ -0.97]( 1.37)      1.00 [  0.13]( 1.47)       0.99 [ -1.17]( 1.60)
  128-clients    1.00 [  0.00]( 1.30)     0.99 [ -1.07]( 1.42)      0.99 [ -0.92]( 1.41)       0.98 [ -1.77]( 1.19)
  256-clients    1.00 [  0.00]( 5.43)     1.02 [  1.53]( 6.74)      1.02 [  1.54]( 3.40)       1.00 [  0.25]( 6.01)
  512-clients    1.00 [  0.00](55.62)     1.00 [ -0.25](54.85)      0.98 [ -1.91](52.43)       0.98 [ -1.88](51.45)


  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)    push+newidle[pct imp](CV)
    1     1.00 [ -0.00]( 2.50)     1.00 [ -0.00](35.19)     0.88 [ 12.50](31.97)         0.95 [  5.00](33.07)
    2     1.00 [ -0.00]( 8.58)     1.02 [ -2.44]( 6.45)     1.02 [ -2.44]( 9.52)         1.00 [ -0.00]( 2.44)
    4     1.00 [ -0.00]( 7.36)     1.02 [ -2.22]( 3.30)     0.98 [  2.22]( 8.29)         1.02 [ -2.22](13.95)
    8     1.00 [ -0.00]( 8.73)     1.10 [ -9.62]( 9.02)     1.06 [ -5.77]( 6.68)         1.04 [ -3.85]( 6.46)
   16     1.00 [ -0.00]( 4.34)     1.05 [ -4.84]( 4.01)     1.03 [ -3.23]( 1.82)         1.06 [ -6.45]( 4.07)
   32     1.00 [ -0.00]( 3.27)     1.06 [ -6.19]( 4.01)     0.99 [  1.03]( 2.08)         1.00 [ -0.00]( 2.06)
   64     1.00 [ -0.00]( 2.05)     1.01 [ -1.02]( 1.27)     1.01 [ -1.02]( 5.11)         0.91 [  9.18]( 6.51)
  128     1.00 [ -0.00]( 6.08)     0.95 [  5.49]( 4.91)     1.09 [ -8.59]( 8.22)         1.08 [ -7.88](11.81)
  256     1.00 [ -0.00]( 3.28)     0.94 [  6.24]( 4.22)     1.04 [ -3.72]( 6.18)         1.04 [ -4.10]( 3.62)
  512     1.00 [ -0.00]( 2.23)     0.98 [  2.29]( 1.92)     0.98 [  1.90]( 6.93)         1.02 [ -1.90]( 1.51)


  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
    1     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.29)     1.00 [  0.00]( 0.14)       0.99 [ -0.56]( 0.91)
    2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.14)       1.00 [  0.00]( 0.00)
    4     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.14)     1.00 [  0.28]( 0.14)       1.00 [  0.28]( 0.14)
    8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
   16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
   32     1.00 [  0.00]( 4.75)     1.01 [  1.13]( 0.29)     1.00 [  0.00]( 3.77)       0.99 [ -0.57]( 0.51)
   64     1.00 [  0.00]( 1.17)     1.01 [  0.69](13.90)     1.00 [  0.00](13.33)       1.01 [  0.69](13.35)
  128     1.00 [  0.00]( 0.00)     1.00 [  0.34]( 0.18)     1.01 [  0.68]( 0.00)       1.00 [  0.34]( 0.18)
  256     1.00 [  0.00]( 0.56)     1.00 [ -0.49]( 1.24)     1.00 [  0.25]( 1.47)       1.01 [  0.99]( 1.20)
  512     1.00 [  0.00]( 0.96)     1.00 [ -0.37]( 0.88)     1.00 [ -0.37]( 1.58)       1.00 [ -0.37]( 0.88)


  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
    1     1.00 [ -0.00](24.81)     0.75 [ 25.00](24.12)     0.67 [ 33.33]( 6.74)        0.67 [ 33.33](11.18)
    2     1.00 [ -0.00]( 4.08)     0.77 [ 23.08]( 9.68)     0.92 [  7.69](21.56)        0.77 [ 23.08]( 8.94)
    4     1.00 [ -0.00]( 0.00)     1.08 [ -7.69](10.00)     0.85 [ 15.38]( 9.99)        0.85 [ 15.38]( 8.13)
    8     1.00 [ -0.00](12.91)     1.09 [ -9.09]( 4.43)     0.82 [ 18.18](19.99)        0.82 [ 18.18](23.66)
   16     1.00 [ -0.00](12.06)     1.18 [-18.18]( 8.37)     1.18 [-18.18](15.10)        1.18 [-18.18](15.10)
   32     1.00 [ -0.00](22.13)     1.00 [ -0.00]( 5.00)     1.10 [-10.00](19.86)        1.00 [ -0.00]( 5.34)
   64     1.00 [ -0.00](11.07)     1.00 [ -0.00](16.90)     0.92 [  7.69](15.49)        1.00 [ -0.00](13.62)
  128     1.00 [ -0.00]( 9.04)     0.98 [  2.48]( 3.01)     0.99 [  1.49]( 6.96)        0.98 [  1.98]( 5.42)
  256     1.00 [ -0.00]( 0.24)     1.00 [ -0.00]( 0.00)     1.00 [ -0.24]( 0.12)        1.00 [ -0.24]( 0.32)
  512     1.00 [ -0.00]( 0.34)     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.34)        0.99 [  1.15]( 0.20)


  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
    1     1.00 [ -0.00]( 0.90)     0.99 [  0.84]( 1.82)     0.99 [  0.56]( 1.10)        1.03 [ -2.53]( 1.88)
    2     1.00 [ -0.00]( 0.00)     1.01 [ -0.57]( 0.29)     1.01 [ -0.86]( 0.81)        1.02 [ -2.28]( 1.04)
    4     1.00 [ -0.00]( 1.02)     0.98 [  1.69]( 0.15)     0.99 [  0.84]( 1.02)        1.01 [ -0.84]( 1.67)
    8     1.00 [ -0.00]( 0.15)     1.01 [ -0.57]( 0.51)     1.00 [ -0.00]( 0.26)        1.00 [ -0.00]( 0.39)
   16     1.00 [ -0.00]( 0.53)     1.01 [ -0.57]( 0.64)     1.00 [ -0.29]( 0.39)        1.01 [ -0.86]( 0.81)
   32     1.00 [ -0.00](35.40)     0.98 [  1.62]( 0.49)     0.99 [  0.81](10.03)        1.00 [ -0.00]( 0.48)
   64     1.00 [ -0.00]( 5.24)     0.92 [  7.82](26.28)     1.03 [ -2.52]( 6.65)        0.62 [ 38.02](32.78)
  128     1.00 [ -0.00]( 2.02)     0.99 [  0.75]( 1.40)     1.16 [-16.14]( 2.15)        1.17 [-16.89]( 3.16)
  256     1.00 [ -0.00]( 3.41)     0.96 [  4.08]( 3.32)     1.07 [ -7.13]( 2.60)        1.10 [ -9.94]( 4.96)
  512     1.00 [ -0.00]( 1.45)     1.00 [  0.43]( 2.77)     0.99 [  1.06]( 0.73)        0.98 [  1.92]( 0.40)


  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                push_only    newidle_only  push+newidle
  ycsb-cassandra              -3%             -3%          -1%
  ycsb-mongodb                -2%             -2%          -1%
  deathstarbench-1x           24%             16%
  deathstarbench-2x           12%             14%
  deathstarbench-3x           17%             14%
  deathstarbench-6x


LPC 2025
========

Further experiments carried out will be discussed at LPC'25 in the
"Scheduler and Real-Time MC" between 11:08AM and 11:30AM on 11th
December, 2025 in Hall B4.

Discussion points include:

o "sd->shared" assignment optimization.
o "nohz.idle_cpus" mask optimization
o Busy balance CPU rotation.
o Effective detection of when it is favorable to push tasks.
o The overheads of maintaining masks (even with optimizations).
o The delicate dance of newidle balance.

Please do drop by, or reach out to me directly if this work interests
you.


Changelog
=========

This series is based on tip:sched/core at commit 3eb593560146 ("Merge
tag 'v6.18-rc7' into sched/core, to pick up fixes"). All the comparisons
above are done with the same.

o rfc v1.. rfc v2

- Collected tags on Patch 1 from Srikanth (Thanks a ton for the review)

- Added a for_each_cpu_and_wrap() and cleaned up couple of sites using
  the newly introduced macro.

- Simplified conditions that referenced per-CPU "llc_size" and
  "sd->shared" using the fact that only sd_llc has sd->shared assigned.

- Merged the two series however, the idea is largely the same. Push
  based load balancing is guarded behing CONFIG_NO_HZ_COMMON since a
  bunch of NO_HZ_COMMON specific bits were put behind the config option.

- Idea of overloaded_mask was dropped since the overhead to maintain
  the mask (without any usage) was visible in many benchmark results.

- Idea of shallow_idle_cpus mask was dropped since the overhead to
  maintain the mask (without any usage) was visible in benchmarks like
  tbench that left the CPUs idle for very short duration.

- Added the patch to rotate the "busy_balance_cpu".

- Renamed "idle_cpus_mask" to "nohz_idle_cpus_mask" in anticipation of
  adding the "shallow_idle_cpus" mask which didn't pan out.


Note: Patched marked EXPERIMENTAL may be incomplete from a schedstats
accounting standpoint.

---
K Prateek Nayak (28):
  sched/fair: Simplify set_cpu_sd_state_*() with guards
  sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
  sched/topology: Optimize sd->shared allocation and assignment
  sched/fair: Simplify the entry condition for update_idle_cpu_scan()
  sched/fair: Simplity SIS_UTIL handling in select_idle_cpu()
  cpumask: Introduce for_each_cpu_and_wrap() and bitfield helpers
  sched/fair: Use for_each_cpu_and_wrap() in select_idle_capacity()
  sched/fair: Use for_each_cpu_and_wrap() in select_idle_cpu()
  sched/fair: Rotate the CPU resposible for busy load balancing
  sched/fair: Use xchg() to set sd->nohz_idle state
  sched/topology: Attach new hierarchy in rq_attach_root()
  sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
  sched/fair: Account idle cpus instead of busy cpus in sd->shared
  sched/topology: Introduce fallback sd->shared assignment
  sched/topology: Introduce percpu sd_nohz for nohz state tracking
  sched/topology: Introduce "nohz_idle_cpus_mask" in sd->shared
  sched/topology: Introduce "nohz_shared_list" to keep track of
    sd->shared
  sched/fair: Reorder the barrier in nohz_balance_enter_idle()
  sched/fair: Extract the main _nohz_idle_balance() loop into a helper
  sched/fair: Convert find_new_ilb() to use nohz_shared_list
  sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
  sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
  sched/fair: Remove "nohz.idle_cpus_mask"
  sched/fair: Optimize global "nohz.nr_cpus" tracking
  sched/topology: Add basic debug information for "nohz_shared_list"
  [EXPERIMENTAL] sched/fair: Proactive idle balance using push mechanism
  [EXPERIMENTAL] sched/fair: Add a local counter to rate limit task push
  [EXPERIMENTAL] sched/fair: Faster alternate for intra-NUMA newidle
    balance

Vincent Guittot (1):
  [EXPERIMENTAL] sched/fair: Add push task framework

 include/linux/cpumask.h        |  20 +
 include/linux/find.h           |  37 ++
 include/linux/sched/topology.h |  18 +-
 kernel/sched/core.c            |   4 +-
 kernel/sched/fair.c            | 828 ++++++++++++++++++++++++++-------
 kernel/sched/sched.h           |  10 +-
 kernel/sched/topology.c        | 386 +++++++++++++--
 7 files changed, 1076 insertions(+), 227 deletions(-)


base-commit: 3eb59356014674fa1b506a122cc59b57089a4d08
-- 
2.43.0
Re: [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing
Posted by Shrikanth Hegde 1 week, 3 days ago

On 12/8/25 2:56 PM, K Prateek Nayak wrote:
> Resending the series with the correct Email IDs this time. Sorry for the
> noise.
> 
> This is the combined successor to the following two series:
> 
> https://lore.kernel.org/lkml/20250904041516.3046-1-kprateek.nayak@amd.com/
> https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
> 
> Most bits are same except for more initial cleanups. Changelog is
> attached towards the end. This topic will be discussed at LPC'25 in the
> "Scheduler and Real-Time MC" - jump to "LPC 2025" to know what will be
> discussed.
> 
> 
> Problem statement
> =================
> 
> This series tackles three problems:
> 
> 1. The busy load balancing always uses the first CPU of
>     group_balance_mask() for load balancing which puts all the load
>     balancing responsibility on a single CPU.
> 
> 2. The "nohz.nr_idle" and "nohz.idle_cpus" are global system-wide
>     variables that can run into scalability bottlenecks on a large
>     core-count system.
> 

I think they are nohz.nr_cpus and nohz.idle_cpus_mask. isn't it?


> 3. Periodic balance can take a long time to even out imbalance on
>     systems with mostly large and flat sched-domain hierarchy. Preempted
>     tasks may wait for a long time behind other runnable tasks increasing
>     the tail latencies.
> 
> This series aims at addressing the combined problems listed above.
> 
> 
> Implementation details
> ======================
> 
> Note: Sections marked EXPERIMENTAL are known to introduce regeressions
> for certain benchmarks. These have been discussed in details in the next
> section. These patches also may be incomplete from a schedstats
> accounting standpoint.
> 
> 
> o Busy balancing optimization
> 
> The busy balance CPU is always the first cpu of
> "sched_group->scg->group_balance_mask". "sgc" object is shared by all
> the CPUs on the "group_balance_mask", even for the overlapping domains.
> 
> To keep overheads minimal, a simple "busy_balance_cpu" is maintained in
> the shared "sgc" object. A CPU is nominated to handle the busy
> balancing. Once the CPU is done with its turn, it nominates the next CPU
> on the group_balance_mask.
> 
>    - Advantages: The responsibility of busy balance is rotated among the
>      CPUs on the group_balance_mask. Maintaining the "busy_balance_cpu"
>      only requires a READ_ONCE() / WRITE_ONCE() modifications making it
>      relatively cheap.
> 
>    - Disadvantages: The currently nominated "busy_balance_cpu" can run
>      for a long time with bh disabled that can prevent balancing work
>      from running however, it is no worse that the current state where
>      the first CPU continues running with bh disabled for a prolonged
>      period of time.
> 
> 
> o Centralized "nohz" accounting optimizations
> 
> The centralized "nohz" tracking maintains the number and list of CPUs
> that are in nohz idle state. These are done via atomic operations on
> variables shared across the system which is less than ideal.
> 
> Peter suggested breaking the mask down and embedding it into the
> sched_domain hierarchy which would minimize atomic operations on the
> global variables.
> 
> There are 2 possible ways to implement this:
> 
> 1. Maintain the idle CPUs mask in sched_domain_shared. Also construct a
>     hierarchy of the sched_domain_shared objects which can be used to
>     propagate a signal up to the topmost domain.
> 
>     - Advantages: Distributed tracking. Less atomic operations on the
>       global variables.
> 
>     - Disadvantages: Number of atomic ops can scale with the depth of the
>       hierarchy with multiple cache lines being possibly shared between
>       multiple NUMA domains.
> 
> 2. [Implemented in this series] Maintain the idle CPUs mask in
>     sched_domain_shared. String all the sched_domain_shared objects in a
>     global list which is used for traversing all nohz idle CPUs during
>     load balancing.
> 
>     Maintain a global "nrr_doms" indicator that is only updated when the
>     first CPU is added to the LLC local mask / last CPU leaves the LLC
>     local mask.
> 
>     - Advantages: Distributed tracking. Simpler implementation.
> 
>     - Disadvantages: Number of atomic ops to global "nr_doms" can scale
>       with the number of LLC domains, however the changes in the boundary
>       conditions are still less frequent than the current global scheme.
> 
> The nohz_idle_cpus mask is also inherently optimized by retaining a CPU
> on the mask until the first tick and is not immediately cleared when the
> ticks are enabled again / at idle exit.
> 
> 
> o [EXPERIMENTAL] Push-based load balancing
> 
> Proactively push tasks to idle CPUs within an LLC domain. Push-based
> load balancing is found to be a delicate balancing act where delaying
> running the tasks, especially if their runtime is small, can lead to
> performance regressions.
> 
> There are cases, especially with larger benchmarks where pushing the
> tasks more proactively helps with performance however, a number of
> microbenchmark suffer as a result of additional work the busy CPU has to
> do to to push a preempted task.
> 
> 
> o [EXPERIMENTAL] Optimizing Intra-NUMA newidle balancing
> 
> On a CONFIG_PREEMPTION enabled kernel, newidle balance only pulls one
> task to keep the latency of balancing low. Despite the effort to keep
> the latency low, the newidle balance ends up computing a great deal of
> stats just to pull a single task at best.
> 
> Instead of following the usual path, directly traverse CPUs for newidle
> balance in search of CPUs to pull load from.
> 
> This too is found to have interesting effects on benchmarks where CPUs
> can converge on single target to pull tasks from causing some amount of
> lock contention.
> 
> More interestingly, a number of benchmarks seem to regress if the
> newidle balance yields on spotting (nr_running > 1 || ttwu_pending)
> instead of just proceeding to scan the entire domain and bail at the
> end.
> 
> 
> Benchmark results
> =================
> 
> Results for some variants are incomplete as a result of setup issues
> (and my sheer incompetence to revert some of the changes I made when
> analyzing the benchmarks)
> 
> I'll update these as and when the runs (and re-runs) complete but as the
> moment, this is how the different [EXPERIMENTAL] bits stack up from
> benchmarking perspective on a dual socket 3rd Generation EPYC system (2
> x 64C/128T)
> 
>    ==================================================================
>    Test          : hackbench
>    Units         : Normalized time in seconds
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    Case:           tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV) push+newidle[pct imp](CV)
>     1-groups     1.00 [ -0.00]( 5.58)     1.01 [ -1.10](14.77)     1.01 [ -0.66]( 8.80)     1.03 [ -2.85]( 6.13)
>     2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -2.41]( 3.09)     1.00 [  0.22]( 5.62)     1.02 [ -1.97]( 4.54)
>     4-groups     1.00 [ -0.00]( 2.11)     0.99 [  1.48]( 2.30)     1.00 [ -0.21]( 2.60)     1.03 [ -2.54]( 2.82)
>     8-groups     1.00 [ -0.00]( 2.07)     1.02 [ -2.31]( 2.98)     1.15 [-14.79]( 2.15)     1.13 [-12.63]( 2.57)
>    16-groups     1.00 [ -0.00]( 3.55)     1.09 [ -8.57]( 7.80)     1.04 [ -3.64]( 3.89)     1.04 [ -4.33]( 1.36)
> 
> 
>    ==================================================================
>    Test          : tbench
>    Units         : Normalized throughput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:    tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>        1     1.00 [  0.00]( 0.29)     1.01 [  0.63]( 0.68)     1.00 [ -0.15]( 0.96)       0.99 [ -1.46]( 0.25)
>        2     1.00 [  0.00]( 0.55)     1.00 [ -0.09]( 0.21)     1.00 [  0.47]( 0.46)       0.99 [ -1.36]( 0.54)
>        4     1.00 [  0.00]( 0.33)     0.99 [ -0.83]( 0.54)     1.01 [  0.76]( 0.36)       0.98 [ -1.51]( 0.20)
>        8     1.00 [  0.00]( 0.75)     1.00 [ -0.42]( 1.14)     1.01 [  0.96]( 0.49)       0.99 [ -0.64]( 0.34)
>       16     1.00 [  0.00]( 0.98)     0.99 [ -0.70]( 1.23)     0.97 [ -2.55]( 0.73)       0.98 [ -1.80]( 1.62)
>       32     1.00 [  0.00]( 0.04)     0.98 [ -2.32]( 1.14)     0.98 [ -1.94]( 0.86)       0.98 [ -2.02]( 0.64)
>       64     1.00 [  0.00]( 1.27)     0.94 [ -5.51]( 3.69)     0.97 [ -3.45]( 1.28)       0.99 [ -1.49]( 1.68)
>      128     1.00 [  0.00]( 0.69)     1.00 [ -0.05]( 2.34)     1.01 [  0.79]( 0.93)       0.99 [ -1.16]( 0.68)
>      256     1.00 [  0.00]( 5.60)     0.97 [ -2.67]( 5.28)     1.00 [  0.34]( 1.23)       0.98 [ -2.16]( 7.10)
>      512     1.00 [  0.00]( 0.90)     1.00 [ -0.38]( 0.86)     1.01 [  0.53]( 0.10)       0.98 [ -1.88]( 0.09)
>     1024     1.00 [  0.00]( 0.25)     0.99 [ -1.01]( 0.37)     1.01 [  0.91]( 0.53)       0.98 [ -1.58]( 0.32)
> 
> 
>    ==================================================================
>    Test          : stream-10
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>     Copy     1.00 [  0.00]( 4.37)     0.97 [ -2.82]( 8.57)     0.99 [ -1.31]( 6.75)       0.97 [ -3.34]( 6.18)
>    Scale     1.00 [  0.00]( 2.75)     0.99 [ -0.73]( 3.62)     0.99 [ -0.86]( 3.73)       0.99 [ -1.49]( 5.39)
>      Add     1.00 [  0.00]( 3.54)     0.98 [ -2.40]( 3.99)     0.98 [ -1.51]( 4.12)       0.97 [ -3.27]( 6.28)
>    Triad     1.00 [  0.00]( 4.41)     0.98 [ -1.71]( 7.00)     1.01 [  0.55]( 3.77)       0.96 [ -4.32]( 7.49)
> 
> 
>    ==================================================================
>    Test          : stream-100
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
>     Copy     1.00 [  0.00]( 3.25)     0.96 [ -4.08]( 3.07)     0.98 [ -1.56]( 3.45)       0.97 [ -2.74]( 2.00)
>    Scale     1.00 [  0.00]( 1.49)     0.98 [ -2.25]( 4.13)     0.98 [ -1.86]( 4.32)       0.99 [ -1.19]( 1.43)
>      Add     1.00 [  0.00]( 1.75)     1.00 [ -0.47]( 2.17)     1.00 [ -0.14]( 1.31)       0.99 [ -0.81]( 2.26)
>    Triad     1.00 [  0.00]( 1.95)     0.97 [ -2.82]( 4.63)     0.95 [ -4.65]( 6.59)       0.97 [ -2.80]( 4.84)
> 
> 
>    ==================================================================
>    Test          : netperf
>    Units         : Normalized Througput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:         tip[pct imp](CV)       push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
>     1-clients     1.00 [  0.00]( 0.25)     0.98 [ -1.51]( 0.56)      0.99 [ -1.37]( 0.32)       0.98 [ -1.91]( 0.38)
>     2-clients     1.00 [  0.00]( 0.39)     0.99 [ -1.26]( 1.05)      0.99 [ -0.99]( 0.75)       0.98 [ -2.16]( 0.57)
>     4-clients     1.00 [  0.00]( 0.67)     0.99 [ -0.73]( 0.68)      1.00 [ -0.22]( 0.46)       0.98 [ -1.70]( 0.30)
>     8-clients     1.00 [  0.00]( 0.46)     0.99 [ -1.09]( 0.50)      1.00 [ -0.27]( 0.44)       0.98 [ -1.84]( 0.59)
>    16-clients     1.00 [  0.00]( 0.76)     0.99 [ -0.79]( 0.48)      1.00 [ -0.24]( 1.35)       0.99 [ -1.31]( 0.74)
>    32-clients     1.00 [  0.00]( 0.82)     0.99 [ -0.91]( 0.80)      1.00 [ -0.04]( 1.16)       0.99 [ -1.27]( 0.83)
>    64-clients     1.00 [  0.00]( 1.63)     0.99 [ -0.97]( 1.37)      1.00 [  0.13]( 1.47)       0.99 [ -1.17]( 1.60)
>    128-clients    1.00 [  0.00]( 1.30)     0.99 [ -1.07]( 1.42)      0.99 [ -0.92]( 1.41)       0.98 [ -1.77]( 1.19)
>    256-clients    1.00 [  0.00]( 5.43)     1.02 [  1.53]( 6.74)      1.02 [  1.54]( 3.40)       1.00 [  0.25]( 6.01)
>    512-clients    1.00 [  0.00](55.62)     1.00 [ -0.25](54.85)      0.98 [ -1.91](52.43)       0.98 [ -1.88](51.45)
> 
> 
>    ==================================================================
>    Test          : schbench
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)    push+newidle[pct imp](CV)
>      1     1.00 [ -0.00]( 2.50)     1.00 [ -0.00](35.19)     0.88 [ 12.50](31.97)         0.95 [  5.00](33.07)
>      2     1.00 [ -0.00]( 8.58)     1.02 [ -2.44]( 6.45)     1.02 [ -2.44]( 9.52)         1.00 [ -0.00]( 2.44)
>      4     1.00 [ -0.00]( 7.36)     1.02 [ -2.22]( 3.30)     0.98 [  2.22]( 8.29)         1.02 [ -2.22](13.95)
>      8     1.00 [ -0.00]( 8.73)     1.10 [ -9.62]( 9.02)     1.06 [ -5.77]( 6.68)         1.04 [ -3.85]( 6.46)
>     16     1.00 [ -0.00]( 4.34)     1.05 [ -4.84]( 4.01)     1.03 [ -3.23]( 1.82)         1.06 [ -6.45]( 4.07)
>     32     1.00 [ -0.00]( 3.27)     1.06 [ -6.19]( 4.01)     0.99 [  1.03]( 2.08)         1.00 [ -0.00]( 2.06)
>     64     1.00 [ -0.00]( 2.05)     1.01 [ -1.02]( 1.27)     1.01 [ -1.02]( 5.11)         0.91 [  9.18]( 6.51)
>    128     1.00 [ -0.00]( 6.08)     0.95 [  5.49]( 4.91)     1.09 [ -8.59]( 8.22)         1.08 [ -7.88](11.81)
>    256     1.00 [ -0.00]( 3.28)     0.94 [  6.24]( 4.22)     1.04 [ -3.72]( 6.18)         1.04 [ -4.10]( 3.62)
>    512     1.00 [ -0.00]( 2.23)     0.98 [  2.29]( 1.92)     0.98 [  1.90]( 6.93)         1.02 [ -1.90]( 1.51)
> 
> 
>    ==================================================================
>    Test          : new-schbench-requests-per-second
>    Units         : Normalized Requests per second
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>      1     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.29)     1.00 [  0.00]( 0.14)       0.99 [ -0.56]( 0.91)
>      2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.14)       1.00 [  0.00]( 0.00)
>      4     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.14)     1.00 [  0.28]( 0.14)       1.00 [  0.28]( 0.14)
>      8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
>     16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
>     32     1.00 [  0.00]( 4.75)     1.01 [  1.13]( 0.29)     1.00 [  0.00]( 3.77)       0.99 [ -0.57]( 0.51)
>     64     1.00 [  0.00]( 1.17)     1.01 [  0.69](13.90)     1.00 [  0.00](13.33)       1.01 [  0.69](13.35)
>    128     1.00 [  0.00]( 0.00)     1.00 [  0.34]( 0.18)     1.01 [  0.68]( 0.00)       1.00 [  0.34]( 0.18)
>    256     1.00 [  0.00]( 0.56)     1.00 [ -0.49]( 1.24)     1.00 [  0.25]( 1.47)       1.01 [  0.99]( 1.20)
>    512     1.00 [  0.00]( 0.96)     1.00 [ -0.37]( 0.88)     1.00 [ -0.37]( 1.58)       1.00 [ -0.37]( 0.88)
> 
> 
>    ==================================================================
>    Test          : new-schbench-wakeup-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
>      1     1.00 [ -0.00](24.81)     0.75 [ 25.00](24.12)     0.67 [ 33.33]( 6.74)        0.67 [ 33.33](11.18)
>      2     1.00 [ -0.00]( 4.08)     0.77 [ 23.08]( 9.68)     0.92 [  7.69](21.56)        0.77 [ 23.08]( 8.94)
>      4     1.00 [ -0.00]( 0.00)     1.08 [ -7.69](10.00)     0.85 [ 15.38]( 9.99)        0.85 [ 15.38]( 8.13)
>      8     1.00 [ -0.00](12.91)     1.09 [ -9.09]( 4.43)     0.82 [ 18.18](19.99)        0.82 [ 18.18](23.66)
>     16     1.00 [ -0.00](12.06)     1.18 [-18.18]( 8.37)     1.18 [-18.18](15.10)        1.18 [-18.18](15.10)
>     32     1.00 [ -0.00](22.13)     1.00 [ -0.00]( 5.00)     1.10 [-10.00](19.86)        1.00 [ -0.00]( 5.34)
>     64     1.00 [ -0.00](11.07)     1.00 [ -0.00](16.90)     0.92 [  7.69](15.49)        1.00 [ -0.00](13.62)
>    128     1.00 [ -0.00]( 9.04)     0.98 [  2.48]( 3.01)     0.99 [  1.49]( 6.96)        0.98 [  1.98]( 5.42)
>    256     1.00 [ -0.00]( 0.24)     1.00 [ -0.00]( 0.00)     1.00 [ -0.24]( 0.12)        1.00 [ -0.24]( 0.32)
>    512     1.00 [ -0.00]( 0.34)     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.34)        0.99 [  1.15]( 0.20)
> 
> 
>    ==================================================================
>    Test          : new-schbench-request-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>      1     1.00 [ -0.00]( 0.90)     0.99 [  0.84]( 1.82)     0.99 [  0.56]( 1.10)        1.03 [ -2.53]( 1.88)
>      2     1.00 [ -0.00]( 0.00)     1.01 [ -0.57]( 0.29)     1.01 [ -0.86]( 0.81)        1.02 [ -2.28]( 1.04)
>      4     1.00 [ -0.00]( 1.02)     0.98 [  1.69]( 0.15)     0.99 [  0.84]( 1.02)        1.01 [ -0.84]( 1.67)
>      8     1.00 [ -0.00]( 0.15)     1.01 [ -0.57]( 0.51)     1.00 [ -0.00]( 0.26)        1.00 [ -0.00]( 0.39)
>     16     1.00 [ -0.00]( 0.53)     1.01 [ -0.57]( 0.64)     1.00 [ -0.29]( 0.39)        1.01 [ -0.86]( 0.81)
>     32     1.00 [ -0.00](35.40)     0.98 [  1.62]( 0.49)     0.99 [  0.81](10.03)        1.00 [ -0.00]( 0.48)
>     64     1.00 [ -0.00]( 5.24)     0.92 [  7.82](26.28)     1.03 [ -2.52]( 6.65)        0.62 [ 38.02](32.78)
>    128     1.00 [ -0.00]( 2.02)     0.99 [  0.75]( 1.40)     1.16 [-16.14]( 2.15)        1.17 [-16.89]( 3.16)
>    256     1.00 [ -0.00]( 3.41)     0.96 [  4.08]( 3.32)     1.07 [ -7.13]( 2.60)        1.10 [ -9.94]( 4.96)
>    512     1.00 [ -0.00]( 1.45)     1.00 [  0.43]( 2.77)     0.99 [  1.06]( 0.73)        0.98 [  1.92]( 0.40)
> 
> 
>    ==================================================================
>    Test          : Various longer running benchmarks
>    Units         : %diff in throughput reported
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    Benchmarks:                push_only    newidle_only  push+newidle
>    ycsb-cassandra              -3%             -3%          -1%
>    ycsb-mongodb                -2%             -2%          -1%
>    deathstarbench-1x           24%             16%
>    deathstarbench-2x           12%             14%
>    deathstarbench-3x           17%             14%
>    deathstarbench-6x
> 
> 
> LPC 2025
> ========
> 
> Further experiments carried out will be discussed at LPC'25 in the
> "Scheduler and Real-Time MC" between 11:08AM and 11:30AM on 11th
> December, 2025 in Hall B4.
> 
> Discussion points include:
> 
> o "sd->shared" assignment optimization.
> o "nohz.idle_cpus" mask optimization
> o Busy balance CPU rotation.
> o Effective detection of when it is favorable to push tasks.
> o The overheads of maintaining masks (even with optimizations).
> o The delicate dance of newidle balance.
> 
> Please do drop by, or reach out to me directly if this work interests
> you.
> 
> 
> Changelog
> =========
> 
> This series is based on tip:sched/core at commit 3eb593560146 ("Merge
> tag 'v6.18-rc7' into sched/core, to pick up fixes"). All the comparisons
> above are done with the same.
> 
> o rfc v1.. rfc v2
> 
> - Collected tags on Patch 1 from Srikanth (Thanks a ton for the review)
> 
> - Added a for_each_cpu_and_wrap() and cleaned up couple of sites using
>    the newly introduced macro.
> 
> - Simplified conditions that referenced per-CPU "llc_size" and
>    "sd->shared" using the fact that only sd_llc has sd->shared assigned.
> 
> - Merged the two series however, the idea is largely the same. Push
>    based load balancing is guarded behing CONFIG_NO_HZ_COMMON since a
>    bunch of NO_HZ_COMMON specific bits were put behind the config option.
> 
> - Idea of overloaded_mask was dropped since the overhead to maintain
>    the mask (without any usage) was visible in many benchmark results.
> 
> - Idea of shallow_idle_cpus mask was dropped since the overhead to
>    maintain the mask (without any usage) was visible in benchmarks like
>    tbench that left the CPUs idle for very short duration.
> 
> - Added the patch to rotate the "busy_balance_cpu".
> 
> - Renamed "idle_cpus_mask" to "nohz_idle_cpus_mask" in anticipation of
>    adding the "shallow_idle_cpus" mask which didn't pan out.
> 
> 
> Note: Patched marked EXPERIMENTAL may be incomplete from a schedstats
> accounting standpoint.
> 
> ---
> K Prateek Nayak (28):
>    sched/fair: Simplify set_cpu_sd_state_*() with guards
>    sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
>    sched/topology: Optimize sd->shared allocation and assignment
>    sched/fair: Simplify the entry condition for update_idle_cpu_scan()
>    sched/fair: Simplity SIS_UTIL handling in select_idle_cpu()
>    cpumask: Introduce for_each_cpu_and_wrap() and bitfield helpers
>    sched/fair: Use for_each_cpu_and_wrap() in select_idle_capacity()
>    sched/fair: Use for_each_cpu_and_wrap() in select_idle_cpu()
>    sched/fair: Rotate the CPU resposible for busy load balancing
>    sched/fair: Use xchg() to set sd->nohz_idle state
>    sched/topology: Attach new hierarchy in rq_attach_root()
>    sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
>    sched/fair: Account idle cpus instead of busy cpus in sd->shared
>    sched/topology: Introduce fallback sd->shared assignment
>    sched/topology: Introduce percpu sd_nohz for nohz state tracking
>    sched/topology: Introduce "nohz_idle_cpus_mask" in sd->shared
>    sched/topology: Introduce "nohz_shared_list" to keep track of
>      sd->shared
>    sched/fair: Reorder the barrier in nohz_balance_enter_idle()
>    sched/fair: Extract the main _nohz_idle_balance() loop into a helper
>    sched/fair: Convert find_new_ilb() to use nohz_shared_list
>    sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
>    sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
>    sched/fair: Remove "nohz.idle_cpus_mask"
>    sched/fair: Optimize global "nohz.nr_cpus" tracking
>    sched/topology: Add basic debug information for "nohz_shared_list"
>    [EXPERIMENTAL] sched/fair: Proactive idle balance using push mechanism
>    [EXPERIMENTAL] sched/fair: Add a local counter to rate limit task push
>    [EXPERIMENTAL] sched/fair: Faster alternate for intra-NUMA newidle
>      balance
> 
> Vincent Guittot (1):
>    [EXPERIMENTAL] sched/fair: Add push task framework
> 
>   include/linux/cpumask.h        |  20 +
>   include/linux/find.h           |  37 ++
>   include/linux/sched/topology.h |  18 +-
>   kernel/sched/core.c            |   4 +-
>   kernel/sched/fair.c            | 828 ++++++++++++++++++++++++++-------
>   kernel/sched/sched.h           |  10 +-
>   kernel/sched/topology.c        | 386 +++++++++++++--
>   7 files changed, 1076 insertions(+), 227 deletions(-)
> 
> 
> base-commit: 3eb59356014674fa1b506a122cc59b57089a4d08
Re: [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing
Posted by K Prateek Nayak 1 week, 3 days ago
Hello Shrikanth,

On 12/8/2025 7:34 PM, Shrikanth Hegde wrote:
>> 2. The "nohz.nr_idle" and "nohz.idle_cpus" are global system-wide
>>     variables that can run into scalability bottlenecks on a large
>>     core-count system.
>>
> 
> I think they are nohz.nr_cpus and nohz.idle_cpus_mask. isn't it?

Ack! Sorry my brain is all mush at this point staring at all the new
and old "nohz.*" variable names. Thank you for catching that.

-- 
Thanks and Regards,
Prateek