Cache aware scheduling enhancements

[Patch v4 00/16] Cache aware scheduling enhancements

Posted by Tim Chen 4 weeks, 1 day ago

This patch set contains cache-aware scheduling enhancements
and bug fixes on top of Peter's sched/cache branch:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/cache

Patches 1 to 6 resolve the over-aggregation issue, which is the remaining
part of v4 that has not yet been merged into sched/cache. Patches 7 to 15
fix bugs reported by Sashiko (online and local).

Compared with cache-aware v4, the major change in the first part is
storing the LLC effective size in the per-CPU bottom sched_domain. This
allows checking whether a task's memory footprint exceeds the threshold
by fetching the value directly from the corresponding sched_domain,
instead of recalculating it every time. Besides,  the NUMA balance
page-fault statistics is used instead of RSS to estimate the working
set. We also picked up Jianyong's optimization patch to reduce CPU scan
overhead. However, if NUMA balancing is not enabled we will not have
this working set estimate.  Perhaps using RSS will be apprpriate for
such scenario.

Gengkun's CPU scan optimization is not
included for now and will be revisited after further tuning.

Most patches in the second part address race conditions. Each patch fixes
one independent issue to facilitate easier review.

Test results show that the current version keeps the same performance
as v4 for workloads and platforms we tested.

Future plans are to introduce fine-grained control of using cache aware
scheduling on specific tasks after the load-balance-based cache-aware
scheduling is merged:

- Look into task tagging (e.g. with schedqos framework, cgroup) for non process 
  based tasks grouping to LLC.
- Evaluate fast cache-aware aggregation in the wakeup path.

I will be on sabbatical from mid May to mid June. Chen Yu will still be
following up these patches.

Thanks.

Tim

Chen Yu (15):
  sched/cache: Disable cache aware scheduling for processes with high
    thread counts
  sched/cache: Skip cache-aware scheduling for single-threaded processes
  sched/cache: Calculate the LLC size and store it in sched_domain
  sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  sched/cache: Add user control to adjust the aggressiveness of
    cache-aware scheduling
  sched/cache: Fix rcu warning when accessing sd_llc domain
  sched/cache: Fix potential NULL mm pointer access
  sched/cache: Annotate lockless accesses to mm->sc_stat.cpu
  sched/cache: Fix unpaired account_llc_enqueue/dequeue
  sched/cache: Fix checking active load balance by only considering the
    CFS task
  sched/cache: Fix race condition during sched domain rebuild
  sched/cache: Fix cache aware scheduling enabling for multi LLCs system
  sched/cache: Fix has_multi_llcs iff at least one partition has
    multiple LLCs
  sched/cache: Fix possible overflow when invalidating the preferred CPU
  sched/cache: Fix stale preferred_llc for a new task

Jianyong Wu (1):
  sched/cache: Allow only 1 thread of the process to calculate the LLC
    occupancy

 drivers/base/cacheinfo.c       |  23 +++
 include/linux/cacheinfo.h      |   1 +
 include/linux/sched.h          |   5 +
 include/linux/sched/topology.h |   7 +
 init/init_task.c               |   1 +
 kernel/exit.c                  |  29 ++++
 kernel/sched/debug.c           |  14 +-
 kernel/sched/fair.c            | 256 +++++++++++++++++++++++++++++----
 kernel/sched/sched.h           |   7 +-
 kernel/sched/topology.c        | 240 +++++++++++++++++++++++++------
 10 files changed, 509 insertions(+), 74 deletions(-)

-- 
2.32.0

Re: [Patch v4 00/16] Cache aware scheduling enhancements

Posted by K Prateek Nayak 3 weeks, 2 days ago

Hello Tim, Chenyu,

On 5/14/2026 2:09 AM, Tim Chen wrote:
> This patch set contains cache-aware scheduling enhancements
> and bug fixes on top of Peter's sched/cache branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/cache

I took the latest queue:sched/core for a spin before and after
the sched/cache merge and everything is now looking fine.

I've temporary lost access to my usual test machine so I could only grab
the microbenchmark data but they are mostly positive to unaffected as
expected. I'll update if I see anything funky with longer running
benchmarks if and when I get a chance.

Following is the data from a dual socket Zen4c system (2 x 128C/256T)
with 32 LLCs in total:

o Kernels:

tip:         queue:sched/core at commit dd29c017aed6 ("sched/rt: Have
             RT_PUSH_IPI be default off for non PREEMPT_RT")

sched-cache: queue:sched/core at commit a26d9208c137 ("Merge branch
             'sched/cache'")


o Benchmark results

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      sched_cache[pct imp](CV)
   1-groups     1.00 [ -0.00]( 9.66)     0.92 [  8.04](14.93)
   2-groups     1.00 [ -0.00]( 9.22)     0.88 [ 11.96](12.53)
   4-groups     1.00 [ -0.00]( 2.14)     0.99 [  0.93]( 1.55)
   8-groups     1.00 [ -0.00]( 2.80)     1.00 [  0.22]( 3.96)
  16-groups     1.00 [ -0.00]( 5.54)     1.00 [ -0.49]( 2.76)


  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)     sched_cache[pct imp](CV)
      1     1.00 [  0.00]( 0.03)     1.00 [  0.30]( 0.29)
      2     1.00 [  0.00]( 0.32)     1.00 [ -0.45]( 1.86)
      4     1.00 [  0.00]( 0.34)     1.00 [  0.38]( 0.14)
      8     1.00 [  0.00]( 0.24)     1.01 [  0.56]( 0.34)
     16     1.00 [  0.00]( 0.45)     1.00 [  0.12]( 0.05)
     32     1.00 [  0.00]( 0.58)     1.01 [  1.27]( 0.58)
     64     1.00 [  0.00]( 0.81)     1.01 [  1.32]( 0.16)
    128     1.00 [  0.00]( 0.53)     1.03 [  3.27]( 1.15)
    256     1.00 [  0.00]( 0.30)     1.02 [  2.14]( 0.64)
    512     1.00 [  0.00]( 3.73)     1.01 [  1.00]( 2.73)
   1024     1.00 [  0.00]( 0.23)     0.99 [ -0.53]( 0.29)
   2048     1.00 [  0.00]( 0.14)     0.99 [ -0.73]( 0.37)


  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
   Copy     1.00 [  0.00]( 0.66)     1.00 [  0.04]( 0.43)
  Scale     1.00 [  0.00]( 0.89)     1.00 [  0.17]( 0.70)
    Add     1.00 [  0.00]( 0.73)     1.00 [  0.08]( 0.73)
  Triad     1.00 [  0.00]( 0.70)     1.00 [  0.04]( 0.75)


  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
   Copy     1.00 [  0.00]( 0.32)     1.00 [ -0.25]( 1.49)
  Scale     1.00 [  0.00]( 0.26)     0.99 [ -0.50]( 1.56)
    Add     1.00 [  0.00]( 0.29)     0.99 [ -0.69]( 1.22)
  Triad     1.00 [  0.00]( 0.27)     0.99 [ -0.71]( 1.24)


  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:           tip[pct imp](CV)     sched_cache[pct imp](CV)
     1-clients     1.00 [  0.00]( 0.10)     1.00 [ -0.08]( 0.13)
     2-clients     1.00 [  0.00]( 0.29)     1.00 [ -0.01]( 0.16)
     4-clients     1.00 [  0.00]( 0.36)     1.00 [ -0.25]( 0.21)
     8-clients     1.00 [  0.00]( 0.32)     1.00 [ -0.28]( 0.16)
    16-clients     1.00 [  0.00]( 0.24)     1.00 [ -0.38]( 0.24)
    32-clients     1.00 [  0.00]( 0.42)     1.00 [ -0.46]( 0.49)
    64-clients     1.00 [  0.00]( 0.94)     1.00 [ -0.40]( 0.65)
   128-clients     1.00 [  0.00]( 1.10)     1.00 [ -0.08]( 0.89)
   256-clients     1.00 [  0.00]( 1.06)     1.00 [ -0.10]( 0.97)
   512-clients     1.00 [  0.00]( 4.68)     0.98 [ -1.56]( 4.53)
   768-clients     1.00 [  0.00](34.35)     0.98 [ -2.03](32.96)
  1024-clients     1.00 [  0.00](42.76)     0.98 [ -1.74](43.29)


  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)     sched_cache[pct imp](CV)
     1     1.00 [ -0.00](18.94)     0.39 [ 61.36]( 8.81)
     2     1.00 [ -0.00]( 1.67)     0.91 [  8.57](12.48)
     4     1.00 [ -0.00]( 9.79)     0.70 [ 29.73](11.76)
     8     1.00 [ -0.00]( 2.27)     0.82 [ 18.18]( 6.19)
    16     1.00 [ -0.00]( 0.00)     0.98 [  1.79]( 1.82)
    32     1.00 [ -0.00]( 1.92)     1.00 [ -0.00]( 0.72)
    64     1.00 [ -0.00]( 1.19)     1.02 [ -1.56]( 0.77)
   128     1.00 [ -0.00]( 0.67)     1.00 [ -0.00]( 0.44)
   256     1.00 [ -0.00]( 0.46)     1.01 [ -0.88]( 1.08)
   512     1.00 [ -0.00]( 0.33)     0.97 [  2.64]( 2.07)
   768     1.00 [ -0.00]( 4.69)     1.02 [ -1.55]( 2.51)
  1024     1.00 [ -0.00]( 2.71)     1.05 [ -4.72]( 1.36)


  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)     sched_cache[pct imp](CV)
     1     1.00 [  0.00]( 0.15)     0.99 [ -0.59]( 0.15)
     2     1.00 [  0.00]( 0.00)     0.99 [ -0.59]( 0.15)
     4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
     8     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)
    16     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)
    32     1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.00)
    64     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
   128     1.00 [  0.00](12.53)     0.99 [ -0.59](13.81)
   256     1.00 [  0.00]( 0.15)     1.00 [ -0.28]( 0.51)
   512     1.00 [  0.00]( 0.84)     1.01 [  0.75]( 1.02)
   768     1.00 [  0.00]( 2.05)     1.01 [  1.18]( 1.25)
  1024     1.00 [  0.00]( 2.90)     0.98 [ -1.62]( 1.25)


  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)     sched_cache[pct imp](CV)
     1     1.00 [ -0.00](12.99)     1.33 [-33.33](31.03)
     2     1.00 [ -0.00]( 4.08)     0.77 [ 23.08]( 5.34)
     4     1.00 [ -0.00]( 0.00)     0.82 [ 18.18]( 5.53)
     8     1.00 [ -0.00]( 0.00)     0.91 [  9.09]( 0.00)
    16     1.00 [ -0.00]( 4.56)     1.00 [ -0.00]( 4.84)
    32     1.00 [ -0.00]( 0.00)     0.91 [  9.09]( 0.00)
    64     1.00 [ -0.00]( 5.00)     1.00 [ -0.00]( 5.00)
   128     1.00 [ -0.00]( 7.45)     1.17 [-16.67](18.75)
   256     1.00 [ -0.00]( 2.70)     1.02 [ -2.49]( 5.07)
   512     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 0.00)
   768     1.00 [ -0.00]( 1.66)     1.02 [ -2.44]( 1.30)
  1024     1.00 [ -0.00]( 3.32)     1.01 [ -1.19]( 1.92)


  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)     sched_cache[pct imp](CV)
     1     1.00 [ -0.00]( 0.14)     1.01 [ -0.80]( 0.41)
     2     1.00 [ -0.00]( 0.14)     1.02 [ -1.60]( 0.27)
     4     1.00 [ -0.00]( 0.00)     1.01 [ -1.07]( 0.68)
     8     1.00 [ -0.00]( 0.14)     1.01 [ -0.80]( 0.00)
    16     1.00 [ -0.00]( 1.49)     0.98 [  1.82]( 0.00)
    32     1.00 [ -0.00]( 0.89)     0.99 [  0.53]( 0.27)
    64     1.00 [ -0.00]( 1.43)     1.00 [ -0.26]( 1.22)
   128     1.00 [ -0.00]( 2.78)     1.01 [ -0.89]( 3.06)
   256     1.00 [ -0.00]( 0.13)     1.00 [ -0.00]( 0.13)
   512     1.00 [ -0.00]( 6.72)     1.07 [ -6.59]( 8.20)
   768     1.00 [ -0.00]( 3.42)     1.05 [ -4.61]( 2.67)
  1024     1.00 [ -0.00]( 4.37)     0.99 [  1.43]( 2.40)
---

Thanks a ton! And sorry for not having been the most responsive on the
latest iterations.

-- 
Thanks and Regards,
Prateek

Re: [Patch v4 00/16] Cache aware scheduling enhancements

Posted by Chen, Yu C 3 weeks, 2 days ago

Hi Prateek,

On 5/20/2026 11:05 AM, K Prateek Nayak wrote:
> Hello Tim, Chenyu,
> 
> On 5/14/2026 2:09 AM, Tim Chen wrote:
>> This patch set contains cache-aware scheduling enhancements
>> and bug fixes on top of Peter's sched/cache branch:
>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/cache
> 
> I took the latest queue:sched/core for a spin before and after
> the sched/cache merge and everything is now looking fine.
> 
> I've temporary lost access to my usual test machine so I could only grab
> the microbenchmark data but they are mostly positive to unaffected as
> expected. I'll update if I see anything funky with longer running
> benchmarks if and when I get a chance.
> 
[ ... ]
> 
> Thanks a ton! And sorry for not having been the most responsive on the
> latest iterations.
> 

Thanks very much for your help with this series!

I have kicked off the server-side tests based on sched/core.
I suppose cache-aware scheduling can still leverage sd_llc_share
after applying Peter’s patch and yours, allowing the sched domain
share data for SD_SHARE_LLC and SD_ASYM_CPUCAPACITY to coexist.

Additionally, since sched/core includes optimizations for selecting
SMT cores during asymmetric-capacity wakeups, I have also set up
big.LITTLE systems for testing. One testbed features P-cores with
SMT enabled while E-cores run without SMT support. I am simply curious
to see how the benchmark data turns out.

thanks,
Chenyu