[v2] Cache aware scheduling

[PATCH v2 00/23] Cache aware scheduling

Posted by Tim Chen 2 months, 1 week ago

This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data on the
same Last Level Cache (LLC) domain. By improving cache locality, the
scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].
 
In this initial implementation, threads within the same process are
treated as entities that are likely to share data. During load
balancing, the scheduler attempts to aggregate these threads onto the
same LLC domain whenever possible.
 
We would like to thank everyone who provided feedbacks on the v1
series[1]. Most of the comments have been addressed in this revision.
Several broader suggestions surfaced during review, and we believe
they are best approached in follow-up work once the foundational
cache-aware scheduling infrastructure is merged:
 
1. **Generalizing task grouping beyond processes.**
   While v2 focuses on grouping threads within a single process, other
   classes of workloads naturally share data and could benefit from LLC
   co-location, such as:
   a) Tasks from different processes that operate on shared data.
   b) Tasks belonging to the same NUMA group.
   c) Tasks with strong waker/wakee relationships.
   d) User-defined groups via cgroups or other user interfaces.
 
2. **Configurable cache-aware scheduling policies.**
   The current iteration implements a global cache-aware scheduling
   policy. Future work may introduce per-process or per-task-group
   policies, exposed through prctl() or other mechanisms.
 
**v2 Changes:**
1. Align NUMA balancing and cache affinity by
   prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
   size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
   directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
   (see individual patch change log).
 
Test results:

The patch series was applied and tested on v6.18-rc7.
See: https://github.com/timcchen1298/linux/commits/cache_aware_v2

The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on
these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan shows good throughput improvement.

Genoa:
ChaCha20-xiangshan shows huge throughput improvement.
No obvious difference is observed in hackbench/schbench
/netperf/stream/stress-ng.
Phoronix has tested v1 and shows good improvements
in 33 cases[2].

Detail:
Due to length constraints, only part of the data is presented.

Sapphire Rapids:

hackbench thread pipes
                           baseline            sched_cache
       groups
Amean     1      38.8224 (   0.00%)     26.4582 *  31.85%*
Amean     3      38.2358 (   0.00%)     38.0758 (   0.42%)
Amean     5      40.7282 (   0.00%)     41.1568 (  -1.05%)
Amean     7      51.1720 (   0.00%)     50.6646 (   0.99%)
Amean     12     63.1562 (   0.00%)     63.3516 (  -0.31%)
Amean     16     73.9584 (   0.00%)     75.5596 (  -2.17%)
Max       1      39.4140 (   0.00%)     26.7590 (  32.11%)
Max       3      40.8310 (   0.00%)     39.8000 (   2.53%)
Max       5      42.2150 (   0.00%)     42.4860 (  -0.64%)
Max       7      52.1800 (   0.00%)     51.9370 (   0.47%)
Max       12     63.9430 (   0.00%)     64.2820 (  -0.53%)
Max       16     74.3710 (   0.00%)     76.4170 (  -2.75%)

further test hackbench using other number of fds:

case         fd          groups         baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  1.25)  +38.52 (  1.33)
threads-pipe-2          2-groups         1.00 ( 12.52)  +12.74 (  1.31)
threads-pipe-2          4-groups         1.00 (  7.91)  +12.29 (  1.86)
threads-pipe-4          1-groups         1.00 (  0.55)  +34.99 (  0.45)
threads-pipe-4          2-groups         1.00 ( 16.00)  +27.32 (  0.75)
threads-pipe-4          4-groups         1.00 ( 17.37)  +25.75 (  0.20)
threads-pipe-8          1-groups         1.00 (  0.74)  +27.13 (  0.44)
threads-pipe-8          2-groups         1.00 (  8.82)  +23.79 (  0.32)
threads-pipe-8          4-groups         1.00 (  1.30)  +27.64 (  0.51)
threads-pipe-16         1-groups         1.00 (  1.03)  +30.55 (  0.27)
threads-pipe-16         2-groups         1.00 (  6.43)  +29.52 (  0.20)
threads-pipe-16         4-groups         1.00 (  1.36)   -1.85 (  1.43)
threads-pipe-20         1-groups         1.00 (  0.45)  +30.88 (  0.42)
threads-pipe-20         2-groups         1.00 (  1.95)   -0.81 (  5.84)
threads-pipe-20         4-groups         1.00 (  2.09)   -1.77 (  7.57)

stream:
                              baseline            sched_cache
GB/sec copy-2        36.48 (   0.00%)       36.55 (   0.18%)
GB/sec scale-2       36.83 (   0.00%)       36.97 (   0.38%)
GB/sec add-2         37.92 (   0.00%)       38.03 (   0.31%)
GB/sec triad-2       37.83 (   0.00%)       37.97 (   0.37%)

stress-ng context switch:
                                    baseline            sched_cache
Min       context-1       2957.81 (   0.00%)     2966.17 (   0.28%)
Min       context-2       5931.68 (   0.00%)     5930.17 (  -0.03%)
Min       context-4      11874.20 (   0.00%)    11875.68 (   0.01%)
Min       context-8      23755.30 (   0.00%)    23762.43 (   0.03%)
Min       context-16     47535.14 (   0.00%)    47526.46 (  -0.02%)
Min       context-32     95078.66 (   0.00%)    94356.39 (  -0.76%)
Min       context-64    190074.62 (   0.00%)   190042.93 (  -0.02%)
Min       context-128   371107.12 (   0.00%)   371008.10 (  -0.03%)
Min       context-256   578443.73 (   0.00%)   579037.86 (   0.10%)
Min       context-480   580203.34 (   0.00%)   580499.43 (   0.05%)
Hmean     context-1       2964.59 (   0.00%)     2967.69 (   0.10%)
Hmean     context-2       5936.41 (   0.00%)     5935.51 (  -0.02%)
Hmean     context-4      11879.56 (   0.00%)    11881.70 (   0.02%)
Hmean     context-8      23771.92 (   0.00%)    23770.28 (  -0.01%)
Hmean     context-16     47552.23 (   0.00%)    47538.01 (  -0.03%)
Hmean     context-32     95102.67 (   0.00%)    94969.43 (  -0.14%)
Hmean     context-64    190129.74 (   0.00%)   190088.68 (  -0.02%)
Hmean     context-128   371291.95 (   0.00%)   371114.82 (  -0.05%)
Hmean     context-256   578907.96 (   0.00%)   579338.99 (   0.07%)
Hmean     context-480   580541.78 (   0.00%)   580726.13 (   0.03%)
Max       context-1       2967.93 (   0.00%)     2968.90 (   0.03%)
Max       context-2       5942.37 (   0.00%)     5940.40 (  -0.03%)
Max       context-4      11885.25 (   0.00%)    11886.43 (   0.01%)
Max       context-8      23784.17 (   0.00%)    23783.31 (  -0.00%)
Max       context-16     47576.84 (   0.00%)    47561.42 (  -0.03%)
Max       context-32     95139.03 (   0.00%)    95094.86 (  -0.05%)
Max       context-64    190180.08 (   0.00%)   190123.31 (  -0.03%)
Max       context-128   371451.73 (   0.00%)   371240.25 (  -0.06%)
Max       context-256   579355.24 (   0.00%)   579731.37 (   0.06%)
Max       context-480   580750.44 (   0.00%)   581118.33 (   0.06%)
BHmean-50 context-1       2966.80 (   0.00%)     2968.82 (   0.07%)
BHmean-50 context-2       5939.32 (   0.00%)     5939.49 (   0.00%)
BHmean-50 context-4      11883.02 (   0.00%)    11886.08 (   0.03%)
BHmean-50 context-8      23778.40 (   0.00%)    23775.90 (  -0.01%)
BHmean-50 context-16     47568.31 (   0.00%)    47546.19 (  -0.05%)
BHmean-50 context-32     95125.84 (   0.00%)    95087.06 (  -0.04%)
BHmean-50 context-64    190165.37 (   0.00%)   190117.94 (  -0.02%)
BHmean-50 context-128   371405.28 (   0.00%)   371168.75 (  -0.06%)
BHmean-50 context-256   579137.11 (   0.00%)   579609.35 (   0.08%)
BHmean-50 context-480   580646.72 (   0.00%)   580920.46 (   0.05%)
BHmean-95 context-1       2965.72 (   0.00%)     2967.94 (   0.07%)
BHmean-95 context-2       5937.20 (   0.00%)     5936.40 (  -0.01%)
BHmean-95 context-4      11880.45 (   0.00%)    11882.71 (   0.02%)
BHmean-95 context-8      23774.69 (   0.00%)    23771.59 (  -0.01%)
BHmean-95 context-16     47555.08 (   0.00%)    47539.93 (  -0.03%)
BHmean-95 context-32     95106.67 (   0.00%)    95072.38 (  -0.04%)
BHmean-95 context-64    190138.93 (   0.00%)   190096.30 (  -0.02%)
BHmean-95 context-128   371322.78 (   0.00%)   371132.61 (  -0.05%)
BHmean-95 context-256   578985.41 (   0.00%)   579389.21 (   0.07%)
BHmean-95 context-480   580598.22 (   0.00%)   580763.93 (   0.03%)
BHmean-99 context-1       2965.72 (   0.00%)     2967.94 (   0.07%)
BHmean-99 context-2       5937.20 (   0.00%)     5936.40 (  -0.01%)
BHmean-99 context-4      11880.45 (   0.00%)    11882.71 (   0.02%)
BHmean-99 context-8      23774.69 (   0.00%)    23771.59 (  -0.01%)
BHmean-99 context-16     47555.08 (   0.00%)    47539.93 (  -0.03%)
BHmean-99 context-32     95106.67 (   0.00%)    95072.38 (  -0.04%)
BHmean-99 context-64    190138.93 (   0.00%)   190096.30 (  -0.02%)
BHmean-99 context-128   371322.78 (   0.00%)   371132.61 (  -0.05%)
BHmean-99 context-256   578985.41 (   0.00%)   579389.21 (   0.07%)
BHmean-99 context-480   580598.22 (   0.00%)   580763.93 (   0.03%)

schbench thread = 1
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        10.71(0.76)          9.86(1.46)           +7.94%    
Request Latencies 99.0th       4036.00(6.53)        4054.29(10.03)       -0.45%    
RPS 50.0th                     267.29(0.49)         266.86(0.38)         -0.16%    
Average RPS                    268.42(0.16)         267.86(0.31)         -0.21%    

schbench thread = 2
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        11.43(1.13)          8.00(2.00)           +30.01%   
Request Latencies 99.0th       4007.43(34.52)       3967.43(70.03)       +1.00%    
RPS 50.0th                     536.71(0.76)         536.14(1.57)         -0.11%    
Average RPS                    536.59(0.55)         535.33(1.34)         -0.23%    

schbench thread = 4
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        9.57(0.79)           6.14(1.46)           +35.84%   
Request Latencies 99.0th       3789.14(31.47)       3810.86(48.97)       -0.57%    
RPS 50.0th                     1074.00(0.00)        1073.43(2.76)        -0.05%    
Average RPS                    1075.03(1.07)        1072.93(2.13)        -0.20%    

schbench thread = 8
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        9.29(0.49)           6.57(1.81)           +29.28%   
Request Latencies 99.0th       3756.00(19.60)       3769.71(23.87)       -0.37%    
RPS 50.0th                     2152.57(4.28)        2152.57(4.28)        0.00%     
Average RPS                    2151.07(2.71)        2150.58(3.41)        -0.02%    

schbench thread = 16
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        9.43(0.53)           6.86(0.90)           +27.25%   
Request Latencies 99.0th       3780.00(32.98)       3774.29(11.04)       +0.15%    
RPS 50.0th                     4305.14(8.55)        4307.43(7.81)        +0.05%    
Average RPS                    4303.47(5.74)        4301.71(4.35)        -0.04%    

schbench thread = 32
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        10.14(0.38)          6.86(0.69)           +32.35%   
Request Latencies 99.0th       3764.00(21.66)       3806.29(32.24)       -1.12%    
RPS 50.0th                     8624.00(0.00)        8619.43(12.09)       -0.05%    
Average RPS                    8607.36(5.29)        8602.69(7.08)        -0.05%    

schbench thread = 64
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        11.71(0.49)          8.43(1.81)           +28.01%   
Request Latencies 99.0th       3796.00(62.48)       3860.25(147.35)      -1.69%  
RPS 50.0th                     17238.86(24.19)      16411.43(88.95)      -4.80%    
Average RPS                    17209.02(10.18)      16389.73(100.27)     -4.76%    

schbench thread = 128
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        13.29(0.49)          12.00(0.00)          +9.71%    
Request Latencies 99.0th       7893.71(11.04)       7909.71(17.10)       -0.20%    
RPS 50.0th                     32013.71(194.52)     32068.57(50.35)      +0.17%    
Average RPS                    31762.03(238.18)     31884.81(300.85)     +0.39%    

schbench thread = 239
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        13.29(0.49)          14.43(0.53)          -8.58%    
Request Latencies 99.0th       8174.86(8.55)        8244.57(12.09)       -0.85%    
RPS 50.0th                     30624.00(0.00)       30614.86(24.19)      -0.03%    
Average RPS                    30695.86(11.03)      30673.35(17.31)      -0.07%    

chacha20:
baseline:
Host time spent: 66,320ms
sched_cache:
Host time spent: 53,859ms
Time reduced by 18%, throughput increased by 23%

Genoa:
chacha20
baseline:
Host time spent: 51,848ms
sched_cache:
Host time spent: 28,439ms

Time reduced by 45%, throughput increased by 82%

[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin

Chen Yu (10):
  sched/cache: Record per-LLC utilization to guide cache-aware
    scheduling decisions
  sched/cache: Introduce helper functions to enforce LLC migration
    policy
  sched/cache: Introduce sched_cache_present to enable cache aware
    scheduling for multi LLCs NUMA node
  sched/cache: Record the number of active threads per process for
    cache-aware scheduling
  sched/cache: Disable cache aware scheduling for processes with high
    thread counts
  sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  sched/cache: Add user control to adjust the parameters of cache-aware
    scheduling
  -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware
    load balancing
  -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
    balance statistics
  -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
    for each process via proc fs

Peter Zijlstra (Intel) (1):
  sched/cache: Introduce infrastructure for cache-aware load balancing

Tim Chen (12):
  sched/cache: Make LLC id continuous
  sched/cache: Assign preferred LLC ID to processes
  sched/cache: Track LLC-preferred tasks per runqueue
  sched/cache: Introduce per runqueue task LLC preference counter
  sched/cache: Calculate the per runqueue task LLC preference
  sched/cache: Count tasks prefering destination LLC in a sched group
  sched/cache: Check local_group only once in update_sg_lb_stats()
  sched/cache: Prioritize tasks preferring destination LLC during
    balancing
  sched/cache: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/cache: Handle moving single tasks to/from their preferred LLC
  sched/cache: Consider LLC preference when selecting tasks for load
    balancing
  sched/cache: Respect LLC preference in task migration and detach

 fs/proc/base.c                 |   22 +
 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   60 ++
 include/linux/sched.h          |   19 +
 include/linux/sched/topology.h |    5 +
 include/trace/events/sched.h   |   31 +
 init/Kconfig                   |   11 +
 init/init_task.c               |    4 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   12 +
 kernel/sched/debug.c           |   62 ++
 kernel/sched/fair.c            | 1034 +++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   39 ++
 kernel/sched/stats.c           |    5 +-
 kernel/sched/topology.c        |  239 +++++++-
 15 files changed, 1543 insertions(+), 27 deletions(-)

-- 
2.32.0

Re: [PATCH v2 00/23] Cache aware scheduling

Posted by Aaron Lu 1 month, 3 weeks ago

On Wed, Dec 03, 2025 at 03:07:19PM -0800, Tim Chen wrote:
... ...
> Test results:
> 
> The patch series was applied and tested on v6.18-rc7.
> See: https://github.com/timcchen1298/linux/commits/cache_aware_v2
> 
> The first test platform is a 2 socket Intel Sapphire Rapids with 30
> cores per socket. The DRAM interleaving is enabled in the BIOS so it
> essential has one NUMA node with two last level caches. There are 60
> CPUs associated with each last level cache.
> 
> The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
> per node. Each node has 2 CCXs and each CCX has 16 CPUs.
> 
> hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on
> these two platforms.
> 
> [TL;DR]
> Sappire Rapids:
> hackbench shows significant improvement when the number of
> different active threads is below the capacity of a LLC.
> schbench shows overall wakeup latency improvement.
> ChaCha20-xiangshan shows good throughput improvement.
> 
> Genoa:
> ChaCha20-xiangshan shows huge throughput improvement.
> No obvious difference is observed in hackbench/schbench

I think for small task number hackbench run, there should be some
improvement.

I tried thread/pipe/2fds/1group, i.e. 4 tasks on Genoa:
./hackbench -T -f 2 -g 1 -p -l 2000000
And I noticed performance improved a lot:
(Result in seconds, less is better)

       llc_off       llc_on          diff
time   4.755±1.6%    2.684±6.25%    +43.6%

llc_off means /sys/kernel/debug/sched/llc_enabled set to 0 while
llc_on means /sys/kernel/debug/sched/llc_enabled set to 1, other
tunnables are left unchanged.
Turbo is disabled and cpufreq set to performance.

I also tried redis and noticed when I set io-threads to 4 in redis.conf,
there is also some improvement on AMD Genoa:

                 llc_off        manual      diff     llc_on      diff
throughput      1536727±0%     1737619±0%  +13.1%   1737720±0%  +13.1%

Client cmdline:
numactl -N 1 redis-benchmark --threads 4 -t set -r 100000 -P 16 -n 10000000
server cmdline: numactl -N 0 redis-server ./redis.conf
I also tried to manually bind all tasks of redis server to a single LLC
to see if this workload benefits from aggregation and that's what manual
means: taskset -c 8-15,200-207 redis-server ./redis.conf

According to the result, I think this 'cache aware scheduling' works
as expected in that its performance is the same as manual binding; and
they all beat llc_off.

Re: [PATCH v2 00/23] Cache aware scheduling

Posted by Chen, Yu C 1 month, 3 weeks ago

On 12/19/2025 11:19 AM, Aaron Lu wrote:
> On Wed, Dec 03, 2025 at 03:07:19PM -0800, Tim Chen wrote:
> ... ...
>> Test results:
>>
>> The patch series was applied and tested on v6.18-rc7.
>> See: https://github.com/timcchen1298/linux/commits/cache_aware_v2
>>
>> The first test platform is a 2 socket Intel Sapphire Rapids with 30
>> cores per socket. The DRAM interleaving is enabled in the BIOS so it
>> essential has one NUMA node with two last level caches. There are 60
>> CPUs associated with each last level cache.
>>
>> The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
>> per node. Each node has 2 CCXs and each CCX has 16 CPUs.
>>
>> hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on
>> these two platforms.
>>
>> [TL;DR]
>> Sappire Rapids:
>> hackbench shows significant improvement when the number of
>> different active threads is below the capacity of a LLC.
>> schbench shows overall wakeup latency improvement.
>> ChaCha20-xiangshan shows good throughput improvement.
>>
>> Genoa:
>> ChaCha20-xiangshan shows huge throughput improvement.
>> No obvious difference is observed in hackbench/schbench
> 
> I think for small task number hackbench run, there should be some
> improvement.
> 
> I tried thread/pipe/2fds/1group, i.e. 4 tasks on Genoa:
> ./hackbench -T -f 2 -g 1 -p -l 2000000
> And I noticed performance improved a lot:
> (Result in seconds, less is better)
> 
>         llc_off       llc_on          diff
> time   4.755±1.6%    2.684±6.25%    +43.6%
> 
> llc_off means /sys/kernel/debug/sched/llc_enabled set to 0 while
> llc_on means /sys/kernel/debug/sched/llc_enabled set to 1, other
> tunnables are left unchanged.
> Turbo is disabled and cpufreq set to performance.
> 
> I also tried redis and noticed when I set io-threads to 4 in redis.conf,
> there is also some improvement on AMD Genoa:
> 
>                   llc_off        manual      diff     llc_on      diff
> throughput      1536727±0%     1737619±0%  +13.1%   1737720±0%  +13.1%
> 
> Client cmdline:
> numactl -N 1 redis-benchmark --threads 4 -t set -r 100000 -P 16 -n 10000000
> server cmdline: numactl -N 0 redis-server ./redis.conf
> I also tried to manually bind all tasks of redis server to a single LLC
> to see if this workload benefits from aggregation and that's what manual
> means: taskset -c 8-15,200-207 redis-server ./redis.conf
> 
> According to the result, I think this 'cache aware scheduling' works
> as expected in that its performance is the same as manual binding; and
> they all beat llc_off.

Thanks Aaron for the test!

thanks,
Chenyu