Cache aware scheduling

[RFC patch v3 00/20] Cache aware scheduling

Posted by Tim Chen 3 months, 3 weeks ago

This is the third revision of the cache aware scheduling patches,
based on the original patch proposed by Peter[1].
 
The goal of the patch series is to aggregate tasks sharing data
to the same cache domain, thereby reducing cache bouncing and
cache misses, and improve data access efficiency. In the current
implementation, threads within the same process are considered
as entities that potentially share resources.
 
In previous versions, aggregation of tasks were done in the
wake up path, without making load balancing paths aware of
LLC (Last-Level-Cache) preference. This led to the following
problems:

1) Aggregation of tasks during wake up led to load imbalance
   between LLCs
2) Load balancing tried to even out the load between LLCs
3) Wake up tasks aggregation happened at a faster rate and
   load balancing moved tasks in opposite directions, leading
   to continuous and excessive task migrations and regressions
   in benchmarks like schbench.

In this version, load balancing is made cache-aware. The main
idea of cache-aware load balancing consists of two parts:

1) Identify tasks that prefer to run on their hottest LLC and
   move them there.
2) Prevent generic load balancing from moving a task out of
   its hottest LLC.

By default, LLC task aggregation during wake-up is disabled.
Conversely, cache-aware load balancing is enabled by default.
For easier comparison, two scheduler features are introduced:
SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
wake up and cache-aware load balancing, respectively. By default,
NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
is only done on load balancing.

With above default settings, task migrations occur less frequently
and no longer happen in the latency-sensitive wake-up path.

The load balancing and migration policy are now implemented in
a single location within the function _get_migrate_hint().
Debugfs knobs are also introduced to fine-tune the
_get_migrate_hint() function. Please refer to patch 7 for
detail.

Improvements in performance for hackbench are observed in the
lower load ranges when tested on a 2 socket sapphire rapids with
30 cores per socket. The DRAM interleaving is enabled in the
BIOS so it essential has one NUMA node with two last level
caches. Hackbench benefits from having all the threads
in the process running in the same LLC. There are some small
regressions for the heavily loaded case when not all threads can
fit in a LLC.

Hackbench is run with one process, and pairs of threads ping
ponging message off each other via command with increasing number
of thread pairs, each test runs for 10 cycles:

hackbench -g 1 --thread --pipe(socket) -l 1000000 -s 100 -f <pairs>

case                    load            baseline(std%)  compare%( std%)
threads-pipe-8          1-groups         1.00 (  2.70)  +24.51 (  0.59)
threads-pipe-15         1-groups         1.00 (  1.42)  +28.37 (  0.68)
threads-pipe-30         1-groups         1.00 (  2.53)  +26.16 (  0.11)
threads-pipe-45         1-groups         1.00 (  0.48)  +35.38 (  0.18)
threads-pipe-60         1-groups         1.00 (  2.13)  +13.46 ( 12.81)
threads-pipe-75         1-groups         1.00 (  1.57)  +16.71 (  0.20)
threads-pipe-90         1-groups         1.00 (  0.22)   -0.57 (  1.21)
threads-sockets-8       1-groups         1.00 (  2.82)  +23.04 (  0.83)
threads-sockets-15      1-groups         1.00 (  2.57)  +21.67 (  1.90)
threads-sockets-30      1-groups         1.00 (  0.75)  +18.78 (  0.09)
threads-sockets-45      1-groups         1.00 (  1.63)  +18.89 (  0.43)
threads-sockets-60      1-groups         1.00 (  0.66)  +10.10 (  1.91)
threads-sockets-75      1-groups         1.00 (  0.44)  -14.49 (  0.43)
threads-sockets-90      1-groups         1.00 (  0.15)   -8.03 (  3.88)

Similar tests were also experimented on schbench on the system.
Overall latency improvement is observed when underloaded and
regression when overloaded. The regression is significantly
smaller than the previous version because cache aware aggregation
is in load balancing rather than in wake up path. Besides, it is
found that the schbench seems to have large run-to-run variance,
so the result of schbench might be only used as reference.

schbench:
                                   baseline              nowake_lb
Lat 50.0th-qrtle-1          5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-1          9.00 (   0.00%)        8.00 (  11.11%)
Lat 99.0th-qrtle-1         15.00 (   0.00%)       15.00 (   0.00%)
Lat 99.9th-qrtle-1         32.00 (   0.00%)       23.00 (  28.12%)
Lat 20.0th-qrtle-1        267.00 (   0.00%)      266.00 (   0.37%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        4.00 (  50.00%)
Lat 90.0th-qrtle-2          9.00 (   0.00%)        7.00 (  22.22%)
Lat 99.0th-qrtle-2         18.00 (   0.00%)       11.00 (  38.89%)
Lat 99.9th-qrtle-2         26.00 (   0.00%)       25.00 (   3.85%)
Lat 20.0th-qrtle-2        535.00 (   0.00%)      537.00 (  -0.37%)
Lat 50.0th-qrtle-4          6.00 (   0.00%)        4.00 (  33.33%)
Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
Lat 99.0th-qrtle-4         13.00 (   0.00%)       10.00 (  23.08%)
Lat 99.9th-qrtle-4         20.00 (   0.00%)       14.00 (  30.00%)
Lat 20.0th-qrtle-4       1066.00 (   0.00%)     1050.00 (   1.50%)
Lat 50.0th-qrtle-8          5.00 (   0.00%)        4.00 (  20.00%)
Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-8         11.00 (   0.00%)        8.00 (  27.27%)
Lat 99.9th-qrtle-8         17.00 (   0.00%)       18.00 (  -5.88%)
Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2156.00 (  -0.75%)
Lat 50.0th-qrtle-16         6.00 (   0.00%)        4.00 (  33.33%)
Lat 90.0th-qrtle-16         7.00 (   0.00%)        6.00 (  14.29%)
Lat 99.0th-qrtle-16        11.00 (   0.00%)       11.00 (   0.00%)
Lat 99.9th-qrtle-16        18.00 (   0.00%)       18.00 (   0.00%)
Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4216.00 (   1.86%)
Lat 50.0th-qrtle-32         6.00 (   0.00%)        4.00 (  33.33%)
Lat 90.0th-qrtle-32         7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-32        11.00 (   0.00%)        9.00 (  18.18%)
Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8624.00 (  -1.51%)
Lat 50.0th-qrtle-64         5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-64         7.00 (   0.00%)        7.00 (   0.00%)
Lat 99.0th-qrtle-64        11.00 (   0.00%)       11.00 (   0.00%)
Lat 99.9th-qrtle-64        17.00 (   0.00%)       18.00 (  -5.88%)
Lat 20.0th-qrtle-64     17120.00 (   0.00%)    15728.00 (   8.13%)
Lat 50.0th-qrtle-128        6.00 (   0.00%)        6.00 (   0.00%)
Lat 90.0th-qrtle-128        9.00 (   0.00%)        8.00 (  11.11%)
Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
Lat 99.9th-qrtle-128       20.00 (   0.00%)       26.00 ( -30.00%)
Lat 20.0th-qrtle-128    19488.00 (   0.00%)    18784.00 (   3.61%)
Lat 50.0th-qrtle-239        8.00 (   0.00%)        8.00 (   0.00%)
Lat 90.0th-qrtle-239       16.00 (   0.00%)       14.00 (  12.50%)
Lat 99.0th-qrtle-239       45.00 (   0.00%)       41.00 (   8.89%)
Lat 99.9th-qrtle-239      137.00 (   0.00%)      225.00 ( -64.23%)
Lat 20.0th-qrtle-239    30432.00 (   0.00%)    29920.00 (   1.68%)

AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
with 1 group test scenario benefits from cache aware load balance
too:

hackbench(1 group and fd ranges in [1,6]:
case                    load            baseline(std%)  compare%( std%)
threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)

Besides a single instance of hackbench, four instances of hackbench are
also tested on Milan. The test results show that different instances of
hackbench are aggregated to dedicated LLCs, and performance improvement
is observed.

schbench mmtests(unstable)
                                  baseline              nowake_lb
Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)

Netperf/Tbench have not been tested yet. As they are single-process
benchmarks that are not the target of this cache-aware scheduling.
Additionally, client and server components should be tested on
different machines or bound to different nodes. Otherwise,
cache-aware scheduling might harm their performance: placing client
and server in the same LLC could yield higher throughput due to
improved cache locality in the TCP/IP stack, whereas cache-aware
scheduling aims to place them in dedicated LLCs.

This patch set is applied on v6.15 kernel.
 
There are some further work needed for future versions in this
patch set.  We will need to align NUMA balancing with LLC aggregations
such that LLC aggregation will align with the preferred NUMA node.

Comments and tests are much appreciated.

[1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/

The patches are grouped as follow:
Patch 1:     Peter's original patch.
Patch 2-5:   Various fixes and tuning of the original v1 patch.
Patch 6-12:  Infrastructure and helper functions for load balancing to be cache aware.
Patch 13-18: Add logic to load balancing for preferred LLC aggregation.
Patch 19:    Add process LLC aggregation in load balancing sched feature.
Patch 20:    Add Process LLC aggregation in wake up sched feature (turn off by default).

v1:
https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
v2:
https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/


Chen Yu (3):
  sched: Several fixes for cache aware scheduling
  sched: Avoid task migration within its preferred LLC
  sched: Save the per LLC utilization for better cache aware scheduling

K Prateek Nayak (1):
  sched: Avoid calculating the cpumask if the system is overloaded

Peter Zijlstra (1):
  sched: Cache aware load-balancing

Tim Chen (15):
  sched: Add hysteresis to switch a task's preferred LLC
  sched: Add helper function to decide whether to allow cache aware
    scheduling
  sched: Set up LLC indexing
  sched: Introduce task preferred LLC field
  sched: Calculate the number of tasks that have LLC preference on a
    runqueue
  sched: Introduce per runqueue task LLC preference counter
  sched: Calculate the total number of preferred LLC tasks during load
    balance
  sched: Tag the sched group as llc_balance if it has tasks prefer other
    LLC
  sched: Introduce update_llc_busiest() to deal with groups having
    preferred LLC tasks
  sched: Introduce a new migration_type to track the preferred LLC load
    balance
  sched: Consider LLC locality for active balance
  sched: Consider LLC preference when picking tasks from busiest queue
  sched: Do not migrate task if it is moving out of its preferred LLC
  sched: Introduce SCHED_CACHE_LB to control cache aware load balance
  sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
    up

 include/linux/mm_types.h       |  44 ++
 include/linux/sched.h          |   8 +
 include/linux/sched/topology.h |   3 +
 init/Kconfig                   |   4 +
 init/init_task.c               |   3 +
 kernel/fork.c                  |   5 +
 kernel/sched/core.c            |  25 +-
 kernel/sched/debug.c           |   4 +
 kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
 kernel/sched/features.h        |   3 +
 kernel/sched/sched.h           |  23 +
 kernel/sched/topology.c        |  29 ++
 12 files changed, 982 insertions(+), 28 deletions(-)

-- 
2.32.0

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by K Prateek Nayak 3 months, 2 weeks ago

Hello Tim,

On 6/18/2025 11:57 PM, Tim Chen wrote:
> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
> with 1 group test scenario benefits from cache aware load balance
> too:
> 
> hackbench(1 group and fd ranges in [1,6]:
> case                    load            baseline(std%)  compare%( std%)
> threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
> threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
> threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
> threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
> threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
> threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
> threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
> threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
> threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
> threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
> threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
> threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)
> 
> Besides a single instance of hackbench, four instances of hackbench are
> also tested on Milan. The test results show that different instances of
> hackbench are aggregated to dedicated LLCs, and performance improvement
> is observed.
> 
> schbench mmtests(unstable)
>                                    baseline              nowake_lb
> Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
> Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
> Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
> Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
> Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
> Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
> Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
> Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
> Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
> Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
> Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
> Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
> Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
> Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
> Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
> Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
> Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
> Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
> Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
> Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
> Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)
> 
> Netperf/Tbench have not been tested yet. As they are single-process
> benchmarks that are not the target of this cache-aware scheduling.
> Additionally, client and server components should be tested on
> different machines or bound to different nodes. Otherwise,
> cache-aware scheduling might harm their performance: placing client
> and server in the same LLC could yield higher throughput due to
> improved cache locality in the TCP/IP stack, whereas cache-aware
> scheduling aims to place them in dedicated LLCs.

I have similar observation from my testing.

tl;dr

o Benchmark that prefer co-location and run in threaded mode see
   a benefit including hackbench at high utilization and schbench
   at low utilization.

o schbench (both new and old but particularly the old) regresses
   quite a bit on the tial latency metric when #workers cross the
   LLC size.

o client-server benchmarks where client and servers are threads
   from different processes (netserver-netperf, tbench_srv-tbench,
   services of DeathStarBench) seem to noticeably regress due to
   lack of co-location between the communicating client and server.

   Not sure if WF_SYNC can be an indicator to temporarily ignore
   the preferred LLC hint.

o stream regresses in some runs where the occupancy metrics trip
   and assign a preferred LLC for all the stream threads bringing
   down performance in !50% of the runs.

Full data from my testing is as follows:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	  tip:sched/core at commit 914873bc7df9 ("Merge tag
            'x86-build-2025-05-25' of
            git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

llc-aware-lb-v3: tip + this series as is

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     1.03 [ -2.77](12.01)
      2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -1.78]( 6.12)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -0.87]( 0.91)
      8-groups     1.00 [ -0.00]( 1.51)     1.03 [ -3.31]( 2.06)
     16-groups     1.00 [ -0.00]( 1.10)     0.95 [  5.36]( 1.67)


     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     0.96 [ -3.68]( 1.23)
         2     1.00 [  0.00]( 1.13)     0.98 [ -2.30]( 0.51)
         4     1.00 [  0.00]( 1.12)     0.96 [ -4.14]( 0.22)
         8     1.00 [  0.00]( 0.93)     0.96 [ -3.61]( 0.46)
        16     1.00 [  0.00]( 0.38)     0.95 [ -4.98]( 1.26)
        32     1.00 [  0.00]( 0.66)     0.93 [ -7.12]( 2.22)
        64     1.00 [  0.00]( 1.18)     0.95 [ -5.44]( 0.37)
       128     1.00 [  0.00]( 1.12)     0.93 [ -6.78]( 0.64)
       256     1.00 [  0.00]( 0.42)     0.94 [ -6.45]( 0.47)
       512     1.00 [  0.00]( 0.14)     0.93 [ -7.26]( 0.27)
      1024     1.00 [  0.00]( 0.26)     0.92 [ -7.57]( 0.31)


     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.39 [-61.05](44.88)
     Scale     1.00 [  0.00]( 2.85)     0.43 [-57.26](40.60)
       Add     1.00 [  0.00]( 3.39)     0.40 [-59.88](42.02)
     Triad     1.00 [  0.00]( 6.39)     0.41 [-58.93](42.98)


     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     0.36 [-63.95](51.04)
     Scale     1.00 [  0.00]( 4.34)     0.40 [-60.31](43.12)
       Add     1.00 [  0.00]( 4.14)     0.38 [-62.46](43.40)
     Triad     1.00 [  0.00]( 1.00)     0.36 [-64.38](43.12)


     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     0.97 [ -3.26]( 1.30)
      2-clients     1.00 [  0.00]( 0.58)     0.96 [ -4.24]( 0.71)
      4-clients     1.00 [  0.00]( 0.35)     0.96 [ -4.19]( 0.67)
      8-clients     1.00 [  0.00]( 0.48)     0.95 [ -5.41]( 1.36)
     16-clients     1.00 [  0.00]( 0.66)     0.95 [ -5.31]( 0.93)
     32-clients     1.00 [  0.00]( 1.15)     0.94 [ -6.43]( 1.44)
     64-clients     1.00 [  0.00]( 1.38)     0.93 [ -7.14]( 1.63)
     128-clients    1.00 [  0.00]( 0.87)     0.89 [-10.62]( 0.78)
     256-clients    1.00 [  0.00]( 5.36)     0.92 [ -8.04]( 2.64)
     512-clients    1.00 [  0.00](54.39)     0.88 [-12.12](48.87)


     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.54 [ 45.65](28.79)
       2     1.00 [ -0.00]( 1.15)     0.56 [ 44.00]( 2.09)
       4     1.00 [ -0.00](13.46)     0.67 [ 33.33](35.68)
       8     1.00 [ -0.00]( 7.14)     0.63 [ 36.84]( 4.28)
      16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 9.13)
      32     1.00 [ -0.00]( 1.06)    32.04 [-3104.26](81.31)
      64     1.00 [ -0.00]( 5.48)    24.51 [-2351.16](81.18)
     128     1.00 [ -0.00](10.45)    14.56 [-1356.07]( 5.35)
     256     1.00 [ -0.00](31.14)     0.95 [  4.80](20.88)
     512     1.00 [ -0.00]( 1.52)     1.00 [ -0.25]( 1.26)


     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     0.97 [ -3.24]( 0.98)
       2     1.00 [  0.00]( 0.00)     0.99 [ -1.17]( 0.15)
       4     1.00 [  0.00]( 0.00)     0.96 [ -3.50]( 0.56)
       8     1.00 [  0.00]( 0.15)     0.98 [ -1.76]( 0.31)
      16     1.00 [  0.00]( 0.00)     0.94 [ -6.13]( 1.93)
      32     1.00 [  0.00]( 3.41)     0.97 [ -3.18]( 2.10)
      64     1.00 [  0.00]( 1.05)     0.82 [-18.14](18.41)
     128     1.00 [  0.00]( 0.00)     0.98 [ -2.27]( 0.20)
     256     1.00 [  0.00]( 0.72)     1.01 [  1.23]( 0.31)
     512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.12)


     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.88 [ 12.50](11.92)
       2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.92)
       4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 4.08)
       8     1.00 [ -0.00]( 0.00)     0.83 [ 16.67]( 5.34)
      16     1.00 [ -0.00]( 7.56)     0.85 [ 15.38]( 0.00)
      32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.19)
      64     1.00 [ -0.00]( 9.63)     1.05 [ -5.00](24.47)
     128     1.00 [ -0.00]( 4.86)     1.57 [-56.78](68.52)
     256     1.00 [ -0.00]( 2.34)     1.00 [ -0.00]( 0.57)
     512     1.00 [ -0.00]( 0.40)     1.00 [ -0.00]( 0.34)


     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     1.06 [ -5.71]( 0.25)
       2     1.00 [ -0.00]( 0.87)     1.08 [ -8.37]( 0.78)
       4     1.00 [ -0.00]( 1.21)     1.09 [ -9.15]( 0.79)
       8     1.00 [ -0.00]( 0.27)     1.06 [ -6.31]( 0.51)
      16     1.00 [ -0.00]( 4.04)     1.85 [-84.55]( 5.11)
      32     1.00 [ -0.00]( 7.35)     1.52 [-52.16]( 0.83)
      64     1.00 [ -0.00]( 3.54)     1.06 [ -5.77]( 2.62)
     128     1.00 [ -0.00]( 0.37)     1.09 [ -9.18](28.47)
     256     1.00 [ -0.00]( 9.57)     0.99 [  0.60]( 0.48)
     512     1.00 [ -0.00]( 1.82)     1.03 [ -2.80]( 1.16)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                  %diff
     ycsb-cassandra              -0.99%
     ycsb-mongodb                -0.96%
     deathstarbench-1x           -2.09%
     deathstarbench-2x           -0.26%
     deathstarbench-3x           -3.34%
     deathstarbench-6x           -3.03%
     hammerdb+mysql 16VU         -2.15%
     hammerdb+mysql 64VU         -3.77%

> 
> This patch set is applied on v6.15 kernel.
>   
> There are some further work needed for future versions in this
> patch set.  We will need to align NUMA balancing with LLC aggregations
> such that LLC aggregation will align with the preferred NUMA node.
> 
> Comments and tests are much appreciated.

I'll rerun the test once with the SCHED_FEAT() disabled just to make
sure I'm not regressing because of some other factors. For the major
regressions, I'll get the "perf sched stats" data to see if anything
stands out.

I'm also planning on getting the data from a Zen5c system with larger
LLC to see if there is any difference in the trend (I'll start with the
microbenchmarks since setting the larger ones will take some time)

Sorry for the lack of engagement on previous versions but I plan on
taking a better look at the series this time around. If you need any
specific data from my setup, please do let me know.

-- 
Thanks and Regards,
Prateek

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Shrikanth Hegde 3 months, 1 week ago

> 
> tl;dr
> 
> o Benchmark that prefer co-location and run in threaded mode see
>    a benefit including hackbench at high utilization and schbench
>    at low utilization.
> 
> o schbench (both new and old but particularly the old) regresses
>    quite a bit on the tial latency metric when #workers cross the
>    LLC size.
> 
> o client-server benchmarks where client and servers are threads
>    from different processes (netserver-netperf, tbench_srv-tbench,
>    services of DeathStarBench) seem to noticeably regress due to
>    lack of co-location between the communicating client and server.
> 
>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>    the preferred LLC hint.
> 
> o stream regresses in some runs where the occupancy metrics trip
>    and assign a preferred LLC for all the stream threads bringing
>    down performance in !50% of the runs.
> 

- When you have SMT systems, threads will go faster if they run in ST mode.
If aggregation happens in a LLC, they might end up with lower IPC.

> Full data from my testing is as follows:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Kernel details
> 
> tip:      tip:sched/core at commit 914873bc7df9 ("Merge tag
>             'x86-build-2025-05-25' of
>             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> 
> llc-aware-lb-v3: tip + this series as is
> 
>

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Chen, Yu C 3 months ago

On 7/4/2025 4:00 AM, Shrikanth Hegde wrote:
> 
>>
>> tl;dr
>>
>> o Benchmark that prefer co-location and run in threaded mode see
>>    a benefit including hackbench at high utilization and schbench
>>    at low utilization.
>>
>> o schbench (both new and old but particularly the old) regresses
>>    quite a bit on the tial latency metric when #workers cross the
>>    LLC size.
>>
>> o client-server benchmarks where client and servers are threads
>>    from different processes (netserver-netperf, tbench_srv-tbench,
>>    services of DeathStarBench) seem to noticeably regress due to
>>    lack of co-location between the communicating client and server.
>>
>>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>>    the preferred LLC hint.
>>
>> o stream regresses in some runs where the occupancy metrics trip
>>    and assign a preferred LLC for all the stream threads bringing
>>    down performance in !50% of the runs.
>>
> 
> - When you have SMT systems, threads will go faster if they run in ST mode.
> If aggregation happens in a LLC, they might end up with lower IPC.
> 

OK, the number of SMT within a core should also be considered to
control how aggressive the aggregation is.

Regarding the regression from the stream, it was caused by the working
set size. When the working set size is 2.9G in Prateek's test scenario,
there is a regression with task aggregation. If we reduce it to a lower
value, say 512MB, the regression disappears. Therefore, we are trying to
tweak this by comparing the process's RSS with the L3 cache size.

thanks,
Chenyu

thanks,
Chenyu

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Tim Chen 3 months, 2 weeks ago

On Tue, 2025-06-24 at 10:30 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> I have similar observation from my testing.
> 
> 
Prateek,

Thanks for the testing that you did. Much appreciated.
Some follow up to Chen, Yu's comments.

> 
> o Benchmark that prefer co-location and run in threaded mode see
>    a benefit including hackbench at high utilization and schbench
>    at low utilization.
> 
> o schbench (both new and old but particularly the old) regresses
>    quite a bit on the tial latency metric when #workers cross the
>    LLC size.

Will take closer look at the cases where #workers just exceed LLC size.
Perhaps adjusting the threshold to spread the load earlier at a
lower LLC utilization will help.

> 
> o client-server benchmarks where client and servers are threads
>    from different processes (netserver-netperf, tbench_srv-tbench,
>    services of DeathStarBench) seem to noticeably regress due to
>    lack of co-location between the communicating client and server.
> 
>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>    the preferred LLC hint.

Currently we do not aggregate tasks from different processes.
The case where client and server actually reside on the same
system I think is the exception rather than the rule for real
workloads where clients and servers reside on different systems.

But I do see tasks from different processes talking to each
other via pipe/socket in real workload.  Do you know of good
use cases for such scenario that would justify extending task
aggregation to multi-processes?
 
> 
> o stream regresses in some runs where the occupancy metrics trip
>    and assign a preferred LLC for all the stream threads bringing
>    down performance in !50% of the runs.
> 

Yes, stream does not have cache benefit from co-locating threads, and
get hurt from sharing common resource like memory controller.


> Full data from my testing is as follows:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> 
>      ==================================================================
>      Test          : Various longer running benchmarks
>      Units         : %diff in throughput reported
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      Benchmarks:                  %diff
>      ycsb-cassandra              -0.99%
>      ycsb-mongodb                -0.96%
>      deathstarbench-1x           -2.09%
>      deathstarbench-2x           -0.26%
>      deathstarbench-3x           -3.34%
>      deathstarbench-6x           -3.03%
>      hammerdb+mysql 16VU         -2.15%
>      hammerdb+mysql 64VU         -3.77%
> 

The clients and server of the benchmarks are co-located on the same
system, right?

> > 
> > This patch set is applied on v6.15 kernel.
> >   
> > There are some further work needed for future versions in this
> > patch set.  We will need to align NUMA balancing with LLC aggregations
> > such that LLC aggregation will align with the preferred NUMA node.
> > 
> > Comments and tests are much appreciated.
> 
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.
> 
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
> 
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
> 

Will do.  Thanks.

Tim

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by K Prateek Nayak 3 months, 2 weeks ago

Hello Tim,

On 6/25/2025 6:00 AM, Tim Chen wrote:
>> o Benchmark that prefer co-location and run in threaded mode see
>>     a benefit including hackbench at high utilization and schbench
>>     at low utilization.
>>
>> o schbench (both new and old but particularly the old) regresses
>>     quite a bit on the tial latency metric when #workers cross the
>>     LLC size.
> 
> Will take closer look at the cases where #workers just exceed LLC size.
> Perhaps adjusting the threshold to spread the load earlier at a
> lower LLC utilization will help.

I too will test with different number of fd pairs to see if I can
spot a trend.

> 
>>
>> o client-server benchmarks where client and servers are threads
>>     from different processes (netserver-netperf, tbench_srv-tbench,
>>     services of DeathStarBench) seem to noticeably regress due to
>>     lack of co-location between the communicating client and server.
>>
>>     Not sure if WF_SYNC can be an indicator to temporarily ignore
>>     the preferred LLC hint.
> 
> Currently we do not aggregate tasks from different processes.
> The case where client and server actually reside on the same
> system I think is the exception rather than the rule for real
> workloads where clients and servers reside on different systems.
> 
> But I do see tasks from different processes talking to each
> other via pipe/socket in real workload.  Do you know of good
> use cases for such scenario that would justify extending task
> aggregation to multi-processes?

We've seen cases with Kubernetes deployments where co-locating
processes of different services from the same pod can help with
throughput and latency. Perhaps it can happen indirectly where
co-location on WF_SYNC can actually help increase the cache
occupancy for a the other process and they both arrive at the
same preferred LLC. I'll see if I can get my hands on a setup
which is closer to these real world deployment.

>   
>>
>> o stream regresses in some runs where the occupancy metrics trip
>>     and assign a preferred LLC for all the stream threads bringing
>>     down performance in !50% of the runs.
>>
> 
> Yes, stream does not have cache benefit from co-locating threads, and
> get hurt from sharing common resource like memory controller.
> 
> 
>> Full data from my testing is as follows:
>>
>> o Machine details
>>
>> - 3rd Generation EPYC System
>> - 2 sockets each with 64C/128T
>> - NPS1 (Each socket is a NUMA node)
>> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>>
>>
>>       ==================================================================
>>       Test          : Various longer running benchmarks
>>       Units         : %diff in throughput reported
>>       Interpretation: Higher is better
>>       Statistic     : Median
>>       ==================================================================
>>       Benchmarks:                  %diff
>>       ycsb-cassandra              -0.99%
>>       ycsb-mongodb                -0.96%
>>       deathstarbench-1x           -2.09%
>>       deathstarbench-2x           -0.26%
>>       deathstarbench-3x           -3.34%
>>       deathstarbench-6x           -3.03%
>>       hammerdb+mysql 16VU         -2.15%
>>       hammerdb+mysql 64VU         -3.77%
>>
> 
> The clients and server of the benchmarks are co-located on the same
> system, right?

Yes that is correct. I'm using a 2P systems and our runner scripts
pin the workload to the first socket, and the workload driver runs
from the second socket. One side effect of this is that changes can
influence the placement of workload driver and that can lead to
some inconsistencies. I'll check if the the stats for the workload
driver is way off between the baseline and with this series.

-- 
Thanks and Regards,
Prateek

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Chen, Yu C 3 months, 2 weeks ago

On 6/24/2025 1:00 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 6/18/2025 11:57 PM, Tim Chen wrote:
>> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
>> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
>> with 1 group test scenario benefits from cache aware load balance
>> too:
>>
>> hackbench(1 group and fd ranges in [1,6]:
>> case                    load            baseline(std%)  compare%( std%)
>> threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
>> threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
>> threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
>> threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
>> threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
>> threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
>> threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
>> threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
>> threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
>> threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
>> threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
>> threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)
>>
>> Besides a single instance of hackbench, four instances of hackbench are
>> also tested on Milan. The test results show that different instances of
>> hackbench are aggregated to dedicated LLCs, and performance improvement
>> is observed.
>>
>> schbench mmtests(unstable)
>>                                    baseline              nowake_lb
>> Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
>> Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
>> Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
>> Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
>> Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
>> Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
>> Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
>> Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
>> Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
>> Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
>> Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
>> Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
>> Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
>> Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
>> Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
>> Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
>> Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
>> Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
>> Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
>> Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
>> Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
>> Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
>> Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
>> Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
>> Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
>> Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
>> Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
>> Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
>> Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
>> Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)
>>
>> Netperf/Tbench have not been tested yet. As they are single-process
>> benchmarks that are not the target of this cache-aware scheduling.
>> Additionally, client and server components should be tested on
>> different machines or bound to different nodes. Otherwise,
>> cache-aware scheduling might harm their performance: placing client
>> and server in the same LLC could yield higher throughput due to
>> improved cache locality in the TCP/IP stack, whereas cache-aware
>> scheduling aims to place them in dedicated LLCs.
> 
> I have similar observation from my testing.
> 

Prateek, thanks for your test.

> tl;dr
> 
> o Benchmark that prefer co-location and run in threaded mode see
>    a benefit including hackbench at high utilization and schbench
>    at low utilization.
> 

Previously, we tested hackbench with one group using different
fd pairs. The number of fds (1–6) was lower than the number
of CPUs (8) within one CCX. If I understand correctly, the
default number of fd pairs in hackbench is 20. We might need
to handle cases where the number of threads (nr_thread)
exceeds the number of CPUs per LLC—perhaps by
skipping task aggregation in such scenarios.

> o schbench (both new and old but particularly the old) regresses
>    quite a bit on the tial latency metric when #workers cross the
>    LLC size.
> 

As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
could mitigate the issue. Besides, maybe introduce a rate limit
for cache aware aggregation would help.

> o client-server benchmarks where client and servers are threads
>    from different processes (netserver-netperf, tbench_srv-tbench,
>    services of DeathStarBench) seem to noticeably regress due to
>    lack of co-location between the communicating client and server.
> 
>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>    the preferred LLC hint.

WF_SYNC is used in wakeup path, the current v3 version does the
task aggregation in the load balance path. We'll look into this
C/S scenario.

> 
> o stream regresses in some runs where the occupancy metrics trip
>    and assign a preferred LLC for all the stream threads bringing
>    down performance in !50% of the runs.
> 

May I know if you tested the stream with mmtests under OMP mode,
and what do stream-10 and stream-100 mean? Stream is an example
where all threads have their private memory buffers—no
interaction with each other. For this benchmark, spreading
them across different Nodes gets higher memory bandwidth because
stream allocates the buffer to be at least 4X the L3 cache size.
We lack a metric that can indicate when threads share a lot of
data (e.g., both Thread 1 and Thread 2 read from the same
buffer). In such cases, we should aggregate the threads;
otherwise, do not aggregate them (as in the stream case).
On the other hand, stream-omp seems like an unrealistic
scenario—if threads do not share buffer, why create them
in the same process?


> Full data from my testing is as follows:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Kernel details
> 
> tip:      tip:sched/core at commit 914873bc7df9 ("Merge tag
>             'x86-build-2025-05-25' of
>             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> 
> llc-aware-lb-v3: tip + this series as is
> 
> o Benchmark results
> 
>      ==================================================================
>      Test          : hackbench
>      Units         : Normalized time in seconds
>      Interpretation: Lower is better
>      Statistic     : AMean
>      ==================================================================
>      Case:           tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       1-groups     1.00 [ -0.00](13.74)     1.03 [ -2.77](12.01)
>       2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -1.78]( 6.12)
>       4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -0.87]( 0.91)
>       8-groups     1.00 [ -0.00]( 1.51)     1.03 [ -3.31]( 2.06)
>      16-groups     1.00 [ -0.00]( 1.10)     0.95 [  5.36]( 1.67)
> 
> 
>      ==================================================================
>      Test          : tbench
>      Units         : Normalized throughput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:    tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>          1     1.00 [  0.00]( 0.82)     0.96 [ -3.68]( 1.23)
>          2     1.00 [  0.00]( 1.13)     0.98 [ -2.30]( 0.51)
>          4     1.00 [  0.00]( 1.12)     0.96 [ -4.14]( 0.22)
>          8     1.00 [  0.00]( 0.93)     0.96 [ -3.61]( 0.46)
>         16     1.00 [  0.00]( 0.38)     0.95 [ -4.98]( 1.26)
>         32     1.00 [  0.00]( 0.66)     0.93 [ -7.12]( 2.22)
>         64     1.00 [  0.00]( 1.18)     0.95 [ -5.44]( 0.37)
>        128     1.00 [  0.00]( 1.12)     0.93 [ -6.78]( 0.64)
>        256     1.00 [  0.00]( 0.42)     0.94 [ -6.45]( 0.47)
>        512     1.00 [  0.00]( 0.14)     0.93 [ -7.26]( 0.27)
>       1024     1.00 [  0.00]( 0.26)     0.92 [ -7.57]( 0.31)
> 
> 
>      ==================================================================
>      Test          : stream-10
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       Copy     1.00 [  0.00]( 8.37)     0.39 [-61.05](44.88)
>      Scale     1.00 [  0.00]( 2.85)     0.43 [-57.26](40.60)
>        Add     1.00 [  0.00]( 3.39)     0.40 [-59.88](42.02)
>      Triad     1.00 [  0.00]( 6.39)     0.41 [-58.93](42.98)
> 
> 
>      ==================================================================
>      Test          : stream-100
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       Copy     1.00 [  0.00]( 3.91)     0.36 [-63.95](51.04)
>      Scale     1.00 [  0.00]( 4.34)     0.40 [-60.31](43.12)
>        Add     1.00 [  0.00]( 4.14)     0.38 [-62.46](43.40)
>      Triad     1.00 [  0.00]( 1.00)     0.36 [-64.38](43.12)
> 
> 
>      ==================================================================
>      Test          : netperf
>      Units         : Normalized Througput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:         tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       1-clients     1.00 [  0.00]( 0.41)     0.97 [ -3.26]( 1.30)
>       2-clients     1.00 [  0.00]( 0.58)     0.96 [ -4.24]( 0.71)
>       4-clients     1.00 [  0.00]( 0.35)     0.96 [ -4.19]( 0.67)
>       8-clients     1.00 [  0.00]( 0.48)     0.95 [ -5.41]( 1.36)
>      16-clients     1.00 [  0.00]( 0.66)     0.95 [ -5.31]( 0.93)
>      32-clients     1.00 [  0.00]( 1.15)     0.94 [ -6.43]( 1.44)
>      64-clients     1.00 [  0.00]( 1.38)     0.93 [ -7.14]( 1.63)
>      128-clients    1.00 [  0.00]( 0.87)     0.89 [-10.62]( 0.78)
>      256-clients    1.00 [  0.00]( 5.36)     0.92 [ -8.04]( 2.64)
>      512-clients    1.00 [  0.00](54.39)     0.88 [-12.12](48.87)
> 
> 
>      ==================================================================
>      Test          : schbench
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [ -0.00]( 8.54)     0.54 [ 45.65](28.79)
>        2     1.00 [ -0.00]( 1.15)     0.56 [ 44.00]( 2.09)
>        4     1.00 [ -0.00](13.46)     0.67 [ 33.33](35.68)
>        8     1.00 [ -0.00]( 7.14)     0.63 [ 36.84]( 4.28)
>       16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 9.13)
>       32     1.00 [ -0.00]( 1.06)    32.04 [-3104.26](81.31)
>       64     1.00 [ -0.00]( 5.48)    24.51 [-2351.16](81.18)
>      128     1.00 [ -0.00](10.45)    14.56 [-1356.07]( 5.35)
>      256     1.00 [ -0.00](31.14)     0.95 [  4.80](20.88)
>      512     1.00 [ -0.00]( 1.52)     1.00 [ -0.25]( 1.26)
> 
> 
>      ==================================================================
>      Test          : new-schbench-requests-per-second
>      Units         : Normalized Requests per second
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [  0.00]( 1.07)     0.97 [ -3.24]( 0.98)
>        2     1.00 [  0.00]( 0.00)     0.99 [ -1.17]( 0.15)
>        4     1.00 [  0.00]( 0.00)     0.96 [ -3.50]( 0.56)
>        8     1.00 [  0.00]( 0.15)     0.98 [ -1.76]( 0.31)
>       16     1.00 [  0.00]( 0.00)     0.94 [ -6.13]( 1.93)
>       32     1.00 [  0.00]( 3.41)     0.97 [ -3.18]( 2.10)
>       64     1.00 [  0.00]( 1.05)     0.82 [-18.14](18.41)
>      128     1.00 [  0.00]( 0.00)     0.98 [ -2.27]( 0.20)
>      256     1.00 [  0.00]( 0.72)     1.01 [  1.23]( 0.31)
>      512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.12)
> 
> 
>      ==================================================================
>      Test          : new-schbench-wakeup-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [ -0.00]( 9.11)     0.88 [ 12.50](11.92)
>        2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.92)
>        4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 4.08)
>        8     1.00 [ -0.00]( 0.00)     0.83 [ 16.67]( 5.34)
>       16     1.00 [ -0.00]( 7.56)     0.85 [ 15.38]( 0.00)
>       32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.19)
>       64     1.00 [ -0.00]( 9.63)     1.05 [ -5.00](24.47)
>      128     1.00 [ -0.00]( 4.86)     1.57 [-56.78](68.52)
>      256     1.00 [ -0.00]( 2.34)     1.00 [ -0.00]( 0.57)
>      512     1.00 [ -0.00]( 0.40)     1.00 [ -0.00]( 0.34)
> 
> 
>      ==================================================================
>      Test          : new-schbench-request-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [ -0.00]( 2.73)     1.06 [ -5.71]( 0.25)
>        2     1.00 [ -0.00]( 0.87)     1.08 [ -8.37]( 0.78)
>        4     1.00 [ -0.00]( 1.21)     1.09 [ -9.15]( 0.79)
>        8     1.00 [ -0.00]( 0.27)     1.06 [ -6.31]( 0.51)
>       16     1.00 [ -0.00]( 4.04)     1.85 [-84.55]( 5.11)
>       32     1.00 [ -0.00]( 7.35)     1.52 [-52.16]( 0.83)
>       64     1.00 [ -0.00]( 3.54)     1.06 [ -5.77]( 2.62)
>      128     1.00 [ -0.00]( 0.37)     1.09 [ -9.18](28.47)
>      256     1.00 [ -0.00]( 9.57)     0.99 [  0.60]( 0.48)
>      512     1.00 [ -0.00]( 1.82)     1.03 [ -2.80]( 1.16)
> 
> 
>      ==================================================================
>      Test          : Various longer running benchmarks
>      Units         : %diff in throughput reported
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      Benchmarks:                  %diff
>      ycsb-cassandra              -0.99%
>      ycsb-mongodb                -0.96%
>      deathstarbench-1x           -2.09%
>      deathstarbench-2x           -0.26%
>      deathstarbench-3x           -3.34%
>      deathstarbench-6x           -3.03%
>      hammerdb+mysql 16VU         -2.15%
>      hammerdb+mysql 64VU         -3.77%
> 
>>
>> This patch set is applied on v6.15 kernel.
>> There are some further work needed for future versions in this
>> patch set.  We will need to align NUMA balancing with LLC aggregations
>> such that LLC aggregation will align with the preferred NUMA node.
>>
>> Comments and tests are much appreciated.
> 
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.

It seems that task migration and task bouncing between its preferred
LLC and non-preferred LLC is one symptom that caused regression.

thanks,
Chenyu

> 
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
> 
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
>

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by K Prateek Nayak 3 months, 2 weeks ago

Hello Chenyu,

On 6/24/2025 5:46 PM, Chen, Yu C wrote:

[..snip..]

>> tl;dr
>>
>> o Benchmark that prefer co-location and run in threaded mode see
>>    a benefit including hackbench at high utilization and schbench
>>    at low utilization.
>>
> 
> Previously, we tested hackbench with one group using different
> fd pairs. The number of fds (1–6) was lower than the number
> of CPUs (8) within one CCX. If I understand correctly, the
> default number of fd pairs in hackbench is 20.

Yes that is correct. I'm using the default configuration with
20 messengers are 20 receivers over 20 fd pairs. I'll check
if changing this to nr_llc and nr_llc / 2 makes a difference.

> We might need
> to handle cases where the number of threads (nr_thread)
> exceeds the number of CPUs per LLC—perhaps by
> skipping task aggregation in such scenarios.
> 
>> o schbench (both new and old but particularly the old) regresses
>>    quite a bit on the tial latency metric when #workers cross the
>>    LLC size.
>>
> 
> As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
> could mitigate the issue. Besides, maybe introduce a rate limit
> for cache aware aggregation would help.
> 
>> o client-server benchmarks where client and servers are threads
>>    from different processes (netserver-netperf, tbench_srv-tbench,
>>    services of DeathStarBench) seem to noticeably regress due to
>>    lack of co-location between the communicating client and server.
>>
>>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>>    the preferred LLC hint.
> 
> WF_SYNC is used in wakeup path, the current v3 version does the
> task aggregation in the load balance path. We'll look into this
> C/S scenario.
> 
>>
>> o stream regresses in some runs where the occupancy metrics trip
>>    and assign a preferred LLC for all the stream threads bringing
>>    down performance in !50% of the runs.
>>
> 
> May I know if you tested the stream with mmtests under OMP mode,
> and what do stream-10 and stream-100 mean?

I'm using STREAM in OMP mode. The "10" and "100" refer to the
"NTIMES" argument. I'm passing this during the time of binary
creation as:

     gcc -DSTREAM_ARRAY_SIZE=$ARRAY_SIZE -DNTIMES=$NUM_TIMES -fopenmp -O2 stream.c -o stream

This repeats the main loop of stream benchmark NTIMES. 10 runs
is used to spot any imbalances for shorter runs of b/w intensive
tasks and 100 runs are used to spot trends / ability to correct
an incorrect placement over a longer run.

> Stream is an example
> where all threads have their private memory buffers—no
> interaction with each other. For this benchmark, spreading
> them across different Nodes gets higher memory bandwidth because
> stream allocates the buffer to be at least 4X the L3 cache size.
> We lack a metric that can indicate when threads share a lot of
> data (e.g., both Thread 1 and Thread 2 read from the same
> buffer). In such cases, we should aggregate the threads;
> otherwise, do not aggregate them (as in the stream case).
> On the other hand, stream-omp seems like an unrealistic
> scenario—if threads do not share buffer, why create them
> in the same process?

Not very sure why that is the case but from what I know, HPC
heavily relies on OMP and I believe using threads can reduce
the overhead of fork + join when amount of parallelism in
OMP loops vary?

[..snip..]

>>
>>>
>>> This patch set is applied on v6.15 kernel.
>>> There are some further work needed for future versions in this
>>> patch set.  We will need to align NUMA balancing with LLC aggregations
>>> such that LLC aggregation will align with the preferred NUMA node.
>>>
>>> Comments and tests are much appreciated.
>>
>> I'll rerun the test once with the SCHED_FEAT() disabled just to make
>> sure I'm not regressing because of some other factors. For the major
>> regressions, I'll get the "perf sched stats" data to see if anything
>> stands out.
> 
> It seems that task migration and task bouncing between its preferred
> LLC and non-preferred LLC is one symptom that caused regression.

That could be the case! I'll also include some migration data to see
if it reveals anything.

-- 
Thanks and Regards,
Prateek

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Madadi Vineeth Reddy 3 months, 2 weeks ago

Hi Tim,

On 18/06/25 23:57, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
>  
> The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.
>  
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 
> 1) Aggregation of tasks during wake up led to load imbalance
>    between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
>    load balancing moved tasks in opposite directions, leading
>    to continuous and excessive task migrations and regressions
>    in benchmarks like schbench.
> 
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
> 
> 1) Identify tasks that prefer to run on their hottest LLC and
>    move them there.
> 2) Prevent generic load balancing from moving a task out of
>    its hottest LLC.
> 
> By default, LLC task aggregation during wake-up is disabled.
> Conversely, cache-aware load balancing is enabled by default.
> For easier comparison, two scheduler features are introduced:
> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> wake up and cache-aware load balancing, respectively. By default,
> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> is only done on load balancing.

Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
LLC on this platform spans 4 threads.

schbench:
                        baseline (sd%)        baseline+cacheaware (sd%)      %change
Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%

Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%

Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%

Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%

Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%

Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%

Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%

Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%

Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%

The above data shows mostly regression both in the lesser and
higher load cases.


Hackbench pipe:

Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
2       2.987 (1.19%)               2.414 (17.99%)              24.06%
4       7.702 (12.53%)              7.228 (18.37%)               6.16%
8       14.141 (1.32%)              13.109 (1.46%)               7.29%
15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
30      65.118 (4.49%)              61.352 (4.00%)               5.78%
45      105.086 (9.75%)             97.970 (4.26%)               6.77%
60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
75      199.278 (1.21%)             198.680 (1.37%)              0.30%

A lot of run to run variation is seen in hackbench runs. So hard to tell
on the performance but looks better than schbench.

In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
when compared to platforms like sapphire rapids and Milan. Didn't go
through this series yet. Will go through and try to understand why
schbench is not happy on Power systems.

Meanwhile, Wanted to know your thoughts on how does smaller LLC
size get impacted with this patch?

Thanks,
Madadi Vineeth Reddy


> 
> With above default settings, task migrations occur less frequently
> and no longer happen in the latency-sensitive wake-up path.
> 

[..snip..]

> 
> Chen Yu (3):
>   sched: Several fixes for cache aware scheduling
>   sched: Avoid task migration within its preferred LLC
>   sched: Save the per LLC utilization for better cache aware scheduling
> 
> K Prateek Nayak (1):
>   sched: Avoid calculating the cpumask if the system is overloaded
> 
> Peter Zijlstra (1):
>   sched: Cache aware load-balancing
> 
> Tim Chen (15):
>   sched: Add hysteresis to switch a task's preferred LLC
>   sched: Add helper function to decide whether to allow cache aware
>     scheduling
>   sched: Set up LLC indexing
>   sched: Introduce task preferred LLC field
>   sched: Calculate the number of tasks that have LLC preference on a
>     runqueue
>   sched: Introduce per runqueue task LLC preference counter
>   sched: Calculate the total number of preferred LLC tasks during load
>     balance
>   sched: Tag the sched group as llc_balance if it has tasks prefer other
>     LLC
>   sched: Introduce update_llc_busiest() to deal with groups having
>     preferred LLC tasks
>   sched: Introduce a new migration_type to track the preferred LLC load
>     balance
>   sched: Consider LLC locality for active balance
>   sched: Consider LLC preference when picking tasks from busiest queue
>   sched: Do not migrate task if it is moving out of its preferred LLC
>   sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>   sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>     up
> 
>  include/linux/mm_types.h       |  44 ++
>  include/linux/sched.h          |   8 +
>  include/linux/sched/topology.h |   3 +
>  init/Kconfig                   |   4 +
>  init/init_task.c               |   3 +
>  kernel/fork.c                  |   5 +
>  kernel/sched/core.c            |  25 +-
>  kernel/sched/debug.c           |   4 +
>  kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   3 +
>  kernel/sched/sched.h           |  23 +
>  kernel/sched/topology.c        |  29 ++
>  12 files changed, 982 insertions(+), 28 deletions(-)
>

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Tim Chen 3 months, 2 weeks ago

On Sat, 2025-06-21 at 00:55 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 18/06/25 23:57, Tim Chen wrote:
> > This is the third revision of the cache aware scheduling patches,
> > based on the original patch proposed by Peter[1].
> >  
> > The goal of the patch series is to aggregate tasks sharing data
> > to the same cache domain, thereby reducing cache bouncing and
> > cache misses, and improve data access efficiency. In the current
> > implementation, threads within the same process are considered
> > as entities that potentially share resources.
> >  
> > In previous versions, aggregation of tasks were done in the
> > wake up path, without making load balancing paths aware of
> > LLC (Last-Level-Cache) preference. This led to the following
> > problems:
> > 
> > 1) Aggregation of tasks during wake up led to load imbalance
> >    between LLCs
> > 2) Load balancing tried to even out the load between LLCs
> > 3) Wake up tasks aggregation happened at a faster rate and
> >    load balancing moved tasks in opposite directions, leading
> >    to continuous and excessive task migrations and regressions
> >    in benchmarks like schbench.
> > 
> > In this version, load balancing is made cache-aware. The main
> > idea of cache-aware load balancing consists of two parts:
> > 
> > 1) Identify tasks that prefer to run on their hottest LLC and
> >    move them there.
> > 2) Prevent generic load balancing from moving a task out of
> >    its hottest LLC.
> > 
> > By default, LLC task aggregation during wake-up is disabled.
> > Conversely, cache-aware load balancing is enabled by default.
> > For easier comparison, two scheduler features are introduced:
> > SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> > wake up and cache-aware load balancing, respectively. By default,
> > NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> > is only done on load balancing.
> 
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.

Hi Madadi,

Thank you for testing this patch series.

If I understand correctly, the Power 11 you tested has 8 threads per core.
My suspicion is we benefit much more from utilizing more cores
than aggregating the load on less cores but sharing the cache
more in this case.


> 
> schbench:
>                         baseline (sd%)        baseline+cacheaware (sd%)      %change
> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
> 
> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
> 
> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
> 
> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
> 
> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
> 
> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
> 
> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
> 
> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
> 
> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
> 
> The above data shows mostly regression both in the lesser and
> higher load cases.
> 
> 
> Hackbench pipe:
> 
> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
> 
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.
> 
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.

My guess is having 8 threads per core, LLC aggregation may have
been too aggressive in consolidating tasks on fewer cores and may have left some
cpu cycles unused. Doing experiments by running one thread per core on Power11
may give us some insights if this conjecture is true.

> 
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
> 

This patch series is currently tuned for systems with single threaded core,
and having many cores and large cache per LLC.  

With only 4 cores and 32 threads per LLC as in Power 11, we run out of cores quickly
and have more cache contention between the tasks consolidated.
We may have to set aggregation threshold (sysctl_llc_aggr_cap) less
than 50% utilization (default), so we consolidate less aggressively
and spread the tasks much sooner. 


Tim

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Chen, Yu C 3 months, 2 weeks ago

On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 18/06/25 23:57, Tim Chen wrote:
>> This is the third revision of the cache aware scheduling patches,
>> based on the original patch proposed by Peter[1].
>>   
>> The goal of the patch series is to aggregate tasks sharing data
>> to the same cache domain, thereby reducing cache bouncing and
>> cache misses, and improve data access efficiency. In the current
>> implementation, threads within the same process are considered
>> as entities that potentially share resources.
>>   
>> In previous versions, aggregation of tasks were done in the
>> wake up path, without making load balancing paths aware of
>> LLC (Last-Level-Cache) preference. This led to the following
>> problems:
>>
>> 1) Aggregation of tasks during wake up led to load imbalance
>>     between LLCs
>> 2) Load balancing tried to even out the load between LLCs
>> 3) Wake up tasks aggregation happened at a faster rate and
>>     load balancing moved tasks in opposite directions, leading
>>     to continuous and excessive task migrations and regressions
>>     in benchmarks like schbench.
>>
>> In this version, load balancing is made cache-aware. The main
>> idea of cache-aware load balancing consists of two parts:
>>
>> 1) Identify tasks that prefer to run on their hottest LLC and
>>     move them there.
>> 2) Prevent generic load balancing from moving a task out of
>>     its hottest LLC.
>>
>> By default, LLC task aggregation during wake-up is disabled.
>> Conversely, cache-aware load balancing is enabled by default.
>> For easier comparison, two scheduler features are introduced:
>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>> wake up and cache-aware load balancing, respectively. By default,
>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>> is only done on load balancing.
> 
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.
> 
> schbench:
>                          baseline (sd%)        baseline+cacheaware (sd%)      %change
> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
> 
> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
> 
> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
> 
> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
> 
> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
> 
> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
> 
> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
> 
> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
> 
> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
> 
> The above data shows mostly regression both in the lesser and
> higher load cases.
> 
> 
> Hackbench pipe:
> 
> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
> 
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.

May I know if the cpu frequency was set at a fixed level and deep
cpu idle states were disabled(I assume on power system it is called
stop states?)

> 
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.
> 
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
> 

task aggregation on smaller LLC domain(both in terms of the
number of CPUs and the size of LLC) might bring cache contention
and hurt performance IMO. May I know what is the cache size on
your system:
lscpu | grep "L3 cache"

May I know if you tested it with:
echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features

vs

echo SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features

And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
from 50 to some smaller values(25, etc) would help?

thanks,
Chenyu

> Thanks,
> Madadi Vineeth Reddy
> 
> 
>>
>> With above default settings, task migrations occur less frequently
>> and no longer happen in the latency-sensitive wake-up path.
>>
> 
> [..snip..]
> 
>>
>> Chen Yu (3):
>>    sched: Several fixes for cache aware scheduling
>>    sched: Avoid task migration within its preferred LLC
>>    sched: Save the per LLC utilization for better cache aware scheduling
>>
>> K Prateek Nayak (1):
>>    sched: Avoid calculating the cpumask if the system is overloaded
>>
>> Peter Zijlstra (1):
>>    sched: Cache aware load-balancing
>>
>> Tim Chen (15):
>>    sched: Add hysteresis to switch a task's preferred LLC
>>    sched: Add helper function to decide whether to allow cache aware
>>      scheduling
>>    sched: Set up LLC indexing
>>    sched: Introduce task preferred LLC field
>>    sched: Calculate the number of tasks that have LLC preference on a
>>      runqueue
>>    sched: Introduce per runqueue task LLC preference counter
>>    sched: Calculate the total number of preferred LLC tasks during load
>>      balance
>>    sched: Tag the sched group as llc_balance if it has tasks prefer other
>>      LLC
>>    sched: Introduce update_llc_busiest() to deal with groups having
>>      preferred LLC tasks
>>    sched: Introduce a new migration_type to track the preferred LLC load
>>      balance
>>    sched: Consider LLC locality for active balance
>>    sched: Consider LLC preference when picking tasks from busiest queue
>>    sched: Do not migrate task if it is moving out of its preferred LLC
>>    sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>>    sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>>      up
>>
>>   include/linux/mm_types.h       |  44 ++
>>   include/linux/sched.h          |   8 +
>>   include/linux/sched/topology.h |   3 +
>>   init/Kconfig                   |   4 +
>>   init/init_task.c               |   3 +
>>   kernel/fork.c                  |   5 +
>>   kernel/sched/core.c            |  25 +-
>>   kernel/sched/debug.c           |   4 +
>>   kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>>   kernel/sched/features.h        |   3 +
>>   kernel/sched/sched.h           |  23 +
>>   kernel/sched/topology.c        |  29 ++
>>   12 files changed, 982 insertions(+), 28 deletions(-)
>>
>

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Madadi Vineeth Reddy 3 months, 2 weeks ago

Hi Chen,

On 22/06/25 06:09, Chen, Yu C wrote:
> On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
>> Hi Tim,
>>
>> On 18/06/25 23:57, Tim Chen wrote:
>>> This is the third revision of the cache aware scheduling patches,
>>> based on the original patch proposed by Peter[1].
>>>   The goal of the patch series is to aggregate tasks sharing data
>>> to the same cache domain, thereby reducing cache bouncing and
>>> cache misses, and improve data access efficiency. In the current
>>> implementation, threads within the same process are considered
>>> as entities that potentially share resources.
>>>   In previous versions, aggregation of tasks were done in the
>>> wake up path, without making load balancing paths aware of
>>> LLC (Last-Level-Cache) preference. This led to the following
>>> problems:
>>>
>>> 1) Aggregation of tasks during wake up led to load imbalance
>>>     between LLCs
>>> 2) Load balancing tried to even out the load between LLCs
>>> 3) Wake up tasks aggregation happened at a faster rate and
>>>     load balancing moved tasks in opposite directions, leading
>>>     to continuous and excessive task migrations and regressions
>>>     in benchmarks like schbench.
>>>
>>> In this version, load balancing is made cache-aware. The main
>>> idea of cache-aware load balancing consists of two parts:
>>>
>>> 1) Identify tasks that prefer to run on their hottest LLC and
>>>     move them there.
>>> 2) Prevent generic load balancing from moving a task out of
>>>     its hottest LLC.
>>>
>>> By default, LLC task aggregation during wake-up is disabled.
>>> Conversely, cache-aware load balancing is enabled by default.
>>> For easier comparison, two scheduler features are introduced:
>>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>>> wake up and cache-aware load balancing, respectively. By default,
>>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>>> is only done on load balancing.
>>
>> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
>> LLC on this platform spans 4 threads.
>>
>> schbench:
>>                          baseline (sd%)        baseline+cacheaware (sd%)      %change
>> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
>> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
>> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
>> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
>>
>> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
>> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
>> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
>> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
>>
>> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
>> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
>> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
>> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
>>
>> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
>> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
>> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
>> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
>>
>> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
>> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
>> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
>> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
>>
>> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
>> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
>> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
>> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
>>
>> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
>> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
>> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
>> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
>>
>> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
>> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
>> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
>> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
>>
>> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
>> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
>> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
>> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
>>
>> The above data shows mostly regression both in the lesser and
>> higher load cases.
>>
>>
>> Hackbench pipe:
>>
>> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
>> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
>> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
>> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
>> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
>> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
>> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
>> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
>> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
>>
>> A lot of run to run variation is seen in hackbench runs. So hard to tell
>> on the performance but looks better than schbench.
> 
> May I know if the cpu frequency was set at a fixed level and deep
> cpu idle states were disabled(I assume on power system it is called
> stop states?)

Deep cpu idle state is called 'cede' in PowerVM LPAR. I have not disabled
it.

> 
>>
>> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
>> when compared to platforms like sapphire rapids and Milan. Didn't go
>> through this series yet. Will go through and try to understand why
>> schbench is not happy on Power systems.
>>
>> Meanwhile, Wanted to know your thoughts on how does smaller LLC
>> size get impacted with this patch?
>>
> 
> task aggregation on smaller LLC domain(both in terms of the
> number of CPUs and the size of LLC) might bring cache contention
> and hurt performance IMO. May I know what is the cache size on
> your system:
> lscpu | grep "L3 cache"

L3 cache: 224 MiB (56 instances)

> 
> May I know if you tested it with:
> echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features
> 
> vs
> 
> echo SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features
> 

I have tested with and without this patch series. Didn't change
any sched feature. So, the patched kernel was running with the default
settings:
SCHED_CACHE, NO_SCHED_CACHE_WAKE, and SCHED_CACHE_LB.


> And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
> from 50 to some smaller values(25, etc) would help?

Will give it a try.

Thanks,
Madadi Vineeth Reddy

> 
> thanks,
> Chenyu
> 
>> Thanks,
>> Madadi Vineeth Reddy
>>
>>
>>>
>>> With above default settings, task migrations occur less frequently
>>> and no longer happen in the latency-sensitive wake-up path.
>>>
>>
>> [..snip..]
>>
>>>
>>> Chen Yu (3):
>>>    sched: Several fixes for cache aware scheduling
>>>    sched: Avoid task migration within its preferred LLC
>>>    sched: Save the per LLC utilization for better cache aware scheduling
>>>
>>> K Prateek Nayak (1):
>>>    sched: Avoid calculating the cpumask if the system is overloaded
>>>
>>> Peter Zijlstra (1):
>>>    sched: Cache aware load-balancing
>>>
>>> Tim Chen (15):
>>>    sched: Add hysteresis to switch a task's preferred LLC
>>>    sched: Add helper function to decide whether to allow cache aware
>>>      scheduling
>>>    sched: Set up LLC indexing
>>>    sched: Introduce task preferred LLC field
>>>    sched: Calculate the number of tasks that have LLC preference on a
>>>      runqueue
>>>    sched: Introduce per runqueue task LLC preference counter
>>>    sched: Calculate the total number of preferred LLC tasks during load
>>>      balance
>>>    sched: Tag the sched group as llc_balance if it has tasks prefer other
>>>      LLC
>>>    sched: Introduce update_llc_busiest() to deal with groups having
>>>      preferred LLC tasks
>>>    sched: Introduce a new migration_type to track the preferred LLC load
>>>      balance
>>>    sched: Consider LLC locality for active balance
>>>    sched: Consider LLC preference when picking tasks from busiest queue
>>>    sched: Do not migrate task if it is moving out of its preferred LLC
>>>    sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>>>    sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>>>      up
>>>
>>>   include/linux/mm_types.h       |  44 ++
>>>   include/linux/sched.h          |   8 +
>>>   include/linux/sched/topology.h |   3 +
>>>   init/Kconfig                   |   4 +
>>>   init/init_task.c               |   3 +
>>>   kernel/fork.c                  |   5 +
>>>   kernel/sched/core.c            |  25 +-
>>>   kernel/sched/debug.c           |   4 +
>>>   kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>>>   kernel/sched/features.h        |   3 +
>>>   kernel/sched/sched.h           |  23 +
>>>   kernel/sched/topology.c        |  29 ++
>>>   12 files changed, 982 insertions(+), 28 deletions(-)
>>>
>>

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Yangyu Chen 3 months, 3 weeks ago

Nice work!

I've tested your patch based on commit fb4d33ab452e and found it
incredibly helpful for Verilator with large RTL simulations like
XiangShan [1] on AMD EPYC Geona.

I've created a simple benchmark [2] using a static build of an
8-thread Verilator of XiangShan. Simply clone the repository and
run `make run`.

In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
9T24, before the patch, we have a simulation time of 49.348ms. This
was because each thread was distributed across every CCX, resulting
in extremely high core-to-core latency. However, after applying the
patch, the entire 8-thread Verilator is allocated to a single CCX.
Consequently, the simulation time was reduced to 24.196ms, which
is a remarkable 2.03x faster than before. We don't need numactl
anymore!

[1] https://github.com/OpenXiangShan/XiangShan
[2] https://github.com/cyyself/chacha20-xiangshan

Tested-by: Yangyu Chen <cyy@cyyself.name>

Thanks,
Yangyu Chen

On 19/6/2025 02:27, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
>  The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.
>  In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 1) Aggregation of tasks during wake up led to load imbalance
>    between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
>    load balancing moved tasks in opposite directions, leading
>    to continuous and excessive task migrations and regressions
>    in benchmarks like schbench.
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
> 1) Identify tasks that prefer to run on their hottest LLC and
>    move them there.
> 2) Prevent generic load balancing from moving a task out of
>    its hottest LLC.
> By default, LLC task aggregation during wake-up is disabled.
> Conversely, cache-aware load balancing is enabled by default.
> For easier comparison, two scheduler features are introduced:
> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> wake up and cache-aware load balancing, respectively. By default,
> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> is only done on load balancing.
> With above default settings, task migrations occur less frequently
> and no longer happen in the latency-sensitive wake-up path.
> The load balancing and migration policy are now implemented in
> a single location within the function _get_migrate_hint().
> Debugfs knobs are also introduced to fine-tune the
> _get_migrate_hint() function. Please refer to patch 7 for
> detail.
> Improvements in performance for hackbench are observed in the
> lower load ranges when tested on a 2 socket sapphire rapids with
> 30 cores per socket. The DRAM interleaving is enabled in the
> BIOS so it essential has one NUMA node with two last level
> caches. Hackbench benefits from having all the threads
> in the process running in the same LLC. There are some small
> regressions for the heavily loaded case when not all threads can
> fit in a LLC.
> Hackbench is run with one process, and pairs of threads ping
> ponging message off each other via command with increasing number
> of thread pairs, each test runs for 10 cycles:
> hackbench -g 1 --thread --pipe(socket) -l 1000000 -s 100 -f <pairs>
> case                    load            baseline(std%)  compare%( std%)
> threads-pipe-8          1-groups         1.00 (  2.70)  +24.51 (  0.59)
> threads-pipe-15         1-groups         1.00 (  1.42)  +28.37 (  0.68)
> threads-pipe-30         1-groups         1.00 (  2.53)  +26.16 (  0.11)
> threads-pipe-45         1-groups         1.00 (  0.48)  +35.38 (  0.18)
> threads-pipe-60         1-groups         1.00 (  2.13)  +13.46 ( 12.81)
> threads-pipe-75         1-groups         1.00 (  1.57)  +16.71 (  0.20)
> threads-pipe-90         1-groups         1.00 (  0.22)   -0.57 (  1.21)
> threads-sockets-8       1-groups         1.00 (  2.82)  +23.04 (  0.83)
> threads-sockets-15      1-groups         1.00 (  2.57)  +21.67 (  1.90)
> threads-sockets-30      1-groups         1.00 (  0.75)  +18.78 (  0.09)
> threads-sockets-45      1-groups         1.00 (  1.63)  +18.89 (  0.43)
> threads-sockets-60      1-groups         1.00 (  0.66)  +10.10 (  1.91)
> threads-sockets-75      1-groups         1.00 (  0.44)  -14.49 (  0.43)
> threads-sockets-90      1-groups         1.00 (  0.15)   -8.03 (  3.88)
> Similar tests were also experimented on schbench on the system.
> Overall latency improvement is observed when underloaded and
> regression when overloaded. The regression is significantly
> smaller than the previous version because cache aware aggregation
> is in load balancing rather than in wake up path. Besides, it is
> found that the schbench seems to have large run-to-run variance,
> so the result of schbench might be only used as reference.
> schbench:
>                                    baseline              nowake_lb
> Lat 50.0th-qrtle-1          5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-1          9.00 (   0.00%)        8.00 (  11.11%)
> Lat 99.0th-qrtle-1         15.00 (   0.00%)       15.00 (   0.00%)
> Lat 99.9th-qrtle-1         32.00 (   0.00%)       23.00 (  28.12%)
> Lat 20.0th-qrtle-1        267.00 (   0.00%)      266.00 (   0.37%)
> Lat 50.0th-qrtle-2          8.00 (   0.00%)        4.00 (  50.00%)
> Lat 90.0th-qrtle-2          9.00 (   0.00%)        7.00 (  22.22%)
> Lat 99.0th-qrtle-2         18.00 (   0.00%)       11.00 (  38.89%)
> Lat 99.9th-qrtle-2         26.00 (   0.00%)       25.00 (   3.85%)
> Lat 20.0th-qrtle-2        535.00 (   0.00%)      537.00 (  -0.37%)
> Lat 50.0th-qrtle-4          6.00 (   0.00%)        4.00 (  33.33%)
> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
> Lat 99.0th-qrtle-4         13.00 (   0.00%)       10.00 (  23.08%)
> Lat 99.9th-qrtle-4         20.00 (   0.00%)       14.00 (  30.00%)
> Lat 20.0th-qrtle-4       1066.00 (   0.00%)     1050.00 (   1.50%)
> Lat 50.0th-qrtle-8          5.00 (   0.00%)        4.00 (  20.00%)
> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-8         11.00 (   0.00%)        8.00 (  27.27%)
> Lat 99.9th-qrtle-8         17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2156.00 (  -0.75%)
> Lat 50.0th-qrtle-16         6.00 (   0.00%)        4.00 (  33.33%)
> Lat 90.0th-qrtle-16         7.00 (   0.00%)        6.00 (  14.29%)
> Lat 99.0th-qrtle-16        11.00 (   0.00%)       11.00 (   0.00%)
> Lat 99.9th-qrtle-16        18.00 (   0.00%)       18.00 (   0.00%)
> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4216.00 (   1.86%)
> Lat 50.0th-qrtle-32         6.00 (   0.00%)        4.00 (  33.33%)
> Lat 90.0th-qrtle-32         7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-32        11.00 (   0.00%)        9.00 (  18.18%)
> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8624.00 (  -1.51%)
> Lat 50.0th-qrtle-64         5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-64         7.00 (   0.00%)        7.00 (   0.00%)
> Lat 99.0th-qrtle-64        11.00 (   0.00%)       11.00 (   0.00%)
> Lat 99.9th-qrtle-64        17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    15728.00 (   8.13%)
> Lat 50.0th-qrtle-128        6.00 (   0.00%)        6.00 (   0.00%)
> Lat 90.0th-qrtle-128        9.00 (   0.00%)        8.00 (  11.11%)
> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
> Lat 99.9th-qrtle-128       20.00 (   0.00%)       26.00 ( -30.00%)
> Lat 20.0th-qrtle-128    19488.00 (   0.00%)    18784.00 (   3.61%)
> Lat 50.0th-qrtle-239        8.00 (   0.00%)        8.00 (   0.00%)
> Lat 90.0th-qrtle-239       16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.0th-qrtle-239       45.00 (   0.00%)       41.00 (   8.89%)
> Lat 99.9th-qrtle-239      137.00 (   0.00%)      225.00 ( -64.23%)
> Lat 20.0th-qrtle-239    30432.00 (   0.00%)    29920.00 (   1.68%)
> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
> with 1 group test scenario benefits from cache aware load balance
> too:
> hackbench(1 group and fd ranges in [1,6]:
> case                    load            baseline(std%)  compare%( std%)
> threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
> threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
> threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
> threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
> threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
> threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
> threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
> threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
> threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
> threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
> threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
> threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)
> Besides a single instance of hackbench, four instances of hackbench are
> also tested on Milan. The test results show that different instances of
> hackbench are aggregated to dedicated LLCs, and performance improvement
> is observed.
> schbench mmtests(unstable)
>                                   baseline              nowake_lb
> Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
> Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
> Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
> Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
> Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
> Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
> Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
> Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
> Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
> Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
> Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
> Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
> Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
> Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
> Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
> Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
> Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
> Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
> Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
> Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
> Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)
> Netperf/Tbench have not been tested yet. As they are single-process
> benchmarks that are not the target of this cache-aware scheduling.
> Additionally, client and server components should be tested on
> different machines or bound to different nodes. Otherwise,
> cache-aware scheduling might harm their performance: placing client
> and server in the same LLC could yield higher throughput due to
> improved cache locality in the TCP/IP stack, whereas cache-aware
> scheduling aims to place them in dedicated LLCs.
> This patch set is applied on v6.15 kernel.
>  There are some further work needed for future versions in this
> patch set.  We will need to align NUMA balancing with LLC aggregations
> such that LLC aggregation will align with the preferred NUMA node.
> Comments and tests are much appreciated.
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> The patches are grouped as follow:
> Patch 1:     Peter's original patch.
> Patch 2-5:   Various fixes and tuning of the original v1 patch.
> Patch 6-12:  Infrastructure and helper functions for load balancing to be cache aware.
> Patch 13-18: Add logic to load balancing for preferred LLC aggregation.
> Patch 19:    Add process LLC aggregation in load balancing sched feature.
> Patch 20:    Add Process LLC aggregation in wake up sched feature (turn off by default).
> v1:
> https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> v2:
> https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> Chen Yu (3):
>   sched: Several fixes for cache aware scheduling
>   sched: Avoid task migration within its preferred LLC
>   sched: Save the per LLC utilization for better cache aware scheduling
> K Prateek Nayak (1):
>   sched: Avoid calculating the cpumask if the system is overloaded
> Peter Zijlstra (1):
>   sched: Cache aware load-balancing
> Tim Chen (15):
>   sched: Add hysteresis to switch a task's preferred LLC
>   sched: Add helper function to decide whether to allow cache aware
>     scheduling
>   sched: Set up LLC indexing
>   sched: Introduce task preferred LLC field
>   sched: Calculate the number of tasks that have LLC preference on a
>     runqueue
>   sched: Introduce per runqueue task LLC preference counter
>   sched: Calculate the total number of preferred LLC tasks during load
>     balance
>   sched: Tag the sched group as llc_balance if it has tasks prefer other
>     LLC
>   sched: Introduce update_llc_busiest() to deal with groups having
>     preferred LLC tasks
>   sched: Introduce a new migration_type to track the preferred LLC load
>     balance
>   sched: Consider LLC locality for active balance
>   sched: Consider LLC preference when picking tasks from busiest queue
>   sched: Do not migrate task if it is moving out of its preferred LLC
>   sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>   sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>     up
>  include/linux/mm_types.h       |  44 ++
>  include/linux/sched.h          |   8 +
>  include/linux/sched/topology.h |   3 +
>  init/Kconfig                   |   4 +
>  init/init_task.c               |   3 +
>  kernel/fork.c                  |   5 +
>  kernel/sched/core.c            |  25 +-
>  kernel/sched/debug.c           |   4 +
>  kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   3 +
>  kernel/sched/sched.h           |  23 +
>  kernel/sched/topology.c        |  29 ++
>  12 files changed, 982 insertions(+), 28 deletions(-)

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Chen, Yu C 3 months, 3 weeks ago

On 6/19/2025 2:39 PM, Yangyu Chen wrote:
> Nice work!
> 
> I've tested your patch based on commit fb4d33ab452e and found it
> incredibly helpful for Verilator with large RTL simulations like
> XiangShan [1] on AMD EPYC Geona.
> 
> I've created a simple benchmark [2] using a static build of an
> 8-thread Verilator of XiangShan. Simply clone the repository and
> run `make run`.
> 
> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
> 9T24, before the patch, we have a simulation time of 49.348ms. This
> was because each thread was distributed across every CCX, resulting
> in extremely high core-to-core latency. However, after applying the
> patch, the entire 8-thread Verilator is allocated to a single CCX.
> Consequently, the simulation time was reduced to 24.196ms, which
> is a remarkable 2.03x faster than before. We don't need numactl
> anymore!
> 
> [1] https://github.com/OpenXiangShan/XiangShan
> [2] https://github.com/cyyself/chacha20-xiangshan
> 
> Tested-by: Yangyu Chen <cyy@cyyself.name>
> 

Thanks Yangyu for your test. May I know if these 8-threads have any
data sharing with each other, or each thread has their dedicated
data? Or, there is 1 main thread, the other 7 threads do the
chacha20 rotate and put the result to the main thread?
Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
reduction in the total time.

Thanks,
Chenyu

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Yangyu Chen 3 months, 3 weeks ago

> On 19 Jun 2025, at 21:21, Chen, Yu C <yu.c.chen@intel.com> wrote:
> 
> On 6/19/2025 2:39 PM, Yangyu Chen wrote:
>> Nice work!
>> I've tested your patch based on commit fb4d33ab452e and found it
>> incredibly helpful for Verilator with large RTL simulations like
>> XiangShan [1] on AMD EPYC Geona.
>> I've created a simple benchmark [2] using a static build of an
>> 8-thread Verilator of XiangShan. Simply clone the repository and
>> run `make run`.
>> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
>> 9T24, before the patch, we have a simulation time of 49.348ms. This
>> was because each thread was distributed across every CCX, resulting
>> in extremely high core-to-core latency. However, after applying the
>> patch, the entire 8-thread Verilator is allocated to a single CCX.
>> Consequently, the simulation time was reduced to 24.196ms, which
>> is a remarkable 2.03x faster than before. We don't need numactl
>> anymore!
>> [1] https://github.com/OpenXiangShan/XiangShan
>> [2] https://github.com/cyyself/chacha20-xiangshan
>> Tested-by: Yangyu Chen <cyy@cyyself.name>
> 
> Thanks Yangyu for your test. May I know if these 8-threads have any
> data sharing with each other, or each thread has their dedicated
> data? Or, there is 1 main thread, the other 7 threads do the
> chacha20 rotate and put the result to the main thread?

Ah, I had forgotten to mention the benchmark. The workload is not
about chacha20 itself. This benchmark utilizes a RTL-level simulator
[1] that runs an Open Source OoO CPU core called XiangShan [2]. The
chacha20 algorithm is executed on the guest CPU within this simulator.

The verilator partitions a large RTL design into multiple blocks
of functions and distributes them to each thread. These signals
require synchronization every guest cycle, and synchronization is
also necessary when a dependency exists. Given that we have
approximately 5K guest cycles per second, there is a significant
amount of data that needs to be transferred between each thread.
If there are signal dependencies, this could lead to latency-bound
performance.

[1] https://github.com/verilator/verilator
[2] https://github.com/OpenXiangShan/XiangShan

Thanks,
Yangyu Chen

> Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
> reduction in the total time.

Nice result!

> 
> Thanks,
> Chenyu

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Madadi Vineeth Reddy 3 months ago

On 18/06/25 23:57, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
>  
> The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.

[..snip..]

> 
> Comments and tests are much appreciated.

When running ebizzy as below:
ebizzy -t 8 -S 10

I see ~24% degradation on the patched kernel, due to higher SMT2 and
SMT4 cycles compared to the baseline. ST cycles decreased.

Since both P10 and P11 have LLC shared at the SMT4 level, even spawning
fewer threads easily crowds the LLC with the default llc_aggr_cap value
of 50. Increasing this value would likely make things worse, while
decreasing it to 25 effectively disables cache-aware scheduling
(as it limits selection to just one CPU).

I understand that ebizzy itself doesn't benefit from cache sharing, so
it might not improve but here it actually *regresses*, and the impact
may be even larger on P10 /P11 because of its smaller LLC shared by 4
CPUs, even with fewer threads. IPC drops.

By default, the SCHED_CACHE feature is enabled. Given these results for
workloads that don't share cache and on systems with smaller LLCs, I think
the default value should be revisited.

Thanks,
Madadi Vineeth Reddy

> 
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> 
> The patches are grouped as follow:
> Patch 1:     Peter's original patch.
> Patch 2-5:   Various fixes and tuning of the original v1 patch.
> Patch 6-12:  Infrastructure and helper functions for load balancing to be cache aware.
> Patch 13-18: Add logic to load balancing for preferred LLC aggregation.
> Patch 19:    Add process LLC aggregation in load balancing sched feature.
> Patch 20:    Add Process LLC aggregation in wake up sched feature (turn off by default).
> 
> v1:
> https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> v2:
> https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> 
> 
> Chen Yu (3):
>   sched: Several fixes for cache aware scheduling
>   sched: Avoid task migration within its preferred LLC
>   sched: Save the per LLC utilization for better cache aware scheduling
> 
> K Prateek Nayak (1):
>   sched: Avoid calculating the cpumask if the system is overloaded
> 
> Peter Zijlstra (1):
>   sched: Cache aware load-balancing
> 
> Tim Chen (15):
>   sched: Add hysteresis to switch a task's preferred LLC
>   sched: Add helper function to decide whether to allow cache aware
>     scheduling
>   sched: Set up LLC indexing
>   sched: Introduce task preferred LLC field
>   sched: Calculate the number of tasks that have LLC preference on a
>     runqueue
>   sched: Introduce per runqueue task LLC preference counter
>   sched: Calculate the total number of preferred LLC tasks during load
>     balance
>   sched: Tag the sched group as llc_balance if it has tasks prefer other
>     LLC
>   sched: Introduce update_llc_busiest() to deal with groups having
>     preferred LLC tasks
>   sched: Introduce a new migration_type to track the preferred LLC load
>     balance
>   sched: Consider LLC locality for active balance
>   sched: Consider LLC preference when picking tasks from busiest queue
>   sched: Do not migrate task if it is moving out of its preferred LLC
>   sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>   sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>     up
> 
>  include/linux/mm_types.h       |  44 ++
>  include/linux/sched.h          |   8 +
>  include/linux/sched/topology.h |   3 +
>  init/Kconfig                   |   4 +
>  init/init_task.c               |   3 +
>  kernel/fork.c                  |   5 +
>  kernel/sched/core.c            |  25 +-
>  kernel/sched/debug.c           |   4 +
>  kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   3 +
>  kernel/sched/sched.h           |  23 +
>  kernel/sched/topology.c        |  29 ++
>  12 files changed, 982 insertions(+), 28 deletions(-)
>

Re: [RFC patch v3 00/20] Cache aware scheduling

Posted by Chen, Yu C 3 months ago

On 7/10/2025 3:39 AM, Madadi Vineeth Reddy wrote:
> On 18/06/25 23:57, Tim Chen wrote:
>> This is the third revision of the cache aware scheduling patches,
>> based on the original patch proposed by Peter[1].
>>   
>> The goal of the patch series is to aggregate tasks sharing data
>> to the same cache domain, thereby reducing cache bouncing and
>> cache misses, and improve data access efficiency. In the current
>> implementation, threads within the same process are considered
>> as entities that potentially share resources.
> 
> [..snip..]
> 
>>
>> Comments and tests are much appreciated.
> 
> When running ebizzy as below:
> ebizzy -t 8 -S 10
> 
> I see ~24% degradation on the patched kernel, due to higher SMT2 and
> SMT4 cycles compared to the baseline. ST cycles decreased.
> 
> Since both P10 and P11 have LLC shared at the SMT4 level, even spawning
> fewer threads easily crowds the LLC with the default llc_aggr_cap value
> of 50. Increasing this value would likely make things worse, while
> decreasing it to 25 effectively disables cache-aware scheduling
> (as it limits selection to just one CPU).
> 
> I understand that ebizzy itself doesn't benefit from cache sharing, so
> it might not improve but here it actually *regresses*, and the impact
> may be even larger on P10 /P11 because of its smaller LLC shared by 4
> CPUs, even with fewer threads. IPC drops.
> 
> By default, the SCHED_CACHE feature is enabled. Given these results for
> workloads that don't share cache and on systems with smaller LLCs, I think
> the default value should be revisited.
> 

Thanks for the test. I agree with you. The SMT number,
the L3 cache size, the workload's working set size should
all be considered to find a proper threshold to enable/disable
task aggregation.

thanks,
Chenyu

> Thanks,
> Madadi Vineeth Reddy
>