[RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses

Zecheng Li posted 2 patches 10 months, 1 week ago
There is a newer version of this series
include/linux/sched.h | 37 +++++++++++---------
kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h  | 70 +++++++++++++++++++++++--------------
3 files changed, 144 insertions(+), 44 deletions(-)
[RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses
Posted by Zecheng Li 10 months, 1 week ago
Reorder the fields within the `struct cfs_rq` and `struct sched_entity`
to improve cache locality. This can reduce cache misses to improve
performance in CFS scheduling-related operations, particularly for
servers with hundreds of cores and ~1000 cgroups.

The reordering is based on the kernel data-type profiling
(https://lwn.net/Articles/955709/) indicating hot fields and fields
that frequently accessed together.

This reordering aims to optimize cache utilization and improve the
performance of scheduling-related functions, particularly
`tg_throttle_down`, `tg_unthrottle_up`, and `__update_load_avg_cfs_rq`.
The reordering mainly considers performance when
`CONFIG_FAIR_GROUP_SCHED` is configured. When it is disabled, there is
no CFS bandwidth control and only a single `cfs_rq` exists per CPU, thus
its layout would not significantly impact performance.

We use a benchmark with multiple cgroup levels to simulate real server
load. The benchmark constructs a tree structure hierarchy of cgroups,
with “width” and “depth” parameters controlling the number of children
per node and the depth of the tree. Each leaf cgroup runs a `schbench`
workload and gets an 80% quota of the total CPU quota divided by number
of leaf cgroups (in other words, the target CPU load is set to 80%) to
exercise the throttling functions. Bandwidth control period is set to
10ms. We run the benchmark on Intel and AMD machines; each machine has
hundreds of threads.

Kernel LLC load misses for 30 seconds. d3 w10 (wider tree) means a
cgroup hierarchy of 3 levels, each level has 10 children, totaling 1000
leaf cgroups. d5 w4 represents a deeper tree with more hierarchies. Each
benchmark is run 10 times, the table shows 95% confidence intervals of
the kernel LLC misses in millions.


| Kernel LLC Misses | d3 w10            | d5 w4             |
+-------------------+-------------------+-------------------+
| AMD-orig          | [3025.5, 3344.1]M | [3382.4, 3607.8]M |
| AMD-opt           | [2410.7, 2556.9]M | [2565.4, 2931.2]M |
| Change            | -22.01%           | -21.37%           |
| Intel-orig        | [1157.2, 1249.0]M | [1343.7, 1630.7]M |
| Intel-opt         | [960.2, 1023.0]M  | [1092.7, 1350.7]M |
| Change            | -17.59%           | -17.86%           |

Since the benchmark limits CPU quota, the RPS results reported by
`schbench` did not show statistically significant improvement as it
does not reflect the kernel overhead reduction.

Perf data shows the reduction of LLC misses percentage within the kernel
for the depth 5, width 4 workload. The symbols are taken from the union
of top 10 symbols in both original and optimized profiles.

| Symbol                                | Intel-orig | Intel-opt |
+---------------------------------------+------------+-----------+
| worker_thread                         | 75.41%     | 78.95%    |
| tg_unthrottle_up                      | 3.21%      | 1.61%     |
| tg_throttle_down                      | 2.42%      | 1.77%     |
| __update_load_avg_cfs_rq              | 1.95%      | 1.60%     |
| walk_tg_tree_from                     | 1.23%      | 0.91%     |
| sched_balance_update_blocked_averages | 1.09%      | 1.13%     |
| sched_balance_rq                      | 1.03%      | 1.08%     |
| _raw_spin_lock                        | 1.01%      | 1.23%     |
| task_mm_cid_work                      | 0.87%      | 1.09%     |
| __update_load_avg_se                  | 0.78%      | 0.48%     |

| Symbol                                | AMD-orig | AMD-opt |
+---------------------------------------+----------+---------+
| worker_thread                         | 53.97%   | 61.49%  |
| sched_balance_update_blocked_averages | 3.94%    | 2.48%   |
| __update_load_avg_cfs_rq              | 3.52%    | 2.62%   |
| update_load_avg                       | 2.66%    | 2.19%   |
| tg_throttle_down                      | 1.99%    | 1.57%   |
| tg_unthrottle_up                      | 1.98%    | 1.34%   |
| __update_load_avg_se                  | 1.89%    | 1.32%   |
| walk_tg_tree_from                     | 1.79%    | 1.37%   |
| sched_clock_noinstr                   | 1.59%    | 1.01%   |
| sched_balance_rq                      | 1.53%    | 1.26%   |
| _raw_spin_lock                        | 1.47%    | 1.41%   |
| task_mm_cid_work                      | 1.34%    | 1.42%   |

The percentage of the LLC misses in the system is reduced.

Zecheng Li (2):
  sched/fair: Reorder struct cfs_rq
  sched/fair: Reorder struct sched_entity

 include/linux/sched.h | 37 +++++++++++---------
 kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h  | 70 +++++++++++++++++++++++--------------
 3 files changed, 144 insertions(+), 44 deletions(-)


base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
-- 
2.49.0
Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses
Posted by Madadi Vineeth Reddy 10 months ago
Hi Zecheng Li,

On 03/04/25 02:59, Zecheng Li wrote:
> Reorder the fields within the `struct cfs_rq` and `struct sched_entity`
> to improve cache locality. This can reduce cache misses to improve
> performance in CFS scheduling-related operations, particularly for
> servers with hundreds of cores and ~1000 cgroups.
> 
> The reordering is based on the kernel data-type profiling
> (https://lwn.net/Articles/955709/) indicating hot fields and fields
> that frequently accessed together.

This patch is based on optimizations by reordering for 64 byte systems.
In case of 128 byte L1 D-cache systems like Power10, this might or might
not be beneficial. Moreover lot of space(almost half) would be wasted
on the cache line due to APIs like `__cacheline_group_begin_aligned`
and `__cacheline_group_end_aligned` that may restrict size to 64 bytes.

Since this is in generic code, any ideas on how to make sure that
other architectures with different cache size don't suffer?

[..snip..]

> 
> 
> | Kernel LLC Misses | d3 w10            | d5 w4             |
> +-------------------+-------------------+-------------------+
> | AMD-orig          | [3025.5, 3344.1]M | [3382.4, 3607.8]M |
> | AMD-opt           | [2410.7, 2556.9]M | [2565.4, 2931.2]M |
> | Change            | -22.01%           | -21.37%           |
> | Intel-orig        | [1157.2, 1249.0]M | [1343.7, 1630.7]M |
> | Intel-opt         | [960.2, 1023.0]M  | [1092.7, 1350.7]M |
> | Change            | -17.59%           | -17.86%           |
> 
> Since the benchmark limits CPU quota, the RPS results reported by
> `schbench` did not show statistically significant improvement as it
> does not reflect the kernel overhead reduction.
> 
> Perf data shows the reduction of LLC misses percentage within the kernel
> for the depth 5, width 4 workload. The symbols are taken from the union
> of top 10 symbols in both original and optimized profiles.
> 
> | Symbol                                | Intel-orig | Intel-opt |
> +---------------------------------------+------------+-----------+
> | worker_thread                         | 75.41%     | 78.95%    |
> | tg_unthrottle_up                      | 3.21%      | 1.61%     |
> | tg_throttle_down                      | 2.42%      | 1.77%     |
> | __update_load_avg_cfs_rq              | 1.95%      | 1.60%     |
> | walk_tg_tree_from                     | 1.23%      | 0.91%     |
> | sched_balance_update_blocked_averages | 1.09%      | 1.13%     |
> | sched_balance_rq                      | 1.03%      | 1.08%     |
> | _raw_spin_lock                        | 1.01%      | 1.23%     |
> | task_mm_cid_work                      | 0.87%      | 1.09%     |
> | __update_load_avg_se                  | 0.78%      | 0.48%     |
> 
> | Symbol                                | AMD-orig | AMD-opt |
> +---------------------------------------+----------+---------+
> | worker_thread                         | 53.97%   | 61.49%  |
> | sched_balance_update_blocked_averages | 3.94%    | 2.48%   |
> | __update_load_avg_cfs_rq              | 3.52%    | 2.62%   |
> | update_load_avg                       | 2.66%    | 2.19%   |
> | tg_throttle_down                      | 1.99%    | 1.57%   |
> | tg_unthrottle_up                      | 1.98%    | 1.34%   |
> | __update_load_avg_se                  | 1.89%    | 1.32%   |
> | walk_tg_tree_from                     | 1.79%    | 1.37%   |
> | sched_clock_noinstr                   | 1.59%    | 1.01%   |
> | sched_balance_rq                      | 1.53%    | 1.26%   |
> | _raw_spin_lock                        | 1.47%    | 1.41%   |
> | task_mm_cid_work                      | 1.34%    | 1.42%   |
> 
> The percentage of the LLC misses in the system is reduced.

Due to the reordering of the fields, there might be some workloads
that could take a hit. May be try running workloads of different
kinds(latency and throughput oriented) and make sure that regression
is not high.

Thanks,
Madadi Vineeth Reddy

> 
> Zecheng Li (2):
>   sched/fair: Reorder struct cfs_rq
>   sched/fair: Reorder struct sched_entity
> 
>  include/linux/sched.h | 37 +++++++++++---------
>  kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h  | 70 +++++++++++++++++++++++--------------
>  3 files changed, 144 insertions(+), 44 deletions(-)
> 
> 
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses
Posted by ZECHENG LI 9 months, 3 weeks ago
Hi Madadi Vineeth Reddy,

> This patch is based on optimizations by reordering for 64 byte systems.
> In case of 128 byte L1 D-cache systems like Power10, this might or might
> not be beneficial. Moreover lot of space(almost half) would be wasted
> on the cache line due to APIs like `__cacheline_group_begin_aligned`
> and `__cacheline_group_end_aligned` that may restrict size to 64 bytes.
>
> Since this is in generic code, any ideas on how to make sure that
> other architectures with different cache size don't suffer?

We propose to conditionally align to the cacheline boundary only when
the cacheline size is 64 bytes, since this is the most common size.

For architectures with 128-byte cachelines (like PowerPC), this
approach will still collocate the hot fields, providing some
performance benefit from improved locality, but it will not enforce
alignment to the larger 128-byte boundary. This avoids wasting cache
space on those architectures due to padding introduced by the
alignment, while still gaining benefits from collocating frequently
accessed fields.

> Due to the reordering of the fields, there might be some workloads
> that could take a hit. May be try running workloads of different
> kinds(latency and throughput oriented) and make sure that regression
> is not high.

For workloads running without a cgroup hierarchy, we expect a small
performance impact. This is because there is only one cfs_rq per CPU
in this scenario, which is likely in cache due to frequent access.

For workloads with the cgroup hierarchy, I have tested sysbench threads
and hackbench --thread, there are no obvious regression.

Heavy load on 1024 instances of sysbench:
Latency (ms), after-patch, origial
avg avg: 2133.51, 2150.97
avg min: 21.9629, 20.9413
avg max: 5955.8, 5966.78

Avg runtime for 1024 instances of ./hackbench --thread -g 2 -l 1000
in a cgroup hierarchy:
After-patch: 34.9458s, Original: 36.8647s

We plan to include more benchmark results in the v2 patch. Do you have
suggestions for other benchmarks you would like us to test?

Regards,
Zecheng
Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses
Posted by Madadi Vineeth Reddy 9 months, 3 weeks ago
On 19/04/25 02:28, ZECHENG LI wrote:
> Hi Madadi Vineeth Reddy,
> 
>> This patch is based on optimizations by reordering for 64 byte systems.
>> In case of 128 byte L1 D-cache systems like Power10, this might or might
>> not be beneficial. Moreover lot of space(almost half) would be wasted
>> on the cache line due to APIs like `__cacheline_group_begin_aligned`
>> and `__cacheline_group_end_aligned` that may restrict size to 64 bytes.
>>
>> Since this is in generic code, any ideas on how to make sure that
>> other architectures with different cache size don't suffer?
> 
> We propose to conditionally align to the cacheline boundary only when
> the cacheline size is 64 bytes, since this is the most common size.
> 
> For architectures with 128-byte cachelines (like PowerPC), this
> approach will still collocate the hot fields, providing some
> performance benefit from improved locality, but it will not enforce
> alignment to the larger 128-byte boundary. This avoids wasting cache

I don't see the check to enforce the alignment only for 64 bytes. IIUC,
the macros seem to apply the alignment unconditionally based on arch
specific cacheline size. I might be missing something, could you
clarify this?

> space on those architectures due to padding introduced by the
> alignment, while still gaining benefits from collocating frequently
> accessed fields.
> 
>> Due to the reordering of the fields, there might be some workloads
>> that could take a hit. May be try running workloads of different
>> kinds(latency and throughput oriented) and make sure that regression
>> is not high.
> 
> For workloads running without a cgroup hierarchy, we expect a small
> performance impact. This is because there is only one cfs_rq per CPU
> in this scenario, which is likely in cache due to frequent access.
> 
> For workloads with the cgroup hierarchy, I have tested sysbench threads
> and hackbench --thread, there are no obvious regression.
> 
> Heavy load on 1024 instances of sysbench:
> Latency (ms), after-patch, origial
> avg avg: 2133.51, 2150.97
> avg min: 21.9629, 20.9413
> avg max: 5955.8, 5966.78
> 
> Avg runtime for 1024 instances of ./hackbench --thread -g 2 -l 1000
> in a cgroup hierarchy:
> After-patch: 34.9458s, Original: 36.8647s
> 
> We plan to include more benchmark results in the v2 patch. Do you have
> suggestions for other benchmarks you would like us to test?

May be some throughput oriented workloads like ebizzy, sysbench and also
some real life workloads would be good to include.

Thanks,
Madadi Vineeth Reddy

> 
> Regards,
> Zecheng