include/linux/sched.h | 37 +++++++++++--------- kernel/sched/core.c | 81 ++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 70 +++++++++++++++++++++++-------------- 3 files changed, 144 insertions(+), 44 deletions(-)
Reorder the fields within the `struct cfs_rq` and `struct sched_entity` to improve cache locality. This can reduce cache misses to improve performance in CFS scheduling-related operations, particularly for servers with hundreds of cores and ~1000 cgroups. The reordering is based on the kernel data-type profiling (https://lwn.net/Articles/955709/) indicating hot fields and fields that frequently accessed together. This reordering aims to optimize cache utilization and improve the performance of scheduling-related functions, particularly `tg_throttle_down`, `tg_unthrottle_up`, and `__update_load_avg_cfs_rq`. The reordering mainly considers performance when `CONFIG_FAIR_GROUP_SCHED` is configured. When it is disabled, there is no CFS bandwidth control and only a single `cfs_rq` exists per CPU, thus its layout would not significantly impact performance. We use a benchmark with multiple cgroup levels to simulate real server load. The benchmark constructs a tree structure hierarchy of cgroups, with “width” and “depth” parameters controlling the number of children per node and the depth of the tree. Each leaf cgroup runs a `schbench` workload and gets an 80% quota of the total CPU quota divided by number of leaf cgroups (in other words, the target CPU load is set to 80%) to exercise the throttling functions. Bandwidth control period is set to 10ms. We run the benchmark on Intel and AMD machines; each machine has hundreds of threads. Kernel LLC load misses for 30 seconds. d3 w10 (wider tree) means a cgroup hierarchy of 3 levels, each level has 10 children, totaling 1000 leaf cgroups. d5 w4 represents a deeper tree with more hierarchies. Each benchmark is run 10 times, the table shows 95% confidence intervals of the kernel LLC misses in millions. | Kernel LLC Misses | d3 w10 | d5 w4 | +-------------------+-------------------+-------------------+ | AMD-orig | [3025.5, 3344.1]M | [3382.4, 3607.8]M | | AMD-opt | [2410.7, 2556.9]M | [2565.4, 2931.2]M | | Change | -22.01% | -21.37% | | Intel-orig | [1157.2, 1249.0]M | [1343.7, 1630.7]M | | Intel-opt | [960.2, 1023.0]M | [1092.7, 1350.7]M | | Change | -17.59% | -17.86% | Since the benchmark limits CPU quota, the RPS results reported by `schbench` did not show statistically significant improvement as it does not reflect the kernel overhead reduction. Perf data shows the reduction of LLC misses percentage within the kernel for the depth 5, width 4 workload. The symbols are taken from the union of top 10 symbols in both original and optimized profiles. | Symbol | Intel-orig | Intel-opt | +---------------------------------------+------------+-----------+ | worker_thread | 75.41% | 78.95% | | tg_unthrottle_up | 3.21% | 1.61% | | tg_throttle_down | 2.42% | 1.77% | | __update_load_avg_cfs_rq | 1.95% | 1.60% | | walk_tg_tree_from | 1.23% | 0.91% | | sched_balance_update_blocked_averages | 1.09% | 1.13% | | sched_balance_rq | 1.03% | 1.08% | | _raw_spin_lock | 1.01% | 1.23% | | task_mm_cid_work | 0.87% | 1.09% | | __update_load_avg_se | 0.78% | 0.48% | | Symbol | AMD-orig | AMD-opt | +---------------------------------------+----------+---------+ | worker_thread | 53.97% | 61.49% | | sched_balance_update_blocked_averages | 3.94% | 2.48% | | __update_load_avg_cfs_rq | 3.52% | 2.62% | | update_load_avg | 2.66% | 2.19% | | tg_throttle_down | 1.99% | 1.57% | | tg_unthrottle_up | 1.98% | 1.34% | | __update_load_avg_se | 1.89% | 1.32% | | walk_tg_tree_from | 1.79% | 1.37% | | sched_clock_noinstr | 1.59% | 1.01% | | sched_balance_rq | 1.53% | 1.26% | | _raw_spin_lock | 1.47% | 1.41% | | task_mm_cid_work | 1.34% | 1.42% | The percentage of the LLC misses in the system is reduced. Zecheng Li (2): sched/fair: Reorder struct cfs_rq sched/fair: Reorder struct sched_entity include/linux/sched.h | 37 +++++++++++--------- kernel/sched/core.c | 81 ++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 70 +++++++++++++++++++++++-------------- 3 files changed, 144 insertions(+), 44 deletions(-) base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557 -- 2.49.0
Hi Zecheng Li, On 03/04/25 02:59, Zecheng Li wrote: > Reorder the fields within the `struct cfs_rq` and `struct sched_entity` > to improve cache locality. This can reduce cache misses to improve > performance in CFS scheduling-related operations, particularly for > servers with hundreds of cores and ~1000 cgroups. > > The reordering is based on the kernel data-type profiling > (https://lwn.net/Articles/955709/) indicating hot fields and fields > that frequently accessed together. This patch is based on optimizations by reordering for 64 byte systems. In case of 128 byte L1 D-cache systems like Power10, this might or might not be beneficial. Moreover lot of space(almost half) would be wasted on the cache line due to APIs like `__cacheline_group_begin_aligned` and `__cacheline_group_end_aligned` that may restrict size to 64 bytes. Since this is in generic code, any ideas on how to make sure that other architectures with different cache size don't suffer? [..snip..] > > > | Kernel LLC Misses | d3 w10 | d5 w4 | > +-------------------+-------------------+-------------------+ > | AMD-orig | [3025.5, 3344.1]M | [3382.4, 3607.8]M | > | AMD-opt | [2410.7, 2556.9]M | [2565.4, 2931.2]M | > | Change | -22.01% | -21.37% | > | Intel-orig | [1157.2, 1249.0]M | [1343.7, 1630.7]M | > | Intel-opt | [960.2, 1023.0]M | [1092.7, 1350.7]M | > | Change | -17.59% | -17.86% | > > Since the benchmark limits CPU quota, the RPS results reported by > `schbench` did not show statistically significant improvement as it > does not reflect the kernel overhead reduction. > > Perf data shows the reduction of LLC misses percentage within the kernel > for the depth 5, width 4 workload. The symbols are taken from the union > of top 10 symbols in both original and optimized profiles. > > | Symbol | Intel-orig | Intel-opt | > +---------------------------------------+------------+-----------+ > | worker_thread | 75.41% | 78.95% | > | tg_unthrottle_up | 3.21% | 1.61% | > | tg_throttle_down | 2.42% | 1.77% | > | __update_load_avg_cfs_rq | 1.95% | 1.60% | > | walk_tg_tree_from | 1.23% | 0.91% | > | sched_balance_update_blocked_averages | 1.09% | 1.13% | > | sched_balance_rq | 1.03% | 1.08% | > | _raw_spin_lock | 1.01% | 1.23% | > | task_mm_cid_work | 0.87% | 1.09% | > | __update_load_avg_se | 0.78% | 0.48% | > > | Symbol | AMD-orig | AMD-opt | > +---------------------------------------+----------+---------+ > | worker_thread | 53.97% | 61.49% | > | sched_balance_update_blocked_averages | 3.94% | 2.48% | > | __update_load_avg_cfs_rq | 3.52% | 2.62% | > | update_load_avg | 2.66% | 2.19% | > | tg_throttle_down | 1.99% | 1.57% | > | tg_unthrottle_up | 1.98% | 1.34% | > | __update_load_avg_se | 1.89% | 1.32% | > | walk_tg_tree_from | 1.79% | 1.37% | > | sched_clock_noinstr | 1.59% | 1.01% | > | sched_balance_rq | 1.53% | 1.26% | > | _raw_spin_lock | 1.47% | 1.41% | > | task_mm_cid_work | 1.34% | 1.42% | > > The percentage of the LLC misses in the system is reduced. Due to the reordering of the fields, there might be some workloads that could take a hit. May be try running workloads of different kinds(latency and throughput oriented) and make sure that regression is not high. Thanks, Madadi Vineeth Reddy > > Zecheng Li (2): > sched/fair: Reorder struct cfs_rq > sched/fair: Reorder struct sched_entity > > include/linux/sched.h | 37 +++++++++++--------- > kernel/sched/core.c | 81 ++++++++++++++++++++++++++++++++++++++++++- > kernel/sched/sched.h | 70 +++++++++++++++++++++++-------------- > 3 files changed, 144 insertions(+), 44 deletions(-) > > > base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
Hi Madadi Vineeth Reddy, > This patch is based on optimizations by reordering for 64 byte systems. > In case of 128 byte L1 D-cache systems like Power10, this might or might > not be beneficial. Moreover lot of space(almost half) would be wasted > on the cache line due to APIs like `__cacheline_group_begin_aligned` > and `__cacheline_group_end_aligned` that may restrict size to 64 bytes. > > Since this is in generic code, any ideas on how to make sure that > other architectures with different cache size don't suffer? We propose to conditionally align to the cacheline boundary only when the cacheline size is 64 bytes, since this is the most common size. For architectures with 128-byte cachelines (like PowerPC), this approach will still collocate the hot fields, providing some performance benefit from improved locality, but it will not enforce alignment to the larger 128-byte boundary. This avoids wasting cache space on those architectures due to padding introduced by the alignment, while still gaining benefits from collocating frequently accessed fields. > Due to the reordering of the fields, there might be some workloads > that could take a hit. May be try running workloads of different > kinds(latency and throughput oriented) and make sure that regression > is not high. For workloads running without a cgroup hierarchy, we expect a small performance impact. This is because there is only one cfs_rq per CPU in this scenario, which is likely in cache due to frequent access. For workloads with the cgroup hierarchy, I have tested sysbench threads and hackbench --thread, there are no obvious regression. Heavy load on 1024 instances of sysbench: Latency (ms), after-patch, origial avg avg: 2133.51, 2150.97 avg min: 21.9629, 20.9413 avg max: 5955.8, 5966.78 Avg runtime for 1024 instances of ./hackbench --thread -g 2 -l 1000 in a cgroup hierarchy: After-patch: 34.9458s, Original: 36.8647s We plan to include more benchmark results in the v2 patch. Do you have suggestions for other benchmarks you would like us to test? Regards, Zecheng
On 19/04/25 02:28, ZECHENG LI wrote: > Hi Madadi Vineeth Reddy, > >> This patch is based on optimizations by reordering for 64 byte systems. >> In case of 128 byte L1 D-cache systems like Power10, this might or might >> not be beneficial. Moreover lot of space(almost half) would be wasted >> on the cache line due to APIs like `__cacheline_group_begin_aligned` >> and `__cacheline_group_end_aligned` that may restrict size to 64 bytes. >> >> Since this is in generic code, any ideas on how to make sure that >> other architectures with different cache size don't suffer? > > We propose to conditionally align to the cacheline boundary only when > the cacheline size is 64 bytes, since this is the most common size. > > For architectures with 128-byte cachelines (like PowerPC), this > approach will still collocate the hot fields, providing some > performance benefit from improved locality, but it will not enforce > alignment to the larger 128-byte boundary. This avoids wasting cache I don't see the check to enforce the alignment only for 64 bytes. IIUC, the macros seem to apply the alignment unconditionally based on arch specific cacheline size. I might be missing something, could you clarify this? > space on those architectures due to padding introduced by the > alignment, while still gaining benefits from collocating frequently > accessed fields. > >> Due to the reordering of the fields, there might be some workloads >> that could take a hit. May be try running workloads of different >> kinds(latency and throughput oriented) and make sure that regression >> is not high. > > For workloads running without a cgroup hierarchy, we expect a small > performance impact. This is because there is only one cfs_rq per CPU > in this scenario, which is likely in cache due to frequent access. > > For workloads with the cgroup hierarchy, I have tested sysbench threads > and hackbench --thread, there are no obvious regression. > > Heavy load on 1024 instances of sysbench: > Latency (ms), after-patch, origial > avg avg: 2133.51, 2150.97 > avg min: 21.9629, 20.9413 > avg max: 5955.8, 5966.78 > > Avg runtime for 1024 instances of ./hackbench --thread -g 2 -l 1000 > in a cgroup hierarchy: > After-patch: 34.9458s, Original: 36.8647s > > We plan to include more benchmark results in the v2 patch. Do you have > suggestions for other benchmarks you would like us to test? May be some throughput oriented workloads like ebizzy, sysbench and also some real life workloads would be good to include. Thanks, Madadi Vineeth Reddy > > Regards, > Zecheng
© 2016 - 2026 Red Hat, Inc.