[RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses

Zecheng Li posted 2 patches 21 hours ago
include/linux/sched.h | 37 +++++++++++---------
kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h  | 70 +++++++++++++++++++++++--------------
3 files changed, 144 insertions(+), 44 deletions(-)
[RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses
Posted by Zecheng Li 21 hours ago
Reorder the fields within the `struct cfs_rq` and `struct sched_entity`
to improve cache locality. This can reduce cache misses to improve
performance in CFS scheduling-related operations, particularly for
servers with hundreds of cores and ~1000 cgroups.

The reordering is based on the kernel data-type profiling
(https://lwn.net/Articles/955709/) indicating hot fields and fields
that frequently accessed together.

This reordering aims to optimize cache utilization and improve the
performance of scheduling-related functions, particularly
`tg_throttle_down`, `tg_unthrottle_up`, and `__update_load_avg_cfs_rq`.
The reordering mainly considers performance when
`CONFIG_FAIR_GROUP_SCHED` is configured. When it is disabled, there is
no CFS bandwidth control and only a single `cfs_rq` exists per CPU, thus
its layout would not significantly impact performance.

We use a benchmark with multiple cgroup levels to simulate real server
load. The benchmark constructs a tree structure hierarchy of cgroups,
with “width” and “depth” parameters controlling the number of children
per node and the depth of the tree. Each leaf cgroup runs a `schbench`
workload and gets an 80% quota of the total CPU quota divided by number
of leaf cgroups (in other words, the target CPU load is set to 80%) to
exercise the throttling functions. Bandwidth control period is set to
10ms. We run the benchmark on Intel and AMD machines; each machine has
hundreds of threads.

Kernel LLC load misses for 30 seconds. d3 w10 (wider tree) means a
cgroup hierarchy of 3 levels, each level has 10 children, totaling 1000
leaf cgroups. d5 w4 represents a deeper tree with more hierarchies. Each
benchmark is run 10 times, the table shows 95% confidence intervals of
the kernel LLC misses in millions.


| Kernel LLC Misses | d3 w10            | d5 w4             |
+-------------------+-------------------+-------------------+
| AMD-orig          | [3025.5, 3344.1]M | [3382.4, 3607.8]M |
| AMD-opt           | [2410.7, 2556.9]M | [2565.4, 2931.2]M |
| Change            | -22.01%           | -21.37%           |
| Intel-orig        | [1157.2, 1249.0]M | [1343.7, 1630.7]M |
| Intel-opt         | [960.2, 1023.0]M  | [1092.7, 1350.7]M |
| Change            | -17.59%           | -17.86%           |

Since the benchmark limits CPU quota, the RPS results reported by
`schbench` did not show statistically significant improvement as it
does not reflect the kernel overhead reduction.

Perf data shows the reduction of LLC misses percentage within the kernel
for the depth 5, width 4 workload. The symbols are taken from the union
of top 10 symbols in both original and optimized profiles.

| Symbol                                | Intel-orig | Intel-opt |
+---------------------------------------+------------+-----------+
| worker_thread                         | 75.41%     | 78.95%    |
| tg_unthrottle_up                      | 3.21%      | 1.61%     |
| tg_throttle_down                      | 2.42%      | 1.77%     |
| __update_load_avg_cfs_rq              | 1.95%      | 1.60%     |
| walk_tg_tree_from                     | 1.23%      | 0.91%     |
| sched_balance_update_blocked_averages | 1.09%      | 1.13%     |
| sched_balance_rq                      | 1.03%      | 1.08%     |
| _raw_spin_lock                        | 1.01%      | 1.23%     |
| task_mm_cid_work                      | 0.87%      | 1.09%     |
| __update_load_avg_se                  | 0.78%      | 0.48%     |

| Symbol                                | AMD-orig | AMD-opt |
+---------------------------------------+----------+---------+
| worker_thread                         | 53.97%   | 61.49%  |
| sched_balance_update_blocked_averages | 3.94%    | 2.48%   |
| __update_load_avg_cfs_rq              | 3.52%    | 2.62%   |
| update_load_avg                       | 2.66%    | 2.19%   |
| tg_throttle_down                      | 1.99%    | 1.57%   |
| tg_unthrottle_up                      | 1.98%    | 1.34%   |
| __update_load_avg_se                  | 1.89%    | 1.32%   |
| walk_tg_tree_from                     | 1.79%    | 1.37%   |
| sched_clock_noinstr                   | 1.59%    | 1.01%   |
| sched_balance_rq                      | 1.53%    | 1.26%   |
| _raw_spin_lock                        | 1.47%    | 1.41%   |
| task_mm_cid_work                      | 1.34%    | 1.42%   |

The percentage of the LLC misses in the system is reduced.

Zecheng Li (2):
  sched/fair: Reorder struct cfs_rq
  sched/fair: Reorder struct sched_entity

 include/linux/sched.h | 37 +++++++++++---------
 kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h  | 70 +++++++++++++++++++++++--------------
 3 files changed, 144 insertions(+), 44 deletions(-)


base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
-- 
2.49.0