[v10] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality

[PATCH v10 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality
Posted by Zecheng Li 2 days, 4 hours ago
Hi all,

This patch series enhances CFS cache efficiency by co-locating cfs_rq
and sched_entity within a unified per-cpu memory region. Furthermore,
it replaces the existing task_group pointer arrays (specifically
tg->cfs_rq[] and tg->se[]) with a streamlined single per-cpu offset.

Performance evaluation
=======

Summary:

0.5% CPU throughput increase and 0.5% memory utilization reduction in
Google production environment.

1% performance gain for MySQL, with ~10% drop in LLC misses.

30% reduction in LLC misses on Intel and 10% reduction in LLC misses
on AMD for schbench.

1) Google Production Environment

Internal metrics indicate a 0.5±0.02% increase in CPU throughput
alongside a 0.5±0.02% reduction in RAM utilization.

2) MySQL OLTP under a deep cgroup hierarchy

Setup: To stress test on the cgroup subsystem, we set up a deep cgroup
hierarchy with a width-4, depth-5 cgroup tree (1024 leaves). Each leaf
runs one mysqld instance. 75% of the MySQL servers are in throttled
cgroups that utilize 50% machine resources at most, which simulates the
batch jobs in data centers. The remaining 25% MySQL servers are placed
in unlimited cgroups with no resource constraints, which simulates
latency sensitive jobs in data centers. A parameter is used to select
how many of those 1024 servers are actively targeted by sysbench
oltp_read_only. The following table shows the improvements (delta) when
this patch series is applied. In summary, we observe consistent
improvement: about a ~1% gain in TPS, coupled with a ~1% gain in IPC,
and ~10% drop in last level cache misses.

Active servers |    TPS (%)   |    IPC (%)   |  LLC misses (%)
---------------+--------------+--------------+----------------
            16 | +0.99 ± 0.10 | +0.85 ± 0.10 |  -14.87 ± 0.20
            32 | +0.27 ± 0.13 | +0.28 ± 0.12 |  -10.81 ± 0.24
            64 | +1.79 ± 0.31 | +1.77 ± 0.31 |   -7.06 ± 1.56
           128 | +0.52 ± 0.41 | +0.29 ± 0.53 |   -4.52 ± 2.26
           256 | +0.77 ± 0.48 | +1.19 ± 0.33 |  -10.84 ± 3.21

3) schbench in a deep cgroup tree with bandwidth control

The cgroup tree is built with "width" and "depth" parameters; each
leaf runs a schbench instance at 80% of its CPU quota to exercise
throttling. Bandwidth period is set to 10ms. The code runs on Intel
and AMD machines with hundreds of cores. We observe the execution
drops about a 10% drop in LLC miss rate for AMD and about a 30% drop
for Intel.

Kernel LLC Misses | depth 3 width 10    | depth 5 width 4
------------------+---------------------+--------------------
AMD-orig          | [2218.98, 2241.89]M | [2599.80, 2645.16]M
AMD-opt           | [1957.62, 1981.55]M | [2380.47, 2431.86]M
Change            | -11.69%             |  -8.25%
Intel-orig        | [1580.53, 1604.90]M | [2125.37, 2208.68]M
Intel-opt         | [1066.94, 1100.19]M | [1543.77, 1570.83]M
Change            | -31.96%             | -28.13%

Approach
========

A cfs_rq instance is per-CPU and per-task_group; each non-root cfs_rq
has a matching sched_entity in its parent. Currently both are allocated
separately with kzalloc_node, and tg->cfs_rq / tg->se are arrays of
pointers, so loading the parent sched_entity from a cfs_rq incurs
pointer chasing (cfs_rq->tg->se[cpu]) and the per-CPU iterations in hot
paths repeatedly walk those pointer arrays.

Original memory layout:

	tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
	tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);

	+----+       +-----------------+
	| tg | ----> | cfs_rq pointers |
	+----+       +-----------------+
	                |     |     |
	                v     v     v
	            cfs_rq cfs_rq cfs_rq

	+----+       +--------------------+
	| tg | ----> | sched_entity ptrs  |
	+----+       +--------------------+
	                |     |     |
	                v     v     v
	                se    se    se

Layout after Optimization:

	+--------+      | CPU 0  |	| CPU 1  |	| CPU 2  |
	|   tg   |      | percpu |	| percpu |	| percpu |
	|        |         ...             ...             ...
	| percpu |  ->  | cfs_rq |	| cfs_rq |	| cfs_rq |
	| offset |      |   se   |	|   se   |	|   se   |
	+--------+      +--------+	+--------+	+--------+

The optimization includes two parts:

1) Co-allocate cfs_rq and its sched_entity for non-root task groups, so
   loading the parent se from a cfs_rq becomes a simple offset rather
   than a pointer dereference through tg.

2) Allocate the combined struct with the percpu allocator. Hot paths
   iterate task_groups for the same CPU, so a shared per-CPU base
   pointer keeps these accesses cache-resident and removes the
   per-tg pointer arrays.

Workloads without bandwidth control running in O(1000)-cgroup trees
(sysbench, hackbench, ebizzy) show no regression signals.

Changes
=======

v10:
- Rewrote the cover letter, adding more benchmark results. The series
  applies cleanly to tip/sched/core.

v9:
- Rebased on latest tip/sched/core (v7.0.0-rc4), fixing minor conflicts
  from upstream using kzalloc_objs instead of the original kcalloc. Tested
  with selftests/cgroup in QEMU.
- Kept Reviewed-by tags from v8, but dropped the Tested-by tags due to
  major version bump.

v8:
- Simplified free_fair_sched_group() in patch 1, use direct kfree
  instead of container_of since cfs_rq is at offset 0 (Prateek)
- Moved variable declarations to top of function scope (Prateek)

v7:
- Removed struct sched_entity_stats, using cfs_tg_state to contain all
  three structs: cfs_rq, sched_entity, and sched_statistics since they
  are always allocated together in alloc_fair_sched_group. (Josh)

v5-v6:
- Rebases on tip/sched/core;
- Added my personal email as a secondary Signed-off-by after losing
  access to zecheng@google.com.

v4:
https://lore.kernel.org/all/20250903194503.1679687-1-zecheng@google.com/
- Rebased on tip/sched/core
- Intel kernel test robot results
https://lore.kernel.org/all/202507161052.ed3213f4-lkp@intel.com/

v3:
https://lore.kernel.org/all/20250701210230.2985885-1-zecheng@google.com/
- Rebased on top of 6.16-rc4.
- Minor wording and comment updates.

v2:
https://lore.kernel.org/lkml/20250609193834.2556866-1-zecheng@google.com/
- Allocate cfs_rq and sched_entity together for non-root task group
  instead of embedding sched_entity into cfs_rq to avoid increasing the
  size of struct rq based on the feedback from Peter Zijlstra.

Zecheng Li (3):
  sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
  sched/fair: Remove task_group->se pointer array
  sched/fair: Allocate cfs_tg_state with percpu allocator

 kernel/sched/core.c  | 40 ++++++++-------------
 kernel/sched/debug.c |  2 +-
 kernel/sched/fair.c  | 83 ++++++++++++++++----------------------------
 kernel/sched/sched.h | 55 ++++++++++++++++++++++++-----
 kernel/sched/stats.h |  9 +----
 5 files changed, 93 insertions(+), 96 deletions(-)

-- 
2.54.0