kernel/sched/core.c | 40 ++++++++------------- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 83 ++++++++++++++++---------------------------- kernel/sched/sched.h | 55 ++++++++++++++++++++++++----- kernel/sched/stats.h | 9 +---- 5 files changed, 93 insertions(+), 96 deletions(-)
Hi all,
This patch series enhances CFS cache efficiency by co-locating cfs_rq
and sched_entity within a unified per-cpu memory region. Furthermore,
it replaces the existing task_group pointer arrays (specifically
tg->cfs_rq[] and tg->se[]) with a streamlined single per-cpu offset.
Performance evaluation
=======
Summary:
0.5% CPU throughput increase and 0.5% memory utilization reduction in
Google production environment.
1% performance gain for MySQL, with ~10% drop in LLC misses.
30% reduction in LLC misses on Intel and 10% reduction in LLC misses
on AMD for schbench.
1) Google Production Environment
Internal metrics indicate a 0.5±0.02% increase in CPU throughput
alongside a 0.5±0.02% reduction in RAM utilization.
2) MySQL OLTP under a deep cgroup hierarchy
Setup: To stress test on the cgroup subsystem, we set up a deep cgroup
hierarchy with a width-4, depth-5 cgroup tree (1024 leaves). Each leaf
runs one mysqld instance. 75% of the MySQL servers are in throttled
cgroups that utilize 50% machine resources at most, which simulates the
batch jobs in data centers. The remaining 25% MySQL servers are placed
in unlimited cgroups with no resource constraints, which simulates
latency sensitive jobs in data centers. A parameter is used to select
how many of those 1024 servers are actively targeted by sysbench
oltp_read_only. The following table shows the improvements (delta) when
this patch series is applied. In summary, we observe consistent
improvement: about a ~1% gain in TPS, coupled with a ~1% gain in IPC,
and ~10% drop in last level cache misses.
Active servers | TPS (%) | IPC (%) | LLC misses (%)
---------------+--------------+--------------+----------------
16 | +0.99 ± 0.10 | +0.85 ± 0.10 | -14.87 ± 0.20
32 | +0.27 ± 0.13 | +0.28 ± 0.12 | -10.81 ± 0.24
64 | +1.79 ± 0.31 | +1.77 ± 0.31 | -7.06 ± 1.56
128 | +0.52 ± 0.41 | +0.29 ± 0.53 | -4.52 ± 2.26
256 | +0.77 ± 0.48 | +1.19 ± 0.33 | -10.84 ± 3.21
3) schbench in a deep cgroup tree with bandwidth control
The cgroup tree is built with "width" and "depth" parameters; each
leaf runs a schbench instance at 80% of its CPU quota to exercise
throttling. Bandwidth period is set to 10ms. The code runs on Intel
and AMD machines with hundreds of cores. We observe the execution
drops about a 10% drop in LLC miss rate for AMD and about a 30% drop
for Intel.
Kernel LLC Misses | depth 3 width 10 | depth 5 width 4
------------------+---------------------+--------------------
AMD-orig | [2218.98, 2241.89]M | [2599.80, 2645.16]M
AMD-opt | [1957.62, 1981.55]M | [2380.47, 2431.86]M
Change | -11.69% | -8.25%
Intel-orig | [1580.53, 1604.90]M | [2125.37, 2208.68]M
Intel-opt | [1066.94, 1100.19]M | [1543.77, 1570.83]M
Change | -31.96% | -28.13%
Approach
========
A cfs_rq instance is per-CPU and per-task_group; each non-root cfs_rq
has a matching sched_entity in its parent. Currently both are allocated
separately with kzalloc_node, and tg->cfs_rq / tg->se are arrays of
pointers, so loading the parent sched_entity from a cfs_rq incurs
pointer chasing (cfs_rq->tg->se[cpu]) and the per-CPU iterations in hot
paths repeatedly walk those pointer arrays.
Original memory layout:
tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
+----+ +-----------------+
| tg | ----> | cfs_rq pointers |
+----+ +-----------------+
| | |
v v v
cfs_rq cfs_rq cfs_rq
+----+ +--------------------+
| tg | ----> | sched_entity ptrs |
+----+ +--------------------+
| | |
v v v
se se se
Layout after Optimization:
+--------+ | CPU 0 | | CPU 1 | | CPU 2 |
| tg | | percpu | | percpu | | percpu |
| | ... ... ...
| percpu | -> | cfs_rq | | cfs_rq | | cfs_rq |
| offset | | se | | se | | se |
+--------+ +--------+ +--------+ +--------+
The optimization includes two parts:
1) Co-allocate cfs_rq and its sched_entity for non-root task groups, so
loading the parent se from a cfs_rq becomes a simple offset rather
than a pointer dereference through tg.
2) Allocate the combined struct with the percpu allocator. Hot paths
iterate task_groups for the same CPU, so a shared per-CPU base
pointer keeps these accesses cache-resident and removes the
per-tg pointer arrays.
Workloads without bandwidth control running in O(1000)-cgroup trees
(sysbench, hackbench, ebizzy) show no regression signals.
Changes
=======
v10:
- Rewrote the cover letter, adding more benchmark results. The series
applies cleanly to tip/sched/core.
v9:
- Rebased on latest tip/sched/core (v7.0.0-rc4), fixing minor conflicts
from upstream using kzalloc_objs instead of the original kcalloc. Tested
with selftests/cgroup in QEMU.
- Kept Reviewed-by tags from v8, but dropped the Tested-by tags due to
major version bump.
v8:
- Simplified free_fair_sched_group() in patch 1, use direct kfree
instead of container_of since cfs_rq is at offset 0 (Prateek)
- Moved variable declarations to top of function scope (Prateek)
v7:
- Removed struct sched_entity_stats, using cfs_tg_state to contain all
three structs: cfs_rq, sched_entity, and sched_statistics since they
are always allocated together in alloc_fair_sched_group. (Josh)
v5-v6:
- Rebases on tip/sched/core;
- Added my personal email as a secondary Signed-off-by after losing
access to zecheng@google.com.
v4:
https://lore.kernel.org/all/20250903194503.1679687-1-zecheng@google.com/
- Rebased on tip/sched/core
- Intel kernel test robot results
https://lore.kernel.org/all/202507161052.ed3213f4-lkp@intel.com/
v3:
https://lore.kernel.org/all/20250701210230.2985885-1-zecheng@google.com/
- Rebased on top of 6.16-rc4.
- Minor wording and comment updates.
v2:
https://lore.kernel.org/lkml/20250609193834.2556866-1-zecheng@google.com/
- Allocate cfs_rq and sched_entity together for non-root task group
instead of embedding sched_entity into cfs_rq to avoid increasing the
size of struct rq based on the feedback from Peter Zijlstra.
Zecheng Li (3):
sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
sched/fair: Remove task_group->se pointer array
sched/fair: Allocate cfs_tg_state with percpu allocator
kernel/sched/core.c | 40 ++++++++-------------
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 83 ++++++++++++++++----------------------------
kernel/sched/sched.h | 55 ++++++++++++++++++++++++-----
kernel/sched/stats.h | 9 +----
5 files changed, 93 insertions(+), 96 deletions(-)
--
2.54.0
© 2016 - 2026 Red Hat, Inc.