[v4] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality

[PATCH v4 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality

Posted by Zecheng Li 4 weeks, 1 day ago

Hi all,

This patch series improves CFS cache performance by allocating cfs_rq
and sched_entity together in the per-cpu allocator. It allows for
replacing the pointer arrays in task_group with a per-cpu offset.

v4:
- Rebased on tip/sched/core
- Intel kernel test robot results
https://lore.kernel.org/all/202507161052.ed3213f4-lkp@intel.com/

v3:
https://lore.kernel.org/all/20250701210230.2985885-1-zecheng@google.com/
- Rebased on top of 6.16-rc4.
- Minor wording and comment updates.

v2:
https://lore.kernel.org/lkml/20250609193834.2556866-1-zecheng@google.com/
- Allocate cfs_rq and sched_entity together for non-root task group
  instead of embedding sched_entity into cfs_rq to avoid increasing the
  size of struct rq based on the feedback from Peter Zijlstra.

v1:
https://lore.kernel.org/lkml/20250604195846.193159-1-zecheng@google.com/

Accessing cfs_rq and sched_entity instances incurs many cache misses.
This series of patches aims to reduce these cache misses. A struct
cfs_rq instance is per CPU and per task_group. Each task_group instance
(and the root runqueue) holds cfs_rq instances per CPU. Additionally,
there are corresponding struct sched_entity instances for each cfs_rq
instance (except the root). Currently, both cfs_rq and sched_entity
instances are allocated in NUMA-local memory using kzalloc_node, and
tg->cfs_rq and tg->se are arrays of pointers.

Original memory layout:

	tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
	tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);

	+----+       +-----------------+
	| tg | ----> | cfs_rq pointers |
	+----+       +-----------------+
	                |     |     |
	                v     v     v
	            cfs_rq cfs_rq cfs_rq

	+----+       +--------------------+
	| tg | ----> | sched_entity ptrs  |
	+----+       +--------------------+
	                |     |     |
	                v     v     v
	                se    se    se

Layout after Optimization:

	+--------+      | CPU 0  |	| CPU 1  |	| CPU 2  |
	|   tg   |      | percpu |	| percpu |	| percpu |
	|        |         ...             ...             ...
	| percpu |  ->  | cfs_rq |	| cfs_rq |	| cfs_rq |
	| offset |      |   se   |	|   se   |	|   se   |
	+--------+      +--------+	+--------+	+--------+

The optimization includes two parts:

1) Co-allocate cfs_rq and sched_entity for non-root task groups.

- This benefits loading the sched_entity for the parent runqueue.
  Currently it incurs pointer chasing, i.e., cfs_rq->tg->se[cpu]. After
  co-locating, the sched_entity fields can be loaded with simple offset
  computations from cfs_rq.

2) Allocate the combined cfs_rq/se struct using percpu allocator.

- Accesses to cfs_rq instances in hot paths are mostly iterating through
  multiple task_groups for the same CPU. Therefore, the new percpu
  layout can reuse the base pointer, and they are more likely to reside
  in the CPU cache than the per-task_group pointer arrays.

- This optimization also reduces the memory needed for the array of
  pointers.

To measure the impact of the patch series, we construct a tree structure
hierarchy of cgroups, with “width” and “depth” parameters controlling
the number of children per node and the depth of the tree. Each leaf
cgroup runs a schbench workload and gets an 80% quota of the total CPU
quota divided by number of leaf cgroups (in other words, the target CPU
load is set to 80%) to exercise the throttling functions. Bandwidth
control period is set to 10ms. We run the benchmark on Intel and AMD
machines; each machine has hundreds of threads.

Tests were conducted on 6.15.

| Kernel LLC Misses | depth 3 width 10    | depth 5 width 4     |
+-------------------+---------------------+---------------------+
| AMD-orig          | [2218.98, 2241.89]M | [2599.80, 2645.16]M |
| AMD-opt           | [1957.62, 1981.55]M | [2380.47, 2431.86]M |
| Change            | -11.69%             | -8.248%             |
| Intel-orig        | [1580.53, 1604.90]M | [2125.37, 2208.68]M |
| Intel-opt         | [1066.94, 1100.19]M | [1543.77, 1570.83]M |
| Change            | -31.96%             | -28.13%             |

There's also a 25% improvement on kernel IPC on the AMD system. On
Intel, the improvement is 3% despite a greater LLC miss reduction.

Other workloads without CPU share limits, while also running in a cgroup
hierarchy with O(1000) instances, show no obvious regression:

sysbench, hackbench - lower is better; ebizzy - higher is better.

workload  | base                  | opt                   | metric
----------+-----------------------+-----------------------+------------
sysbench  | 63.55, [63.04, 64.05] | 64.36, [62.97, 65.75] | avg latency
hackbench | 36.95, [35.45, 38.45] | 37.12, [35.81, 38.44] | time
ebizzy    | 610.7, [569.8, 651.6] | 613.5, [592.1, 635.0] | record/s

Zecheng Li (3):
  sched/fair: Co-locate cfs_rq and sched_entity
  sched/fair: Remove task_group->se pointer array
  sched/fair: Allocate both cfs_rq and sched_entity with per-cpu

 kernel/sched/core.c  | 40 ++++++++-------------
 kernel/sched/debug.c |  2 +-
 kernel/sched/fair.c  | 83 ++++++++++++++++----------------------------
 kernel/sched/sched.h | 48 ++++++++++++++++++++-----
 4 files changed, 85 insertions(+), 88 deletions(-)


base-commit: 5b726e9bf9544a349090879a513a5e00da486c14
-- 
2.51.0.338.gd7d06c2dae-goog

Re: [PATCH v4 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality

Posted by Zecheng Li 2 weeks ago

Hi all,

Gentle ping on this patch series. Any feedback would be appreciated.

Thanks,
Zecheng

On Wed, Sep 3, 2025 at 3:45 PM Zecheng Li <zecheng@google.com> wrote:
>
> Hi all,
>
> This patch series improves CFS cache performance by allocating cfs_rq
> and sched_entity together in the per-cpu allocator. It allows for
> replacing the pointer arrays in task_group with a per-cpu offset.
>
> v4:
> - Rebased on tip/sched/core
> - Intel kernel test robot results
> https://lore.kernel.org/all/202507161052.ed3213f4-lkp@intel.com/
>
> v3:
> https://lore.kernel.org/all/20250701210230.2985885-1-zecheng@google.com/
> - Rebased on top of 6.16-rc4.
> - Minor wording and comment updates.
>
> v2:
> https://lore.kernel.org/lkml/20250609193834.2556866-1-zecheng@google.com/
> - Allocate cfs_rq and sched_entity together for non-root task group
>   instead of embedding sched_entity into cfs_rq to avoid increasing the
>   size of struct rq based on the feedback from Peter Zijlstra.
>
> v1:
> https://lore.kernel.org/lkml/20250604195846.193159-1-zecheng@google.com/
>
> Accessing cfs_rq and sched_entity instances incurs many cache misses.
> This series of patches aims to reduce these cache misses. A struct
> cfs_rq instance is per CPU and per task_group. Each task_group instance
> (and the root runqueue) holds cfs_rq instances per CPU. Additionally,
> there are corresponding struct sched_entity instances for each cfs_rq
> instance (except the root). Currently, both cfs_rq and sched_entity
> instances are allocated in NUMA-local memory using kzalloc_node, and
> tg->cfs_rq and tg->se are arrays of pointers.
>
> Original memory layout:
>
>         tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
>         tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
>
>         +----+       +-----------------+
>         | tg | ----> | cfs_rq pointers |
>         +----+       +-----------------+
>                         |     |     |
>                         v     v     v
>                     cfs_rq cfs_rq cfs_rq
>
>         +----+       +--------------------+
>         | tg | ----> | sched_entity ptrs  |
>         +----+       +--------------------+
>                         |     |     |
>                         v     v     v
>                         se    se    se
>
> Layout after Optimization:
>
>         +--------+      | CPU 0  |      | CPU 1  |      | CPU 2  |
>         |   tg   |      | percpu |      | percpu |      | percpu |
>         |        |         ...             ...             ...
>         | percpu |  ->  | cfs_rq |      | cfs_rq |      | cfs_rq |
>         | offset |      |   se   |      |   se   |      |   se   |
>         +--------+      +--------+      +--------+      +--------+
>
> The optimization includes two parts:
>
> 1) Co-allocate cfs_rq and sched_entity for non-root task groups.
>
> - This benefits loading the sched_entity for the parent runqueue.
>   Currently it incurs pointer chasing, i.e., cfs_rq->tg->se[cpu]. After
>   co-locating, the sched_entity fields can be loaded with simple offset
>   computations from cfs_rq.
>
> 2) Allocate the combined cfs_rq/se struct using percpu allocator.
>
> - Accesses to cfs_rq instances in hot paths are mostly iterating through
>   multiple task_groups for the same CPU. Therefore, the new percpu
>   layout can reuse the base pointer, and they are more likely to reside
>   in the CPU cache than the per-task_group pointer arrays.
>
> - This optimization also reduces the memory needed for the array of
>   pointers.
>
> To measure the impact of the patch series, we construct a tree structure
> hierarchy of cgroups, with “width” and “depth” parameters controlling
> the number of children per node and the depth of the tree. Each leaf
> cgroup runs a schbench workload and gets an 80% quota of the total CPU
> quota divided by number of leaf cgroups (in other words, the target CPU
> load is set to 80%) to exercise the throttling functions. Bandwidth
> control period is set to 10ms. We run the benchmark on Intel and AMD
> machines; each machine has hundreds of threads.
>
> Tests were conducted on 6.15.
>
> | Kernel LLC Misses | depth 3 width 10    | depth 5 width 4     |
> +-------------------+---------------------+---------------------+
> | AMD-orig          | [2218.98, 2241.89]M | [2599.80, 2645.16]M |
> | AMD-opt           | [1957.62, 1981.55]M | [2380.47, 2431.86]M |
> | Change            | -11.69%             | -8.248%             |
> | Intel-orig        | [1580.53, 1604.90]M | [2125.37, 2208.68]M |
> | Intel-opt         | [1066.94, 1100.19]M | [1543.77, 1570.83]M |
> | Change            | -31.96%             | -28.13%             |
>
> There's also a 25% improvement on kernel IPC on the AMD system. On
> Intel, the improvement is 3% despite a greater LLC miss reduction.
>
> Other workloads without CPU share limits, while also running in a cgroup
> hierarchy with O(1000) instances, show no obvious regression:
>
> sysbench, hackbench - lower is better; ebizzy - higher is better.
>
> workload  | base                  | opt                   | metric
> ----------+-----------------------+-----------------------+------------
> sysbench  | 63.55, [63.04, 64.05] | 64.36, [62.97, 65.75] | avg latency
> hackbench | 36.95, [35.45, 38.45] | 37.12, [35.81, 38.44] | time
> ebizzy    | 610.7, [569.8, 651.6] | 613.5, [592.1, 635.0] | record/s
>
> Zecheng Li (3):
>   sched/fair: Co-locate cfs_rq and sched_entity
>   sched/fair: Remove task_group->se pointer array
>   sched/fair: Allocate both cfs_rq and sched_entity with per-cpu
>
>  kernel/sched/core.c  | 40 ++++++++-------------
>  kernel/sched/debug.c |  2 +-
>  kernel/sched/fair.c  | 83 ++++++++++++++++----------------------------
>  kernel/sched/sched.h | 48 ++++++++++++++++++++-----
>  4 files changed, 85 insertions(+), 88 deletions(-)
>
>
> base-commit: 5b726e9bf9544a349090879a513a5e00da486c14
> --
> 2.51.0.338.gd7d06c2dae-goog
>