kernel/sched/core.c | 40 ++++++++------------- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 83 ++++++++++++++++---------------------------- kernel/sched/sched.h | 48 ++++++++++++++++++++----- 4 files changed, 85 insertions(+), 88 deletions(-)
Hi all, This patch series improves CFS cache performance by allocating cfs_rq and sched_entity together in the per-cpu allocator. It allows for replacing the pointer arrays in task_group with a per-cpu offset. v4: - Rebased on tip/sched/core - Intel kernel test robot results https://lore.kernel.org/all/202507161052.ed3213f4-lkp@intel.com/ v3: https://lore.kernel.org/all/20250701210230.2985885-1-zecheng@google.com/ - Rebased on top of 6.16-rc4. - Minor wording and comment updates. v2: https://lore.kernel.org/lkml/20250609193834.2556866-1-zecheng@google.com/ - Allocate cfs_rq and sched_entity together for non-root task group instead of embedding sched_entity into cfs_rq to avoid increasing the size of struct rq based on the feedback from Peter Zijlstra. v1: https://lore.kernel.org/lkml/20250604195846.193159-1-zecheng@google.com/ Accessing cfs_rq and sched_entity instances incurs many cache misses. This series of patches aims to reduce these cache misses. A struct cfs_rq instance is per CPU and per task_group. Each task_group instance (and the root runqueue) holds cfs_rq instances per CPU. Additionally, there are corresponding struct sched_entity instances for each cfs_rq instance (except the root). Currently, both cfs_rq and sched_entity instances are allocated in NUMA-local memory using kzalloc_node, and tg->cfs_rq and tg->se are arrays of pointers. Original memory layout: tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL); tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL); +----+ +-----------------+ | tg | ----> | cfs_rq pointers | +----+ +-----------------+ | | | v v v cfs_rq cfs_rq cfs_rq +----+ +--------------------+ | tg | ----> | sched_entity ptrs | +----+ +--------------------+ | | | v v v se se se Layout after Optimization: +--------+ | CPU 0 | | CPU 1 | | CPU 2 | | tg | | percpu | | percpu | | percpu | | | ... ... ... | percpu | -> | cfs_rq | | cfs_rq | | cfs_rq | | offset | | se | | se | | se | +--------+ +--------+ +--------+ +--------+ The optimization includes two parts: 1) Co-allocate cfs_rq and sched_entity for non-root task groups. - This benefits loading the sched_entity for the parent runqueue. Currently it incurs pointer chasing, i.e., cfs_rq->tg->se[cpu]. After co-locating, the sched_entity fields can be loaded with simple offset computations from cfs_rq. 2) Allocate the combined cfs_rq/se struct using percpu allocator. - Accesses to cfs_rq instances in hot paths are mostly iterating through multiple task_groups for the same CPU. Therefore, the new percpu layout can reuse the base pointer, and they are more likely to reside in the CPU cache than the per-task_group pointer arrays. - This optimization also reduces the memory needed for the array of pointers. To measure the impact of the patch series, we construct a tree structure hierarchy of cgroups, with “width” and “depth” parameters controlling the number of children per node and the depth of the tree. Each leaf cgroup runs a schbench workload and gets an 80% quota of the total CPU quota divided by number of leaf cgroups (in other words, the target CPU load is set to 80%) to exercise the throttling functions. Bandwidth control period is set to 10ms. We run the benchmark on Intel and AMD machines; each machine has hundreds of threads. Tests were conducted on 6.15. | Kernel LLC Misses | depth 3 width 10 | depth 5 width 4 | +-------------------+---------------------+---------------------+ | AMD-orig | [2218.98, 2241.89]M | [2599.80, 2645.16]M | | AMD-opt | [1957.62, 1981.55]M | [2380.47, 2431.86]M | | Change | -11.69% | -8.248% | | Intel-orig | [1580.53, 1604.90]M | [2125.37, 2208.68]M | | Intel-opt | [1066.94, 1100.19]M | [1543.77, 1570.83]M | | Change | -31.96% | -28.13% | There's also a 25% improvement on kernel IPC on the AMD system. On Intel, the improvement is 3% despite a greater LLC miss reduction. Other workloads without CPU share limits, while also running in a cgroup hierarchy with O(1000) instances, show no obvious regression: sysbench, hackbench - lower is better; ebizzy - higher is better. workload | base | opt | metric ----------+-----------------------+-----------------------+------------ sysbench | 63.55, [63.04, 64.05] | 64.36, [62.97, 65.75] | avg latency hackbench | 36.95, [35.45, 38.45] | 37.12, [35.81, 38.44] | time ebizzy | 610.7, [569.8, 651.6] | 613.5, [592.1, 635.0] | record/s Zecheng Li (3): sched/fair: Co-locate cfs_rq and sched_entity sched/fair: Remove task_group->se pointer array sched/fair: Allocate both cfs_rq and sched_entity with per-cpu kernel/sched/core.c | 40 ++++++++------------- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 83 ++++++++++++++++---------------------------- kernel/sched/sched.h | 48 ++++++++++++++++++++----- 4 files changed, 85 insertions(+), 88 deletions(-) base-commit: 5b726e9bf9544a349090879a513a5e00da486c14 -- 2.51.0.338.gd7d06c2dae-goog
Hi all, Gentle ping on this patch series. Any feedback would be appreciated. Thanks, Zecheng On Wed, Sep 3, 2025 at 3:45 PM Zecheng Li <zecheng@google.com> wrote: > > Hi all, > > This patch series improves CFS cache performance by allocating cfs_rq > and sched_entity together in the per-cpu allocator. It allows for > replacing the pointer arrays in task_group with a per-cpu offset. > > v4: > - Rebased on tip/sched/core > - Intel kernel test robot results > https://lore.kernel.org/all/202507161052.ed3213f4-lkp@intel.com/ > > v3: > https://lore.kernel.org/all/20250701210230.2985885-1-zecheng@google.com/ > - Rebased on top of 6.16-rc4. > - Minor wording and comment updates. > > v2: > https://lore.kernel.org/lkml/20250609193834.2556866-1-zecheng@google.com/ > - Allocate cfs_rq and sched_entity together for non-root task group > instead of embedding sched_entity into cfs_rq to avoid increasing the > size of struct rq based on the feedback from Peter Zijlstra. > > v1: > https://lore.kernel.org/lkml/20250604195846.193159-1-zecheng@google.com/ > > Accessing cfs_rq and sched_entity instances incurs many cache misses. > This series of patches aims to reduce these cache misses. A struct > cfs_rq instance is per CPU and per task_group. Each task_group instance > (and the root runqueue) holds cfs_rq instances per CPU. Additionally, > there are corresponding struct sched_entity instances for each cfs_rq > instance (except the root). Currently, both cfs_rq and sched_entity > instances are allocated in NUMA-local memory using kzalloc_node, and > tg->cfs_rq and tg->se are arrays of pointers. > > Original memory layout: > > tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL); > tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL); > > +----+ +-----------------+ > | tg | ----> | cfs_rq pointers | > +----+ +-----------------+ > | | | > v v v > cfs_rq cfs_rq cfs_rq > > +----+ +--------------------+ > | tg | ----> | sched_entity ptrs | > +----+ +--------------------+ > | | | > v v v > se se se > > Layout after Optimization: > > +--------+ | CPU 0 | | CPU 1 | | CPU 2 | > | tg | | percpu | | percpu | | percpu | > | | ... ... ... > | percpu | -> | cfs_rq | | cfs_rq | | cfs_rq | > | offset | | se | | se | | se | > +--------+ +--------+ +--------+ +--------+ > > The optimization includes two parts: > > 1) Co-allocate cfs_rq and sched_entity for non-root task groups. > > - This benefits loading the sched_entity for the parent runqueue. > Currently it incurs pointer chasing, i.e., cfs_rq->tg->se[cpu]. After > co-locating, the sched_entity fields can be loaded with simple offset > computations from cfs_rq. > > 2) Allocate the combined cfs_rq/se struct using percpu allocator. > > - Accesses to cfs_rq instances in hot paths are mostly iterating through > multiple task_groups for the same CPU. Therefore, the new percpu > layout can reuse the base pointer, and they are more likely to reside > in the CPU cache than the per-task_group pointer arrays. > > - This optimization also reduces the memory needed for the array of > pointers. > > To measure the impact of the patch series, we construct a tree structure > hierarchy of cgroups, with “width” and “depth” parameters controlling > the number of children per node and the depth of the tree. Each leaf > cgroup runs a schbench workload and gets an 80% quota of the total CPU > quota divided by number of leaf cgroups (in other words, the target CPU > load is set to 80%) to exercise the throttling functions. Bandwidth > control period is set to 10ms. We run the benchmark on Intel and AMD > machines; each machine has hundreds of threads. > > Tests were conducted on 6.15. > > | Kernel LLC Misses | depth 3 width 10 | depth 5 width 4 | > +-------------------+---------------------+---------------------+ > | AMD-orig | [2218.98, 2241.89]M | [2599.80, 2645.16]M | > | AMD-opt | [1957.62, 1981.55]M | [2380.47, 2431.86]M | > | Change | -11.69% | -8.248% | > | Intel-orig | [1580.53, 1604.90]M | [2125.37, 2208.68]M | > | Intel-opt | [1066.94, 1100.19]M | [1543.77, 1570.83]M | > | Change | -31.96% | -28.13% | > > There's also a 25% improvement on kernel IPC on the AMD system. On > Intel, the improvement is 3% despite a greater LLC miss reduction. > > Other workloads without CPU share limits, while also running in a cgroup > hierarchy with O(1000) instances, show no obvious regression: > > sysbench, hackbench - lower is better; ebizzy - higher is better. > > workload | base | opt | metric > ----------+-----------------------+-----------------------+------------ > sysbench | 63.55, [63.04, 64.05] | 64.36, [62.97, 65.75] | avg latency > hackbench | 36.95, [35.45, 38.45] | 37.12, [35.81, 38.44] | time > ebizzy | 610.7, [569.8, 651.6] | 613.5, [592.1, 635.0] | record/s > > Zecheng Li (3): > sched/fair: Co-locate cfs_rq and sched_entity > sched/fair: Remove task_group->se pointer array > sched/fair: Allocate both cfs_rq and sched_entity with per-cpu > > kernel/sched/core.c | 40 ++++++++------------- > kernel/sched/debug.c | 2 +- > kernel/sched/fair.c | 83 ++++++++++++++++---------------------------- > kernel/sched/sched.h | 48 ++++++++++++++++++++----- > 4 files changed, 85 insertions(+), 88 deletions(-) > > > base-commit: 5b726e9bf9544a349090879a513a5e00da486c14 > -- > 2.51.0.338.gd7d06c2dae-goog >
© 2016 - 2025 Red Hat, Inc.