include/linux/cpuset.h | 6 include/linux/sched.h | 1 kernel/cgroup/cpuset.c | 15 kernel/sched/core.c | 47 -- kernel/sched/debug.c | 171 +++++--- kernel/sched/fair.c | 1038 ++++++++++++++++++++++--------------------------- kernel/sched/pelt.c | 6 kernel/sched/sched.h | 44 -- 8 files changed, 672 insertions(+), 656 deletions(-)
Hi!
So cgroup scheduling has always been a pain in the arse. The problems start
with weight distribution and end with hierachical picks and it all sucks.
The problems with weight distribution are related to that infernal global
fraction:
tg->w * grq_i->w
ge_i->w = ----------------
\Sum_j grq_j->w
which we've approximated reasonably well by now. However, the immediate
consequence of this fraction is that the total group weight (tg->w) gets
fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
with the fact that 256 CPU systems are relatively common these days, this
becomes painful.
The common 'solution' is to inflate the group weight by 'nr_cpus'; the
immediate problem with that is that when all load of a group gets concentrated
on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
exceeding nice -20.
Additionally there are numerical limits on the max weight you can have before
the math starts suffering overflows. As such there is a definite limit on the
total group weight. Which has annoyed people ;-)
The first few patches add a knob /debug/sched/cgroup_mode and a few different
options on how to deal with this. My favourite is 'concur', but obviously that
is also the most expensive one :-/ It adds a tg->tasks counter which makes the
update_tg_load_avg() thing more expensive.
I have some ideas but I figured I ought to share these things before sinking
more time into it.
On to the hierarchical pick; this has been causing trouble for a very long
time. So once again an attempt at flatting it. The basic idea is to keep the
full hierarchical load tracking as-is, but keep all the runnable entities in a
single level. The immediate concequence of all this is ofcourse that we need to
constantly re-compute the effective weight of each entity as things progress.
Reweight is done on:
- enqueue
- pick -- or rather set_next_entity(.first=true)
- tick
So while the {en,de}queue operations are still O(depth) due to the full
accounting mess, the pick is now a single level. Removing the intermediate
levels that obscure runnability etc.
For testing, I've done a little experiment, I dug out what is colloqually known
as a potato. A trusty old Sandybridge 12600k with a RX 580, and ran a game on
it. From GOG, I had available 'Shadows: Awakens', a fun title that normally
runs really well on this machine (provided you stick to 1080p).
To make it interesting, I added 8 (one for each logical CPU) copies of: 'nice
spin.sh'; this results in the game becoming almost unplayable, as in proper
terrible.
I used MangoHUD to record a few minutes of playtime for statistics, and then
quit the came and re-started it with a shorter slice set (base/10). This
results in the game being entirely playable -- not great, but definiltey
playable.
Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
Intel Core i7-2600K
AMD Radeon RX 580
Shadows Awakening (GOG)
default slice(*)
FPS min 3.8 20.6
avg 48.0 57.2
mag 87.4 80.3
FT min 9.4 8.4
avg 34.5 19.5
max 107.4 37.2
FPS (Frames Per Second)
FT (FrameTime)
[*] Command prefix: 'chrt -o --sched-runtime 280000 0'
effectively setting 'base_slice_ns/10'
I have not compared to a kernel without flat on, just wanted to run non trivial
workloads and play with slice to make sure everything 'works'.
Can also be had:
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat
include/linux/cpuset.h | 6
include/linux/sched.h | 1
kernel/cgroup/cpuset.c | 15
kernel/sched/core.c | 47 --
kernel/sched/debug.c | 171 +++++---
kernel/sched/fair.c | 1038 ++++++++++++++++++++++---------------------------
kernel/sched/pelt.c | 6
kernel/sched/sched.h | 44 --
8 files changed, 672 insertions(+), 656 deletions(-)
---
Change since v1 ( https://patch.msgid.link/20260317095113.387450089@infradead.org ):
- various Sashiko thingies
- rebase atop curren -tip
On 05/11/26 13:31, Peter Zijlstra wrote: > Hi! > > So cgroup scheduling has always been a pain in the arse. The problems start > with weight distribution and end with hierachical picks and it all sucks. It does.. Not that it is useful info, but we talked briefly about it at OSPM, so thought I'll report back. I gave this a go with my test case from schedqos announcement [1] of running schbench with kernel build as BACKGROUND noise, but the fairness imposed at group level is preserved (as expected) even if the pick is flattened. I do actually want a total flat system, ie: disable this whole thing :-) The problem is to create a system where you want to introduce smart tagging based on tasks, group scheduling becomes a big problem. If a task is set as background or interactive, it has to be global to be enforced otherwise it loses its meaning. And my test case stresses two long running tasks one is interactive but the other is background and group scheduling imposes fairness that breaks the task level tagging. Managing deadline via runtime doesn't help here since they are both always busy tasks; and one must use nice values to manage bandwidth. But nice values are local when autogroup/cgroups are present. I need to find a simple way to turn this thing off at runtime and properly flatten it ;-) /runs away [1] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
On Mon, 11 May 2026 at 14:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi!
>
> So cgroup scheduling has always been a pain in the arse. The problems start
> with weight distribution and end with hierachical picks and it all sucks.
>
> The problems with weight distribution are related to that infernal global
> fraction:
>
> tg->w * grq_i->w
> ge_i->w = ----------------
> \Sum_j grq_j->w
>
> which we've approximated reasonably well by now. However, the immediate
> consequence of this fraction is that the total group weight (tg->w) gets
> fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
> weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
> with the fact that 256 CPU systems are relatively common these days, this
> becomes painful.
>
> The common 'solution' is to inflate the group weight by 'nr_cpus'; the
> immediate problem with that is that when all load of a group gets concentrated
> on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
> exceeding nice -20.
>
> Additionally there are numerical limits on the max weight you can have before
> the math starts suffering overflows. As such there is a definite limit on the
> total group weight. Which has annoyed people ;-)
>
> The first few patches add a knob /debug/sched/cgroup_mode and a few different
> options on how to deal with this. My favourite is 'concur', but obviously that
> is also the most expensive one :-/ It adds a tg->tasks counter which makes the
> update_tg_load_avg() thing more expensive.
>
> I have some ideas but I figured I ought to share these things before sinking
> more time into it.
>
>
> On to the hierarchical pick; this has been causing trouble for a very long
> time. So once again an attempt at flatting it. The basic idea is to keep the
> full hierarchical load tracking as-is, but keep all the runnable entities in a
> single level. The immediate concequence of all this is ofcourse that we need to
> constantly re-compute the effective weight of each entity as things progress.
>
> Reweight is done on:
> - enqueue
> - pick -- or rather set_next_entity(.first=true)
> - tick
>
> So while the {en,de}queue operations are still O(depth) due to the full
> accounting mess, the pick is now a single level. Removing the intermediate
> levels that obscure runnability etc.
>
>
> For testing, I've done a little experiment, I dug out what is colloqually known
> as a potato. A trusty old Sandybridge 12600k with a RX 580, and ran a game on
> it. From GOG, I had available 'Shadows: Awakens', a fun title that normally
> runs really well on this machine (provided you stick to 1080p).
>
> To make it interesting, I added 8 (one for each logical CPU) copies of: 'nice
> spin.sh'; this results in the game becoming almost unplayable, as in proper
> terrible.
>
> I used MangoHUD to record a few minutes of playtime for statistics, and then
> quit the came and re-started it with a shorter slice set (base/10). This
> results in the game being entirely playable -- not great, but definiltey
> playable.
>
> Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
> Intel Core i7-2600K
> AMD Radeon RX 580
>
> Shadows Awakening (GOG)
>
> default slice(*)
>
> FPS min 3.8 20.6
> avg 48.0 57.2
> mag 87.4 80.3
>
> FT min 9.4 8.4
> avg 34.5 19.5
> max 107.4 37.2
>
> FPS (Frames Per Second)
> FT (FrameTime)
>
> [*] Command prefix: 'chrt -o --sched-runtime 280000 0'
> effectively setting 'base_slice_ns/10'
>
> I have not compared to a kernel without flat on, just wanted to run non trivial
> workloads and play with slice to make sure everything 'works'.
I haven't reviewed the patches yet but I ran some tests with it while
testing sched latency related changes for short slice wakeup
preemption. I have some large hackbench regressions with this series
on HMP system with and without EAS. those figures are unexpected
because the benchs run on root cfs
One example with hackbench 8 groups thread pipe
tip/sched/core tip/sched/core +this patchset +this patchset
slice 2.8ms 16ms 2.8ms 16ms
dragonboard rb5 with EAS
0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156%
0,689(+/- 9.1%) +8%
radxa orion6 HMP without EAS
0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156%
1,071(+/-5.9%) -82%
Increasing the slice partly removes regressions but tis is surprising
because the bench runs at root cfs and I thought that results will not
change in such a case
I will review the patchset and try to get what is going wrong
>
>
> Can also be had:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat
>
> include/linux/cpuset.h | 6
> include/linux/sched.h | 1
> kernel/cgroup/cpuset.c | 15
> kernel/sched/core.c | 47 --
> kernel/sched/debug.c | 171 +++++---
> kernel/sched/fair.c | 1038 ++++++++++++++++++++++---------------------------
> kernel/sched/pelt.c | 6
> kernel/sched/sched.h | 44 --
> 8 files changed, 672 insertions(+), 656 deletions(-)
>
> ---
> Change since v1 ( https://patch.msgid.link/20260317095113.387450089@infradead.org ):
> - various Sashiko thingies
> - rebase atop curren -tip
>
>
On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > I haven't reviewed the patches yet but I ran some tests with it while > testing sched latency related changes for short slice wakeup > preemption. I have some large hackbench regressions with this series > on HMP system with and without EAS. those figures are unexpected > because the benchs run on root cfs > > One example with hackbench 8 groups thread pipe > tip/sched/core tip/sched/core +this patchset +this patchset > slice 2.8ms 16ms 2.8ms 16ms > dragonboard rb5 with EAS > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > 0,689(+/- 9.1%) +8% > > radxa orion6 HMP without EAS > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > 1,071(+/-5.9%) -82% > > Increasing the slice partly removes regressions but tis is surprising > because the bench runs at root cfs and I thought that results will not > change in such a case D'oh :/ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e54da4c6c945..77d0e1937f2c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; struct task_struct *donor = rq->donor; struct sched_entity *nse, *se = &donor->se, *pse = &p->se; - struct cfs_rq *cfs_rq = task_cfs_rq(donor); + struct cfs_rq *cfs_rq = &rq->cfs; int cse_is_idle, pse_is_idle; /*
On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > I haven't reviewed the patches yet but I ran some tests with it while > > testing sched latency related changes for short slice wakeup > > preemption. I have some large hackbench regressions with this series > > on HMP system with and without EAS. those figures are unexpected > > because the benchs run on root cfs > > > > One example with hackbench 8 groups thread pipe > > tip/sched/core tip/sched/core +this patchset +this patchset > > slice 2.8ms 16ms 2.8ms 16ms > > dragonboard rb5 with EAS > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > 0,689(+/- 9.1%) +8% > > > > radxa orion6 HMP without EAS > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > 1,071(+/-5.9%) -82% > > > > Increasing the slice partly removes regressions but tis is surprising > > because the bench runs at root cfs and I thought that results will not > > change in such a case > > D'oh :/ > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e54da4c6c945..77d0e1937f2c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f > enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; > struct task_struct *donor = rq->donor; > struct sched_entity *nse, *se = &donor->se, *pse = &p->se; > - struct cfs_rq *cfs_rq = task_cfs_rq(donor); > + struct cfs_rq *cfs_rq = &rq->cfs; I tested this patch on top of the series but it doesn't fix the perf regression on rb5 hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default slice duration > int cse_is_idle, pse_is_idle; > > /*
On Mon, May 18, 2026 at 03:34:51PM +0200, Vincent Guittot wrote: > On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > > > I haven't reviewed the patches yet but I ran some tests with it while > > > testing sched latency related changes for short slice wakeup > > > preemption. I have some large hackbench regressions with this series > > > on HMP system with and without EAS. those figures are unexpected > > > because the benchs run on root cfs > > > > > > One example with hackbench 8 groups thread pipe > > > tip/sched/core tip/sched/core +this patchset +this patchset > > > slice 2.8ms 16ms 2.8ms 16ms > > > dragonboard rb5 with EAS > > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > > 0,689(+/- 9.1%) +8% > > > > > > radxa orion6 HMP without EAS > > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > > 1,071(+/-5.9%) -82% > > > > > > Increasing the slice partly removes regressions but tis is surprising > > > because the bench runs at root cfs and I thought that results will not > > > change in such a case > > > > D'oh :/ > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e54da4c6c945..77d0e1937f2c 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f > > enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; > > struct task_struct *donor = rq->donor; > > struct sched_entity *nse, *se = &donor->se, *pse = &p->se; > > - struct cfs_rq *cfs_rq = task_cfs_rq(donor); > > + struct cfs_rq *cfs_rq = &rq->cfs; > > I tested this patch on top of the series but it doesn't fix the perf > regression on rb5 > > hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default > slice duration Weird, I can't reproduce anymore with this fixed :/ I'll try more hackbench variants tomorrow I suppose.
On Mon, 18 May 2026 at 23:12, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, May 18, 2026 at 03:34:51PM +0200, Vincent Guittot wrote: > > On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > > > > > I haven't reviewed the patches yet but I ran some tests with it while > > > > testing sched latency related changes for short slice wakeup > > > > preemption. I have some large hackbench regressions with this series > > > > on HMP system with and without EAS. those figures are unexpected > > > > because the benchs run on root cfs > > > > > > > > One example with hackbench 8 groups thread pipe > > > > tip/sched/core tip/sched/core +this patchset +this patchset > > > > slice 2.8ms 16ms 2.8ms 16ms > > > > dragonboard rb5 with EAS > > > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > > > 0,689(+/- 9.1%) +8% > > > > > > > > radxa orion6 HMP without EAS > > > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > > > 1,071(+/-5.9%) -82% > > > > > > > > Increasing the slice partly removes regressions but tis is surprising > > > > because the bench runs at root cfs and I thought that results will not > > > > change in such a case > > > > > > D'oh :/ > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index e54da4c6c945..77d0e1937f2c 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f > > > enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; > > > struct task_struct *donor = rq->donor; > > > struct sched_entity *nse, *se = &donor->se, *pse = &p->se; > > > - struct cfs_rq *cfs_rq = task_cfs_rq(donor); > > > + struct cfs_rq *cfs_rq = &rq->cfs; > > > > I tested this patch on top of the series but it doesn't fix the perf > > regression on rb5 > > > > hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default > > slice duration > > Weird, I can't reproduce anymore with this fixed :/ > > I'll try more hackbench variants tomorrow I suppose. I tried several conf : - HMP with EAS enabled - HMP without EAS enabled (perf cpufreq gov) - SMP (only the 4 little cores) All of them show large regressions with hackbench which are almost recovered when increasing the slice from 2.8 to 16ms
On Tue, 19 May 2026 at 12:13, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Mon, 18 May 2026 at 23:12, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, May 18, 2026 at 03:34:51PM +0200, Vincent Guittot wrote: > > > On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > > > > > > > I haven't reviewed the patches yet but I ran some tests with it while > > > > > testing sched latency related changes for short slice wakeup > > > > > preemption. I have some large hackbench regressions with this series > > > > > on HMP system with and without EAS. those figures are unexpected > > > > > because the benchs run on root cfs > > > > > > > > > > One example with hackbench 8 groups thread pipe > > > > > tip/sched/core tip/sched/core +this patchset +this patchset > > > > > slice 2.8ms 16ms 2.8ms 16ms > > > > > dragonboard rb5 with EAS > > > > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > > > > 0,689(+/- 9.1%) +8% > > > > > > > > > > radxa orion6 HMP without EAS > > > > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > > > > 1,071(+/-5.9%) -82% > > > > > > > > > > Increasing the slice partly removes regressions but tis is surprising > > > > > because the bench runs at root cfs and I thought that results will not > > > > > change in such a case > > > > > > > > D'oh :/ > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index e54da4c6c945..77d0e1937f2c 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f > > > > enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; > > > > struct task_struct *donor = rq->donor; > > > > struct sched_entity *nse, *se = &donor->se, *pse = &p->se; > > > > - struct cfs_rq *cfs_rq = task_cfs_rq(donor); > > > > + struct cfs_rq *cfs_rq = &rq->cfs; > > > > > > I tested this patch on top of the series but it doesn't fix the perf > > > regression on rb5 > > > > > > hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default > > > slice duration > > > > Weird, I can't reproduce anymore with this fixed :/ > > > > I'll try more hackbench variants tomorrow I suppose. > > I tried several conf : > - HMP with EAS enabled > - HMP without EAS enabled (perf cpufreq gov) > - SMP (only the 4 little cores) > > All of them show large regressions with hackbench which are almost > recovered when increasing the slice from 2.8 to 16ms With patch 10 the vlag value is very often set to the max 3.8ms (the clamp value of 2.8ms slice + 1ms tick) whereas it is usually less than a 1ms without patch 10
On Wed, May 13, 2026 at 01:35:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
>
> > I haven't reviewed the patches yet but I ran some tests with it while
> > testing sched latency related changes for short slice wakeup
> > preemption. I have some large hackbench regressions with this series
> > on HMP system with and without EAS. those figures are unexpected
> > because the benchs run on root cfs
> >
> > One example with hackbench 8 groups thread pipe
> > tip/sched/core tip/sched/core +this patchset +this patchset
> > slice 2.8ms 16ms 2.8ms 16ms
> > dragonboard rb5 with EAS
> > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156%
> > 0,689(+/- 9.1%) +8%
> >
> > radxa orion6 HMP without EAS
> > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156%
> > 1,071(+/-5.9%) -82%
> >
> > Increasing the slice partly removes regressions but tis is surprising
> > because the bench runs at root cfs and I thought that results will not
> > change in such a case
>
> D'oh :/
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e54da4c6c945..77d0e1937f2c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
> enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
> struct task_struct *donor = rq->donor;
> struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
> - struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> + struct cfs_rq *cfs_rq = &rq->cfs;
> int cse_is_idle, pse_is_idle;
>
> /*
With that fixed, I now get:
vanilla slice(*)
FPS min 3.0 11.1
avg 44.7 57.3
max 88.1 96.2
FT min 9.1 8.0
avg 41.4 21.0
max 157.2 53.9
FPS (Frames Per Second)
FT (FrameTime)
Which I suppose shows we now preempt less. Its still significantly
better with reduced slice, but not as good as it was.
On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > I haven't reviewed the patches yet but I ran some tests with it while > testing sched latency related changes for short slice wakeup > preemption. I have some large hackbench regressions with this series > on HMP system with and without EAS. those figures are unexpected > because the benchs run on root cfs > > One example with hackbench 8 groups thread pipe > tip/sched/core tip/sched/core +this patchset +this patchset > slice 2.8ms 16ms 2.8ms 16ms > dragonboard rb5 with EAS > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > 0,689(+/- 9.1%) +8% > > radxa orion6 HMP without EAS > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > 1,071(+/-5.9%) -82% > > Increasing the slice partly removes regressions but tis is surprising > because the bench runs at root cfs and I thought that results will not > change in such a case > > I will review the patchset and try to get what is going wrong Yeah, that is unexpected. Let me go have another look too.
On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote: > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > > > I haven't reviewed the patches yet but I ran some tests with it while > > testing sched latency related changes for short slice wakeup > > preemption. I have some large hackbench regressions with this series > > on HMP system with and without EAS. those figures are unexpected > > because the benchs run on root cfs > > > > One example with hackbench 8 groups thread pipe > > tip/sched/core tip/sched/core +this patchset +this patchset > > slice 2.8ms 16ms 2.8ms 16ms > > dragonboard rb5 with EAS > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > 0,689(+/- 9.1%) +8% > > > > radxa orion6 HMP without EAS > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > 1,071(+/-5.9%) -82% > > > > Increasing the slice partly removes regressions but tis is surprising > > because the bench runs at root cfs and I thought that results will not > > change in such a case > > > > I will review the patchset and try to get what is going wrong > > Yeah, that is unexpected. Let me go have another look too. So I can reproduce even without the last patch applied. I suspect it is in the cgroup mode patches somewhere. My first suspect is that concur mode thing doing bad things to track the 'global' nr_running thing.
On Tue, May 12, 2026 at 08:24:39PM +0200, Peter Zijlstra wrote: > On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote: > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > > > > > > I haven't reviewed the patches yet but I ran some tests with it while > > > testing sched latency related changes for short slice wakeup > > > preemption. I have some large hackbench regressions with this series > > > on HMP system with and without EAS. those figures are unexpected > > > because the benchs run on root cfs > > > > > > One example with hackbench 8 groups thread pipe > > > tip/sched/core tip/sched/core +this patchset +this patchset > > > slice 2.8ms 16ms 2.8ms 16ms > > > dragonboard rb5 with EAS > > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > > 0,689(+/- 9.1%) +8% > > > > > > radxa orion6 HMP without EAS > > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > > 1,071(+/-5.9%) -82% > > > > > > Increasing the slice partly removes regressions but tis is surprising > > > because the bench runs at root cfs and I thought that results will not > > > change in such a case > > > > > > I will review the patchset and try to get what is going wrong > > > > Yeah, that is unexpected. Let me go have another look too. > > So I can reproduce even without the last patch applied. I suspect it is > in the cgroup mode patches somewhere. My first suspect is that concur > mode thing doing bad things to track the 'global' nr_running thing. Argh, n/m PEBKAC. I'll try this again in the morning :/
On Tue, 12 May 2026 at 20:25, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, May 12, 2026 at 08:24:39PM +0200, Peter Zijlstra wrote: > > On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote: > > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > > > > > > > > > I haven't reviewed the patches yet but I ran some tests with it while > > > > testing sched latency related changes for short slice wakeup > > > > preemption. I have some large hackbench regressions with this series > > > > on HMP system with and without EAS. those figures are unexpected > > > > because the benchs run on root cfs > > > > > > > > One example with hackbench 8 groups thread pipe > > > > tip/sched/core tip/sched/core +this patchset +this patchset > > > > slice 2.8ms 16ms 2.8ms 16ms > > > > dragonboard rb5 with EAS > > > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > > > 0,689(+/- 9.1%) +8% > > > > > > > > radxa orion6 HMP without EAS > > > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > > > 1,071(+/-5.9%) -82% > > > > > > > > Increasing the slice partly removes regressions but tis is surprising > > > > because the bench runs at root cfs and I thought that results will not > > > > change in such a case > > > > > > > > I will review the patchset and try to get what is going wrong > > > > > > Yeah, that is unexpected. Let me go have another look too. > > > > So I can reproduce even without the last patch applied. I suspect it is > > in the cgroup mode patches somewhere. My first suspect is that concur > > mode thing doing bad things to track the 'global' nr_running thing. > > Argh, n/m PEBKAC. I'll try this again in the morning :/ Reverting the last patch is enough to recover performance
On Tue, May 12, 2026 at 08:32:12PM +0200, Vincent Guittot wrote: > On Tue, 12 May 2026 at 20:25, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, May 12, 2026 at 08:24:39PM +0200, Peter Zijlstra wrote: > > > On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote: > > > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote: > > > > > > > > > > > > > > I haven't reviewed the patches yet but I ran some tests with it while > > > > > testing sched latency related changes for short slice wakeup > > > > > preemption. I have some large hackbench regressions with this series > > > > > on HMP system with and without EAS. those figures are unexpected > > > > > because the benchs run on root cfs > > > > > > > > > > One example with hackbench 8 groups thread pipe > > > > > tip/sched/core tip/sched/core +this patchset +this patchset > > > > > slice 2.8ms 16ms 2.8ms 16ms > > > > > dragonboard rb5 with EAS > > > > > 0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156% > > > > > 0,689(+/- 9.1%) +8% > > > > > > > > > > radxa orion6 HMP without EAS > > > > > 0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156% > > > > > 1,071(+/-5.9%) -82% > > > > > > > > > > Increasing the slice partly removes regressions but tis is surprising > > > > > because the bench runs at root cfs and I thought that results will not > > > > > change in such a case > > > > > > > > > > I will review the patchset and try to get what is going wrong > > > > > > > > Yeah, that is unexpected. Let me go have another look too. > > > > > > So I can reproduce even without the last patch applied. I suspect it is > > > in the cgroup mode patches somewhere. My first suspect is that concur > > > mode thing doing bad things to track the 'global' nr_running thing. > > > > Argh, n/m PEBKAC. I'll try this again in the morning :/ > > Reverting the last patch is enough to recover performance Yeah, I was on a fail-streak yesterday. I forgot to copy the kernel image before reboot ... Lets see if today is better :-)
Hello, Peter. On Mon, May 11, 2026 at 01:31:04PM +0200, Peter Zijlstra wrote: > So cgroup scheduling has always been a pain in the arse. The problems start > with weight distribution and end with hierachical picks and it all sucks. > > The problems with weight distribution are related to that infernal global > fraction: > > tg->w * grq_i->w > ge_i->w = ---------------- > \Sum_j grq_j->w > > which we've approximated reasonably well by now. However, the immediate > consequence of this fraction is that the total group weight (tg->w) gets > fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup > weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine > with the fact that 256 CPU systems are relatively common these days, this > becomes painful. > > The common 'solution' is to inflate the group weight by 'nr_cpus'; the > immediate problem with that is that when all load of a group gets concentrated > on a single CPU, the per-cpu cgroup weight becomes insanely large, easily > exceeding nice -20. > > Additionally there are numerical limits on the max weight you can have before > the math starts suffering overflows. As such there is a definite limit on the > total group weight. Which has annoyed people ;-) > > The first few patches add a knob /debug/sched/cgroup_mode and a few different > options on how to deal with this. My favourite is 'concur', but obviously that > is also the most expensive one :-/ It adds a tg->tasks counter which makes the > update_tg_load_avg() thing more expensive. Ignoring fixed math accuracy problems, isn't the root problem here that every thread in the root cgroup competes as if each is its own cgroup? ie. Isn't the canonical solution here to create an enveloping group, at least for share calculation purposes, for root threads and then assign them some weight so that they compete in the same way that other cgroups do? Then, the different modes go away or rather whatever the user wants can be expressed via root's weight if that's to be made configurable. Thanks. -- tejun
On Mon, May 11, 2026 at 09:23:45AM -1000, Tejun Heo wrote: > Hello, Peter. > > On Mon, May 11, 2026 at 01:31:04PM +0200, Peter Zijlstra wrote: > > So cgroup scheduling has always been a pain in the arse. The problems start > > with weight distribution and end with hierachical picks and it all sucks. > > > > The problems with weight distribution are related to that infernal global > > fraction: > > > > tg->w * grq_i->w > > ge_i->w = ---------------- > > \Sum_j grq_j->w > > > > which we've approximated reasonably well by now. However, the immediate > > consequence of this fraction is that the total group weight (tg->w) gets > > fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup > > weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine > > with the fact that 256 CPU systems are relatively common these days, this > > becomes painful. > > > > The common 'solution' is to inflate the group weight by 'nr_cpus'; the > > immediate problem with that is that when all load of a group gets concentrated > > on a single CPU, the per-cpu cgroup weight becomes insanely large, easily > > exceeding nice -20. > > > > Additionally there are numerical limits on the max weight you can have before > > the math starts suffering overflows. As such there is a definite limit on the > > total group weight. Which has annoyed people ;-) > > > > The first few patches add a knob /debug/sched/cgroup_mode and a few different > > options on how to deal with this. My favourite is 'concur', but obviously that > > is also the most expensive one :-/ It adds a tg->tasks counter which makes the > > update_tg_load_avg() thing more expensive. > > Ignoring fixed math accuracy problems, isn't the root problem here that > every thread in the root cgroup competes as if each is its own cgroup? ie. > Isn't the canonical solution here to create an enveloping group, at least > for share calculation purposes, for root threads and then assign them some > weight so that they compete in the same way that other cgroups do? Then, the > different modes go away or rather whatever the user wants can be expressed > via root's weight if that's to be made configurable. As long as the total group weight is a fraction; and it sorta has to be. You can run into trouble by stacking that fraction. Take 256 CPUs and a group weight of 1024. Then each CPU gets a weight of 1/256 or 4. Even if we increase the internal accuracy to 20 bits (we do on 64bit) then this becomes 4096, do this for 2 more levels in the hierarchy and you're down to scraping the barrel again. So if each level runs at a fraction f of the level above, then level n runs at f^n. Moving root into a phantom group at level 1, only solves the problem against other tasks at level 1, but then you have the same problem again at level 2 and below. Both the numerical problems and the scale problem of the root group can be avoided if we can get the average/nominal fraction to be near 1. The 'normal' way around this is to ensure the group weight is nr_cpus * 1024, then, when everybody is running, the per CPU weight is 1024 or 1 and the continued fraction is also 1-ish. This is why people like to increase the max group weight. Trouble is of course that if not all CPUs are busy, with the extreme being only a single CPU carrying that weight of nr_cpus*1024, this then causes trouble because that one CPU gets overloaded. One of the options is to simply put a max on the single CPU load; which is the crudest option to just make it 'work'. The one I favour though is the one where we scale the group weight by: 'min(cpumas, nr_tasks)'. Anyway, this is why I've been looking at these alternative weight schemes, to get the nominal fraction near 1 and make these problems go away. It is both the numerical issues and the disparity between levels (with root being at level 0 being the most obvious). Does that make sense?
Hello, Peter. On Tue, May 12, 2026 at 10:10:00AM +0200, Peter Zijlstra wrote: ... > Anyway, this is why I've been looking at these alternative weight > schemes, to get the nominal fraction near 1 and make these problems go > away. It is both the numerical issues and the disparity between levels > (with root being at level 0 being the most obvious). I see. I think what bothers me is that I'm unsure what the weight config would mean when the shares are scaled by the number of active cpus in that cgroup. Here's a simple example: - There are 256 cpus. - /cgroup-A has weight 100 and 128 active threads. No pinning. - /cgroup-B has weight 100 and 256 active thredas. No pinning. In the current code, assuming math holds up, cgroup-A and B would get about the same shares - ~128 CPUs each. However, if we scale the share by active CPUs in each cgroup, B's tasks would end up with the same weight as A's on CPUs that they end up competing on, which would lead to ~ 1:3 distribution. Is that the right reading of the code? Thanks. -- tejun
On Tue, May 12, 2026 at 08:45:21AM -1000, Tejun Heo wrote: > Hello, Peter. > > On Tue, May 12, 2026 at 10:10:00AM +0200, Peter Zijlstra wrote: > ... > > Anyway, this is why I've been looking at these alternative weight > > schemes, to get the nominal fraction near 1 and make these problems go > > away. It is both the numerical issues and the disparity between levels > > (with root being at level 0 being the most obvious). > > I see. I think what bothers me is that I'm unsure what the weight config > would mean when the shares are scaled by the number of active cpus in that > cgroup. Relative weight per active cpu :-), but yes, that is a somewhat more difficult concept I suppose. > Here's a simple example: > > - There are 256 cpus. > - /cgroup-A has weight 100 and 128 active threads. No pinning. > - /cgroup-B has weight 100 and 256 active thredas. No pinning. > > In the current code, assuming math holds up, cgroup-A and B would get about > the same shares - ~128 CPUs each. However, if we scale the share by active > CPUs in each cgroup, B's tasks would end up with the same weight as A's on > CPUs that they end up competing on, which would lead to ~ 1:3 distribution. > Is that the right reading of the code? Indeed. So both A and B will get ~1024 weight per (active) CPU, such that on the CPUs they contend they will get 1:1 and then B will get the full CPU on the uncontested CPUs, resulting in a total of 1:3 distribution. This can of course be compensated by increasing the relative weight of A, if that is so desired. But the alternative view is that for those 128 CPUs they overlap, A and B will get equal parts, it is just that B consumes another 128 CPUs and will not have contention there. So the current scheme will inflate the part of A to be double the weight (of B), giving them 2 out of 3 parts on the contended CPUs, but then B will still get complete / uncontested access to those extra 128 CPUs, resulting in a 2:4 weight distribution. Which also isn't as straight forward as one might think. So perhaps 'weight on the CPUs you contest on' isn't as unintuitive as it seems on first glance, its just different. And it has tremendous advantages as outlined before; it is naturally normalized -- the disparity between nesting levels goes away, and the edge case of a single CPU active will be sane. Eg. consider your example except now A will have 1 active thread. Then A will get the full group weight (1024) on its one CPU, while B will get (1024/256=8) on each CPU. So for the one contended CPU A gets 256 out of 257 parts, while B gets the full CPU for the remaining 255 CPUs, for a: 256 1 257 --- : --- + 255*--- = 256:65535 ~ 1:256 257 257 257 distribution. While with the new scheme it would be: 1 1 2 - : - + 255*- = 1:511 2 2 2 Which, realistically isn't all that different, except the old scheme has this really large weight to deal with. So from where I'm sitting, yes different, but it behaves better.
Hello, Peter. On Mon, May 18, 2026 at 09:14:56AM +0200, Peter Zijlstra wrote: ... > So the current scheme will inflate the part of A to be double the weight > (of B), giving them 2 out of 3 parts on the contended CPUs, but then B > will still get complete / uncontested access to those extra 128 CPUs, > resulting in a 2:4 weight distribution. > > Which also isn't as straight forward as one might think. Right, the current behavior isn't quite what people would expect intuitively either. ... > So for the one contended CPU A gets 256 out of 257 parts, while B gets > the full CPU for the remaining 255 CPUs, for a: > > 256 1 257 > --- : --- + 255*--- = 256:65535 ~ 1:256 > 257 257 257 > > distribution. While with the new scheme it would be: > > 1 1 2 > - : - + 255*- = 1:511 > 2 2 2 > > Which, realistically isn't all that different, except the old scheme has > this really large weight to deal with. > > So from where I'm sitting, yes different, but it behaves better. I see. Thread cardinality and affinity problems make weight based distribution such a pain. I wonder whether this can be better solved by turning it into a two-layer allocation problem - groups to CPUs and then timeshare on CPUs as necessary. That comes with a lot of its own problems but it can, aspirationally at least, approximate global weight distribution and would have better locality properties. Thanks. -- tejun
On Mon, May 18, 2026 at 09:11:03AM -1000, Tejun Heo wrote: > Hello, Peter. > > On Mon, May 18, 2026 at 09:14:56AM +0200, Peter Zijlstra wrote: > ... > > So the current scheme will inflate the part of A to be double the weight > > (of B), giving them 2 out of 3 parts on the contended CPUs, but then B > > will still get complete / uncontested access to those extra 128 CPUs, > > resulting in a 2:4 weight distribution. > > > > Which also isn't as straight forward as one might think. > > Right, the current behavior isn't quite what people would expect intuitively > either. > > ... > > So for the one contended CPU A gets 256 out of 257 parts, while B gets > > the full CPU for the remaining 255 CPUs, for a: > > > > 256 1 257 > > --- : --- + 255*--- = 256:65535 ~ 1:256 > > 257 257 257 > > > > distribution. While with the new scheme it would be: > > > > 1 1 2 > > - : - + 255*- = 1:511 > > 2 2 2 > > > > Which, realistically isn't all that different, except the old scheme has > > this really large weight to deal with. > > > > So from where I'm sitting, yes different, but it behaves better. FWIW if the workload was single threads per CPU; the above is also the exact behaviour we'd have without cgroups. > I see. Thread cardinality and affinity problems make weight based > distribution such a pain. I wonder whether this can be better solved by > turning it into a two-layer allocation problem - groups to CPUs and then > timeshare on CPUs as necessary. That comes with a lot of its own problems > but it can, aspirationally at least, approximate global weight distribution > and would have better locality properties. If people want, they can already do this today. I don't see a reason to mandate something like that. That is, combine cpuset and cpu in a v2 hierarchy and you get this. The main problem with doing something like that is of course that it isn't always clear how many CPUs will be needed for a particular 'job'. So assigning groups to CPUs isn't a straight forward thing. If I remember, Meta was actually doing some of this. It was dynamically resizing cpusets based on load predictions and the like in order to separate various worloads on the same large machine, right? Anyway, while it is somewhat tedious to change behaviour, I do think it is worth doing in this case.
© 2016 - 2026 Red Hat, Inc.