kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 46 ++++-- kernel/sched/topology.c | 3 + 3 files changed, 346 insertions(+), 53 deletions(-)
This is a subset of [1] (sched/fair: Rework EAS to handle more cases) [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/ The current Energy Aware Scheduler has some known limitations which have became more and more visible with features like uclamp as an example. This serie tries to fix some of those issues: - tasks stacked on the same CPU of a PD - tasks stuck on the wrong CPU. Patch 1 fixes the case where a CPU is wrongly classified as overloaded whereas it is capped to a lower compute capacity. This wrong classification can prevent periodic load balancer to select a group_misfit_task CPU because group_overloaded has higher priority. Patch 2 removes the need of testing uclamp_min in cpu_overutilized to trigger the active migration of a task on another CPU. Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or Exec flags when we just want to look for a possible better CPU. Patch 4 adds push call back mecanism to fair scheduler but doesn't enable it. Patch 5 enable has_idle_core for !SMP system to track if there may be an idle CPU in the LLC. Patch 6 adds some conditions to enable pushing runnable tasks for EAS: - when a task is stuck on a CPU and the system is not overutilized. - if there is a possible idle CPU when the system is overutilized. More tests results will come later as I wanted to send the pachtset before LPC. Tbench on dragonboard rb5 schedutil and EAS enabled # process tip +patchset 1 29.1(+/-4.1%) 124.7(+/-12.3%) +329% 2 60.0(+/-0.9%) 216.1(+/- 7.9%) +260% 4 255.8(+/-1.9%) 421.4(+/- 2.0%) +65% 8 1317.3(+/-4.6%) 1396.1(+/- 3.0%) +6% 16 958.2(+/-4.6%) 979.6(+/- 2.0%) +2% Hackbench didn't show any difference Vincent Guittot (6): sched/fair: Filter false overloaded_group case for EAS sched/fair: Update overutilized detection sched/fair: Prepare select_task_rq_fair() to be called for new cases sched/fair: Add push task mechanism for fair sched/fair: Enable idle core tracking for !SMT sched/fair: Add EAS and idle cpu push trigger kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 46 ++++-- kernel/sched/topology.c | 3 + 3 files changed, 346 insertions(+), 53 deletions(-) -- 2.43.0
On 12/1/25 09:13, Vincent Guittot wrote: > This is a subset of [1] (sched/fair: Rework EAS to handle more cases) > > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/ > > The current Energy Aware Scheduler has some known limitations which have > became more and more visible with features like uclamp as an example. This > serie tries to fix some of those issues: > - tasks stacked on the same CPU of a PD > - tasks stuck on the wrong CPU. > > Patch 1 fixes the case where a CPU is wrongly classified as overloaded > whereas it is capped to a lower compute capacity. This wrong classification > can prevent periodic load balancer to select a group_misfit_task CPU > because group_overloaded has higher priority. > > Patch 2 removes the need of testing uclamp_min in cpu_overutilized to > trigger the active migration of a task on another CPU. > > Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or > Exec flags when we just want to look for a possible better CPU. > > Patch 4 adds push call back mecanism to fair scheduler but doesn't enable > it. > > Patch 5 enable has_idle_core for !SMP system to track if there may be an > idle CPU in the LLC. > > Patch 6 adds some conditions to enable pushing runnable tasks for EAS: > - when a task is stuck on a CPU and the system is not overutilized. > - if there is a possible idle CPU when the system is overutilized. > > More tests results will come later as I wanted to send the pachtset before > LPC. > > Tbench on dragonboard rb5 > schedutil and EAS enabled > > # process tip +patchset > 1 29.1(+/-4.1%) 124.7(+/-12.3%) +329% > 2 60.0(+/-0.9%) 216.1(+/- 7.9%) +260% > 4 255.8(+/-1.9%) 421.4(+/- 2.0%) +65% > 8 1317.3(+/-4.6%) 1396.1(+/- 3.0%) +6% > 16 958.2(+/-4.6%) 979.6(+/- 2.0%) +2% Just so I understand, there's no uclamp in the workload here? Could you expand on the workload a little, what were the parameters/settings? So the significant increase is really only for nr_proc < nr_cpus, with the observed throughput increase it'll probably be something like "always running on little CPUs" vs "always running on big CPUs", is that what's happening? Also shouldn't tbench still have plenty of wakeup events? It issues plenty of TCP anyway. > > Hackbench didn't show any difference > > > Vincent Guittot (6): > sched/fair: Filter false overloaded_group case for EAS > sched/fair: Update overutilized detection > sched/fair: Prepare select_task_rq_fair() to be called for new cases > sched/fair: Add push task mechanism for fair > sched/fair: Enable idle core tracking for !SMT > sched/fair: Add EAS and idle cpu push trigger > > kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++----- > kernel/sched/sched.h | 46 ++++-- > kernel/sched/topology.c | 3 + > 3 files changed, 346 insertions(+), 53 deletions(-) >
On Mon, 1 Dec 2025 at 14:31, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 12/1/25 09:13, Vincent Guittot wrote:
> > This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >
> > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> > Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> > trigger the active migration of a task on another CPU.
> >
> > Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> > Exec flags when we just want to look for a possible better CPU.
> >
> > Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> > it.
> >
> > Patch 5 enable has_idle_core for !SMP system to track if there may be an
> > idle CPU in the LLC.
> >
> > Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> > - when a task is stuck on a CPU and the system is not overutilized.
> > - if there is a possible idle CPU when the system is overutilized.
> >
> > More tests results will come later as I wanted to send the pachtset before
> > LPC.
> >
> > Tbench on dragonboard rb5
> > schedutil and EAS enabled
> >
> > # process tip +patchset
> > 1 29.1(+/-4.1%) 124.7(+/-12.3%) +329%
> > 2 60.0(+/-0.9%) 216.1(+/- 7.9%) +260%
> > 4 255.8(+/-1.9%) 421.4(+/- 2.0%) +65%
> > 8 1317.3(+/-4.6%) 1396.1(+/- 3.0%) +6%
> > 16 958.2(+/-4.6%) 979.6(+/- 2.0%) +2%
>
> Just so I understand, there's no uclamp in the workload here?
Yes, no uclamp
> Could you expand on the workload a little, what were the parameters/settings?
for g in 1 2 4 8 16; do
for i in {0..8}; do
sync
sleep 3.777
tbench -t 10 $g
done
done
> So the significant increase is really only for nr_proc < nr_cpus, with the
yes
> observed throughput increase it'll probably be something like "always running
> on little CPUs" vs "always running on big CPUs", is that what's happening?
I have looked at the details. These results are part of the bench that
I'm running with hackbench but It's most probably come from migrating
task on a better cpu
> Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> TCP anyway.
Yes
>
> >
> > Hackbench didn't show any difference
> >
> >
> > Vincent Guittot (6):
> > sched/fair: Filter false overloaded_group case for EAS
> > sched/fair: Update overutilized detection
> > sched/fair: Prepare select_task_rq_fair() to be called for new cases
> > sched/fair: Add push task mechanism for fair
> > sched/fair: Enable idle core tracking for !SMT
> > sched/fair: Add EAS and idle cpu push trigger
> >
> > kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++-----
> > kernel/sched/sched.h | 46 ++++--
> > kernel/sched/topology.c | 3 +
> > 3 files changed, 346 insertions(+), 53 deletions(-)
> >
>
Nit in the title: mechanism, handle On 12/1/25 13:31, Christian Loehle wrote: > On 12/1/25 09:13, Vincent Guittot wrote: >> This is a subset of [1] (sched/fair: Rework EAS to handle more cases) >> >> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/ >> >> The current Energy Aware Scheduler has some known limitations which have >> became more and more visible with features like uclamp as an example. This >> serie tries to fix some of those issues: >> - tasks stacked on the same CPU of a PD >> - tasks stuck on the wrong CPU. >> >> Patch 1 fixes the case where a CPU is wrongly classified as overloaded >> whereas it is capped to a lower compute capacity. This wrong classification >> can prevent periodic load balancer to select a group_misfit_task CPU >> because group_overloaded has higher priority. >> >> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to >> trigger the active migration of a task on another CPU. >> >> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or >> Exec flags when we just want to look for a possible better CPU. >> >> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable >> it. >> >> Patch 5 enable has_idle_core for !SMP system to track if there may be an >> idle CPU in the LLC. >> >> Patch 6 adds some conditions to enable pushing runnable tasks for EAS: >> - when a task is stuck on a CPU and the system is not overutilized. >> - if there is a possible idle CPU when the system is overutilized. >> >> More tests results will come later as I wanted to send the pachtset before >> LPC. >> >> Tbench on dragonboard rb5 >> schedutil and EAS enabled >> >> # process tip +patchset >> 1 29.1(+/-4.1%) 124.7(+/-12.3%) +329% >> 2 60.0(+/-0.9%) 216.1(+/- 7.9%) +260% >> 4 255.8(+/-1.9%) 421.4(+/- 2.0%) +65% >> 8 1317.3(+/-4.6%) 1396.1(+/- 3.0%) +6% >> 16 958.2(+/-4.6%) 979.6(+/- 2.0%) +2% > > Just so I understand, there's no uclamp in the workload here? > Could you expand on the workload a little, what were the parameters/settings? > So the significant increase is really only for nr_proc < nr_cpus, with the > observed throughput increase it'll probably be something like "always running > on little CPUs" vs "always running on big CPUs", is that what's happening? > Also shouldn't tbench still have plenty of wakeup events? It issues plenty of > TCP anyway. ... or if not why does OU not trigger on tip? > >> >> Hackbench didn't show any difference >> >> >> Vincent Guittot (6): >> sched/fair: Filter false overloaded_group case for EAS >> sched/fair: Update overutilized detection >> sched/fair: Prepare select_task_rq_fair() to be called for new cases >> sched/fair: Add push task mechanism for fair >> sched/fair: Enable idle core tracking for !SMT >> sched/fair: Add EAS and idle cpu push trigger >> >> kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++----- >> kernel/sched/sched.h | 46 ++++-- >> kernel/sched/topology.c | 3 + >> 3 files changed, 346 insertions(+), 53 deletions(-) >> I can't apply this on yesterday's released 6.18 and not on tip/sched-core, what's this based on? Can I get a branch or a 6.18 rebase?
On Mon, 1 Dec 2025 at 14:57, Christian Loehle <christian.loehle@arm.com> wrote:
>
> Nit in the title: mechanism, handle
>
> On 12/1/25 13:31, Christian Loehle wrote:
> > On 12/1/25 09:13, Vincent Guittot wrote:
> >> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >>
> >> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >>
> >> The current Energy Aware Scheduler has some known limitations which have
> >> became more and more visible with features like uclamp as an example. This
> >> serie tries to fix some of those issues:
> >> - tasks stacked on the same CPU of a PD
> >> - tasks stuck on the wrong CPU.
> >>
> >> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> >> whereas it is capped to a lower compute capacity. This wrong classification
> >> can prevent periodic load balancer to select a group_misfit_task CPU
> >> because group_overloaded has higher priority.
> >>
> >> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> >> trigger the active migration of a task on another CPU.
> >>
> >> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> >> Exec flags when we just want to look for a possible better CPU.
> >>
> >> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> >> it.
> >>
> >> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> >> idle CPU in the LLC.
> >>
> >> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> >> - when a task is stuck on a CPU and the system is not overutilized.
> >> - if there is a possible idle CPU when the system is overutilized.
> >>
> >> More tests results will come later as I wanted to send the pachtset before
> >> LPC.
> >>
> >> Tbench on dragonboard rb5
> >> schedutil and EAS enabled
> >>
> >> # process tip +patchset
> >> 1 29.1(+/-4.1%) 124.7(+/-12.3%) +329%
> >> 2 60.0(+/-0.9%) 216.1(+/- 7.9%) +260%
> >> 4 255.8(+/-1.9%) 421.4(+/- 2.0%) +65%
> >> 8 1317.3(+/-4.6%) 1396.1(+/- 3.0%) +6%
> >> 16 958.2(+/-4.6%) 979.6(+/- 2.0%) +2%
> >
> > Just so I understand, there's no uclamp in the workload here?
> > Could you expand on the workload a little, what were the parameters/settings?
> > So the significant increase is really only for nr_proc < nr_cpus, with the
> > observed throughput increase it'll probably be something like "always running
> > on little CPUs" vs "always running on big CPUs", is that what's happening?
> > Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> > TCP anyway.
>
> ... or if not why does OU not trigger on tip?
>
> >
> >>
> >> Hackbench didn't show any difference
> >>
> >>
> >> Vincent Guittot (6):
> >> sched/fair: Filter false overloaded_group case for EAS
> >> sched/fair: Update overutilized detection
> >> sched/fair: Prepare select_task_rq_fair() to be called for new cases
> >> sched/fair: Add push task mechanism for fair
> >> sched/fair: Enable idle core tracking for !SMT
> >> sched/fair: Add EAS and idle cpu push trigger
> >>
> >> kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++-----
> >> kernel/sched/sched.h | 46 ++++--
> >> kernel/sched/topology.c | 3 +
> >> 3 files changed, 346 insertions(+), 53 deletions(-)
> >>
>
> I can't apply this on yesterday's released 6.18 and not on tip/sched-core, what's
> this based on? Can I get a branch or a 6.18 rebase?
The patchset is based on tip/sched/core commit 33cf66d88306
("sched/fair: Proportional newidle balance")
© 2016 - 2025 Red Hat, Inc.