sched/fair: Add push task mecansim and hadle more EAS cases

[PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

Posted by Vincent Guittot 2 months, 1 week ago

This is a subset of [1] (sched/fair: Rework EAS to handle more cases)

[1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
Exec flags when we just want to look for a possible better CPU.

Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
it.

Patch 5 enable has_idle_core for !SMP system to track if there may be an
idle CPU in the LLC.

Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
- when a task is stuck on a CPU and the system is not overutilized.
- if there is a possible idle CPU when the system is overutilized.

More tests results will come later as I wanted to send the pachtset before
LPC.

Tbench  on dragonboard rb5
schedutil and EAS enabled

# process     tip                   +patchset
1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%       
8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%

Hackbench didn't show any difference


Vincent Guittot (6):
  sched/fair: Filter false overloaded_group case for EAS
  sched/fair: Update overutilized detection
  sched/fair: Prepare select_task_rq_fair() to be called for new cases
  sched/fair: Add push task mechanism for fair
  sched/fair: Enable idle core tracking for !SMT
  sched/fair: Add EAS and idle cpu push trigger

 kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h    |  46 ++++--
 kernel/sched/topology.c |   3 +
 3 files changed, 346 insertions(+), 53 deletions(-)

-- 
2.43.0

Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

Posted by David Laight 2 months, 1 week ago

On Mon,  1 Dec 2025 10:13:02 +0100
Vincent Guittot <vincent.guittot@linaro.org> wrote:

...

If you've got sched/fair.c out on the operating table have a look at all the
code that multiplies by PELT_MIN_DIVISOR (about 48k).
There are max_t(u32) that (I think) mask the product to 32bits (on 64bit)
before assigning to a u64.
Conversely on 32bit the product is only 32bits - even though it is assigned
to a u64.

There might a valid justification for the 'utilisation' fitting in 32bits,
but I'm not sure it applies to any of the other fields.

There are also all the 'long' variables in the code - which change size
between 32bit and 64bit.
I failed to spot an explanation as to why this is valid.
I suspect they should all be either u32 or u64.

This all means that variables the 'runnable_sum' may be truncated and much
smaller than they ought to be.
I think that means the scheduler can incorrectly think a 'session' is idle
when, in fact, it is very busy.

I didn't do a full analysis of the code, just looked at a few expressions.

The 64bit code calculates 'long_var * PELT_MIN_DIVISOR' to get a 64bit product.
Doing a full 64x64 multiply if 32bit is rather more expensive.
Given PELT_MIN_DIVISOR is just a scale factor to get extra precision
(I think the product decays with time) multiplying by 32768 would be much
cheaper and have much the same effect.

	David

Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

Posted by Vincent Guittot 2 months, 1 week ago

On Mon, 1 Dec 2025 at 23:03, David Laight <david.laight.linux@gmail.com> wrote:
>
> On Mon,  1 Dec 2025 10:13:02 +0100
> Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> ...
>
> If you've got sched/fair.c out on the operating table have a look at all the
> code that multiplies by PELT_MIN_DIVISOR (about 48k).
> There are max_t(u32) that (I think) mask the product to 32bits (on 64bit)
> before assigning to a u64.

I'm going to have a look. Some stay in the 32 bits range like util_sum
but some others don't and we have scale_load_down() which is either a
nop or >> 10 in the picture

> Conversely on 32bit the product is only 32bits - even though it is assigned
> to a u64.
>
> There might a valid justification for the 'utilisation' fitting in 32bits,
> but I'm not sure it applies to any of the other fields.
>
> There are also all the 'long' variables in the code - which change size
> between 32bit and 64bit.
> I failed to spot an explanation as to why this is valid.
> I suspect they should all be either u32 or u64.
>
> This all means that variables the 'runnable_sum' may be truncated and much
> smaller than they ought to be.
> I think that means the scheduler can incorrectly think a 'session' is idle
> when, in fact, it is very busy.
>
> I didn't do a full analysis of the code, just looked at a few expressions.
>
> The 64bit code calculates 'long_var * PELT_MIN_DIVISOR' to get a 64bit product.
> Doing a full 64x64 multiply if 32bit is rather more expensive.
> Given PELT_MIN_DIVISOR is just a scale factor to get extra precision
> (I think the product decays with time) multiplying by 32768 would be much
> cheaper and have much the same effect.
>
>         David

Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

Posted by Christian Loehle 2 months, 1 week ago

On 12/1/25 09:13, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> 
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> 
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
> 
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
> 
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
> 
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
> 
> More tests results will come later as I wanted to send the pachtset before
> LPC.
> 
> Tbench  on dragonboard rb5
> schedutil and EAS enabled
> 
> # process     tip                   +patchset
> 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
> 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
> 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%       
> 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
> 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%

Just so I understand, there's no uclamp in the workload here?
Could you expand on the workload a little, what were the parameters/settings?
So the significant increase is really only for nr_proc < nr_cpus, with the
observed throughput increase it'll probably be something like "always running
on little CPUs" vs "always running on big CPUs", is that what's happening?
Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
TCP anyway.

> 
> Hackbench didn't show any difference
> 
> 
> Vincent Guittot (6):
>   sched/fair: Filter false overloaded_group case for EAS
>   sched/fair: Update overutilized detection
>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
>   sched/fair: Add push task mechanism for fair
>   sched/fair: Enable idle core tracking for !SMT
>   sched/fair: Add EAS and idle cpu push trigger
> 
>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
>  kernel/sched/sched.h    |  46 ++++--
>  kernel/sched/topology.c |   3 +
>  3 files changed, 346 insertions(+), 53 deletions(-)
>

Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

Posted by Vincent Guittot 2 months, 1 week ago

On Mon, 1 Dec 2025 at 14:31, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 12/1/25 09:13, Vincent Guittot wrote:
> > This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >
> > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> > Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> > trigger the active migration of a task on another CPU.
> >
> > Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> > Exec flags when we just want to look for a possible better CPU.
> >
> > Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> > it.
> >
> > Patch 5 enable has_idle_core for !SMP system to track if there may be an
> > idle CPU in the LLC.
> >
> > Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> > - when a task is stuck on a CPU and the system is not overutilized.
> > - if there is a possible idle CPU when the system is overutilized.
> >
> > More tests results will come later as I wanted to send the pachtset before
> > LPC.
> >
> > Tbench  on dragonboard rb5
> > schedutil and EAS enabled
> >
> > # process     tip                   +patchset
> > 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
> > 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
> > 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%
> > 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
> > 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%
>
> Just so I understand, there's no uclamp in the workload here?

Yes, no uclamp

> Could you expand on the workload a little, what were the parameters/settings?

for g in 1 2 4 8 16; do
for i in {0..8}; do
sync
sleep 3.777
tbench -t 10 $g
done
done

> So the significant increase is really only for nr_proc < nr_cpus, with the

yes

> observed throughput increase it'll probably be something like "always running
> on little CPUs" vs "always running on big CPUs", is that what's happening?

I have looked at the details. These results are part of the bench that
I'm running with hackbench but It's most probably come from migrating
task on a better cpu

> Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> TCP anyway.

Yes


>
> >
> > Hackbench didn't show any difference
> >
> >
> > Vincent Guittot (6):
> >   sched/fair: Filter false overloaded_group case for EAS
> >   sched/fair: Update overutilized detection
> >   sched/fair: Prepare select_task_rq_fair() to be called for new cases
> >   sched/fair: Add push task mechanism for fair
> >   sched/fair: Enable idle core tracking for !SMT
> >   sched/fair: Add EAS and idle cpu push trigger
> >
> >  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
> >  kernel/sched/sched.h    |  46 ++++--
> >  kernel/sched/topology.c |   3 +
> >  3 files changed, 346 insertions(+), 53 deletions(-)
> >
>

Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

Posted by Christian Loehle 2 months, 1 week ago

Nit in the title: mechanism, handle

On 12/1/25 13:31, Christian Loehle wrote:
> On 12/1/25 09:13, Vincent Guittot wrote:
>> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
>>
>> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
>>
>> The current Energy Aware Scheduler has some known limitations which have
>> became more and more visible with features like uclamp as an example. This
>> serie tries to fix some of those issues:
>> - tasks stacked on the same CPU of a PD
>> - tasks stuck on the wrong CPU.
>>
>> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
>> whereas it is capped to a lower compute capacity. This wrong classification
>> can prevent periodic load balancer to select a group_misfit_task CPU
>> because group_overloaded has higher priority.
>>
>> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
>> trigger the active migration of a task on another CPU.
>>
>> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
>> Exec flags when we just want to look for a possible better CPU.
>>
>> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
>> it.
>>
>> Patch 5 enable has_idle_core for !SMP system to track if there may be an
>> idle CPU in the LLC.
>>
>> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
>> - when a task is stuck on a CPU and the system is not overutilized.
>> - if there is a possible idle CPU when the system is overutilized.
>>
>> More tests results will come later as I wanted to send the pachtset before
>> LPC.
>>
>> Tbench  on dragonboard rb5
>> schedutil and EAS enabled
>>
>> # process     tip                   +patchset
>> 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
>> 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
>> 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%       
>> 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
>> 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%
> 
> Just so I understand, there's no uclamp in the workload here?
> Could you expand on the workload a little, what were the parameters/settings?
> So the significant increase is really only for nr_proc < nr_cpus, with the
> observed throughput increase it'll probably be something like "always running
> on little CPUs" vs "always running on big CPUs", is that what's happening?
> Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> TCP anyway.

... or if not why does OU not trigger on tip?

> 
>>
>> Hackbench didn't show any difference
>>
>>
>> Vincent Guittot (6):
>>   sched/fair: Filter false overloaded_group case for EAS
>>   sched/fair: Update overutilized detection
>>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
>>   sched/fair: Add push task mechanism for fair
>>   sched/fair: Enable idle core tracking for !SMT
>>   sched/fair: Add EAS and idle cpu push trigger
>>
>>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
>>  kernel/sched/sched.h    |  46 ++++--
>>  kernel/sched/topology.c |   3 +
>>  3 files changed, 346 insertions(+), 53 deletions(-)
>>

I can't apply this on yesterday's released 6.18 and not on tip/sched-core, what's
this based on? Can I get a branch or a 6.18 rebase?

Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

Posted by Vincent Guittot 2 months, 1 week ago

On Mon, 1 Dec 2025 at 14:57, Christian Loehle <christian.loehle@arm.com> wrote:
>
> Nit in the title: mechanism, handle
>
> On 12/1/25 13:31, Christian Loehle wrote:
> > On 12/1/25 09:13, Vincent Guittot wrote:
> >> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >>
> >> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >>
> >> The current Energy Aware Scheduler has some known limitations which have
> >> became more and more visible with features like uclamp as an example. This
> >> serie tries to fix some of those issues:
> >> - tasks stacked on the same CPU of a PD
> >> - tasks stuck on the wrong CPU.
> >>
> >> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> >> whereas it is capped to a lower compute capacity. This wrong classification
> >> can prevent periodic load balancer to select a group_misfit_task CPU
> >> because group_overloaded has higher priority.
> >>
> >> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> >> trigger the active migration of a task on another CPU.
> >>
> >> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> >> Exec flags when we just want to look for a possible better CPU.
> >>
> >> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> >> it.
> >>
> >> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> >> idle CPU in the LLC.
> >>
> >> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> >> - when a task is stuck on a CPU and the system is not overutilized.
> >> - if there is a possible idle CPU when the system is overutilized.
> >>
> >> More tests results will come later as I wanted to send the pachtset before
> >> LPC.
> >>
> >> Tbench  on dragonboard rb5
> >> schedutil and EAS enabled
> >>
> >> # process     tip                   +patchset
> >> 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
> >> 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
> >> 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%
> >> 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
> >> 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%
> >
> > Just so I understand, there's no uclamp in the workload here?
> > Could you expand on the workload a little, what were the parameters/settings?
> > So the significant increase is really only for nr_proc < nr_cpus, with the
> > observed throughput increase it'll probably be something like "always running
> > on little CPUs" vs "always running on big CPUs", is that what's happening?
> > Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> > TCP anyway.
>
> ... or if not why does OU not trigger on tip?
>
> >
> >>
> >> Hackbench didn't show any difference
> >>
> >>
> >> Vincent Guittot (6):
> >>   sched/fair: Filter false overloaded_group case for EAS
> >>   sched/fair: Update overutilized detection
> >>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
> >>   sched/fair: Add push task mechanism for fair
> >>   sched/fair: Enable idle core tracking for !SMT
> >>   sched/fair: Add EAS and idle cpu push trigger
> >>
> >>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
> >>  kernel/sched/sched.h    |  46 ++++--
> >>  kernel/sched/topology.c |   3 +
> >>  3 files changed, 346 insertions(+), 53 deletions(-)
> >>
>
> I can't apply this on yesterday's released 6.18 and not on tip/sched-core, what's
> this based on? Can I get a branch or a 6.18 rebase?

The patchset is based on tip/sched/core commit 33cf66d88306
("sched/fair: Proportional newidle balance")