sched/fair: Add push task mechanism and handle more EAS cases

[PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases

Posted by Vincent Guittot 2 months, 1 week ago

This is a subset of [1] (sched/fair: Rework EAS to handle more cases)

[1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
Exec flags when we just want to look for a possible better CPU.

Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
it.

Patch 5 enable has_idle_core for !SMP system to track if there may be an
idle CPU in the LLC.

Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
- when a task is stuck on a CPU and the system is not overutilized.
- if there is a possible idle CPU when the system is overutilized.

More tests results will come later as I wanted to send the pachtset before
LPC.

I have kept Tbench figures as I added them in v7 but results are the same
with the correct patch 6.

Tbench on dragonboard rb5
schedutil and EAS enabled

# process     tip                   +patchset
1              29.3(+/-0.3%)        29.2(+/-0.2%) +0%
2              61.1(+/-1.8%)        61.7(+/-3.2%) +1%
4             260.0(+/-1.7%)       258.8(+/-2.8%) -1%       
8            1361.2(+/-3.1%)      1377.1(+/-1.9%) +1%
16            981.5(+/-0.6%)       958.0(+/-1.7%) -2%

Hackbench didn't show any difference

Changes since v7:
- Rebased on latest tip/sched/core
- Fix some typos
- Fix patch 6 mess 

Vincent Guittot (6):
  sched/fair: Filter false overloaded_group case for EAS
  sched/fair: Update overutilized detection
  sched/fair: Prepare select_task_rq_fair() to be called for new cases
  sched/fair: Add push task mechanism for fair
  sched/fair: Enable idle core tracking for !SMT
  sched/fair: Add EAS and idle cpu push trigger

 kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h    |  46 ++++--
 kernel/sched/topology.c |   2 +
 3 files changed, 345 insertions(+), 53 deletions(-)

-- 
2.43.0

Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases

Posted by Qais Yousef 2 days, 1 hour ago

On 12/02/25 19:12, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> 
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> 
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.

I think you are under selling the importance of this improvement :-)

FWIW, in my view the new mechanism will help us:

1. Improve slow reaction time of lb. Waiting for another CPU to pull is very
   slow. And with 4ms TICK being the default and what I believe should be
   demolished (for most systems) back off mechanisms, when lb kicks in things
   has gone really bad already.
2. It helps implement misfit based on energy I brought up in the past [1]
3. It brings up a step closer to unify wake up and load balancer paths as we
   discussed is necessary to get a decent sched qos. We need to add the concept
   of task placement based on latency, and if lb can't take similar decision it
   is hard to make this useful. With push lb, both path can easily follow up
   the same decision tree.

I have backported an earlier version of this to help verify it, but so far
I think it is an amazing addition. Thanks for this!

[1] https://lore.kernel.org/lkml/20231209011759.398021-1-qyousef@layalina.io/

> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
> 
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
> 
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
> 
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
> 
> More tests results will come later as I wanted to send the pachtset before
> LPC.
> 
> I have kept Tbench figures as I added them in v7 but results are the same
> with the correct patch 6.
> 
> Tbench on dragonboard rb5
> schedutil and EAS enabled
> 
> # process     tip                   +patchset
> 1              29.3(+/-0.3%)        29.2(+/-0.2%) +0%
> 2              61.1(+/-1.8%)        61.7(+/-3.2%) +1%
> 4             260.0(+/-1.7%)       258.8(+/-2.8%) -1%       
> 8            1361.2(+/-3.1%)      1377.1(+/-1.9%) +1%
> 16            981.5(+/-0.6%)       958.0(+/-1.7%) -2%
> 
> Hackbench didn't show any difference
> 
> Changes since v7:
> - Rebased on latest tip/sched/core
> - Fix some typos
> - Fix patch 6 mess 
> 
> Vincent Guittot (6):
>   sched/fair: Filter false overloaded_group case for EAS
>   sched/fair: Update overutilized detection
>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
>   sched/fair: Add push task mechanism for fair
>   sched/fair: Enable idle core tracking for !SMT
>   sched/fair: Add EAS and idle cpu push trigger
> 
>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
>  kernel/sched/sched.h    |  46 ++++--
>  kernel/sched/topology.c |   2 +
>  3 files changed, 345 insertions(+), 53 deletions(-)
> 
> -- 
> 2.43.0
>

Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases

Posted by Dietmar Eggemann 2 months ago

- hongyan.xia2@arm.com
- luis.machado@arm.com

On 02.12.25 19:12, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> 
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> 
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
> 
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
> 
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
> 
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
> 
> More tests results will come later as I wanted to send the pachtset before
> LPC.
> 
> I have kept Tbench figures as I added them in v7 but results are the same
> with the correct patch 6.
> 
> Tbench on dragonboard rb5
> schedutil and EAS enabled
> 
> # process     tip                   +patchset
> 1              29.3(+/-0.3%)        29.2(+/-0.2%) +0%
> 2              61.1(+/-1.8%)        61.7(+/-3.2%) +1%
> 4             260.0(+/-1.7%)       258.8(+/-2.8%) -1%       
> 8            1361.2(+/-3.1%)      1377.1(+/-1.9%) +1%
> 16            981.5(+/-0.6%)       958.0(+/-1.7%) -2%
> 
> Hackbench didn't show any difference

I guess this is the overall idea here is:

-->

(1) Push runnable tasks

[pick_next|put_prev]_task_fair() -> fair_add_pushable_task() ->
fair_push_task() (*)

__set_next_task_fair() -> fair_queue_pushable_tasks() ->
queue_balance_callback(..., push_fair_tasks)

push_fair_task() -> strf(), move_queued_task() (or similar)

(2) Push single running task

tick() -> check_pushable_task() -> fair_push_task() (*), strf(),
active_balance

<--

strf() ... select_task_rq_fair(..., 0)

(1) & (2) are invoked when the policy fair_push_task() (2 parts
according to OverUtilized (OU) scenario) says the task should be moved

fair_push_task() (*)

  sched_energy_push_task() - non-OU

  sched_idle_push_task()   - OU


Pretty complex to reason about where this could be beneficial. I'm
thinking about the interaction of (1) and (2) with wakeup & MF handling
in non-OU and with load-balance in in OU.

You mentioned that you will show more test results next to tbench soon.
I don't know right now how to interpret the tbench results above.

IMHO, a set of rt-app files (customisable to a specific asymmetric CPU
capacity systems, potentially with uclamp max settings) with scenarios
to provoke the new functionality would help with the
understanding/evaluating here.

Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases

Posted by Christian Loehle 2 months, 1 week ago

On 12/2/25 18:12, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> 
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> 
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD

This needs elaboration IMO as "tasks stacked on the same CPU of a PD" isn't
really an issue per se? What's the scenario being fixed here?

> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
> 
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.

nit: here's still the mecanism typo :)

> 
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.

s/!SMP/!SMT/

> 
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.

I'd find it helpful to have the motivation spelled out more verbosely here.
Why are there tasks stuck? UCLAMP_MAX? Temporarily reduced capacity?
Would be nice to have a very concrete list of scenarios/issues in mind that
are being fixed and a description of how they're fixed by this patchset.
(e.g. current behaviour, new behaviour, reason why this behaviour is the
'more' correct one).

> 
> More tests results will come later as I wanted to send the pachtset before
> LPC.
> 
> I have kept Tbench figures as I added them in v7 but results are the same
> with the correct patch 6.

Ah I was confused by this sentence at first, so now for v8 both hackbench
and tbench are same for baseline and patchset.

> 
> Tbench on dragonboard rb5
> schedutil and EAS enabled
> 
> # process     tip                   +patchset
> 1              29.3(+/-0.3%)        29.2(+/-0.2%) +0%
> 2              61.1(+/-1.8%)        61.7(+/-3.2%) +1%
> 4             260.0(+/-1.7%)       258.8(+/-2.8%) -1%       
> 8            1361.2(+/-3.1%)      1377.1(+/-1.9%) +1%
> 16            981.5(+/-0.6%)       958.0(+/-1.7%) -2%
 
So I've done some analysis on tbench in the meantime, at least for the 1-process
case, because I was puzzled by your v7 result and indeed there are plenty
of wakeups, in particular in a 10s run I see 62806 tbench wakeups
with a distribution like so (time from one wakeup to the next):
0 ms - 1 ms: 62157
1 ms - 2 ms: 44
2 ms - 3 ms: 32
3 ms - 4 ms: 5
4 ms - 5 ms: 10
5 ms - 6 ms: 6
6 ms - 7 ms: 2
7 ms - 8 ms: 2
8 ms - 9 ms: 3
12 ms - 13 ms: 2
15 ms - 16 ms: 1
16 ms - 17 ms: 1
24 ms - 25 ms: 1
95 ms - 96 ms: 1

> Hackbench didn't show any difference

hackbench is always OU once it ramped up anyway, right? So this is expected.
If I'm not mistaken neither of the workloads then are likely to run through
the changes for the series? (Both have more than enough wakeup events, hackbench
is additionally OU so EAS is mostly skipped).
Would be helpful for reviewing then to have a workload that benefits from this
push mechanism, maybe at least one with and one without UCLAMP_MAX?

> 
> Changes since v7:
> - Rebased on latest tip/sched/core
> - Fix some typos
> - Fix patch 6 mess 
> 
> Vincent Guittot (6):
>   sched/fair: Filter false overloaded_group case for EAS
>   sched/fair: Update overutilized detection
>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
>   sched/fair: Add push task mechanism for fair
>   sched/fair: Enable idle core tracking for !SMT
>   sched/fair: Add EAS and idle cpu push trigger
> 
>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
>  kernel/sched/sched.h    |  46 ++++--
>  kernel/sched/topology.c |   2 +
>  3 files changed, 345 insertions(+), 53 deletions(-)
>