kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 46 ++++-- kernel/sched/topology.c | 2 + 3 files changed, 345 insertions(+), 53 deletions(-)
This is a subset of [1] (sched/fair: Rework EAS to handle more cases) [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/ The current Energy Aware Scheduler has some known limitations which have became more and more visible with features like uclamp as an example. This serie tries to fix some of those issues: - tasks stacked on the same CPU of a PD - tasks stuck on the wrong CPU. Patch 1 fixes the case where a CPU is wrongly classified as overloaded whereas it is capped to a lower compute capacity. This wrong classification can prevent periodic load balancer to select a group_misfit_task CPU because group_overloaded has higher priority. Patch 2 removes the need of testing uclamp_min in cpu_overutilized to trigger the active migration of a task on another CPU. Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or Exec flags when we just want to look for a possible better CPU. Patch 4 adds push call back mecanism to fair scheduler but doesn't enable it. Patch 5 enable has_idle_core for !SMP system to track if there may be an idle CPU in the LLC. Patch 6 adds some conditions to enable pushing runnable tasks for EAS: - when a task is stuck on a CPU and the system is not overutilized. - if there is a possible idle CPU when the system is overutilized. More tests results will come later as I wanted to send the pachtset before LPC. I have kept Tbench figures as I added them in v7 but results are the same with the correct patch 6. Tbench on dragonboard rb5 schedutil and EAS enabled # process tip +patchset 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0% 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1% 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1% 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1% 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2% Hackbench didn't show any difference Changes since v7: - Rebased on latest tip/sched/core - Fix some typos - Fix patch 6 mess Vincent Guittot (6): sched/fair: Filter false overloaded_group case for EAS sched/fair: Update overutilized detection sched/fair: Prepare select_task_rq_fair() to be called for new cases sched/fair: Add push task mechanism for fair sched/fair: Enable idle core tracking for !SMT sched/fair: Add EAS and idle cpu push trigger kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 46 ++++-- kernel/sched/topology.c | 2 + 3 files changed, 345 insertions(+), 53 deletions(-) -- 2.43.0
- hongyan.xia2@arm.com - luis.machado@arm.com On 02.12.25 19:12, Vincent Guittot wrote: > This is a subset of [1] (sched/fair: Rework EAS to handle more cases) > > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/ > > The current Energy Aware Scheduler has some known limitations which have > became more and more visible with features like uclamp as an example. This > serie tries to fix some of those issues: > - tasks stacked on the same CPU of a PD > - tasks stuck on the wrong CPU. > > Patch 1 fixes the case where a CPU is wrongly classified as overloaded > whereas it is capped to a lower compute capacity. This wrong classification > can prevent periodic load balancer to select a group_misfit_task CPU > because group_overloaded has higher priority. > > Patch 2 removes the need of testing uclamp_min in cpu_overutilized to > trigger the active migration of a task on another CPU. > > Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or > Exec flags when we just want to look for a possible better CPU. > > Patch 4 adds push call back mecanism to fair scheduler but doesn't enable > it. > > Patch 5 enable has_idle_core for !SMP system to track if there may be an > idle CPU in the LLC. > > Patch 6 adds some conditions to enable pushing runnable tasks for EAS: > - when a task is stuck on a CPU and the system is not overutilized. > - if there is a possible idle CPU when the system is overutilized. > > More tests results will come later as I wanted to send the pachtset before > LPC. > > I have kept Tbench figures as I added them in v7 but results are the same > with the correct patch 6. > > Tbench on dragonboard rb5 > schedutil and EAS enabled > > # process tip +patchset > 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0% > 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1% > 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1% > 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1% > 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2% > > Hackbench didn't show any difference I guess this is the overall idea here is: --> (1) Push runnable tasks [pick_next|put_prev]_task_fair() -> fair_add_pushable_task() -> fair_push_task() (*) __set_next_task_fair() -> fair_queue_pushable_tasks() -> queue_balance_callback(..., push_fair_tasks) push_fair_task() -> strf(), move_queued_task() (or similar) (2) Push single running task tick() -> check_pushable_task() -> fair_push_task() (*), strf(), active_balance <-- strf() ... select_task_rq_fair(..., 0) (1) & (2) are invoked when the policy fair_push_task() (2 parts according to OverUtilized (OU) scenario) says the task should be moved fair_push_task() (*) sched_energy_push_task() - non-OU sched_idle_push_task() - OU Pretty complex to reason about where this could be beneficial. I'm thinking about the interaction of (1) and (2) with wakeup & MF handling in non-OU and with load-balance in in OU. You mentioned that you will show more test results next to tbench soon. I don't know right now how to interpret the tbench results above. IMHO, a set of rt-app files (customisable to a specific asymmetric CPU capacity systems, potentially with uclamp max settings) with scenarios to provoke the new functionality would help with the understanding/evaluating here.
On 12/2/25 18:12, Vincent Guittot wrote: > This is a subset of [1] (sched/fair: Rework EAS to handle more cases) > > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/ > > The current Energy Aware Scheduler has some known limitations which have > became more and more visible with features like uclamp as an example. This > serie tries to fix some of those issues: > - tasks stacked on the same CPU of a PD This needs elaboration IMO as "tasks stacked on the same CPU of a PD" isn't really an issue per se? What's the scenario being fixed here? > - tasks stuck on the wrong CPU. > > Patch 1 fixes the case where a CPU is wrongly classified as overloaded > whereas it is capped to a lower compute capacity. This wrong classification > can prevent periodic load balancer to select a group_misfit_task CPU > because group_overloaded has higher priority. > > Patch 2 removes the need of testing uclamp_min in cpu_overutilized to > trigger the active migration of a task on another CPU. > > Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or > Exec flags when we just want to look for a possible better CPU. > > Patch 4 adds push call back mecanism to fair scheduler but doesn't enable > it. nit: here's still the mecanism typo :) > > Patch 5 enable has_idle_core for !SMP system to track if there may be an > idle CPU in the LLC. s/!SMP/!SMT/ > > Patch 6 adds some conditions to enable pushing runnable tasks for EAS: > - when a task is stuck on a CPU and the system is not overutilized. > - if there is a possible idle CPU when the system is overutilized. I'd find it helpful to have the motivation spelled out more verbosely here. Why are there tasks stuck? UCLAMP_MAX? Temporarily reduced capacity? Would be nice to have a very concrete list of scenarios/issues in mind that are being fixed and a description of how they're fixed by this patchset. (e.g. current behaviour, new behaviour, reason why this behaviour is the 'more' correct one). > > More tests results will come later as I wanted to send the pachtset before > LPC. > > I have kept Tbench figures as I added them in v7 but results are the same > with the correct patch 6. Ah I was confused by this sentence at first, so now for v8 both hackbench and tbench are same for baseline and patchset. > > Tbench on dragonboard rb5 > schedutil and EAS enabled > > # process tip +patchset > 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0% > 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1% > 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1% > 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1% > 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2% So I've done some analysis on tbench in the meantime, at least for the 1-process case, because I was puzzled by your v7 result and indeed there are plenty of wakeups, in particular in a 10s run I see 62806 tbench wakeups with a distribution like so (time from one wakeup to the next): 0 ms - 1 ms: 62157 1 ms - 2 ms: 44 2 ms - 3 ms: 32 3 ms - 4 ms: 5 4 ms - 5 ms: 10 5 ms - 6 ms: 6 6 ms - 7 ms: 2 7 ms - 8 ms: 2 8 ms - 9 ms: 3 12 ms - 13 ms: 2 15 ms - 16 ms: 1 16 ms - 17 ms: 1 24 ms - 25 ms: 1 95 ms - 96 ms: 1 > Hackbench didn't show any difference hackbench is always OU once it ramped up anyway, right? So this is expected. If I'm not mistaken neither of the workloads then are likely to run through the changes for the series? (Both have more than enough wakeup events, hackbench is additionally OU so EAS is mostly skipped). Would be helpful for reviewing then to have a workload that benefits from this push mechanism, maybe at least one with and one without UCLAMP_MAX? > > Changes since v7: > - Rebased on latest tip/sched/core > - Fix some typos > - Fix patch 6 mess > > Vincent Guittot (6): > sched/fair: Filter false overloaded_group case for EAS > sched/fair: Update overutilized detection > sched/fair: Prepare select_task_rq_fair() to be called for new cases > sched/fair: Add push task mechanism for fair > sched/fair: Enable idle core tracking for !SMT > sched/fair: Add EAS and idle cpu push trigger > > kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++----- > kernel/sched/sched.h | 46 ++++-- > kernel/sched/topology.c | 2 + > 3 files changed, 345 insertions(+), 53 deletions(-) >
© 2016 - 2025 Red Hat, Inc.