[PATCH 0/3] sched: Generalize misfit load balance

Qais Yousef posted 3 patches 2 years ago
kernel/sched/fair.c  | 115 +++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h |   9 ++++
2 files changed, 110 insertions(+), 14 deletions(-)
[PATCH 0/3] sched: Generalize misfit load balance
Posted by Qais Yousef 2 years ago
Misfit load balance was added to help handle HMP systems where we can make
a wrong decision at wake up thinking a task can run at a smaller core, but its
characteristics change and requires to migrate to a bigger core to meet its
performance demands.

With the addition of uclamp, we can encounter more cases where such wrong
placement decisions can be made and require load balancer to do a corrective
action.

Specifically if a big task capped by uclamp_max was placed on a big core at
wake up because EAS thought it is the most energy efficient core at the time,
the dynamics of the system might change where other uncapped tasks might wake
up on the cluster and there could be a better new more energy efficient
placement for the capped task(s).

We can generalize the misfit load balance to handle different type of misfits
(whatever they may be) by simply giving it a reason. The reason can decide the
type of action required then.

Current misfit implementation is considered MISFIT_PERF. Which means we need to
move a task to a better CPU to meet its performance requirement.

For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
to control its impact on power.

Once we have an API to annotate latency sensitive tasks, it is anticipated
MISFIT_LATENCY load balance will be required to help handle oversubscribe
situations to help better distribute the latency sensitive tasks to help reduce
their wake up latency.

Patch 1 splits misfit status update from misfit detection by adding a new
function is_misfit_task().

Patch 2 implements the generalization logic by adding a misfit reason and
propagating that correctly and guarding the current misfit code with
MISFIT_PERF reason.

Patch 3 is an RFC on a potential implementation for MISFIT_POWER.

Patch 1 and 2 were tested stand alone and had no regression observed and should
not introduce a functional change and can be considered for merge if they make
sense after addressing any review comments.

Patch 3 was only tested to verify it does what I expected it to do. But no real
power/perf testing was done. Mainly because I was expecting to remove uclamp
max-aggregation [1] and the RFC I currently have (which I wrote many many
months ago) is tied to detecting a task being uncapped by max-aggregation.
I need to rethink the detection mechanism.

Beside that, the logic relies on using find_energy_efficient_cpu() to find the
best potential new placement for the task. To do that though, we need to force
every CPU to do the MISFIT_POWER load balance as we don't know which CPU should
do the pull. But there might be better thoughts on how to handle this. So
feedback and thoughts would be appreciated.

[1] https://lore.kernel.org/lkml/20231208015242.385103-1-qyousef@layalina.io/

Thanks!

--
Qais Yousef

Qais Yousef (3):
  sched/fair: Add is_misfit_task() function
  sched/fair: Generalize misfit lb by adding a misfit reason
  sched/fair: Implement new type of misfit MISFIT_POWER

 kernel/sched/fair.c  | 115 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   9 ++++
 2 files changed, 110 insertions(+), 14 deletions(-)

-- 
2.34.1
Re: [PATCH 0/3] sched: Generalize misfit load balance
Posted by Pierre Gondois 2 years ago
Hello Qais,

On 12/9/23 02:17, Qais Yousef wrote:
> Misfit load balance was added to help handle HMP systems where we can make
> a wrong decision at wake up thinking a task can run at a smaller core, but its
> characteristics change and requires to migrate to a bigger core to meet its
> performance demands.
> 
> With the addition of uclamp, we can encounter more cases where such wrong
> placement decisions can be made and require load balancer to do a corrective
> action.
> 
> Specifically if a big task capped by uclamp_max was placed on a big core at
> wake up because EAS thought it is the most energy efficient core at the time,
> the dynamics of the system might change where other uncapped tasks might wake
> up on the cluster and there could be a better new more energy efficient
> placement for the capped task(s).
> 
> We can generalize the misfit load balance to handle different type of misfits
> (whatever they may be) by simply giving it a reason. The reason can decide the
> type of action required then.
> 
> Current misfit implementation is considered MISFIT_PERF. Which means we need to
> move a task to a better CPU to meet its performance requirement.
> 
> For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
> to control its impact on power.
> 
> Once we have an API to annotate latency sensitive tasks, it is anticipated
> MISFIT_LATENCY load balance will be required to help handle oversubscribe
> situations to help better distribute the latency sensitive tasks to help reduce
> their wake up latency.
> 
> Patch 1 splits misfit status update from misfit detection by adding a new
> function is_misfit_task().
> 
> Patch 2 implements the generalization logic by adding a misfit reason and
> propagating that correctly and guarding the current misfit code with
> MISFIT_PERF reason.
> 
> Patch 3 is an RFC on a potential implementation for MISFIT_POWER.
> 
> Patch 1 and 2 were tested stand alone and had no regression observed and should
> not introduce a functional change and can be considered for merge if they make
> sense after addressing any review comments.
> 
> Patch 3 was only tested to verify it does what I expected it to do. But no real
> power/perf testing was done. Mainly because I was expecting to remove uclamp
> max-aggregation [1] and the RFC I currently have (which I wrote many many
> months ago) is tied to detecting a task being uncapped by max-aggregation.
> I need to rethink the detection mechanism.

I tried to trigger the MISFIT_POWER misfit reason without success so far.
Would it be possible to provide a workload/test to reliably trigger the
condition ?

Regards,
Pierre

> 
> Beside that, the logic relies on using find_energy_efficient_cpu() to find the
> best potential new placement for the task. To do that though, we need to force
> every CPU to do the MISFIT_POWER load balance as we don't know which CPU should
> do the pull. But there might be better thoughts on how to handle this. So
> feedback and thoughts would be appreciated.
> 
> [1] https://lore.kernel.org/lkml/20231208015242.385103-1-qyousef@layalina.io/
> 
> Thanks!
> 
> --
> Qais Yousef
> 
> Qais Yousef (3):
>    sched/fair: Add is_misfit_task() function
>    sched/fair: Generalize misfit lb by adding a misfit reason
>    sched/fair: Implement new type of misfit MISFIT_POWER
> 
>   kernel/sched/fair.c  | 115 +++++++++++++++++++++++++++++++++++++------
>   kernel/sched/sched.h |   9 ++++
>   2 files changed, 110 insertions(+), 14 deletions(-)
>
Re: [PATCH 0/3] sched: Generalize misfit load balance
Posted by Qais Yousef 1 year, 12 months ago
On 12/21/23 16:26, Pierre Gondois wrote:
> Hello Qais,
> 
> On 12/9/23 02:17, Qais Yousef wrote:
> > Misfit load balance was added to help handle HMP systems where we can make
> > a wrong decision at wake up thinking a task can run at a smaller core, but its
> > characteristics change and requires to migrate to a bigger core to meet its
> > performance demands.
> > 
> > With the addition of uclamp, we can encounter more cases where such wrong
> > placement decisions can be made and require load balancer to do a corrective
> > action.
> > 
> > Specifically if a big task capped by uclamp_max was placed on a big core at
> > wake up because EAS thought it is the most energy efficient core at the time,
> > the dynamics of the system might change where other uncapped tasks might wake
> > up on the cluster and there could be a better new more energy efficient
> > placement for the capped task(s).
> > 
> > We can generalize the misfit load balance to handle different type of misfits
> > (whatever they may be) by simply giving it a reason. The reason can decide the
> > type of action required then.
> > 
> > Current misfit implementation is considered MISFIT_PERF. Which means we need to
> > move a task to a better CPU to meet its performance requirement.
> > 
> > For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
> > to control its impact on power.
> > 
> > Once we have an API to annotate latency sensitive tasks, it is anticipated
> > MISFIT_LATENCY load balance will be required to help handle oversubscribe
> > situations to help better distribute the latency sensitive tasks to help reduce
> > their wake up latency.
> > 
> > Patch 1 splits misfit status update from misfit detection by adding a new
> > function is_misfit_task().
> > 
> > Patch 2 implements the generalization logic by adding a misfit reason and
> > propagating that correctly and guarding the current misfit code with
> > MISFIT_PERF reason.
> > 
> > Patch 3 is an RFC on a potential implementation for MISFIT_POWER.
> > 
> > Patch 1 and 2 were tested stand alone and had no regression observed and should
> > not introduce a functional change and can be considered for merge if they make
> > sense after addressing any review comments.
> > 
> > Patch 3 was only tested to verify it does what I expected it to do. But no real
> > power/perf testing was done. Mainly because I was expecting to remove uclamp
> > max-aggregation [1] and the RFC I currently have (which I wrote many many
> > months ago) is tied to detecting a task being uncapped by max-aggregation.
> > I need to rethink the detection mechanism.
> 
> I tried to trigger the MISFIT_POWER misfit reason without success so far.
> Would it be possible to provide a workload/test to reliably trigger the
> condition ?

I spawn a busy loop like

	cat /dev/zero > dev/null

Then use

	uclampset -M 0 -p $PID

to change uclamp_max to 0 and 1024 back and forth.

Try to load the system with some workload and you should see something like
attached picture. Red boxes are periods where uclamp_max is 0. The rest is for
uclamp_max = 1024. Note how it being constantly moved between CPUs when capped.


Cheers

--
Qais Yousef