[v2] sched/fair: Rework EAS to handle more cases

[PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases

Posted by Vincent Guittot 1 year, 1 month ago

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 creates a new EM interface that will be used by Patch 3

Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
others might be a better choice. feec() looks for the CPU with the highest
spare capacity in a PD assuming that it will be the best CPU from a energy
efficiency PoV because it will require the smallest increase of OPP.
This is often but not always true, this policy filters some others CPUs
which would be as efficients because of using the same OPP but with less
running tasks as an example.
In fact, we only care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result having the same energy cost. In
such cases, we can use other metrics to select the best CPU with the same
energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
and then the most performant CPU between CPUs. At now, this only tries to
evenly spread the number of runnable tasks on CPUs but this can be
improved with other metric like the sched slice duration in a follow up
series.

perf sched pipe on a dragonboard rb5 has been used to compare the overhead
of the new feec() vs current implementation.

9 iterations of perf bench sched pipe -T -l 80000
                ops/sec  stdev 
tip/sched/core  13001    (+/- 1.2%)
+ patches 1-3   14349    (+/- 5.4%)  +10.4%


Patch 4 removed the now unused em_cpu_energy()

Patch 5 solves another problem with tasks being stuck on a CPU forever
because it doesn't sleep anymore and as a result never wakeup and call
feec(). Such task can be detected by comparing util_avg or runnable_avg
with the compute capacity of the CPU. Once detected, we can call feec() to
check if there is a better CPU for the stuck task. The call can be done in
2 places:
- When the task is put back in the runnnable list after its running slice
  with the balance callback mecanism similarly to the rt/dl push callback.
- During cfs tick when there is only 1 running task stuck on the CPU in
  which case the balance callback can't be used.

This push callback mecanism with the new feec() algorithm ensures that
tasks always get a chance to migrate on the best suitable CPU and don't
stay stuck on a CPU which is no more the most suitable one. As examples:
- A task waking on a big CPU with a uclamp max preventing it to sleep and
  wake up, can migrate on a smaller CPU once it's more power efficient.
- The tasks are spread on CPUs in the PD when they target the same OPP.

Patch 6 adds task misfit migration case in the cfs tick and push callback
mecanism to prevent waking up an idle cpu unnecessarily.

Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Compared to v1:
- The call to feec() even when overutilized has been removed
from this serie and will be adressed in a separate series. Only the case
of uclamp_min has been kept as it is now handled by push callback and
tick mecanism.
- The push mecanism has been cleanup, fixed and simplified.

This series implements some of the topics discussed at OSPM [1]. Other
topics will be part of an other serie

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Vincent Guittot (7):
  sched/fair: Filter false overloaded_group case for EAS
  energy model: Add a get previous state function
  sched/fair: Rework feec() to use cost instead of spare capacity
  energy model: Remove unused em_cpu_energy()
  sched/fair: Add push task callback for EAS
  sched/fair: Add misfit case to push task callback for EAS
  sched/fair: Update overutilized detection

 include/linux/energy_model.h | 112 ++----
 kernel/sched/fair.c          | 707 +++++++++++++++++++++++------------
 kernel/sched/sched.h         |   2 +
 3 files changed, 503 insertions(+), 318 deletions(-)

-- 
2.43.0

Re: [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases

Posted by Christian Loehle 1 year, 1 month ago

Hi Vincent,
just some quick remarks, I won't have time to actually review and test this
in-depth until January. Sorry for that.

On 12/17/24 16:07, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs. At now, this only tries to
> evenly spread the number of runnable tasks on CPUs but this can be
> improved with other metric like the sched slice duration in a follow up
> series.


Could you elaborate why this is the better strategy instead of max_spare_cap?
Presumably the highest max_spare_cap has to have rather small tasks if it
still has more runnable tasks than the other (higher util) CPUs of the PD.
So nr of runnable tasks should intuitively be the less stable metric (to
me anyway).

For which workloads does it make a difference?
Which benefit from nr of runnable tasks? Which for max_spare_cap?

> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                 ops/sec  stdev 
> tip/sched/core  13001    (+/- 1.2%)
> + patches 1-3   14349    (+/- 5.4%)  +10.4%

I'm confused, the feec() rework in patch 3 does more comparisons overall,
so should be slower, but here we have a 10% improvement?
OTOH feec() shouldn't be running much in the first place, since you
don't run it when overutilized anymore (i.e. keep mainline behavior).
The difference should be negligible then and for me it basically is (rk3399
and -l 5000 to get roughly comparable test duration (results in seconds,
lower is better), 10 iterations:
tip/sched/core:
20.4573 +-0.0832
vingu/rework-eas-v2-patches-1-to-3:
20.7054 +-0.0411

> 
> 
> Patch 4 removed the now unused em_cpu_energy()
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>   with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>   which case the balance callback can't be used.
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>   wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> Patch 6 adds task misfit migration case in the cfs tick and push callback
> mecanism to prevent waking up an idle cpu unnecessarily.
> 
> Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.

Would it make sense to further split 5-7 for ease of reviewing?
Maybe even 1 and 4 as fixes, too?

Regards,
Christian

Re: [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases

Posted by Vincent Guittot 1 year, 1 month ago

On Wed, 18 Dec 2024 at 15:06, Christian Loehle <christian.loehle@arm.com> wrote:
>
> Hi Vincent,
> just some quick remarks, I won't have time to actually review and test this
> in-depth until January. Sorry for that.

no problem

>
> On 12/17/24 16:07, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> > Patch 2 creates a new EM interface that will be used by Patch 3
> >
> > Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> > others might be a better choice. feec() looks for the CPU with the highest
> > spare capacity in a PD assuming that it will be the best CPU from a energy
> > efficiency PoV because it will require the smallest increase of OPP.
> > This is often but not always true, this policy filters some others CPUs
> > which would be as efficients because of using the same OPP but with less
> > running tasks as an example.
> > In fact, we only care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result having the same energy cost. In
> > such cases, we can use other metrics to select the best CPU with the same
> > energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> > and then the most performant CPU between CPUs. At now, this only tries to
> > evenly spread the number of runnable tasks on CPUs but this can be
> > improved with other metric like the sched slice duration in a follow up
> > series.
>
>
> Could you elaborate why this is the better strategy instead of max_spare_cap?
> Presumably the highest max_spare_cap has to have rather small tasks if it
> still has more runnable tasks than the other (higher util) CPUs of the PD.

You don't always have a direct relation between nr_runnable,
max_spare_cap and task "size" because of blocked utilization. This
rework keeps the same behavior of highest max_spare_cap in a lot of
cases which includes a spare capacity that make  selecting a different
OPP but It also covers other cases when blocked utilization,
uclamp_min, uclamp_max, cpufreq clamping min/max freq breaks this
relation

While studying trace, we can often see small tasks being packed on a
CPU whereas another one is idle in the same PD

> So nr of runnable tasks should intuitively be the less stable metric (to
> me anyway).

spreading tasks helps to reduce the average scheduling latency which
is beneficial for small tasks. This performance decision is a 1st
simple version which aimed to be improved with other hints like the
sched_slice

>
> For which workloads does it make a difference?
> Which benefit from nr of runnable tasks? Which for max_spare_cap?

I have started to run some tests on Android device but doesn't have
consolidated results yet and I didn't want to delay more the v2

>
> >
> > perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> > of the new feec() vs current implementation.
> >
> > 9 iterations of perf bench sched pipe -T -l 80000
> >                 ops/sec  stdev
> > tip/sched/core  13001    (+/- 1.2%)
> > + patches 1-3   14349    (+/- 5.4%)  +10.4%
>
> I'm confused, the feec() rework in patch 3 does more comparisons overall,
> so should be slower, but here we have a 10% improvement?

TBH, I didn't expect perf improvement but wanted to test that there is
no regression. I run the tests several time and the results are always
in the same range

> OTOH feec() shouldn't be running much in the first place, since you
> don't run it when overutilized anymore (i.e. keep mainline behavior).

This should not make any difference here as the system is not
overutilized anyway

> The difference should be negligible then and for me it basically is (rk3399
> and -l 5000 to get roughly comparable test duration (results in seconds,
> lower is better), 10 iterations:
> tip/sched/core:
> 20.4573 +-0.0832
> vingu/rework-eas-v2-patches-1-to-3:
> 20.7054 +-0.0411
>
> >
> >
> > Patch 4 removed the now unused em_cpu_energy()
> >
> > Patch 5 solves another problem with tasks being stuck on a CPU forever
> > because it doesn't sleep anymore and as a result never wakeup and call
> > feec(). Such task can be detected by comparing util_avg or runnable_avg
> > with the compute capacity of the CPU. Once detected, we can call feec() to
> > check if there is a better CPU for the stuck task. The call can be done in
> > 2 places:
> > - When the task is put back in the runnnable list after its running slice
> >   with the balance callback mecanism similarly to the rt/dl push callback.
> > - During cfs tick when there is only 1 running task stuck on the CPU in
> >   which case the balance callback can't be used.
> >
> > This push callback mecanism with the new feec() algorithm ensures that
> > tasks always get a chance to migrate on the best suitable CPU and don't
> > stay stuck on a CPU which is no more the most suitable one. As examples:
> > - A task waking on a big CPU with a uclamp max preventing it to sleep and
> >   wake up, can migrate on a smaller CPU once it's more power efficient.
> > - The tasks are spread on CPUs in the PD when they target the same OPP.
> >
> > Patch 6 adds task misfit migration case in the cfs tick and push callback
> > mecanism to prevent waking up an idle cpu unnecessarily.
> >
> > Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> > trigger the active migration of a task on another CPU.
>
> Would it make sense to further split 5-7 for ease of reviewing?
> Maybe even 1 and 4 as fixes, too?
>
> Regards,
> Christian
>