[RFC PATCH 4/5] sched/fair: Use EAS also when overutilized

Vincent Guittot posted 5 patches 1 year, 3 months ago
There is a newer version of this series
[RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 3 months ago
Keep looking for an energy efficient CPU even when the system is
overutilized and use the CPU returned by feec() if it has been able to find
one. Otherwise fallback to the default performance and spread mode of the
scheduler.
A system can become overutilized for a short time when workers of a
workqueue wake up for a short background work like vmstat update.
Continuing to look for a energy efficient CPU will prevent to break the
power packing of tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2273eecf6086..e46af2416159 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		    cpumask_test_cpu(cpu, p->cpus_ptr))
 			return cpu;
 
-		if (!is_rd_overutilized(this_rq()->rd)) {
+		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
 				return new_cpu;
-- 
2.34.1
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Quentin Perret 1 year, 2 months ago
Hi Vincent,

On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> Keep looking for an energy efficient CPU even when the system is
> overutilized and use the CPU returned by feec() if it has been able to find
> one. Otherwise fallback to the default performance and spread mode of the
> scheduler.
> A system can become overutilized for a short time when workers of a
> workqueue wake up for a short background work like vmstat update.
> Continuing to look for a energy efficient CPU will prevent to break the
> power packing of tasks.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2273eecf6086..e46af2416159 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		    cpumask_test_cpu(cpu, p->cpus_ptr))
>  			return cpu;
>  
> -		if (!is_rd_overutilized(this_rq()->rd)) {
> +		if (sched_energy_enabled()) {

As mentioned during LPC, when there is no idle time on a CPU, the
utilization value of the tasks running on it is no longer a good
approximation for how much the tasks want, it becomes an image of how
much CPU time they were given. That is particularly problematic in the
co-scheduling case, but not just.

IOW, when we're OU, the util values are bogus, so using feec() is frankly
wrong IMO. If we don't have a good idea of how long tasks want to run,
the EM just can't help us with anything so we should stay away from it.

I understand how just plain bailing out as we do today is sub-optimal,
but whatever we do to improve on that can't be doing utilization-based
task placement.

Have you considered making the default (non-EAS) wake-up path a little
more reluctant to migrations when EAS is enabled? That should allow us
to maintain a somewhat stable task placement when OU is only transient
(e.g. due to misfit), but without using util values when we really
shouldn't.

Thoughts?

Thanks,
Quentin

>  			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>  			if (new_cpu >= 0)
>  				return new_cpu;
> -- 
> 2.34.1
> 
>
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Fri, 20 Sept 2024 at 18:17, Quentin Perret <qperret@google.com> wrote:
>
> Hi Vincent,
>
> On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> > Keep looking for an energy efficient CPU even when the system is
> > overutilized and use the CPU returned by feec() if it has been able to find
> > one. Otherwise fallback to the default performance and spread mode of the
> > scheduler.
> > A system can become overutilized for a short time when workers of a
> > workqueue wake up for a short background work like vmstat update.
> > Continuing to look for a energy efficient CPU will prevent to break the
> > power packing of tasks.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2273eecf6086..e46af2416159 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> >                       return cpu;
> >
> > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > +             if (sched_energy_enabled()) {
>
> As mentioned during LPC, when there is no idle time on a CPU, the
> utilization value of the tasks running on it is no longer a good
> approximation for how much the tasks want, it becomes an image of how
> much CPU time they were given. That is particularly problematic in the
> co-scheduling case, but not just.

Yes, this is not always true when overutilized and  true after a
certain amount of time. When a CPU is fully utilized without any idle
time anymore, feec() will not find a CPU for the task

>
> IOW, when we're OU, the util values are bogus, so using feec() is frankly
> wrong IMO. If we don't have a good idea of how long tasks want to run,

Except that the CPU is not already fully busy without idle time when
the system is overutilized. We have  ~20% margin on each CPU which
means that system are overutilized as soon as one CPU is more than 80%
utilized which is far from not having idle time anymore. So even when
OU, it doesn't mean that all CPUs don't have idle time and most of the
time the opposite happens and feec() can still make a useful decision.
Also, when there is no idle time on a CPU, the task doesn't fit and
feec() doesn't return a CPU.

Then, the old way to compute invariant utilization was particularly
sensible to the overutilized state because the utilization was capped
and asymptotically converging to max cpu compute capacity but this is
not true with the new pelt and we can go above compute capacity of the
cpu and remain correct as long as we are able to increase the compute
capacity before that there is no idle time. In theory, the utilization
"could" be correct until we reach 1024 (for utilization or runnable)
and there is no way to catch up the temporary under compute capacity.

> the EM just can't help us with anything so we should stay away from it.
>
> I understand how just plain bailing out as we do today is sub-optimal,
> but whatever we do to improve on that can't be doing utilization-based
> task placement.
>
> Have you considered making the default (non-EAS) wake-up path a little
> more reluctant to migrations when EAS is enabled? That should allow us
> to maintain a somewhat stable task placement when OU is only transient
> (e.g. due to misfit), but without using util values when we really
> shouldn't.
>
> Thoughts?

As mentioned above OU doesn't mean no idle time anymore and in this
case utilization is still relevant. In would be in favor of adding
more performance related decision into feec() similarly to have is
done in patch 3 which would be for example that if a cpu doesn't fit
we could still return  a CPU with more performance focus


>
> Thanks,
> Quentin
>
> >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >                       if (new_cpu >= 0)
> >                               return new_cpu;
> > --
> > 2.34.1
> >
> >
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Quentin Perret 1 year, 2 months ago
Hi Vincent,

On Wednesday 25 Sep 2024 at 15:27:45 (+0200), Vincent Guittot wrote:
> On Fri, 20 Sept 2024 at 18:17, Quentin Perret <qperret@google.com> wrote:
> >
> > Hi Vincent,
> >
> > On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> > > Keep looking for an energy efficient CPU even when the system is
> > > overutilized and use the CPU returned by feec() if it has been able to find
> > > one. Otherwise fallback to the default performance and spread mode of the
> > > scheduler.
> > > A system can become overutilized for a short time when workers of a
> > > workqueue wake up for a short background work like vmstat update.
> > > Continuing to look for a energy efficient CPU will prevent to break the
> > > power packing of tasks.
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 2273eecf6086..e46af2416159 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> > >                       return cpu;
> > >
> > > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > > +             if (sched_energy_enabled()) {
> >
> > As mentioned during LPC, when there is no idle time on a CPU, the
> > utilization value of the tasks running on it is no longer a good
> > approximation for how much the tasks want, it becomes an image of how
> > much CPU time they were given. That is particularly problematic in the
> > co-scheduling case, but not just.
> 
> Yes, this is not always true when overutilized and  true after a
> certain amount of time. When a CPU is fully utilized without any idle
> time anymore, feec() will not find a CPU for the task

Well the problem is that is might actually find a CPU for the task -- a
co-scheduled task can obviously look arbitrarily small from a util PoV.

> >
> > IOW, when we're OU, the util values are bogus, so using feec() is frankly
> > wrong IMO. If we don't have a good idea of how long tasks want to run,
> 
> Except that the CPU is not already fully busy without idle time when
> the system is overutilized. We have  ~20% margin on each CPU which
> means that system are overutilized as soon as one CPU is more than 80%
> utilized which is far from not having idle time anymore. So even when
> OU, it doesn't mean that all CPUs don't have idle time and most of the
> time the opposite happens and feec() can still make a useful decision.

My problem with the proposed change here is that it doesn't at all
distinguish between the truly overloaded case (when we have more compute
demand that resources) from a system with a stable-ish utilization at
90%. If you're worried about the latter, then perhaps we should think
about redefining the OU threshold some other way (either by simply
making higher or configurable, or changing its nature to look at the
last time we actually got idle time in the system). But I'm still rather
opinionated that util-based placement is wrong for the former.

And for what it's worth, in my experience if any of the big CPUs get
anywhere near the top of their OPP range, given that the power/perf
curve is exponential it's being penny-wise and pound-foolish to
micro-optimise the placement of the other smaller tasks from an energy
PoV at the same time. But if we can show that it helps real use-cases,
then why not.

> Also, when there is no idle time on a CPU, the task doesn't fit and
> feec() doesn't return a CPU.

It doesn't fit on that CPU but might still (incorrectly) fit on another
CPU right?

> Then, the old way to compute invariant utilization was particularly
> sensible to the overutilized state because the utilization was capped
> and asymptotically converging to max cpu compute capacity but this is
> not true with the new pelt and we can go above compute capacity of the
> cpu and remain correct as long as we are able to increase the compute
> capacity before that there is no idle time. In theory, the utilization
> "could" be correct until we reach 1024 (for utilization or runnable)
> and there is no way to catch up the temporary under compute capacity.
> 
> > the EM just can't help us with anything so we should stay away from it.
> >
> > I understand how just plain bailing out as we do today is sub-optimal,
> > but whatever we do to improve on that can't be doing utilization-based
> > task placement.
> >
> > Have you considered making the default (non-EAS) wake-up path a little
> > more reluctant to migrations when EAS is enabled? That should allow us
> > to maintain a somewhat stable task placement when OU is only transient
> > (e.g. due to misfit), but without using util values when we really
> > shouldn't.
> >
> > Thoughts?
> 
> As mentioned above OU doesn't mean no idle time anymore and in this
> case utilization is still relevant

OK, but please distinguish this from the truly overloaded case somehow,
I really don't think we can 'break' it just to help with the corner case
when we've got 90% ish util.

> In would be in favor of adding
> more performance related decision into feec() similarly to have is
> done in patch 3 which would be for example that if a cpu doesn't fit
> we could still return  a CPU with more performance focus

Fine with me in principle as long as we stop using utilization as a
proxy for how much a task wants when it really isn't that any more.

Thanks!
Quentin

> >
> > Thanks,
> > Quentin
> >
> > >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > >                       if (new_cpu >= 0)
> > >                               return new_cpu;
> > > --
> > > 2.34.1
> > >
> > >
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Thu, 26 Sept 2024 at 11:10, Quentin Perret <qperret@google.com> wrote:
>
> Hi Vincent,
>
> On Wednesday 25 Sep 2024 at 15:27:45 (+0200), Vincent Guittot wrote:
> > On Fri, 20 Sept 2024 at 18:17, Quentin Perret <qperret@google.com> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> > > > Keep looking for an energy efficient CPU even when the system is
> > > > overutilized and use the CPU returned by feec() if it has been able to find
> > > > one. Otherwise fallback to the default performance and spread mode of the
> > > > scheduler.
> > > > A system can become overutilized for a short time when workers of a
> > > > workqueue wake up for a short background work like vmstat update.
> > > > Continuing to look for a energy efficient CPU will prevent to break the
> > > > power packing of tasks.
> > > >
> > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 2273eecf6086..e46af2416159 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> > > >                       return cpu;
> > > >
> > > > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > > > +             if (sched_energy_enabled()) {
> > >
> > > As mentioned during LPC, when there is no idle time on a CPU, the
> > > utilization value of the tasks running on it is no longer a good
> > > approximation for how much the tasks want, it becomes an image of how
> > > much CPU time they were given. That is particularly problematic in the
> > > co-scheduling case, but not just.
> >
> > Yes, this is not always true when overutilized and  true after a
> > certain amount of time. When a CPU is fully utilized without any idle
> > time anymore, feec() will not find a CPU for the task
>
> Well the problem is that is might actually find a CPU for the task -- a
> co-scheduled task can obviously look arbitrarily small from a util PoV.

With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
utilization"), the util_est remains set the value before having to
share the cpu with other tasks which means that the util_est remains
correct even if its util_avg decrease because of sharing the cpu with
other task. This has been done to cover the cases that you mention
above whereboth util_avg and util_est where decreasing when tasks
starts to  share  the CPU bandwidth with others

>
> > >
> > > IOW, when we're OU, the util values are bogus, so using feec() is frankly
> > > wrong IMO. If we don't have a good idea of how long tasks want to run,
> >
> > Except that the CPU is not already fully busy without idle time when
> > the system is overutilized. We have  ~20% margin on each CPU which
> > means that system are overutilized as soon as one CPU is more than 80%
> > utilized which is far from not having idle time anymore. So even when
> > OU, it doesn't mean that all CPUs don't have idle time and most of the
> > time the opposite happens and feec() can still make a useful decision.
>
> My problem with the proposed change here is that it doesn't at all
> distinguish between the truly overloaded case (when we have more compute
> demand that resources) from a system with a stable-ish utilization at
> 90%. If you're worried about the latter, then perhaps we should think
> about redefining the OU threshold some other way (either by simply
> making higher or configurable, or changing its nature to look at the

we definitely increase the OU threshold but we still have case with
truly overutilized CPU but still good utilization value

> last time we actually got idle time in the system). But I'm still rather
> opinionated that util-based placement is wrong for the former.

And feec() will return -1 for that case because util_est remains high

>
> And for what it's worth, in my experience if any of the big CPUs get
> anywhere near the top of their OPP range, given that the power/perf
> curve is exponential it's being penny-wise and pound-foolish to
> micro-optimise the placement of the other smaller tasks from an energy
> PoV at the same time. But if we can show that it helps real use-cases,
> then why not.

The thermal mitigation and/or power budget policy quickly reduce the
max compute capacity of such big CPUs becomes overutilized with lower
OPP which reduce the diff between big/medium/little

>
> > Also, when there is no idle time on a CPU, the task doesn't fit and
> > feec() doesn't return a CPU.
>
> It doesn't fit on that CPU but might still (incorrectly) fit on another
> CPU right?

the commit that I mentioned above covers those cases and the task will
not incorrectly fit to another smaller CPU because its util_est is
preserved during the overutilized phase

>
> > Then, the old way to compute invariant utilization was particularly
> > sensible to the overutilized state because the utilization was capped
> > and asymptotically converging to max cpu compute capacity but this is
> > not true with the new pelt and we can go above compute capacity of the
> > cpu and remain correct as long as we are able to increase the compute
> > capacity before that there is no idle time. In theory, the utilization
> > "could" be correct until we reach 1024 (for utilization or runnable)
> > and there is no way to catch up the temporary under compute capacity.
> >
> > > the EM just can't help us with anything so we should stay away from it.
> > >
> > > I understand how just plain bailing out as we do today is sub-optimal,
> > > but whatever we do to improve on that can't be doing utilization-based
> > > task placement.
> > >
> > > Have you considered making the default (non-EAS) wake-up path a little
> > > more reluctant to migrations when EAS is enabled? That should allow us
> > > to maintain a somewhat stable task placement when OU is only transient
> > > (e.g. due to misfit), but without using util values when we really
> > > shouldn't.
> > >
> > > Thoughts?
> >
> > As mentioned above OU doesn't mean no idle time anymore and in this
> > case utilization is still relevant
>
> OK, but please distinguish this from the truly overloaded case somehow,
> I really don't think we can 'break' it just to help with the corner case
> when we've got 90% ish util.
>
> > In would be in favor of adding
> > more performance related decision into feec() similarly to have is
> > done in patch 3 which would be for example that if a cpu doesn't fit
> > we could still return  a CPU with more performance focus
>
> Fine with me in principle as long as we stop using utilization as a
> proxy for how much a task wants when it really isn't that any more.
>
> Thanks!
> Quentin
>
> > >
> > > Thanks,
> > > Quentin
> > >
> > > >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > > >                       if (new_cpu >= 0)
> > > >                               return new_cpu;
> > > > --
> > > > 2.34.1
> > > >
> > > >
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Quentin Perret 1 year, 2 months ago
On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> utilization"), the util_est remains set the value before having to
> share the cpu with other tasks which means that the util_est remains
> correct even if its util_avg decrease because of sharing the cpu with
> other task. This has been done to cover the cases that you mention
> above whereboth util_avg and util_est where decreasing when tasks
> starts to  share  the CPU bandwidth with others

I don't think I agree about the correctness of that util_est value at
all. The above patch only makes it arbitrarily out of date in the truly
overcommitted case. All the util-based heuristic we have in the
scheduler are based around the assumption that the close future will
look like the recent past, so using an arbitrarily old util-est is still
incorrect. I can understand how this may work OK in RT-app or other
use-cases with perfectly periodic tasks for their entire lifetime and
such, but this doesn't work at all in the general case.

> And feec() will return -1 for that case because util_est remains high

And again, checking that a task fits is broken to start with if we don't
know how big the task is. When we have reasons to believe that the util
values are no longer correct (and the absence of idle time is a very
good reason for that) we just need to give up on them. The fact that we
have to resort to using out-of-date data to sort of make that work is
just another proof that this is not a good idea in the general case.

> the commit that I mentioned above covers those cases and the task will
> not incorrectly fit to another smaller CPU because its util_est is
> preserved during the overutilized phase

There are other reasons why a task may look like it fits, e.g. two tasks
coscheduled on a big CPU get 50% util each, then we migrate one away, the
CPU looks half empty. Is it half empty? We've got no way to tell until
we see idle time. The current util_avg and old util_est value are just
not helpful, they're both bad signals and we should just discard them.

So again I do feel like the best way forward would be to change the
nature of the OU threshold to actually ask cpuidle 'when was the last
time there was idle time?' (or possibly cache that in the idle task
directly). And then based on that we can decide whether we want to enter
feec() and do util-based decision, or to kick the push-pull mechanism in
your other patches, things like that. That would solve/avoid the problem
I mentioned in the previous paragraph and make the OU detection more
robust. We could also consider using different thresholds in different
places to re-enable load-balancing earlier, and give up on feec() a bit
later to avoid messing the entire task placement when we're only
transiently OU because of misfit. But eventually, we really need to just
give up on util values altogether when we're really overcommitted, it's
really an invariant we need to keep.

Thanks,
Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
>
> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> > With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> > utilization"), the util_est remains set the value before having to
> > share the cpu with other tasks which means that the util_est remains
> > correct even if its util_avg decrease because of sharing the cpu with
> > other task. This has been done to cover the cases that you mention
> > above whereboth util_avg and util_est where decreasing when tasks
> > starts to  share  the CPU bandwidth with others
>
> I don't think I agree about the correctness of that util_est value at
> all. The above patch only makes it arbitrarily out of date in the truly
> overcommitted case. All the util-based heuristic we have in the
> scheduler are based around the assumption that the close future will
> look like the recent past, so using an arbitrarily old util-est is still
> incorrect. I can understand how this may work OK in RT-app or other

This fixes a real use case on android device

> use-cases with perfectly periodic tasks for their entire lifetime and
> such, but this doesn't work at all in the general case.
>
> > And feec() will return -1 for that case because util_est remains high
>
> And again, checking that a task fits is broken to start with if we don't
> know how big the task is. When we have reasons to believe that the util
> values are no longer correct (and the absence of idle time is a very
> good reason for that) we just need to give up on them. The fact that we
> have to resort to using out-of-date data to sort of make that work is
> just another proof that this is not a good idea in the general case.

That's where I disagree, this is not an out-of-date value, this is the
last correct one before sharing the cpu

>
> > the commit that I mentioned above covers those cases and the task will
> > not incorrectly fit to another smaller CPU because its util_est is
> > preserved during the overutilized phase
>
> There are other reasons why a task may look like it fits, e.g. two tasks
> coscheduled on a big CPU get 50% util each, then we migrate one away, the

50% of what ? not the cpu capacity. I think you miss one piece of the
recent pelt behavior here. I fullygree that when the system os
overcommitted the util base task placement is not correct but I also
think that feec() can't find a cpu in such case

> CPU looks half empty. Is it half empty? We've got no way to tell until

The same here, it's not thanks to util_est

> we see idle time. The current util_avg and old util_est value are just
> not helpful, they're both bad signals and we should just discard them.
>
> So again I do feel like the best way forward would be to change the
> nature of the OU threshold to actually ask cpuidle 'when was the last
> time there was idle time?' (or possibly cache that in the idle task
> directly). And then based on that we can decide whether we want to enter
> feec() and do util-based decision, or to kick the push-pull mechanism in
> your other patches, things like that. That would solve/avoid the problem
> I mentioned in the previous paragraph and make the OU detection more
> robust. We could also consider using different thresholds in different
> places to re-enable load-balancing earlier, and give up on feec() a bit
> later to avoid messing the entire task placement when we're only
> transiently OU because of misfit. But eventually, we really need to just
> give up on util values altogether when we're really overcommitted, it's
> really an invariant we need to keep.

For now, I will increase the OU threshold to cpu capacity to reduce
the false overutilized state because of misfit tasks which is what I
really care about. The redesign of OU will come in a different series
as this implies more rework. IIUC your point, we are more interested
by the prev cpu than the current one

>
> Thanks,
> Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Christian Loehle 1 year ago
On 10/3/24 07:27, Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
>>
>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>> utilization"), the util_est remains set the value before having to
>>> share the cpu with other tasks which means that the util_est remains
>>> correct even if its util_avg decrease because of sharing the cpu with
>>> other task. This has been done to cover the cases that you mention
>>> above whereboth util_avg and util_est where decreasing when tasks
>>> starts to  share  the CPU bandwidth with others
>>
>> I don't think I agree about the correctness of that util_est value at
>> all. The above patch only makes it arbitrarily out of date in the truly
>> overcommitted case. All the util-based heuristic we have in the
>> scheduler are based around the assumption that the close future will
>> look like the recent past, so using an arbitrarily old util-est is still
>> incorrect. I can understand how this may work OK in RT-app or other
> 
> This fixes a real use case on android device
> 
>> use-cases with perfectly periodic tasks for their entire lifetime and
>> such, but this doesn't work at all in the general case.
>>
>>> And feec() will return -1 for that case because util_est remains high
>>
>> And again, checking that a task fits is broken to start with if we don't
>> know how big the task is. When we have reasons to believe that the util
>> values are no longer correct (and the absence of idle time is a very
>> good reason for that) we just need to give up on them. The fact that we
>> have to resort to using out-of-date data to sort of make that work is
>> just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu
Just adding on this since we are discussing the correctness of util_est
value on an OU CPU since
commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization").
I agree that this commit fixed the immediate false util_est drop after
coscheduling two (or more) tasks, but that's a specific one.
If one of two coscheduled tasks starts growing their util_est can't reflect
that if their compute demand grows above CPU-capacity, that commit doesn't
change the fact. There is no generally sensible way of estimating such a
util_est anyway.
Even worse if both coscheduled tasks grow which isn't uncommon, considering
they might be related.

So
"this is the last correct one before sharing the cpu" is true,
"This is not an out-of-date value" isn't true in the general case.

I agree that the OU definition can evolve, basing that on idle time makes
sense, but given the common period of 16ms (frame rate) we might delay
setting OU by quite a lot for the cases it 'actually is true'.

Regards,
Christian
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Quentin Perret 1 year, 2 months ago
On Thursday 03 Oct 2024 at 08:27:00 (+0200), Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
> > And again, checking that a task fits is broken to start with if we don't
> > know how big the task is. When we have reasons to believe that the util
> > values are no longer correct (and the absence of idle time is a very
> > good reason for that) we just need to give up on them. The fact that we
> > have to resort to using out-of-date data to sort of make that work is
> > just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu

This value is arbitrarily old, so of course it is out of date. This only
sort of works for tasks that don't change their behaviour. That's true
for some use-cases, yes, but absolutely not in the general case. How
can you know that the last correct value before sharing the CPU is still
valid minutes later? The fact that the system started to be
overcommitted is a good indication that something has changed, so we
really can't tell. Also, how is any of this going to work for newly
created tasks while we're overcommitted for example?

> > > the commit that I mentioned above covers those cases and the task will
> > > not incorrectly fit to another smaller CPU because its util_est is
> > > preserved during the overutilized phase
> >
> > There are other reasons why a task may look like it fits, e.g. two tasks
> > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> 
> 50% of what ?

50% of SCHED_CAPACITY_SCALE (the above sentence mentions a 'big' CPU, and
for simplicity I assumed no 'pressure' of any kind).

> not the cpu capacity. I think you miss one piece of the
> recent pelt behavior here

That could very well be the case, which piece are you thinking of?

> I fullygree that when the system os
> overcommitted the util base task placement is not correct but I also
> think that feec() can't find a cpu in such case

But why are we even entering feec() then? Isn't this just looking for
trouble really? As per the example above, task migrations can cause util
'gaps' on the source CPU which may make it appear like a good candidate
from an energy standpoint, but it's all bogus really. And let's not even
talk about how wrong the EM is going be when simulating a potential task
migration in the overcommitted case.

> > CPU looks half empty. Is it half empty? We've got no way to tell until
> 
> The same here, it's not thanks to util_est

And again, an out-of-date util est value is not helpful in the general
case. It helps certain use-cases, sure, but please let's not promote it
to a load-bearing construct on top of which we build our entire
scheduling strategy :-)

> > we see idle time. The current util_avg and old util_est value are just
> > not helpful, they're both bad signals and we should just discard them.
> >
> > So again I do feel like the best way forward would be to change the
> > nature of the OU threshold to actually ask cpuidle 'when was the last
> > time there was idle time?' (or possibly cache that in the idle task
> > directly). And then based on that we can decide whether we want to enter
> > feec() and do util-based decision, or to kick the push-pull mechanism in
> > your other patches, things like that. That would solve/avoid the problem
> > I mentioned in the previous paragraph and make the OU detection more
> > robust. We could also consider using different thresholds in different
> > places to re-enable load-balancing earlier, and give up on feec() a bit
> > later to avoid messing the entire task placement when we're only
> > transiently OU because of misfit. But eventually, we really need to just
> > give up on util values altogether when we're really overcommitted, it's
> > really an invariant we need to keep.
> 
> For now, I will increase the OU threshold to cpu capacity to reduce
> the false overutilized state because of misfit tasks which is what I
> really care about.

Cool, and FWIW I am supportive of making this whole part of the code
better -- a transient OU state due to misfit does make a mess of things
and we should indeed be able to do better.

> The redesign of OU will come in a different series
> as this implies more rework.

Ack, this can be made orthogonal to this work I think.

> IIUC your point, we are more interested
> by the prev cpu than the current one

Hmm, not sure to understand that part. What do you mean?

Thanks,
Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Thu, 3 Oct 2024 at 10:21, Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 03 Oct 2024 at 08:27:00 (+0200), Vincent Guittot wrote:
> > On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
> > > And again, checking that a task fits is broken to start with if we don't
> > > know how big the task is. When we have reasons to believe that the util
> > > values are no longer correct (and the absence of idle time is a very
> > > good reason for that) we just need to give up on them. The fact that we
> > > have to resort to using out-of-date data to sort of make that work is
> > > just another proof that this is not a good idea in the general case.
> >
> > That's where I disagree, this is not an out-of-date value, this is the
> > last correct one before sharing the cpu
>
> This value is arbitrarily old, so of course it is out of date. This only
> sort of works for tasks that don't change their behaviour. That's true
> for some use-cases, yes, but absolutely not in the general case. How
> can you know that the last correct value before sharing the CPU is still
> valid minutes later? The fact that the system started to be
> overcommitted is a good indication that something has changed, so we
> really can't tell. Also, how is any of this going to work for newly
> created tasks while we're overcommitted for example?
>
> > > > the commit that I mentioned above covers those cases and the task will
> > > > not incorrectly fit to another smaller CPU because its util_est is
> > > > preserved during the overutilized phase
> > >
> > > There are other reasons why a task may look like it fits, e.g. two tasks
> > > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> >
> > 50% of what ?
>
> 50% of SCHED_CAPACITY_SCALE (the above sentence mentions a 'big' CPU, and
> for simplicity I assumed no 'pressure' of any kind).

ok, i missed the big cpu

>
> > not the cpu capacity. I think you miss one piece of the
> > recent pelt behavior here
>
> That could very well be the case, which piece are you thinking of?

The current pelt algorithm track actual cpu utilization and can go
above cpu capacity (but not above 1024) so a  task utilization can
become bigger than a little cpu capacity

>
> > I fullygree that when the system os
> > overcommitted the util base task placement is not correct but I also
> > think that feec() can't find a cpu in such case
>
> But why are we even entering feec() then? Isn't this just looking for
> trouble really? As per the example above, task migrations can cause util
> 'gaps' on the source CPU which may make it appear like a good candidate
> from an energy standpoint, but it's all bogus really. And let's not even
> talk about how wrong the EM is going be when simulating a potential task
> migration in the overcommitted case.
>
> > > CPU looks half empty. Is it half empty? We've got no way to tell until
> >
> > The same here, it's not thanks to util_est
>
> And again, an out-of-date util est value is not helpful in the general
> case. It helps certain use-cases, sure, but please let's not promote it
> to a load-bearing construct on top of which we build our entire
> scheduling strategy :-)
>
> > > we see idle time. The current util_avg and old util_est value are just
> > > not helpful, they're both bad signals and we should just discard them.
> > >
> > > So again I do feel like the best way forward would be to change the
> > > nature of the OU threshold to actually ask cpuidle 'when was the last
> > > time there was idle time?' (or possibly cache that in the idle task
> > > directly). And then based on that we can decide whether we want to enter
> > > feec() and do util-based decision, or to kick the push-pull mechanism in
> > > your other patches, things like that. That would solve/avoid the problem
> > > I mentioned in the previous paragraph and make the OU detection more
> > > robust. We could also consider using different thresholds in different
> > > places to re-enable load-balancing earlier, and give up on feec() a bit
> > > later to avoid messing the entire task placement when we're only
> > > transiently OU because of misfit. But eventually, we really need to just
> > > give up on util values altogether when we're really overcommitted, it's
> > > really an invariant we need to keep.
> >
> > For now, I will increase the OU threshold to cpu capacity to reduce
> > the false overutilized state because of misfit tasks which is what I
> > really care about.
>
> Cool, and FWIW I am supportive of making this whole part of the code
> better -- a transient OU state due to misfit does make a mess of things
> and we should indeed be able to do better.
>
> > The redesign of OU will come in a different series
> > as this implies more rework.
>
> Ack, this can be made orthogonal to this work I think.
>
> > IIUC your point, we are more interested
> > by the prev cpu than the current one
>
> Hmm, not sure to understand that part. What do you mean?

As replied to Lukasz, if you want to discard utilization of a trask
you need to check the previous cpu

>
> Thanks,
> Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Quentin Perret 1 year, 2 months ago
On Thursday 03 Oct 2024 at 10:57:55 (+0200), Vincent Guittot wrote:
> The current pelt algorithm track actual cpu utilization and can go
> above cpu capacity (but not above 1024) so a  task utilization can
> become bigger than a little cpu capacity

Right, the time invariance thing. So yes, I still think that a mix of
co-scheduling and task migrations (which is likely common in the
overcommitted state) will cause some CPUs to appear lightly utilized at
least transiently, hence tricking feec() into thinking it can help when
it really can't.

> As replied to Lukasz, if you want to discard utilization of a trask
> you need to check the previous cpu

Please help me out here because I'm still not quite sure what we're
talking about. Could you please expand a bit?

Thanks,
Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Thu, 3 Oct 2024 at 11:52, Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 03 Oct 2024 at 10:57:55 (+0200), Vincent Guittot wrote:
> > The current pelt algorithm track actual cpu utilization and can go
> > above cpu capacity (but not above 1024) so a  task utilization can
> > become bigger than a little cpu capacity
>
> Right, the time invariance thing. So yes, I still think that a mix of
> co-scheduling and task migrations (which is likely common in the
> overcommitted state) will cause some CPUs to appear lightly utilized at
> least transiently, hence tricking feec() into thinking it can help when
> it really can't.
>
> > As replied to Lukasz, if you want to discard utilization of a trask
> > you need to check the previous cpu
>
> Please help me out here because I'm still not quite sure what we're
> talking about. Could you please expand a bit?

If you consider that utilization of a task is meaningless because of
CPU overcommitment then you need to check if the prev cpu of the
waking task is/was overcommitted or not the last time the task run in
addition to the overcommitment of the next cpu

>
> Thanks,
> Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Lukasz Luba 1 year, 2 months ago
Hi Vincent,

On 10/3/24 07:27, Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
>>
>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>> utilization"), the util_est remains set the value before having to
>>> share the cpu with other tasks which means that the util_est remains
>>> correct even if its util_avg decrease because of sharing the cpu with
>>> other task. This has been done to cover the cases that you mention
>>> above whereboth util_avg and util_est where decreasing when tasks
>>> starts to  share  the CPU bandwidth with others
>>
>> I don't think I agree about the correctness of that util_est value at
>> all. The above patch only makes it arbitrarily out of date in the truly
>> overcommitted case. All the util-based heuristic we have in the
>> scheduler are based around the assumption that the close future will
>> look like the recent past, so using an arbitrarily old util-est is still
>> incorrect. I can understand how this may work OK in RT-app or other
> 
> This fixes a real use case on android device
> 
>> use-cases with perfectly periodic tasks for their entire lifetime and
>> such, but this doesn't work at all in the general case.
>>
>>> And feec() will return -1 for that case because util_est remains high
>>
>> And again, checking that a task fits is broken to start with if we don't
>> know how big the task is. When we have reasons to believe that the util
>> values are no longer correct (and the absence of idle time is a very
>> good reason for that) we just need to give up on them. The fact that we
>> have to resort to using out-of-date data to sort of make that work is
>> just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu
> 
>>
>>> the commit that I mentioned above covers those cases and the task will
>>> not incorrectly fit to another smaller CPU because its util_est is
>>> preserved during the overutilized phase
>>
>> There are other reasons why a task may look like it fits, e.g. two tasks
>> coscheduled on a big CPU get 50% util each, then we migrate one away, the
> 
> 50% of what ? not the cpu capacity. I think you miss one piece of the
> recent pelt behavior here. I fullygree that when the system os
> overcommitted the util base task placement is not correct but I also
> think that feec() can't find a cpu in such case
> 
>> CPU looks half empty. Is it half empty? We've got no way to tell until
> 
> The same here, it's not thanks to util_est
> 
>> we see idle time. The current util_avg and old util_est value are just
>> not helpful, they're both bad signals and we should just discard them.
>>
>> So again I do feel like the best way forward would be to change the
>> nature of the OU threshold to actually ask cpuidle 'when was the last
>> time there was idle time?' (or possibly cache that in the idle task
>> directly). And then based on that we can decide whether we want to enter
>> feec() and do util-based decision, or to kick the push-pull mechanism in
>> your other patches, things like that. That would solve/avoid the problem
>> I mentioned in the previous paragraph and make the OU detection more
>> robust. We could also consider using different thresholds in different
>> places to re-enable load-balancing earlier, and give up on feec() a bit
>> later to avoid messing the entire task placement when we're only
>> transiently OU because of misfit. But eventually, we really need to just
>> give up on util values altogether when we're really overcommitted, it's
>> really an invariant we need to keep.
> 
> For now, I will increase the OU threshold to cpu capacity to reduce
> the false overutilized state because of misfit tasks which is what I
> really care about. The redesign of OU will come in a different series
> as this implies more rework. IIUC your point, we are more interested
> by the prev cpu than the current one
> 

What do you mean by that?
Is it due to OU in e.g. Little cluster?
Is it amplified by the uclamp_max usage?

You're re-writing heavily the EAS+EM and I would like to understand
your motivation.

BTW, do you know that if you or anyone wants to improve the EAS/EM
should be able to provide the power numbers?

W/o the power numbers the discussion is moot. Many times SW engineers
have wrong assumptions about HW, therefore we have to test and
measure. There are confidential power saving techniques in HW
that can be missed and some ugly workaround created in SW for issues
which don't exist.

That's why we have to discuss the power numbers.

This _subject_ is not different. If EAS is going to help
even in OU state - we need the numbers.

Regards,
Lukasz
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Thu, 3 Oct 2024 at 10:14, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Vincent,
>
> On 10/3/24 07:27, Vincent Guittot wrote:
> > On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
> >>
> >> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> >>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> >>> utilization"), the util_est remains set the value before having to
> >>> share the cpu with other tasks which means that the util_est remains
> >>> correct even if its util_avg decrease because of sharing the cpu with
> >>> other task. This has been done to cover the cases that you mention
> >>> above whereboth util_avg and util_est where decreasing when tasks
> >>> starts to  share  the CPU bandwidth with others
> >>
> >> I don't think I agree about the correctness of that util_est value at
> >> all. The above patch only makes it arbitrarily out of date in the truly
> >> overcommitted case. All the util-based heuristic we have in the
> >> scheduler are based around the assumption that the close future will
> >> look like the recent past, so using an arbitrarily old util-est is still
> >> incorrect. I can understand how this may work OK in RT-app or other
> >
> > This fixes a real use case on android device
> >
> >> use-cases with perfectly periodic tasks for their entire lifetime and
> >> such, but this doesn't work at all in the general case.
> >>
> >>> And feec() will return -1 for that case because util_est remains high
> >>
> >> And again, checking that a task fits is broken to start with if we don't
> >> know how big the task is. When we have reasons to believe that the util
> >> values are no longer correct (and the absence of idle time is a very
> >> good reason for that) we just need to give up on them. The fact that we
> >> have to resort to using out-of-date data to sort of make that work is
> >> just another proof that this is not a good idea in the general case.
> >
> > That's where I disagree, this is not an out-of-date value, this is the
> > last correct one before sharing the cpu
> >
> >>
> >>> the commit that I mentioned above covers those cases and the task will
> >>> not incorrectly fit to another smaller CPU because its util_est is
> >>> preserved during the overutilized phase
> >>
> >> There are other reasons why a task may look like it fits, e.g. two tasks
> >> coscheduled on a big CPU get 50% util each, then we migrate one away, the
> >
> > 50% of what ? not the cpu capacity. I think you miss one piece of the
> > recent pelt behavior here. I fullygree that when the system os
> > overcommitted the util base task placement is not correct but I also
> > think that feec() can't find a cpu in such case
> >
> >> CPU looks half empty. Is it half empty? We've got no way to tell until
> >
> > The same here, it's not thanks to util_est
> >
> >> we see idle time. The current util_avg and old util_est value are just
> >> not helpful, they're both bad signals and we should just discard them.
> >>
> >> So again I do feel like the best way forward would be to change the
> >> nature of the OU threshold to actually ask cpuidle 'when was the last
> >> time there was idle time?' (or possibly cache that in the idle task
> >> directly). And then based on that we can decide whether we want to enter
> >> feec() and do util-based decision, or to kick the push-pull mechanism in
> >> your other patches, things like that. That would solve/avoid the problem
> >> I mentioned in the previous paragraph and make the OU detection more
> >> robust. We could also consider using different thresholds in different
> >> places to re-enable load-balancing earlier, and give up on feec() a bit
> >> later to avoid messing the entire task placement when we're only
> >> transiently OU because of misfit. But eventually, we really need to just
> >> give up on util values altogether when we're really overcommitted, it's
> >> really an invariant we need to keep.
> >
> > For now, I will increase the OU threshold to cpu capacity to reduce
> > the false overutilized state because of misfit tasks which is what I
> > really care about. The redesign of OU will come in a different series
> > as this implies more rework. IIUC your point, we are more interested
> > by the prev cpu than the current one
> >
>
> What do you mean by that?
> Is it due to OU in e.g. Little cluster?
> Is it amplified by the uclamp_max usage?

You need to know if the prev cpu was overcommitted to know if the task
utilization is correct and usable

>
> You're re-writing heavily the EAS+EM and I would like to understand
> your motivation.

I want to cover more cases that are obviously not covered by current
EAS implementation

>
> BTW, do you know that if you or anyone wants to improve the EAS/EM
> should be able to provide the power numbers?

Having power numbers with the latest mainline kernel is not always
easy as platforms don't support it. Typically, pixel 6 doesn't support
v6.11 or even v6.12-rc1 with enough power optimization. And older
custom kernel don't get last changes and are often modified with out
of tree code which are out of the scope of the discussion

>
> W/o the power numbers the discussion is moot. Many times SW engineers
> have wrong assumptions about HW, therefore we have to test and
> measure. There are confidential power saving techniques in HW
> that can be missed and some ugly workaround created in SW for issues
> which don't exist.
>
> That's why we have to discuss the power numbers.
>
> This _subject_ is not different. If EAS is going to help
> even in OU state - we need the numbers.

I don't expect EAS to help during OU state but more to prevent
spreading blindly everything around. I was more concerned to make sure
that EAS doesn't regression perf too much when overutilized so it can
keep sane task placement whenever possible

>
> Regards,
> Lukasz
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Quentin Perret 1 year, 2 months ago
On Thursday 03 Oct 2024 at 09:15:31 (+0100), Lukasz Luba wrote:
> BTW, do you know that if you or anyone wants to improve the EAS/EM
> should be able to provide the power numbers?
> 
> W/o the power numbers the discussion is moot. Many times SW engineers
> have wrong assumptions about HW, therefore we have to test and
> measure. There are confidential power saving techniques in HW
> that can be missed and some ugly workaround created in SW for issues
> which don't exist.
> 
> That's why we have to discuss the power numbers.

And generally speaking +1 to the above, it would be nice to have power
numbers to motivate the series better. The hackbench results are nice to
show the limited overhead, but they obviously don't help evaluating the
patches against what they claim to do (making better energy decisions in
feec() and such).

Thanks,
Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Lukasz Luba 1 year, 2 months ago
Hi Quentin and Vincent,

On 10/1/24 18:50, Quentin Perret wrote:
> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>> utilization"), the util_est remains set the value before having to
>> share the cpu with other tasks which means that the util_est remains
>> correct even if its util_avg decrease because of sharing the cpu with
>> other task. This has been done to cover the cases that you mention
>> above whereboth util_avg and util_est where decreasing when tasks
>> starts to  share  the CPU bandwidth with others
> 
> I don't think I agree about the correctness of that util_est value at
> all. The above patch only makes it arbitrarily out of date in the truly
> overcommitted case. All the util-based heuristic we have in the
> scheduler are based around the assumption that the close future will
> look like the recent past, so using an arbitrarily old util-est is still
> incorrect. I can understand how this may work OK in RT-app or other
> use-cases with perfectly periodic tasks for their entire lifetime and
> such, but this doesn't work at all in the general case.

I remember that commit Vincent mentioned above. That was from a web
browser test 'Speedometer', not rt-app. The browser has to run the
same 'computation problem' but with quite a lot of JavaScript
frameworks. Those frameworks mainly run in the browser main thread,
with some helper threads in background.

So it was not purely RT-app or other perfectly periodic task.
Although, IIRC Vincent was able to build a model based on rt-app
to tackle that issue.

That patch helped to better reflect the situation in the OS.

For this particular _subject_ I don't think it's relevant, though.
It was actually helping to show that the situation is worse, so
closer to OU because the task was bigger (and we avoid EAS).

> 
>> And feec() will return -1 for that case because util_est remains high
> 
> And again, checking that a task fits is broken to start with if we don't
> know how big the task is. When we have reasons to believe that the util
> values are no longer correct (and the absence of idle time is a very
> good reason for that) we just need to give up on them. The fact that we
> have to resort to using out-of-date data to sort of make that work is
> just another proof that this is not a good idea in the general case.
> 
>> the commit that I mentioned above covers those cases and the task will
>> not incorrectly fit to another smaller CPU because its util_est is
>> preserved during the overutilized phase
> 
> There are other reasons why a task may look like it fits, e.g. two tasks
> coscheduled on a big CPU get 50% util each, then we migrate one away, the
> CPU looks half empty. Is it half empty? We've got no way to tell until
> we see idle time. The current util_avg and old util_est value are just
> not helpful, they're both bad signals and we should just discard them.

So would you then reset them to 0? Or leave them as they are?
What about the other signals (cpu runqueue) which are derived from them?
That sounds like really heavy change or inconsistency in many places.

> 
> So again I do feel like the best way forward would be to change the
> nature of the OU threshold to actually ask cpuidle 'when was the last
> time there was idle time?' (or possibly cache that in the idle task
> directly). And then based on that we can decide whether we want to enter
> feec() and do util-based decision, or to kick the push-pull mechanism in
> your other patches, things like that. That would solve/avoid the problem
> I mentioned in the previous paragraph and make the OU detection more
> robust. We could also consider using different thresholds in different
> places to re-enable load-balancing earlier, and give up on feec() a bit
> later to avoid messing the entire task placement when we're only
> transiently OU because of misfit. But eventually, we really need to just
> give up on util values altogether when we're really overcommitted, it's
> really an invariant we need to keep.

IMHO the problem here with OU was amplified recently due to the
Uclamp_max setting + 'Max aggregation policy' + aggressive frequency
capping + fast freq switching.

Now we are in the situation where we complain about util metrics...

I've been warning Qais and Vincent that this usage of Uclamp_max
in such environment is dangerous and might explode.

If one background task is capped hard in CPU freq, but does computation
'all the time' making that CPU to have no idle time - then IMO
this is not a good scheduling. This is a receipt for starvation.
You probably won't find any better metric.

I would suggest to stop making the OU situation worse and more
frequent with this 'artificial starvation with uclamp_max'.

I understand we want to safe energy, but uclamp_max in current shape
has too many side effects IMO.

Why we haven't invested in the 'Bandwidth controller', e.g. to make
it big.Little aware (if that could be a problem)(they were there for
many years)?

Regards,
Lukasz
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Quentin Perret 1 year, 2 months ago
Hey Lukasz,

On Wednesday 02 Oct 2024 at 08:11:06 (+0100), Lukasz Luba wrote:
> Hi Quentin and Vincent,
> 
> On 10/1/24 18:50, Quentin Perret wrote:
> > On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> > > With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> > > utilization"), the util_est remains set the value before having to
> > > share the cpu with other tasks which means that the util_est remains
> > > correct even if its util_avg decrease because of sharing the cpu with
> > > other task. This has been done to cover the cases that you mention
> > > above whereboth util_avg and util_est where decreasing when tasks
> > > starts to  share  the CPU bandwidth with others
> > 
> > I don't think I agree about the correctness of that util_est value at
> > all. The above patch only makes it arbitrarily out of date in the truly
> > overcommitted case. All the util-based heuristic we have in the
> > scheduler are based around the assumption that the close future will
> > look like the recent past, so using an arbitrarily old util-est is still
> > incorrect. I can understand how this may work OK in RT-app or other
> > use-cases with perfectly periodic tasks for their entire lifetime and
> > such, but this doesn't work at all in the general case.
> 
> I remember that commit Vincent mentioned above. That was from a web
> browser test 'Speedometer', not rt-app. The browser has to run the
> same 'computation problem' but with quite a lot of JavaScript
> frameworks. Those frameworks mainly run in the browser main thread,
> with some helper threads in background.
> 
> So it was not purely RT-app or other perfectly periodic task.
> Although, IIRC Vincent was able to build a model based on rt-app
> to tackle that issue.
> 
> That patch helped to better reflect the situation in the OS.

Sure thing, I'm absolutely ready to believe that an old util-est value
will be better in certain use-cases, but again I don't think we should
conflate this for the general case. In particular a util-est that was
measured when the system was lightly loaded is absolutely not guaranteed
to be valid while it is overcommitted. Freshness matters in many cases.

> For this particular _subject_ I don't think it's relevant, though.
> It was actually helping to show that the situation is worse, so
> closer to OU because the task was bigger (and we avoid EAS).
> 
> > 
> > > And feec() will return -1 for that case because util_est remains high
> > 
> > And again, checking that a task fits is broken to start with if we don't
> > know how big the task is. When we have reasons to believe that the util
> > values are no longer correct (and the absence of idle time is a very
> > good reason for that) we just need to give up on them. The fact that we
> > have to resort to using out-of-date data to sort of make that work is
> > just another proof that this is not a good idea in the general case.
> > 
> > > the commit that I mentioned above covers those cases and the task will
> > > not incorrectly fit to another smaller CPU because its util_est is
> > > preserved during the overutilized phase
> > 
> > There are other reasons why a task may look like it fits, e.g. two tasks
> > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> > CPU looks half empty. Is it half empty? We've got no way to tell until
> > we see idle time. The current util_avg and old util_est value are just
> > not helpful, they're both bad signals and we should just discard them.
> 
> So would you then reset them to 0? Or leave them as they are?
> What about the other signals (cpu runqueue) which are derived from them?
> That sounds like really heavy change or inconsistency in many places.

I would just leave them as they are, but not look at them, pretty much
like we do today. In the overcommitted case, load is a superior signal
because it accounts for runnable time and the task weights, so we really
ought to use that instead of util.

> > 
> > So again I do feel like the best way forward would be to change the
> > nature of the OU threshold to actually ask cpuidle 'when was the last
> > time there was idle time?' (or possibly cache that in the idle task
> > directly). And then based on that we can decide whether we want to enter
> > feec() and do util-based decision, or to kick the push-pull mechanism in
> > your other patches, things like that. That would solve/avoid the problem
> > I mentioned in the previous paragraph and make the OU detection more
> > robust. We could also consider using different thresholds in different
> > places to re-enable load-balancing earlier, and give up on feec() a bit
> > later to avoid messing the entire task placement when we're only
> > transiently OU because of misfit. But eventually, we really need to just
> > give up on util values altogether when we're really overcommitted, it's
> > really an invariant we need to keep.
> 
> IMHO the problem here with OU was amplified recently due to the
> Uclamp_max setting

Ack.

> 'Max aggregation policy'

Ack.

> aggressive frequency capping

What do you mean by that?

> fast freq switching.

And not sure what fast switching has to do with the issue here?

> Now we are in the situation where we complain about util metrics...
> 
> I've been warning Qais and Vincent that this usage of Uclamp_max
> in such environment is dangerous and might explode.

I absolutely agree that uclamp max makes a huge mess of things, and util
in particular :-(

> If one background task is capped hard in CPU freq, but does computation
> 'all the time' making that CPU to have no idle time - then IMO
> this is not a good scheduling. This is a receipt for starvation.
> You probably won't find any better metric.
> 
> I would suggest to stop making the OU situation worse and more
> frequent with this 'artificial starvation with uclamp_max'.
> 
> I understand we want to safe energy, but uclamp_max in current shape
> has too many side effects IMO.
> 
> Why we haven't invested in the 'Bandwidth controller', e.g. to make
> it big.Little aware (if that could be a problem)(they were there for
> many years)?

Bandwidth control is a different thing really, not sure it can be used
interchangeably with uclamp_max in general. Running all the time at low
frequency is often going to be better from a power perspective than
running uncapped for a fixed period of time.

I think the intention of uclamp max is really to say 'these tasks have
low QoS, use spare cycles at low-ish frequency to run them'. What we
found was that it was best to use cpu.shares in conjunction with
uclamp.max to implement the 'use spare cycles' part of the previous
statement, but that was its own can of worms and caused a lot of
priority inversion problems. Hopefully the proxy exec stuff will solve
that...

Thanks,
Quentin
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Lukasz Luba 1 year, 2 months ago

On 10/2/24 08:55, Quentin Perret wrote:
> Hey Lukasz,
> 
> On Wednesday 02 Oct 2024 at 08:11:06 (+0100), Lukasz Luba wrote:
>> Hi Quentin and Vincent,
>>
>> On 10/1/24 18:50, Quentin Perret wrote:
>>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>>> utilization"), the util_est remains set the value before having to
>>>> share the cpu with other tasks which means that the util_est remains
>>>> correct even if its util_avg decrease because of sharing the cpu with
>>>> other task. This has been done to cover the cases that you mention
>>>> above whereboth util_avg and util_est where decreasing when tasks
>>>> starts to  share  the CPU bandwidth with others
>>>
>>> I don't think I agree about the correctness of that util_est value at
>>> all. The above patch only makes it arbitrarily out of date in the truly
>>> overcommitted case. All the util-based heuristic we have in the
>>> scheduler are based around the assumption that the close future will
>>> look like the recent past, so using an arbitrarily old util-est is still
>>> incorrect. I can understand how this may work OK in RT-app or other
>>> use-cases with perfectly periodic tasks for their entire lifetime and
>>> such, but this doesn't work at all in the general case.
>>
>> I remember that commit Vincent mentioned above. That was from a web
>> browser test 'Speedometer', not rt-app. The browser has to run the
>> same 'computation problem' but with quite a lot of JavaScript
>> frameworks. Those frameworks mainly run in the browser main thread,
>> with some helper threads in background.
>>
>> So it was not purely RT-app or other perfectly periodic task.
>> Although, IIRC Vincent was able to build a model based on rt-app
>> to tackle that issue.
>>
>> That patch helped to better reflect the situation in the OS.
> 
> Sure thing, I'm absolutely ready to believe that an old util-est value
> will be better in certain use-cases, but again I don't think we should
> conflate this for the general case. In particular a util-est that was
> measured when the system was lightly loaded is absolutely not guaranteed
> to be valid while it is overcommitted. Freshness matters in many cases.

I think I got your point, fair enough.

> 
>> For this particular _subject_ I don't think it's relevant, though.
>> It was actually helping to show that the situation is worse, so
>> closer to OU because the task was bigger (and we avoid EAS).
>>
>>>
>>>> And feec() will return -1 for that case because util_est remains high
>>>
>>> And again, checking that a task fits is broken to start with if we don't
>>> know how big the task is. When we have reasons to believe that the util
>>> values are no longer correct (and the absence of idle time is a very
>>> good reason for that) we just need to give up on them. The fact that we
>>> have to resort to using out-of-date data to sort of make that work is
>>> just another proof that this is not a good idea in the general case.
>>>
>>>> the commit that I mentioned above covers those cases and the task will
>>>> not incorrectly fit to another smaller CPU because its util_est is
>>>> preserved during the overutilized phase
>>>
>>> There are other reasons why a task may look like it fits, e.g. two tasks
>>> coscheduled on a big CPU get 50% util each, then we migrate one away, the
>>> CPU looks half empty. Is it half empty? We've got no way to tell until
>>> we see idle time. The current util_avg and old util_est value are just
>>> not helpful, they're both bad signals and we should just discard them.
>>
>> So would you then reset them to 0? Or leave them as they are?
>> What about the other signals (cpu runqueue) which are derived from them?
>> That sounds like really heavy change or inconsistency in many places.
> 
> I would just leave them as they are, but not look at them, pretty much
> like we do today. In the overcommitted case, load is a superior signal
> because it accounts for runnable time and the task weights, so we really
> ought to use that instead of util.

OK make sense, thanks. Sounds like valid plan to try then.

> 
>>>
>>> So again I do feel like the best way forward would be to change the
>>> nature of the OU threshold to actually ask cpuidle 'when was the last
>>> time there was idle time?' (or possibly cache that in the idle task
>>> directly). And then based on that we can decide whether we want to enter
>>> feec() and do util-based decision, or to kick the push-pull mechanism in
>>> your other patches, things like that. That would solve/avoid the problem
>>> I mentioned in the previous paragraph and make the OU detection more
>>> robust. We could also consider using different thresholds in different
>>> places to re-enable load-balancing earlier, and give up on feec() a bit
>>> later to avoid messing the entire task placement when we're only
>>> transiently OU because of misfit. But eventually, we really need to just
>>> give up on util values altogether when we're really overcommitted, it's
>>> really an invariant we need to keep.
>>
>> IMHO the problem here with OU was amplified recently due to the
>> Uclamp_max setting
> 
> Ack.
> 
>> 'Max aggregation policy'
> 
> Ack.
> 
>> aggressive frequency capping
> 
> What do you mean by that?
> 
>> fast freq switching.
> 
> And not sure what fast switching has to do with the issue here?

I mean, with some recent changes flying LKML we are heading to kind
of 'per task DVFS'. Like switching a frequency 'just for that task'
when it's scheduled. This was concerning me. I think we tried to
have a 'planning' view in scheduler on more things in the CPUs requested
performance for future. The future is hard to predict, sometime even
this +20% CPU freq margin was helping us (when we run a bit longer than
our prediction).

With this approach tackling all of the 'safety margins' to save
more power I'm worried about harming normal general scheduling
and performance.

I'm a big fan to save energy, but not doing this very hard
where general scheduling concept might suffer.
E.g. this _subject_:  EAS when OU - is when I'm careful.


> 
>> Now we are in the situation where we complain about util metrics...
>>
>> I've been warning Qais and Vincent that this usage of Uclamp_max
>> in such environment is dangerous and might explode.
> 
> I absolutely agree that uclamp max makes a huge mess of things, and util
> in particular :-(
> 
>> If one background task is capped hard in CPU freq, but does computation
>> 'all the time' making that CPU to have no idle time - then IMO
>> this is not a good scheduling. This is a receipt for starvation.
>> You probably won't find any better metric.
>>
>> I would suggest to stop making the OU situation worse and more
>> frequent with this 'artificial starvation with uclamp_max'.
>>
>> I understand we want to safe energy, but uclamp_max in current shape
>> has too many side effects IMO.
>>
>> Why we haven't invested in the 'Bandwidth controller', e.g. to make
>> it big.Little aware (if that could be a problem)(they were there for
>> many years)?
> 
> Bandwidth control is a different thing really, not sure it can be used
> interchangeably with uclamp_max in general. Running all the time at low
> frequency is often going to be better from a power perspective than
> running uncapped for a fixed period of time.
> 
> I think the intention of uclamp max is really to say 'these tasks have
> low QoS, use spare cycles at low-ish frequency to run them'. What we
> found was that it was best to use cpu.shares in conjunction with
> uclamp.max to implement the 'use spare cycles' part of the previous
> statement, but that was its own can of worms and caused a lot of
> priority inversion problems. Hopefully the proxy exec stuff will solve
> that...
> 

Yes, I see your point. It looks like some new ideas are very welcome.
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Christian Loehle 1 year, 3 months ago
On 8/30/24 14:03, Vincent Guittot wrote:
> Keep looking for an energy efficient CPU even when the system is
> overutilized and use the CPU returned by feec() if it has been able to find
> one. Otherwise fallback to the default performance and spread mode of the
> scheduler.
> A system can become overutilized for a short time when workers of a
> workqueue wake up for a short background work like vmstat update.
> Continuing to look for a energy efficient CPU will prevent to break the
> power packing of tasks.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2273eecf6086..e46af2416159 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		    cpumask_test_cpu(cpu, p->cpus_ptr))
>  			return cpu;
>  
> -		if (!is_rd_overutilized(this_rq()->rd)) {
> +		if (sched_energy_enabled()) {
>  			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>  			if (new_cpu >= 0)
>  				return new_cpu;

Super quick testing on pixel6:
for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done                                     
with patch 5/5 only:
Time: 19.433
Time: 19.657
Time: 19.851
Time: 19.789
Time: 19.857
Time: 20.092
Time: 19.973

mainline:
Time: 18.836
Time: 18.718
Time: 18.781
Time: 19.015
Time: 19.061
Time: 18.950
Time: 19.166


The reason we didn't always have this enabled is the belief that
this costs us too much performance in scenarios we most need it
while at best making subpar EAS decisions anyway (in an
overutilized state).
I'd be open for questioning that, but why the change of mind?
And why is this necessary in your series if the EAS selection
isn't 'final' (until the next sleep) anymore (Patch 5/5)?
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Tue, 17 Sept 2024 at 22:24, Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/30/24 14:03, Vincent Guittot wrote:
> > Keep looking for an energy efficient CPU even when the system is
> > overutilized and use the CPU returned by feec() if it has been able to find
> > one. Otherwise fallback to the default performance and spread mode of the
> > scheduler.
> > A system can become overutilized for a short time when workers of a
> > workqueue wake up for a short background work like vmstat update.
> > Continuing to look for a energy efficient CPU will prevent to break the
> > power packing of tasks.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2273eecf6086..e46af2416159 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> >                       return cpu;
> >
> > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > +             if (sched_energy_enabled()) {
> >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >                       if (new_cpu >= 0)
> >                               return new_cpu;
>
> Super quick testing on pixel6:
> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> with patch 5/5 only:

Do you mean 4/5 ?

> Time: 19.433
> Time: 19.657
> Time: 19.851
> Time: 19.789
> Time: 19.857
> Time: 20.092
> Time: 19.973
>
> mainline:
> Time: 18.836
> Time: 18.718
> Time: 18.781
> Time: 19.015
> Time: 19.061
> Time: 18.950
> Time: 19.166
>

As mentioned in the cover letter,  patch 4/5  has an impact on performance.
Your 4.6% regression is in the range of what I have for these tests

>
> The reason we didn't always have this enabled is the belief that
> this costs us too much performance in scenarios we most need it
> while at best making subpar EAS decisions anyway (in an
> overutilized state).
> I'd be open for questioning that, but why the change of mind?

several reasons:
- the rework of eas patch 1,2,3 of this patchset adds some performance
hints into the selection of an energy efficient CPU
- Although some initial proposal of overutilized state was per sched
domain to prevent destroying whole placement if only a subpart of the
system was overutilized, the current implementation is binary: whole
system or nothing. As shown during [1], a short kworker wakeup can
destroy all task placement by putting the whole system overutilized.
But even  when overutilized, there are a lot of possibilities to do
correct feec() task placement. The overutilized state is too
aggressive.
- the feec() has been reworked since the original version to be less
complex as described by commit 5b77261c5510 ("sched/topology: Remove
the EM_MAX_COMPLEXITY limit")

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

> And why is this necessary in your series if the EAS selection
> isn't 'final' (until the next sleep) anymore (Patch 5/5)?

To prevent destroying everything without good reason. feec() will try
select a CPU only if it can find one that fits for the task otherwise
we fallback to full performance one.
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Pierre Gondois 1 year, 3 months ago
Hello Vincent,
I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
with patches 1-4/5 using these workloads:
---
A.
a. 8 tasks at 2%/5%/10% during 1s
b. 1 task:
    - sleeping during 0.3s
    - at 100% during 0.3s
    - sleeping during 0.3s

b. is used to reach the overutilized state during a limited amount of time.
EAS is then operating, then the load balancer does the task placement, then EAS
is operating again.

B.
a. 8 tasks at 2%/5%/10% during 1s
b. 1 task:
    - at 100% during 1s

---
I'm seeing the energy consumption increase in some cases. This seems to be
due to feec() migrating tasks more often than what the load balancer does
for this workload. This leads to utilization 'spikes' and then frequency
'spikes', increasing the overall energy consumption.
This is not entirely related to this patch though,, as the task placement seems
correct. I.e. feec() effectively does an optimal placement given the EM and
task utilization. The task placement is just a bit less stable than with
the load balancer.

---
Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
each little/mid/big CPUs (without the config, these group would no exist).

I see an important regression in the result.
I replaced the condition to run feec() through select_task_rq_fair() by:
   if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
     new_cpu = find_energy_efficient_cpu(p, prev_cpu);
     ...
   }
and obtained better results.

Indeed, for such intensive workload:
- EAS would not have any energy benefit, so better prioritize performance
   (as Christian mentioned)
- EAS would not be able to find a fitting CPU, so running feec() should be
   avoided
- as you mention in the commit message, shuffling tasks when one CPU becomes
   momentarily overutilized is inefficient energy-wise (even though I don't have
   the numbers, it should make sense).
So detecting when the system is overloaded should be a better compromise I
assume. The condition in sched_balance_find_src_group() to let the load balancer
operate might also need to be updated.

Note:
- base: with patches 1-4/5
- _ou: run feec() when not overutilized
- _ol: run feec() when not overloaded
- mean: hackbench execution time in s.
- delta: negative is better. Value is in percentage.
┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
│ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
│ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
│ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
│ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
│ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
│ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
│ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
│ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
│ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
│ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
│ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
│ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
│ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
│ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
│ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
│ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
│ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
└─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘

On 9/17/24 22:24, Christian Loehle wrote:
> On 8/30/24 14:03, Vincent Guittot wrote:
>> Keep looking for an energy efficient CPU even when the system is
>> overutilized and use the CPU returned by feec() if it has been able to find
>> one. Otherwise fallback to the default performance and spread mode of the
>> scheduler.
>> A system can become overutilized for a short time when workers of a
>> workqueue wake up for a short background work like vmstat update.
>> Continuing to look for a energy efficient CPU will prevent to break the
>> power packing of tasks.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>   kernel/sched/fair.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2273eecf6086..e46af2416159 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   		    cpumask_test_cpu(cpu, p->cpus_ptr))
>>   			return cpu;
>>   
>> -		if (!is_rd_overutilized(this_rq()->rd)) {
>> +		if (sched_energy_enabled()) {
>>   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>   			if (new_cpu >= 0)
>>   				return new_cpu;
> 
> Super quick testing on pixel6:
> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> with patch 5/5 only:
> Time: 19.433
> Time: 19.657
> Time: 19.851
> Time: 19.789
> Time: 19.857
> Time: 20.092
> Time: 19.973
> 
> mainline:
> Time: 18.836
> Time: 18.718
> Time: 18.781
> Time: 19.015
> Time: 19.061
> Time: 18.950
> Time: 19.166
> 
> 
> The reason we didn't always have this enabled is the belief that
> this costs us too much performance in scenarios we most need it
> while at best making subpar EAS decisions anyway (in an
> overutilized state).
> I'd be open for questioning that, but why the change of mind?
> And why is this necessary in your series if the EAS selection
> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> 
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
> with patches 1-4/5 using these workloads:
> ---
> A.
> a. 8 tasks at 2%/5%/10% during 1s
> b. 1 task:
>     - sleeping during 0.3s
>     - at 100% during 0.3s
>     - sleeping during 0.3s
>
> b. is used to reach the overutilized state during a limited amount of time.
> EAS is then operating, then the load balancer does the task placement, then EAS
> is operating again.
>
> B.
> a. 8 tasks at 2%/5%/10% during 1s
> b. 1 task:
>     - at 100% during 1s
>
> ---
> I'm seeing the energy consumption increase in some cases. This seems to be
> due to feec() migrating tasks more often than what the load balancer does
> for this workload. This leads to utilization 'spikes' and then frequency
> 'spikes', increasing the overall energy consumption.
> This is not entirely related to this patch though,, as the task placement seems
> correct. I.e. feec() effectively does an optimal placement given the EM and
> task utilization. The task placement is just a bit less stable than with
> the load balancer.

Would patch 5 help to keep things better placed ? in particular if
task b is misplaced at some point because of load balance ?

I agree that load balance might still contribute to migrate task in a
not energy efficient way

>
> ---
> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
> each little/mid/big CPUs (without the config, these group would no exist).

Why did you do this ? All cpus are expected to be in same sched domain
as they share their LLC

>
> I see an important regression in the result.
> I replaced the condition to run feec() through select_task_rq_fair() by:
>    if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {

overloaded is enable when more than 1 task runs on a cpu whatever the
utilization

>      new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>      ...
>    }
> and obtained better results.
>
> Indeed, for such intensive workload:
> - EAS would not have any energy benefit, so better prioritize performance
>    (as Christian mentioned)
> - EAS would not be able to find a fitting CPU, so running feec() should be
>    avoided
> - as you mention in the commit message, shuffling tasks when one CPU becomes
>    momentarily overutilized is inefficient energy-wise (even though I don't have
>    the numbers, it should make sense).
> So detecting when the system is overloaded should be a better compromise I
> assume. The condition in sched_balance_find_src_group() to let the load balancer
> operate might also need to be updated.
>
> Note:
> - base: with patches 1-4/5
> - _ou: run feec() when not overutilized
> - _ol: run feec() when not overloaded
> - mean: hackbench execution time in s.
> - delta: negative is better. Value is in percentage.

Could you share your command line ? As explained in the cover letter I
have seen some perf regressions but not in the range that you have
below

What is your base ? tip/sched/core ?

> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>
> On 9/17/24 22:24, Christian Loehle wrote:
> > On 8/30/24 14:03, Vincent Guittot wrote:
> >> Keep looking for an energy efficient CPU even when the system is
> >> overutilized and use the CPU returned by feec() if it has been able to find
> >> one. Otherwise fallback to the default performance and spread mode of the
> >> scheduler.
> >> A system can become overutilized for a short time when workers of a
> >> workqueue wake up for a short background work like vmstat update.
> >> Continuing to look for a energy efficient CPU will prevent to break the
> >> power packing of tasks.
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>   kernel/sched/fair.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2273eecf6086..e46af2416159 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>                  cpumask_test_cpu(cpu, p->cpus_ptr))
> >>                      return cpu;
> >>
> >> -            if (!is_rd_overutilized(this_rq()->rd)) {
> >> +            if (sched_energy_enabled()) {
> >>                      new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>                      if (new_cpu >= 0)
> >>                              return new_cpu;
> >
> > Super quick testing on pixel6:
> > for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> > with patch 5/5 only:
> > Time: 19.433
> > Time: 19.657
> > Time: 19.851
> > Time: 19.789
> > Time: 19.857
> > Time: 20.092
> > Time: 19.973
> >
> > mainline:
> > Time: 18.836
> > Time: 18.718
> > Time: 18.781
> > Time: 19.015
> > Time: 19.061
> > Time: 18.950
> > Time: 19.166
> >
> >
> > The reason we didn't always have this enabled is the belief that
> > this costs us too much performance in scenarios we most need it
> > while at best making subpar EAS decisions anyway (in an
> > overutilized state).
> > I'd be open for questioning that, but why the change of mind?
> > And why is this necessary in your series if the EAS selection
> > isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> >
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Pierre Gondois 1 year, 2 months ago
Hello Vincent,

Sorry for the delay:

On 9/25/24 15:28, Vincent Guittot wrote:
> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
>> with patches 1-4/5 using these workloads:
>> ---
>> A.
>> a. 8 tasks at 2%/5%/10% during 1s
>> b. 1 task:
>>      - sleeping during 0.3s
>>      - at 100% during 0.3s
>>      - sleeping during 0.3s
>>
>> b. is used to reach the overutilized state during a limited amount of time.
>> EAS is then operating, then the load balancer does the task placement, then EAS
>> is operating again.
>>
>> B.
>> a. 8 tasks at 2%/5%/10% during 1s
>> b. 1 task:
>>      - at 100% during 1s
>>
>> ---
>> I'm seeing the energy consumption increase in some cases. This seems to be
>> due to feec() migrating tasks more often than what the load balancer does
>> for this workload. This leads to utilization 'spikes' and then frequency
>> 'spikes', increasing the overall energy consumption.
>> This is not entirely related to this patch though,, as the task placement seems
>> correct. I.e. feec() effectively does an optimal placement given the EM and
>> task utilization. The task placement is just a bit less stable than with
>> the load balancer.
> 
> Would patch 5 help to keep things better placed ? in particular if
> task b is misplaced at some point because of load balance ?

I assume so, it would require more testing on my side

> 
> I agree that load balance might still contribute to migrate task in a
> not energy efficient way
> 
>>
>> ---
>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
>> each little/mid/big CPUs (without the config, these group would no exist).
> 
> Why did you do this ? All cpus are expected to be in same sched domain
> as they share their LLC

I did this to observe the loa balancer a bit more carefully while reviewing
the first patch:
   sched/fair: Filter false overloaded_group case for EAS
I've let this configuration, but effectively this should not bring anything more.


> 
>>
>> I see an important regression in the result.
>> I replaced the condition to run feec() through select_task_rq_fair() by:
>>     if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
> 
> overloaded is enable when more than 1 task runs on a cpu whatever the
> utilization

Yes right, this idea has little sense.

> 
>>       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>       ...
>>     }
>> and obtained better results.
>>
>> Indeed, for such intensive workload:
>> - EAS would not have any energy benefit, so better prioritize performance
>>     (as Christian mentioned)
>> - EAS would not be able to find a fitting CPU, so running feec() should be
>>     avoided
>> - as you mention in the commit message, shuffling tasks when one CPU becomes
>>     momentarily overutilized is inefficient energy-wise (even though I don't have
>>     the numbers, it should make sense).
>> So detecting when the system is overloaded should be a better compromise I
>> assume. The condition in sched_balance_find_src_group() to let the load balancer
>> operate might also need to be updated.
>>
>> Note:
>> - base: with patches 1-4/5
>> - _ou: run feec() when not overutilized
>> - _ol: run feec() when not overloaded
>> - mean: hackbench execution time in s.
>> - delta: negative is better. Value is in percentage.
> 
> Could you share your command line ? As explained in the cover letter I
> have seen some perf regressions but not in the range that you have
> below
> 
> What is your base ? tip/sched/core ?

I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
to be able to apply your patches cleanly.

The mapping id -> command line is as:
(1) hackbench -l 5120 -g 1
(2) hackbench -l 1280 -g 4
(3) hackbench -l 640  -g 8
(4) hackbench -l 320  -g 16

(5) hackbench -p -l 5120 -g 1
(6) hackbench -p -l 1280 -g 4
(7) hackbench -p -l 640  -g 8
(8) hackbench -p -l 320  -g 16

(9) hackbench -T -l 5120 -g 1
(10) hackbench -T -l 1280 -g 4
(11) hackbench -T -l 640  -g 8
(12) hackbench -T -l 320  -g 16

(13) hackbench -T -p -l 5120 -g 1
(14) hackbench -T -p -l 1280 -g 4
(15) hackbench -T -p -l 640  -g 8
(16) hackbench -T -p -l 320  -g 16


> 
>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>>
>> On 9/17/24 22:24, Christian Loehle wrote:
>>> On 8/30/24 14:03, Vincent Guittot wrote:
>>>> Keep looking for an energy efficient CPU even when the system is
>>>> overutilized and use the CPU returned by feec() if it has been able to find
>>>> one. Otherwise fallback to the default performance and spread mode of the
>>>> scheduler.
>>>> A system can become overutilized for a short time when workers of a
>>>> workqueue wake up for a short background work like vmstat update.
>>>> Continuing to look for a energy efficient CPU will prevent to break the
>>>> power packing of tasks.
>>>>
>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>> ---
>>>>    kernel/sched/fair.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 2273eecf6086..e46af2416159 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>>                   cpumask_test_cpu(cpu, p->cpus_ptr))
>>>>                       return cpu;
>>>>
>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
>>>> +            if (sched_energy_enabled()) {
>>>>                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>                       if (new_cpu >= 0)
>>>>                               return new_cpu;
>>>
>>> Super quick testing on pixel6:
>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
>>> with patch 5/5 only:
>>> Time: 19.433
>>> Time: 19.657
>>> Time: 19.851
>>> Time: 19.789
>>> Time: 19.857
>>> Time: 20.092
>>> Time: 19.973
>>>
>>> mainline:
>>> Time: 18.836
>>> Time: 18.718
>>> Time: 18.781
>>> Time: 19.015
>>> Time: 19.061
>>> Time: 18.950
>>> Time: 19.166
>>>
>>>
>>> The reason we didn't always have this enabled is the belief that
>>> this costs us too much performance in scenarios we most need it
>>> while at best making subpar EAS decisions anyway (in an
>>> overutilized state).
>>> I'd be open for questioning that, but why the change of mind?
>>> And why is this necessary in your series if the EAS selection
>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
>>>
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
Hi Pierre,

On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> Sorry for the delay:
>
> On 9/25/24 15:28, Vincent Guittot wrote:
> > On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>
> >> Hello Vincent,
> >> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
> >> with patches 1-4/5 using these workloads:
> >> ---
> >> A.
> >> a. 8 tasks at 2%/5%/10% during 1s
> >> b. 1 task:
> >>      - sleeping during 0.3s
> >>      - at 100% during 0.3s
> >>      - sleeping during 0.3s
> >>
> >> b. is used to reach the overutilized state during a limited amount of time.
> >> EAS is then operating, then the load balancer does the task placement, then EAS
> >> is operating again.
> >>
> >> B.
> >> a. 8 tasks at 2%/5%/10% during 1s
> >> b. 1 task:
> >>      - at 100% during 1s
> >>
> >> ---
> >> I'm seeing the energy consumption increase in some cases. This seems to be
> >> due to feec() migrating tasks more often than what the load balancer does
> >> for this workload. This leads to utilization 'spikes' and then frequency
> >> 'spikes', increasing the overall energy consumption.
> >> This is not entirely related to this patch though,, as the task placement seems
> >> correct. I.e. feec() effectively does an optimal placement given the EM and
> >> task utilization. The task placement is just a bit less stable than with
> >> the load balancer.
> >
> > Would patch 5 help to keep things better placed ? in particular if
> > task b is misplaced at some point because of load balance ?
>
> I assume so, it would require more testing on my side
>
> >
> > I agree that load balance might still contribute to migrate task in a
> > not energy efficient way
> >
> >>
> >> ---
> >> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
> >> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
> >> each little/mid/big CPUs (without the config, these group would no exist).
> >
> > Why did you do this ? All cpus are expected to be in same sched domain
> > as they share their LLC
>
> I did this to observe the loa balancer a bit more carefully while reviewing
> the first patch:
>    sched/fair: Filter false overloaded_group case for EAS
> I've let this configuration, but effectively this should not bring anything more.
>
>
> >
> >>
> >> I see an important regression in the result.
> >> I replaced the condition to run feec() through select_task_rq_fair() by:
> >>     if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
> >
> > overloaded is enable when more than 1 task runs on a cpu whatever the
> > utilization
>
> Yes right, this idea has little sense.
>
> >
> >>       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>       ...
> >>     }
> >> and obtained better results.
> >>
> >> Indeed, for such intensive workload:
> >> - EAS would not have any energy benefit, so better prioritize performance
> >>     (as Christian mentioned)
> >> - EAS would not be able to find a fitting CPU, so running feec() should be
> >>     avoided
> >> - as you mention in the commit message, shuffling tasks when one CPU becomes
> >>     momentarily overutilized is inefficient energy-wise (even though I don't have
> >>     the numbers, it should make sense).
> >> So detecting when the system is overloaded should be a better compromise I
> >> assume. The condition in sched_balance_find_src_group() to let the load balancer
> >> operate might also need to be updated.
> >>
> >> Note:
> >> - base: with patches 1-4/5
> >> - _ou: run feec() when not overutilized
> >> - _ol: run feec() when not overloaded
> >> - mean: hackbench execution time in s.
> >> - delta: negative is better. Value is in percentage.
> >
> > Could you share your command line ? As explained in the cover letter I
> > have seen some perf regressions but not in the range that you have
> > below
> >
> > What is your base ? tip/sched/core ?
>
> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
> to be able to apply your patches cleanly.

TBH, I'm always cautious with those kind of frankenstein kernel
especially with all changes that have happened on the scheduler since
v6.8 compared to tip/sched/core

>
> The mapping id -> command line is as:

Thanks for the commands details, I'm going to have a look

> (1) hackbench -l 5120 -g 1
> (2) hackbench -l 1280 -g 4
> (3) hackbench -l 640  -g 8
> (4) hackbench -l 320  -g 16
>
> (5) hackbench -p -l 5120 -g 1
> (6) hackbench -p -l 1280 -g 4
> (7) hackbench -p -l 640  -g 8
> (8) hackbench -p -l 320  -g 16
>
> (9) hackbench -T -l 5120 -g 1
> (10) hackbench -T -l 1280 -g 4
> (11) hackbench -T -l 640  -g 8
> (12) hackbench -T -l 320  -g 16
>
> (13) hackbench -T -p -l 5120 -g 1
> (14) hackbench -T -p -l 1280 -g 4
> (15) hackbench -T -p -l 640  -g 8
> (16) hackbench -T -p -l 320  -g 16
>
>
> >
> >> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
> >> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
> >> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
> >> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │

I might have misunderstood your results above last time.
mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
Does it mean that it is 55% better with patch 4 ? I originally thought
there was a regression with patch 4 but I'm not sure that I understood
correctly after re reading the table.

> >> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
> >> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
> >> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
> >> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
> >> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
> >> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
> >> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
> >> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
> >> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
> >> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
> >> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
> >> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
> >> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
> >> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
> >> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
> >> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
> >>
> >> On 9/17/24 22:24, Christian Loehle wrote:
> >>> On 8/30/24 14:03, Vincent Guittot wrote:
> >>>> Keep looking for an energy efficient CPU even when the system is
> >>>> overutilized and use the CPU returned by feec() if it has been able to find
> >>>> one. Otherwise fallback to the default performance and spread mode of the
> >>>> scheduler.
> >>>> A system can become overutilized for a short time when workers of a
> >>>> workqueue wake up for a short background work like vmstat update.
> >>>> Continuing to look for a energy efficient CPU will prevent to break the
> >>>> power packing of tasks.
> >>>>
> >>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>>> ---
> >>>>    kernel/sched/fair.c | 2 +-
> >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 2273eecf6086..e46af2416159 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>>>                   cpumask_test_cpu(cpu, p->cpus_ptr))
> >>>>                       return cpu;
> >>>>
> >>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
> >>>> +            if (sched_energy_enabled()) {
> >>>>                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>>>                       if (new_cpu >= 0)
> >>>>                               return new_cpu;
> >>>
> >>> Super quick testing on pixel6:
> >>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> >>> with patch 5/5 only:
> >>> Time: 19.433
> >>> Time: 19.657
> >>> Time: 19.851
> >>> Time: 19.789
> >>> Time: 19.857
> >>> Time: 20.092
> >>> Time: 19.973
> >>>
> >>> mainline:
> >>> Time: 18.836
> >>> Time: 18.718
> >>> Time: 18.781
> >>> Time: 19.015
> >>> Time: 19.061
> >>> Time: 18.950
> >>> Time: 19.166
> >>>
> >>>
> >>> The reason we didn't always have this enabled is the belief that
> >>> this costs us too much performance in scenarios we most need it
> >>> while at best making subpar EAS decisions anyway (in an
> >>> overutilized state).
> >>> I'd be open for questioning that, but why the change of mind?
> >>> And why is this necessary in your series if the EAS selection
> >>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> >>>
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Pierre Gondois 1 year, 2 months ago
Hello Vincent,

On 10/9/24 10:53, Vincent Guittot wrote:
> Hi Pierre,
> 
> On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> Sorry for the delay:
>>
>> On 9/25/24 15:28, Vincent Guittot wrote:
>>> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>
>>>> Hello Vincent,
>>>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
>>>> with patches 1-4/5 using these workloads:
>>>> ---
>>>> A.
>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>> b. 1 task:
>>>>       - sleeping during 0.3s
>>>>       - at 100% during 0.3s
>>>>       - sleeping during 0.3s
>>>>
>>>> b. is used to reach the overutilized state during a limited amount of time.
>>>> EAS is then operating, then the load balancer does the task placement, then EAS
>>>> is operating again.
>>>>
>>>> B.
>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>> b. 1 task:
>>>>       - at 100% during 1s
>>>>
>>>> ---
>>>> I'm seeing the energy consumption increase in some cases. This seems to be
>>>> due to feec() migrating tasks more often than what the load balancer does
>>>> for this workload. This leads to utilization 'spikes' and then frequency
>>>> 'spikes', increasing the overall energy consumption.
>>>> This is not entirely related to this patch though,, as the task placement seems
>>>> correct. I.e. feec() effectively does an optimal placement given the EM and
>>>> task utilization. The task placement is just a bit less stable than with
>>>> the load balancer.
>>>
>>> Would patch 5 help to keep things better placed ? in particular if
>>> task b is misplaced at some point because of load balance ?
>>
>> I assume so, it would require more testing on my side
>>
>>>
>>> I agree that load balance might still contribute to migrate task in a
>>> not energy efficient way
>>>
>>>>
>>>> ---
>>>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
>>>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
>>>> each little/mid/big CPUs (without the config, these group would no exist).
>>>
>>> Why did you do this ? All cpus are expected to be in same sched domain
>>> as they share their LLC
>>
>> I did this to observe the loa balancer a bit more carefully while reviewing
>> the first patch:
>>     sched/fair: Filter false overloaded_group case for EAS
>> I've let this configuration, but effectively this should not bring anything more.
>>
>>
>>>
>>>>
>>>> I see an important regression in the result.
>>>> I replaced the condition to run feec() through select_task_rq_fair() by:
>>>>      if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
>>>
>>> overloaded is enable when more than 1 task runs on a cpu whatever the
>>> utilization
>>
>> Yes right, this idea has little sense.
>>
>>>
>>>>        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>        ...
>>>>      }
>>>> and obtained better results.
>>>>
>>>> Indeed, for such intensive workload:
>>>> - EAS would not have any energy benefit, so better prioritize performance
>>>>      (as Christian mentioned)
>>>> - EAS would not be able to find a fitting CPU, so running feec() should be
>>>>      avoided
>>>> - as you mention in the commit message, shuffling tasks when one CPU becomes
>>>>      momentarily overutilized is inefficient energy-wise (even though I don't have
>>>>      the numbers, it should make sense).
>>>> So detecting when the system is overloaded should be a better compromise I
>>>> assume. The condition in sched_balance_find_src_group() to let the load balancer
>>>> operate might also need to be updated.
>>>>
>>>> Note:
>>>> - base: with patches 1-4/5
>>>> - _ou: run feec() when not overutilized
>>>> - _ol: run feec() when not overloaded
>>>> - mean: hackbench execution time in s.
>>>> - delta: negative is better. Value is in percentage.
>>>
>>> Could you share your command line ? As explained in the cover letter I
>>> have seen some perf regressions but not in the range that you have
>>> below
>>>
>>> What is your base ? tip/sched/core ?
>>
>> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
>> to be able to apply your patches cleanly.
> 
> TBH, I'm always cautious with those kind of frankenstein kernel
> especially with all changes that have happened on the scheduler since
> v6.8 compared to tip/sched/core

Yes I understand, I'll re-test it on a Juno with a newer kernel.

> 
>>
>> The mapping id -> command line is as:
> 
> Thanks for the commands details, I'm going to have a look
> 
>> (1) hackbench -l 5120 -g 1
>> (2) hackbench -l 1280 -g 4
>> (3) hackbench -l 640  -g 8
>> (4) hackbench -l 320  -g 16
>>
>> (5) hackbench -p -l 5120 -g 1
>> (6) hackbench -p -l 1280 -g 4
>> (7) hackbench -p -l 640  -g 8
>> (8) hackbench -p -l 320  -g 16
>>
>> (9) hackbench -T -l 5120 -g 1
>> (10) hackbench -T -l 1280 -g 4
>> (11) hackbench -T -l 640  -g 8
>> (12) hackbench -T -l 320  -g 16
>>
>> (13) hackbench -T -p -l 5120 -g 1
>> (14) hackbench -T -p -l 1280 -g 4
>> (15) hackbench -T -p -l 640  -g 8
>> (16) hackbench -T -p -l 320  -g 16
>>
>>
>>>
>>>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
>>>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
>>>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
>>>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
> 
> I might have misunderstood your results above last time.
> mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
> Does it mean that it is 55% better with patch 4 ? I originally thought
> there was a regression with patch 4 but I'm not sure that I understood
> correctly after re reading the table.

The columns are:
- the _base configuration disables EAS/feec() when in the overutilized state,
   i.e. patches 1-3 are applied.
- the _ou configuration keeps running EAS/feec() when in the overutilized state
   i.e. patches 1-4 are applied
- the _ol configuration should be ignored as previously established


> 
>>>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
>>>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
>>>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
>>>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
>>>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
>>>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
>>>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
>>>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
>>>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
>>>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
>>>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
>>>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
>>>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
>>>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
>>>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
>>>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>>>>
>>>> On 9/17/24 22:24, Christian Loehle wrote:
>>>>> On 8/30/24 14:03, Vincent Guittot wrote:
>>>>>> Keep looking for an energy efficient CPU even when the system is
>>>>>> overutilized and use the CPU returned by feec() if it has been able to find
>>>>>> one. Otherwise fallback to the default performance and spread mode of the
>>>>>> scheduler.
>>>>>> A system can become overutilized for a short time when workers of a
>>>>>> workqueue wake up for a short background work like vmstat update.
>>>>>> Continuing to look for a energy efficient CPU will prevent to break the
>>>>>> power packing of tasks.
>>>>>>
>>>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>>>> ---
>>>>>>     kernel/sched/fair.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 2273eecf6086..e46af2416159 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>>>>                    cpumask_test_cpu(cpu, p->cpus_ptr))
>>>>>>                        return cpu;
>>>>>>
>>>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
>>>>>> +            if (sched_energy_enabled()) {
>>>>>>                        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>>>                        if (new_cpu >= 0)
>>>>>>                                return new_cpu;
>>>>>
>>>>> Super quick testing on pixel6:
>>>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
>>>>> with patch 5/5 only:
>>>>> Time: 19.433
>>>>> Time: 19.657
>>>>> Time: 19.851
>>>>> Time: 19.789
>>>>> Time: 19.857
>>>>> Time: 20.092
>>>>> Time: 19.973
>>>>>
>>>>> mainline:
>>>>> Time: 18.836
>>>>> Time: 18.718
>>>>> Time: 18.781
>>>>> Time: 19.015
>>>>> Time: 19.061
>>>>> Time: 18.950
>>>>> Time: 19.166
>>>>>
>>>>>
>>>>> The reason we didn't always have this enabled is the belief that
>>>>> this costs us too much performance in scenarios we most need it
>>>>> while at best making subpar EAS decisions anyway (in an
>>>>> overutilized state).
>>>>> I'd be open for questioning that, but why the change of mind?
>>>>> And why is this necessary in your series if the EAS selection
>>>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
>>>>>
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Vincent Guittot 1 year, 2 months ago
On Fri, 11 Oct 2024 at 14:52, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 10/9/24 10:53, Vincent Guittot wrote:
> > Hi Pierre,
> >
> > On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>
> >> Hello Vincent,
> >>
> >> Sorry for the delay:
> >>
> >> On 9/25/24 15:28, Vincent Guittot wrote:
> >>> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>>>
> >>>> Hello Vincent,
> >>>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
> >>>> with patches 1-4/5 using these workloads:
> >>>> ---
> >>>> A.
> >>>> a. 8 tasks at 2%/5%/10% during 1s
> >>>> b. 1 task:
> >>>>       - sleeping during 0.3s
> >>>>       - at 100% during 0.3s
> >>>>       - sleeping during 0.3s
> >>>>
> >>>> b. is used to reach the overutilized state during a limited amount of time.
> >>>> EAS is then operating, then the load balancer does the task placement, then EAS
> >>>> is operating again.
> >>>>
> >>>> B.
> >>>> a. 8 tasks at 2%/5%/10% during 1s
> >>>> b. 1 task:
> >>>>       - at 100% during 1s
> >>>>
> >>>> ---
> >>>> I'm seeing the energy consumption increase in some cases. This seems to be
> >>>> due to feec() migrating tasks more often than what the load balancer does
> >>>> for this workload. This leads to utilization 'spikes' and then frequency
> >>>> 'spikes', increasing the overall energy consumption.
> >>>> This is not entirely related to this patch though,, as the task placement seems
> >>>> correct. I.e. feec() effectively does an optimal placement given the EM and
> >>>> task utilization. The task placement is just a bit less stable than with
> >>>> the load balancer.
> >>>
> >>> Would patch 5 help to keep things better placed ? in particular if
> >>> task b is misplaced at some point because of load balance ?
> >>
> >> I assume so, it would require more testing on my side
> >>
> >>>
> >>> I agree that load balance might still contribute to migrate task in a
> >>> not energy efficient way
> >>>
> >>>>
> >>>> ---
> >>>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
> >>>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
> >>>> each little/mid/big CPUs (without the config, these group would no exist).
> >>>
> >>> Why did you do this ? All cpus are expected to be in same sched domain
> >>> as they share their LLC
> >>
> >> I did this to observe the loa balancer a bit more carefully while reviewing
> >> the first patch:
> >>     sched/fair: Filter false overloaded_group case for EAS
> >> I've let this configuration, but effectively this should not bring anything more.
> >>
> >>
> >>>
> >>>>
> >>>> I see an important regression in the result.
> >>>> I replaced the condition to run feec() through select_task_rq_fair() by:
> >>>>      if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
> >>>
> >>> overloaded is enable when more than 1 task runs on a cpu whatever the
> >>> utilization
> >>
> >> Yes right, this idea has little sense.
> >>
> >>>
> >>>>        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>>>        ...
> >>>>      }
> >>>> and obtained better results.
> >>>>
> >>>> Indeed, for such intensive workload:
> >>>> - EAS would not have any energy benefit, so better prioritize performance
> >>>>      (as Christian mentioned)
> >>>> - EAS would not be able to find a fitting CPU, so running feec() should be
> >>>>      avoided
> >>>> - as you mention in the commit message, shuffling tasks when one CPU becomes
> >>>>      momentarily overutilized is inefficient energy-wise (even though I don't have
> >>>>      the numbers, it should make sense).
> >>>> So detecting when the system is overloaded should be a better compromise I
> >>>> assume. The condition in sched_balance_find_src_group() to let the load balancer
> >>>> operate might also need to be updated.
> >>>>
> >>>> Note:
> >>>> - base: with patches 1-4/5
> >>>> - _ou: run feec() when not overutilized
> >>>> - _ol: run feec() when not overloaded
> >>>> - mean: hackbench execution time in s.
> >>>> - delta: negative is better. Value is in percentage.
> >>>
> >>> Could you share your command line ? As explained in the cover letter I
> >>> have seen some perf regressions but not in the range that you have
> >>> below
> >>>
> >>> What is your base ? tip/sched/core ?
> >>
> >> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
> >> to be able to apply your patches cleanly.
> >
> > TBH, I'm always cautious with those kind of frankenstein kernel
> > especially with all changes that have happened on the scheduler since
> > v6.8 compared to tip/sched/core
>
> Yes I understand, I'll re-test it on a Juno with a newer kernel.
>
> >
> >>
> >> The mapping id -> command line is as:
> >
> > Thanks for the commands details, I'm going to have a look
> >
> >> (1) hackbench -l 5120 -g 1
> >> (2) hackbench -l 1280 -g 4
> >> (3) hackbench -l 640  -g 8
> >> (4) hackbench -l 320  -g 16
> >>
> >> (5) hackbench -p -l 5120 -g 1
> >> (6) hackbench -p -l 1280 -g 4
> >> (7) hackbench -p -l 640  -g 8
> >> (8) hackbench -p -l 320  -g 16
> >>
> >> (9) hackbench -T -l 5120 -g 1
> >> (10) hackbench -T -l 1280 -g 4
> >> (11) hackbench -T -l 640  -g 8
> >> (12) hackbench -T -l 320  -g 16
> >>
> >> (13) hackbench -T -p -l 5120 -g 1
> >> (14) hackbench -T -p -l 1280 -g 4
> >> (15) hackbench -T -p -l 640  -g 8
> >> (16) hackbench -T -p -l 320  -g 16
> >>
> >>
> >>>
> >>>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
> >>>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
> >>>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
> >>>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
> >
> > I might have misunderstood your results above last time.
> > mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
> > Does it mean that it is 55% better with patch 4 ? I originally thought
> > there was a regression with patch 4 but I'm not sure that I understood
> > correctly after re reading the table.
>
> The columns are:
> - the _base configuration disables EAS/feec() when in the overutilized state,
>    i.e. patches 1-3 are applied.

your original description
"
 - base: with patches 1-4/5
 - _ou: run feec() when not overutilized
 - _ol: run feec() when not overloaded
"
was quite confusing :-)

Thanks for the clarification

> - the _ou configuration keeps running EAS/feec() when in the overutilized state
>    i.e. patches 1-4 are applied
> - the _ol configuration should be ignored as previously established
>
>
> >
> >>>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
> >>>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
> >>>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
> >>>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
> >>>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
> >>>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
> >>>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
> >>>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
> >>>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
> >>>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
> >>>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
> >>>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
> >>>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
> >>>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
> >>>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
> >>>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
> >>>>
> >>>> On 9/17/24 22:24, Christian Loehle wrote:
> >>>>> On 8/30/24 14:03, Vincent Guittot wrote:
> >>>>>> Keep looking for an energy efficient CPU even when the system is
> >>>>>> overutilized and use the CPU returned by feec() if it has been able to find
> >>>>>> one. Otherwise fallback to the default performance and spread mode of the
> >>>>>> scheduler.
> >>>>>> A system can become overutilized for a short time when workers of a
> >>>>>> workqueue wake up for a short background work like vmstat update.
> >>>>>> Continuing to look for a energy efficient CPU will prevent to break the
> >>>>>> power packing of tasks.
> >>>>>>
> >>>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>>>>> ---
> >>>>>>     kernel/sched/fair.c | 2 +-
> >>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>>> index 2273eecf6086..e46af2416159 100644
> >>>>>> --- a/kernel/sched/fair.c
> >>>>>> +++ b/kernel/sched/fair.c
> >>>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>>>>>                    cpumask_test_cpu(cpu, p->cpus_ptr))
> >>>>>>                        return cpu;
> >>>>>>
> >>>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
> >>>>>> +            if (sched_energy_enabled()) {
> >>>>>>                        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>>>>>                        if (new_cpu >= 0)
> >>>>>>                                return new_cpu;
> >>>>>
> >>>>> Super quick testing on pixel6:
> >>>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> >>>>> with patch 5/5 only:
> >>>>> Time: 19.433
> >>>>> Time: 19.657
> >>>>> Time: 19.851
> >>>>> Time: 19.789
> >>>>> Time: 19.857
> >>>>> Time: 20.092
> >>>>> Time: 19.973
> >>>>>
> >>>>> mainline:
> >>>>> Time: 18.836
> >>>>> Time: 18.718
> >>>>> Time: 18.781
> >>>>> Time: 19.015
> >>>>> Time: 19.061
> >>>>> Time: 18.950
> >>>>> Time: 19.166
> >>>>>
> >>>>>
> >>>>> The reason we didn't always have this enabled is the belief that
> >>>>> this costs us too much performance in scenarios we most need it
> >>>>> while at best making subpar EAS decisions anyway (in an
> >>>>> overutilized state).
> >>>>> I'd be open for questioning that, but why the change of mind?
> >>>>> And why is this necessary in your series if the EAS selection
> >>>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> >>>>>
Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Posted by Pierre Gondois 1 year, 1 month ago

On 10/15/24 14:47, Vincent Guittot wrote:
> On Fri, 11 Oct 2024 at 14:52, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> On 10/9/24 10:53, Vincent Guittot wrote:
>>> Hi Pierre,
>>>
>>> On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>
>>>> Hello Vincent,
>>>>
>>>> Sorry for the delay:
>>>>
>>>> On 9/25/24 15:28, Vincent Guittot wrote:
>>>>> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>>>
>>>>>> Hello Vincent,
>>>>>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
>>>>>> with patches 1-4/5 using these workloads:
>>>>>> ---
>>>>>> A.
>>>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>>>> b. 1 task:
>>>>>>        - sleeping during 0.3s
>>>>>>        - at 100% during 0.3s
>>>>>>        - sleeping during 0.3s
>>>>>>
>>>>>> b. is used to reach the overutilized state during a limited amount of time.
>>>>>> EAS is then operating, then the load balancer does the task placement, then EAS
>>>>>> is operating again.
>>>>>>
>>>>>> B.
>>>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>>>> b. 1 task:
>>>>>>        - at 100% during 1s
>>>>>>
>>>>>> ---
>>>>>> I'm seeing the energy consumption increase in some cases. This seems to be
>>>>>> due to feec() migrating tasks more often than what the load balancer does
>>>>>> for this workload. This leads to utilization 'spikes' and then frequency
>>>>>> 'spikes', increasing the overall energy consumption.
>>>>>> This is not entirely related to this patch though,, as the task placement seems
>>>>>> correct. I.e. feec() effectively does an optimal placement given the EM and
>>>>>> task utilization. The task placement is just a bit less stable than with
>>>>>> the load balancer.
>>>>>
>>>>> Would patch 5 help to keep things better placed ? in particular if
>>>>> task b is misplaced at some point because of load balance ?
>>>>
>>>> I assume so, it would require more testing on my side
>>>>
>>>>>
>>>>> I agree that load balance might still contribute to migrate task in a
>>>>> not energy efficient way
>>>>>
>>>>>>
>>>>>> ---
>>>>>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
>>>>>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
>>>>>> each little/mid/big CPUs (without the config, these group would no exist).
>>>>>
>>>>> Why did you do this ? All cpus are expected to be in same sched domain
>>>>> as they share their LLC
>>>>
>>>> I did this to observe the loa balancer a bit more carefully while reviewing
>>>> the first patch:
>>>>      sched/fair: Filter false overloaded_group case for EAS
>>>> I've let this configuration, but effectively this should not bring anything more.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> I see an important regression in the result.
>>>>>> I replaced the condition to run feec() through select_task_rq_fair() by:
>>>>>>       if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
>>>>>
>>>>> overloaded is enable when more than 1 task runs on a cpu whatever the
>>>>> utilization
>>>>
>>>> Yes right, this idea has little sense.
>>>>
>>>>>
>>>>>>         new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>>>         ...
>>>>>>       }
>>>>>> and obtained better results.
>>>>>>
>>>>>> Indeed, for such intensive workload:
>>>>>> - EAS would not have any energy benefit, so better prioritize performance
>>>>>>       (as Christian mentioned)
>>>>>> - EAS would not be able to find a fitting CPU, so running feec() should be
>>>>>>       avoided
>>>>>> - as you mention in the commit message, shuffling tasks when one CPU becomes
>>>>>>       momentarily overutilized is inefficient energy-wise (even though I don't have
>>>>>>       the numbers, it should make sense).
>>>>>> So detecting when the system is overloaded should be a better compromise I
>>>>>> assume. The condition in sched_balance_find_src_group() to let the load balancer
>>>>>> operate might also need to be updated.
>>>>>>
>>>>>> Note:
>>>>>> - base: with patches 1-4/5
>>>>>> - _ou: run feec() when not overutilized
>>>>>> - _ol: run feec() when not overloaded
>>>>>> - mean: hackbench execution time in s.
>>>>>> - delta: negative is better. Value is in percentage.
>>>>>
>>>>> Could you share your command line ? As explained in the cover letter I
>>>>> have seen some perf regressions but not in the range that you have
>>>>> below
>>>>>
>>>>> What is your base ? tip/sched/core ?
>>>>
>>>> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
>>>> to be able to apply your patches cleanly.
>>>
>>> TBH, I'm always cautious with those kind of frankenstein kernel
>>> especially with all changes that have happened on the scheduler since
>>> v6.8 compared to tip/sched/core
>>
>> Yes I understand, I'll re-test it on a Juno with a newer kernel.

For the record, I ran the same tests, still on a Pixel6 and supposedly with
the same setup. The results I got show indeed a regression lower than
the first results shared. I assume I didn't cool the Pixel6 enough in
the first experiment ...

The suffix '_w' is for the result with this present patch, running EAS
when in overutilized state.
+---------------------+--------+----------+--------+----------+---------+
|                 cmd |   mean |   std    | mean_w | std_w    |   ratio |
+---------------------+--------+----------+--------+----------+---------+
|       -l 5120 -g 1  | 1.9266 | 0.044848 | 2.1028 | 0.06441  |  9.15%  |
|       -l 1280 -g 4  | 1.89   | 0.080833 | 1.9588 | 0.040227 |  3.64%  |
|       -l 640  -g 8  | 1.8882 | 0.069197 | 1.918  | 0.06837  |  1.58%  |
|       -l 320  -g 16 | 1.9324 | 0.011194 | 1.9154 | 0.044998 | -0.88%  |
|    -p -l 5120 -g 1  | 1.4012 | 0.029811 | 1.6178 | 0.04027  | 15.46%  |
|    -p -l 1280 -g 4  | 1.3432 | 0.036949 | 1.5022 | 0.073346 | 11.84%  |
|    -p -l 640  -g 8  | 1.2944 | 0.022143 | 1.4468 | 0.013882 | 11.77%  |
|    -p -l 320  -g 16 | 1.2824 | 0.045873 | 1.3668 | 0.024448 |  6.58%  |
| -T    -l 5120 -g 1  | 1.9198 | 0.054897 | 2.0318 | 0.059222 |  5.83%  |
| -T    -l 1280 -g 4  | 1.8342 | 0.089015 | 1.9572 | 0.007328 |  6.71%  |
| -T    -l 640  -g 8  | 1.8986 | 0.023469 | 1.937  | 0.068044 |  2.02%  |
| -T    -l 320  -g 16 | 1.825  | 0.060634 | 1.9278 | 0.038206 |  5.63%  |
| -T -p -l 5120 -g 1  | 1.4424 | 0.007956 | 1.6474 | 0.035536 | 14.21%  |
| -T -p -l 1280 -g 4  | 1.3796 | 0.029305 | 1.5106 | 0.031533 |  9.5 %  |
| -T -p -l 640  -g 8  | 1.3306 | 0.024347 | 1.4662 | 0.064224 | 10.19%  |
| -T -p -l 320  -g 16 | 1.2886 | 0.031437 | 1.389  | 0.033083 |  7.79%  |
+---------------------+--------+----------+--------+----------+---------+

>>
>>>
>>>>
>>>> The mapping id -> command line is as:
>>>
>>> Thanks for the commands details, I'm going to have a look
>>>
>>>> (1) hackbench -l 5120 -g 1
>>>> (2) hackbench -l 1280 -g 4
>>>> (3) hackbench -l 640  -g 8
>>>> (4) hackbench -l 320  -g 16
>>>>
>>>> (5) hackbench -p -l 5120 -g 1
>>>> (6) hackbench -p -l 1280 -g 4
>>>> (7) hackbench -p -l 640  -g 8
>>>> (8) hackbench -p -l 320  -g 16
>>>>
>>>> (9) hackbench -T -l 5120 -g 1
>>>> (10) hackbench -T -l 1280 -g 4
>>>> (11) hackbench -T -l 640  -g 8
>>>> (12) hackbench -T -l 320  -g 16
>>>>
>>>> (13) hackbench -T -p -l 5120 -g 1
>>>> (14) hackbench -T -p -l 1280 -g 4
>>>> (15) hackbench -T -p -l 640  -g 8
>>>> (16) hackbench -T -p -l 320  -g 16
>>>>
>>>>
>>>>>
>>>>>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
>>>>>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
>>>>>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
>>>>>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
>>>
>>> I might have misunderstood your results above last time.
>>> mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
>>> Does it mean that it is 55% better with patch 4 ? I originally thought
>>> there was a regression with patch 4 but I'm not sure that I understood
>>> correctly after re reading the table.
>>
>> The columns are:
>> - the _base configuration disables EAS/feec() when in the overutilized state,
>>     i.e. patches 1-3 are applied.
> 
> your original description
> "
>   - base: with patches 1-4/5
>   - _ou: run feec() when not overutilized
>   - _ol: run feec() when not overloaded
> "
> was quite confusing :-)
> 
> Thanks for the clarification
> 
>> - the _ou configuration keeps running EAS/feec() when in the overutilized state
>>     i.e. patches 1-4 are applied
>> - the _ol configuration should be ignored as previously established
>>
>>
>>>
>>>>>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
>>>>>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
>>>>>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
>>>>>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
>>>>>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
>>>>>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
>>>>>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
>>>>>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
>>>>>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
>>>>>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
>>>>>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
>>>>>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
>>>>>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
>>>>>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
>>>>>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
>>>>>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>>>>>>
>>>>>> On 9/17/24 22:24, Christian Loehle wrote:
>>>>>>> On 8/30/24 14:03, Vincent Guittot wrote:
>>>>>>>> Keep looking for an energy efficient CPU even when the system is
>>>>>>>> overutilized and use the CPU returned by feec() if it has been able to find
>>>>>>>> one. Otherwise fallback to the default performance and spread mode of the
>>>>>>>> scheduler.
>>>>>>>> A system can become overutilized for a short time when workers of a
>>>>>>>> workqueue wake up for a short background work like vmstat update.
>>>>>>>> Continuing to look for a energy efficient CPU will prevent to break the
>>>>>>>> power packing of tasks.
>>>>>>>>
>>>>>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>>>>>> ---
>>>>>>>>      kernel/sched/fair.c | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>> index 2273eecf6086..e46af2416159 100644
>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>>>>>>                     cpumask_test_cpu(cpu, p->cpus_ptr))
>>>>>>>>                         return cpu;
>>>>>>>>
>>>>>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
>>>>>>>> +            if (sched_energy_enabled()) {
>>>>>>>>                         new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>>>>>                         if (new_cpu >= 0)
>>>>>>>>                                 return new_cpu;
>>>>>>>
>>>>>>> Super quick testing on pixel6:
>>>>>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
>>>>>>> with patch 5/5 only:
>>>>>>> Time: 19.433
>>>>>>> Time: 19.657
>>>>>>> Time: 19.851
>>>>>>> Time: 19.789
>>>>>>> Time: 19.857
>>>>>>> Time: 20.092
>>>>>>> Time: 19.973
>>>>>>>
>>>>>>> mainline:
>>>>>>> Time: 18.836
>>>>>>> Time: 18.718
>>>>>>> Time: 18.781
>>>>>>> Time: 19.015
>>>>>>> Time: 19.061
>>>>>>> Time: 18.950
>>>>>>> Time: 19.166
>>>>>>>
>>>>>>>
>>>>>>> The reason we didn't always have this enabled is the belief that
>>>>>>> this costs us too much performance in scenarios we most need it
>>>>>>> while at best making subpar EAS decisions anyway (in an
>>>>>>> overutilized state).
>>>>>>> I'd be open for questioning that, but why the change of mind?
>>>>>>> And why is this necessary in your series if the EAS selection
>>>>>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
>>>>>>>