[PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases

Vincent Guittot posted 7 patches 11 months, 1 week ago
There is a newer version of this series
include/linux/energy_model.h | 111 ++----
kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
kernel/sched/sched.h         |   2 +
3 files changed, 518 insertions(+), 316 deletions(-)
[PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
Posted by Vincent Guittot 11 months, 1 week ago
The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 creates a new EM interface that will be used by Patch 3

Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
others might be a better choice. feec() looks for the CPU with the highest
spare capacity in a PD assuming that it will be the best CPU from a energy
efficiency PoV because it will require the smallest increase of OPP.
This is often but not always true, this policy filters some others CPUs
which would be as efficients because of using the same OPP but with less
running tasks as an example.
In fact, we only care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result having the same energy cost. In
such cases, we can use other metrics to select the best CPU with the same
energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
and then the most performant CPU between CPUs. At now, this only tries to
evenly spread the number of runnable tasks on CPUs but this can be
improved with other metric like the sched slice duration in a follow up
series.

perf sched pipe on a dragonboard rb5 has been used to compare the overhead
of the new feec() vs current implementation.

9 iterations of perf bench sched pipe -T -l 80000
                ops/sec  stdev 
tip/sched/core  16634    (+/- 0.5%)
+ patches 1-3   17434    (+/- 1.2%)  +4.8%


Patch 4 removed the now unused em_cpu_energy()

Patch 5 solves another problem with tasks being stuck on a CPU forever
because it doesn't sleep anymore and as a result never wakeup and call
feec(). Such task can be detected by comparing util_avg or runnable_avg
with the compute capacity of the CPU. Once detected, we can call feec() to
check if there is a better CPU for the stuck task. The call can be done in
2 places:
- When the task is put back in the runnnable list after its running slice
  with the balance callback mecanism similarly to the rt/dl push callback.
- During cfs tick when there is only 1 running task stuck on the CPU in
  which case the balance callback can't be used.

This push callback mecanism with the new feec() algorithm ensures that
tasks always get a chance to migrate on the best suitable CPU and don't
stay stuck on a CPU which is no more the most suitable one. As examples:
- A task waking on a big CPU with a uclamp max preventing it to sleep and
  wake up, can migrate on a smaller CPU once it's more power efficient.
- The tasks are spread on CPUs in the PD when they target the same OPP.

Patch 6 adds task misfit migration case in the cfs tick and push callback
mecanism to prevent waking up an idle cpu unnecessarily.

Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Compared to v4:
- Fixed check_pushable_task for !SMP

Compared to v3:
- Fixed the empty functions

Compared to v2:
- Renamed the push and tick functions to ease understanding what they do.
  Both are kept in the same patch as they solve the same problem.
- Created some helper functions
- Fixing some typos and comments
- The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
  take into account the min capacity of the CPU but the is not directly
  available right now. It can trigger feec() when uclamp_max is very low
  compare to the min capacity of the CPU but the feec() should keep 
  returning the same CPU. This can be handled in a follow on patch

Compared to v1:
- The call to feec() even when overutilized has been removed
from this serie and will be adressed in a separate series. Only the case
of uclamp_min has been kept as it is now handled by push callback and
tick mecanism.
- The push mecanism has been cleanup, fixed and simplified.

This series implements some of the topics discussed at OSPM [1]. Other
topics will be part of an other serie

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Vincent Guittot (7):
  sched/fair: Filter false overloaded_group case for EAS
  energy model: Add a get previous state function
  sched/fair: Rework feec() to use cost instead of spare capacity
  energy model: Remove unused em_cpu_energy()
  sched/fair: Add push task mechanism for EAS
  sched/fair: Add misfit case to push task mecanism for EAS
  sched/fair: Update overutilized detection

 include/linux/energy_model.h | 111 ++----
 kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
 kernel/sched/sched.h         |   2 +
 3 files changed, 518 insertions(+), 316 deletions(-)

-- 
2.43.0
Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
Posted by Christian Loehle 10 months, 3 weeks ago
On 3/2/25 21:05, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs. At now, this only tries to
> evenly spread the number of runnable tasks on CPUs but this can be
> improved with other metric like the sched slice duration in a follow up
> series.
> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                 ops/sec  stdev 
> tip/sched/core  16634    (+/- 0.5%)
> + patches 1-3   17434    (+/- 1.2%)  +4.8%
> 
> 
> Patch 4 removed the now unused em_cpu_energy()
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>   with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>   which case the balance callback can't be used.
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>   wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> Patch 6 adds task misfit migration case in the cfs tick and push callback
> mecanism to prevent waking up an idle cpu unnecessarily.
> 
> Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Compared to v4:
> - Fixed check_pushable_task for !SMP
> 
> Compared to v3:
> - Fixed the empty functions
> 
> Compared to v2:
> - Renamed the push and tick functions to ease understanding what they do.
>   Both are kept in the same patch as they solve the same problem.
> - Created some helper functions
> - Fixing some typos and comments
> - The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
>   take into account the min capacity of the CPU but the is not directly
>   available right now. It can trigger feec() when uclamp_max is very low
>   compare to the min capacity of the CPU but the feec() should keep 
>   returning the same CPU. This can be handled in a follow on patch
> 
> Compared to v1:
> - The call to feec() even when overutilized has been removed
> from this serie and will be adressed in a separate series. Only the case
> of uclamp_min has been kept as it is now handled by push callback and
> tick mecanism.
> - The push mecanism has been cleanup, fixed and simplified.
> 
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
> Vincent Guittot (7):
>   sched/fair: Filter false overloaded_group case for EAS
>   energy model: Add a get previous state function
>   sched/fair: Rework feec() to use cost instead of spare capacity
>   energy model: Remove unused em_cpu_energy()
>   sched/fair: Add push task mechanism for EAS
>   sched/fair: Add misfit case to push task mecanism for EAS
>   sched/fair: Update overutilized detection
> 
>  include/linux/energy_model.h | 111 ++----
>  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
>  kernel/sched/sched.h         |   2 +
>  3 files changed, 518 insertions(+), 316 deletions(-)
> 

Hi Vincent,
I'm giving this another go of reviewing after our OSPM discussions.
One thing which bothered me in the past is that it's just a lot going
on in this series, almost rewriting all of the EAS code in fair.c ;)

For easier reviewing I suggest splitting the series:
1. sched/fair: Filter false overloaded_group case for EAS
(Or actually just get this merged, no need carrying this around, is there?)
2. Rework feec to use more factors than just max_spare_cap to improve
responsiveness / reduce load (Patches 2,3,4)
3. Add push mechanism and make use of it for misfit migration (Patches
5,6,7)

In particular 2 & 3 could be separated, reviewed and tested on their own,
this would make it much easier to discuss what's being tackled here IMO.

Best regards,
Christian
Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
Posted by Christian Loehle 10 months, 1 week ago
On 3/2/25 21:05, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs. At now, this only tries to
> evenly spread the number of runnable tasks on CPUs but this can be
> improved with other metric like the sched slice duration in a follow up
> series.
> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                 ops/sec  stdev 
> tip/sched/core  16634    (+/- 0.5%)
> + patches 1-3   17434    (+/- 1.2%)  +4.8%
> 
> 
> Patch 4 removed the now unused em_cpu_energy()
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>   with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>   which case the balance callback can't be used.
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>   wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> Patch 6 adds task misfit migration case in the cfs tick and push callback
> mecanism to prevent waking up an idle cpu unnecessarily.
> 
> Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Compared to v4:
> - Fixed check_pushable_task for !SMP
> 
> Compared to v3:
> - Fixed the empty functions
> 
> Compared to v2:
> - Renamed the push and tick functions to ease understanding what they do.
>   Both are kept in the same patch as they solve the same problem.
> - Created some helper functions
> - Fixing some typos and comments
> - The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
>   take into account the min capacity of the CPU but the is not directly
>   available right now. It can trigger feec() when uclamp_max is very low
>   compare to the min capacity of the CPU but the feec() should keep 
>   returning the same CPU. This can be handled in a follow on patch
> 
> Compared to v1:
> - The call to feec() even when overutilized has been removed
> from this serie and will be adressed in a separate series. Only the case
> of uclamp_min has been kept as it is now handled by push callback and
> tick mecanism.
> - The push mecanism has been cleanup, fixed and simplified.
> 
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
> Vincent Guittot (7):
>   sched/fair: Filter false overloaded_group case for EAS
>   energy model: Add a get previous state function
>   sched/fair: Rework feec() to use cost instead of spare capacity
>   energy model: Remove unused em_cpu_energy()
>   sched/fair: Add push task mechanism for EAS
>   sched/fair: Add misfit case to push task mecanism for EAS
>   sched/fair: Update overutilized detection
> 
>  include/linux/energy_model.h | 111 ++----
>  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
>  kernel/sched/sched.h         |   2 +
>  3 files changed, 518 insertions(+), 316 deletions(-)
> 

Hi Vincent,
so I've invested some time into running tests with the series.
To further narrow down which patch we can attribute a change in
behavior I've compared the following:
- Patches 1 to 3 applied, comparing your proposed feec() (B)
only to the baseline feec() (A).
- All patches applied, using a static branch to enable (C) and
disable (D) push mechanism for misfit tasks (if disabled only
the 'tasks stuck on CPU' mechanism triggers here).

I've looked at
1) YouTube 4K video playback
2) Dr.Arm (in-house ARM game)
3) VideoScroller which loads a new video every 3s
4) Idle screen on
5) Speedometer2.0 in Chromium

The device tested is the Pixel6 with 6.12 kernel + backported
scheduler patches.
For power measurements the onboard energy-meter is used [1].

Mainline feec() A is the baseline for all. All workloads are run for
10mins with the exception of Speedometer 2.0
(one iteration each for 5 iterations with cooldowns).

1) YouTube 4K video
+4.5% power with all other tested (the regression already shows with B,
no further change with C & D).
(cf. +18.5% power with CAS).
The power regression comes from increased average frequency on all
3 clusters.
No dropped frames in all tested A to D.

2)  Dr.Arm (in-house ARM game)
+9.9% power with all other tested (the regression already shows with B,
no further change with C & D).
(cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
The power regression comes from increased average frequency on all
3 clusters.

3) VideoScroller
No difference in terms of power for A to D.
Specifically even the push mechanism with misfit enabled/disabled
doesn't make a noticeable difference in per-cluster energy numbers.

4) Idle screen on
No difference in power for all for A to D.

5) Speedometer2.0 in Chromium
Both power and score comparable for A to D.

As mentioned in the thread already the push mechanism
(without misfit tasks) (D) triggers only once every 2-20 minutes,
depending on the workload (all tested here were without any
UCLAMP_MAX tasks).
I also used the device manually just to check if I'm not missing
anything here, I wasn't.
This push task mechanism shouldn't make any difference without
UCLAMP_MAX.

The increased average frequency in 1) and 2) is caused by the
deviation from max-spare-cap in feec(), which previously ensured
as much headroom as possible until we have to raise the OPP of the
cluster.

So all in all this regresses power on some crucial EAS workloads.
I couldn't find a real-world workload where the
'less co-scheduling/contention' strategy of feec() showed a benefit.
Did you have a specific workload for this in mind?

[1]
https://tooling.sites.arm.com/lisa/latest/sections/api/generated/lisa.analysis.pixel6.Pixel6Analysis.html#lisa.analysis.pixel6.Pixel6Analysis.df_power_meter
Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
Posted by Vincent Guittot 9 months, 4 weeks ago
Hi Christian,

On Thu, 3 Apr 2025 at 14:37, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 3/2/25 21:05, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >

...

> >
> >  include/linux/energy_model.h | 111 ++----
> >  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
> >  kernel/sched/sched.h         |   2 +
> >  3 files changed, 518 insertions(+), 316 deletions(-)
> >
>
> Hi Vincent,
> so I've invested some time into running tests with the series.
> To further narrow down which patch we can attribute a change in
> behavior I've compared the following:
> - Patches 1 to 3 applied, comparing your proposed feec() (B)
> only to the baseline feec() (A).
> - All patches applied, using a static branch to enable (C) and
> disable (D) push mechanism for misfit tasks (if disabled only
> the 'tasks stuck on CPU' mechanism triggers here).
>
> I've looked at
> 1) YouTube 4K video playback
> 2) Dr.Arm (in-house ARM game)
> 3) VideoScroller which loads a new video every 3s
> 4) Idle screen on
> 5) Speedometer2.0 in Chromium
>
> The device tested is the Pixel6 with 6.12 kernel + backported
> scheduler patches.

What do you mean by "6.12 kernel + backported scheduler patches" ? Do
you mean android mainline v6.12 ?

I run my test with android mainline v6.13 + scheduler patches for
v6.14 and v6.15-rc1. Do you mean the same ? v6.12 misses a number of
important patches in regards to threads accounting

> For power measurements the onboard energy-meter is used [1].

same for me

>
> Mainline feec() A is the baseline for all. All workloads are run for
> 10mins with the exception of Speedometer 2.0
> (one iteration each for 5 iterations with cooldowns).

What do you mean exactly by (one iteration each for 5 iterations with
cooldowns) ?

>
> 1) YouTube 4K video

I'd like to reproduce this use case because my test with 4k video
playback shows similar or slightly better power consumption (2%) with
this patch.

Do you have details about this use case that you can share ?


> +4.5% power with all other tested (the regression already shows with B,
> no further change with C & D).
> (cf. +18.5% power with CAS).
> The power regression comes from increased average frequency on all
> 3 clusters.

I'm interested to understand why the average frequency increases as
the OPP remains the 1st level of selection and in case of light loaded
use cases we should not see much difference. That's what I see on my
4k video playback use case

And I will also look at why the CAS is better in your case

> No dropped frames in all tested A to D.
>
> 2)  Dr.Arm (in-house ARM game)
> +9.9% power with all other tested (the regression already shows with B,
> no further change with C & D).
> (cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
> The power regression comes from increased average frequency on all
> 3 clusters.

I supposed that I won't be able to reproduce this one

>
> 3) VideoScroller
> No difference in terms of power for A to D.
> Specifically even the push mechanism with misfit enabled/disabled
> doesn't make a noticeable difference in per-cluster energy numbers.
>
> 4) Idle screen on
> No difference in power for all for A to D.

I see a difference here mainly for DDR power consumption with 7%
saving compared to mainline and 2% on the CPU clusters

>
> 5) Speedometer2.0 in Chromium
> Both power and score comparable for A to D.
>
> As mentioned in the thread already the push mechanism
> (without misfit tasks) (D) triggers only once every 2-20 minutes,
> depending on the workload (all tested here were without any
> UCLAMP_MAX tasks).
> I also used the device manually just to check if I'm not missing
> anything here, I wasn't.
> This push task mechanism shouldn't make any difference without
> UCLAMP_MAX.

On the push mechanism side, I'm surprised that you don't get more push
than once every 2-20 minutes. On the speedometer, I've got around 170
push fair and 600 check pushable which ends with a task migration
during the 75 seconds of the test and much more calls that ends with
the same cpu. This also needs to be compared with the 70% of
overutilized state during the 75 seconds of the time during which we
don't push. On light loaded case, the condition is currently to
conservative to trigger push task mechanism but that's also expected
in order to be conservative

The fact that OU triggers too quickly limits the impact of push and feec rework

uclamp_max sees a difference with the push mechanism which is another
argument for using it.

And this is 1st step is quite conservative before extending the cases
which can benefit from push and feec rework as explained  at OSPM

>
> The increased average frequency in 1) and 2) is caused by the
> deviation from max-spare-cap in feec(), which previously ensured
> as much headroom as possible until we have to raise the OPP of the
> cluster.
>
> So all in all this regresses power on some crucial EAS workloads.
> I couldn't find a real-world workload where the
> 'less co-scheduling/contention' strategy of feec() showed a benefit.
> Did you have a specific workload for this in mind?
>
> [1]
> https://tooling.sites.arm.com/lisa/latest/sections/api/generated/lisa.analysis.pixel6.Pixel6Analysis.html#lisa.analysis.pixel6.Pixel6Analysis.df_power_meter
Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
Posted by Christian Loehle 9 months, 4 weeks ago
On 4/15/25 14:49, Vincent Guittot wrote:
> Hi Christian,
> 
> On Thu, 3 Apr 2025 at 14:37, Christian Loehle <christian.loehle@arm.com> wrote:
>>
>> On 3/2/25 21:05, Vincent Guittot wrote:
>>> The current Energy Aware Scheduler has some known limitations which have
>>> became more and more visible with features like uclamp as an example. This
>>> serie tries to fix some of those issues:
>>> - tasks stacked on the same CPU of a PD
>>> - tasks stuck on the wrong CPU.
>>>
> 
> ...
> 
>>>
>>>  include/linux/energy_model.h | 111 ++----
>>>  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
>>>  kernel/sched/sched.h         |   2 +
>>>  3 files changed, 518 insertions(+), 316 deletions(-)
>>>
>>
>> Hi Vincent,
>> so I've invested some time into running tests with the series.
>> To further narrow down which patch we can attribute a change in
>> behavior I've compared the following:
>> - Patches 1 to 3 applied, comparing your proposed feec() (B)
>> only to the baseline feec() (A).
>> - All patches applied, using a static branch to enable (C) and
>> disable (D) push mechanism for misfit tasks (if disabled only
>> the 'tasks stuck on CPU' mechanism triggers here).
>>
>> I've looked at
>> 1) YouTube 4K video playback
>> 2) Dr.Arm (in-house ARM game)
>> 3) VideoScroller which loads a new video every 3s
>> 4) Idle screen on
>> 5) Speedometer2.0 in Chromium
>>
>> The device tested is the Pixel6 with 6.12 kernel + backported
>> scheduler patches.
> 
> What do you mean by "6.12 kernel + backported scheduler patches" ? Do
> you mean android mainline v6.12 ?

Yes, in particular with the following patches backported:
(This series is here in the shortlog)
PM: EM: Add min/max available performance state limits  
sched/fair: Fix variable declaration position  
sched/fair: Do not try to migrate delayed dequeue task  
sched/fair: Rename cfs_rq.nr_running into nr_queued  
sched/fair: Remove unused cfs_rq.idle_nr_running  
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle  
sched/fair: Removed unsued cfs_rq.h_nr_delayed  
sched/fair: Use the new cfs_rq.h_nr_runnable  
sched/fair: Add new cfs_rq.h_nr_runnable  
sched/fair: Rename h_nr_running into h_nr_queued  
sched/eevdf: More PELT vs DELAYED_DEQUEUE  
sched/fair: Fix sched_can_stop_tick() for fair tasks  
sched/fair: optimize the PLACE_LAG when se->vlag is zero  

> 
> I run my test with android mainline v6.13 + scheduler patches for
> v6.14 and v6.15-rc1. Do you mean the same ? v6.12 misses a number of
> important patches in regards to threads accounting

Which ones in particular do you think are critical?
I'm also happy to just use your branch for testing, so we align on
a common base, if you're willing to share it.
I'm not happy about having to test on backported kernels either, but
as long as this is necessary we might as well just share branches of
Android mainline kernel backports for EAS patches, we all do the
backports anyway.

> 
>> For power measurements the onboard energy-meter is used [1].
> 
> same for me
> 
>>
>> Mainline feec() A is the baseline for all. All workloads are run for
>> 10mins with the exception of Speedometer 2.0
>> (one iteration each for 5 iterations with cooldowns).
> 
> What do you mean exactly by (one iteration each for 5 iterations with
> cooldowns) ?

So for Speedometer 2.0 I do:
Run one iteration.
Wait until device is cooled down (all temp sensors <30C).
Repeat 5x.

> 
>>
>> 1) YouTube 4K video
> 
> I'd like to reproduce this use case because my test with 4k video
> playback shows similar or slightly better power consumption (2%) with
> this patch.
> 
> Do you have details about this use case that you can share ?

Sure, in that case it's just a 5 hour long sample video without
ads in between. I then static-branch between e.g. the two feec()s.
to collect the numbers.
1m of stabilising between static branch switches were energy numbers
are disregarded.

> 
> 
>> +4.5% power with all other tested (the regression already shows with B,
>> no further change with C & D).
>> (cf. +18.5% power with CAS).
>> The power regression comes from increased average frequency on all
>> 3 clusters.
> 
> I'm interested to understand why the average frequency increases as
> the OPP remains the 1st level of selection and in case of light loaded
> use cases we should not see much difference. That's what I see on my
> 4k video playback use case

Well the OPPs may be quite far apart and while max-spare-cap strategy
will optimally balance the util within the cluster, this series deviates
from that, so you will raise OPP earlier once the util of the CPUs in
the cluster grow.
For illustration here's the OPP table for the tested Pixel 6:
CPU    Freq (kHz)    ΔFreq   Capacity   ΔCap 
cpu0   300000        0        26         0     
cpu0   574000        274000   50         24    
cpu0   738000        164000   65         15    
cpu0   930000        192000   82         17    
cpu0   1098000       168000   97         15    
cpu0   1197000       99000    106        9     
cpu0   1328000       131000   117        11    
cpu0   1401000       73000    124        7     
cpu0   1598000       197000   141        17    
cpu0   1704000       106000   151        10    
cpu0   1803000       99000    160        9  
cpu4   400000        0        88         0     
cpu4   553000        153000   122        34    
cpu4   696000        143000   153        31    
cpu4   799000        103000   176        23    
cpu4   910000        111000   201        25    
cpu4   1024000       114000   226        25    
cpu4   1197000       173000   264        38    
cpu4   1328000       131000   293        29    
cpu4   1491000       163000   329        36    
cpu4   1663000       172000   367        38    
cpu4   1836000       173000   405        38    
cpu4   1999000       163000   441        36    
cpu4   2130000       131000   470        29    
cpu4   2253000       123000   498        28   
cpu6   500000        0        182        0     
cpu6   851000        351000   311        129   
cpu6   984000        133000   359        48    
cpu6   1106000       122000   404        45    
cpu6   1277000       171000   466        62    
cpu6   1426000       149000   521        55    
cpu6   1582000       156000   578        57    
cpu6   1745000       163000   637        59    
cpu6   1826000       81000    667        30    
cpu6   2048000       222000   748        81    
cpu6   2188000       140000   799        51    
cpu6   2252000       64000    823        24    
cpu6   2401000       149000   877        54    
cpu6   2507000       106000   916        39    
cpu6   2630000       123000   961        45    
cpu6   2704000       74000    988        27    
cpu6   2802000       98000    1024       36

A hypothetical util distribution on the little for OPP0
would be:
0:5 1:16 2:17 3:18
when now placing a util=2 task max-spare-cap will obviously
pick CPU0, while you may deviate form that also picking any
of CPU1-3. For CPU3 even a single util increase will raise
the OPP of the cluster.
As util are never that stable the balancing effect of
max-spare-cap is helping preserve energy.

On big (CPU6) OPP0 -> OPP1 the situation is even worse if the
util numbers above are too small to be convincing.

> 
> And I will also look at why the CAS is better in your case
> 
>> No dropped frames in all tested A to D.
>>
>> 2)  Dr.Arm (in-house ARM game)
>> +9.9% power with all other tested (the regression already shows with B,
>> no further change with C & D).
>> (cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
>> The power regression comes from increased average frequency on all
>> 3 clusters.
> 
> I supposed that I won't be able to reproduce this one

Not really, although given that the YT case is similar I don't
think this would be a one-off. Probably any comparable 3D action
game will do (our internally is just really nice to automate
obviously).

> 
>>
>> 3) VideoScroller
>> No difference in terms of power for A to D.
>> Specifically even the push mechanism with misfit enabled/disabled
>> doesn't make a noticeable difference in per-cluster energy numbers.
>>
>> 4) Idle screen on
>> No difference in power for all for A to D.
> 
> I see a difference here mainly for DDR power consumption with 7%
> saving compared to mainline and 2% on the CPU clusters

Honestly the stddev on these is so high that something needs to go
quite badly to show something significant in this, just wanted to
include it.

> 
>>
>> 5) Speedometer2.0 in Chromium
>> Both power and score comparable for A to D.
>>
>> As mentioned in the thread already the push mechanism
>> (without misfit tasks) (D) triggers only once every 2-20 minutes,
>> depending on the workload (all tested here were without any
>> UCLAMP_MAX tasks).
>> I also used the device manually just to check if I'm not missing
>> anything here, I wasn't.
>> This push task mechanism shouldn't make any difference without
>> UCLAMP_MAX.
> 
> On the push mechanism side, I'm surprised that you don't get more push
> than once every 2-20 minutes. On the speedometer, I've got around 170
> push fair and 600 check pushable which ends with a task migration
> during the 75 seconds of the test and much more calls that ends with
> the same cpu. This also needs to be compared with the 70% of
> overutilized state during the 75 seconds of the time during which we
> don't push. On light loaded case, the condition is currently to
> conservative to trigger push task mechanism but that's also expected
> in order to be conservative

Does that include misfit pushes? I'd be interested if our results
vastly differ here. Just to reiterate, this is without misfit pushes,
only the "stuck on CPU" case introduced by 5/7.

> 
> The fact that OU triggers too quickly limits the impact of push and feec rework

I'm working on a series here :)

> 
> uclamp_max sees a difference with the push mechanism which is another
> argument for using it.

I don't doubt that, but there's little to test with real-world use-cases
really...

> 
> And this is 1st step is quite conservative before extending the cases
> which can benefit from push and feec rework as explained  at OSPM
> 

Right, I actually do see an appeal of having the push mechanism in fair/EAS,
but of course also the series introducing it should have sufficient convincing
benefits.