[v1] sched/fair: Rework EAS to handle more cases

[PATCH 0/5] sched/fair: Rework EAS to handle more cases

Posted by Vincent Guittot 1 year, 5 months ago

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.


Patch 2 creates a new EM interface that will be used by Patch 3


Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
others might be a better choice. feec() looks for the CPU with the highest
spare capacity in a PD assuming that it will be the best CPU from a energy
efficiency PoV because it will require the smallest increase of OPP.
This is often but not always true, this policy filters some others CPUs
which would be as efficients because of using the same OPP but with less
running tasks as an example.
In fact, we only care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result having the same energy cost. In
such cases, we can use other metrics to select the best CPU with the same
energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
and then the most performant CPU between CPUs.

perf sched pipe on a dragonboard rb5 has been used to compare the overhead
of the new feec() vs current implementation.
sidenote: delayed dequeue has been disable for all tests.

9 iterations of perf bench sched pipe -T -l 80000
                ops/sec  stdev 
tip/sched/core  13490    (+/- 1.7%)
+ patches 1-3   14095    (+/- 1.7%)  +4.5%


When overutilized, the scheduler stops looking for an energy efficient CPU
and fallback to the default performance mode. Although this is the best
choice when a system is fully overutilized, it also breaks the energy
efficiency when one CPU becomes overutilized for a short time because of
kworker and/or background activity as an example.
Patch 4 calls feec() everytime instead of skipping it when overutlized,
and fallback to default performance mode only when feec() can't find a
suitable CPU. The main advantage is that the task placement remains more
stable especially when there is a short and transient overutilized state.
The drawback is that the overhead can be significant for some CPU intensive
use cases.

The overhead of patch 4 has been stressed with hackbench on dragonboard rb5

                               tip/sched/core        + patches 1-4
			       Time    stdev         Time    stdev
hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)

hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)

hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)

hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%) 
hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)

Side note: Both new feec() and current feec() give similar overheads with
patch 4.

Although the highest reachable CPU throughput is not the only target of EAS,
the overhead can be significant in some cases as shown in hackbech results
above. That being said I still think it's worth the benefit for the stability
of tasks placement and a better control of the power.


Patch 5 solves another problem with tasks being stuck on a CPU forever
because it doesn't sleep anymore and as a result never wakeup and call
feec(). Such task can be detected by comparing util_avg or runnable_avg
with the compute capacity of the CPU. Once detected, we can call feec() to
check if there is a better CPU for the stuck task. The call can be done in
2 places:
- When the task is put back in the runnnable list after its running slice
  with the balance callback mecanism similarly to the rt/dl push callback.
- During cfs tick when there is only 1 running task stuck on the CPU in
  which case the balance callback can't be used.

This push callback doesn't replace the current misfit task mecanism which
is already implemented but this could be considered as a follow up serie.


This push callback mecanism with the new feec() algorithm ensures that
tasks always get a chance to migrate on the best suitable CPU and don't
stay stuck on a CPU which is no more the most suitable one. As examples:
- A task waking on a big CPU with a uclamp max preventing it to sleep and
  wake up, can migrate on a smaller CPU once it's more power efficient.
- The tasks are spread on CPUs in the PD when they target the same OPP.

This series implements some of the topics discussed at OSPM [1]. Other
topics will be part of an other serie

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Vincent Guittot (5):
  sched/fair: Filter false overloaded_group case for EAS
  energy model: Add a get previous state function
  sched/fair: Rework feec() to use cost instead of spare capacity
  sched/fair: Use EAS also when overutilized
  sched/fair: Add push task callback for EAS

 include/linux/energy_model.h |  18 +
 kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
 kernel/sched/sched.h         |   2 +
 3 files changed, 488 insertions(+), 225 deletions(-)

-- 
2.34.1

Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Posted by Pierre Gondois 1 year, 3 months ago

Hello Vincent,
Related to feec(), but not to this patchset, I think there might be a
concurrency issue while running feec().

Feec() doesn't have any locking mechanism. This means that multiple CPUs
might run the function at the same time.
If:
- 2 tasks with approximately the same utilization wake up at the same time
- some space on an energy efficient CPU is available
feec() will likely select the same target for the 2 tasks.

Once feec() determined a target for a task, util signals are updated in
enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
is ~20us (on a Pixel6). This is not much, but this still allows some other
CPUs to run feec() util signals that will be wrong in no time.

Note that it is also possible for one CPU to run feec() for 2 different tasks,
decide to migrate the 2 tasks to another target CPU, and then start enqueueing
the tasks. Meaning one single CPU will run feec() using util signals it knows
are wrong.

The issue is problematic as it creates some instability. Once a
'parallel selection' is done, the following scenarios can happen:
- the system goes overutilized, and EAS is disabled
- a frequency spike happen to handle the unexpected load.
   Then the perf. domain becomes less energy efficient compared to other
   perf. domains, and tasks are migrated out of this perf. domain

I made the following prototype to avoid 'parallel selections'. The goal here
is to tag CPUs that are under pending migration.
A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
CPUs should therefore not consider this CPU as valid candidate.

The implementation is a bit raw, but it gives some good results. Using rt-app
workloads, and trying not to have tasks waking up at the same timing during
the whole test:
Workload1:
N tasks with a period of 16ms and a util of 4/8. Each task starts with a
4ms delay. Each workload lasts 20s and is run over 5 iterations.

Workload2:
N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
20s and is run over 5 iterations.

Are presented:
- the measured energy consumed, according to the Pixel6 energy meters
- the estimated energy consumed, lisa uses the util signals along with
   the CPU frequencies and the Energy Model to do an estimation.
- the amount of time spent in the overutilized state, in percentage.

------

Workload1:

Measured energy:
+------+-------+--------------+--------------+------------+
| util | count | without      | with         | ratio      |
+------+-------+--------------+--------------+------------+
| 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
| 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
| 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
| 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
| 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
| 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
+------+-------+--------------+--------------+------------+
Std:
+------+-------+-------------+-------------+
| util | count | without     | with        |
+------+-------+-------------+-------------+
| 4    | 8     | 165.563394  | 48.903514   |
| 4    | 12    | 518.609612  | 81.170952   |
| 4    | 16    | 329.729882  | 192.739245  |
| 8    | 8     | 105.144497  | 336.796522  |
| 8    | 12    | 384.615323  | 339.86986   |
| 8    | 16    | 1252.735561 | 2563.268952 |
+------+-------+-------------+-------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
| 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
| 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
| 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
| 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
| 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+-----------+
| util | count | without  | with      |
+------+-------+----------+-----------+
| 4    | 8     | 1.3896e9 | 5.4265e8  |
| 4    | 12    | 4.7511e9 | 5.1521e8  |
| 4    | 16    | 3.5486e9 | 1.2625e9  |
| 8    | 8     | 3.0033e8 | 2.3168e9  |
| 8    | 12    | 8.7739e9 | 3.0743e9  |
| 8    | 16    | 6.7982e9 | 2.2393e10 |
+------+-------+----------+-----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
| 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
| 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
| 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
| 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
| 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+-----------+-----------+
| util | count | without   | with      |
+------+-------+-----------+-----------+
| 4    | 8     | 0.232077  | 0.016531  |
| 4    | 12    | 0.338637  | 0.040252  |
| 4    | 16    | 0.729743  | 6.368214  |
| 8    | 8     | 1.702964  | 1.722589  |
| 8    | 12    | 34.436278 | 17.314564 |
| 8    | 16    | 14.540217 | 33.77831  |
+------+-------+-----------+-----------+

------

Workload2:

Measured energy:
+------+-------+--------------+--------------+-----------+
| util | count | without      | with         | ratio     |
+------+-------+--------------+--------------+-----------+
| 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
| 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
| 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
| 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
| 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
| 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
+------+-------+--------------+--------------+-----------+
Std:
+------+-------+------------+------------+
| util | count | without    | with       |
+------+-------+------------+------------+
| 4    | 8     | 87.450628  | 76.955783  |
| 4    | 12    | 106.062839 | 116.882891 |
| 4    | 16    | 182.525881 | 172.819307 |
| 8    | 8     | 874.292359 | 162.790237 |
| 8    | 12    | 151.830636 | 339.286741 |
| 8    | 16    | 904.751446 | 154.419644 |
+------+-------+------------+------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
| 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
| 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
| 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
| 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
| 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without  | with     |
+------+-------+----------+----------+
| 4    | 8     | 3.5449e8 | 8.2454e8 |
| 4    | 12    | 9.4248e8 | 1.1364e9 |
| 4    | 16    | 8.3240e8 | 1.2084e9 |
| 8    | 8     | 9.0364e9 | 5.0381e8 |
| 8    | 12    | 9.9112e8 | 3.0836e9 |
| 8    | 16    | 4.9429e8 | 1.9533e9 |
+------+-------+----------+----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+----------+------------+
| util | count | without   | with     | ratio      |
+------+-------+-----------+----------+------------+
| 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
| 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
| 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
| 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
| 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
| 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
+------+-------+-----------+----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without  | with     |
+------+-------+----------+----------+
| 4    | 8     | 0.212919 | 0.036856 |
| 4    | 12    | 0.069696 | 0.060257 |
| 4    | 16    | 0.63995  | 0.542028 |
| 8    | 8     | 2.158079 | 0.211775 |
| 8    | 12    | 0.089159 | 0.187436 |
| 8    | 16    | 0.798565 | 1.669003 |
+------+-------+----------+----------+

------

Analysis:

- [1]
Without the patch, 2 tasks end up on one little CPU. This consumes
less energy than using the medium/big CPU according to the energy model,
but EAS should not be capable of doing such task placement as the little
CPU becomes overutilized.
Without the patch, the system is overutilized a lot more than with the patch.

-
Looking at the overutilized ratio, being overutilized 0.5% of the time or
0.05% of the time might seem close, but it means that EAS ended up
doing a bad task placement multiple, independent times.

-
The overutilized ratio should be checked along the energy results as it
shows how much EAS was involved in the task placement.

-
Overall, the energy consumed is less. The quantity of energy saved varies
with the workload.

------

On another note, I wanted to ask if there would be a v2 of this present
patchset (sched/fair: Rework EAS to handle more cases),

Regards,
Pierre

------


diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb343136ddd0..812d5bf88875 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1592,6 +1592,12 @@ struct task_struct {
         struct user_event_mm            *user_event_mm;
  #endif
  
+       /*
+        * Keep track of the CPU feec() migrated this task to.
+        * There is a per-cpu 'eas_pending_enqueue' value to reset.
+        */
+       int eas_target_cpu;
+
         /*
          * New fields for task_struct should be added above here, so that
          * they are included in the randomized portion of task_struct.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c157d4860a3b..34911eb059cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
         se->sched_delayed = 0;
  }
  
+DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
+
  /*
   * The enqueue_task method is called before nr_running is
   * increased. Here we update the fair scheduling stats and
@@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                 check_update_overutilized_status(rq);
  
  enqueue_throttle:
+       if (p->eas_target_cpu != -1) {
+               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
+               p->eas_target_cpu = -1;
+       }
+
         assert_list_leaf_cfs_rq(rq);
  
         hrtick_update(rq);
@@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
                         if (!cpumask_test_cpu(cpu, p->cpus_ptr))
                                 continue;
  
+                       /* Skip this CPU as its util signal will be invalid soon. */
+                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
+                           cpu != prev_cpu)
+                               continue;
+
                         util = cpu_util(cpu, p, cpu, 0);
                         cpu_cap = capacity_of(cpu);
  
@@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
             ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
                 target = best_energy_cpu;
  
+       /*
+        *'Lock' the target CPU if there is a migration. Prevent other feec()
+        * calls to use the same target CPU until util signals are not updated.
+        */
+       if (prev_cpu != target) {
+               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
+                       p->eas_target_cpu = target;
+               else
+                       target = prev_cpu;
+       }
+
         return target;
  
  unlock:

Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Posted by Vincent Guittot 1 year, 3 months ago

Hi Pierre,

On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
> Related to feec(), but not to this patchset, I think there might be a
> concurrency issue while running feec().

yes, this is a know limitation

>
> Feec() doesn't have any locking mechanism. This means that multiple CPUs
> might run the function at the same time.

this is done on purpose as we don't want to lock and slow down the wakeup path

> If:
> - 2 tasks with approximately the same utilization wake up at the same time
> - some space on an energy efficient CPU is available
> feec() will likely select the same target for the 2 tasks.

yes

>
> Once feec() determined a target for a task, util signals are updated in
> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
> is ~20us (on a Pixel6). This is not much, but this still allows some other

20us is quite long. this is the worst case on little core lowest freq ?

> CPUs to run feec() util signals that will be wrong in no time.
>
> Note that it is also possible for one CPU to run feec() for 2 different tasks,
> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
> the tasks. Meaning one single CPU will run feec() using util signals it knows
> are wrong.

isn't this case serialized because cpu selection for next task will
happen after enqueuing the 1st one

>
> The issue is problematic as it creates some instability. Once a
> 'parallel selection' is done, the following scenarios can happen:
> - the system goes overutilized, and EAS is disabled
> - a frequency spike happen to handle the unexpected load.
>    Then the perf. domain becomes less energy efficient compared to other
>    perf. domains, and tasks are migrated out of this perf. domain
>
> I made the following prototype to avoid 'parallel selections'. The goal here
> is to tag CPUs that are under pending migration.
> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
> CPUs should therefore not consider this CPU as valid candidate.
>
> The implementation is a bit raw, but it gives some good results. Using rt-app
> workloads, and trying not to have tasks waking up at the same timing during
> the whole test:
> Workload1:
> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
>
> Workload2:
> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
> 20s and is run over 5 iterations.
>
> Are presented:
> - the measured energy consumed, according to the Pixel6 energy meters
> - the estimated energy consumed, lisa uses the util signals along with
>    the CPU frequencies and the Energy Model to do an estimation.
> - the amount of time spent in the overutilized state, in percentage.
>
> ------
>
> Workload1:
>
> Measured energy:
> +------+-------+--------------+--------------+------------+
> | util | count | without      | with         | ratio      |
> +------+-------+--------------+--------------+------------+
> | 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
> | 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
> | 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
> | 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
> | 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
> | 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
> +------+-------+--------------+--------------+------------+
> Std:
> +------+-------+-------------+-------------+
> | util | count | without     | with        |
> +------+-------+-------------+-------------+
> | 4    | 8     | 165.563394  | 48.903514   |
> | 4    | 12    | 518.609612  | 81.170952   |
> | 4    | 16    | 329.729882  | 192.739245  |
> | 8    | 8     | 105.144497  | 336.796522  |
> | 8    | 12    | 384.615323  | 339.86986   |
> | 8    | 16    | 1252.735561 | 2563.268952 |
> +------+-------+-------------+-------------+
>
> Estimated energy:
> +------+-------+-----------+-----------+------------+
> | util | count | without   | with      | ratio      |
> +------+-------+-----------+-----------+------------+
> | 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
> | 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
> | 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
> | 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
> | 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
> | 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+----------+-----------+
> | util | count | without  | with      |
> +------+-------+----------+-----------+
> | 4    | 8     | 1.3896e9 | 5.4265e8  |
> | 4    | 12    | 4.7511e9 | 5.1521e8  |
> | 4    | 16    | 3.5486e9 | 1.2625e9  |
> | 8    | 8     | 3.0033e8 | 2.3168e9  |
> | 8    | 12    | 8.7739e9 | 3.0743e9  |
> | 8    | 16    | 6.7982e9 | 2.2393e10 |
> +------+-------+----------+-----------+
>
> Overutilized ratio (in % of the 20s test):
> +------+-------+-----------+-----------+------------+
> | util | count | without   | with      | ratio      |
> +------+-------+-----------+-----------+------------+
> | 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
> | 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
> | 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
> | 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
> | 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
> | 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+-----------+-----------+
> | util | count | without   | with      |
> +------+-------+-----------+-----------+
> | 4    | 8     | 0.232077  | 0.016531  |
> | 4    | 12    | 0.338637  | 0.040252  |
> | 4    | 16    | 0.729743  | 6.368214  |
> | 8    | 8     | 1.702964  | 1.722589  |
> | 8    | 12    | 34.436278 | 17.314564 |
> | 8    | 16    | 14.540217 | 33.77831  |
> +------+-------+-----------+-----------+
>
> ------
>
> Workload2:
>
> Measured energy:
> +------+-------+--------------+--------------+-----------+
> | util | count | without      | with         | ratio     |
> +------+-------+--------------+--------------+-----------+
> | 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
> | 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
> | 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
> | 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
> | 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
> | 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
> +------+-------+--------------+--------------+-----------+
> Std:
> +------+-------+------------+------------+
> | util | count | without    | with       |
> +------+-------+------------+------------+
> | 4    | 8     | 87.450628  | 76.955783  |
> | 4    | 12    | 106.062839 | 116.882891 |
> | 4    | 16    | 182.525881 | 172.819307 |
> | 8    | 8     | 874.292359 | 162.790237 |
> | 8    | 12    | 151.830636 | 339.286741 |
> | 8    | 16    | 904.751446 | 154.419644 |
> +------+-------+------------+------------+
>
> Estimated energy:
> +------+-------+-----------+-----------+------------+
> | util | count | without   | with      | ratio      |
> +------+-------+-----------+-----------+------------+
> | 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
> | 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
> | 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
> | 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
> | 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
> | 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+----------+----------+
> | util | count | without  | with     |
> +------+-------+----------+----------+
> | 4    | 8     | 3.5449e8 | 8.2454e8 |
> | 4    | 12    | 9.4248e8 | 1.1364e9 |
> | 4    | 16    | 8.3240e8 | 1.2084e9 |
> | 8    | 8     | 9.0364e9 | 5.0381e8 |
> | 8    | 12    | 9.9112e8 | 3.0836e9 |
> | 8    | 16    | 4.9429e8 | 1.9533e9 |
> +------+-------+----------+----------+
>
> Overutilized ratio (in % of the 20s test):
> +------+-------+-----------+----------+------------+
> | util | count | without   | with     | ratio      |
> +------+-------+-----------+----------+------------+
> | 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
> | 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
> | 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
> | 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
> | 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
> | 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
> +------+-------+-----------+----------+------------+
> Std:
> +------+-------+----------+----------+
> | util | count | without  | with     |
> +------+-------+----------+----------+
> | 4    | 8     | 0.212919 | 0.036856 |
> | 4    | 12    | 0.069696 | 0.060257 |
> | 4    | 16    | 0.63995  | 0.542028 |
> | 8    | 8     | 2.158079 | 0.211775 |
> | 8    | 12    | 0.089159 | 0.187436 |
> | 8    | 16    | 0.798565 | 1.669003 |
> +------+-------+----------+----------+
>
> ------
>
> Analysis:
>
> - [1]
> Without the patch, 2 tasks end up on one little CPU. This consumes
> less energy than using the medium/big CPU according to the energy model,
> but EAS should not be capable of doing such task placement as the little
> CPU becomes overutilized.
> Without the patch, the system is overutilized a lot more than with the patch.
>
> -
> Looking at the overutilized ratio, being overutilized 0.5% of the time or
> 0.05% of the time might seem close, but it means that EAS ended up
> doing a bad task placement multiple, independent times.
>
> -
> The overutilized ratio should be checked along the energy results as it
> shows how much EAS was involved in the task placement.
>
> -
> Overall, the energy consumed is less. The quantity of energy saved varies
> with the workload.
>
> ------
>
> On another note, I wanted to ask if there would be a v2 of this present
> patchset (sched/fair: Rework EAS to handle more cases),

yes, I have been side tracked by other stuff since LPC and haven't
been able to finalize test on v2 but it's ongoing

>
> Regards,
> Pierre
>
> ------
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bb343136ddd0..812d5bf88875 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1592,6 +1592,12 @@ struct task_struct {
>          struct user_event_mm            *user_event_mm;
>   #endif
>
> +       /*
> +        * Keep track of the CPU feec() migrated this task to.
> +        * There is a per-cpu 'eas_pending_enqueue' value to reset.
> +        */
> +       int eas_target_cpu;
> +
>          /*
>           * New fields for task_struct should be added above here, so that
>           * they are included in the randomized portion of task_struct.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c157d4860a3b..34911eb059cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
>          se->sched_delayed = 0;
>   }
>
> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
> +
>   /*
>    * The enqueue_task method is called before nr_running is
>    * increased. Here we update the fair scheduling stats and
> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                  check_update_overutilized_status(rq);
>
>   enqueue_throttle:
> +       if (p->eas_target_cpu != -1) {
> +               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
> +               p->eas_target_cpu = -1;
> +       }
> +
>          assert_list_leaf_cfs_rq(rq);
>
>          hrtick_update(rq);
> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                          if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>                                  continue;
>
> +                       /* Skip this CPU as its util signal will be invalid soon. */
> +                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
> +                           cpu != prev_cpu)
> +                               continue;
> +
>                          util = cpu_util(cpu, p, cpu, 0);
>                          cpu_cap = capacity_of(cpu);
>
> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>              ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
>                  target = best_energy_cpu;
>
> +       /*
> +        *'Lock' the target CPU if there is a migration. Prevent other feec()
> +        * calls to use the same target CPU until util signals are not updated.
> +        */
> +       if (prev_cpu != target) {
> +               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
> +                       p->eas_target_cpu = target;
> +               else
> +                       target = prev_cpu;
> +       }
> +
>          return target;
>
>   unlock:

Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Posted by Pierre Gondois 1 year, 3 months ago


On 11/8/24 10:27, Vincent Guittot wrote:
> Hi Pierre,
> 
> On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>> Related to feec(), but not to this patchset, I think there might be a
>> concurrency issue while running feec().
> 
> yes, this is a know limitation
> 
>>
>> Feec() doesn't have any locking mechanism. This means that multiple CPUs
>> might run the function at the same time.
> 
> this is done on purpose as we don't want to lock and slow down the wakeup path

Yes right, this is understandable. However there could be a way to bail out of
feec() when such case is detected without actually waiting for a lock (cf. the
prototype).
We already bail out of feec() when the utilization of a CPU without a task is
higher than with the task in the energy computation.

> 
>> If:
>> - 2 tasks with approximately the same utilization wake up at the same time
>> - some space on an energy efficient CPU is available
>> feec() will likely select the same target for the 2 tasks.
> 
> yes
> 
>>
>> Once feec() determined a target for a task, util signals are updated in
>> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
>> is ~20us (on a Pixel6). This is not much, but this still allows some other
> 
> 20us is quite long. this is the worst case on little core lowest freq ?

I only kept the occurrences where feec() ends up with a target != prev_cpu.
In these case enqueuing is done on the target CPU (cf. __ttwu_queue_wakelist),
which might take more time.

In the other case, the delta is effectively lower (~10us).

> 
>> CPUs to run feec() util signals that will be wrong in no time.
>>
>> Note that it is also possible for one CPU to run feec() for 2 different tasks,
>> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
>> the tasks. Meaning one single CPU will run feec() using util signals it knows
>> are wrong.
> 
> isn't this case serialized because cpu selection for next task will
> happen after enqueuing the 1st one

I'm not sure I understand the question, but if the enqueue is done on the
target CPU, the running CPU might call feec() in the meantime.

> 
>>
>> The issue is problematic as it creates some instability. Once a
>> 'parallel selection' is done, the following scenarios can happen:
>> - the system goes overutilized, and EAS is disabled
>> - a frequency spike happen to handle the unexpected load.
>>     Then the perf. domain becomes less energy efficient compared to other
>>     perf. domains, and tasks are migrated out of this perf. domain
>>
>> I made the following prototype to avoid 'parallel selections'. The goal here
>> is to tag CPUs that are under pending migration.
>> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
>> CPUs should therefore not consider this CPU as valid candidate.
>>
>> The implementation is a bit raw, but it gives some good results. Using rt-app
>> workloads, and trying not to have tasks waking up at the same timing during
>> the whole test:
>> Workload1:
>> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
>> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
>>
>> Workload2:
>> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
>> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
>> 20s and is run over 5 iterations.
>>
>> Are presented:
>> - the measured energy consumed, according to the Pixel6 energy meters
>> - the estimated energy consumed, lisa uses the util signals along with
>>     the CPU frequencies and the Energy Model to do an estimation.
>> - the amount of time spent in the overutilized state, in percentage.
>>
>> ------
>>
>> Workload1:
>>
>> Measured energy:
>> +------+-------+--------------+--------------+------------+
>> | util | count | without      | with         | ratio      |
>> +------+-------+--------------+--------------+------------+
>> | 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
>> | 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
>> | 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
>> | 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
>> | 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
>> | 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
>> +------+-------+--------------+--------------+------------+
>> Std:
>> +------+-------+-------------+-------------+
>> | util | count | without     | with        |
>> +------+-------+-------------+-------------+
>> | 4    | 8     | 165.563394  | 48.903514   |
>> | 4    | 12    | 518.609612  | 81.170952   |
>> | 4    | 16    | 329.729882  | 192.739245  |
>> | 8    | 8     | 105.144497  | 336.796522  |
>> | 8    | 12    | 384.615323  | 339.86986   |
>> | 8    | 16    | 1252.735561 | 2563.268952 |
>> +------+-------+-------------+-------------+
>>
>> Estimated energy:
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
>> | 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
>> | 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
>> | 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
>> | 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
>> | 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+----------+-----------+
>> | util | count | without  | with      |
>> +------+-------+----------+-----------+
>> | 4    | 8     | 1.3896e9 | 5.4265e8  |
>> | 4    | 12    | 4.7511e9 | 5.1521e8  |
>> | 4    | 16    | 3.5486e9 | 1.2625e9  |
>> | 8    | 8     | 3.0033e8 | 2.3168e9  |
>> | 8    | 12    | 8.7739e9 | 3.0743e9  |
>> | 8    | 16    | 6.7982e9 | 2.2393e10 |
>> +------+-------+----------+-----------+
>>
>> Overutilized ratio (in % of the 20s test):
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
>> | 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
>> | 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
>> | 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
>> | 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
>> | 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+-----------+-----------+
>> | util | count | without   | with      |
>> +------+-------+-----------+-----------+
>> | 4    | 8     | 0.232077  | 0.016531  |
>> | 4    | 12    | 0.338637  | 0.040252  |
>> | 4    | 16    | 0.729743  | 6.368214  |
>> | 8    | 8     | 1.702964  | 1.722589  |
>> | 8    | 12    | 34.436278 | 17.314564 |
>> | 8    | 16    | 14.540217 | 33.77831  |
>> +------+-------+-----------+-----------+
>>
>> ------
>>
>> Workload2:
>>
>> Measured energy:
>> +------+-------+--------------+--------------+-----------+
>> | util | count | without      | with         | ratio     |
>> +------+-------+--------------+--------------+-----------+
>> | 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
>> | 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
>> | 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
>> | 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
>> | 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
>> | 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
>> +------+-------+--------------+--------------+-----------+
>> Std:
>> +------+-------+------------+------------+
>> | util | count | without    | with       |
>> +------+-------+------------+------------+
>> | 4    | 8     | 87.450628  | 76.955783  |
>> | 4    | 12    | 106.062839 | 116.882891 |
>> | 4    | 16    | 182.525881 | 172.819307 |
>> | 8    | 8     | 874.292359 | 162.790237 |
>> | 8    | 12    | 151.830636 | 339.286741 |
>> | 8    | 16    | 904.751446 | 154.419644 |
>> +------+-------+------------+------------+
>>
>> Estimated energy:
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
>> | 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
>> | 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
>> | 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
>> | 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
>> | 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+----------+----------+
>> | util | count | without  | with     |
>> +------+-------+----------+----------+
>> | 4    | 8     | 3.5449e8 | 8.2454e8 |
>> | 4    | 12    | 9.4248e8 | 1.1364e9 |
>> | 4    | 16    | 8.3240e8 | 1.2084e9 |
>> | 8    | 8     | 9.0364e9 | 5.0381e8 |
>> | 8    | 12    | 9.9112e8 | 3.0836e9 |
>> | 8    | 16    | 4.9429e8 | 1.9533e9 |
>> +------+-------+----------+----------+
>>
>> Overutilized ratio (in % of the 20s test):
>> +------+-------+-----------+----------+------------+
>> | util | count | without   | with     | ratio      |
>> +------+-------+-----------+----------+------------+
>> | 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
>> | 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
>> | 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
>> | 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
>> | 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
>> | 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
>> +------+-------+-----------+----------+------------+
>> Std:
>> +------+-------+----------+----------+
>> | util | count | without  | with     |
>> +------+-------+----------+----------+
>> | 4    | 8     | 0.212919 | 0.036856 |
>> | 4    | 12    | 0.069696 | 0.060257 |
>> | 4    | 16    | 0.63995  | 0.542028 |
>> | 8    | 8     | 2.158079 | 0.211775 |
>> | 8    | 12    | 0.089159 | 0.187436 |
>> | 8    | 16    | 0.798565 | 1.669003 |
>> +------+-------+----------+----------+
>>
>> ------
>>
>> Analysis:
>>
>> - [1]
>> Without the patch, 2 tasks end up on one little CPU. This consumes
>> less energy than using the medium/big CPU according to the energy model,
>> but EAS should not be capable of doing such task placement as the little
>> CPU becomes overutilized.
>> Without the patch, the system is overutilized a lot more than with the patch.
>>
>> -
>> Looking at the overutilized ratio, being overutilized 0.5% of the time or
>> 0.05% of the time might seem close, but it means that EAS ended up
>> doing a bad task placement multiple, independent times.
>>
>> -
>> The overutilized ratio should be checked along the energy results as it
>> shows how much EAS was involved in the task placement.
>>
>> -
>> Overall, the energy consumed is less. The quantity of energy saved varies
>> with the workload.
>>
>> ------
>>
>> On another note, I wanted to ask if there would be a v2 of this present
>> patchset (sched/fair: Rework EAS to handle more cases),
> 
> yes, I have been side tracked by other stuff since LPC and haven't
> been able to finalize test on v2 but it's ongoing

Ok thanks!

> 
>>
>> Regards,
>> Pierre
>>
>> ------
>>
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index bb343136ddd0..812d5bf88875 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1592,6 +1592,12 @@ struct task_struct {
>>           struct user_event_mm            *user_event_mm;
>>    #endif
>>
>> +       /*
>> +        * Keep track of the CPU feec() migrated this task to.
>> +        * There is a per-cpu 'eas_pending_enqueue' value to reset.
>> +        */
>> +       int eas_target_cpu;
>> +
>>           /*
>>            * New fields for task_struct should be added above here, so that
>>            * they are included in the randomized portion of task_struct.
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c157d4860a3b..34911eb059cf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
>>           se->sched_delayed = 0;
>>    }
>>
>> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
>> +
>>    /*
>>     * The enqueue_task method is called before nr_running is
>>     * increased. Here we update the fair scheduling stats and
>> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>                   check_update_overutilized_status(rq);
>>
>>    enqueue_throttle:
>> +       if (p->eas_target_cpu != -1) {
>> +               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
>> +               p->eas_target_cpu = -1;
>> +       }
>> +
>>           assert_list_leaf_cfs_rq(rq);
>>
>>           hrtick_update(rq);
>> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>                           if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>>                                   continue;
>>
>> +                       /* Skip this CPU as its util signal will be invalid soon. */
>> +                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
>> +                           cpu != prev_cpu)
>> +                               continue;
>> +
>>                           util = cpu_util(cpu, p, cpu, 0);
>>                           cpu_cap = capacity_of(cpu);
>>
>> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>               ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
>>                   target = best_energy_cpu;
>>
>> +       /*
>> +        *'Lock' the target CPU if there is a migration. Prevent other feec()
>> +        * calls to use the same target CPU until util signals are not updated.
>> +        */
>> +       if (prev_cpu != target) {
>> +               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
>> +                       p->eas_target_cpu = target;
>> +               else
>> +                       target = prev_cpu;
>> +       }
>> +
>>           return target;
>>
>>    unlock:

Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Posted by Vincent Guittot 1 year, 2 months ago

On Fri, 8 Nov 2024 at 14:10, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
>
>
> On 11/8/24 10:27, Vincent Guittot wrote:
> > Hi Pierre,
> >
> > On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>
> >> Hello Vincent,
> >> Related to feec(), but not to this patchset, I think there might be a
> >> concurrency issue while running feec().
> >
> > yes, this is a know limitation
> >
> >>
> >> Feec() doesn't have any locking mechanism. This means that multiple CPUs
> >> might run the function at the same time.
> >
> > this is done on purpose as we don't want to lock and slow down the wakeup path
>
> Yes right, this is understandable. However there could be a way to bail out of
> feec() when such case is detected without actually waiting for a lock (cf. the
> prototype).
> We already bail out of feec() when the utilization of a CPU without a task is
> higher than with the task in the energy computation.
>
> >
> >> If:
> >> - 2 tasks with approximately the same utilization wake up at the same time
> >> - some space on an energy efficient CPU is available
> >> feec() will likely select the same target for the 2 tasks.
> >
> > yes
> >
> >>
> >> Once feec() determined a target for a task, util signals are updated in
> >> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
> >> is ~20us (on a Pixel6). This is not much, but this still allows some other
> >
> > 20us is quite long. this is the worst case on little core lowest freq ?
>
> I only kept the occurrences where feec() ends up with a target != prev_cpu.
> In these case enqueuing is done on the target CPU (cf. __ttwu_queue_wakelist),
> which might take more time.
>
> In the other case, the delta is effectively lower (~10us).
>
> >
> >> CPUs to run feec() util signals that will be wrong in no time.
> >>
> >> Note that it is also possible for one CPU to run feec() for 2 different tasks,
> >> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
> >> the tasks. Meaning one single CPU will run feec() using util signals it knows
> >> are wrong.
> >
> > isn't this case serialized because cpu selection for next task will
> > happen after enqueuing the 1st one
>
> I'm not sure I understand the question, but if the enqueue is done on the
> target CPU, the running CPU might call feec() in the meantime.

When CPUs share LLC, the local cpu enqueues to the target ... unless
target is idle which is the case for your example above

>
> >
> >>
> >> The issue is problematic as it creates some instability. Once a
> >> 'parallel selection' is done, the following scenarios can happen:
> >> - the system goes overutilized, and EAS is disabled
> >> - a frequency spike happen to handle the unexpected load.
> >>     Then the perf. domain becomes less energy efficient compared to other
> >>     perf. domains, and tasks are migrated out of this perf. domain
> >>
> >> I made the following prototype to avoid 'parallel selections'. The goal here
> >> is to tag CPUs that are under pending migration.
> >> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
> >> CPUs should therefore not consider this CPU as valid candidate.
> >>
> >> The implementation is a bit raw, but it gives some good results. Using rt-app
> >> workloads, and trying not to have tasks waking up at the same timing during
> >> the whole test:
> >> Workload1:
> >> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
> >> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
> >>
> >> Workload2:
> >> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
> >> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
> >> 20s and is run over 5 iterations.
> >>
> >> Are presented:
> >> - the measured energy consumed, according to the Pixel6 energy meters
> >> - the estimated energy consumed, lisa uses the util signals along with
> >>     the CPU frequencies and the Energy Model to do an estimation.
> >> - the amount of time spent in the overutilized state, in percentage.
> >>
> >> ------
> >>
> >> Workload1:
> >>
> >> Measured energy:
> >> +------+-------+--------------+--------------+------------+
> >> | util | count | without      | with         | ratio      |
> >> +------+-------+--------------+--------------+------------+
> >> | 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
> >> | 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
> >> | 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
> >> | 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
> >> | 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
> >> | 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
> >> +------+-------+--------------+--------------+------------+
> >> Std:
> >> +------+-------+-------------+-------------+
> >> | util | count | without     | with        |
> >> +------+-------+-------------+-------------+
> >> | 4    | 8     | 165.563394  | 48.903514   |
> >> | 4    | 12    | 518.609612  | 81.170952   |
> >> | 4    | 16    | 329.729882  | 192.739245  |
> >> | 8    | 8     | 105.144497  | 336.796522  |
> >> | 8    | 12    | 384.615323  | 339.86986   |
> >> | 8    | 16    | 1252.735561 | 2563.268952 |
> >> +------+-------+-------------+-------------+
> >>
> >> Estimated energy:
> >> +------+-------+-----------+-----------+------------+
> >> | util | count | without   | with      | ratio      |
> >> +------+-------+-----------+-----------+------------+
> >> | 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
> >> | 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
> >> | 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
> >> | 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
> >> | 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
> >> | 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
> >> +------+-------+-----------+-----------+------------+
> >> Std:
> >> +------+-------+----------+-----------+
> >> | util | count | without  | with      |
> >> +------+-------+----------+-----------+
> >> | 4    | 8     | 1.3896e9 | 5.4265e8  |
> >> | 4    | 12    | 4.7511e9 | 5.1521e8  |
> >> | 4    | 16    | 3.5486e9 | 1.2625e9  |
> >> | 8    | 8     | 3.0033e8 | 2.3168e9  |
> >> | 8    | 12    | 8.7739e9 | 3.0743e9  |
> >> | 8    | 16    | 6.7982e9 | 2.2393e10 |
> >> +------+-------+----------+-----------+
> >>
> >> Overutilized ratio (in % of the 20s test):
> >> +------+-------+-----------+-----------+------------+
> >> | util | count | without   | with      | ratio      |
> >> +------+-------+-----------+-----------+------------+
> >> | 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
> >> | 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
> >> | 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
> >> | 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
> >> | 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
> >> | 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
> >> +------+-------+-----------+-----------+------------+
> >> Std:
> >> +------+-------+-----------+-----------+
> >> | util | count | without   | with      |
> >> +------+-------+-----------+-----------+
> >> | 4    | 8     | 0.232077  | 0.016531  |
> >> | 4    | 12    | 0.338637  | 0.040252  |
> >> | 4    | 16    | 0.729743  | 6.368214  |
> >> | 8    | 8     | 1.702964  | 1.722589  |
> >> | 8    | 12    | 34.436278 | 17.314564 |
> >> | 8    | 16    | 14.540217 | 33.77831  |
> >> +------+-------+-----------+-----------+
> >>
> >> ------
> >>
> >> Workload2:
> >>
> >> Measured energy:
> >> +------+-------+--------------+--------------+-----------+
> >> | util | count | without      | with         | ratio     |
> >> +------+-------+--------------+--------------+-----------+
> >> | 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
> >> | 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
> >> | 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
> >> | 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
> >> | 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
> >> | 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
> >> +------+-------+--------------+--------------+-----------+
> >> Std:
> >> +------+-------+------------+------------+
> >> | util | count | without    | with       |
> >> +------+-------+------------+------------+
> >> | 4    | 8     | 87.450628  | 76.955783  |
> >> | 4    | 12    | 106.062839 | 116.882891 |
> >> | 4    | 16    | 182.525881 | 172.819307 |
> >> | 8    | 8     | 874.292359 | 162.790237 |
> >> | 8    | 12    | 151.830636 | 339.286741 |
> >> | 8    | 16    | 904.751446 | 154.419644 |
> >> +------+-------+------------+------------+
> >>
> >> Estimated energy:
> >> +------+-------+-----------+-----------+------------+
> >> | util | count | without   | with      | ratio      |
> >> +------+-------+-----------+-----------+------------+
> >> | 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
> >> | 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
> >> | 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
> >> | 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
> >> | 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
> >> | 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
> >> +------+-------+-----------+-----------+------------+
> >> Std:
> >> +------+-------+----------+----------+
> >> | util | count | without  | with     |
> >> +------+-------+----------+----------+
> >> | 4    | 8     | 3.5449e8 | 8.2454e8 |
> >> | 4    | 12    | 9.4248e8 | 1.1364e9 |
> >> | 4    | 16    | 8.3240e8 | 1.2084e9 |
> >> | 8    | 8     | 9.0364e9 | 5.0381e8 |
> >> | 8    | 12    | 9.9112e8 | 3.0836e9 |
> >> | 8    | 16    | 4.9429e8 | 1.9533e9 |
> >> +------+-------+----------+----------+
> >>
> >> Overutilized ratio (in % of the 20s test):
> >> +------+-------+-----------+----------+------------+
> >> | util | count | without   | with     | ratio      |
> >> +------+-------+-----------+----------+------------+
> >> | 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
> >> | 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
> >> | 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
> >> | 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
> >> | 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
> >> | 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
> >> +------+-------+-----------+----------+------------+
> >> Std:
> >> +------+-------+----------+----------+
> >> | util | count | without  | with     |
> >> +------+-------+----------+----------+
> >> | 4    | 8     | 0.212919 | 0.036856 |
> >> | 4    | 12    | 0.069696 | 0.060257 |
> >> | 4    | 16    | 0.63995  | 0.542028 |
> >> | 8    | 8     | 2.158079 | 0.211775 |
> >> | 8    | 12    | 0.089159 | 0.187436 |
> >> | 8    | 16    | 0.798565 | 1.669003 |
> >> +------+-------+----------+----------+
> >>
> >> ------
> >>
> >> Analysis:
> >>
> >> - [1]
> >> Without the patch, 2 tasks end up on one little CPU. This consumes
> >> less energy than using the medium/big CPU according to the energy model,
> >> but EAS should not be capable of doing such task placement as the little
> >> CPU becomes overutilized.
> >> Without the patch, the system is overutilized a lot more than with the patch.
> >>
> >> -
> >> Looking at the overutilized ratio, being overutilized 0.5% of the time or
> >> 0.05% of the time might seem close, but it means that EAS ended up
> >> doing a bad task placement multiple, independent times.
> >>
> >> -
> >> The overutilized ratio should be checked along the energy results as it
> >> shows how much EAS was involved in the task placement.
> >>
> >> -
> >> Overall, the energy consumed is less. The quantity of energy saved varies
> >> with the workload.
> >>
> >> ------
> >>
> >> On another note, I wanted to ask if there would be a v2 of this present
> >> patchset (sched/fair: Rework EAS to handle more cases),
> >
> > yes, I have been side tracked by other stuff since LPC and haven't
> > been able to finalize test on v2 but it's ongoing
>
> Ok thanks!
>
> >
> >>
> >> Regards,
> >> Pierre
> >>
> >> ------
> >>
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index bb343136ddd0..812d5bf88875 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -1592,6 +1592,12 @@ struct task_struct {
> >>           struct user_event_mm            *user_event_mm;
> >>    #endif
> >>
> >> +       /*
> >> +        * Keep track of the CPU feec() migrated this task to.
> >> +        * There is a per-cpu 'eas_pending_enqueue' value to reset.
> >> +        */
> >> +       int eas_target_cpu;
> >> +
> >>           /*
> >>            * New fields for task_struct should be added above here, so that
> >>            * they are included in the randomized portion of task_struct.
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index c157d4860a3b..34911eb059cf 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
> >>           se->sched_delayed = 0;
> >>    }
> >>
> >> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
> >> +
> >>    /*
> >>     * The enqueue_task method is called before nr_running is
> >>     * increased. Here we update the fair scheduling stats and
> >> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>                   check_update_overutilized_status(rq);
> >>
> >>    enqueue_throttle:
> >> +       if (p->eas_target_cpu != -1) {
> >> +               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
> >> +               p->eas_target_cpu = -1;
> >> +       }
> >> +
> >>           assert_list_leaf_cfs_rq(rq);
> >>
> >>           hrtick_update(rq);
> >> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >>                           if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> >>                                   continue;
> >>
> >> +                       /* Skip this CPU as its util signal will be invalid soon. */
> >> +                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
> >> +                           cpu != prev_cpu)
> >> +                               continue;
> >> +
> >>                           util = cpu_util(cpu, p, cpu, 0);
> >>                           cpu_cap = capacity_of(cpu);
> >>
> >> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >>               ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
> >>                   target = best_energy_cpu;
> >>
> >> +       /*
> >> +        *'Lock' the target CPU if there is a migration. Prevent other feec()
> >> +        * calls to use the same target CPU until util signals are not updated.
> >> +        */
> >> +       if (prev_cpu != target) {
> >> +               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
> >> +                       p->eas_target_cpu = target;
> >> +               else
> >> +                       target = prev_cpu;
> >> +       }
> >> +
> >>           return target;
> >>
> >>    unlock:

Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Posted by Hongyan Xia 1 year, 2 months ago

Hi Vincent,

On 30/08/2024 14:03, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs.
> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> sidenote: delayed dequeue has been disable for all tests.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                  ops/sec  stdev
> tip/sched/core  13490    (+/- 1.7%)
> + patches 1-3   14095    (+/- 1.7%)  +4.5%
> 
> 
> When overutilized, the scheduler stops looking for an energy efficient CPU
> and fallback to the default performance mode. Although this is the best
> choice when a system is fully overutilized, it also breaks the energy
> efficiency when one CPU becomes overutilized for a short time because of
> kworker and/or background activity as an example.
> Patch 4 calls feec() everytime instead of skipping it when overutlized,
> and fallback to default performance mode only when feec() can't find a
> suitable CPU. The main advantage is that the task placement remains more
> stable especially when there is a short and transient overutilized state.
> The drawback is that the overhead can be significant for some CPU intensive
> use cases.
> 
> The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
> 
>                                 tip/sched/core        + patches 1-4
> 			       Time    stdev         Time    stdev
> hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
> hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
> hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
> hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)
> 
> hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
> hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
> hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
> hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)
> 
> hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
> hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
> hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
> hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)
> 
> hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%)
> hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
> hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
> hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)
> 
> Side note: Both new feec() and current feec() give similar overheads with
> patch 4.
> 
> Although the highest reachable CPU throughput is not the only target of EAS,
> the overhead can be significant in some cases as shown in hackbech results
> above. That being said I still think it's worth the benefit for the stability
> of tasks placement and a better control of the power.
> 
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>    with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>    which case the balance callback can't be used.
> 
> This push callback doesn't replace the current misfit task mecanism which
> is already implemented but this could be considered as a follow up serie.
> 
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>    wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
> Vincent Guittot (5):
>    sched/fair: Filter false overloaded_group case for EAS
>    energy model: Add a get previous state function
>    sched/fair: Rework feec() to use cost instead of spare capacity
>    sched/fair: Use EAS also when overutilized
>    sched/fair: Add push task callback for EAS
> 
>   include/linux/energy_model.h |  18 +
>   kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
>   kernel/sched/sched.h         |   2 +
>   3 files changed, 488 insertions(+), 225 deletions(-)
> 

On second look, I do wonder if this series should be split into 
individual patches or mini-series. Some of the ideas, like 
overloaded_groups or calling EAS at more locations rather than just 
wake-up events, might be easier to review and merge if they are independent.

Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Posted by Vincent Guittot 1 year, 2 months ago

Hi Hongyan,

On Thu, 28 Nov 2024 at 18:24, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> Hi Vincent,
>
> On 30/08/2024 14:03, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> >
> > Patch 2 creates a new EM interface that will be used by Patch 3
> >
> >
> > Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> > others might be a better choice. feec() looks for the CPU with the highest
> > spare capacity in a PD assuming that it will be the best CPU from a energy
> > efficiency PoV because it will require the smallest increase of OPP.
> > This is often but not always true, this policy filters some others CPUs
> > which would be as efficients because of using the same OPP but with less
> > running tasks as an example.
> > In fact, we only care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result having the same energy cost. In
> > such cases, we can use other metrics to select the best CPU with the same
> > energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> > and then the most performant CPU between CPUs.
> >
> > perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> > of the new feec() vs current implementation.
> > sidenote: delayed dequeue has been disable for all tests.
> >
> > 9 iterations of perf bench sched pipe -T -l 80000
> >                  ops/sec  stdev
> > tip/sched/core  13490    (+/- 1.7%)
> > + patches 1-3   14095    (+/- 1.7%)  +4.5%
> >
> >
> > When overutilized, the scheduler stops looking for an energy efficient CPU
> > and fallback to the default performance mode. Although this is the best
> > choice when a system is fully overutilized, it also breaks the energy
> > efficiency when one CPU becomes overutilized for a short time because of
> > kworker and/or background activity as an example.
> > Patch 4 calls feec() everytime instead of skipping it when overutlized,
> > and fallback to default performance mode only when feec() can't find a
> > suitable CPU. The main advantage is that the task placement remains more
> > stable especially when there is a short and transient overutilized state.
> > The drawback is that the overhead can be significant for some CPU intensive
> > use cases.
> >
> > The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
> >
> >                                 tip/sched/core        + patches 1-4
> >                              Time    stdev         Time    stdev
> > hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
> > hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
> > hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
> > hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)
> >
> > hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
> > hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
> > hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
> > hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)
> >
> > hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
> > hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
> > hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
> > hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)
> >
> > hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%)
> > hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
> > hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
> > hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)
> >
> > Side note: Both new feec() and current feec() give similar overheads with
> > patch 4.
> >
> > Although the highest reachable CPU throughput is not the only target of EAS,
> > the overhead can be significant in some cases as shown in hackbech results
> > above. That being said I still think it's worth the benefit for the stability
> > of tasks placement and a better control of the power.
> >
> >
> > Patch 5 solves another problem with tasks being stuck on a CPU forever
> > because it doesn't sleep anymore and as a result never wakeup and call
> > feec(). Such task can be detected by comparing util_avg or runnable_avg
> > with the compute capacity of the CPU. Once detected, we can call feec() to
> > check if there is a better CPU for the stuck task. The call can be done in
> > 2 places:
> > - When the task is put back in the runnnable list after its running slice
> >    with the balance callback mecanism similarly to the rt/dl push callback.
> > - During cfs tick when there is only 1 running task stuck on the CPU in
> >    which case the balance callback can't be used.
> >
> > This push callback doesn't replace the current misfit task mecanism which
> > is already implemented but this could be considered as a follow up serie.
> >
> >
> > This push callback mecanism with the new feec() algorithm ensures that
> > tasks always get a chance to migrate on the best suitable CPU and don't
> > stay stuck on a CPU which is no more the most suitable one. As examples:
> > - A task waking on a big CPU with a uclamp max preventing it to sleep and
> >    wake up, can migrate on a smaller CPU once it's more power efficient.
> > - The tasks are spread on CPUs in the PD when they target the same OPP.
> >
> > This series implements some of the topics discussed at OSPM [1]. Other
> > topics will be part of an other serie
> >
> > [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> >
> > Vincent Guittot (5):
> >    sched/fair: Filter false overloaded_group case for EAS
> >    energy model: Add a get previous state function
> >    sched/fair: Rework feec() to use cost instead of spare capacity
> >    sched/fair: Use EAS also when overutilized
> >    sched/fair: Add push task callback for EAS
> >
> >   include/linux/energy_model.h |  18 +
> >   kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
> >   kernel/sched/sched.h         |   2 +
> >   3 files changed, 488 insertions(+), 225 deletions(-)
> >
>
> On second look, I do wonder if this series should be split into
> individual patches or mini-series. Some of the ideas, like
> overloaded_groups or calling EAS at more locations rather than just
> wake-up events, might be easier to review and merge if they are independent.

The series is almost ready, I was waiting for the support of v6.12 on
a device like pixel 6 to run some benchmarks but it is not yet
available publicly at least so I might send the serie without such
figures. I also wanted to test it with delayed dequeued enabled this
time unlike previous version:
https://lore.kernel.org/lkml/20241129161756.3081386-1-vincent.guittot@linaro.org/