sched/fair: Decrease util_est in presence of idle time

[PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Pierre Gondois 1 year, 1 month ago

util_est signal does not decay if the task utilization is lower
than its runnable signal by a value of 10. This was done to keep
the util_est signal high in case a task shares a rq with another
task and doesn't obtain a desired running time.

However, tasks sharing a rq obtain the running time they desire
provided that the rq has some idle time. Indeed, either:
- a CPU is always running. The utilization signal of tasks reflects
  the running time they obtained. This running time depends on the
  niceness of the tasks. A decreasing utilization signal doesn't
  reflect a decrease of the task activity and the util_est signal
  should not be decayed in this case.
- a CPU is not always running (i.e. there is some idle time). Tasks
  might be waiting to run, increasing their runnable signal, but
  eventually run to completion. A decreasing utilization signal
  does reflect a decrease of the task activity and the util_est
  signal should be decayed in this case.

------------

Running on a 1024 capacity CPU:
- TaskA:
  - duty_cycle=60%, period=16ms, duration=2s
- TaskB:
  - sleep=2ms (to let TaskA run first), duration=1s
  - first phase: duty_cycle=20%, period=16ms, duration=1s
  - second phase: duty_cycle=5%, period=16ms, duration=1s

When TaskB starts the second phase, the util_est signal can take up to
~150ms before starting to decrease. Indeed, if TaskB runs after
TaskA, its runnable signal will be higher than its util signal by more
than 10 units.
This creates unnecessary frequency spikes: upon enqueuing TaskB,
util_est_cfs is increased by the old value of util_est of TaskB,
impacting frequency selection through:
sugov_get_util()
\-cpu_util_cfs_boost()
  \-cpu_util()
		util_est = READ_ONCE(cfs_rq->avg.util_est);

------------

Energy impact can also be measured as suggested by Hongyan at [2].
On a Pixel6, the following workload is run 10 times:
- TaskA:
  - duty_cycle=20%, duration=0.4s
  - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
  - Used to increase the runnable signals of TaskB
- TaskB:
  - sleep=2ms (to let TaskA run first)
  - first phase: duty_cycle=20%, duration=0.2s
  - second phase: duty_cycle=5%, duration=0.2s
  - 4 occurrences of TaskB

The duration of the workload is low (0.4s) to emphasis the impact of
continuing to run at an overestimated frequency.

Energy measured with energy counters:
┌────────────┬────────────┬───────────┬───────────┬───────────┐
│ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
│ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆     -9.16 │
└────────────┴────────────┴───────────┴───────────┴───────────┘

Energy computed from util signals and energy model:
┌────────────┬────────────┬───────────┬───────────┬───────────┐
│ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
│  4.8318e9  ┆   4.0823e9 ┆ 5.1044e8  ┆  7.5558e8 ┆    -15.51 │
└────────────┴────────────┴───────────┴───────────┴───────────┘

------------

The initial patch [2] aimed to solve an issue detected while running
speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
versions are compared:
- base: the current version
- patch: the new version, with this patch applied
- revert: the initial version, with commit [2] reverted

Score (higher is better):
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│      0.57 ┆      0.49 ┆       0.58 │
└───────────┴───────────┴────────────┘

Energy measured with energy counters:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│   1347.13 ┆   2431.67 ┆     510.88 │
└───────────┴───────────┴────────────┘

Energy computed from util signals and energy model:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
└───────────┴───────────┴────────────┘

OU ratio in % (ratio of time being overutilized over total time).
The test lasts ~65s:
┌────────────┬────────────┬─────────────┐
│ base mean  ┆ patch mean ┆ revert mean │
╞════════════╪════════════╪═════════════╡
│     63.39% ┆     12.48% ┆      12.28% │
└────────────┴────────────┴─────────────┘
┌───────────┬───────────┬─────────────┐
│ base std  ┆ patch std ┆ revert mean │
╞═══════════╪═══════════╪═════════════╡
│      0.97 ┆      0.28 ┆        0.88 │
└───────────┴───────────┴─────────────┘

The energy gain can be explained by the fact that the system is
overutilized during most of the test with the base version.

During the test, the base condition is evaluated to true ~40%
of the time. The new condition is evaluated to true ~2% of
the time. Preventing util_est signals to decay with the base
condition has a significant impact on the overutilized state
due to an overestimation of the resulting utilization of tasks.

The score is impacted by the patch, but:
- it is expected to have slightly lower scores with EAS running more
  often
- the base version making the system run at higher frequencies by
  overestimating task utilization, it is expected to have higher scores

------------

Decrease util_est when the rq has some idle time instead of a delta
between util and runnable signals.

[1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@arm.com/
[2] commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@linaro.org/
[3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
---
 kernel/sched/fair.c |  2 +-
 kernel/sched/pelt.h | 23 ++++++++++++++++-------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..d058ab29e52e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	 * To avoid underestimate of task utilization, skip updates of EWMA if
 	 * we cannot grant that thread got all CPU time it wanted.
 	 */
-	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
+	if (rq_no_idle_pelt(rq_of(cfs_rq)))
 		goto done;
 
 
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index f4f6a0875c66..eaf34f95fd93 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -123,21 +123,30 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
 }
 
 /*
- * When rq becomes idle, we have to check if it has lost idle time
- * because it was fully busy. A rq is fully used when the /Sum util_sum
- * is greater or equal to:
+ * A rq is fully used (or is considered to not have enough idle time)
+ * when the /Sum util_sum is greater or equal to:
  * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
  * For optimization and computing rounding purpose, we don't take into account
  * the position in the current window (period_contrib) and we use the higher
  * bound of util_sum to decide.
  */
-static inline void update_idle_rq_clock_pelt(struct rq *rq)
+static inline bool rq_no_idle_pelt(struct rq *rq)
 {
-	u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
-	u32 util_sum = rq->cfs.avg.util_sum;
+	u32 util_sum, divider;
+
+	divider = (PELT_MIN_DIVIDER << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
+	util_sum = rq->cfs.avg.util_sum;
 	util_sum += rq->avg_rt.util_sum;
 	util_sum += rq->avg_dl.util_sum;
+	return util_sum >= divider;
+}
 
+/*
+ * When rq becomes idle, we have to check if it has lost idle time
+ * because it was fully busy.
+ */
+static inline void update_idle_rq_clock_pelt(struct rq *rq)
+{
 	/*
 	 * Reflecting stolen time makes sense only if the idle
 	 * phase would be present at max capacity. As soon as the
@@ -147,7 +156,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 	 * this case. We keep track of this lost idle time compare to
 	 * rq's clock_task.
 	 */
-	if (util_sum >= divider)
+	if (rq_no_idle_pelt(rq))
 		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
 
 	_update_idle_rq_clock_pelt(rq);
-- 
2.25.1

Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Vincent Guittot 1 year, 1 month ago

On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> util_est signal does not decay if the task utilization is lower
> than its runnable signal by a value of 10. This was done to keep

The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
worth updating util_est

> the util_est signal high in case a task shares a rq with another
> task and doesn't obtain a desired running time.
>
> However, tasks sharing a rq obtain the running time they desire
> provided that the rq has some idle time. Indeed, either:
> - a CPU is always running. The utilization signal of tasks reflects
>   the running time they obtained. This running time depends on the
>   niceness of the tasks. A decreasing utilization signal doesn't
>   reflect a decrease of the task activity and the util_est signal
>   should not be decayed in this case.
> - a CPU is not always running (i.e. there is some idle time). Tasks
>   might be waiting to run, increasing their runnable signal, but
>   eventually run to completion. A decreasing utilization signal
>   does reflect a decrease of the task activity and the util_est
>   signal should be decayed in this case.

This is not always true
Run a task 40ms with a period of 100ms alone on the biggest cpu at max
compute capacity. its util_avg is up to 674 at dequeue as well as its
util_est
Then start a 2nd task with the exact same behavior on the same cpu.
The util_avg of this 2nd task will be only 496 at dequeue as well as
its util_est but there is still 20ms of idle time. Furthermore,  The
util_avg of the 1st task is also around 496 at dequeue but

>
> ------------
>
> Running on a 1024 capacity CPU:
> - TaskA:
>   - duty_cycle=60%, period=16ms, duration=2s
> - TaskB:
>   - sleep=2ms (to let TaskA run first), duration=1s
>   - first phase: duty_cycle=20%, period=16ms, duration=1s
>   - second phase: duty_cycle=5%, period=16ms, duration=1s
>
> When TaskB starts the second phase, the util_est signal can take up to
> ~150ms before starting to decrease. Indeed, if TaskB runs after
> TaskA, its runnable signal will be higher than its util signal by more
> than 10 units.
> This creates unnecessary frequency spikes: upon enqueuing TaskB,
> util_est_cfs is increased by the old value of util_est of TaskB,
> impacting frequency selection through:
> sugov_get_util()
> \-cpu_util_cfs_boost()
>   \-cpu_util()
>                 util_est = READ_ONCE(cfs_rq->avg.util_est);
>
> ------------
>
> Energy impact can also be measured as suggested by Hongyan at [2].
> On a Pixel6, the following workload is run 10 times:
> - TaskA:
>   - duty_cycle=20%, duration=0.4s
>   - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
>   - Used to increase the runnable signals of TaskB
> - TaskB:
>   - sleep=2ms (to let TaskA run first)
>   - first phase: duty_cycle=20%, duration=0.2s
>   - second phase: duty_cycle=5%, duration=0.2s
>   - 4 occurrences of TaskB
>
> The duration of the workload is low (0.4s) to emphasis the impact of
> continuing to run at an overestimated frequency.
>
> Energy measured with energy counters:
> ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> │ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
> ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> │ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆     -9.16 │
> └────────────┴────────────┴───────────┴───────────┴───────────┘
>
> Energy computed from util signals and energy model:
> ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> │ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
> ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> │  4.8318e9  ┆   4.0823e9 ┆ 5.1044e8  ┆  7.5558e8 ┆    -15.51 │
> └────────────┴────────────┴───────────┴───────────┴───────────┘
>
> ------------
>
> The initial patch [2] aimed to solve an issue detected while running
> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> versions are compared:
> - base: the current version

What do you mean by current version ? tip/sched/core ?

> - patch: the new version, with this patch applied
> - revert: the initial version, with commit [2] reverted
>
> Score (higher is better):
> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> ┌───────────┬───────────┬────────────┐
> │ base std  ┆ patch std ┆ revert std │
> ╞═══════════╪═══════════╪════════════╡
> │      0.57 ┆      0.49 ┆       0.58 │
> └───────────┴───────────┴────────────┘
>
> Energy measured with energy counters:
> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> ┌───────────┬───────────┬────────────┐
> │ base std  ┆ patch std ┆ revert std │
> ╞═══════════╪═══════════╪════════════╡
> │   1347.13 ┆   2431.67 ┆     510.88 │
> └───────────┴───────────┴────────────┘
>
> Energy computed from util signals and energy model:
> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> ┌───────────┬───────────┬────────────┐
> │ base std  ┆ patch std ┆ revert std │
> ╞═══════════╪═══════════╪════════════╡
> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> └───────────┴───────────┴────────────┘
>
> OU ratio in % (ratio of time being overutilized over total time).
> The test lasts ~65s:
> ┌────────────┬────────────┬─────────────┐
> │ base mean  ┆ patch mean ┆ revert mean │
> ╞════════════╪════════════╪═════════════╡
> │     63.39% ┆     12.48% ┆      12.28% │
> └────────────┴────────────┴─────────────┘
> ┌───────────┬───────────┬─────────────┐
> │ base std  ┆ patch std ┆ revert mean │
> ╞═══════════╪═══════════╪═════════════╡
> │      0.97 ┆      0.28 ┆        0.88 │
> └───────────┴───────────┴─────────────┘
>
> The energy gain can be explained by the fact that the system is
> overutilized during most of the test with the base version.
>
> During the test, the base condition is evaluated to true ~40%
> of the time. The new condition is evaluated to true ~2% of
> the time. Preventing util_est signals to decay with the base
> condition has a significant impact on the overutilized state
> due to an overestimation of the resulting utilization of tasks.
>
> The score is impacted by the patch, but:
> - it is expected to have slightly lower scores with EAS running more
>   often
> - the base version making the system run at higher frequencies by
>   overestimating task utilization, it is expected to have higher scores

I'm not sure to get what you are trying to solve here ?

>
> ------------
>
> Decrease util_est when the rq has some idle time instead of a delta
> between util and runnable signals.
>
> [1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@arm.com/
> [2] commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
> https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@linaro.org/
> [3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t
>
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>  kernel/sched/fair.c |  2 +-
>  kernel/sched/pelt.h | 23 ++++++++++++++++-------
>  2 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e9ca38512de..d058ab29e52e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>          * To avoid underestimate of task utilization, skip updates of EWMA if
>          * we cannot grant that thread got all CPU time it wanted.
>          */
> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))

You can't use here the test that is done in
update_idle_rq_clock_pelt() to detect if we lost some idle time
because this test is only relevant when the rq becomes idle which is
not the case here

With this test you skip completely the cases where the task has to
share the CPU with others. As an example on the pixel 6, the little
cpus must run more than 1.2 seconds at its max freq before detecting
that there is no idle time

>                 goto done;
>
>
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index f4f6a0875c66..eaf34f95fd93 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -123,21 +123,30 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
>  }
>
>  /*
> - * When rq becomes idle, we have to check if it has lost idle time
> - * because it was fully busy. A rq is fully used when the /Sum util_sum
> - * is greater or equal to:
> + * A rq is fully used (or is considered to not have enough idle time)
> + * when the /Sum util_sum is greater or equal to:
>   * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
>   * For optimization and computing rounding purpose, we don't take into account
>   * the position in the current window (period_contrib) and we use the higher
>   * bound of util_sum to decide.
>   */
> -static inline void update_idle_rq_clock_pelt(struct rq *rq)
> +static inline bool rq_no_idle_pelt(struct rq *rq)
>  {
> -       u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> -       u32 util_sum = rq->cfs.avg.util_sum;
> +       u32 util_sum, divider;
> +
> +       divider = (PELT_MIN_DIVIDER << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> +       util_sum = rq->cfs.avg.util_sum;
>         util_sum += rq->avg_rt.util_sum;
>         util_sum += rq->avg_dl.util_sum;
> +       return util_sum >= divider;
> +}
>
> +/*
> + * When rq becomes idle, we have to check if it has lost idle time
> + * because it was fully busy.
> + */
> +static inline void update_idle_rq_clock_pelt(struct rq *rq)
> +{
>         /*
>          * Reflecting stolen time makes sense only if the idle
>          * phase would be present at max capacity. As soon as the
> @@ -147,7 +156,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
>          * this case. We keep track of this lost idle time compare to
>          * rq's clock_task.
>          */
> -       if (util_sum >= divider)
> +       if (rq_no_idle_pelt(rq))
>                 rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
>
>         _update_idle_rq_clock_pelt(rq);
> --
> 2.25.1
>

Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Vincent Guittot 1 year, 1 month ago

On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >
> > util_est signal does not decay if the task utilization is lower
> > than its runnable signal by a value of 10. This was done to keep
>
> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> worth updating util_est
>
> > the util_est signal high in case a task shares a rq with another
> > task and doesn't obtain a desired running time.
> >
> > However, tasks sharing a rq obtain the running time they desire
> > provided that the rq has some idle time. Indeed, either:
> > - a CPU is always running. The utilization signal of tasks reflects
> >   the running time they obtained. This running time depends on the
> >   niceness of the tasks. A decreasing utilization signal doesn't
> >   reflect a decrease of the task activity and the util_est signal
> >   should not be decayed in this case.
> > - a CPU is not always running (i.e. there is some idle time). Tasks
> >   might be waiting to run, increasing their runnable signal, but
> >   eventually run to completion. A decreasing utilization signal
> >   does reflect a decrease of the task activity and the util_est
> >   signal should be decayed in this case.
>
> This is not always true
> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> compute capacity. its util_avg is up to 674 at dequeue as well as its
> util_est
> Then start a 2nd task with the exact same behavior on the same cpu.
> The util_avg of this 2nd task will be only 496 at dequeue as well as
> its util_est but there is still 20ms of idle time. Furthermore,  The
> util_avg of the 1st task is also around 496 at dequeue but

the end of the sentence was missing...

but there is still 20ms of idle time.

>
> >
> > ------------
> >
> > Running on a 1024 capacity CPU:
> > - TaskA:
> >   - duty_cycle=60%, period=16ms, duration=2s
> > - TaskB:
> >   - sleep=2ms (to let TaskA run first), duration=1s
> >   - first phase: duty_cycle=20%, period=16ms, duration=1s
> >   - second phase: duty_cycle=5%, period=16ms, duration=1s
> >
> > When TaskB starts the second phase, the util_est signal can take up to
> > ~150ms before starting to decrease. Indeed, if TaskB runs after
> > TaskA, its runnable signal will be higher than its util signal by more
> > than 10 units.
> > This creates unnecessary frequency spikes: upon enqueuing TaskB,
> > util_est_cfs is increased by the old value of util_est of TaskB,
> > impacting frequency selection through:
> > sugov_get_util()
> > \-cpu_util_cfs_boost()
> >   \-cpu_util()
> >                 util_est = READ_ONCE(cfs_rq->avg.util_est);
> >
> > ------------
> >
> > Energy impact can also be measured as suggested by Hongyan at [2].
> > On a Pixel6, the following workload is run 10 times:
> > - TaskA:
> >   - duty_cycle=20%, duration=0.4s
> >   - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
> >   - Used to increase the runnable signals of TaskB
> > - TaskB:
> >   - sleep=2ms (to let TaskA run first)
> >   - first phase: duty_cycle=20%, duration=0.2s
> >   - second phase: duty_cycle=5%, duration=0.2s
> >   - 4 occurrences of TaskB
> >
> > The duration of the workload is low (0.4s) to emphasis the impact of
> > continuing to run at an overestimated frequency.
> >
> > Energy measured with energy counters:
> > ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> > │ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
> > ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> > │ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆     -9.16 │
> > └────────────┴────────────┴───────────┴───────────┴───────────┘
> >
> > Energy computed from util signals and energy model:
> > ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> > │ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
> > ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> > │  4.8318e9  ┆   4.0823e9 ┆ 5.1044e8  ┆  7.5558e8 ┆    -15.51 │
> > └────────────┴────────────┴───────────┴───────────┴───────────┘
> >
> > ------------
> >
> > The initial patch [2] aimed to solve an issue detected while running
> > speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> > versions are compared:
> > - base: the current version
>
> What do you mean by current version ? tip/sched/core ?
>
> > - patch: the new version, with this patch applied
> > - revert: the initial version, with commit [2] reverted
> >
> > Score (higher is better):
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std  ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │      0.57 ┆      0.49 ┆       0.58 │
> > └───────────┴───────────┴────────────┘
> >
> > Energy measured with energy counters:
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std  ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │   1347.13 ┆   2431.67 ┆     510.88 │
> > └───────────┴───────────┴────────────┘
> >
> > Energy computed from util signals and energy model:
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std  ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> > └───────────┴───────────┴────────────┘
> >
> > OU ratio in % (ratio of time being overutilized over total time).
> > The test lasts ~65s:
> > ┌────────────┬────────────┬─────────────┐
> > │ base mean  ┆ patch mean ┆ revert mean │
> > ╞════════════╪════════════╪═════════════╡
> > │     63.39% ┆     12.48% ┆      12.28% │
> > └────────────┴────────────┴─────────────┘
> > ┌───────────┬───────────┬─────────────┐
> > │ base std  ┆ patch std ┆ revert mean │
> > ╞═══════════╪═══════════╪═════════════╡
> > │      0.97 ┆      0.28 ┆        0.88 │
> > └───────────┴───────────┴─────────────┘
> >
> > The energy gain can be explained by the fact that the system is
> > overutilized during most of the test with the base version.
> >
> > During the test, the base condition is evaluated to true ~40%
> > of the time. The new condition is evaluated to true ~2% of
> > the time. Preventing util_est signals to decay with the base
> > condition has a significant impact on the overutilized state
> > due to an overestimation of the resulting utilization of tasks.
> >
> > The score is impacted by the patch, but:
> > - it is expected to have slightly lower scores with EAS running more
> >   often
> > - the base version making the system run at higher frequencies by
> >   overestimating task utilization, it is expected to have higher scores
>
> I'm not sure to get what you are trying to solve here ?
>
> >
> > ------------
> >
> > Decrease util_est when the rq has some idle time instead of a delta
> > between util and runnable signals.
> >
> > [1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@arm.com/
> > [2] commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
> > https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@linaro.org/
> > [3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t
> >
> > Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> > ---
> >  kernel/sched/fair.c |  2 +-
> >  kernel/sched/pelt.h | 23 ++++++++++++++++-------
> >  2 files changed, 17 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3e9ca38512de..d058ab29e52e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> >          * To avoid underestimate of task utilization, skip updates of EWMA if
> >          * we cannot grant that thread got all CPU time it wanted.
> >          */
> > -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> > +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
>
> You can't use here the test that is done in
> update_idle_rq_clock_pelt() to detect if we lost some idle time
> because this test is only relevant when the rq becomes idle which is
> not the case here
>
> With this test you skip completely the cases where the task has to
> share the CPU with others. As an example on the pixel 6, the little
> cpus must run more than 1.2 seconds at its max freq before detecting
> that there is no idle time
>
> >                 goto done;
> >
> >
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index f4f6a0875c66..eaf34f95fd93 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -123,21 +123,30 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> >  }
> >
> >  /*
> > - * When rq becomes idle, we have to check if it has lost idle time
> > - * because it was fully busy. A rq is fully used when the /Sum util_sum
> > - * is greater or equal to:
> > + * A rq is fully used (or is considered to not have enough idle time)
> > + * when the /Sum util_sum is greater or equal to:
> >   * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> >   * For optimization and computing rounding purpose, we don't take into account
> >   * the position in the current window (period_contrib) and we use the higher
> >   * bound of util_sum to decide.
> >   */
> > -static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > +static inline bool rq_no_idle_pelt(struct rq *rq)
> >  {
> > -       u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> > -       u32 util_sum = rq->cfs.avg.util_sum;
> > +       u32 util_sum, divider;
> > +
> > +       divider = (PELT_MIN_DIVIDER << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> > +       util_sum = rq->cfs.avg.util_sum;
> >         util_sum += rq->avg_rt.util_sum;
> >         util_sum += rq->avg_dl.util_sum;
> > +       return util_sum >= divider;
> > +}
> >
> > +/*
> > + * When rq becomes idle, we have to check if it has lost idle time
> > + * because it was fully busy.
> > + */
> > +static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > +{
> >         /*
> >          * Reflecting stolen time makes sense only if the idle
> >          * phase would be present at max capacity. As soon as the
> > @@ -147,7 +156,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> >          * this case. We keep track of this lost idle time compare to
> >          * rq's clock_task.
> >          */
> > -       if (util_sum >= divider)
> > +       if (rq_no_idle_pelt(rq))
> >                 rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> >
> >         _update_idle_rq_clock_pelt(rq);
> > --
> > 2.25.1
> >

Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Dietmar Eggemann 1 year, 1 month ago

On 20/12/2024 08:47, Vincent Guittot wrote:
> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>
>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>
>>> util_est signal does not decay if the task utilization is lower
>>> than its runnable signal by a value of 10. This was done to keep
>>
>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
>> worth updating util_est
Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
the mechanism is too sensitive?

It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
446 446 446) in cases 2 tasks are scheduled on the same little CPU:

...
task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
...

I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
this implementation was tested with perf/energy numbers back then?

>>> the util_est signal high in case a task shares a rq with another
>>> task and doesn't obtain a desired running time.
>>>
>>> However, tasks sharing a rq obtain the running time they desire
>>> provided that the rq has some idle time. Indeed, either:
>>> - a CPU is always running. The utilization signal of tasks reflects
>>>   the running time they obtained. This running time depends on the
>>>   niceness of the tasks. A decreasing utilization signal doesn't
>>>   reflect a decrease of the task activity and the util_est signal
>>>   should not be decayed in this case.
>>> - a CPU is not always running (i.e. there is some idle time). Tasks
>>>   might be waiting to run, increasing their runnable signal, but
>>>   eventually run to completion. A decreasing utilization signal
>>>   does reflect a decrease of the task activity and the util_est
>>>   signal should be decayed in this case.
>>
>> This is not always true
>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
>> compute capacity. its util_avg is up to 674 at dequeue as well as its
>> util_est
>> Then start a 2nd task with the exact same behavior on the same cpu.
>> The util_avg of this 2nd task will be only 496 at dequeue as well as
>> its util_est but there is still 20ms of idle time. Furthermore,  The
>> util_avg of the 1st task is also around 496 at dequeue but
> 
> the end of the sentence was missing...
> 
> but there is still 20ms of idle time.

But these two tasks are still able to finish there activity within this
100ms window. So why should we keep their util_est values high when
dequeuing?

[...]

>>> The initial patch [2] aimed to solve an issue detected while running
>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
>>> versions are compared:
>>> - base: the current version
>>
>> What do you mean by current version ? tip/sched/core ?
>>
>>> - patch: the new version, with this patch applied
>>> - revert: the initial version, with commit [2] reverted
>>>
>>> Score (higher is better):
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │      0.57 ┆      0.49 ┆       0.58 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy measured with energy counters:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │   1347.13 ┆   2431.67 ┆     510.88 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy computed from util signals and energy model:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> OU ratio in % (ratio of time being overutilized over total time).
>>> The test lasts ~65s:
>>> ┌────────────┬────────────┬─────────────┐
>>> │ base mean  ┆ patch mean ┆ revert mean │
>>> ╞════════════╪════════════╪═════════════╡
>>> │     63.39% ┆     12.48% ┆      12.28% │
>>> └────────────┴────────────┴─────────────┘
>>> ┌───────────┬───────────┬─────────────┐
>>> │ base std  ┆ patch std ┆ revert mean │
>>> ╞═══════════╪═══════════╪═════════════╡
>>> │      0.97 ┆      0.28 ┆        0.88 │
>>> └───────────┴───────────┴─────────────┘
>>>
>>> The energy gain can be explained by the fact that the system is
>>> overutilized during most of the test with the base version.
>>>
>>> During the test, the base condition is evaluated to true ~40%
>>> of the time. The new condition is evaluated to true ~2% of
>>> the time. Preventing util_est signals to decay with the base
>>> condition has a significant impact on the overutilized state
>>> due to an overestimation of the resulting utilization of tasks.
>>>
>>> The score is impacted by the patch, but:
>>> - it is expected to have slightly lower scores with EAS running more
>>>   often
>>> - the base version making the system run at higher frequencies by
>>>   overestimating task utilization, it is expected to have higher scores
>>
>> I'm not sure to get what you are trying to solve here ?

Yeah, the question is how much perf loss we accept for energy savings?
IMHO, impossible to answer generically based on one specific
workload/platform incarnation.

[...]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 3e9ca38512de..d058ab29e52e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>>          * To avoid underestimate of task utilization, skip updates of EWMA if
>>>          * we cannot grant that thread got all CPU time it wanted.
>>>          */
>>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
>>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
>>
>> You can't use here the test that is done in
>> update_idle_rq_clock_pelt() to detect if we lost some idle time
>> because this test is only relevant when the rq becomes idle which is
>> not the case here

Do you mean this test ?

util_avg = util_sum / divider

util_sum >= divider * util_avg

with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
bound of the window (+ 1024):

util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX

Why can't we use it here?

>> With this test you skip completely the cases where the task has to
>> share the CPU with others. As an example on the pixel 6, the little

True. But I assume that's anticipated here. The assumption is that as
long as there is idle time, tasks get what they want in a time frame.

>> cpus must run more than 1.2 seconds at its max freq before detecting
>> that there is no idle time

BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
2.07s (with CPU capacity of Pix5 little CPU = 160)?

[...]

Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Vincent Guittot 1 year, 1 month ago

On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 20/12/2024 08:47, Vincent Guittot wrote:
> > On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> >>
> >> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>>
> >>> util_est signal does not decay if the task utilization is lower
> >>> than its runnable signal by a value of 10. This was done to keep
> >>
> >> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> >> worth updating util_est
> Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
> the mechanism is too sensitive?

The default config is to follow util_est update

>
> It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
> 446 446 446) in cases 2 tasks are scheduled on the same little CPU:
>
> ...
> task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
> task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
> task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
> task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
> ...
>
> I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
> this implementation was tested with perf/energy numbers back then?
>
> >>> the util_est signal high in case a task shares a rq with another
> >>> task and doesn't obtain a desired running time.
> >>>
> >>> However, tasks sharing a rq obtain the running time they desire
> >>> provided that the rq has some idle time. Indeed, either:
> >>> - a CPU is always running. The utilization signal of tasks reflects
> >>>   the running time they obtained. This running time depends on the
> >>>   niceness of the tasks. A decreasing utilization signal doesn't
> >>>   reflect a decrease of the task activity and the util_est signal
> >>>   should not be decayed in this case.
> >>> - a CPU is not always running (i.e. there is some idle time). Tasks
> >>>   might be waiting to run, increasing their runnable signal, but
> >>>   eventually run to completion. A decreasing utilization signal
> >>>   does reflect a decrease of the task activity and the util_est
> >>>   signal should be decayed in this case.
> >>
> >> This is not always true
> >> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> >> compute capacity. its util_avg is up to 674 at dequeue as well as its
> >> util_est
> >> Then start a 2nd task with the exact same behavior on the same cpu.
> >> The util_avg of this 2nd task will be only 496 at dequeue as well as
> >> its util_est but there is still 20ms of idle time. Furthermore,  The
> >> util_avg of the 1st task is also around 496 at dequeue but
> >
> > the end of the sentence was missing...
> >
> > but there is still 20ms of idle time.
>
> But these two tasks are still able to finish there activity within this
> 100ms window. So why should we keep their util_est values high when
> dequeuing?

But then, the util_est decreases from the original value compared to
alone whereas its utilization is the same

>
> [...]
>
> >>> The initial patch [2] aimed to solve an issue detected while running
> >>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> >>> versions are compared:
> >>> - base: the current version
> >>
> >> What do you mean by current version ? tip/sched/core ?
> >>
> >>> - patch: the new version, with this patch applied
> >>> - revert: the initial version, with commit [2] reverted
> >>>
> >>> Score (higher is better):
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std  ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │      0.57 ┆      0.49 ┆       0.58 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> Energy measured with energy counters:
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std  ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │   1347.13 ┆   2431.67 ┆     510.88 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> Energy computed from util signals and energy model:
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std  ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> OU ratio in % (ratio of time being overutilized over total time).
> >>> The test lasts ~65s:
> >>> ┌────────────┬────────────┬─────────────┐
> >>> │ base mean  ┆ patch mean ┆ revert mean │
> >>> ╞════════════╪════════════╪═════════════╡
> >>> │     63.39% ┆     12.48% ┆      12.28% │
> >>> └────────────┴────────────┴─────────────┘
> >>> ┌───────────┬───────────┬─────────────┐
> >>> │ base std  ┆ patch std ┆ revert mean │
> >>> ╞═══════════╪═══════════╪═════════════╡
> >>> │      0.97 ┆      0.28 ┆        0.88 │
> >>> └───────────┴───────────┴─────────────┘
> >>>
> >>> The energy gain can be explained by the fact that the system is
> >>> overutilized during most of the test with the base version.
> >>>
> >>> During the test, the base condition is evaluated to true ~40%
> >>> of the time. The new condition is evaluated to true ~2% of
> >>> the time. Preventing util_est signals to decay with the base
> >>> condition has a significant impact on the overutilized state
> >>> due to an overestimation of the resulting utilization of tasks.
> >>>
> >>> The score is impacted by the patch, but:
> >>> - it is expected to have slightly lower scores with EAS running more
> >>>   often
> >>> - the base version making the system run at higher frequencies by
> >>>   overestimating task utilization, it is expected to have higher scores
> >>
> >> I'm not sure to get what you are trying to solve here ?
>
> Yeah, the question is how much perf loss we accept for energy savings?
> IMHO, impossible to answer generically based on one specific
> workload/platform incarnation.

I think that your problem is somewhere else like the fact the /Sum of
util_est is always higher (and sometime far higher) than the actual
max util_avg because we sum the max of all tasks and as you can see in
my example above, the runnable but not running slice decrease the
util_avg of the task.

>
> [...]
>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 3e9ca38512de..d058ab29e52e 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> >>>          * To avoid underestimate of task utilization, skip updates of EWMA if
> >>>          * we cannot grant that thread got all CPU time it wanted.
> >>>          */
> >>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> >>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
> >>
> >> You can't use here the test that is done in
> >> update_idle_rq_clock_pelt() to detect if we lost some idle time
> >> because this test is only relevant when the rq becomes idle which is
> >> not the case here
>
> Do you mean this test ?
>
> util_avg = util_sum / divider
>
> util_sum >= divider * util_avg
>
> with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
> bound of the window (+ 1024):
>
> util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
>
> Why can't we use it here?

because of the example below, it makes the filtering a nop for a very
large time and you will be overutilized far before

>
> >> With this test you skip completely the cases where the task has to
> >> share the CPU with others. As an example on the pixel 6, the little
>
> True. But I assume that's anticipated here. The assumption is that as
> long as there is idle time, tasks get what they want in a time frame.
>
> >> cpus must run more than 1.2 seconds at its max freq before detecting
> >> that there is no idle time
>
> BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
> 2.07s (with CPU capacity of Pix5 little CPU = 160)?

yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse

>
> [...]

Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Pierre Gondois 1 year ago

Hello Vincent,

Thanks for the review,

On 12/20/24 16:05, Vincent Guittot wrote:
> On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 20/12/2024 08:47, Vincent Guittot wrote:
>>> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
>>> <vincent.guittot@linaro.org> wrote:
>>>>
>>>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>>
>>>>> util_est signal does not decay if the task utilization is lower
>>>>> than its runnable signal by a value of 10. This was done to keep
>>>>
>>>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
>>>> worth updating util_est
>> Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
>> the mechanism is too sensitive?
> 
> The default config is to follow util_est update
> 
>>
>> It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
>> 446 446 446) in cases 2 tasks are scheduled on the same little CPU:
>>
>> ...
>> task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
>> task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
>> task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
>> task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
>> ...
>>
>> I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
>> this implementation was tested with perf/energy numbers back then?
>>
>>>>> the util_est signal high in case a task shares a rq with another
>>>>> task and doesn't obtain a desired running time.
>>>>>
>>>>> However, tasks sharing a rq obtain the running time they desire
>>>>> provided that the rq has some idle time. Indeed, either:
>>>>> - a CPU is always running. The utilization signal of tasks reflects
>>>>>    the running time they obtained. This running time depends on the
>>>>>    niceness of the tasks. A decreasing utilization signal doesn't
>>>>>    reflect a decrease of the task activity and the util_est signal
>>>>>    should not be decayed in this case.
>>>>> - a CPU is not always running (i.e. there is some idle time). Tasks
>>>>>    might be waiting to run, increasing their runnable signal, but
>>>>>    eventually run to completion. A decreasing utilization signal
>>>>>    does reflect a decrease of the task activity and the util_est
>>>>>    signal should be decayed in this case.
>>>>
>>>> This is not always true
>>>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
>>>> compute capacity. its util_avg is up to 674 at dequeue as well as its
>>>> util_est
>>>> Then start a 2nd task with the exact same behavior on the same cpu.
>>>> The util_avg of this 2nd task will be only 496 at dequeue as well as
>>>> its util_est but there is still 20ms of idle time. Furthermore,  The
>>>> util_avg of the 1st task is also around 496 at dequeue but
>>>
>>> the end of the sentence was missing...
>>>
>>> but there is still 20ms of idle time.
>>
>> But these two tasks are still able to finish there activity within this
>> 100ms window. So why should we keep their util_est values high when
>> dequeuing?
> 
> But then, the util_est decreases from the original value compared to
> alone whereas its utilization is the same

In the example with one task, it is possible to have a utilization as high
as we want by increasing the period. With a period of 200ms, the task
reaches a utilization of 750, and with a period of 300ms the max utilization
is 870.
Having a high utilization at dequeue is a usefull information stored in
util_est. It allows to track down that even though the utilization of the task
had time to decrease, the task actually represents a big quantity of
instructions to execute. The task should be handled accordingly.

On the other side, by decreasing the period, the lowest max utilization we
can get is 40% * 1024 = 410.

------------

By having 2 tasks sharing the CPU, the utilization graph is smoothed as one
big period of 40ms followed by 60ms of idle time becomes:
- when the 2 tasks are running, both tasks run alternatively during one sched
   slice ~=4ms, so the 40ms running phase becomes a periodic phase with a period
   of 8ms and a duty cycle of 50%
- the 60ms idle time is reduced to 20ms idle time for each task
The fact that these tasks could run longer than one sched slice is reflected
in the runnable signal of the tasks.
The duty cycle of the tasks in the co-scheduling phase is 50% and the duty
cycle over the 100ms period is 40%. So the utilization of the tasks can reach
40% * 1024. This is ok, tasks don't prevent each other to reach a utilization
value corresponding to their actual duty_cycle.

This patch intends to detect when a periodic task cannot reach a utilization
value of duty_cycle * 1024 due to other tasks requiring to run.
This would be the case for instance if there were 3 tasks with:
   duty_cycle=40%, period=100ms, running during 300ms
In this case, the total running time of the CPU is:
   3(tasks) * 40(ms) * 3(periods) = 360ms
There is no idle time during these 360ms and the utilization of tasks reaches
at most 369 (369 < 0.4*1024).

This is different from the case where the task utilization is lower than their
runnable signal. The following task:
---
To get a high util_est / low utilization value:
- Run during a long period
- Idle during a long period
Then loop n times:
- Periodic during 80ms, period=8ms, duty_cycle=51%
   (note that the duty_cycle is set to 51% to be sure the running time is
   superior to a sched slice of 4ms)
- Idle during 20ms
---
would:
- allow decaying util_est during the looping phase if there was one task
- not allow decaying util_est during the looping phase if there were 2 tasks.
   Indeed the runnable signal of the tasks would be higher than their util
   signal.

However, the looping phase doesn't represent a long and continuous amount of
instruction to execute. The profile of the task changed and the util_est
value should reflect that.
Checking the delta between the runnable and utilization signal doesn't allow to
detect that the profile of the task changed. Indeed, being runnable doesn't
mean being runnable all the time a task is runnable.

> 
>>
>> [...]
>>
>>>>> The initial patch [2] aimed to solve an issue detected while running
>>>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
>>>>> versions are compared:
>>>>> - base: the current version
>>>>
>>>> What do you mean by current version ? tip/sched/core ?

I meant using the following condition:
(dequeued + UTIL_EST_MARGIN) < task_runnable(p)

>>>>
>>>>> - patch: the new version, with this patch applied
>>>>> - revert: the initial version, with commit [2] reverted
>>>>>
>>>>> Score (higher is better):
>>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>>>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
>>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>>>> ┌───────────┬───────────┬────────────┐
>>>>> │ base std  ┆ patch std ┆ revert std │
>>>>> ╞═══════════╪═══════════╪════════════╡
>>>>> │      0.57 ┆      0.49 ┆       0.58 │
>>>>> └───────────┴───────────┴────────────┘
>>>>>
>>>>> Energy measured with energy counters:
>>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>>>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
>>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>>>> ┌───────────┬───────────┬────────────┐
>>>>> │ base std  ┆ patch std ┆ revert std │
>>>>> ╞═══════════╪═══════════╪════════════╡
>>>>> │   1347.13 ┆   2431.67 ┆     510.88 │
>>>>> └───────────┴───────────┴────────────┘
>>>>>
>>>>> Energy computed from util signals and energy model:
>>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>>>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
>>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>>>> ┌───────────┬───────────┬────────────┐
>>>>> │ base std  ┆ patch std ┆ revert std │
>>>>> ╞═══════════╪═══════════╪════════════╡
>>>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
>>>>> └───────────┴───────────┴────────────┘
>>>>>
>>>>> OU ratio in % (ratio of time being overutilized over total time).
>>>>> The test lasts ~65s:
>>>>> ┌────────────┬────────────┬─────────────┐
>>>>> │ base mean  ┆ patch mean ┆ revert mean │
>>>>> ╞════════════╪════════════╪═════════════╡
>>>>> │     63.39% ┆     12.48% ┆      12.28% │
>>>>> └────────────┴────────────┴─────────────┘
>>>>> ┌───────────┬───────────┬─────────────┐
>>>>> │ base std  ┆ patch std ┆ revert mean │
>>>>> ╞═══════════╪═══════════╪═════════════╡
>>>>> │      0.97 ┆      0.28 ┆        0.88 │
>>>>> └───────────┴───────────┴─────────────┘
>>>>>
>>>>> The energy gain can be explained by the fact that the system is
>>>>> overutilized during most of the test with the base version.
>>>>>
>>>>> During the test, the base condition is evaluated to true ~40%
>>>>> of the time. The new condition is evaluated to true ~2% of
>>>>> the time. Preventing util_est signals to decay with the base
>>>>> condition has a significant impact on the overutilized state
>>>>> due to an overestimation of the resulting utilization of tasks.
>>>>>
>>>>> The score is impacted by the patch, but:
>>>>> - it is expected to have slightly lower scores with EAS running more
>>>>>    often
>>>>> - the base version making the system run at higher frequencies by
>>>>>    overestimating task utilization, it is expected to have higher scores
>>>>
>>>> I'm not sure to get what you are trying to solve here ?
>>
>> Yeah, the question is how much perf loss we accept for energy savings?
>> IMHO, impossible to answer generically based on one specific
>> workload/platform incarnation.
> 
> I think that your problem is somewhere else like the fact the /Sum of
> util_est is always higher (and sometime far higher) than the actual
> max util_avg because we sum the max of all tasks and as you can see in
> my example above, the runnable but not running slice decrease the
> util_avg of the task.
> 
>>
>> [...]
>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 3e9ca38512de..d058ab29e52e 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>>>>           * To avoid underestimate of task utilization, skip updates of EWMA if
>>>>>           * we cannot grant that thread got all CPU time it wanted.
>>>>>           */
>>>>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
>>>>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
>>>>
>>>> You can't use here the test that is done in
>>>> update_idle_rq_clock_pelt() to detect if we lost some idle time
>>>> because this test is only relevant when the rq becomes idle which is
>>>> not the case here
>>
>> Do you mean this test ?
>>
>> util_avg = util_sum / divider
>>
>> util_sum >= divider * util_avg
>>
>> with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
>> bound of the window (+ 1024):
>>
>> util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
>>
>> Why can't we use it here?
> 
> because of the example below, it makes the filtering a nop for a very
> large time and you will be overutilized far before


To estimate the amount of time a task requires to reach a certain utilization
value, I did the following:
- Computing the accumulated sum of 'pelt graph' for the first 12 * 32ms.

diff --git a/Documentation/scheduler/sched-pelt.c b/Documentation/scheduler/sched-pelt.c
index 7238b355919c..62df61f6e968 100644
--- a/Documentation/scheduler/sched-pelt.c
+++ b/Documentation/scheduler/sched-pelt.c
@@ -38,7 +38,7 @@ void calc_runnable_avg_yN_sum(void)
         int i;
  
         printf("static const u32 runnable_avg_yN_sum[] = {\n\t    0,");
-       for (i = 1; i <= HALFLIFE; i++) {
+       for (i = 1; i <= 12 * HALFLIFE; i++) {
                 if (i == 1)
                         sum *= y;
                 else
@@ -103,7 +103,7 @@ void main(void)
         y = pow(0.5, 1/(double)HALFLIFE);
  
         calc_runnable_avg_yN_inv();
-//     calc_runnable_avg_yN_sum();
+       calc_runnable_avg_yN_sum();
         calc_converged_max();
  //     calc_accumulated_sum_32();
  }

- This yields:
static const u32 runnable_avg_yN_sum[] = {
             0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
          9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
         17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
         23872,24362,24842,25311,25770,26219,26659,27089,27510,27922,28325,
         28720,29106,29484,29854,30216,30570,30917,31256,31588,31913,32231,
         32542,32846,33144,33435,33720,33999,34272,34539,34800,35056,35306,
         35551,35791,36026,36256,36481,36701,36916,37127,37333,37535,37732,
         37925,38114,38299,38480,38657,38830,39000,39166,39328,39487,39642,
         39794,39943,40089,40232,40371,40507,40641,40772,40900,41025,41147,
         41267,41384,41499,41611,41721,41829,41934,42037,42138,42237,42334,
         42428,42520,42610,42699,42786,42871,42954,43035,43114,43192,43268,
         43342,43415,43486,43556,43624,43691,43756,43820,43883,43944,44004,
         44063,44120,44176,44231,44285,44338,44389,44439,44488,44536,44583,
         44629,44674,44718,44761,44803,44845,44886,44926,44965,45003,45040,
         45076,45112,45147,45181,45214,45247,45279,45310,45341,45371,45400,
         45429,45457,45485,45512,45538,45564,45589,45614,45638,45662,45685,
         45708,45730,45752,45773,45794,45814,45834,45853,45872,45891,45909,
         45927,45944,45961,45978,45994,46010,46026,46041,46056,46071,46085,
         46099,46113,46126,46139,46152,46165,46177,46189,46201,46213,46224,
         46235,46246,46257,46267,46277,46287,46297,46307,46316,46325,46334,
         46343,46352,46360,46368,46376,46384,46392,46399,46406,46413,46420,
         46427,46434,46441,46447,46453,46459,46465,46471,46477,46483,46489,
         46494,46499,46504,46509,46514,46519,46524,46529,46534,46538,46542,
         46546,46550,46554,46558,46562,46566,46570,46574,46578,46581,46584,
         46587,46590,46593,46596,46599,46602,46605,46608,46611,46614,46617,
         46620,46623,46626,46628,46630,46632,46634,46636,46638,46640,46642,
         46644,46646,46648,46650,46652,46654,46656,46658,46660,46662,46664,
         46666,46668,46670,46672,46673,46674,46675,46676,46677,46678,46679,
         46680,46681,46682,46683,46684,46685,46686,46687,46688,46689,46690,
         46691,46692,46693,46694,46695,46696,46697,46698,46699,46700,46701,
         46702,46703,46704,46705,46706,46707,46708,46709,46710,46711,46712,
         46713,46714,46715,46716,46717,46718,46718,46718,46718,46718,46718,
         46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,
         46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,
         46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,
};

- Due to some approximations during the computation (I presume) the accumulated
   sum doesn't converge toward 47742, but toward 46718, so I'll use 46718.
- Computing the estimated utilization after Xms of running time with, for each
   value V: estimated_util = 1024 * (V / 46718)
This yields the following util estimation array. For instance, after 12ms,
the estimated utilization is 234.
   0:    0   21   43   64   85  105  124  144  163  181
  10:  199  217  234  251  267  284  300  315  330  345
  20:  360  374  388  401  415  428  441  453  465  477
  30:  489  501  512  523  533  544  554  564  574  584
  40:  593  602  612  620  629  637  646  654  662  670
  50:  677  685  692  699  706  713  719  726  732  739
  60:  745  751  757  762  768  773  779  784  789  794
  70:  799  804  809  813  818  822  827  831  835  839
  80:  843  847  851  854  858  862  865  868  872  875
  90:  878  881  884  887  890  893  896  899  901  904
100:  907  909  912  914  916  919  921  923  925  927
110:  929  931  933  935  937  939  941  943  945  946
120:  948  950  951  953  954  956  957  959  960  961
130:  963  964  965  967  968  969  970  971  972  974
140:  975  976  977  978  979  980  981  982  982  983
150:  984  985  986  987  988  988  989  990  991  991
160:  992  993  993  994  995  995  996  996  997  998
170:  998  999  999 1000 1000 1001 1001 1002 1002 1003
180: 1003 1004 1004 1005 1005 1005 1006 1006 1007 1007
190: 1007 1008 1008 1008 1009 1009 1009 1010 1010 1010
200: 1011 1011 1011 1011 1012 1012 1012 1012 1013 1013
210: 1013 1013 1014 1014 1014 1014 1014 1015 1015 1015
220: 1015 1015 1016 1016 1016 1016 1016 1017 1017 1017
230: 1017 1017 1017 1017 1018 1018 1018 1018 1018 1018
240: 1018 1018 1019 1019 1019 1019 1019 1019 1019 1019
250: 1019 1020 1020 1020 1020 1020 1020 1020 1020 1020
260: 1020 1020 1020 1021 1021 1021 1021 1021 1021 1021
270: 1021 1021 1021 1021 1021 1021 1021 1021 1022 1022
280: 1022 1022 1022 1022 1022 1022 1022 1022 1022 1022
290: 1022 1022 1022 1022 1022 1022 1022 1022 1022 1022
300: 1022 1023 1023 1023 1023 1023 1023 1023 1023 1023
310: 1023 1023 1023 1023 1023 1023 1023 1023 1023 1023
320: 1023 1023 1023 1023 1023 1023 1023 1023 1023 1023
330: 1023 1023 1023 1023 1023 1023 1023 1023 1023 1023
340: 1023 1023 1023 1023 1023 1023 1024 1024 1024 1024
350: 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024
360: 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024
370: 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024
380: 1024 1024 1024 1024 1024

On a Pixel6, a little CPU reached 169 utilization after 52ms of continuous
running time. The capacity of little CPUs is 160. So due to time scaling,
this should match the value at (52 * (160 / 1024))= 8.12ms.
The estimated utilization at 8ms is 163, so this should match. Just to be
accurate:
- 52ms: wall clock time
- 8ms: pelt scaled time

The other way around, a CPU with a capacity of 963 (cf the array above at 180ms)
will reach a utilization of 1003 after (130 * (1024 / 1003))= 183ms
- 183ms: wall clock time
- 180ms: pelt scaled time

------------

All of this just to highlight that:
- being overutilized already depends on the capacity of a CPU
- the lower the capacity, the easier it is to become overutilized
This is if overutilized means 'having a CPU utilization reaching 80% of a
CPU capacity'.

If overutilized means 'not having enough compute power to correctly estimate
a task utilization', then indeed it takes 2.07s for a 160-capacity CPU to
realize that. But FWIU, this is the current behaviour as CPUs have the ability
to estimate a task utilization beyond their own capacity.
I don't see why having 2 tasks instead of 1 would make a difference, their
utilization would just raise half fast as if their were alone on the CPU,
but nothing more IIUC.

------------

Also, I think the original issue is to detect cases where tasks cannot reach
a max utilization corresponding to their duty cycle. I.e. cases where the
utilization of a task is always strictly below the value
(task_duty_cycle * 1024). This being due to other tasks preventing to run
as much time as desired.
I don't think this is what happens when 2 tasks run on a non-big CPU, as long
as there is idle time on the non-big CPU. This even though their respective
utilization goes above the CPU capacity.

On a 512-capacity CPU, 2 periodic tasks with a duty cycle of 20% and a period
of 100ms should have correct utilization values, even if the utilization of the
CPU goes above its capacity. On the Pixel6 where mid CPUs have a capacity of
498, these tasks reach a utilization of 323, and the CPU reaches a utilization
of 662.


> 
>>
>>>> With this test you skip completely the cases where the task has to
>>>> share the CPU with others. As an example on the pixel 6, the little
>>
>> True. But I assume that's anticipated here. The assumption is that as
>> long as there is idle time, tasks get what they want in a time frame.
>>
>>>> cpus must run more than 1.2 seconds at its max freq before detecting
>>>> that there is no idle time
>>
>> BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
>> 2.07s (with CPU capacity of Pix5 little CPU = 160)?
> 
> yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse
> 
>>
>> [...]

Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Vincent Guittot 1 year ago

On Thu, 9 Jan 2025 at 16:32, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> Thanks for the review,
>
> On 12/20/24 16:05, Vincent Guittot wrote:
> > On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 20/12/2024 08:47, Vincent Guittot wrote:
> >>> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> >>> <vincent.guittot@linaro.org> wrote:
> >>>>
> >>>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>>>>
> >>>>> util_est signal does not decay if the task utilization is lower
> >>>>> than its runnable signal by a value of 10. This was done to keep
> >>>>
> >>>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> >>>> worth updating util_est
> >> Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
> >> the mechanism is too sensitive?
> >
> > The default config is to follow util_est update
> >
> >>
> >> It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
> >> 446 446 446) in cases 2 tasks are scheduled on the same little CPU:
> >>
> >> ...
> >> task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
> >> task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
> >> task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
> >> task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
> >> ...
> >>
> >> I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
> >> this implementation was tested with perf/energy numbers back then?
> >>
> >>>>> the util_est signal high in case a task shares a rq with another
> >>>>> task and doesn't obtain a desired running time.
> >>>>>
> >>>>> However, tasks sharing a rq obtain the running time they desire
> >>>>> provided that the rq has some idle time. Indeed, either:
> >>>>> - a CPU is always running. The utilization signal of tasks reflects
> >>>>>    the running time they obtained. This running time depends on the
> >>>>>    niceness of the tasks. A decreasing utilization signal doesn't
> >>>>>    reflect a decrease of the task activity and the util_est signal
> >>>>>    should not be decayed in this case.
> >>>>> - a CPU is not always running (i.e. there is some idle time). Tasks
> >>>>>    might be waiting to run, increasing their runnable signal, but
> >>>>>    eventually run to completion. A decreasing utilization signal
> >>>>>    does reflect a decrease of the task activity and the util_est
> >>>>>    signal should be decayed in this case.
> >>>>
> >>>> This is not always true
> >>>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> >>>> compute capacity. its util_avg is up to 674 at dequeue as well as its
> >>>> util_est
> >>>> Then start a 2nd task with the exact same behavior on the same cpu.
> >>>> The util_avg of this 2nd task will be only 496 at dequeue as well as
> >>>> its util_est but there is still 20ms of idle time. Furthermore,  The
> >>>> util_avg of the 1st task is also around 496 at dequeue but
> >>>
> >>> the end of the sentence was missing...
> >>>
> >>> but there is still 20ms of idle time.
> >>
> >> But these two tasks are still able to finish there activity within this
> >> 100ms window. So why should we keep their util_est values high when
> >> dequeuing?
> >
> > But then, the util_est decreases from the original value compared to
> > alone whereas its utilization is the same
>
> In the example with one task, it is possible to have a utilization as high
> as we want by increasing the period. With a period of 200ms, the task
> reaches a utilization of 750, and with a period of 300ms the max utilization
> is 870.
> Having a high utilization at dequeue is a usefull information stored in
> util_est. It allows to track down that even though the utilization of the task
> had time to decrease, the task actually represents a big quantity of
> instructions to execute. The task should be handled accordingly.
>
> On the other side, by decreasing the period, the lowest max utilization we
> can get is 40% * 1024 = 410.
>
> ------------
>
> By having 2 tasks sharing the CPU, the utilization graph is smoothed as one
> big period of 40ms followed by 60ms of idle time becomes:
> - when the 2 tasks are running, both tasks run alternatively during one sched
>    slice ~=4ms, so the 40ms running phase becomes a periodic phase with a period
>    of 8ms and a duty cycle of 50%
> - the 60ms idle time is reduced to 20ms idle time for each task
> The fact that these tasks could run longer than one sched slice is reflected
> in the runnable signal of the tasks.
> The duty cycle of the tasks in the co-scheduling phase is 50% and the duty
> cycle over the 100ms period is 40%. So the utilization of the tasks can reach
> 40% * 1024. This is ok, tasks don't prevent each other to reach a utilization
> value corresponding to their actual duty_cycle.
>
> This patch intends to detect when a periodic task cannot reach a utilization
> value of duty_cycle * 1024 due to other tasks requiring to run.
> This would be the case for instance if there were 3 tasks with:
>    duty_cycle=40%, period=100ms, running during 300ms
> In this case, the total running time of the CPU is:
>    3(tasks) * 40(ms) * 3(periods) = 360ms
> There is no idle time during these 360ms and the utilization of tasks reaches
> at most 369 (369 < 0.4*1024).
>
> This is different from the case where the task utilization is lower than their
> runnable signal. The following task:
> ---
> To get a high util_est / low utilization value:
> - Run during a long period
> - Idle during a long period
> Then loop n times:
> - Periodic during 80ms, period=8ms, duty_cycle=51%
>    (note that the duty_cycle is set to 51% to be sure the running time is
>    superior to a sched slice of 4ms)
> - Idle during 20ms
> ---
> would:
> - allow decaying util_est during the looping phase if there was one task
> - not allow decaying util_est during the looping phase if there were 2 tasks.
>    Indeed the runnable signal of the tasks would be higher than their util
>    signal.
>
> However, the looping phase doesn't represent a long and continuous amount of
> instruction to execute. The profile of the task changed and the util_est
> value should reflect that.
> Checking the delta between the runnable and utilization signal doesn't allow to
> detect that the profile of the task changed. Indeed, being runnable doesn't
> mean being runnable all the time a task is runnable.

I fully agree that the current solution is not perfect as it assumes
that when runnable_avg > util_avg, the task didn't fully run as
expected and its util_avg at dequeue might not be correct as described
in my example. I also agree that some other case fall in this
condition whereas it should not but your proposal fail to detect this
correctly

>
> >
> >>
> >> [...]
> >>
> >>>>> The initial patch [2] aimed to solve an issue detected while running
> >>>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> >>>>> versions are compared:
> >>>>> - base: the current version
> >>>>
> >>>> What do you mean by current version ? tip/sched/core ?
>
> I meant using the following condition:
> (dequeued + UTIL_EST_MARGIN) < task_runnable(p)

I meant what is your base tree ? v6.12 ? v6.13-rcX ? tip/sched/core

I tried your patch on top of android mainline v6.12 but don't get the
same results; In particular for the Overutilized ratio.

In my tests, your patch doesn't make any real difference:
similar speedometer score 87.96 vs  87.4 (running locally and not over wifi)
similar overutilized ratio 67% vs 61%
similar energy counters 171232171 vs 166066813 (/Sum of CPUs clusters counters)

These results means the same as the thermal environment (ambient temp
and skin temp at beg of the test) and the thermal mitigation have an
impact on results

What am I missing compared to your setup ?


>
> >>>>
> >>>>> - patch: the new version, with this patch applied
> >>>>> - revert: the initial version, with commit [2] reverted
> >>>>>
> >>>>> Score (higher is better):
> >>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>>>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
> >>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>>>> ┌───────────┬───────────┬────────────┐
> >>>>> │ base std  ┆ patch std ┆ revert std │
> >>>>> ╞═══════════╪═══════════╪════════════╡
> >>>>> │      0.57 ┆      0.49 ┆       0.58 │
> >>>>> └───────────┴───────────┴────────────┘
> >>>>>
> >>>>> Energy measured with energy counters:
> >>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>>>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
> >>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>>>> ┌───────────┬───────────┬────────────┐
> >>>>> │ base std  ┆ patch std ┆ revert std │
> >>>>> ╞═══════════╪═══════════╪════════════╡
> >>>>> │   1347.13 ┆   2431.67 ┆     510.88 │
> >>>>> └───────────┴───────────┴────────────┘
> >>>>>
> >>>>> Energy computed from util signals and energy model:
> >>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>>>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
> >>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>>>> ┌───────────┬───────────┬────────────┐
> >>>>> │ base std  ┆ patch std ┆ revert std │
> >>>>> ╞═══════════╪═══════════╪════════════╡
> >>>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> >>>>> └───────────┴───────────┴────────────┘
> >>>>>
> >>>>> OU ratio in % (ratio of time being overutilized over total time).
> >>>>> The test lasts ~65s:
> >>>>> ┌────────────┬────────────┬─────────────┐
> >>>>> │ base mean  ┆ patch mean ┆ revert mean │
> >>>>> ╞════════════╪════════════╪═════════════╡
> >>>>> │     63.39% ┆     12.48% ┆      12.28% │
> >>>>> └────────────┴────────────┴─────────────┘
> >>>>> ┌───────────┬───────────┬─────────────┐
> >>>>> │ base std  ┆ patch std ┆ revert mean │
> >>>>> ╞═══════════╪═══════════╪═════════════╡
> >>>>> │      0.97 ┆      0.28 ┆        0.88 │
> >>>>> └───────────┴───────────┴─────────────┘
> >>>>>

[...]

> >> [...]
> >>
> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>> index 3e9ca38512de..d058ab29e52e 100644
> >>>>> --- a/kernel/sched/fair.c
> >>>>> +++ b/kernel/sched/fair.c
> >>>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> >>>>>           * To avoid underestimate of task utilization, skip updates of EWMA if
> >>>>>           * we cannot grant that thread got all CPU time it wanted.
> >>>>>           */
> >>>>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> >>>>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
> >>>>
> >>>> You can't use here the test that is done in
> >>>> update_idle_rq_clock_pelt() to detect if we lost some idle time
> >>>> because this test is only relevant when the rq becomes idle which is
> >>>> not the case here
> >>
> >> Do you mean this test ?
> >>
> >> util_avg = util_sum / divider
> >>
> >> util_sum >= divider * util_avg
> >>
> >> with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
> >> bound of the window (+ 1024):
> >>
> >> util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
> >>
> >> Why can't we use it here?
> >
> > because of the example below, it makes the filtering a nop for a very
> > large time and you will be overutilized far before
>
>
> To estimate the amount of time a task requires to reach a certain utilization
> value, I did the following:
> - Computing the accumulated sum of 'pelt graph' for the first 12 * 32ms.

You can also do
(1-y^r)*1024 where r is the number of 1024 us periods
and
(1-y^r) / (1-y^p) when you have a task running r period with a task
period of p 1024us

Keep in mind that we track 1024us and not 1000us

>

[...]

> - Due to some approximations during the computation (I presume) the accumulated
   sum doesn't converge toward 47742, but toward 46718, so I'll use 46718.

It's not an approximation:  46718 = 47742*y

The computation is done at the end of a complete pelt period (which is
decayed) before accumulating the current period

[...]

>
> ------------
>
> All of this just to highlight that:
> - being overutilized already depends on the capacity of a CPU
> - the lower the capacity, the easier it is to become overutilized
> This is if overutilized means 'having a CPU utilization reaching 80% of a
> CPU capacity'.

This is the current implementation of cpu overutilized detection

>
> If overutilized means 'not having enough compute power to correctly estimate
> a task utilization', then indeed it takes 2.07s for a 160-capacity CPU to
> realize that. But FWIU, this is the current behaviour as CPUs have the ability
> to estimate a task utilization beyond their own capacity.

After this sentence above, I'm not sure what you mean by overutilized ?

Being overutilized and being able to correctly estimate task
utilization are 2 different things.

Until we reach 1024, we can't say if the task didn't get enough cycles
to finish what it has to do. And this whatever the compute capacity.
When there is idle time and cpu utilization is not 1024 then it means
that there were enough compute capacity (but not that we didn't change
the behavior of the task)

Testing util_sum >= divider is relevant when the CPU becomes idle to
know if we miss accounting some cycle. But testing  util_sum < divider
at runtime doesn't mean that we didn't lose some cycles, just that we
didn't detect it yet

> I don't see why having 2 tasks instead of 1 would make a difference, their
> utilization would just raise half fast as if their were alone on the CPU,
> but nothing more IIUC.

There is a difference as described in my example in my previous email
because the utilization pattern in the period is not the same in this
case and PELT  is not a linear with time

>
> ------------
>
> Also, I think the original issue is to detect cases where tasks cannot reach
> a max utilization corresponding to their duty cycle. I.e. cases where the
> utilization of a task is always strictly below the value
> (task_duty_cycle * 1024). This being due to other tasks preventing to run
> as much time as desired.
> I don't think this is what happens when 2 tasks run on a non-big CPU, as long
> as there is idle time on the non-big CPU. This even though their respective

But this is not what your patch/test does !

> utilization goes above the CPU capacity.

Max utilization of a task is always > (task_duty_cycle * 1024) and the
longer the period is the larger the diff is

And some task with its max utilization > CPU compute capacity, can fit
on this CPU. But at now we don't detect these cases so we assume the
task doesn't fit

>
> On a 512-capacity CPU, 2 periodic tasks with a duty cycle of 20% and a period
> of 100ms should have correct utilization values, even if the utilization of the
> CPU goes above its capacity. On the Pixel6 where mid CPUs have a capacity of
> 498, these tasks reach a utilization of 323, and the CPU reaches a utilization
> of 662.

An easier solution would be to not use the /Sum of util est to know if
a CPU is overutilized or not but only to select the OPP. Something
like the below:

@@ -8069,7 +8069,7 @@ cpu_util(int cpu, struct task_struct *p, int
dst_cpu, int boost)
  else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
  util += task_util(p);

- if (sched_feat(UTIL_EST)) {
+ if (sched_feat(UTIL_EST) && boost) {
  unsigned long util_est;

  util_est = READ_ONCE(cfs_rq->avg.util_est);

>
>
> >
> >>
> >>>> With this test you skip completely the cases where the task has to
> >>>> share the CPU with others. As an example on the pixel 6, the little
> >>
> >> True. But I assume that's anticipated here. The assumption is that as
> >> long as there is idle time, tasks get what they want in a time frame.
> >>
> >>>> cpus must run more than 1.2 seconds at its max freq before detecting
> >>>> that there is no idle time
> >>
> >> BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
> >> 2.07s (with CPU capacity of Pix5 little CPU = 160)?
> >
> > yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse
> >
> >>
> >> [...]

Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

Posted by Pierre Gondois 1 year ago


On 1/10/25 10:06, Vincent Guittot wrote:
> On Thu, 9 Jan 2025 at 16:32, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> Thanks for the review,
>>
>> On 12/20/24 16:05, Vincent Guittot wrote:
>>> On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 20/12/2024 08:47, Vincent Guittot wrote:
>>>>> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
>>>>> <vincent.guittot@linaro.org> wrote:
>>>>>>
>>>>>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>>>>
>>>>>>> util_est signal does not decay if the task utilization is lower
>>>>>>> than its runnable signal by a value of 10. This was done to keep
>>>>>>
>>>>>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
>>>>>> worth updating util_est
>>>> Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
>>>> the mechanism is too sensitive?
>>>
>>> The default config is to follow util_est update
>>>
>>>>
>>>> It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
>>>> 446 446 446) in cases 2 tasks are scheduled on the same little CPU:
>>>>
>>>> ...
>>>> task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
>>>> task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
>>>> task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
>>>> task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
>>>> ...
>>>>
>>>> I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
>>>> this implementation was tested with perf/energy numbers back then?
>>>>
>>>>>>> the util_est signal high in case a task shares a rq with another
>>>>>>> task and doesn't obtain a desired running time.
>>>>>>>
>>>>>>> However, tasks sharing a rq obtain the running time they desire
>>>>>>> provided that the rq has some idle time. Indeed, either:
>>>>>>> - a CPU is always running. The utilization signal of tasks reflects
>>>>>>>     the running time they obtained. This running time depends on the
>>>>>>>     niceness of the tasks. A decreasing utilization signal doesn't
>>>>>>>     reflect a decrease of the task activity and the util_est signal
>>>>>>>     should not be decayed in this case.
>>>>>>> - a CPU is not always running (i.e. there is some idle time). Tasks
>>>>>>>     might be waiting to run, increasing their runnable signal, but
>>>>>>>     eventually run to completion. A decreasing utilization signal
>>>>>>>     does reflect a decrease of the task activity and the util_est
>>>>>>>     signal should be decayed in this case.
>>>>>>
>>>>>> This is not always true
>>>>>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
>>>>>> compute capacity. its util_avg is up to 674 at dequeue as well as its
>>>>>> util_est
>>>>>> Then start a 2nd task with the exact same behavior on the same cpu.
>>>>>> The util_avg of this 2nd task will be only 496 at dequeue as well as
>>>>>> its util_est but there is still 20ms of idle time. Furthermore,  The
>>>>>> util_avg of the 1st task is also around 496 at dequeue but
>>>>>
>>>>> the end of the sentence was missing...
>>>>>
>>>>> but there is still 20ms of idle time.
>>>>
>>>> But these two tasks are still able to finish there activity within this
>>>> 100ms window. So why should we keep their util_est values high when
>>>> dequeuing?
>>>
>>> But then, the util_est decreases from the original value compared to
>>> alone whereas its utilization is the same
>>
>> In the example with one task, it is possible to have a utilization as high
>> as we want by increasing the period. With a period of 200ms, the task
>> reaches a utilization of 750, and with a period of 300ms the max utilization
>> is 870.
>> Having a high utilization at dequeue is a usefull information stored in
>> util_est. It allows to track down that even though the utilization of the task
>> had time to decrease, the task actually represents a big quantity of
>> instructions to execute. The task should be handled accordingly.
>>
>> On the other side, by decreasing the period, the lowest max utilization we
>> can get is 40% * 1024 = 410.
>>
>> ------------
>>
>> By having 2 tasks sharing the CPU, the utilization graph is smoothed as one
>> big period of 40ms followed by 60ms of idle time becomes:
>> - when the 2 tasks are running, both tasks run alternatively during one sched
>>     slice ~=4ms, so the 40ms running phase becomes a periodic phase with a period
>>     of 8ms and a duty cycle of 50%
>> - the 60ms idle time is reduced to 20ms idle time for each task
>> The fact that these tasks could run longer than one sched slice is reflected
>> in the runnable signal of the tasks.
>> The duty cycle of the tasks in the co-scheduling phase is 50% and the duty
>> cycle over the 100ms period is 40%. So the utilization of the tasks can reach
>> 40% * 1024. This is ok, tasks don't prevent each other to reach a utilization
>> value corresponding to their actual duty_cycle.
>>
>> This patch intends to detect when a periodic task cannot reach a utilization
>> value of duty_cycle * 1024 due to other tasks requiring to run.
>> This would be the case for instance if there were 3 tasks with:
>>     duty_cycle=40%, period=100ms, running during 300ms
>> In this case, the total running time of the CPU is:
>>     3(tasks) * 40(ms) * 3(periods) = 360ms
>> There is no idle time during these 360ms and the utilization of tasks reaches
>> at most 369 (369 < 0.4*1024).
>>
>> This is different from the case where the task utilization is lower than their
>> runnable signal. The following task:
>> ---
>> To get a high util_est / low utilization value:
>> - Run during a long period
>> - Idle during a long period
>> Then loop n times:
>> - Periodic during 80ms, period=8ms, duty_cycle=51%
>>     (note that the duty_cycle is set to 51% to be sure the running time is
>>     superior to a sched slice of 4ms)
>> - Idle during 20ms
>> ---
>> would:
>> - allow decaying util_est during the looping phase if there was one task
>> - not allow decaying util_est during the looping phase if there were 2 tasks.
>>     Indeed the runnable signal of the tasks would be higher than their util
>>     signal.
>>
>> However, the looping phase doesn't represent a long and continuous amount of
>> instruction to execute. The profile of the task changed and the util_est
>> value should reflect that.
>> Checking the delta between the runnable and utilization signal doesn't allow to
>> detect that the profile of the task changed. Indeed, being runnable doesn't
>> mean being runnable all the time a task is runnable.
> 
> I fully agree that the current solution is not perfect as it assumes
> that when runnable_avg > util_avg, the task didn't fully run as
> expected and its util_avg at dequeue might not be correct as described
> in my example. I also agree that some other case fall in this
> condition whereas it should not but your proposal fail to detect this
> correctly

As you said, in the example you gave:
- The util_avg a task reaches at dequeue depends on whether the task had
   to share the rq or not.
- The UTIL_EST_MARGIN allows to avoid the util_est signal to decrease if
   the task had to share the rq.

So the following condition acts like a filter:
   if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
preventing util_est to decrease to try to compensate the fact that util_avg
couldn't reach a higher value.

However, wouldn't it be possible to:
- compute this 'actual util_avg at dequeue' independently from the presence of
   other tasks
- not prevent util_est from decaying ?

I did the following experiment just to compute this 'actual util_avg at
dequeue' at runtime:
- The information that a task is enqueued/dequeued is passed to
   __update_load_avg_se().
- If the task is enqueued, a snapshot of util_sum is taken
- Whenever accounting for pelt:
   - The running time is accumulated in 'u64 running_time;'
   - The non-running time is accumulated in 'u64 non_running_time;'
   - Whether running or not-running, the util contributions is stored
     in a scratch 'u64 util_sum_enqueue' variable.
- When the task is dequeued, it is possible to compute the 'actual util_avg at
   dequeue' by:
   a- Decaying the util_sum snapshot by 'running_time' periods
   b- Accumulating the total running time while the task was runnable
      with __accumulate_pelt_segments() for 'running_time' periods
   c- Copmute 'actual util_avg at dequeue' = (a + b) / get_pelt_divider()
- Finally 'util_sum' is decayed by (running_time + non_running_time) periods,
   and the util_sum_enqueue is added to util_sum then cleared.

With your example with 2 tasks running 40ms every 100ms, once both tasks
finished running after 80ms, both tasks have:
- a util_avg at dequeue of 496 (unchanged)
- an 'actual util_avg at dequeue' of ~674, just like if each task had run
   alone on the rq.

It could even be possible to compute this 'actual util_avg' while the task is
still enqueued (i.e. runnable=1). This would be equivalent to do what is
currently done for time scaling / idle time accounting in pelt, except this
would be based on running=[0|1].

------------

I think I finally understand what you mean. IIUC, the util_avg graph of a task
which shares the rq with other tasks is inexact from your perspective. The
exact graph is when the task is alone on the rq.
 From that perspective, the presence of idle time effectively doesn't guarantee
a correct util_avg graph. This was not how I understood the issue described at
[4], but this might have been the issue indeed.

 From what I initially thought about [4], the main thread could not reach a
high utilization value because it could not even reach a utilization such as:
   max utilization > (task_duty_cycle * 1024)
To illustrate with a case, this happens if there are 2 tasks at 60% on the same
rq. Their max utilization would be 512. In that case aswell the util_avg graph
is incorrect, but this is a different type of incorrect than in your case.
Here, checking the presence of idle time is relevant.

------------

About the overutilization topic, the notion was defined to allow EAS to make
correct task placement while not breaking niceness, cf. [5].
IMO, in your example:
- util values are incorrect (in the sense that runnable==util)
- we might have cpu_util [<|>] 80% * CPU_capacity
But util values being incorrect doesn't prevent from running EAS (which I
agree).

In the example with 2 tasks with a duty_cycle=60%:
- util values are incorrect (in the sense that their max util < duty_cycle)
- cpu_util > 80% * CPU_capacity
So the system is definitely overutilized.

However, with one single task with duty_cycle=90%:
- util values are correct (in the sense that runnable==util and
   max util > duty_cycle)
- we have cpu_util > 80% * CPU_capacity
So the utilization value is correct but EAS is stopped, which seems wrong.
In that case, checking the presence of idle time might be better than checking
80% * CPU_capacity.
The energy values I advertised in the the commit message are actually showing
the results of doing this modification + this patch.

[4] https://lore.kernel.org/lkml/f1b1b663-3a12-9e5d-932b-b3ffb5f02e14@arm.com/
[5] https://lore.kernel.org/all/tip-2802bf3cd936fe2c8033a696d375a4d9d3974de4@git.kernel.org/


> 
>>
>>>
>>>>
>>>> [...]
>>>>
>>>>>>> The initial patch [2] aimed to solve an issue detected while running
>>>>>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
>>>>>>> versions are compared:
>>>>>>> - base: the current version
>>>>>>
>>>>>> What do you mean by current version ? tip/sched/core ?
>>
>> I meant using the following condition:
>> (dequeued + UTIL_EST_MARGIN) < task_runnable(p)
> 
> I meant what is your base tree ? v6.12 ? v6.13-rcX ? tip/sched/core
> 
> I tried your patch on top of android mainline v6.12 but don't get the
> same results; In particular for the Overutilized ratio.
> 
> In my tests, your patch doesn't make any real difference:
> similar speedometer score 87.96 vs  87.4 (running locally and not over wifi)
> similar overutilized ratio 67% vs 61%
> similar energy counters 171232171 vs 166066813 (/Sum of CPUs clusters counters)
> 
> These results means the same as the thermal environment (ambient temp
> and skin temp at beg of the test) and the thermal mitigation have an
> impact on results
> 
> What am I missing compared to your setup ?

The base is v6.12. When testing:
- with commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
   reverted
- with this patch
I included a change modifying the overutilized condition aswell,
checking rq_no_idle_pelt() (i.e. if there is idle time) instead of the current
80% * CPU_capacity condition. So the numbers advertised on speedometer for
the patchset are incorrect. Really sorry about that...

Here are the correct numbers, closer to what you get:
- a slightly lower OU rate
- approximately same energy consumption and performance numbers.


Score (higher is better):
┌────────────┬────────────┬─────────────┐
│ base mean  ┆ patch mean ┆ ratio_patch │
╞════════════╪════════════╪═════════════╡
│     128.36 ┆     127.24 ┆      -0.87% │
└────────────┴────────────┴─────────────┘
┌───────────┬───────────┐
│ base std  ┆ patch std │
╞═══════════╪═══════════╡
│      0.66 ┆      1.91 │
└───────────┴───────────┘

Energy measured with energy counters:
┌────────────┬────────────┬─────────────┐
│ base mean  ┆ patch mean ┆ ratio_patch │
╞════════════╪════════════╪═════════════╡
│     134122 ┆     132377 ┆      -1.30% │
└────────────┴────────────┴─────────────┘
┌───────────┬───────────┐
│ base std  ┆ patch std │
╞═══════════╪═══════════╡
│      1771 ┆       617 │
└───────────┴───────────┘

Energy computed from util signals and energy model:
┌────────────┬────────────┬─────────────┐
│ base mean  ┆ patch mean ┆ ratio_patch │
╞════════════╪════════════╪═════════════╡
│  1.849e+12 ┆  1.857e+12 ┆      +0.43% │
└────────────┴────────────┴─────────────┘
┌───────────┬───────────┐
│ base std  ┆ patch std │
╞═══════════╪═══════════╡
│ 1.706e+10 ┆ 3.344e+10 │
└───────────┴───────────┘

OU ratio in % (ratio of time being overutilized over total time).
┌────────────┬────────────┐
│ base mean  ┆ patch mean │
╞════════════╪════════════╡
│     64.66% ┆     57.08% │
└────────────┴────────────┘
┌───────────┬───────────┐
│ base std  ┆ patch std │
╞═══════════╪═══════════╡
│      2.41 ┆      0.71 │
└───────────┴───────────┘

> 
> 
>>
>>>>>>
>>>>>>> - patch: the new version, with this patch applied
>>>>>>> - revert: the initial version, with commit [2] reverted
>>>>>>>
>>>>>>> Score (higher is better):
>>>>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>>>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>>>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>>>>>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
>>>>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>>>>>> ┌───────────┬───────────┬────────────┐
>>>>>>> │ base std  ┆ patch std ┆ revert std │
>>>>>>> ╞═══════════╪═══════════╪════════════╡
>>>>>>> │      0.57 ┆      0.49 ┆       0.58 │
>>>>>>> └───────────┴───────────┴────────────┘
>>>>>>>
>>>>>>> Energy measured with energy counters:
>>>>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>>>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>>>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>>>>>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
>>>>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>>>>>> ┌───────────┬───────────┬────────────┐
>>>>>>> │ base std  ┆ patch std ┆ revert std │
>>>>>>> ╞═══════════╪═══════════╪════════════╡
>>>>>>> │   1347.13 ┆   2431.67 ┆     510.88 │
>>>>>>> └───────────┴───────────┴────────────┘
>>>>>>>
>>>>>>> Energy computed from util signals and energy model:
>>>>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>>>>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>>>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>>>>>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
>>>>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>>>>>> ┌───────────┬───────────┬────────────┐
>>>>>>> │ base std  ┆ patch std ┆ revert std │
>>>>>>> ╞═══════════╪═══════════╪════════════╡
>>>>>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
>>>>>>> └───────────┴───────────┴────────────┘
>>>>>>>
>>>>>>> OU ratio in % (ratio of time being overutilized over total time).
>>>>>>> The test lasts ~65s:
>>>>>>> ┌────────────┬────────────┬─────────────┐
>>>>>>> │ base mean  ┆ patch mean ┆ revert mean │
>>>>>>> ╞════════════╪════════════╪═════════════╡
>>>>>>> │     63.39% ┆     12.48% ┆      12.28% │
>>>>>>> └────────────┴────────────┴─────────────┘
>>>>>>> ┌───────────┬───────────┬─────────────┐
>>>>>>> │ base std  ┆ patch std ┆ revert mean │
>>>>>>> ╞═══════════╪═══════════╪═════════════╡
>>>>>>> │      0.97 ┆      0.28 ┆        0.88 │
>>>>>>> └───────────┴───────────┴─────────────┘
>>>>>>>
> 
> [...]
> 
>>>> [...]
>>>>
>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>> index 3e9ca38512de..d058ab29e52e 100644
>>>>>>> --- a/kernel/sched/fair.c
>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>>>>>>            * To avoid underestimate of task utilization, skip updates of EWMA if
>>>>>>>            * we cannot grant that thread got all CPU time it wanted.
>>>>>>>            */
>>>>>>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
>>>>>>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
>>>>>>
>>>>>> You can't use here the test that is done in
>>>>>> update_idle_rq_clock_pelt() to detect if we lost some idle time
>>>>>> because this test is only relevant when the rq becomes idle which is
>>>>>> not the case here
>>>>
>>>> Do you mean this test ?
>>>>
>>>> util_avg = util_sum / divider
>>>>
>>>> util_sum >= divider * util_avg
>>>>
>>>> with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
>>>> bound of the window (+ 1024):
>>>>
>>>> util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
>>>>
>>>> Why can't we use it here?
>>>
>>> because of the example below, it makes the filtering a nop for a very
>>> large time and you will be overutilized far before
>>
>>
>> To estimate the amount of time a task requires to reach a certain utilization
>> value, I did the following:
>> - Computing the accumulated sum of 'pelt graph' for the first 12 * 32ms.
> 
> You can also do
> (1-y^r)*1024 where r is the number of 1024 us periods
> and
> (1-y^r) / (1-y^p) when you have a task running r period with a task
> period of p 1024us
> 
> Keep in mind that we track 1024us and not 1000us

Yes right

> 
>>
> 
> [...]
> 
>> - Due to some approximations during the computation (I presume) the accumulated
>     sum doesn't converge toward 47742, but toward 46718, so I'll use 46718.
> 
> It's not an approximation:  46718 = 47742*y
> 
> The computation is done at the end of a complete pelt period (which is
> decayed) before accumulating the current period
> 
> [...]
> 
>>
>> ------------
>>
>> All of this just to highlight that:
>> - being overutilized already depends on the capacity of a CPU
>> - the lower the capacity, the easier it is to become overutilized
>> This is if overutilized means 'having a CPU utilization reaching 80% of a
>> CPU capacity'.
> 
> This is the current implementation of cpu overutilized detection
> 
>>
>> If overutilized means 'not having enough compute power to correctly estimate
>> a task utilization', then indeed it takes 2.07s for a 160-capacity CPU to
>> realize that. But FWIU, this is the current behaviour as CPUs have the ability
>> to estimate a task utilization beyond their own capacity.
> 
> After this sentence above, I'm not sure what you mean by overutilized ?
> 
> Being overutilized and being able to correctly estimate task
> utilization are 2 different things.

Cf. above, I think this depends on what is meant by not being able to correctly
estimate a task utilization.

> 
> Until we reach 1024, we can't say if the task didn't get enough cycles
> to finish what it has to do. And this whatever the compute capacity.
> When there is idle time and cpu utilization is not 1024 then it means
> that there were enough compute capacity (but not that we didn't change
> the behavior of the task)
> 
> Testing util_sum >= divider is relevant when the CPU becomes idle to
> know if we miss accounting some cycle. But testing  util_sum < divider
> at runtime doesn't mean that we didn't lose some cycles, just that we
> didn't detect it yet
> 
>> I don't see why having 2 tasks instead of 1 would make a difference, their
>> utilization would just raise half fast as if their were alone on the CPU,
>> but nothing more IIUC.
> 
> There is a difference as described in my example in my previous email
> because the utilization pattern in the period is not the same in this
> case and PELT  is not a linear with time
> 
>>
>> ------------
>>
>> Also, I think the original issue is to detect cases where tasks cannot reach
>> a max utilization corresponding to their duty cycle. I.e. cases where the
>> utilization of a task is always strictly below the value
>> (task_duty_cycle * 1024). This being due to other tasks preventing to run
>> as much time as desired.
>> I don't think this is what happens when 2 tasks run on a non-big CPU, as long
>> as there is idle time on the non-big CPU. This even though their respective
> 
> But this is not what your patch/test does !

Cf. above, I think we didn't mean the same thing by 'other tasks preventing to run
>> as much time as desired'.

> 
>> utilization goes above the CPU capacity.
> 
> Max utilization of a task is always > (task_duty_cycle * 1024) and the
> longer the period is the larger the diff is

I don't think this is always the case, cf above with the example with 2 tasks
having a duty_cycle=60%

> 
> And some task with its max utilization > CPU compute capacity, can fit
> on this CPU. But at now we don't detect these cases so we assume the
> task doesn't fit
> 
>>
>> On a 512-capacity CPU, 2 periodic tasks with a duty cycle of 20% and a period
>> of 100ms should have correct utilization values, even if the utilization of the
>> CPU goes above its capacity. On the Pixel6 where mid CPUs have a capacity of
>> 498, these tasks reach a utilization of 323, and the CPU reaches a utilization
>> of 662.
> 
> An easier solution would be to not use the /Sum of util est to know if
> a CPU is overutilized or not but only to select the OPP. Something
> like the below:
> 
> @@ -8069,7 +8069,7 @@ cpu_util(int cpu, struct task_struct *p, int
> dst_cpu, int boost)
>    else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
>    util += task_util(p);
> 
> - if (sched_feat(UTIL_EST)) {
> + if (sched_feat(UTIL_EST) && boost) {
>    unsigned long util_est;
> 
>    util_est = READ_ONCE(cfs_rq->avg.util_est);
> 

When trying this, I get ~similar results as with the base version regarding the
OU ratio.

>>
>>
>>>
>>>>
>>>>>> With this test you skip completely the cases where the task has to
>>>>>> share the CPU with others. As an example on the pixel 6, the little
>>>>
>>>> True. But I assume that's anticipated here. The assumption is that as
>>>> long as there is idle time, tasks get what they want in a time frame.
>>>>
>>>>>> cpus must run more than 1.2 seconds at its max freq before detecting
>>>>>> that there is no idle time
>>>>
>>>> BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
>>>> 2.07s (with CPU capacity of Pix5 little CPU = 160)?
>>>
>>> yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse
>>>
>>>>
>>>> [...]