kernel/sched/fair.c | 7 ------- 1 file changed, 7 deletions(-)
commit 10a35e6812aa ("sched/pelt: Skip updating util_est when
utilization is higher than CPU's capacity")
prevents util_est from being updated if util_avg is higher than the
underlying CPU capacity to avoid overestimating the task when the CPU
is capped (due to thermal issue for instance). In this scenario, the
task will miss its deadlines and start overlapping its wake-up events
for instance. The task will appear as always running when the CPU is
just not powerful enough to allow having a good estimation of the
task.
commit b8c96361402a ("sched/fair/util_est: Implement faster ramp-up
EWMA on utilization increases")
sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
grow instead of slowly converge to the new util_avg value when a task
profile changes from small to big.
However, the 2 conditions:
- Check util_avg against max CPU capacity
- Check whether util_est > util_avg
are placed in an order such as it is possible to set util_est to a
value higher than the CPU capacity if util_est > util_avg, but
util_est is prevented to decay as long as:
CPU capacity < util_avg < util_est.
Just remove the check as either:
1.
There is idle time on the CPU. In that case the util_avg value of the
task is actually correct. It is possible that the task missed a
deadline and appears bigger, but this is also the case when the
util_avg of the task is lower than the maximum CPU capacity.
2.
There is no idle time. In that case, the util_avg value might aswell
be an under estimation of the size of the task.
It is possible that undesired frequency spikes will appear when the
task is later enqueued with an inflated util_est value, but the
frequency spike might aswell be deserved. The absence of idle time
prevents from drawing any conclusion.
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
---
kernel/sched/fair.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c798d2795243..de7687e579c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4918,13 +4918,6 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
if (last_ewma_diff < UTIL_EST_MARGIN)
goto done;
- /*
- * To avoid overestimation of actual task utilization, skip updates if
- * we cannot grant there is idle time in this CPU.
- */
- if (dequeued > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
- return;
-
/*
* To avoid underestimate of task utilization, skip updates of EWMA if
* we cannot grant that thread got all CPU time it wanted.
--
2.25.1
On Tue, 25 Mar 2025 at 16:06, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> commit 10a35e6812aa ("sched/pelt: Skip updating util_est when
> utilization is higher than CPU's capacity")
> prevents util_est from being updated if util_avg is higher than the
> underlying CPU capacity to avoid overestimating the task when the CPU
> is capped (due to thermal issue for instance). In this scenario, the
> task will miss its deadlines and start overlapping its wake-up events
> for instance. The task will appear as always running when the CPU is
> just not powerful enough to allow having a good estimation of the
> task.
>
> commit b8c96361402a ("sched/fair/util_est: Implement faster ramp-up
> EWMA on utilization increases")
> sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
> grow instead of slowly converge to the new util_avg value when a task
> profile changes from small to big.
>
> However, the 2 conditions:
> - Check util_avg against max CPU capacity
> - Check whether util_est > util_avg
> are placed in an order such as it is possible to set util_est to a
> value higher than the CPU capacity if util_est > util_avg, but
> util_est is prevented to decay as long as:
> CPU capacity < util_avg < util_est.
>
> Just remove the check as either:
> 1.
> There is idle time on the CPU. In that case the util_avg value of the
> task is actually correct. It is possible that the task missed a
> deadline and appears bigger, but this is also the case when the
> util_avg of the task is lower than the maximum CPU capacity.
> 2.
> There is no idle time. In that case, the util_avg value might aswell
> be an under estimation of the size of the task.
> It is possible that undesired frequency spikes will appear when the
> task is later enqueued with an inflated util_est value, but the
> frequency spike might aswell be deserved. The absence of idle time
> prevents from drawing any conclusion.
>
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
This change looks reasonable to me. Did you face problems related to
this in a particular use case ?
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.rog>
> ---
> kernel/sched/fair.c | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c798d2795243..de7687e579c2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4918,13 +4918,6 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> if (last_ewma_diff < UTIL_EST_MARGIN)
> goto done;
>
> - /*
> - * To avoid overestimation of actual task utilization, skip updates if
> - * we cannot grant there is idle time in this CPU.
> - */
> - if (dequeued > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
> - return;
> -
> /*
> * To avoid underestimate of task utilization, skip updates of EWMA if
> * we cannot grant that thread got all CPU time it wanted.
> --
> 2.25.1
>
On 3/26/25 18:25, Vincent Guittot wrote:
> On Tue, 25 Mar 2025 at 16:06, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> commit 10a35e6812aa ("sched/pelt: Skip updating util_est when
>> utilization is higher than CPU's capacity")
>> prevents util_est from being updated if util_avg is higher than the
>> underlying CPU capacity to avoid overestimating the task when the CPU
>> is capped (due to thermal issue for instance). In this scenario, the
>> task will miss its deadlines and start overlapping its wake-up events
>> for instance. The task will appear as always running when the CPU is
>> just not powerful enough to allow having a good estimation of the
>> task.
>>
>> commit b8c96361402a ("sched/fair/util_est: Implement faster ramp-up
>> EWMA on utilization increases")
>> sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
>> grow instead of slowly converge to the new util_avg value when a task
>> profile changes from small to big.
>>
>> However, the 2 conditions:
>> - Check util_avg against max CPU capacity
>> - Check whether util_est > util_avg
>> are placed in an order such as it is possible to set util_est to a
>> value higher than the CPU capacity if util_est > util_avg, but
>> util_est is prevented to decay as long as:
>> CPU capacity < util_avg < util_est.
>>
>> Just remove the check as either:
>> 1.
>> There is idle time on the CPU. In that case the util_avg value of the
>> task is actually correct. It is possible that the task missed a
>> deadline and appears bigger, but this is also the case when the
>> util_avg of the task is lower than the maximum CPU capacity.
>> 2.
>> There is no idle time. In that case, the util_avg value might aswell
>> be an under estimation of the size of the task.
>> It is possible that undesired frequency spikes will appear when the
>> task is later enqueued with an inflated util_est value, but the
>> frequency spike might aswell be deserved. The absence of idle time
>> prevents from drawing any conclusion.
>>
>> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
>
> This change looks reasonable to me. Did you face problems related to
> this in a particular use case ?
I think it was more related to the fact util_est is not decayed when:
(runnable - util_avg) > margin
This patch slightly helps to decay, but not that much.
>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.rog>
Thanks!
>
>
>> ---
>> kernel/sched/fair.c | 7 -------
>> 1 file changed, 7 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c798d2795243..de7687e579c2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4918,13 +4918,6 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>> if (last_ewma_diff < UTIL_EST_MARGIN)
>> goto done;
>>
>> - /*
>> - * To avoid overestimation of actual task utilization, skip updates if
>> - * we cannot grant there is idle time in this CPU.
>> - */
>> - if (dequeued > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
>> - return;
>> -
>> /*
>> * To avoid underestimate of task utilization, skip updates of EWMA if
>> * we cannot grant that thread got all CPU time it wanted.
>> --
>> 2.25.1
>>
On 27/03/2025 10:35, Pierre Gondois wrote:
>
>
> On 3/26/25 18:25, Vincent Guittot wrote:
>> On Tue, 25 Mar 2025 at 16:06, Pierre Gondois <pierre.gondois@arm.com>
>> wrote:
>>>
>>> commit 10a35e6812aa ("sched/pelt: Skip updating util_est when
>>> utilization is higher than CPU's capacity")
>>> prevents util_est from being updated if util_avg is higher than the
>>> underlying CPU capacity to avoid overestimating the task when the CPU
>>> is capped (due to thermal issue for instance). In this scenario, the
>>> task will miss its deadlines and start overlapping its wake-up events
>>> for instance. The task will appear as always running when the CPU is
>>> just not powerful enough to allow having a good estimation of the
>>> task.
This one will be removed by your patch, right?
>>>
>>> commit b8c96361402a ("sched/fair/util_est: Implement faster ramp-up
>>> EWMA on utilization increases")
>>> sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
>>> grow instead of slowly converge to the new util_avg value when a task
>>> profile changes from small to big.
>>>
>>> However, the 2 conditions:
>>> - Check util_avg against max CPU capacity
I assume this is the condition you remove and
>>> - Check whether util_est > util_avg
this is:
4918 /*
4919 * Reset EWMA on utilization increases, the moving average is used
4920 * to smooth utilization decreases.
4921 */
4922 if (ewma <= dequeued) {
4923 ewma = dequeued;
4924 goto done;
4925 }
which is before the condition you remove?
So maybe explain those conditions and their order more carefully? So
it's easier to grasp.
>>> are placed in an order such as it is possible to set util_est to a
>>> value higher than the CPU capacity if util_est > util_avg, but
>>> util_est is prevented to decay as long as:
>>> CPU capacity < util_avg < util_est.
Maybe mentioning 'util_avg eq. dequeued' and 'util_est eq. ewma' would
help here for easier understanding.
>>> Just remove the check as either:
>>> 1.
>>> There is idle time on the CPU. In that case the util_avg value of the
>>> task is actually correct. It is possible that the task missed a
>>> deadline and appears bigger, but this is also the case when the
>>> util_avg of the task is lower than the maximum CPU capacity.
>>> 2.
>>> There is no idle time. In that case, the util_avg value might aswell
>>> be an under estimation of the size of the task.
>>> It is possible that undesired frequency spikes will appear when the
>>> task is later enqueued with an inflated util_est value, but the
>>> frequency spike might aswell be deserved. The absence of idle time
>>> prevents from drawing any conclusion.
>>>
>>> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
>>
>> This change looks reasonable to me. Did you face problems related to
>> this in a particular use case ?
>
> I think it was more related to the fact util_est is not decayed when:
> (runnable - util_avg) > margin
>
> This patch slightly helps to decay, but not that much.
Some of the 'stress-ng --class scheduler' seem to be be sensitive in
this regard. Haven't looked deeper into this.
[...]
The following commit has been merged into the sched/core branch of tip:
Commit-ID: f2d650618bc721760199ae0133c73ec32c63817e
Gitweb: https://git.kernel.org/tip/f2d650618bc721760199ae0133c73ec32c63817e
Author: Pierre Gondois <pierre.gondois@arm.com>
AuthorDate: Tue, 25 Mar 2025 16:05:41 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:52 +02:00
sched/fair: Allow decaying util_est when util_avg > CPU capa
commit 10a35e6812aa ("sched/pelt: Skip updating util_est when
utilization is higher than CPU's capacity")
prevents util_est from being updated if util_avg is higher than the
underlying CPU capacity to avoid overestimating the task when the CPU
is capped (due to thermal issue for instance). In this scenario, the
task will miss its deadlines and start overlapping its wake-up events
for instance. The task will appear as always running when the CPU is
just not powerful enough to allow having a good estimation of the
task.
commit b8c96361402a ("sched/fair/util_est: Implement faster ramp-up
EWMA on utilization increases")
sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
grow instead of slowly converge to the new util_avg value when a task
profile changes from small to big.
However, the 2 conditions:
- Check util_avg against max CPU capacity
- Check whether util_est > util_avg
are placed in an order such as it is possible to set util_est to a
value higher than the CPU capacity if util_est > util_avg, but
util_est is prevented to decay as long as:
CPU capacity < util_avg < util_est.
Just remove the check as either:
1.
There is idle time on the CPU. In that case the util_avg value of the
task is actually correct. It is possible that the task missed a
deadline and appears bigger, but this is also the case when the
util_avg of the task is lower than the maximum CPU capacity.
2.
There is no idle time. In that case, the util_avg value might aswell
be an under estimation of the size of the task.
It is possible that undesired frequency spikes will appear when the
task is later enqueued with an inflated util_est value, but the
frequency spike might aswell be deserved. The absence of idle time
prevents from drawing any conclusion.
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.rog>
Link: https://lore.kernel.org/r/20250325150542.1077344-1-pierre.gondois@arm.com
---
kernel/sched/fair.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a..0c19459 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4933,13 +4933,6 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
goto done;
/*
- * To avoid overestimation of actual task utilization, skip updates if
- * we cannot grant there is idle time in this CPU.
- */
- if (dequeued > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
- return;
-
- /*
* To avoid underestimate of task utilization, skip updates of EWMA if
* we cannot grant that thread got all CPU time it wanted.
*/
© 2016 - 2025 Red Hat, Inc.