sched/fair: Call update_util_est() after dequeue_entities()

[PATCH] sched/fair: Call update_util_est() after dequeue_entities()

Posted by Qais Yousef 1 month ago

update_util_est() reads task_util() at dequeue which is updated in
dequeue_entities(). To read the accurate util_avg at dequeue, make sure
to do the read after load_avg is updated in dequeue_entities().

util_est for a periodic task before

                                periodic-3114 util_est.enqueued running
   ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
 95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
   │              ▛                                                                                │
 51┤             ▐▘                                                                                │
  7┤      ▖▗▗  ▗▄▐                                                                                 │
   └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
  0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87

and after

                                 periodic-2977 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
 82.0┤             ▟                                                                               │
     │             ▌                                                                               │
 44.5┤             ▌                                                                               │
  7.0┤      ▗   ▗▖ ▌                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86

Note how the signal is noisier and can peak to 183 vs 157 now.

Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---

This is split from [1] series where I stumbled upon this problem. AFAICS it
needs backporting all the way to 6.12 LTS.

[1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/

 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..96ba97e5f4ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
+	int ret;
+
 	if (task_is_throttled(p)) {
 		dequeue_throttled_task(p, flags);
 		return true;
@@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->se.sched_delayed)
 		util_est_dequeue(&rq->cfs, p);
 
+	ret = dequeue_entities(rq, &p->se, flags);
 	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
-	if (dequeue_entities(rq, &p->se, flags) < 0)
+	if (ret < 0)
 		return false;
 
 	/*
-- 
2.34.1

Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()

Posted by Tim Chen 4 weeks ago

On Tue, 2026-05-12 at 13:46 +0100, Qais Yousef wrote:
> update_util_est() reads task_util() at dequeue which is updated in
> dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> to do the read after load_avg is updated in dequeue_entities().
> 
> util_est for a periodic task before
> 
>                                 periodic-3114 util_est.enqueued running
>    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
>  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
>    │              ▛                                                                                │
>  51┤             ▐▘                                                                                │
>   7┤      ▖▗▗  ▗▄▐                                                                                 │
>    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
>   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> 
> and after
> 
>                                  periodic-2977 util_est.enqueued running
>      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
>  82.0┤             ▟                                                                               │
>      │             ▌                                                                               │
>  44.5┤             ▌                                                                               │
>   7.0┤      ▗   ▗▖ ▌                                                                               │
>      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
>     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> 
> Note how the signal is noisier and can peak to 183 vs 157 now.
> 
> Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> 
> This is split from [1] series where I stumbled upon this problem. AFAICS it
> needs backporting all the way to 6.12 LTS.
> 
> [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> 
>  kernel/sched/fair.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..96ba97e5f4ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   */
>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	int ret;
> +
>  	if (task_is_throttled(p)) {
>  		dequeue_throttled_task(p, flags);
>  		return true;
> @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	if (!p->se.sched_delayed)
>  		util_est_dequeue(&rq->cfs, p);
>  
> +	ret = dequeue_entities(rq, &p->se, flags);
>  	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);

I thought that util_est_update() was called intentionally before dequeue_entities
to update the utilization of task p up to this time right
before the dequeue.  Then dequeue_entities() is called later
with up to date task utilization estimate of p.

Perhaps util_est_update() should be moved before
util_est_dequeue() so the updated utilization of p
is subtracted from the rq utilization.

@@ -8002,10 +8002,10 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                return true;
        }
 
+       util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
        if (!p->se.sched_delayed)
                util_est_dequeue(&rq->cfs, p);
 
-       util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
        if (dequeue_entities(rq, &p->se, flags) < 0)
                return false;


Tim

> -	if (dequeue_entities(rq, &p->se, flags) < 0)
> +	if (ret < 0)
>  		return false;
>  
>  	/*

Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()

Posted by Qais Yousef 3 weeks, 6 days ago

On 05/15/26 11:35, Tim Chen wrote:
> On Tue, 2026-05-12 at 13:46 +0100, Qais Yousef wrote:
> > update_util_est() reads task_util() at dequeue which is updated in
> > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > to do the read after load_avg is updated in dequeue_entities().
> > 
> > util_est for a periodic task before
> > 
> >                                 periodic-3114 util_est.enqueued running
> >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> >    │              ▛                                                                                │
> >  51┤             ▐▘                                                                                │
> >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> > 
> > and after
> > 
> >                                  periodic-2977 util_est.enqueued running
> >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> >  82.0┤             ▟                                                                               │
> >      │             ▌                                                                               │
> >  44.5┤             ▌                                                                               │
> >   7.0┤      ▗   ▗▖ ▌                                                                               │
> >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> > 
> > Note how the signal is noisier and can peak to 183 vs 157 now.
> > 
> > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> > 
> > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > needs backporting all the way to 6.12 LTS.
> > 
> > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > 
> >  kernel/sched/fair.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 728965851842..96ba97e5f4ae 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >   */
> >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > +	int ret;
> > +
> >  	if (task_is_throttled(p)) {
> >  		dequeue_throttled_task(p, flags);
> >  		return true;
> > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	if (!p->se.sched_delayed)
> >  		util_est_dequeue(&rq->cfs, p);
> >  
> > +	ret = dequeue_entities(rq, &p->se, flags);
> >  	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> 
> I thought that util_est_update() was called intentionally before dequeue_entities
> to update the utilization of task p up to this time right
> before the dequeue.  Then dequeue_entities() is called later
> with up to date task utilization estimate of p.

No. If you look at older versions of dequeue_task_fair() you'll see it was done
at the end.

util_est is a holding function, it should remember the last util_avg value at
dequeue, so the updates must happen first.

> 
> Perhaps util_est_update() should be moved before
> util_est_dequeue() so the updated utilization of p
> is subtracted from the rq utilization.

We actually should subtract the old value always before updating it. The update
happens only at dequeue. My rampup multiplier patches introduces updates for
running tasks, but has to do the dance of subtract, update and re-add otherwise
you'll end up with weird util values at the rq.

> 
> @@ -8002,10 +8002,10 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 return true;
>         }
>  
> +       util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>         if (!p->se.sched_delayed)
>                 util_est_dequeue(&rq->cfs, p);
>  
> -       util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>         if (dequeue_entities(rq, &p->se, flags) < 0)
>                 return false;
> 
> 
> Tim
> 
> > -	if (dequeue_entities(rq, &p->se, flags) < 0)
> > +	if (ret < 0)
> >  		return false;
> >  
> >  	/*

Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()

Posted by John Stultz 1 month ago

On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> update_util_est() reads task_util() at dequeue which is updated in
> dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> to do the read after load_avg is updated in dequeue_entities().
>
> util_est for a periodic task before
>
>                                 periodic-3114 util_est.enqueued running
>    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
>  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
>    │              ▛                                                                                │
>  51┤             ▐▘                                                                                │
>   7┤      ▖▗▗  ▗▄▐                                                                                 │
>    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
>   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
>
> and after
>
>                                  periodic-2977 util_est.enqueued running
>      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
>  82.0┤             ▟                                                                               │
>      │             ▌                                                                               │
>  44.5┤             ▌                                                                               │
>   7.0┤      ▗   ▗▖ ▌                                                                               │
>      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
>     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
>
> Note how the signal is noisier and can peak to 183 vs 157 now.
>
> Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>
> This is split from [1] series where I stumbled upon this problem. AFAICS it
> needs backporting all the way to 6.12 LTS.
>
> [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
>
>  kernel/sched/fair.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..96ba97e5f4ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   */
>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> +       int ret;
> +
>         if (task_is_throttled(p)) {
>                 dequeue_throttled_task(p, flags);
>                 return true;
> @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>         if (!p->se.sched_delayed)
>                 util_est_dequeue(&rq->cfs, p);
>
> +       ret = dequeue_entities(rq, &p->se, flags);
>         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> -       if (dequeue_entities(rq, &p->se, flags) < 0)

Hrm, so the Sashiko tool raised a reasonable concern on the earlier
version of this:
  https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12

Specifically, we shouldn't reference p after dequeue_entities() or we
risk racing with it being woken up, running, and maybe exiting on
another cpu.
And this moves the util_est_updat() call to after dequeue finishes.

Maybe there's some way to have util_est_update() compensate for the
unfinished accounting that will be done in dequeue_entities()?

thanks
-john

Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()

Posted by Vincent Guittot 1 month ago

Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > update_util_est() reads task_util() at dequeue which is updated in
> > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > to do the read after load_avg is updated in dequeue_entities().
> >
> > util_est for a periodic task before
> >
> >                                 periodic-3114 util_est.enqueued running
> >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> >    │              ▛                                                                                │
> >  51┤             ▐▘                                                                                │
> >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> >
> > and after
> >
> >                                  periodic-2977 util_est.enqueued running
> >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> >  82.0┤             ▟                                                                               │
> >      │             ▌                                                                               │
> >  44.5┤             ▌                                                                               │
> >   7.0┤      ▗   ▗▖ ▌                                                                               │
> >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> >
> > Note how the signal is noisier and can peak to 183 vs 157 now.
> >
> > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> >
> > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > needs backporting all the way to 6.12 LTS.
> >
> > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> >
> >  kernel/sched/fair.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 728965851842..96ba97e5f4ae 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >   */
> >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > +       int ret;
> > +
> >         if (task_is_throttled(p)) {
> >                 dequeue_throttled_task(p, flags);
> >                 return true;
> > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >         if (!p->se.sched_delayed)
> >                 util_est_dequeue(&rq->cfs, p);
> >
> > +       ret = dequeue_entities(rq, &p->se, flags);
> >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> 
> Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> version of this:
>   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12

Even without sashiko, this is what the comment says just below in the function ;-)

Below is a different way to fix it which also covers cases when we have both
DEQUEUE_SLEEP and DEQUEUE_DELAYED

---
 kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
 1 file changed, 93 insertions(+), 96 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c83fbe4e88c1..0976adc12594 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	trace_pelt_cfs_tp(cfs_rq);
 }

+#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
+
+static inline void util_est_update(struct sched_entity *se)
+{
+	unsigned int ewma, dequeued, last_ewma_diff;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/* Get current estimate of utilization */
+	ewma = READ_ONCE(se->avg.util_est);
+
+	/*
+	 * If the PELT values haven't changed since enqueue time,
+	 * skip the util_est update.
+	 */
+	if (ewma & UTIL_AVG_UNCHANGED)
+		return;
+
+	/* Get utilization at dequeue */
+	dequeued = READ_ONCE(se->avg.util_avg);
+
+	/*
+	 * Reset EWMA on utilization increases, the moving average is used only
+	 * to smooth utilization decreases.
+	 */
+	if (ewma <= dequeued) {
+		ewma = dequeued;
+		goto done;
+	}
+
+	/*
+	 * Skip update of task's estimated utilization when its members are
+	 * already ~1% close to its last activation value.
+	 */
+	last_ewma_diff = ewma - dequeued;
+	if (last_ewma_diff < UTIL_EST_MARGIN)
+		goto done;
+
+	/*
+	 * To avoid underestimate of task utilization, skip updates of EWMA if
+	 * we cannot grant that thread got all CPU time it wanted.
+	 */
+	if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
+		goto done;
+
+
+	/*
+	 * Update Task's estimated utilization
+	 *
+	 * When *p completes an activation we can consolidate another sample
+	 * of the task size. This is done by using this value to update the
+	 * Exponential Weighted Moving Average (EWMA):
+	 *
+	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
+	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
+	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
+	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
+	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
+	 *
+	 * Where 'w' is the weight of new samples, which is configured to be
+	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
+	 */
+	ewma <<= UTIL_EST_WEIGHT_SHIFT;
+	ewma  -= last_ewma_diff;
+	ewma >>= UTIL_EST_WEIGHT_SHIFT;
+done:
+	ewma |= UTIL_AVG_UNCHANGED;
+	WRITE_ONCE(se->avg.util_est, ewma);
+
+	trace_sched_util_est_se_tp(se);
+}
+
 /*
  * Optional action to be done while updating the load average
  */
-#define UPDATE_TG	0x1
-#define SKIP_AGE_LOAD	0x2
-#define DO_ATTACH	0x4
-#define DO_DETACH	0x8
+#define UPDATE_TG	0x01
+#define SKIP_AGE_LOAD	0x02
+#define DO_ATTACH	0x04
+#define DO_DETACH	0x08
+#define UPDATE_UTIL_EST	0x10

 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		if (flags & UPDATE_TG)
 			update_tg_load_avg(cfs_rq);
 	}
+
+	if (flags & UPDATE_UTIL_EST)
+		util_est_update(se);
 }

 /*
@@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
 	return READ_ONCE(p->se.avg.util_avg);
 }

-static inline unsigned long task_runnable(struct task_struct *p)
-{
-	return READ_ONCE(p->se.avg.runnable_avg);
-}
-
 static inline unsigned long _task_util_est(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
@@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
 	trace_sched_util_est_cfs_tp(cfs_rq);
 }

-#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
-
-static inline void util_est_update(struct cfs_rq *cfs_rq,
-				   struct task_struct *p,
-				   bool task_sleep)
-{
-	unsigned int ewma, dequeued, last_ewma_diff;
-
-	if (!sched_feat(UTIL_EST))
-		return;
-
-	/*
-	 * Skip update of task's estimated utilization when the task has not
-	 * yet completed an activation, e.g. being migrated.
-	 */
-	if (!task_sleep)
-		return;
-
-	/* Get current estimate of utilization */
-	ewma = READ_ONCE(p->se.avg.util_est);
-
-	/*
-	 * If the PELT values haven't changed since enqueue time,
-	 * skip the util_est update.
-	 */
-	if (ewma & UTIL_AVG_UNCHANGED)
-		return;
-
-	/* Get utilization at dequeue */
-	dequeued = task_util(p);
-
-	/*
-	 * Reset EWMA on utilization increases, the moving average is used only
-	 * to smooth utilization decreases.
-	 */
-	if (ewma <= dequeued) {
-		ewma = dequeued;
-		goto done;
-	}
-
-	/*
-	 * Skip update of task's estimated utilization when its members are
-	 * already ~1% close to its last activation value.
-	 */
-	last_ewma_diff = ewma - dequeued;
-	if (last_ewma_diff < UTIL_EST_MARGIN)
-		goto done;
-
-	/*
-	 * To avoid underestimate of task utilization, skip updates of EWMA if
-	 * we cannot grant that thread got all CPU time it wanted.
-	 */
-	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
-		goto done;
-
-
-	/*
-	 * Update Task's estimated utilization
-	 *
-	 * When *p completes an activation we can consolidate another sample
-	 * of the task size. This is done by using this value to update the
-	 * Exponential Weighted Moving Average (EWMA):
-	 *
-	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
-	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
-	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
-	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
-	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
-	 *
-	 * Where 'w' is the weight of new samples, which is configured to be
-	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
-	 */
-	ewma <<= UTIL_EST_WEIGHT_SHIFT;
-	ewma  -= last_ewma_diff;
-	ewma >>= UTIL_EST_WEIGHT_SHIFT;
-done:
-	ewma |= UTIL_AVG_UNCHANGED;
-	WRITE_ONCE(p->se.avg.util_est, ewma);
-
-	trace_sched_util_est_se_tp(&p->se);
-}
-
 static inline unsigned long get_actual_cpu_capacity(int cpu)
 {
 	unsigned long capacity = arch_scale_cpu_capacity(cpu);
@@ -5828,7 +5818,7 @@ static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	bool sleep = flags & DEQUEUE_SLEEP;
-	int action = UPDATE_TG;
+	int action = 0;

 	update_curr(cfs_rq);
 	clear_buddies(cfs_rq, se);
@@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
-			update_load_avg(cfs_rq, se, 0);
+			if (entity_is_task(se))
+				action |= UPDATE_UTIL_EST;
+			update_load_avg(cfs_rq, se, action);
 			update_entity_lag(cfs_rq, se);
 			set_delayed(se);
 			return false;
 		}
 	}

-	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
-		action |= DO_DETACH;
+	action = UPDATE_TG;
+	if (entity_is_task(se)) {
+		if (task_on_rq_migrating(task_of(se)))
+			action |= DO_DETACH;
+
+		if (sleep && !(flags & DEQUEUE_DELAYED))
+			action |= UPDATE_UTIL_EST;
+	}

 	/*
 	 * When dequeuing a sched_entity, we must:
@@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->se.sched_delayed)
 		util_est_dequeue(&rq->cfs, p);

-	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
 	if (dequeue_entities(rq, &p->se, flags) < 0)
 		return false;

--
2.43.0




> 
> Specifically, we shouldn't reference p after dequeue_entities() or we
> risk racing with it being woken up, running, and maybe exiting on
> another cpu.
> And this moves the util_est_updat() call to after dequeue finishes.
> 
> Maybe there's some way to have util_est_update() compensate for the
> unfinished accounting that will be done in dequeue_entities()?
> 
> thanks
> -john

Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()

Posted by Qais Yousef 4 weeks ago

On 05/13/26 14:46, Vincent Guittot wrote:
> Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > update_util_est() reads task_util() at dequeue which is updated in
> > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > to do the read after load_avg is updated in dequeue_entities().
> > >
> > > util_est for a periodic task before
> > >
> > >                                 periodic-3114 util_est.enqueued running
> > >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> > >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> > >    │              ▛                                                                                │
> > >  51┤             ▐▘                                                                                │
> > >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> > >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> > >
> > > and after
> > >
> > >                                  periodic-2977 util_est.enqueued running
> > >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> > >  82.0┤             ▟                                                                               │
> > >      │             ▌                                                                               │
> > >  44.5┤             ▌                                                                               │
> > >   7.0┤      ▗   ▗▖ ▌                                                                               │
> > >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> > >
> > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > >
> > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > ---
> > >
> > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > needs backporting all the way to 6.12 LTS.
> > >
> > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > >
> > >  kernel/sched/fair.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 728965851842..96ba97e5f4ae 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > >   */
> > >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  {
> > > +       int ret;
> > > +
> > >         if (task_is_throttled(p)) {
> > >                 dequeue_throttled_task(p, flags);
> > >                 return true;
> > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >         if (!p->se.sched_delayed)
> > >                 util_est_dequeue(&rq->cfs, p);
> > >
> > > +       ret = dequeue_entities(rq, &p->se, flags);
> > >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> > 
> > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > version of this:
> >   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> 
> Even without sashiko, this is what the comment says just below in the function ;-)

Yeah I read it but it didn't really register properly :-)

> 
> Below is a different way to fix it which also covers cases when we have both
> DEQUEUE_SLEEP and DEQUEUE_DELAYED

Works for me. It looks even more stable now

util_avg

                                    periodic-2737 util_avg running
   ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
140┤               ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄              │
105┤              ▄██████████████████████████████████████████████████████████████████              │
 70┤             ▗▛▘                                                                               │
   │             ▐▘                                                                                │
 35┤             █                                                                                 │
  0┤       ▖   ▗▄▌                                                                                 │
   └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
  0.00      0.65       1.30      1.95       2.60      3.25       3.90      4.55       5.20     5.85

and util_eset

                                 periodic-2737 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
140.0┤               ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘             │
106.8┤              ▝▘                                                                             │
 73.5┤             ▗▘                                                                              │
     │             ▗                                                                               │
 40.2┤             ▖                                                                               │
  7.0┤       ▖   ▄▗▖                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.55      5.20     5.85

Do you want to send a proper patch? Feel free to stick my
reviewed-and-tested-by.

Thanks!

> 
> ---
>  kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
>  1 file changed, 93 insertions(+), 96 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c83fbe4e88c1..0976adc12594 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	trace_pelt_cfs_tp(cfs_rq);
>  }
> 
> +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> +
> +static inline void util_est_update(struct sched_entity *se)
> +{
> +	unsigned int ewma, dequeued, last_ewma_diff;
> +
> +	if (!sched_feat(UTIL_EST))
> +		return;
> +
> +	/* Get current estimate of utilization */
> +	ewma = READ_ONCE(se->avg.util_est);
> +
> +	/*
> +	 * If the PELT values haven't changed since enqueue time,
> +	 * skip the util_est update.
> +	 */
> +	if (ewma & UTIL_AVG_UNCHANGED)
> +		return;
> +
> +	/* Get utilization at dequeue */
> +	dequeued = READ_ONCE(se->avg.util_avg);
> +
> +	/*
> +	 * Reset EWMA on utilization increases, the moving average is used only
> +	 * to smooth utilization decreases.
> +	 */
> +	if (ewma <= dequeued) {
> +		ewma = dequeued;
> +		goto done;
> +	}
> +
> +	/*
> +	 * Skip update of task's estimated utilization when its members are
> +	 * already ~1% close to its last activation value.
> +	 */
> +	last_ewma_diff = ewma - dequeued;
> +	if (last_ewma_diff < UTIL_EST_MARGIN)
> +		goto done;
> +
> +	/*
> +	 * To avoid underestimate of task utilization, skip updates of EWMA if
> +	 * we cannot grant that thread got all CPU time it wanted.
> +	 */
> +	if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> +		goto done;
> +
> +
> +	/*
> +	 * Update Task's estimated utilization
> +	 *
> +	 * When *p completes an activation we can consolidate another sample
> +	 * of the task size. This is done by using this value to update the
> +	 * Exponential Weighted Moving Average (EWMA):
> +	 *
> +	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> +	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> +	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> +	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> +	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> +	 *
> +	 * Where 'w' is the weight of new samples, which is configured to be
> +	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> +	 */
> +	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> +	ewma  -= last_ewma_diff;
> +	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> +done:
> +	ewma |= UTIL_AVG_UNCHANGED;
> +	WRITE_ONCE(se->avg.util_est, ewma);
> +
> +	trace_sched_util_est_se_tp(se);
> +}
> +
>  /*
>   * Optional action to be done while updating the load average
>   */
> -#define UPDATE_TG	0x1
> -#define SKIP_AGE_LOAD	0x2
> -#define DO_ATTACH	0x4
> -#define DO_DETACH	0x8
> +#define UPDATE_TG	0x01
> +#define SKIP_AGE_LOAD	0x02
> +#define DO_ATTACH	0x04
> +#define DO_DETACH	0x08
> +#define UPDATE_UTIL_EST	0x10
> 
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  		if (flags & UPDATE_TG)
>  			update_tg_load_avg(cfs_rq);
>  	}
> +
> +	if (flags & UPDATE_UTIL_EST)
> +		util_est_update(se);
>  }
> 
>  /*
> @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
>  	return READ_ONCE(p->se.avg.util_avg);
>  }
> 
> -static inline unsigned long task_runnable(struct task_struct *p)
> -{
> -	return READ_ONCE(p->se.avg.runnable_avg);
> -}
> -
>  static inline unsigned long _task_util_est(struct task_struct *p)
>  {
>  	return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
>  	trace_sched_util_est_cfs_tp(cfs_rq);
>  }
> 
> -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> -
> -static inline void util_est_update(struct cfs_rq *cfs_rq,
> -				   struct task_struct *p,
> -				   bool task_sleep)
> -{
> -	unsigned int ewma, dequeued, last_ewma_diff;
> -
> -	if (!sched_feat(UTIL_EST))
> -		return;
> -
> -	/*
> -	 * Skip update of task's estimated utilization when the task has not
> -	 * yet completed an activation, e.g. being migrated.
> -	 */
> -	if (!task_sleep)
> -		return;
> -
> -	/* Get current estimate of utilization */
> -	ewma = READ_ONCE(p->se.avg.util_est);
> -
> -	/*
> -	 * If the PELT values haven't changed since enqueue time,
> -	 * skip the util_est update.
> -	 */
> -	if (ewma & UTIL_AVG_UNCHANGED)
> -		return;
> -
> -	/* Get utilization at dequeue */
> -	dequeued = task_util(p);
> -
> -	/*
> -	 * Reset EWMA on utilization increases, the moving average is used only
> -	 * to smooth utilization decreases.
> -	 */
> -	if (ewma <= dequeued) {
> -		ewma = dequeued;
> -		goto done;
> -	}
> -
> -	/*
> -	 * Skip update of task's estimated utilization when its members are
> -	 * already ~1% close to its last activation value.
> -	 */
> -	last_ewma_diff = ewma - dequeued;
> -	if (last_ewma_diff < UTIL_EST_MARGIN)
> -		goto done;
> -
> -	/*
> -	 * To avoid underestimate of task utilization, skip updates of EWMA if
> -	 * we cannot grant that thread got all CPU time it wanted.
> -	 */
> -	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> -		goto done;
> -
> -
> -	/*
> -	 * Update Task's estimated utilization
> -	 *
> -	 * When *p completes an activation we can consolidate another sample
> -	 * of the task size. This is done by using this value to update the
> -	 * Exponential Weighted Moving Average (EWMA):
> -	 *
> -	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> -	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> -	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> -	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> -	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> -	 *
> -	 * Where 'w' is the weight of new samples, which is configured to be
> -	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> -	 */
> -	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> -	ewma  -= last_ewma_diff;
> -	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> -done:
> -	ewma |= UTIL_AVG_UNCHANGED;
> -	WRITE_ONCE(p->se.avg.util_est, ewma);
> -
> -	trace_sched_util_est_se_tp(&p->se);
> -}
> -
>  static inline unsigned long get_actual_cpu_capacity(int cpu)
>  {
>  	unsigned long capacity = arch_scale_cpu_capacity(cpu);
> @@ -5828,7 +5818,7 @@ static bool
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>  	bool sleep = flags & DEQUEUE_SLEEP;
> -	int action = UPDATE_TG;
> +	int action = 0;
> 
>  	update_curr(cfs_rq);
>  	clear_buddies(cfs_rq, se);
> @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 
>  		if (sched_feat(DELAY_DEQUEUE) && delay &&
>  		    !entity_eligible(cfs_rq, se)) {
> -			update_load_avg(cfs_rq, se, 0);
> +			if (entity_is_task(se))
> +				action |= UPDATE_UTIL_EST;
> +			update_load_avg(cfs_rq, se, action);
>  			update_entity_lag(cfs_rq, se);
>  			set_delayed(se);
>  			return false;
>  		}
>  	}
> 
> -	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> -		action |= DO_DETACH;
> +	action = UPDATE_TG;
> +	if (entity_is_task(se)) {
> +		if (task_on_rq_migrating(task_of(se)))
> +			action |= DO_DETACH;
> +
> +		if (sleep && !(flags & DEQUEUE_DELAYED))
> +			action |= UPDATE_UTIL_EST;
> +	}
> 
>  	/*
>  	 * When dequeuing a sched_entity, we must:
> @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	if (!p->se.sched_delayed)
>  		util_est_dequeue(&rq->cfs, p);
> 
> -	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>  	if (dequeue_entities(rq, &p->se, flags) < 0)
>  		return false;
> 
> --
> 2.43.0
> 
> 
> 
> 
> > 
> > Specifically, we shouldn't reference p after dequeue_entities() or we
> > risk racing with it being woken up, running, and maybe exiting on
> > another cpu.
> > And this moves the util_est_updat() call to after dequeue finishes.
> > 
> > Maybe there's some way to have util_est_update() compensate for the
> > unfinished accounting that will be done in dequeue_entities()?
> > 
> > thanks
> > -john

Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()

Posted by Vincent Guittot 4 weeks ago

On Fri, 15 May 2026 at 03:51, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 05/13/26 14:46, Vincent Guittot wrote:
> > Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > update_util_est() reads task_util() at dequeue which is updated in
> > > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > > to do the read after load_avg is updated in dequeue_entities().
> > > >
> > > > util_est for a periodic task before
> > > >
> > > >                                 periodic-3114 util_est.enqueued running
> > > >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > > > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> > > >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> > > >    │              ▛                                                                                │
> > > >  51┤             ▐▘                                                                                │
> > > >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> > > >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > > >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> > > >
> > > > and after
> > > >
> > > >                                  periodic-2977 util_est.enqueued running
> > > >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > > > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> > > >  82.0┤             ▟                                                                               │
> > > >      │             ▌                                                                               │
> > > >  44.5┤             ▌                                                                               │
> > > >   7.0┤      ▗   ▗▖ ▌                                                                               │
> > > >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > > >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> > > >
> > > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > > >
> > > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > > ---
> > > >
> > > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > > needs backporting all the way to 6.12 LTS.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > > >
> > > >  kernel/sched/fair.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 728965851842..96ba97e5f4ae 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > > >   */
> > > >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  {
> > > > +       int ret;
> > > > +
> > > >         if (task_is_throttled(p)) {
> > > >                 dequeue_throttled_task(p, flags);
> > > >                 return true;
> > > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >         if (!p->se.sched_delayed)
> > > >                 util_est_dequeue(&rq->cfs, p);
> > > >
> > > > +       ret = dequeue_entities(rq, &p->se, flags);
> > > >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> > >
> > > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > > version of this:
> > >   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> >
> > Even without sashiko, this is what the comment says just below in the function ;-)
>
> Yeah I read it but it didn't really register properly :-)
>
> >
> > Below is a different way to fix it which also covers cases when we have both
> > DEQUEUE_SLEEP and DEQUEUE_DELAYED
>
> Works for me. It looks even more stable now
>
> util_avg
>
>                                     periodic-2737 util_avg running
>    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 140┤               ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄              │
> 105┤              ▄██████████████████████████████████████████████████████████████████              │
>  70┤             ▗▛▘                                                                               │
>    │             ▐▘                                                                                │
>  35┤             █                                                                                 │
>   0┤       ▖   ▗▄▌                                                                                 │
>    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
>   0.00      0.65       1.30      1.95       2.60      3.25       3.90      4.55       5.20     5.85
>
> and util_eset
>
>                                  periodic-2737 util_est.enqueued running
>      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 140.0┤               ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘             │
> 106.8┤              ▝▘                                                                             │
>  73.5┤             ▗▘                                                                              │
>      │             ▗                                                                               │
>  40.2┤             ▖                                                                               │
>   7.0┤       ▖   ▄▗▖                                                                               │
>      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
>     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.55      5.20     5.85
>
> Do you want to send a proper patch? Feel free to stick my
> reviewed-and-tested-by.

Yes I'm going to send a proper patch with reported, reviewed and tested by You

Thanks

>
> Thanks!
>
> >
> > ---
> >  kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
> >  1 file changed, 93 insertions(+), 96 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c83fbe4e88c1..0976adc12594 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >       trace_pelt_cfs_tp(cfs_rq);
> >  }
> >
> > +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > +
> > +static inline void util_est_update(struct sched_entity *se)
> > +{
> > +     unsigned int ewma, dequeued, last_ewma_diff;
> > +
> > +     if (!sched_feat(UTIL_EST))
> > +             return;
> > +
> > +     /* Get current estimate of utilization */
> > +     ewma = READ_ONCE(se->avg.util_est);
> > +
> > +     /*
> > +      * If the PELT values haven't changed since enqueue time,
> > +      * skip the util_est update.
> > +      */
> > +     if (ewma & UTIL_AVG_UNCHANGED)
> > +             return;
> > +
> > +     /* Get utilization at dequeue */
> > +     dequeued = READ_ONCE(se->avg.util_avg);
> > +
> > +     /*
> > +      * Reset EWMA on utilization increases, the moving average is used only
> > +      * to smooth utilization decreases.
> > +      */
> > +     if (ewma <= dequeued) {
> > +             ewma = dequeued;
> > +             goto done;
> > +     }
> > +
> > +     /*
> > +      * Skip update of task's estimated utilization when its members are
> > +      * already ~1% close to its last activation value.
> > +      */
> > +     last_ewma_diff = ewma - dequeued;
> > +     if (last_ewma_diff < UTIL_EST_MARGIN)
> > +             goto done;
> > +
> > +     /*
> > +      * To avoid underestimate of task utilization, skip updates of EWMA if
> > +      * we cannot grant that thread got all CPU time it wanted.
> > +      */
> > +     if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> > +             goto done;
> > +
> > +
> > +     /*
> > +      * Update Task's estimated utilization
> > +      *
> > +      * When *p completes an activation we can consolidate another sample
> > +      * of the task size. This is done by using this value to update the
> > +      * Exponential Weighted Moving Average (EWMA):
> > +      *
> > +      *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> > +      *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> > +      *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> > +      *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> > +      *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> > +      *
> > +      * Where 'w' is the weight of new samples, which is configured to be
> > +      * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > +      */
> > +     ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > +     ewma  -= last_ewma_diff;
> > +     ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > +done:
> > +     ewma |= UTIL_AVG_UNCHANGED;
> > +     WRITE_ONCE(se->avg.util_est, ewma);
> > +
> > +     trace_sched_util_est_se_tp(se);
> > +}
> > +
> >  /*
> >   * Optional action to be done while updating the load average
> >   */
> > -#define UPDATE_TG    0x1
> > -#define SKIP_AGE_LOAD        0x2
> > -#define DO_ATTACH    0x4
> > -#define DO_DETACH    0x8
> > +#define UPDATE_TG    0x01
> > +#define SKIP_AGE_LOAD        0x02
> > +#define DO_ATTACH    0x04
> > +#define DO_DETACH    0x08
> > +#define UPDATE_UTIL_EST      0x10
> >
> >  /* Update task and its cfs_rq load average */
> >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >               if (flags & UPDATE_TG)
> >                       update_tg_load_avg(cfs_rq);
> >       }
> > +
> > +     if (flags & UPDATE_UTIL_EST)
> > +             util_est_update(se);
> >  }
> >
> >  /*
> > @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
> >       return READ_ONCE(p->se.avg.util_avg);
> >  }
> >
> > -static inline unsigned long task_runnable(struct task_struct *p)
> > -{
> > -     return READ_ONCE(p->se.avg.runnable_avg);
> > -}
> > -
> >  static inline unsigned long _task_util_est(struct task_struct *p)
> >  {
> >       return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> > @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> >       trace_sched_util_est_cfs_tp(cfs_rq);
> >  }
> >
> > -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > -
> > -static inline void util_est_update(struct cfs_rq *cfs_rq,
> > -                                struct task_struct *p,
> > -                                bool task_sleep)
> > -{
> > -     unsigned int ewma, dequeued, last_ewma_diff;
> > -
> > -     if (!sched_feat(UTIL_EST))
> > -             return;
> > -
> > -     /*
> > -      * Skip update of task's estimated utilization when the task has not
> > -      * yet completed an activation, e.g. being migrated.
> > -      */
> > -     if (!task_sleep)
> > -             return;
> > -
> > -     /* Get current estimate of utilization */
> > -     ewma = READ_ONCE(p->se.avg.util_est);
> > -
> > -     /*
> > -      * If the PELT values haven't changed since enqueue time,
> > -      * skip the util_est update.
> > -      */
> > -     if (ewma & UTIL_AVG_UNCHANGED)
> > -             return;
> > -
> > -     /* Get utilization at dequeue */
> > -     dequeued = task_util(p);
> > -
> > -     /*
> > -      * Reset EWMA on utilization increases, the moving average is used only
> > -      * to smooth utilization decreases.
> > -      */
> > -     if (ewma <= dequeued) {
> > -             ewma = dequeued;
> > -             goto done;
> > -     }
> > -
> > -     /*
> > -      * Skip update of task's estimated utilization when its members are
> > -      * already ~1% close to its last activation value.
> > -      */
> > -     last_ewma_diff = ewma - dequeued;
> > -     if (last_ewma_diff < UTIL_EST_MARGIN)
> > -             goto done;
> > -
> > -     /*
> > -      * To avoid underestimate of task utilization, skip updates of EWMA if
> > -      * we cannot grant that thread got all CPU time it wanted.
> > -      */
> > -     if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> > -             goto done;
> > -
> > -
> > -     /*
> > -      * Update Task's estimated utilization
> > -      *
> > -      * When *p completes an activation we can consolidate another sample
> > -      * of the task size. This is done by using this value to update the
> > -      * Exponential Weighted Moving Average (EWMA):
> > -      *
> > -      *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> > -      *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> > -      *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> > -      *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> > -      *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> > -      *
> > -      * Where 'w' is the weight of new samples, which is configured to be
> > -      * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > -      */
> > -     ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > -     ewma  -= last_ewma_diff;
> > -     ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > -done:
> > -     ewma |= UTIL_AVG_UNCHANGED;
> > -     WRITE_ONCE(p->se.avg.util_est, ewma);
> > -
> > -     trace_sched_util_est_se_tp(&p->se);
> > -}
> > -
> >  static inline unsigned long get_actual_cpu_capacity(int cpu)
> >  {
> >       unsigned long capacity = arch_scale_cpu_capacity(cpu);
> > @@ -5828,7 +5818,7 @@ static bool
> >  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> >       bool sleep = flags & DEQUEUE_SLEEP;
> > -     int action = UPDATE_TG;
> > +     int action = 0;
> >
> >       update_curr(cfs_rq);
> >       clear_buddies(cfs_rq, se);
> > @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >
> >               if (sched_feat(DELAY_DEQUEUE) && delay &&
> >                   !entity_eligible(cfs_rq, se)) {
> > -                     update_load_avg(cfs_rq, se, 0);
> > +                     if (entity_is_task(se))
> > +                             action |= UPDATE_UTIL_EST;
> > +                     update_load_avg(cfs_rq, se, action);
> >                       update_entity_lag(cfs_rq, se);
> >                       set_delayed(se);
> >                       return false;
> >               }
> >       }
> >
> > -     if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> > -             action |= DO_DETACH;
> > +     action = UPDATE_TG;
> > +     if (entity_is_task(se)) {
> > +             if (task_on_rq_migrating(task_of(se)))
> > +                     action |= DO_DETACH;
> > +
> > +             if (sleep && !(flags & DEQUEUE_DELAYED))
> > +                     action |= UPDATE_UTIL_EST;
> > +     }
> >
> >       /*
> >        * When dequeuing a sched_entity, we must:
> > @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >       if (!p->se.sched_delayed)
> >               util_est_dequeue(&rq->cfs, p);
> >
> > -     util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> >       if (dequeue_entities(rq, &p->se, flags) < 0)
> >               return false;
> >
> > --
> > 2.43.0
> >
> >
> >
> >
> > >
> > > Specifically, we shouldn't reference p after dequeue_entities() or we
> > > risk racing with it being woken up, running, and maybe exiting on
> > > another cpu.
> > > And this moves the util_est_updat() call to after dequeue finishes.
> > >
> > > Maybe there's some way to have util_est_update() compensate for the
> > > unfinished accounting that will be done in dequeue_entities()?
> > >
> > > thanks
> > > -john