sched/fair: Update zero_vruntime after clearing on_rq in dequeue_entity()

[PATCH] sched/fair: Update zero_vruntime after clearing on_rq in dequeue_entity()

Posted by Zicheng Qu 2 weeks, 3 days ago

When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(),
the cfs_rq->zero_vruntime is updated via update_entity_lag() ->
avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1.
This means the current entity is still included in the zero_vruntime
calculation.

However, immediately after this, curr->on_rq is set to 0, which should
change the avg_vruntime() result. Without re-updating zero_vruntime, the
stale value may be used in subsequent task selection paths:

  schedule() -> ... -> pick_task_fair() -> pick_next_entity() ->
  pick_eevdf() -> vruntime_eligible()

If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not
triggered in time between dequeue and the next pick, vruntime_eligible()
may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause
all tasks to appear ineligible, leading to NULL pointer dereference.

Add an explicit avg_vruntime(cfs_rq) call after clearing curr->on_rq to
ensure cfs_rq->zero_vruntime is properly updated before the next pick.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..f8070767c2f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5461,6 +5461,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
+	/* update the cfs_rq->zero_vruntime again after curr->on_rq = 0 */
+	if (se == cfs_rq->curr)
+		avg_vruntime(cfs_rq);
 	account_entity_dequeue(cfs_rq, se);
 
 	/* return excess runtime on last dequeue */
-- 
2.34.1

Re: [PATCH] sched/fair: Update zero_vruntime after clearing on_rq in dequeue_entity()

Posted by Vincent Guittot 2 weeks ago

On Thu, 19 Mar 2026 at 12:43, Zicheng Qu <quzicheng@huawei.com> wrote:
>
> When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(),
> the cfs_rq->zero_vruntime is updated via update_entity_lag() ->
> avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1.
> This means the current entity is still included in the zero_vruntime
> calculation.

curr is not included in zero_vruntime but added when computing
avg_vruntime so zero_vruntime is not impacted when curr is dequeued

>
> However, immediately after this, curr->on_rq is set to 0, which should
> change the avg_vruntime() result. Without re-updating zero_vruntime, the
> stale value may be used in subsequent task selection paths:
>
>   schedule() -> ... -> pick_task_fair() -> pick_next_entity() ->
>   pick_eevdf() -> vruntime_eligible()
>
> If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not
> triggered in time between dequeue and the next pick, vruntime_eligible()
> may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause
> all tasks to appear ineligible, leading to NULL pointer dereference.
>
> Add an explicit avg_vruntime(cfs_rq) call after clearing curr->on_rq to
> ensure cfs_rq->zero_vruntime is properly updated before the next pick.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> ---
>  kernel/sched/fair.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf948db905ed..f8070767c2f4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5461,6 +5461,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>         if (se != cfs_rq->curr)
>                 __dequeue_entity(cfs_rq, se);
>         se->on_rq = 0;
> +       /* update the cfs_rq->zero_vruntime again after curr->on_rq = 0 */
> +       if (se == cfs_rq->curr)
> +               avg_vruntime(cfs_rq);
>         account_entity_dequeue(cfs_rq, se);
>
>         /* return excess runtime on last dequeue */
> --
> 2.34.1
>

Re: [PATCH] sched/fair: Update zero_vruntime after clearing on_rq in dequeue_entity()

Posted by Peter Zijlstra 1 week, 6 days ago

On Mon, Mar 23, 2026 at 08:52:21AM +0100, Vincent Guittot wrote:
> On Thu, 19 Mar 2026 at 12:43, Zicheng Qu <quzicheng@huawei.com> wrote:
> >
> > When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(),
> > the cfs_rq->zero_vruntime is updated via update_entity_lag() ->
> > avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1.
> > This means the current entity is still included in the zero_vruntime
> > calculation.
> 
> curr is not included in zero_vruntime but added when computing
> avg_vruntime so zero_vruntime is not impacted when curr is dequeued

It is, we explicitly add curr back in.

> > However, immediately after this, curr->on_rq is set to 0, which should
> > change the avg_vruntime() result. Without re-updating zero_vruntime, the
> > stale value may be used in subsequent task selection paths:
> >
> >   schedule() -> ... -> pick_task_fair() -> pick_next_entity() ->
> >   pick_eevdf() -> vruntime_eligible()
> >
> > If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not
> > triggered in time between dequeue and the next pick, vruntime_eligible()
> > may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause
> > all tasks to appear ineligible, leading to NULL pointer dereference.

This makes no sense.

One entity worth of vruntime should not affect things to the point of
overrun. Yes, it is true that zero_vruntime != avg_vruntime() right
after a dequeue, but that doesn't matter.

vruntime_eligible() does the same math that avg_vruntime() does and
takes this difference into account.

As long as zero_vruntime is close 'enough' to avg_vruntime, all the
deltas are small and nothing overflows.

Re: [PATCH] sched/fair: Update zero_vruntime after clearing on_rq in dequeue_entity()

Posted by Vincent Guittot 1 week, 6 days ago

On Mon, 23 Mar 2026 at 10:57, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 23, 2026 at 08:52:21AM +0100, Vincent Guittot wrote:
> > On Thu, 19 Mar 2026 at 12:43, Zicheng Qu <quzicheng@huawei.com> wrote:
> > >
> > > When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(),
> > > the cfs_rq->zero_vruntime is updated via update_entity_lag() ->
> > > avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1.
> > > This means the current entity is still included in the zero_vruntime
> > > calculation.
> >
> > curr is not included in zero_vruntime but added when computing
> > avg_vruntime so zero_vruntime is not impacted when curr is dequeued
>
> It is, we explicitly add curr back in.

yeah, i should have taken one more coffee before replying

>
> > > However, immediately after this, curr->on_rq is set to 0, which should
> > > change the avg_vruntime() result. Without re-updating zero_vruntime, the
> > > stale value may be used in subsequent task selection paths:
> > >
> > >   schedule() -> ... -> pick_task_fair() -> pick_next_entity() ->
> > >   pick_eevdf() -> vruntime_eligible()
> > >
> > > If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not
> > > triggered in time between dequeue and the next pick, vruntime_eligible()
> > > may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause
> > > all tasks to appear ineligible, leading to NULL pointer dereference.
>
> This makes no sense.
>
> One entity worth of vruntime should not affect things to the point of
> overrun. Yes, it is true that zero_vruntime != avg_vruntime() right
> after a dequeue, but that doesn't matter.
>
> vruntime_eligible() does the same math that avg_vruntime() does and
> takes this difference into account.
>
> As long as zero_vruntime is close 'enough' to avg_vruntime, all the
> deltas are small and nothing overflows.
>