kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+)
When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(),
the cfs_rq->zero_vruntime is updated via update_entity_lag() ->
avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1.
This means the current entity is still included in the zero_vruntime
calculation.
However, immediately after this, curr->on_rq is set to 0, which should
change the avg_vruntime() result. Without re-updating zero_vruntime, the
stale value may be used in subsequent task selection paths:
schedule() -> ... -> pick_task_fair() -> pick_next_entity() ->
pick_eevdf() -> vruntime_eligible()
If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not
triggered in time between dequeue and the next pick, vruntime_eligible()
may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause
all tasks to appear ineligible, leading to NULL pointer dereference.
Add an explicit avg_vruntime(cfs_rq) call after clearing curr->on_rq to
ensure cfs_rq->zero_vruntime is properly updated before the next pick.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..f8070767c2f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5461,6 +5461,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
se->on_rq = 0;
+ /* update the cfs_rq->zero_vruntime again after curr->on_rq = 0 */
+ if (se == cfs_rq->curr)
+ avg_vruntime(cfs_rq);
account_entity_dequeue(cfs_rq, se);
/* return excess runtime on last dequeue */
--
2.34.1
On Thu, 19 Mar 2026 at 12:43, Zicheng Qu <quzicheng@huawei.com> wrote:
>
> When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(),
> the cfs_rq->zero_vruntime is updated via update_entity_lag() ->
> avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1.
> This means the current entity is still included in the zero_vruntime
> calculation.
curr is not included in zero_vruntime but added when computing
avg_vruntime so zero_vruntime is not impacted when curr is dequeued
>
> However, immediately after this, curr->on_rq is set to 0, which should
> change the avg_vruntime() result. Without re-updating zero_vruntime, the
> stale value may be used in subsequent task selection paths:
>
> schedule() -> ... -> pick_task_fair() -> pick_next_entity() ->
> pick_eevdf() -> vruntime_eligible()
>
> If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not
> triggered in time between dequeue and the next pick, vruntime_eligible()
> may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause
> all tasks to appear ineligible, leading to NULL pointer dereference.
>
> Add an explicit avg_vruntime(cfs_rq) call after clearing curr->on_rq to
> ensure cfs_rq->zero_vruntime is properly updated before the next pick.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf948db905ed..f8070767c2f4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5461,6 +5461,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> if (se != cfs_rq->curr)
> __dequeue_entity(cfs_rq, se);
> se->on_rq = 0;
> + /* update the cfs_rq->zero_vruntime again after curr->on_rq = 0 */
> + if (se == cfs_rq->curr)
> + avg_vruntime(cfs_rq);
> account_entity_dequeue(cfs_rq, se);
>
> /* return excess runtime on last dequeue */
> --
> 2.34.1
>
On Mon, Mar 23, 2026 at 08:52:21AM +0100, Vincent Guittot wrote: > On Thu, 19 Mar 2026 at 12:43, Zicheng Qu <quzicheng@huawei.com> wrote: > > > > When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(), > > the cfs_rq->zero_vruntime is updated via update_entity_lag() -> > > avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1. > > This means the current entity is still included in the zero_vruntime > > calculation. > > curr is not included in zero_vruntime but added when computing > avg_vruntime so zero_vruntime is not impacted when curr is dequeued It is, we explicitly add curr back in. > > However, immediately after this, curr->on_rq is set to 0, which should > > change the avg_vruntime() result. Without re-updating zero_vruntime, the > > stale value may be used in subsequent task selection paths: > > > > schedule() -> ... -> pick_task_fair() -> pick_next_entity() -> > > pick_eevdf() -> vruntime_eligible() > > > > If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not > > triggered in time between dequeue and the next pick, vruntime_eligible() > > may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause > > all tasks to appear ineligible, leading to NULL pointer dereference. This makes no sense. One entity worth of vruntime should not affect things to the point of overrun. Yes, it is true that zero_vruntime != avg_vruntime() right after a dequeue, but that doesn't matter. vruntime_eligible() does the same math that avg_vruntime() does and takes this difference into account. As long as zero_vruntime is close 'enough' to avg_vruntime, all the deltas are small and nothing overflows.
On Mon, 23 Mar 2026 at 10:57, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Mar 23, 2026 at 08:52:21AM +0100, Vincent Guittot wrote: > > On Thu, 19 Mar 2026 at 12:43, Zicheng Qu <quzicheng@huawei.com> wrote: > > > > > > When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(), > > > the cfs_rq->zero_vruntime is updated via update_entity_lag() -> > > > avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1. > > > This means the current entity is still included in the zero_vruntime > > > calculation. > > > > curr is not included in zero_vruntime but added when computing > > avg_vruntime so zero_vruntime is not impacted when curr is dequeued > > It is, we explicitly add curr back in. yeah, i should have taken one more coffee before replying > > > > However, immediately after this, curr->on_rq is set to 0, which should > > > change the avg_vruntime() result. Without re-updating zero_vruntime, the > > > stale value may be used in subsequent task selection paths: > > > > > > schedule() -> ... -> pick_task_fair() -> pick_next_entity() -> > > > pick_eevdf() -> vruntime_eligible() > > > > > > If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not > > > triggered in time between dequeue and the next pick, vruntime_eligible() > > > may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause > > > all tasks to appear ineligible, leading to NULL pointer dereference. > > This makes no sense. > > One entity worth of vruntime should not affect things to the point of > overrun. Yes, it is true that zero_vruntime != avg_vruntime() right > after a dequeue, but that doesn't matter. > > vruntime_eligible() does the same math that avg_vruntime() does and > takes this difference into account. > > As long as zero_vruntime is close 'enough' to avg_vruntime, all the > deltas are small and nothing overflows. >
© 2016 - 2026 Red Hat, Inc.