kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Early when trying to get sched_ext and proxy-exe working together,
I kept tripping over NULL ptr in put_prev_task_scx() on the line:
if (sched_class_above(&ext_sched_class, next->sched_class)) {
Which was due to put_prev_task() passes a NULL next, calling:
prev->sched_class->put_prev_task(rq, prev, NULL);
put_prev_task_scx() already guards for a NULL next in the
switch_class case, but doesn't seem to have a guard for
sched_class_above() check.
I can't say I understand why this doesn't trip usually without
proxy-exec. And in newer kernels there are way fewer
put_prev_task(), and I can't easily reproduce the issue now
even with proxy-exec.
But we still have one put_prev_task() call left in core.c that
seems like it could trip this, so I wanted to send this out for
consideration.
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: sched-ext@lists.linux.dev
Cc: kernel-team@android.com
---
kernel/sched/ext.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 446091cba4429..598552f58f5ec 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2402,7 +2402,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
* ops.enqueue() that @p is the only one available for this cpu,
* which should trigger an explicit follow-up scheduling event.
*/
- if (sched_class_above(&ext_sched_class, next->sched_class)) {
+ if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
WARN_ON_ONCE(!(sch->ops.flags & SCX_OPS_ENQ_LAST));
do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
} else {
--
2.52.0.223.gf5cc29aaa4-goog
Applied to sched_ext/for-6.19-fixes with a note. Thanks. -- tejun
Hello John, On Sat Dec 6, 2025 at 2:22 AM UTC, John Stultz wrote: > I can't say I understand why this doesn't trip usually without > proxy-exec. And in newer kernels there are way fewer > put_prev_task(), and I can't easily reproduce the issue now > even with proxy-exec. That's probably because put_prev_task_scx() with next == NULL is always preceded by a dequeue, clearing SCX_TASK_QUEUED from p->scx.flags, so we don't reach the problematic sched_class_above() check because it only happens when the flag is set. > But we still have one put_prev_task() call left in core.c that > seems like it could trip this, so I wanted to send this out for > consideration. I'm assuming you're referring to the one in sched_change_begin(). It looks like it's impossible for an outside observer holding a CPU's rq lock to observe a task that is running on that CPU and isn't queued, i.e. 'running' implies 'queued' (I'm new to the scheduler so I may be wrong here). That would explain why dequeue_task() is always called before put_prev_task(). Does proxy execution break that assumption? Best, Kuba
On Mon Dec 8, 2025 at 10:10 AM UTC, Kuba Piecuch wrote: > It looks like it's impossible for an outside observer holding a CPU's rq lock > to observe a task that is running on that CPU and isn't queued, i.e. > 'running' implies 'queued' (I'm new to the scheduler so I may be wrong here). A task that blocks in __schedule() can drop the rq lock while picking the next task, which is after try_to_block_task() dequeues prev. So it's very much possible for a task on another CPU to grab the rq lock and observe prev as dequeued but still running.
On Mon Dec 8, 2025 at 11:15 AM UTC, Kuba Piecuch wrote: > On Mon Dec 8, 2025 at 10:10 AM UTC, Kuba Piecuch wrote: >> It looks like it's impossible for an outside observer holding a CPU's rq lock >> to observe a task that is running on that CPU and isn't queued, i.e. >> 'running' implies 'queued' (I'm new to the scheduler so I may be wrong here). > > A task that blocks in __schedule() can drop the rq lock while picking the next > task, which is after try_to_block_task() dequeues prev. So it's very much > possible for a task on another CPU to grab the rq lock and observe prev as > dequeued but still running. Even with that, I'm not convinced that it's possible to do a NULL deref with the current code. In order for sched_change_begin() to do the NULL deref in put_prev_task_scx(), we would need to have: * rq->donor == p (for sched_change_begin() to call put_prev_task()) * p->on_rq != TASK_ON_RQ_QUEUED (for sched_change_begin() to not call dequeue_task() beforehand) * p->scx.flags & SCX_TASK_QUEUED (for put_prev_task_scx() to enter the branch with the @next deref) From a brief survey of the code, __assuming proxy execution is disabled__, I don't think it's possible for a remote task holding @rq's lock to observe the second and third condition to be true. Every time p->on_rq is changed away from TASK_ON_RQ_QUEUED, it happens under the rq lock and is paired with a dequeue (see block_task(), deactivate_task()). dequeue_task_scx() always clears SCX_TASK_QUEUED from p->scx.flags. Every time SCX_TASK_QUEUED is set in p->scx.flags (i.e. enqueue_task_scx() is called), it happens under the rq lock and is either gated by p->on_rq == TASK_ON_RQ_QUEUED (see ttwu_runnable(), sched_change_end()) or is paired with p->on_rq being set to TASK_ON_RQ_QUEUED (see activate_task()). It also happens in proxy_tag_curr(), which is a no-op if proxy execution is disabled. Even when it's enabled, proxy_tag_curr() does a dequeue-enqueue cycle while holding the rq lock, which doesn't look dangerous. I'm not trying to say that we shouldn't add a NULL check, all this is just for my own understanding.
On Sat, Dec 06, 2025 at 02:22:03AM +0000, John Stultz wrote:
> Early when trying to get sched_ext and proxy-exe working together,
> I kept tripping over NULL ptr in put_prev_task_scx() on the line:
> if (sched_class_above(&ext_sched_class, next->sched_class)) {
>
> Which was due to put_prev_task() passes a NULL next, calling:
> prev->sched_class->put_prev_task(rq, prev, NULL);
>
> put_prev_task_scx() already guards for a NULL next in the
> switch_class case, but doesn't seem to have a guard for
> sched_class_above() check.
>
> I can't say I understand why this doesn't trip usually without
> proxy-exec. And in newer kernels there are way fewer
> put_prev_task(), and I can't easily reproduce the issue now
> even with proxy-exec.
>
> But we still have one put_prev_task() call left in core.c that
> seems like it could trip this, so I wanted to send this out for
> consideration.
>
> Signed-off-by: John Stultz <jstultz@google.com>
This looks like a valid fix to me. If the task changes any sched property
while it's running, we go through sched_change_begin() which calls
put_prev_task() that always passes NULL as the next parameter:
static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
WARN_ON_ONCE(rq->donor != prev);
prev->sched_class->put_prev_task(rq, prev, NULL);
}
This should be the code path(s) to trigger the bug:
sys_setpriority() / sched_setaffinity() / sched_setscheduler()
- set_user_nice() / __sched_setaffinity() / __sched_setscheduler()
- scoped_guard(sched_change, p, DEQUEUE_SAVE)
- sched_change_begin(p, DEQUEUE_SAVE)
- if (ctx->running)
put_prev_task(rq, p)
- prev->sched_class->put_prev_task(rq, prev, NULL)
- put_prev_task_scx(rq, prev, NULL)
- if (sched_class_above(&ext_sched_class, next->sched_class))
^^^^
NULL dereference
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
> ---
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kuyo chang <kuyo.chang@mediatek.com>
> Cc: hupu <hupu.gm@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> Cc: Andrea Righi <arighi@nvidia.com>
> Cc: Changwoo Min <changwoo@igalia.com>
> Cc: sched-ext@lists.linux.dev
> Cc: kernel-team@android.com
> ---
> kernel/sched/ext.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 446091cba4429..598552f58f5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2402,7 +2402,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
> * ops.enqueue() that @p is the only one available for this cpu,
> * which should trigger an explicit follow-up scheduling event.
> */
> - if (sched_class_above(&ext_sched_class, next->sched_class)) {
> + if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
> WARN_ON_ONCE(!(sch->ops.flags & SCX_OPS_ENQ_LAST));
> do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
> } else {
> --
> 2.52.0.223.gf5cc29aaa4-goog
>
© 2016 - 2025 Red Hat, Inc.