sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next

[RFC][PATCH] sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next

Posted by John Stultz 2 months ago

Early when trying to get sched_ext and proxy-exe working together,
I kept tripping over NULL ptr in put_prev_task_scx() on the line:
  if (sched_class_above(&ext_sched_class, next->sched_class)) {

Which was due to put_prev_task() passes a NULL next, calling:
  prev->sched_class->put_prev_task(rq, prev, NULL);

put_prev_task_scx() already guards for a NULL next in the
switch_class case, but doesn't seem to have a guard for
sched_class_above() check.

I can't say I understand why this doesn't trip usually without
proxy-exec. And in newer kernels there are way fewer
put_prev_task(), and I can't easily reproduce the issue now
even with proxy-exec.

But we still have one put_prev_task() call left in core.c that
seems like it could trip this, so I wanted to send this out for
consideration.

Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: sched-ext@lists.linux.dev
Cc: kernel-team@android.com
---
 kernel/sched/ext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 446091cba4429..598552f58f5ec 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2402,7 +2402,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 * ops.enqueue() that @p is the only one available for this cpu,
 		 * which should trigger an explicit follow-up scheduling event.
 		 */
-		if (sched_class_above(&ext_sched_class, next->sched_class)) {
+		if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
 			WARN_ON_ONCE(!(sch->ops.flags & SCX_OPS_ENQ_LAST));
 			do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
 		} else {
-- 
2.52.0.223.gf5cc29aaa4-goog

Re: [RFC][PATCH] sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next

Posted by Tejun Heo 2 months ago

Applied to sched_ext/for-6.19-fixes with a note.

Thanks.

--
tejun

Re: [RFC][PATCH] sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next

Posted by Kuba Piecuch 2 months ago

Hello John,

On Sat Dec 6, 2025 at 2:22 AM UTC, John Stultz wrote:
> I can't say I understand why this doesn't trip usually without
> proxy-exec. And in newer kernels there are way fewer
> put_prev_task(), and I can't easily reproduce the issue now
> even with proxy-exec.

That's probably because put_prev_task_scx() with next == NULL is always
preceded by a dequeue, clearing SCX_TASK_QUEUED from p->scx.flags, so we don't
reach the problematic sched_class_above() check because it only happens when
the flag is set.

> But we still have one put_prev_task() call left in core.c that
> seems like it could trip this, so I wanted to send this out for
> consideration.

I'm assuming you're referring to the one in sched_change_begin().
It looks like it's impossible for an outside observer holding a CPU's rq lock
to observe a task that is running on that CPU and isn't queued, i.e.
'running' implies 'queued' (I'm new to the scheduler so I may be wrong here).
That would explain why dequeue_task() is always called before put_prev_task().
Does proxy execution break that assumption?

Best,
Kuba

Re: [RFC][PATCH] sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next

Posted by Kuba Piecuch 2 months ago

On Mon Dec 8, 2025 at 10:10 AM UTC, Kuba Piecuch wrote:
> It looks like it's impossible for an outside observer holding a CPU's rq lock
> to observe a task that is running on that CPU and isn't queued, i.e.
> 'running' implies 'queued' (I'm new to the scheduler so I may be wrong here).

A task that blocks in __schedule() can drop the rq lock while picking the next
task, which is after try_to_block_task() dequeues prev. So it's very much
possible for a task on another CPU to grab the rq lock and observe prev as
dequeued but still running.

Re: [RFC][PATCH] sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next

Posted by Kuba Piecuch 2 months ago

On Mon Dec 8, 2025 at 11:15 AM UTC, Kuba Piecuch wrote:
> On Mon Dec 8, 2025 at 10:10 AM UTC, Kuba Piecuch wrote:
>> It looks like it's impossible for an outside observer holding a CPU's rq lock
>> to observe a task that is running on that CPU and isn't queued, i.e.
>> 'running' implies 'queued' (I'm new to the scheduler so I may be wrong here).
>
> A task that blocks in __schedule() can drop the rq lock while picking the next
> task, which is after try_to_block_task() dequeues prev. So it's very much
> possible for a task on another CPU to grab the rq lock and observe prev as
> dequeued but still running.

Even with that, I'm not convinced that it's possible to do a NULL deref with
the current code.

In order for sched_change_begin() to do the NULL deref in put_prev_task_scx(),
we would need to have:

* rq->donor == p (for sched_change_begin() to call put_prev_task())
* p->on_rq != TASK_ON_RQ_QUEUED
  (for sched_change_begin() to not call dequeue_task() beforehand)
* p->scx.flags & SCX_TASK_QUEUED
  (for put_prev_task_scx() to enter the branch with the @next deref)

From a brief survey of the code, __assuming proxy execution is disabled__,
I don't think it's possible for a remote task holding @rq's lock to observe
the second and third condition to be true.

Every time p->on_rq is changed away from TASK_ON_RQ_QUEUED, it happens under
the rq lock and is paired with a dequeue (see block_task(),
deactivate_task()). dequeue_task_scx() always clears SCX_TASK_QUEUED from
p->scx.flags.

Every time SCX_TASK_QUEUED is set in p->scx.flags (i.e. enqueue_task_scx()
is called), it happens under the rq lock and is either gated by
p->on_rq == TASK_ON_RQ_QUEUED (see ttwu_runnable(), sched_change_end()) or is
paired with p->on_rq being set to TASK_ON_RQ_QUEUED (see activate_task()).
It also happens in proxy_tag_curr(), which is a no-op if proxy execution is
disabled. Even when it's enabled, proxy_tag_curr() does a dequeue-enqueue
cycle while holding the rq lock, which doesn't look dangerous.

I'm not trying to say that we shouldn't add a NULL check, all this is just
for my own understanding.

Re: [RFC][PATCH] sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next

Posted by Andrea Righi 2 months ago

On Sat, Dec 06, 2025 at 02:22:03AM +0000, John Stultz wrote:
> Early when trying to get sched_ext and proxy-exe working together,
> I kept tripping over NULL ptr in put_prev_task_scx() on the line:
>   if (sched_class_above(&ext_sched_class, next->sched_class)) {
> 
> Which was due to put_prev_task() passes a NULL next, calling:
>   prev->sched_class->put_prev_task(rq, prev, NULL);
> 
> put_prev_task_scx() already guards for a NULL next in the
> switch_class case, but doesn't seem to have a guard for
> sched_class_above() check.
> 
> I can't say I understand why this doesn't trip usually without
> proxy-exec. And in newer kernels there are way fewer
> put_prev_task(), and I can't easily reproduce the issue now
> even with proxy-exec.
> 
> But we still have one put_prev_task() call left in core.c that
> seems like it could trip this, so I wanted to send this out for
> consideration.
> 
> Signed-off-by: John Stultz <jstultz@google.com>

This looks like a valid fix to me. If the task changes any sched property
while it's running, we go through sched_change_begin() which calls
put_prev_task() that always passes NULL as the next parameter:

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
        WARN_ON_ONCE(rq->donor != prev);
        prev->sched_class->put_prev_task(rq, prev, NULL);
}

This should be the code path(s) to trigger the bug:

  sys_setpriority() / sched_setaffinity() / sched_setscheduler()
    - set_user_nice() / __sched_setaffinity() / __sched_setscheduler()
      - scoped_guard(sched_change, p, DEQUEUE_SAVE)
        - sched_change_begin(p, DEQUEUE_SAVE)
          - if (ctx->running)
               put_prev_task(rq, p)
            - prev->sched_class->put_prev_task(rq, prev, NULL)
              - put_prev_task_scx(rq, prev, NULL)
                - if (sched_class_above(&ext_sched_class, next->sched_class))
	                                                  ^^^^
                                                          NULL dereference

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kuyo chang <kuyo.chang@mediatek.com>
> Cc: hupu <hupu.gm@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> Cc: Andrea Righi <arighi@nvidia.com>
> Cc: Changwoo Min <changwoo@igalia.com>
> Cc: sched-ext@lists.linux.dev
> Cc: kernel-team@android.com
> ---
>  kernel/sched/ext.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 446091cba4429..598552f58f5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2402,7 +2402,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  		 * ops.enqueue() that @p is the only one available for this cpu,
>  		 * which should trigger an explicit follow-up scheduling event.
>  		 */
> -		if (sched_class_above(&ext_sched_class, next->sched_class)) {
> +		if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
>  			WARN_ON_ONCE(!(sch->ops.flags & SCX_OPS_ENQ_LAST));
>  			do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
>  		} else {
> -- 
> 2.52.0.223.gf5cc29aaa4-goog
>