sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()

[PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()

Posted by Andrea Righi 1 month ago

Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:

 $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop
 ...
 BUG: kernel NULL pointer dereference, address: 0000000000000400
 ...
 RIP: 0010:scx_claim_exit+0x3b/0x120
 Call Trace:
  <TASK>
  bpf_scx_unreg+0x45/0xb0
  bpf_struct_ops_map_link_dealloc+0x39/0x50
  bpf_link_release+0x18/0x20
  __fput+0x10b/0x2e0
  __x64_sys_close+0x47/0xa0

This was introduced by commit 105dcd005be2 ("sched_ext: Introduce
scx_prog_sched()"), which:

  - Made kfuncs look up the scheduler via scx_prog_sched(aux), which
    resolves aux -> struct_ops -> ops->priv.

  - Added RCU_INIT_POINTER(ops->priv, NULL) to bpf_scx_unreg() before
    dropping the kobject reference.

Under concurrent attach/detach of the same struct_ops program, the BPF
program's aux->struct_ops association can resolve to a struct_ops whose
->priv was just cleared by a concurrent bpf_scx_unreg(), or to one where
scx_alloc_and_add_sched() has not yet completed rcu_assign_pointer().

When scx_prog_sched() observes this transient ops->priv == NULL, it
returns NULL; kfuncs like scx_bpf_create_dsq() then return -ENODEV,
which causes ops.init() to fail with -ENODEV. The failed attach enters
the disable path, and the subsequent bpf_scx_unreg() reads NULL from
ops->priv and dereferences it in scx_claim_exit().

Fix it in two places:

  - scx_prog_sched(): when ops is found but ops->priv is NULL, fall
    through to the scx_root path instead of returning NULL. For
    single-sched (the only currently supported configuration), this
    recovers the previous behavior; for sub-sched-aware schedulers the
    existing !root->ops.sub_attach guard keeps the fallback off so
    multi-sched semantics are preserved.

  - bpf_scx_unreg(): guard against ops->priv == NULL so the function is
    a no-op instead of NULL-dereferencing scx_disable(NULL, ...).

Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c          |  8 ++++++++
 kernel/sched/ext_internal.h | 16 +++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4efe0099f79af..6c476ec5dcbe1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7608,6 +7608,14 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
 	struct sched_ext_ops *ops = kdata;
 	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
 
+	/*
+	 * ops->priv can be NULL if scx_alloc_and_add_sched() failed before
+	 * assigning it, or if bpf_scx_unreg() somehow re-entered. There's
+	 * nothing to tear down in either case.
+	 */
+	if (!sch)
+		return;
+
 	scx_disable(sch, SCX_EXIT_UNREG);
 	scx_flush_disable_work(sch);
 	RCU_INIT_POINTER(ops->priv, NULL);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index a075732d4430d..e468a7401ed83 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1433,11 +1433,21 @@ static inline bool scx_task_on_sched(struct scx_sched *sch,
 static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux)
 {
 	struct sched_ext_ops *ops;
-	struct scx_sched *root;
+	struct scx_sched *root, *sch;
 
 	ops = bpf_prog_get_assoc_struct_ops(aux);
-	if (likely(ops))
-		return rcu_dereference_all(ops->priv);
+	if (likely(ops)) {
+		sch = rcu_dereference_all(ops->priv);
+		if (likely(sch))
+			return sch;
+		/*
+		 * @aux is associated with @ops but @ops->priv is NULL. This can
+		 * be observed transiently under concurrent attach/detach (e.g.
+		 * bpf_scx_unreg() clears @ops->priv before kdata is freed).
+		 * Continue with the scx_root path so single-sched users keep
+		 * working, sub-sched users see no scheduler.
+		 */
+	}
 
 	root = rcu_dereference_all(scx_root);
 	if (root) {
-- 
2.54.0

Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()

Posted by Tejun Heo 1 month ago

Hello, Andrea.

I traced reload_loop with per-CPU ring probes around all @ops->priv
and scx_root assign/clear sites. The race is a stomp:

  T2 unreg(K)                       T1 reg(K)
  -----------                       ---------
  sch = ops->priv = sch_b800
  scx_disable; flush_disable_work
    [scx_root_disable: scx_root=NULL,
     mutex_unlock, state=DISABLED]
                                    mutex_lock; state ok
                                    scx_alloc_and_add_sched:
                                      ops->priv = sch_a800
                                    scx_root = sch_a800; init=0
                                    state=ENABLED; mutex_unlock
    [flush returns]
  RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
  kobject_put(sch_b800)

Reachable because the unreg waits on sch->helper while the next reg
runs on the global scx_enable_helper, and scx_enable_mutex is released
inside scx_root_disable() well before bpf_scx_unreg() reaches its
RCU_INIT_POINTER. My trace caught 11us between PRIV_SET sch_a800 and
the clobber; nothing bounds it.

The posted patch suppresses the deref but leaves the stomp. Each
stomp leaks one sch (the "sch's base reference will be put by
bpf_scx_unreg()" contract assumes ops->priv still points at it), and
in the case I caught, sch_a800 is already SCX_ENABLED with scx_root
pointing at it - the bpf_link is gone but state stays ENABLED, so all
future attaches fail with -EBUSY permanently.

Suggestion: make @ops->priv the lifecycle binding. In
scx_root_enable_workfn() (and scx_sub_enable_workfn()), after the
existing state check and still under scx_enable_mutex, refuse with
-EBUSY if @ops->priv is non-NULL. Unreg side keeps its current
ordering.

One question: are there other paths that write or clear @ops->priv?
I only see the rcu_assign_pointer in scx_alloc_and_add_sched and the
RCU_INIT_POINTER(NULL) in bpf_scx_unreg().

Thanks.

--
tejun

Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()

Posted by Andrea Righi 1 month ago

Hi Tejun,

On Sun, May 10, 2026 at 04:55:30PM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> I traced reload_loop with per-CPU ring probes around all @ops->priv
> and scx_root assign/clear sites. The race is a stomp:
> 
>   T2 unreg(K)                       T1 reg(K)
>   -----------                       ---------
>   sch = ops->priv = sch_b800
>   scx_disable; flush_disable_work
>     [scx_root_disable: scx_root=NULL,
>      mutex_unlock, state=DISABLED]
>                                     mutex_lock; state ok
>                                     scx_alloc_and_add_sched:
>                                       ops->priv = sch_a800
>                                     scx_root = sch_a800; init=0
>                                     state=ENABLED; mutex_unlock
>     [flush returns]
>   RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
>   kobject_put(sch_b800)

Ah makes sense! Yes, that's the case.

> 
> Reachable because the unreg waits on sch->helper while the next reg
> runs on the global scx_enable_helper, and scx_enable_mutex is released
> inside scx_root_disable() well before bpf_scx_unreg() reaches its
> RCU_INIT_POINTER. My trace caught 11us between PRIV_SET sch_a800 and
> the clobber; nothing bounds it.
> 
> The posted patch suppresses the deref but leaves the stomp. Each
> stomp leaks one sch (the "sch's base reference will be put by
> bpf_scx_unreg()" contract assumes ops->priv still points at it), and
> in the case I caught, sch_a800 is already SCX_ENABLED with scx_root
> pointing at it - the bpf_link is gone but state stays ENABLED, so all
> future attaches fail with -EBUSY permanently.
> 
> Suggestion: make @ops->priv the lifecycle binding. In
> scx_root_enable_workfn() (and scx_sub_enable_workfn()), after the
> existing state check and still under scx_enable_mutex, refuse with
> -EBUSY if @ops->priv is non-NULL. Unreg side keeps its current
> ordering.

I'll send a new version implementing this.

> 
> One question: are there other paths that write or clear @ops->priv?
> I only see the rcu_assign_pointer in scx_alloc_and_add_sched and the
> RCU_INIT_POINTER(NULL) in bpf_scx_unreg().

AFAICS there's only the rcu_assign_pointer() in scx_alloc_and_add_sched() and
RCU_INIT_POINTER(NULL) in bpf_scx_unreg(), no other writers/clearers. So the
-EBUSY check should be sufficient to close all the races.

Thanks,
-Andrea