[v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()

[PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()

Posted by zhidao su 1 month, 1 week ago

In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
of scx_root.  scx_root_enable_workfn() iterates all tasks and for each
releases scx_tasks_lock via scx_task_iter_unlock() before calling
scx_init_task().  A concurrent sched_ext_dead() can race in this window.

Two bugs:

1. NULL deref: If sched_ext_dead() runs after scx_init_task() sets
   state=INIT but before the callsite sets p->scx.sched, the invariant
   "state != NONE => p->scx.sched != NULL" is broken.  sched_ext_dead()
   calls scx_disable_and_exit_task(scx_task_sched(p)=NULL, p), which
   crashes in SCX_HAS_OP(NULL, ...).

2. Resource leak: If sched_ext_dead() runs before scx_init_task() when
   state=NONE, it skips scx_disable_and_exit_task() (state check
   fails).  scx_init_task() then calls ops.init_task() and sets
   state=INIT.  The enable loop never calls ops.exit_task(), leaking
   whatever ops.init_task() allocated.

Fix both:

- Move scx_set_task_sched(p, sch) into scx_init_task(), before the
  state transition off NONE.  This restores the invariant so
  sched_ext_dead() always finds a valid scheduler pointer (fixes
  bug 1).

- After scx_init_task() returns, check under scx_tasks_lock whether
  @p is still on scx_tasks.  If not, sched_ext_dead() raced us.
  If state != NONE, ops.init_task() ran before sched_ext_dead() saw
  state=NONE, so call scx_disable_and_exit_task() with cancelled=true
  to release the resources (fixes bug 2).  If state=NONE,
  sched_ext_dead() already cleaned up.

Fixes: 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
v2: Rewrite as writer-side fix per Tejun's review:
 - Move scx_set_task_sched(p, sch) into scx_init_task() before the state
   transition off NONE, restoring the "state!=NONE => p->scx.sched!=NULL"
   invariant.  Bug 1 (NULL deref) is fixed without touching sched_ext_dead().
 - Handle bug 2 (resource leak) in the workfn's list_empty() path by
   calling scx_disable_and_exit_task() when state!=NONE, instead of the
   v1 reader-side branch in sched_ext_dead() that leaked resources.
 - Update Fixes: to 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()")
   which is when scx_task_sched(p) started dereferencing p->scx.sched.


 kernel/sched/ext.c | 59 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 52 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f7b1b16e81a5..99560f77af81 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3583,7 +3583,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork
 		/*
 		 * While @p's rq is not locked. @p is not visible to the rest of
 		 * SCX yet and it's safe to update the flags and state.
+		 *
+		 * Install p->scx.sched before transitioning state off NONE so
+		 * that the invariant state!=NONE => p->scx.sched!=NULL holds as
+		 * soon as state becomes observable.  A concurrent sched_ext_dead()
+		 * that races the INIT window will then always find a valid
+		 * scheduler pointer and can call scx_disable_and_exit_task()
+		 * to release resources allocated by ops.init_task().
 		 */
+		scx_set_task_sched(p, sch);
 		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
 		scx_set_task_state(p, SCX_TASK_INIT);
 	}
@@ -3769,8 +3777,6 @@ void scx_pre_fork(struct task_struct *p)
 
 int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 {
-	s32 ret;
-
 	percpu_rwsem_assert_held(&scx_fork_rwsem);
 
 	p->scx.tid = scx_alloc_tid();
@@ -3781,10 +3787,7 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 #else
 		struct scx_sched *sch = scx_root;
 #endif
-		ret = scx_init_task(sch, p, true);
-		if (!ret)
-			scx_set_task_sched(p, sch);
-		return ret;
+		return scx_init_task(sch, p, true);
 	}
 
 	return 0;
@@ -6937,7 +6940,49 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 			goto err_disable_unlock_all;
 		}
 
-		scx_set_task_sched(p, sch);
+		/*
+		 * sched_ext_dead() may have raced while locks were dropped in
+		 * scx_task_iter_unlock().  Two cases:
+		 *
+		 * (a) sched_ext_dead() ran after scx_init_task() set state=INIT:
+		 *     it called scx_disable_and_exit_task() (cancelled=true) and
+		 *     reset state to NONE.  ops.exit_task() already ran; skip.
+		 *
+		 * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at
+		 *     the time): it skipped scx_disable_and_exit_task() because
+		 *     state was NONE.  scx_init_task() subsequently called
+		 *     ops.init_task() and set state=INIT, leaving allocated
+		 *     resources with no owner.  We must call
+		 *     scx_disable_and_exit_task() here to release them.
+		 *
+		 * Distinguish case (a) from (b) by reading state: (a) leaves
+		 * state=NONE (reset by scx_disable_and_exit_task); (b) leaves
+		 * state=INIT (set by scx_init_task, never reset).
+		 */
+		{
+			bool p_dead = false, need_exit = false;
+
+			scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
+				if (list_empty(&p->scx.tasks_node)) {
+					p_dead = true;
+					need_exit = scx_get_task_state(p) != SCX_TASK_NONE;
+				}
+			}
+
+			if (p_dead) {
+				if (need_exit) {
+					struct rq_flags rf;
+					struct rq *rq;
+
+					rq = task_rq_lock(p, &rf);
+					scx_disable_and_exit_task(sch, p);
+					task_rq_unlock(rq, p, &rf);
+				}
+				put_task_struct(p);
+				continue;
+			}
+		}
+
 		scx_set_task_state(p, SCX_TASK_READY);
 
 		/*
-- 
2.43.0

Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()

Posted by Tejun Heo 1 month ago

Hello,

Thanks for the report and the patches. The same race window also
affects the analogous sub-sched paths and the wrapper-disable paths
trip on the NONE state that scx_fail_parent() leaves behind, so I
ended up taking a more invasive route - extending the task state
machine with SCX_TASK_INIT_BEGIN and SCX_TASK_DEAD - rather than
continuing with your localized fix.

Posted as a 6-patch series:

  https://lore.kernel.org/all/20260510074113.2049514-1-tj@kernel.org/

Thanks.

--
tejun