kernel/sched/ext.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
@p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three
scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch()
when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert():
1. Deferred dispatch cancellation: when a task is directly dispatched to
a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the
dispatch is deferred (since we can't lock the remote rq while holding
the current one). If the task is dequeued before processing the
dispatch in process_ddsp_deferred_locals(), dispatch_dequeue()
removes the task from the list leaving a stale direct dispatch state.
Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch
of dispatch_dequeue().
2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a
task to another CPU's local DSQ, it sets holding_cpu and releases
DISPATCHING before locking the source rq. If dequeue wins the race
and clears holding_cpu, dispatch_enqueue() is never called and
ddsp_dsq_id is not cleared.
Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu
in dispatch_dequeue().
3. Cross-scheduler-instance stale state: When an SCX scheduler exits,
scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue
them, but sleeping tasks are not on any runqueue and are not touched.
If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id
set) at the time the scheduler exited, the state persists. When a new
scheduler instance loads and calls scx_enable_task() for all tasks,
it does not reset this leftover state. The next wakeup's
ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers:
WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before
calling ops.enable(), ensuring each new scheduler instance starts
with a clean direct dispatch state per task.
With all the fixes applied the SCX_DSQ_INVALID warning doesn't seem to
happen anymore.
Fixes: 5b26f7b920f76 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Cc: stable@vger.kernel.org # v6.12+
Cc: Daniel Hodges <hodgesd@meta.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 26a6ac2f88267..de827ce0ffb74 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1163,20 +1163,32 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
if (!dsq) {
/*
* If !dsq && on-list, @p is on @rq's ddsp_deferred_locals.
- * Unlinking is all that's needed to cancel.
+ * Unlink and clear the deferred dispatch state.
*/
- if (unlikely(!list_empty(&p->scx.dsq_list.node)))
+ if (unlikely(!list_empty(&p->scx.dsq_list.node))) {
list_del_init(&p->scx.dsq_list.node);
+ p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
+ p->scx.ddsp_enq_flags = 0;
+ }
+
/*
* When dispatching directly from the BPF scheduler to a local
* DSQ, the task isn't associated with any DSQ but
* @p->scx.holding_cpu may be set under the protection of
- * %SCX_OPSS_DISPATCHING.
+ * %SCX_OPSS_DISPATCHING. If we win the race and clear
+ * holding_cpu before dispatch_to_local_dsq() completes, the
+ * in-flight dispatch is cancelled and dispatch_enqueue() won't
+ * be called. Clear the stale direct dispatch state here so the
+ * next wakeup starts clean.
*/
- if (p->scx.holding_cpu >= 0)
+ if (p->scx.holding_cpu >= 0) {
p->scx.holding_cpu = -1;
+ p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
+ p->scx.ddsp_enq_flags = 0;
+ }
+
return;
}
@@ -2945,6 +2957,9 @@ static void scx_enable_task(struct task_struct *p)
p->scx.weight = sched_weight_to_cgroup(weight);
+ p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
+ p->scx.ddsp_enq_flags = 0;
+
if (SCX_HAS_OP(sch, enable))
SCX_CALL_OP_TASK(sch, SCX_KF_REST, enable, rq, p);
scx_set_task_state(p, SCX_TASK_ENABLED);
--
2.53.0
Hello, (cc'ing Patrick Somaru. This is the same issue reported on https://github.com/sched-ext/scx/pull/3482) On Wed, Apr 01, 2026 at 11:56:19PM +0200, Andrea Righi wrote: > @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three > scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch() > when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(): > > 1. Deferred dispatch cancellation: when a task is directly dispatched to > a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the > dispatch is deferred (since we can't lock the remote rq while holding > the current one). If the task is dequeued before processing the > dispatch in process_ddsp_deferred_locals(), dispatch_dequeue() > removes the task from the list leaving a stale direct dispatch state. > > Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch > of dispatch_dequeue(). > > 2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a > task to another CPU's local DSQ, it sets holding_cpu and releases > DISPATCHING before locking the source rq. If dequeue wins the race > and clears holding_cpu, dispatch_enqueue() is never called and > ddsp_dsq_id is not cleared. > > Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu > in dispatch_dequeue(). These two just mean that dequeue need to clear it, right? > 3. Cross-scheduler-instance stale state: When an SCX scheduler exits, > scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue > them, but sleeping tasks are not on any runqueue and are not touched. > If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id > set) at the time the scheduler exited, the state persists. When a new > scheduler instance loads and calls scx_enable_task() for all tasks, > it does not reset this leftover state. The next wakeup's > ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers: > > WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) > > Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before > calling ops.enable(), ensuring each new scheduler instance starts > with a clean direct dispatch state per task. I don't understand this one. If we fix the missing clearing from dequeue, where would the residual ddsp_dsq_id come from? How would a sleeping task have ddsp_dsq_id set? Note that select_cpu() + enqueue() call sequence is atomic w.r.t. dequeue as both are protected by pi_lock. It's been always a bit bothersome that ddsp_dsq_id was being cleared in dispatch_enqueue(). It was there to catch the cases where ddsp_dsq_id was overridden but it just isn't the right place. Can we do the following? - Add clear_direct_dispatch() which clears ddsp_dsq_id and ddsp_enq_flags. - Add clear_direct_dispatch() call under the enqueue: in do_enqueue_task() and remove ddsp clearing from dispatch_enqueue(). This should catch all cases that ignore ddsp. - Add clear_direct_dispatch() call after dispatch_enqueue() in direct_dispatch(). This clears it for the synchronous consumption. - Add clear_direct_dispatch() call before dispatch_to_local_dsq() call in process_ddsp_deferred_locals(). Note that the funciton has to cache and clear ddsp fields *before* calling dispatch_to_local_enq() as the function will migrate the task to another rq and we can't control what happens to it afterwrds. Even for the previous synchronous case, it may just be a better pattern to always cache dsq_id and enq_flags in local vars and clear p->scx.ddsp* before calling dispatch_enqueue(). - Add clear_direct_dispatch() call to dequeue_task_scx() after dispatch_dequeue(). I think this should capture all cases and the fields are cleared where they should be cleared (either consumed or canceled). Thanks. -- tejun
Hi Tejun, On Wed, Apr 01, 2026 at 12:46:58PM -1000, Tejun Heo wrote: > Hello, > > (cc'ing Patrick Somaru. This is the same issue reported on > https://github.com/sched-ext/scx/pull/3482) > > On Wed, Apr 01, 2026 at 11:56:19PM +0200, Andrea Righi wrote: > > @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three > > scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch() > > when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(): > > > > 1. Deferred dispatch cancellation: when a task is directly dispatched to > > a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the > > dispatch is deferred (since we can't lock the remote rq while holding > > the current one). If the task is dequeued before processing the > > dispatch in process_ddsp_deferred_locals(), dispatch_dequeue() > > removes the task from the list leaving a stale direct dispatch state. > > > > Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch > > of dispatch_dequeue(). > > > > 2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a > > task to another CPU's local DSQ, it sets holding_cpu and releases > > DISPATCHING before locking the source rq. If dequeue wins the race > > and clears holding_cpu, dispatch_enqueue() is never called and > > ddsp_dsq_id is not cleared. > > > > Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu > > in dispatch_dequeue(). > > These two just mean that dequeue need to clear it, right? Correct, we want to clear the state where dispatch_dequeue() cancels a pending direct dispatch without calling dispatch_enqueue(), so I could have just clear the state unconditionally in the !dsq case and simplify the code. > > > 3. Cross-scheduler-instance stale state: When an SCX scheduler exits, > > scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue > > them, but sleeping tasks are not on any runqueue and are not touched. > > If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id > > set) at the time the scheduler exited, the state persists. When a new > > scheduler instance loads and calls scx_enable_task() for all tasks, > > it does not reset this leftover state. The next wakeup's > > ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers: > > > > WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) > > > > Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before > > calling ops.enable(), ensuring each new scheduler instance starts > > with a clean direct dispatch state per task. > > I don't understand this one. If we fix the missing clearing from dequeue, > where would the residual ddsp_dsq_id come from? How would a sleeping task > have ddsp_dsq_id set? Note that select_cpu() + enqueue() call sequence is > atomic w.r.t. dequeue as both are protected by pi_lock. > > It's been always a bit bothersome that ddsp_dsq_id was being cleared in > dispatch_enqueue(). It was there to catch the cases where ddsp_dsq_id was > overridden but it just isn't the right place. Can we do the following? > > - Add clear_direct_dispatch() which clears ddsp_dsq_id and ddsp_enq_flags. > > - Add clear_direct_dispatch() call under the enqueue: in do_enqueue_task() > and remove ddsp clearing from dispatch_enqueue(). This should catch all > cases that ignore ddsp. > > - Add clear_direct_dispatch() call after dispatch_enqueue() in > direct_dispatch(). This clears it for the synchronous consumption. > > - Add clear_direct_dispatch() call before dispatch_to_local_dsq() call in > process_ddsp_deferred_locals(). Note that the funciton has to cache and > clear ddsp fields *before* calling dispatch_to_local_enq() as the function > will migrate the task to another rq and we can't control what happens to > it afterwrds. Even for the previous synchronous case, it may just be a > better pattern to always cache dsq_id and enq_flags in local vars and > clear p->scx.ddsp* before calling dispatch_enqueue(). > > - Add clear_direct_dispatch() call to dequeue_task_scx() after > dispatch_dequeue(). > > I think this should capture all cases and the fields are cleared where they > should be cleared (either consumed or canceled). I like this, it looks like a better design. However, I tried it, but I'm still able to trigger the warning, unless I clear the direct dispatch state in __scx_enable_task(), so we're still missing a case that doesn't properly clear the state. I think it has something to do with sleeping tasks / queued wakeups / bypass, because now I can easily reproduce the warning running a `stress-ng --sleep 1` while restarting a scheduler that uses SCX_OPS_ALLOW_QUEUED_WAKEUP, like scx_cosmos. Will keep investigating... Thanks, -Andrea
On Thu, Apr 02, 2026 at 09:40:41AM +0200, Andrea Righi wrote: > Hi Tejun, > > On Wed, Apr 01, 2026 at 12:46:58PM -1000, Tejun Heo wrote: > > Hello, > > > > (cc'ing Patrick Somaru. This is the same issue reported on > > https://github.com/sched-ext/scx/pull/3482) > > > > On Wed, Apr 01, 2026 at 11:56:19PM +0200, Andrea Righi wrote: > > > @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three > > > scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch() > > > when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(): > > > > > > 1. Deferred dispatch cancellation: when a task is directly dispatched to > > > a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the > > > dispatch is deferred (since we can't lock the remote rq while holding > > > the current one). If the task is dequeued before processing the > > > dispatch in process_ddsp_deferred_locals(), dispatch_dequeue() > > > removes the task from the list leaving a stale direct dispatch state. > > > > > > Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch > > > of dispatch_dequeue(). > > > > > > 2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a > > > task to another CPU's local DSQ, it sets holding_cpu and releases > > > DISPATCHING before locking the source rq. If dequeue wins the race > > > and clears holding_cpu, dispatch_enqueue() is never called and > > > ddsp_dsq_id is not cleared. > > > > > > Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu > > > in dispatch_dequeue(). > > > > These two just mean that dequeue need to clear it, right? > > Correct, we want to clear the state where dispatch_dequeue() cancels a > pending direct dispatch without calling dispatch_enqueue(), so I could have > just clear the state unconditionally in the !dsq case and simplify the > code. > > > > > > 3. Cross-scheduler-instance stale state: When an SCX scheduler exits, > > > scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue > > > them, but sleeping tasks are not on any runqueue and are not touched. > > > If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id > > > set) at the time the scheduler exited, the state persists. When a new > > > scheduler instance loads and calls scx_enable_task() for all tasks, > > > it does not reset this leftover state. The next wakeup's > > > ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers: > > > > > > WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) > > > > > > Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before > > > calling ops.enable(), ensuring each new scheduler instance starts > > > with a clean direct dispatch state per task. > > > > I don't understand this one. If we fix the missing clearing from dequeue, > > where would the residual ddsp_dsq_id come from? How would a sleeping task > > have ddsp_dsq_id set? Note that select_cpu() + enqueue() call sequence is > > atomic w.r.t. dequeue as both are protected by pi_lock. > > > > It's been always a bit bothersome that ddsp_dsq_id was being cleared in > > dispatch_enqueue(). It was there to catch the cases where ddsp_dsq_id was > > overridden but it just isn't the right place. Can we do the following? > > > > - Add clear_direct_dispatch() which clears ddsp_dsq_id and ddsp_enq_flags. > > > > - Add clear_direct_dispatch() call under the enqueue: in do_enqueue_task() > > and remove ddsp clearing from dispatch_enqueue(). This should catch all > > cases that ignore ddsp. > > > > - Add clear_direct_dispatch() call after dispatch_enqueue() in > > direct_dispatch(). This clears it for the synchronous consumption. > > > > - Add clear_direct_dispatch() call before dispatch_to_local_dsq() call in > > process_ddsp_deferred_locals(). Note that the funciton has to cache and > > clear ddsp fields *before* calling dispatch_to_local_enq() as the function > > will migrate the task to another rq and we can't control what happens to > > it afterwrds. Even for the previous synchronous case, it may just be a > > better pattern to always cache dsq_id and enq_flags in local vars and > > clear p->scx.ddsp* before calling dispatch_enqueue(). > > > > - Add clear_direct_dispatch() call to dequeue_task_scx() after > > dispatch_dequeue(). > > > > I think this should capture all cases and the fields are cleared where they > > should be cleared (either consumed or canceled). > > I like this, it looks like a better design. > > However, I tried it, but I'm still able to trigger the warning, unless I > clear the direct dispatch state in __scx_enable_task(), so we're still > missing a case that doesn't properly clear the state. > > I think it has something to do with sleeping tasks / queued wakeups / > bypass, because now I can easily reproduce the warning running a > `stress-ng --sleep 1` while restarting a scheduler that uses > SCX_OPS_ALLOW_QUEUED_WAKEUP, like scx_cosmos. Will keep investigating... I think I see it: waking tasks may have had ddsp_dsq_id set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue (on_rq == 0) and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared by dequeue_task_scx(). How about clearning the direct dispatch state in scx_disable_task()? That would catch those tasks as well and it won't leave the stale state from the previous scx scheduler instance. Thanks, -Andrea
© 2016 - 2026 Red Hat, Inc.