[v1] sched_ext: Fix stale direct dispatch state in ddsp_dsq_id

[PATCH] sched_ext: Fix stale direct dispatch state in ddsp_dsq_id

Posted by Andrea Righi 2 months, 1 week ago

@p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three
scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch()
when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert():

1. Deferred dispatch cancellation: when a task is directly dispatched to
   a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the
   dispatch is deferred (since we can't lock the remote rq while holding
   the current one). If the task is dequeued before processing the
   dispatch in process_ddsp_deferred_locals(), dispatch_dequeue()
   removes the task from the list leaving a stale direct dispatch state.

   Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch
   of dispatch_dequeue().

2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a
   task to another CPU's local DSQ, it sets holding_cpu and releases
   DISPATCHING before locking the source rq. If dequeue wins the race
   and clears holding_cpu, dispatch_enqueue() is never called and
   ddsp_dsq_id is not cleared.

   Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu
   in dispatch_dequeue().

3. Cross-scheduler-instance stale state: When an SCX scheduler exits,
   scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue
   them, but sleeping tasks are not on any runqueue and are not touched.
   If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id
   set) at the time the scheduler exited, the state persists. When a new
   scheduler instance loads and calls scx_enable_task() for all tasks,
   it does not reset this leftover state. The next wakeup's
   ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers:

     WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)

   Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before
   calling ops.enable(), ensuring each new scheduler instance starts
   with a clean direct dispatch state per task.

With all the fixes applied the SCX_DSQ_INVALID warning doesn't seem to
happen anymore.

Fixes: 5b26f7b920f76 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Cc: stable@vger.kernel.org # v6.12+
Cc: Daniel Hodges <hodgesd@meta.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 26a6ac2f88267..de827ce0ffb74 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1163,20 +1163,32 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 	if (!dsq) {
 		/*
 		 * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals.
-		 * Unlinking is all that's needed to cancel.
+		 * Unlink and clear the deferred dispatch state.
 		 */
-		if (unlikely(!list_empty(&p->scx.dsq_list.node)))
+		if (unlikely(!list_empty(&p->scx.dsq_list.node))) {
 			list_del_init(&p->scx.dsq_list.node);
 
+			p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
+			p->scx.ddsp_enq_flags = 0;
+		}
+
 		/*
 		 * When dispatching directly from the BPF scheduler to a local
 		 * DSQ, the task isn't associated with any DSQ but
 		 * @p->scx.holding_cpu may be set under the protection of
-		 * %SCX_OPSS_DISPATCHING.
+		 * %SCX_OPSS_DISPATCHING. If we win the race and clear
+		 * holding_cpu before dispatch_to_local_dsq() completes, the
+		 * in-flight dispatch is cancelled and dispatch_enqueue() won't
+		 * be called. Clear the stale direct dispatch state here so the
+		 * next wakeup starts clean.
 		 */
-		if (p->scx.holding_cpu >= 0)
+		if (p->scx.holding_cpu >= 0) {
 			p->scx.holding_cpu = -1;
 
+			p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
+			p->scx.ddsp_enq_flags = 0;
+		}
+
 		return;
 	}
 
@@ -2945,6 +2957,9 @@ static void scx_enable_task(struct task_struct *p)
 
 	p->scx.weight = sched_weight_to_cgroup(weight);
 
+	p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
+	p->scx.ddsp_enq_flags = 0;
+
 	if (SCX_HAS_OP(sch, enable))
 		SCX_CALL_OP_TASK(sch, SCX_KF_REST, enable, rq, p);
 	scx_set_task_state(p, SCX_TASK_ENABLED);
-- 
2.53.0

Re: [PATCH] sched_ext: Fix stale direct dispatch state in ddsp_dsq_id

Posted by Tejun Heo 2 months, 1 week ago

Hello,

(cc'ing Patrick Somaru. This is the same issue reported on
https://github.com/sched-ext/scx/pull/3482)

On Wed, Apr 01, 2026 at 11:56:19PM +0200, Andrea Righi wrote:
> @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three
> scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch()
> when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert():
> 
> 1. Deferred dispatch cancellation: when a task is directly dispatched to
>    a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the
>    dispatch is deferred (since we can't lock the remote rq while holding
>    the current one). If the task is dequeued before processing the
>    dispatch in process_ddsp_deferred_locals(), dispatch_dequeue()
>    removes the task from the list leaving a stale direct dispatch state.
> 
>    Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch
>    of dispatch_dequeue().
> 
> 2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a
>    task to another CPU's local DSQ, it sets holding_cpu and releases
>    DISPATCHING before locking the source rq. If dequeue wins the race
>    and clears holding_cpu, dispatch_enqueue() is never called and
>    ddsp_dsq_id is not cleared.
> 
>    Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu
>    in dispatch_dequeue().

These two just mean that dequeue need to clear it, right?

> 3. Cross-scheduler-instance stale state: When an SCX scheduler exits,
>    scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue
>    them, but sleeping tasks are not on any runqueue and are not touched.
>    If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id
>    set) at the time the scheduler exited, the state persists. When a new
>    scheduler instance loads and calls scx_enable_task() for all tasks,
>    it does not reset this leftover state. The next wakeup's
>    ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers:
> 
>      WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> 
>    Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before
>    calling ops.enable(), ensuring each new scheduler instance starts
>    with a clean direct dispatch state per task.

I don't understand this one. If we fix the missing clearing from dequeue,
where would the residual ddsp_dsq_id come from? How would a sleeping task
have ddsp_dsq_id set? Note that select_cpu() + enqueue() call sequence is
atomic w.r.t. dequeue as both are protected by pi_lock.

It's been always a bit bothersome that ddsp_dsq_id was being cleared in
dispatch_enqueue(). It was there to catch the cases where ddsp_dsq_id was
overridden but it just isn't the right place. Can we do the following?

- Add clear_direct_dispatch() which clears ddsp_dsq_id and ddsp_enq_flags.

- Add clear_direct_dispatch() call under the enqueue: in do_enqueue_task()
  and remove ddsp clearing from dispatch_enqueue(). This should catch all
  cases that ignore ddsp.

- Add clear_direct_dispatch() call after dispatch_enqueue() in
  direct_dispatch(). This clears it for the synchronous consumption.

- Add clear_direct_dispatch() call before dispatch_to_local_dsq() call in
  process_ddsp_deferred_locals(). Note that the funciton has to cache and
  clear ddsp fields *before* calling dispatch_to_local_enq() as the function
  will migrate the task to another rq and we can't control what happens to
  it afterwrds. Even for the previous synchronous case, it may just be a
  better pattern to always cache dsq_id and enq_flags in local vars and
  clear p->scx.ddsp* before calling dispatch_enqueue().

- Add clear_direct_dispatch() call to dequeue_task_scx() after
  dispatch_dequeue().

I think this should capture all cases and the fields are cleared where they
should be cleared (either consumed or canceled).

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: Fix stale direct dispatch state in ddsp_dsq_id

Posted by Andrea Righi 2 months, 1 week ago

Hi Tejun,

On Wed, Apr 01, 2026 at 12:46:58PM -1000, Tejun Heo wrote:
> Hello,
> 
> (cc'ing Patrick Somaru. This is the same issue reported on
> https://github.com/sched-ext/scx/pull/3482)
> 
> On Wed, Apr 01, 2026 at 11:56:19PM +0200, Andrea Righi wrote:
> > @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three
> > scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch()
> > when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert():
> > 
> > 1. Deferred dispatch cancellation: when a task is directly dispatched to
> >    a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the
> >    dispatch is deferred (since we can't lock the remote rq while holding
> >    the current one). If the task is dequeued before processing the
> >    dispatch in process_ddsp_deferred_locals(), dispatch_dequeue()
> >    removes the task from the list leaving a stale direct dispatch state.
> > 
> >    Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch
> >    of dispatch_dequeue().
> > 
> > 2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a
> >    task to another CPU's local DSQ, it sets holding_cpu and releases
> >    DISPATCHING before locking the source rq. If dequeue wins the race
> >    and clears holding_cpu, dispatch_enqueue() is never called and
> >    ddsp_dsq_id is not cleared.
> > 
> >    Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu
> >    in dispatch_dequeue().
> 
> These two just mean that dequeue need to clear it, right?

Correct, we want to clear the state where dispatch_dequeue() cancels a
pending direct dispatch without calling dispatch_enqueue(), so I could have
just clear the state unconditionally in the !dsq case and simplify the
code.

> 
> > 3. Cross-scheduler-instance stale state: When an SCX scheduler exits,
> >    scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue
> >    them, but sleeping tasks are not on any runqueue and are not touched.
> >    If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id
> >    set) at the time the scheduler exited, the state persists. When a new
> >    scheduler instance loads and calls scx_enable_task() for all tasks,
> >    it does not reset this leftover state. The next wakeup's
> >    ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers:
> > 
> >      WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> > 
> >    Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before
> >    calling ops.enable(), ensuring each new scheduler instance starts
> >    with a clean direct dispatch state per task.
> 
> I don't understand this one. If we fix the missing clearing from dequeue,
> where would the residual ddsp_dsq_id come from? How would a sleeping task
> have ddsp_dsq_id set? Note that select_cpu() + enqueue() call sequence is
> atomic w.r.t. dequeue as both are protected by pi_lock.
> 
> It's been always a bit bothersome that ddsp_dsq_id was being cleared in
> dispatch_enqueue(). It was there to catch the cases where ddsp_dsq_id was
> overridden but it just isn't the right place. Can we do the following?
> 
> - Add clear_direct_dispatch() which clears ddsp_dsq_id and ddsp_enq_flags.
> 
> - Add clear_direct_dispatch() call under the enqueue: in do_enqueue_task()
>   and remove ddsp clearing from dispatch_enqueue(). This should catch all
>   cases that ignore ddsp.
> 
> - Add clear_direct_dispatch() call after dispatch_enqueue() in
>   direct_dispatch(). This clears it for the synchronous consumption.
> 
> - Add clear_direct_dispatch() call before dispatch_to_local_dsq() call in
>   process_ddsp_deferred_locals(). Note that the funciton has to cache and
>   clear ddsp fields *before* calling dispatch_to_local_enq() as the function
>   will migrate the task to another rq and we can't control what happens to
>   it afterwrds. Even for the previous synchronous case, it may just be a
>   better pattern to always cache dsq_id and enq_flags in local vars and
>   clear p->scx.ddsp* before calling dispatch_enqueue().
> 
> - Add clear_direct_dispatch() call to dequeue_task_scx() after
>   dispatch_dequeue().
> 
> I think this should capture all cases and the fields are cleared where they
> should be cleared (either consumed or canceled).

I like this, it looks like a better design.

However, I tried it, but I'm still able to trigger the warning, unless I
clear the direct dispatch state in __scx_enable_task(), so we're still
missing a case that doesn't properly clear the state.

I think it has something to do with sleeping tasks / queued wakeups /
bypass, because now I can easily reproduce the warning running a
`stress-ng --sleep 1` while restarting a scheduler that uses
SCX_OPS_ALLOW_QUEUED_WAKEUP, like scx_cosmos. Will keep investigating...

Thanks,
-Andrea

Re: [PATCH] sched_ext: Fix stale direct dispatch state in ddsp_dsq_id

Posted by Andrea Righi 2 months, 1 week ago

On Thu, Apr 02, 2026 at 09:40:41AM +0200, Andrea Righi wrote:
> Hi Tejun,
> 
> On Wed, Apr 01, 2026 at 12:46:58PM -1000, Tejun Heo wrote:
> > Hello,
> > 
> > (cc'ing Patrick Somaru. This is the same issue reported on
> > https://github.com/sched-ext/scx/pull/3482)
> > 
> > On Wed, Apr 01, 2026 at 11:56:19PM +0200, Andrea Righi wrote:
> > > @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) in three
> > > scenarios, causing a spurious WARN_ON_ONCE() in mark_direct_dispatch()
> > > when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert():
> > > 
> > > 1. Deferred dispatch cancellation: when a task is directly dispatched to
> > >    a remote CPU's local DSQ via ops.select_cpu() or ops.enqueue(), the
> > >    dispatch is deferred (since we can't lock the remote rq while holding
> > >    the current one). If the task is dequeued before processing the
> > >    dispatch in process_ddsp_deferred_locals(), dispatch_dequeue()
> > >    removes the task from the list leaving a stale direct dispatch state.
> > > 
> > >    Fix: clear ddsp_dsq_id and ddsp_enq_flags in the !list_empty branch
> > >    of dispatch_dequeue().
> > > 
> > > 2. Holding-cpu dispatch race: when dispatch_to_local_dsq() transfers a
> > >    task to another CPU's local DSQ, it sets holding_cpu and releases
> > >    DISPATCHING before locking the source rq. If dequeue wins the race
> > >    and clears holding_cpu, dispatch_enqueue() is never called and
> > >    ddsp_dsq_id is not cleared.
> > > 
> > >    Fix: clear ddsp_dsq_id and ddsp_enq_flags when clearing holding_cpu
> > >    in dispatch_dequeue().
> > 
> > These two just mean that dequeue need to clear it, right?
> 
> Correct, we want to clear the state where dispatch_dequeue() cancels a
> pending direct dispatch without calling dispatch_enqueue(), so I could have
> just clear the state unconditionally in the !dsq case and simplify the
> code.
> 
> > 
> > > 3. Cross-scheduler-instance stale state: When an SCX scheduler exits,
> > >    scx_bypass() iterates over all runnable tasks to dequeue/re-enqueue
> > >    them, but sleeping tasks are not on any runqueue and are not touched.
> > >    If a sleeping task had a deferred dispatch in flight (ddsp_dsq_id
> > >    set) at the time the scheduler exited, the state persists. When a new
> > >    scheduler instance loads and calls scx_enable_task() for all tasks,
> > >    it does not reset this leftover state. The next wakeup's
> > >    ops.select_cpu() then sees a non-INVALID ddsp_dsq_id and triggers:
> > > 
> > >      WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> > > 
> > >    Fix: clear ddsp_dsq_id and ddsp_enq_flags in scx_enable_task() before
> > >    calling ops.enable(), ensuring each new scheduler instance starts
> > >    with a clean direct dispatch state per task.
> > 
> > I don't understand this one. If we fix the missing clearing from dequeue,
> > where would the residual ddsp_dsq_id come from? How would a sleeping task
> > have ddsp_dsq_id set? Note that select_cpu() + enqueue() call sequence is
> > atomic w.r.t. dequeue as both are protected by pi_lock.
> > 
> > It's been always a bit bothersome that ddsp_dsq_id was being cleared in
> > dispatch_enqueue(). It was there to catch the cases where ddsp_dsq_id was
> > overridden but it just isn't the right place. Can we do the following?
> > 
> > - Add clear_direct_dispatch() which clears ddsp_dsq_id and ddsp_enq_flags.
> > 
> > - Add clear_direct_dispatch() call under the enqueue: in do_enqueue_task()
> >   and remove ddsp clearing from dispatch_enqueue(). This should catch all
> >   cases that ignore ddsp.
> > 
> > - Add clear_direct_dispatch() call after dispatch_enqueue() in
> >   direct_dispatch(). This clears it for the synchronous consumption.
> > 
> > - Add clear_direct_dispatch() call before dispatch_to_local_dsq() call in
> >   process_ddsp_deferred_locals(). Note that the funciton has to cache and
> >   clear ddsp fields *before* calling dispatch_to_local_enq() as the function
> >   will migrate the task to another rq and we can't control what happens to
> >   it afterwrds. Even for the previous synchronous case, it may just be a
> >   better pattern to always cache dsq_id and enq_flags in local vars and
> >   clear p->scx.ddsp* before calling dispatch_enqueue().
> > 
> > - Add clear_direct_dispatch() call to dequeue_task_scx() after
> >   dispatch_dequeue().
> > 
> > I think this should capture all cases and the fields are cleared where they
> > should be cleared (either consumed or canceled).
> 
> I like this, it looks like a better design.
> 
> However, I tried it, but I'm still able to trigger the warning, unless I
> clear the direct dispatch state in __scx_enable_task(), so we're still
> missing a case that doesn't properly clear the state.
> 
> I think it has something to do with sleeping tasks / queued wakeups /
> bypass, because now I can easily reproduce the warning running a
> `stress-ng --sleep 1` while restarting a scheduler that uses
> SCX_OPS_ALLOW_QUEUED_WAKEUP, like scx_cosmos. Will keep investigating...

I think I see it: waking tasks may have had ddsp_dsq_id set by the outgoing
scheduler's ops.select_cpu() and then been queued on a wake_list via
ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are
not on the runqueue (on_rq == 0) and are not iterated by scx_bypass(), so their
direct dispatch state won't be cleared by dequeue_task_scx().

How about clearning the direct dispatch state in scx_disable_task()? That would
catch those tasks as well and it won't leave the stale state from the previous
scx scheduler instance.

Thanks,
-Andrea