sched_ext: Invalidate dispatch decisions on CPU affinity changes

[PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Andrea Righi 3 days, 14 hours ago

A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
target CPU. However, task affinity can change between the dispatch
decision and its finalization in finish_dispatch(). When this happens,
the scheduler may attempt to dispatch a task to a CPU that is no longer
allowed, resulting in fatal errors such as:

 EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])

This race exists because ops.dispatch() runs without holding the task's
run queue lock, allowing a concurrent set_cpus_allowed() to update
p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
then finalized using stale affinity information.

Example timeline:

  CPU0                                      CPU1
  ----                                      ----
                                            task_rq_lock(p)
  if (cpumask_test_cpu(cpu, p->cpus_ptr))
                                            set_cpus_allowed_scx(p, new_mask)
                                            task_rq_unlock(p)
      scx_bpf_dsq_insert(p,
              SCX_DSQ_LOCAL_ON | cpu, 0)

Fix this by extending the existing qseq invalidation mechanism to also
cover CPU affinity changes, in addition to task dequeues/re-enqueues,
occurring between dispatch decision and finalization.

When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
re-dispatched using up-to-date affinity information.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0ab994180f655..6128a2529a7c7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
  * will change. As @p's task_rq is locked, this function doesn't need to use the
  * holding_cpu mechanism.
  *
- * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
- * return value, is locked.
+ * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
+ * return value, is locked. On failure (affinity change invalidated the move),
+ * returns NULL with @src_dsq unlocked and task remaining in @src_dsq.
  */
 static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 					 struct task_struct *p, u64 enq_flags,
@@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 	if (dst_dsq->id == SCX_DSQ_LOCAL) {
 		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
 		if (src_rq != dst_rq &&
-		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-			dst_dsq = find_global_dsq(sch, p);
-			dst_rq = src_rq;
+		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
+			/*
+			 * Task affinity changed after dispatch decision:
+			 * drop the dispatch, caller will handle returning
+			 * the task to its original DSQ.
+			 */
+			return NULL;
 		}
 	} else {
 		/* no need to migrate if destination is a non-local DSQ */
@@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 	}
 
 	if (src_rq != dst_rq &&
-	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
-				 enq_flags | SCX_ENQ_CLEAR_OPSS);
+	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
+		/*
+		 * Task affinity changed after dispatch decision: drop the
+		 * dispatch, task remains in its current state and will be
+		 * dispatched again in a future cycle.
+		 */
+		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
+					(atomic_long_read(&p->scx.ops_state) &
+					 SCX_OPSS_QSEQ_MASK));
 		return;
 	}
 
@@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 				 struct affinity_context *ac)
 {
 	struct scx_sched *sch = scx_root;
+	struct rq *rq = task_rq(p);
+
+	lockdep_assert_rq_held(rq);
 
 	set_cpus_allowed_common(p, ac);
 
 	if (unlikely(!sch))
 		return;
 
+	/*
+	 * Affinity changes invalidate any pending dispatch decisions made
+	 * with the old affinity. Increment the runqueue's ops_qseq and
+	 * update the task's qseq to invalidate in-flight dispatches.
+	 */
+	if (p->scx.flags & SCX_TASK_QUEUED) {
+		unsigned long opss;
+
+		rq->scx.ops_qseq++;
+		opss = atomic_long_read(&p->scx.ops_state);
+		atomic_long_set(&p->scx.ops_state,
+				(opss & SCX_OPSS_STATE_MASK) |
+				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
+	}
+
 	/*
 	 * The effective cpumask is stored in @p->cpus_ptr which may temporarily
 	 * differ from the configured one in @p->cpus_mask. Always tell the bpf
@@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 
 	/* execute move */
 	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
-	dispatched = true;
+	if (locked_rq) {
+		dispatched = true;
+	} else {
+		/* Move failed: task stays in src_dsq */
+		raw_spin_unlock(&src_dsq->lock);
+		locked_rq = in_balance ? this_rq : NULL;
+	}
 out:
 	if (in_balance) {
 		if (this_rq != locked_rq) {
-			raw_spin_rq_unlock(locked_rq);
+			if (locked_rq)
+				raw_spin_rq_unlock(locked_rq);
 			raw_spin_rq_lock(this_rq);
 		}
-	} else {
+	} else if (locked_rq) {
 		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
 	}
 
-- 
2.52.0

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Tejun Heo 2 days, 14 hours ago

Hello,

On Wed, Feb 04, 2026 at 12:06:39AM +0100, Andrea Righi wrote:
...
>   CPU0                                      CPU1
>   ----                                      ----
>                                             task_rq_lock(p)
>   if (cpumask_test_cpu(cpu, p->cpus_ptr))
>                                             set_cpus_allowed_scx(p, new_mask)
>                                             task_rq_unlock(p)
>       scx_bpf_dsq_insert(p,
>               SCX_DSQ_LOCAL_ON | cpu, 0)
> 
> Fix this by extending the existing qseq invalidation mechanism to also
> cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> occurring between dispatch decision and finalization.
> 
> When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> re-dispatched using up-to-date affinity information.

It shouldn't be returned, right? set_cpus_allowed() dequeues and
re-enqueues. What the seq invalidation detected is dequeue racing the async
dispatch and the invalidation means that the task was dequeued while on the
async buffer (to be re-enqueued once the property change is complete). It
should just be ignored.

>  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  					 struct task_struct *p, u64 enq_flags,
> @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
>  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
>  		if (src_rq != dst_rq &&
> -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -			dst_dsq = find_global_dsq(sch, p);
> -			dst_rq = src_rq;
> +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> +			/*
> +			 * Task affinity changed after dispatch decision:
> +			 * drop the dispatch, caller will handle returning
> +			 * the task to its original DSQ.
> +			 */
> +			return NULL;
...
> @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	}
>  
>  	if (src_rq != dst_rq &&
> -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> +		/*
> +		 * Task affinity changed after dispatch decision: drop the
> +		 * dispatch, task remains in its current state and will be
> +		 * dispatched again in a future cycle.
> +		 */
> +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> +					(atomic_long_read(&p->scx.ops_state) &
> +					 SCX_OPSS_QSEQ_MASK));

I don't quite follow why we need the above slippery behavior. The qseq
invalidation, if reliable, means that there's no race window if the BPF
scheduler correctly implements ops.dequeue() (after the kernel side fixes it
of course).

ie. The BPF scheduler is reponsible for synchronizing its
scx_bpf_dsq_insert() call against whatever it needs to do in ops.dequeue().
If ops.dequeue() wins, the task shouldn't be inserted in the first place. If
ops.dequeue() loses, the qseq invalidation should kill it while on async
buffer if it wins over finish_dispatch(). If finish_dispatch() wins, the
task will just be dequeued from the inserted DSQ or the property change will
happen while the task is running.

Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
synchronization and let things slide (probably optionally w/ an OPS flag),
but for that, falling back to global DSQ is fine, no?

> @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  				 struct affinity_context *ac)
>  {
>  	struct scx_sched *sch = scx_root;
> +	struct rq *rq = task_rq(p);
> +
> +	lockdep_assert_rq_held(rq);
>  
>  	set_cpus_allowed_common(p, ac);
>  
>  	if (unlikely(!sch))
>  		return;
>  
> +	/*
> +	 * Affinity changes invalidate any pending dispatch decisions made
> +	 * with the old affinity. Increment the runqueue's ops_qseq and
> +	 * update the task's qseq to invalidate in-flight dispatches.
> +	 */
> +	if (p->scx.flags & SCX_TASK_QUEUED) {
> +		unsigned long opss;
> +
> +		rq->scx.ops_qseq++;
> +		opss = atomic_long_read(&p->scx.ops_state);
> +		atomic_long_set(&p->scx.ops_state,
> +				(opss & SCX_OPSS_STATE_MASK) |
> +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> +	}
> +

I wonder whether we should define an invalid qseq and use that instead. The
queueing instance really is invalid after this and it would help catching
cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
enqueued instances? While the symptom we most immediately see are through
cpumask changes, the underlying problem is dequeue not shooting down
existing enqueued tasks.

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Andrea Righi 1 day, 21 hours ago

On Wed, Feb 04, 2026 at 01:31:35PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 04, 2026 at 12:06:39AM +0100, Andrea Righi wrote:
> ...
> >   CPU0                                      CPU1
> >   ----                                      ----
> >                                             task_rq_lock(p)
> >   if (cpumask_test_cpu(cpu, p->cpus_ptr))
> >                                             set_cpus_allowed_scx(p, new_mask)
> >                                             task_rq_unlock(p)
> >       scx_bpf_dsq_insert(p,
> >               SCX_DSQ_LOCAL_ON | cpu, 0)
> > 
> > Fix this by extending the existing qseq invalidation mechanism to also
> > cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> > occurring between dispatch decision and finalization.
> > 
> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> > re-dispatched using up-to-date affinity information.
> 
> It shouldn't be returned, right? set_cpus_allowed() dequeues and
> re-enqueues. What the seq invalidation detected is dequeue racing the async
> dispatch and the invalidation means that the task was dequeued while on the
> async buffer (to be re-enqueued once the property change is complete). It
> should just be ignored.

Yeah, the only downside is that the scheduler doesn't know that the task
has been re-enqueued due to a failed dispatch, but that's probably fine for
now.

> 
> >  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  					 struct task_struct *p, u64 enq_flags,
> > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
> >  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> >  		if (src_rq != dst_rq &&
> > -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -			dst_dsq = find_global_dsq(sch, p);
> > -			dst_rq = src_rq;
> > +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +			/*
> > +			 * Task affinity changed after dispatch decision:
> > +			 * drop the dispatch, caller will handle returning
> > +			 * the task to its original DSQ.
> > +			 */
> > +			return NULL;
> ...
> > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  	}
> >  
> >  	if (src_rq != dst_rq &&
> > -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> > -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> > +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +		/*
> > +		 * Task affinity changed after dispatch decision: drop the
> > +		 * dispatch, task remains in its current state and will be
> > +		 * dispatched again in a future cycle.
> > +		 */
> > +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> > +					(atomic_long_read(&p->scx.ops_state) &
> > +					 SCX_OPSS_QSEQ_MASK));
> 
> I don't quite follow why we need the above slippery behavior. The qseq
> invalidation, if reliable, means that there's no race window if the BPF
> scheduler correctly implements ops.dequeue() (after the kernel side fixes it
> of course).
> 
> ie. The BPF scheduler is reponsible for synchronizing its
> scx_bpf_dsq_insert() call against whatever it needs to do in ops.dequeue().
> If ops.dequeue() wins, the task shouldn't be inserted in the first place. If
> ops.dequeue() loses, the qseq invalidation should kill it while on async
> buffer if it wins over finish_dispatch(). If finish_dispatch() wins, the
> task will just be dequeued from the inserted DSQ or the property change will
> happen while the task is running.
> 
> Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
> synchronization and let things slide (probably optionally w/ an OPS flag),
> but for that, falling back to global DSQ is fine, no?

I think the problem with the global DSQ fallback is that we're essentially
ignoring a request from the BPF scheduler to dispatch a task to a specific
CPU. Moreover, the global DSQ can potentially introduce starvation: if a
task is silently dispatched to the global DSQ and the BPF scheduler keeps
dispatching tasks to the local DSQs, the task waiting in the global DSQ
will never be consumed.

> 
> > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  				 struct affinity_context *ac)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > +	struct rq *rq = task_rq(p);
> > +
> > +	lockdep_assert_rq_held(rq);
> >  
> >  	set_cpus_allowed_common(p, ac);
> >  
> >  	if (unlikely(!sch))
> >  		return;
> >  
> > +	/*
> > +	 * Affinity changes invalidate any pending dispatch decisions made
> > +	 * with the old affinity. Increment the runqueue's ops_qseq and
> > +	 * update the task's qseq to invalidate in-flight dispatches.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_QUEUED) {
> > +		unsigned long opss;
> > +
> > +		rq->scx.ops_qseq++;
> > +		opss = atomic_long_read(&p->scx.ops_state);
> > +		atomic_long_set(&p->scx.ops_state,
> > +				(opss & SCX_OPSS_STATE_MASK) |
> > +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> > +	}
> > +
> 
> I wonder whether we should define an invalid qseq and use that instead. The
> queueing instance really is invalid after this and it would help catching
> cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> enqueued instances? While the symptom we most immediately see are through
> cpumask changes, the underlying problem is dequeue not shooting down
> existing enqueued tasks.

I think I like the idea of having an INVALID_QSEQ or similar, it'd also
make debugging easier.

I'm not sure about moving the logic to dequeue_task_scx(), more exactly,
I'm not sure if there're nasty locking implications. I'll do some
experiments, if it works, sure, dequeue would be a better place to cancel
invalid enqueued instances.

Thanks,
-Andrea

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Tejun Heo 1 day, 15 hours ago

Hello,

On Thu, Feb 05, 2026 at 05:40:05PM +0100, Andrea Righi wrote:
...
> > It shouldn't be returned, right? set_cpus_allowed() dequeues and
> > re-enqueues. What the seq invalidation detected is dequeue racing the async
> > dispatch and the invalidation means that the task was dequeued while on the
> > async buffer (to be re-enqueued once the property change is complete). It
> > should just be ignored.
> 
> Yeah, the only downside is that the scheduler doesn't know that the task
> has been re-enqueued due to a failed dispatch, but that's probably fine for
> now.

Yeah, but does that matter? Consider the following three scenarios:

A. Task gets dispatched into local DSQ, CPU mask gets updated while in async
   buffer, the dispatch is ignored and then the task gets re-enqueued later.

B. The same as A but the CPU mask update happens after the task lands in the
   local DSQ but before starts executing.

C. Task gets dispatched into local DSQ and starts running, CPU mask gets
   updated so that the task can't run on the current CPU anymore, migration
   task preempts the task and it gets enqueued.

A and B woould be indistinguishible from BPF sched's POV. C would be a bit
different in that the task would transition through ops->running/stopping().

I don't see anything significantly different across the three scenarios -
the task was dispatched but cpumask got updated and the scheduler needs to
place it again.

...
> > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
> > synchronization and let things slide (probably optionally w/ an OPS flag),
> > but for that, falling back to global DSQ is fine, no?
> 
> I think the problem with the global DSQ fallback is that we're essentially
> ignoring a request from the BPF scheduler to dispatch a task to a specific
> CPU. Moreover, the global DSQ can potentially introduce starvation: if a
> task is silently dispatched to the global DSQ and the BPF scheduler keeps
> dispatching tasks to the local DSQs, the task waiting in the global DSQ
> will never be consumed.

While starvation is possible, it's not very likely:

- ops.select_cpu/enqueue() usually don't direct dispatch to local CPUs
  unless they're idle.

- ops.dispatch() is only called after global DSQ is drained.

If ops.select_cpu/enqueue() keeps DD'ing to local CPUs while there are other
tasks waiting, it's gonna stall whether we fall back to global DSQ or not.

But, taking a step back, the sloppy fallback behavior is secondary. What
really matters is once we fix ops.dequeue(), can the BPF scheduler properly
synchronize dequeue against scx_bpf_dsq_insert() to avoid triggering cpumask
or migration disabled state mismatches? If so, ops.dequeue() would be the
primary way to deal with these issues.

Maybe not implementing ops.dequeue() can enable sloppy fallbacks as that
indicates the scheduler isn't taking property changes into account at all,
but that's really secondary. Let's first focus on making ops.dequeue()
working properly so that the BPF scheduler can synchronize correctly.

...
> > I wonder whether we should define an invalid qseq and use that instead. The
> > queueing instance really is invalid after this and it would help catching
> > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> > enqueued instances? While the symptom we most immediately see are through
> > cpumask changes, the underlying problem is dequeue not shooting down
> > existing enqueued tasks.
> 
> I think I like the idea of having an INVALID_QSEQ or similar, it'd also
> make debugging easier.
>
> I'm not sure about moving the logic to dequeue_task_scx(), more exactly,
> I'm not sure if there're nasty locking implications. I'll do some
> experiments, if it works, sure, dequeue would be a better place to cancel
> invalid enqueued instances.

I was confused while writing above. All of the above is already happening.
When a task is dequeued, it's OPSS is cleared and the task won't be eligible
for dispatching anymore. The only "confused" case is where the task finishes
reenqueueing before the previous dispatch attempt is finished, which the BPF
scheduler should be able to handle once ops.dequeue() is fixed.

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Andrea Righi 1 day, 5 hours ago

On Thu, Feb 05, 2026 at 12:57:04PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 05, 2026 at 05:40:05PM +0100, Andrea Righi wrote:
> ...
> > > It shouldn't be returned, right? set_cpus_allowed() dequeues and
> > > re-enqueues. What the seq invalidation detected is dequeue racing the async
> > > dispatch and the invalidation means that the task was dequeued while on the
> > > async buffer (to be re-enqueued once the property change is complete). It
> > > should just be ignored.
> > 
> > Yeah, the only downside is that the scheduler doesn't know that the task
> > has been re-enqueued due to a failed dispatch, but that's probably fine for
> > now.
> 
> Yeah, but does that matter? Consider the following three scenarios:
> 
> A. Task gets dispatched into local DSQ, CPU mask gets updated while in async
>    buffer, the dispatch is ignored and then the task gets re-enqueued later.
> 
> B. The same as A but the CPU mask update happens after the task lands in the
>    local DSQ but before starts executing.
> 
> C. Task gets dispatched into local DSQ and starts running, CPU mask gets
>    updated so that the task can't run on the current CPU anymore, migration
>    task preempts the task and it gets enqueued.
> 
> A and B woould be indistinguishible from BPF sched's POV. C would be a bit
> different in that the task would transition through ops->running/stopping().
> 
> I don't see anything significantly different across the three scenarios -
> the task was dispatched but cpumask got updated and the scheduler needs to
> place it again.

Ack, knowing that an enqueue is coming from a failed dispatch or a regular
running/stopping transition probably doesn't matter. And the scheduler can
probably try to infer this information, if needed.

> 
> ...
> > > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
> > > synchronization and let things slide (probably optionally w/ an OPS flag),
> > > but for that, falling back to global DSQ is fine, no?
> > 
> > I think the problem with the global DSQ fallback is that we're essentially
> > ignoring a request from the BPF scheduler to dispatch a task to a specific
> > CPU. Moreover, the global DSQ can potentially introduce starvation: if a
> > task is silently dispatched to the global DSQ and the BPF scheduler keeps
> > dispatching tasks to the local DSQs, the task waiting in the global DSQ
> > will never be consumed.
> 
> While starvation is possible, it's not very likely:
> 
> - ops.select_cpu/enqueue() usually don't direct dispatch to local CPUs
>   unless they're idle.
> 
> - ops.dispatch() is only called after global DSQ is drained.
> 
> If ops.select_cpu/enqueue() keeps DD'ing to local CPUs while there are other
> tasks waiting, it's gonna stall whether we fall back to global DSQ or not.

Well, if a scheduler is only using DD's with SCX_DSQ_LOCAL[_ON] and a
per-CPU task ends up in the global DSQ the chance of starvation is not that
remote.

If we re-enqueue the task in the regular enqueue path, without the global
DSQ fallback, the task will be dispatched again by the BPF scheduler via
SCX_DSQ_LOCAL[_ON] and it's less likely to be starved.

I think if we just drop the dispatch without the global DSQ fallback
everything should work.

> 
> But, taking a step back, the sloppy fallback behavior is secondary. What
> really matters is once we fix ops.dequeue(), can the BPF scheduler properly
> synchronize dequeue against scx_bpf_dsq_insert() to avoid triggering cpumask
> or migration disabled state mismatches? If so, ops.dequeue() would be the
> primary way to deal with these issues.

Agreed, we should handle this in ops_dequeue().

> 
> Maybe not implementing ops.dequeue() can enable sloppy fallbacks as that
> indicates the scheduler isn't taking property changes into account at all,
> but that's really secondary. Let's first focus on making ops.dequeue()
> working properly so that the BPF scheduler can synchronize correctly.

Ack.

> 
> ...
> > > I wonder whether we should define an invalid qseq and use that instead. The
> > > queueing instance really is invalid after this and it would help catching
> > > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> > > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> > > enqueued instances? While the symptom we most immediately see are through
> > > cpumask changes, the underlying problem is dequeue not shooting down
> > > existing enqueued tasks.
> > 
> > I think I like the idea of having an INVALID_QSEQ or similar, it'd also
> > make debugging easier.
> >
> > I'm not sure about moving the logic to dequeue_task_scx(), more exactly,
> > I'm not sure if there're nasty locking implications. I'll do some
> > experiments, if it works, sure, dequeue would be a better place to cancel
> > invalid enqueued instances.
> 
> I was confused while writing above. All of the above is already happening.
> When a task is dequeued, it's OPSS is cleared and the task won't be eligible
> for dispatching anymore. The only "confused" case is where the task finishes
> reenqueueing before the previous dispatch attempt is finished, which the BPF
> scheduler should be able to handle once ops.dequeue() is fixed.

Yeah I think we can handle this in the dequeue path. I think I have a new
working (maybe) patch. Will run some tests and send the new version later
today.

Thanks,
-Andrea

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Tejun Heo 2 days, 12 hours ago

Hello,

On Wed, Feb 04, 2026 at 01:31:35PM -1000, Tejun Heo wrote:
...
> I wonder whether we should define an invalid qseq and use that instead. The
> queueing instance really is invalid after this and it would help catching
> cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> enqueued instances? While the symptom we most immediately see are through
> cpumask changes, the underlying problem is dequeue not shooting down
> existing enqueued tasks.

Hmmm... in fact, this is all happening already, right? Isn't all that's
missing the BPF scheduler's ops.dequeue() synchronizing against
scx_bpf_dsq_insert()?

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Christian Loehle 2 days, 22 hours ago

On 2/3/26 23:06, Andrea Righi wrote:
> A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> target CPU. However, task affinity can change between the dispatch
> decision and its finalization in finish_dispatch(). When this happens,
> the scheduler may attempt to dispatch a task to a CPU that is no longer
> allowed, resulting in fatal errors such as:
> 
>  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
> 
> This race exists because ops.dispatch() runs without holding the task's
> run queue lock, allowing a concurrent set_cpus_allowed() to update
> p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> then finalized using stale affinity information.
> 
> Example timeline:
> 
>   CPU0                                      CPU1
>   ----                                      ----
>                                             task_rq_lock(p)
>   if (cpumask_test_cpu(cpu, p->cpus_ptr))
>                                             set_cpus_allowed_scx(p, new_mask)
>                                             task_rq_unlock(p)
>       scx_bpf_dsq_insert(p,
>               SCX_DSQ_LOCAL_ON | cpu, 0)
> 
> Fix this by extending the existing qseq invalidation mechanism to also
> cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> occurring between dispatch decision and finalization.
> 
> When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> re-dispatched using up-to-date affinity information.
> 
Hi Andrea,
so this fixes the default scx_storm insert / dequeue race indeed, let me go
review in-depth and create some more tests...

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Kuba Piecuch 3 days ago

Hi Andrea,

On Tue Feb 3, 2026 at 11:06 PM UTC, Andrea Righi wrote:
> A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> target CPU. However, task affinity can change between the dispatch
> decision and its finalization in finish_dispatch(). When this happens,
> the scheduler may attempt to dispatch a task to a CPU that is no longer
> allowed, resulting in fatal errors such as:
>
>  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
>
> This race exists because ops.dispatch() runs without holding the task's
> run queue lock, allowing a concurrent set_cpus_allowed() to update
> p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> then finalized using stale affinity information.
>
> Example timeline:
>
>   CPU0                                      CPU1
>   ----                                      ----
>                                             task_rq_lock(p)
>   if (cpumask_test_cpu(cpu, p->cpus_ptr))
>                                             set_cpus_allowed_scx(p, new_mask)
>                                             task_rq_unlock(p)
>       scx_bpf_dsq_insert(p,
>               SCX_DSQ_LOCAL_ON | cpu, 0)
>
> Fix this by extending the existing qseq invalidation mechanism to also
> cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> occurring between dispatch decision and finalization.

I'm not convinced this will prevent the exact race scenario you describe.
For the qseq increment inside set_cpus_allowed_scx() to be noticed, it needs
to happen _after_ scx_bpf_dsq_insert() reads the qseq and stores it in the
dispatch buffer entry.

We can still have the cpumask change and qseq increment on CPU1 happen between
cpumask_test_cpu() and scx_bpf_dsq_insert() on CPU0, and it will not be
detected because the dispatch buffer entry will contain the incremented value.

>
> When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> re-dispatched using up-to-date affinity information.

How will the scheduler know that the dispatch was dropped? Is the scheduler
expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
on CPU1?

>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 47 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0ab994180f655..6128a2529a7c7 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
>   * will change. As @p's task_rq is locked, this function doesn't need to use the
>   * holding_cpu mechanism.
>   *
> - * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
> - * return value, is locked.
> + * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
> + * return value, is locked. On failure (affinity change invalidated the move),
> + * returns NULL with @src_dsq unlocked and task remaining in @src_dsq.
>   */
>  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  					 struct task_struct *p, u64 enq_flags,
> @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
>  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
>  		if (src_rq != dst_rq &&
> -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -			dst_dsq = find_global_dsq(sch, p);
> -			dst_rq = src_rq;
> +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> +			/*
> +			 * Task affinity changed after dispatch decision:
> +			 * drop the dispatch, caller will handle returning
> +			 * the task to its original DSQ.
> +			 */
> +			return NULL;
>  		}
>  	} else {
>  		/* no need to migrate if destination is a non-local DSQ */
> @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	}
>  
>  	if (src_rq != dst_rq &&
> -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> +		/*
> +		 * Task affinity changed after dispatch decision: drop the
> +		 * dispatch, task remains in its current state and will be
> +		 * dispatched again in a future cycle.
> +		 */
> +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> +					(atomic_long_read(&p->scx.ops_state) &
> +					 SCX_OPSS_QSEQ_MASK));
>  		return;
>  	}
>  
> @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  				 struct affinity_context *ac)
>  {
>  	struct scx_sched *sch = scx_root;
> +	struct rq *rq = task_rq(p);
> +
> +	lockdep_assert_rq_held(rq);
>  
>  	set_cpus_allowed_common(p, ac);
>  
>  	if (unlikely(!sch))
>  		return;
>  
> +	/*
> +	 * Affinity changes invalidate any pending dispatch decisions made
> +	 * with the old affinity. Increment the runqueue's ops_qseq and
> +	 * update the task's qseq to invalidate in-flight dispatches.
> +	 */
> +	if (p->scx.flags & SCX_TASK_QUEUED) {
> +		unsigned long opss;
> +
> +		rq->scx.ops_qseq++;
> +		opss = atomic_long_read(&p->scx.ops_state);
> +		atomic_long_set(&p->scx.ops_state,
> +				(opss & SCX_OPSS_STATE_MASK) |
> +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> +	}

Will we ever enter this if statement? IIUC set_cpus_allowed_scx() is always
called under `scoped_guard (sched_change, p, DEQUEUE_SAVE)`, so assuming the
task is on a runqueue, set_cpus_allowed_scx() will always be preceded by
dequeue_task_scx(), which clears SCX_TASK_QUEUED.

> +
>  	/*
>  	 * The effective cpumask is stored in @p->cpus_ptr which may temporarily
>  	 * differ from the configured one in @p->cpus_mask. Always tell the bpf
> @@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  
>  	/* execute move */
>  	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
> -	dispatched = true;
> +	if (locked_rq) {
> +		dispatched = true;
> +	} else {
> +		/* Move failed: task stays in src_dsq */
> +		raw_spin_unlock(&src_dsq->lock);
> +		locked_rq = in_balance ? this_rq : NULL;
> +	}
>  out:
>  	if (in_balance) {
>  		if (this_rq != locked_rq) {
> -			raw_spin_rq_unlock(locked_rq);
> +			if (locked_rq)
> +				raw_spin_rq_unlock(locked_rq);
>  			raw_spin_rq_lock(this_rq);
>  		}
> -	} else {
> +	} else if (locked_rq) {
>  		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
>  	}
>  

Thanks,
Kuba

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Andrea Righi 2 days, 22 hours ago

Hi Kuba,

On Wed, Feb 04, 2026 at 01:20:52PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Tue Feb 3, 2026 at 11:06 PM UTC, Andrea Righi wrote:
> > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> > target CPU. However, task affinity can change between the dispatch
> > decision and its finalization in finish_dispatch(). When this happens,
> > the scheduler may attempt to dispatch a task to a CPU that is no longer
> > allowed, resulting in fatal errors such as:
> >
> >  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
> >
> > This race exists because ops.dispatch() runs without holding the task's
> > run queue lock, allowing a concurrent set_cpus_allowed() to update
> > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> > then finalized using stale affinity information.
> >
> > Example timeline:
> >
> >   CPU0                                      CPU1
> >   ----                                      ----
> >                                             task_rq_lock(p)
> >   if (cpumask_test_cpu(cpu, p->cpus_ptr))
> >                                             set_cpus_allowed_scx(p, new_mask)
> >                                             task_rq_unlock(p)
> >       scx_bpf_dsq_insert(p,
> >               SCX_DSQ_LOCAL_ON | cpu, 0)
> >
> > Fix this by extending the existing qseq invalidation mechanism to also
> > cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> > occurring between dispatch decision and finalization.
> 
> I'm not convinced this will prevent the exact race scenario you describe.
> For the qseq increment inside set_cpus_allowed_scx() to be noticed, it needs
> to happen _after_ scx_bpf_dsq_insert() reads the qseq and stores it in the
> dispatch buffer entry.
> 
> We can still have the cpumask change and qseq increment on CPU1 happen between
> cpumask_test_cpu() and scx_bpf_dsq_insert() on CPU0, and it will not be
> detected because the dispatch buffer entry will contain the incremented value.

Yeah, I think you're right, it makes things better, but it's still racy.

> 
> >
> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> > re-dispatched using up-to-date affinity information.
> 
> How will the scheduler know that the dispatch was dropped? Is the scheduler
> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
> on CPU1?

The idea was that, if the dispatch is dropped, we'll see another
ops.enqueue() for the task, so at least the task is not "lost" and the
BPF scheduler gets another chance what to do with it. In this case it'd be
useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
the enqueue resulted from a dropped dispatch.

> 
> >
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 47 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 0ab994180f655..6128a2529a7c7 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> >   * will change. As @p's task_rq is locked, this function doesn't need to use the
> >   * holding_cpu mechanism.
> >   *
> > - * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
> > - * return value, is locked.
> > + * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
> > + * return value, is locked. On failure (affinity change invalidated the move),
> > + * returns NULL with @src_dsq unlocked and task remaining in @src_dsq.
> >   */
> >  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  					 struct task_struct *p, u64 enq_flags,
> > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
> >  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> >  		if (src_rq != dst_rq &&
> > -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -			dst_dsq = find_global_dsq(sch, p);
> > -			dst_rq = src_rq;
> > +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +			/*
> > +			 * Task affinity changed after dispatch decision:
> > +			 * drop the dispatch, caller will handle returning
> > +			 * the task to its original DSQ.
> > +			 */
> > +			return NULL;
> >  		}
> >  	} else {
> >  		/* no need to migrate if destination is a non-local DSQ */
> > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  	}
> >  
> >  	if (src_rq != dst_rq &&
> > -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> > -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> > +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +		/*
> > +		 * Task affinity changed after dispatch decision: drop the
> > +		 * dispatch, task remains in its current state and will be
> > +		 * dispatched again in a future cycle.
> > +		 */
> > +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> > +					(atomic_long_read(&p->scx.ops_state) &
> > +					 SCX_OPSS_QSEQ_MASK));
> >  		return;
> >  	}
> >  
> > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  				 struct affinity_context *ac)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > +	struct rq *rq = task_rq(p);
> > +
> > +	lockdep_assert_rq_held(rq);
> >  
> >  	set_cpus_allowed_common(p, ac);
> >  
> >  	if (unlikely(!sch))
> >  		return;
> >  
> > +	/*
> > +	 * Affinity changes invalidate any pending dispatch decisions made
> > +	 * with the old affinity. Increment the runqueue's ops_qseq and
> > +	 * update the task's qseq to invalidate in-flight dispatches.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_QUEUED) {
> > +		unsigned long opss;
> > +
> > +		rq->scx.ops_qseq++;
> > +		opss = atomic_long_read(&p->scx.ops_state);
> > +		atomic_long_set(&p->scx.ops_state,
> > +				(opss & SCX_OPSS_STATE_MASK) |
> > +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> > +	}
> 
> Will we ever enter this if statement? IIUC set_cpus_allowed_scx() is always
> called under `scoped_guard (sched_change, p, DEQUEUE_SAVE)`, so assuming the
> task is on a runqueue, set_cpus_allowed_scx() will always be preceded by
> dequeue_task_scx(), which clears SCX_TASK_QUEUED.

I think you're right, we can probably remove this.

> 
> > +
> >  	/*
> >  	 * The effective cpumask is stored in @p->cpus_ptr which may temporarily
> >  	 * differ from the configured one in @p->cpus_mask. Always tell the bpf
> > @@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
> >  
> >  	/* execute move */
> >  	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
> > -	dispatched = true;
> > +	if (locked_rq) {
> > +		dispatched = true;
> > +	} else {
> > +		/* Move failed: task stays in src_dsq */
> > +		raw_spin_unlock(&src_dsq->lock);
> > +		locked_rq = in_balance ? this_rq : NULL;
> > +	}
> >  out:
> >  	if (in_balance) {
> >  		if (this_rq != locked_rq) {
> > -			raw_spin_rq_unlock(locked_rq);
> > +			if (locked_rq)
> > +				raw_spin_rq_unlock(locked_rq);
> >  			raw_spin_rq_lock(this_rq);
> >  		}
> > -	} else {
> > +	} else if (locked_rq) {
> >  		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
> >  	}
> >  
> 
> Thanks,
> Kuba
> 

Thanks,
-Andrea

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Kuba Piecuch 2 days, 21 hours ago

On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote:
>> >
>> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
>> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
>> > re-dispatched using up-to-date affinity information.
>> 
>> How will the scheduler know that the dispatch was dropped? Is the scheduler
>> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
>> on CPU1?
>
> The idea was that, if the dispatch is dropped, we'll see another
> ops.enqueue() for the task, so at least the task is not "lost" and the
> BPF scheduler gets another chance what to do with it. In this case it'd be
> useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
> the enqueue resulted from a dropped dispatch.

I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag
if a need for it arises.

I still worry about the scenario you described. In particular, I think it can
lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch.

  CPU0                                      CPU1
  ----                                      ----
  if (cpumask_test_cpu(cpu, p->cpus_ptr))
                                            task_rq_lock(p)
                                            dequeue_task_scx(p, ...)
                                              (remove p from internal queues)
                                            set_cpus_allowed_scx(p, new_mask)
                                            enqueue_task_scx(p, ...)
                                              (add p to internal queues)
                                            task_rq_unlock(p)
      (remove p from internal queues)
      scx_bpf_dsq_insert(p,
              SCX_DSQ_LOCAL_ON | cpu, 0)

In this scenario, the ops.enqueue() which is supposed to notify the BPF
scheduler about the failed dispatch actually happens _before_ the actual
dispatch, so once the dispatch fails, the task won't be re-enqueued.

There are two problems here:

1. CPU0 makes a scheduling decision based on stale data and it isn't detected.
2. Even if it is detected and the dispatch aborted, the task won't be
   re-enqueued.

The way we deal with the first problem in ghOSt (Google's equivalent of
sched_ext) is we expose the per-task sequence number to the BPF scheduler.
On the dispatch path, when the BPF scheduler has a candidate task,
it retrieves its seqnum, re-checks the task state to ensure that it's still
eligible for dispatch, and passes the seqnum to the kernel's dispatch helper
for verification. If the kernel detects that the seqnum has changed already,
it synchronously fails the dispatch attempt (dispatch always happens
synchronously in ghOSt). In sched_ext, we could do the synchronous check, but
we also need to do the same check later in finish_dispatch(), comparing
the current qseq against the qseq passed by the BPF scheduler.

To fix the second problem, we would need to explicitly call ops.enqueue()
from finish_dispatch() and the other places where we abort dispatch if the
qseq is out of date.

Either that, or just add locking to the BPF scheduler to prevent the race from
happening in the first place.

Thanks,
Kuba

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Andrea Righi 2 days, 20 hours ago

On Wed, Feb 04, 2026 at 04:58:47PM +0000, Kuba Piecuch wrote:
> On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote:
> >> >
> >> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> >> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> >> > re-dispatched using up-to-date affinity information.
> >> 
> >> How will the scheduler know that the dispatch was dropped? Is the scheduler
> >> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
> >> on CPU1?
> >
> > The idea was that, if the dispatch is dropped, we'll see another
> > ops.enqueue() for the task, so at least the task is not "lost" and the
> > BPF scheduler gets another chance what to do with it. In this case it'd be
> > useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
> > the enqueue resulted from a dropped dispatch.
> 
> I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag
> if a need for it arises.
> 
> I still worry about the scenario you described. In particular, I think it can
> lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch.
> 
>   CPU0                                      CPU1
>   ----                                      ----
>   if (cpumask_test_cpu(cpu, p->cpus_ptr))
>                                             task_rq_lock(p)
>                                             dequeue_task_scx(p, ...)
>                                               (remove p from internal queues)
>                                             set_cpus_allowed_scx(p, new_mask)
>                                             enqueue_task_scx(p, ...)
>                                               (add p to internal queues)
>                                             task_rq_unlock(p)
>       (remove p from internal queues)
>       scx_bpf_dsq_insert(p,
>               SCX_DSQ_LOCAL_ON | cpu, 0)
> 
> In this scenario, the ops.enqueue() which is supposed to notify the BPF
> scheduler about the failed dispatch actually happens _before_ the actual
> dispatch, so once the dispatch fails, the task won't be re-enqueued.
> 
> There are two problems here:
> 
> 1. CPU0 makes a scheduling decision based on stale data and it isn't detected.
> 2. Even if it is detected and the dispatch aborted, the task won't be
>    re-enqueued.

Right. At this point I think we can just rely on the affinity validation
via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and
just drop invalid dispatches.

And to prevent dropped tasks, I was wondering if we could just insert the
task into a per-rq fallback DSQ, that can be consumed from balance_scx() to
re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the
re-enqueue problem avoiding the locking complexity of calling ops.enqueue()
directly from finish_dispatch().

Thoughts?

> 
> The way we deal with the first problem in ghOSt (Google's equivalent of
> sched_ext) is we expose the per-task sequence number to the BPF scheduler.
> On the dispatch path, when the BPF scheduler has a candidate task,
> it retrieves its seqnum, re-checks the task state to ensure that it's still
> eligible for dispatch, and passes the seqnum to the kernel's dispatch helper
> for verification. If the kernel detects that the seqnum has changed already,
> it synchronously fails the dispatch attempt (dispatch always happens
> synchronously in ghOSt). In sched_ext, we could do the synchronous check, but
> we also need to do the same check later in finish_dispatch(), comparing
> the current qseq against the qseq passed by the BPF scheduler.
> 
> To fix the second problem, we would need to explicitly call ops.enqueue()
> from finish_dispatch() and the other places where we abort dispatch if the
> qseq is out of date.
> 
> Either that, or just add locking to the BPF scheduler to prevent the race from
> happening in the first place.

Thanks,
-Andrea

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Kuba Piecuch 1 day, 20 hours ago

On Wed Feb 4, 2026 at 5:56 PM UTC, Andrea Righi wrote:
> Right. At this point I think we can just rely on the affinity validation
> via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and
> just drop invalid dispatches.
>
> And to prevent dropped tasks, I was wondering if we could just insert the
> task into a per-rq fallback DSQ, that can be consumed from balance_scx() to
> re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the
> re-enqueue problem avoiding the locking complexity of calling ops.enqueue()
> directly from finish_dispatch().
>
> Thoughts?

How would these fallback DSQs work?

1. Would inserting the task into the fallback DSQ trigger ops.dequeue(), so
   that we can later balance it with the re-enqueue?

2. Which rq's fallback DSQ will the task be inserted into? The one belonging to
   the CPU doing the dispatch?

3. Is the re-enqueue going to happen inside the same call to balance_one() that
   tried to dispatch the task?

I'm not opposed to the idea, I'm curious to see how it works in practice.

Thanks,
Kuba

Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

Posted by Andrea Righi 1 day, 20 hours ago

On Thu, Feb 05, 2026 at 05:20:11PM +0000, Kuba Piecuch wrote:
> On Wed Feb 4, 2026 at 5:56 PM UTC, Andrea Righi wrote:
> > Right. At this point I think we can just rely on the affinity validation
> > via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and
> > just drop invalid dispatches.
> >
> > And to prevent dropped tasks, I was wondering if we could just insert the
> > task into a per-rq fallback DSQ, that can be consumed from balance_scx() to
> > re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the
> > re-enqueue problem avoiding the locking complexity of calling ops.enqueue()
> > directly from finish_dispatch().
> >
> > Thoughts?
> 
> How would these fallback DSQs work?
> 
> 1. Would inserting the task into the fallback DSQ trigger ops.dequeue(), so
>    that we can later balance it with the re-enqueue?

Yeah, but ... see below.

> 
> 2. Which rq's fallback DSQ will the task be inserted into? The one belonging to
>    the CPU doing the dispatch?

I was thinking the task's rq.

> 
> 3. Is the re-enqueue going to happen inside the same call to balance_one() that
>    tried to dispatch the task?

balance_one().

> 
> I'm not opposed to the idea, I'm curious to see how it works in practice.

Thinking more about this, it's a bit problematic, when
set_cpus_allowed_scx() triggers dequeue+enqueue we get another enqueue
without the SCX_ENQ_REENQ flag and it's a bit tricky to manage that with
the fallback DSQ.

So, I'm back to the drawing board, trying to explore the qseq approach...

-Andrea