SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Kuba Piecuch 1 month, 3 weeks ago

Hi folks,

I recently saw that scx_qmap got rid of the sched_switch tracepoint hook,
claiming that SCX_OPS_ALWAYS_ENQ_IMMED is sufficient to keep tasks from
lingering on local DSQs.

This prompted me to think about some possible edge cases, and I think we
can end up with lingering tasks on the local DSQ in the following scenario:

Initial conditions: rq->curr == rq->idle &&
                    rq->next_class == &idle_sched_class

1. We enter schedule() for whatever reason, e.g. BPF scheduler kick from
   another CPU.

2. In __pick_next_task(), all sched classes above SCX fail to pick a task.
   We still have rq->next_class == &idle_sched_class.

3. We enter do_pick_task_scx(). rq_modified_begin() does nothing because
   sched_class_above(rq->next_class, &ext_sched_class) is false.

4. ops.dispatch() dispatches two tasks. The first one goes to the local DSQ,
   and the second one goes to a remote CPU's local DSQ. The first task is
   dispatched without interference.

5. During dispatch of the second task, while the local CPU's rq lock is dropped
   during insertion into the remote CPU's local DSQ, an RT task wakes up on the
   local CPU. Since rq->next_class is still idle, wakeup_preempt() calls
   wakeup_preempt_idle() which calls resched_curr(rq). This effectively does
   nothing since need_resched is cleared in __schedule() after pick.
   rq->next_class is set to &rt_sched_class.

6. At the end of balance_one(), we don't trigger a reenqueue because the local
   DSQ has only one task.

7. do_pick_task_scx() notices rq_modified_above(rq, &ext_sched_class) and
   returns RETRY_TASK.

8. The RT task ends up being picked and runs. SCX is not notified of the switch
   because we're switching from the idle task to an RT task.

If my understanding is correct and I didn't miss anything important, then
at no point does SCX reenqueue the first task, even though it should.

This particular scenario may not apply to scx_qmap, but I think it proves that
it's possible to have dispatched tasks lingering on the local DSQ even with
SCX_OPS_ALWAYS_ENQ_IMMED.

I was thinking we could fix this by adding a nr_immed check right before
returning RETRY_TASK:

diff --git i/kernel/sched/ext.c w/kernel/sched/ext.c
index d66fea57ee69..480627fdc203 100644
--- i/kernel/sched/ext.c
+++ w/kernel/sched/ext.c
@@ -3079,8 +3079,11 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
         * If @force_scx is true, always try to pick a SCHED_EXT task,
         * regardless of any higher-priority sched classes activity.
         */
-       if (!force_scx && rq_modified_above(rq, &ext_sched_class))
+       if (!force_scx && rq_modified_above(rq, &ext_sched_class)) {
+               if (rq->scx.nr_immed)
+                       schedule_reenq_local(rq, 0);
                return RETRY_TASK;
+       }

        keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
        if (unlikely(keep_prev &&

...but I think this only fixes the case where the RT task wakes up on the CPU
that is doing the dispatch. The other case is one where the RT task wakes up
on the remote CPU (the one the second task was dispatched to) after insertion
of the second task, assuming the remote CPU is initially idle.

To fix both cases, one potential solution that comes to mind is bumping
rq->next_class to &ext_sched_class when inserting a task into rq->scx.local_dsq.
Perhaps we should call wakeup_preempt() in dispatch_to_local_dsq()?

Let me know what you think!

Thanks,
Kuba

Re: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Tejun Heo 1 month, 3 weeks ago

Hello, Kuba.

On Wed, Apr 22, 2026 at 01:21:27PM +0000, Kuba Piecuch wrote:
> diff --git i/kernel/sched/ext.c w/kernel/sched/ext.c
> index d66fea57ee69..480627fdc203 100644
> --- i/kernel/sched/ext.c
> +++ w/kernel/sched/ext.c
> @@ -3079,8 +3079,11 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
>          * If @force_scx is true, always try to pick a SCHED_EXT task,
>          * regardless of any higher-priority sched classes activity.
>          */
> -       if (!force_scx && rq_modified_above(rq, &ext_sched_class))
> +       if (!force_scx && rq_modified_above(rq, &ext_sched_class)) {
> +               if (rq->scx.nr_immed)
> +                       schedule_reenq_local(rq, 0);
>                 return RETRY_TASK;
> +       }
> 
>         keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
>         if (unlikely(keep_prev &&

Ah, good catch.

> ...but I think this only fixes the case where the RT task wakes up on the CPU
> that is doing the dispatch. The other case is one where the RT task wakes up
> on the remote CPU (the one the second task was dispatched to) after insertion
> of the second task, assuming the remote CPU is initially idle.
> 
> To fix both cases, one potential solution that comes to mind is bumping
> rq->next_class to &ext_sched_class when inserting a task into rq->scx.local_dsq.
> Perhaps we should call wakeup_preempt() in dispatch_to_local_dsq()?

I think what's missing is wakeup_preempt() call in
move_remote_task_to_local_dsq(). This is SCX's version of move_queued_task()
and we're missing wakeup_preempt() call after activate_task().

Thanks.

-- 
tejun

Re: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Kuba Piecuch 1 month, 3 weeks ago

Hi Tejun,

On Wed Apr 22, 2026 at 4:50 PM UTC, Tejun Heo wrote:
> Hello, Kuba.
>
> On Wed, Apr 22, 2026 at 01:21:27PM +0000, Kuba Piecuch wrote:
>> diff --git i/kernel/sched/ext.c w/kernel/sched/ext.c
>> index d66fea57ee69..480627fdc203 100644
>> --- i/kernel/sched/ext.c
>> +++ w/kernel/sched/ext.c
>> @@ -3079,8 +3079,11 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
>>          * If @force_scx is true, always try to pick a SCHED_EXT task,
>>          * regardless of any higher-priority sched classes activity.
>>          */
>> -       if (!force_scx && rq_modified_above(rq, &ext_sched_class))
>> +       if (!force_scx && rq_modified_above(rq, &ext_sched_class)) {
>> +               if (rq->scx.nr_immed)
>> +                       schedule_reenq_local(rq, 0);
>>                 return RETRY_TASK;
>> +       }
>> 
>>         keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
>>         if (unlikely(keep_prev &&
>
> Ah, good catch.
>
>> ...but I think this only fixes the case where the RT task wakes up on the CPU
>> that is doing the dispatch. The other case is one where the RT task wakes up
>> on the remote CPU (the one the second task was dispatched to) after insertion
>> of the second task, assuming the remote CPU is initially idle.
>> 
>> To fix both cases, one potential solution that comes to mind is bumping
>> rq->next_class to &ext_sched_class when inserting a task into rq->scx.local_dsq.
>> Perhaps we should call wakeup_preempt() in dispatch_to_local_dsq()?
>
> I think what's missing is wakeup_preempt() call in
> move_remote_task_to_local_dsq(). This is SCX's version of move_queued_task()
> and we're missing wakeup_preempt() call after activate_task().

I'm not sure if we should limit ourselves to just remote tasks.
If we call wakeup_preempt(rq, p, flags) when adding @p to @rq's local DSQ
regardless of whether @p/@rq is remote, then I think that should cover all
cases and the patch above wouldn't be needed.

Perhaps local_dsq_post_enq() would be a good place to invoke wakeup_preempt()?

Thanks,
Kuba

Re: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Tejun Heo 1 month, 3 weeks ago

Hello,

On Thu, Apr 23, 2026 at 09:48:11AM +0000, Kuba Piecuch wrote:
> I'm not sure if we should limit ourselves to just remote tasks.
> If we call wakeup_preempt(rq, p, flags) when adding @p to @rq's local DSQ
> regardless of whether @p/@rq is remote, then I think that should cover all
> cases and the patch above wouldn't be needed.

Sorry, I wasn't clear. I meant adding wakeup_preempt() in
move_remote_task_to_local_dsq() *in addition to* your nr_immed/RETRY_TASK
patch. That combination covers both cases:

- Local (your stepwise scenario): the task stays on the dispatching CPU, no
  migration happens, so there's no insertion-side hook to arm. Your nr_immed
  check in do_pick_task_scx() catches it on the RETRY_TASK path.

- Remote: move_remote_task_to_local_dsq() is SCX's analog of
  move_queued_task() and is missing the wakeup_preempt() call that
  move_queued_task() does after activate_task(). Once added,
  dst_rq->next_class is bumped to &ext_sched_class, and any subsequent
  higher-class wakeup on dst_rq routes through wakeup_preempt_scx() which
  already has the nr_immed reenqueue.

Adding wakeup_preempt() to local_dsq_post_enq() for every insertion would
work too, but I'd rather keep it in sync with how core sched handles
move_queued_task() and only add it where it's actually needed.

Thanks.

--
tejun

Re: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Kuba Piecuch 1 month, 3 weeks ago

On Thu Apr 23, 2026 at 4:53 PM UTC, Tejun Heo wrote:
> Hello,
>
> On Thu, Apr 23, 2026 at 09:48:11AM +0000, Kuba Piecuch wrote:
>> I'm not sure if we should limit ourselves to just remote tasks.
>> If we call wakeup_preempt(rq, p, flags) when adding @p to @rq's local DSQ
>> regardless of whether @p/@rq is remote, then I think that should cover all
>> cases and the patch above wouldn't be needed.
>
> Sorry, I wasn't clear. I meant adding wakeup_preempt() in
> move_remote_task_to_local_dsq() *in addition to* your nr_immed/RETRY_TASK
> patch. That combination covers both cases:
>
> - Local (your stepwise scenario): the task stays on the dispatching CPU, no
>   migration happens, so there's no insertion-side hook to arm. Your nr_immed
>   check in do_pick_task_scx() catches it on the RETRY_TASK path.
>
> - Remote: move_remote_task_to_local_dsq() is SCX's analog of
>   move_queued_task() and is missing the wakeup_preempt() call that
>   move_queued_task() does after activate_task(). Once added,
>   dst_rq->next_class is bumped to &ext_sched_class, and any subsequent
>   higher-class wakeup on dst_rq routes through wakeup_preempt_scx() which
>   already has the nr_immed reenqueue.

There's another case we need to handle: dispatching to a remote CPU's
local DSQ, but without migrating the task. In dispatch_to_local_dsq()
this corresponds to rq != src_rq && src_rq == dst_rq. The check for the local
case on the remote CPU doesn't cover it because an RT task can swoop in and
prevent pick_task_scx() from running (the assumption is that the remote CPU
is idle).
The check for the remote case doesn't cover it either because we do
dispatch_enqueue() directly from dispatch_to_local_dsq().

Then there's the same case, but instead of dispatching we're moving the task
from a user DSQ. If my understanding of the code is correct, the path is:

  scx_dsq_move(p, dsq_id);
    raw_spin_rq_unlock(this_rq);
    raw_spin_lock(src_rq);
    dst_dsq = find_dsq_for_dispatch(dsq_id);
    move_task_between_dsqs(p, src_dsq, dst_dsq);
      dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
      task_unlink_from_dsq(p, src_dsq);
      move_local_task_to_local_dsq(p, src_dsq, dst_rq);

I think these two cases show that we don't necessarily have to migrate a task
for wakeup_preempt() to be warranted.

So to cover all cases (that I'm aware of), we need four checks:

* wakeup_preempt() in dispatch_to_local_dsq() in the case of rq != src_rq &&
  src_rq == dst_rq

* wakeup_preempt() at the end of move_remote_task_to_local_dsq()

* wakeup_preempt() somewhere on the scx_dsq_move() path, but only if we're
  moving the task to the local DSQ of a remote CPU (?)

* nr_immed check before returning RETRY_TASK

>
> Adding wakeup_preempt() to local_dsq_post_enq() for every insertion would
> work too, but I'd rather keep it in sync with how core sched handles
> move_queued_task() and only add it where it's actually needed.

Does having four separate cases to handle, not all of which involve task
migration, change that calculus somewhat?
I believe all of these cases will be handled if we add wakeup_preempt()
to local_dsq_post_enq().

Thanks,
Kuba

Re: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Tejun Heo 1 month, 3 weeks ago

Hello,

On Thu, Apr 23, 2026 at 07:12:18PM +0000, Kuba Piecuch wrote:
...
> I think these two cases show that we don't necessarily have to migrate a task
> for wakeup_preempt() to be warranted.

I see.

> So to cover all cases (that I'm aware of), we need four checks:
> 
> * wakeup_preempt() in dispatch_to_local_dsq() in the case of rq != src_rq &&
>   src_rq == dst_rq
> 
> * wakeup_preempt() at the end of move_remote_task_to_local_dsq()
> 
> * wakeup_preempt() somewhere on the scx_dsq_move() path, but only if we're
>   moving the task to the local DSQ of a remote CPU (?)
> 
> * nr_immed check before returning RETRY_TASK
> 
> > Adding wakeup_preempt() to local_dsq_post_enq() for every insertion would
> > work too, but I'd rather keep it in sync with how core sched handles
> > move_queued_task() and only add it where it's actually needed.
> 
> Does having four separate cases to handle, not all of which involve task
> migration, change that calculus somewhat?
> I believe all of these cases will be handled if we add wakeup_preempt()
> to local_dsq_post_enq().

If we call wakeup_preempt() unconditionally, we'd be calling
wakeup_preempt_scx() on every dispatch, which isn't too appealing. We can
add an ENQ flag to mark these sites specifically but I'm not sure whether
that's necessarily better. The invovled code paths are inherently subtle and
finicky, so might as well just add the calls where they're necessary. What
do you think?

Thanks.

-- 
tejun

Re: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Kuba Piecuch 1 month, 3 weeks ago

On Thu Apr 23, 2026 at 7:29 PM UTC, Tejun Heo wrote:
> Hello,
>
> On Thu, Apr 23, 2026 at 07:12:18PM +0000, Kuba Piecuch wrote:
> ...
>> I think these two cases show that we don't necessarily have to migrate a task
>> for wakeup_preempt() to be warranted.
>
> I see.
>
>> So to cover all cases (that I'm aware of), we need four checks:
>> 
>> * wakeup_preempt() in dispatch_to_local_dsq() in the case of rq != src_rq &&
>>   src_rq == dst_rq
>> 
>> * wakeup_preempt() at the end of move_remote_task_to_local_dsq()
>> 
>> * wakeup_preempt() somewhere on the scx_dsq_move() path, but only if we're
>>   moving the task to the local DSQ of a remote CPU (?)
>> 
>> * nr_immed check before returning RETRY_TASK
>> 
>> > Adding wakeup_preempt() to local_dsq_post_enq() for every insertion would
>> > work too, but I'd rather keep it in sync with how core sched handles
>> > move_queued_task() and only add it where it's actually needed.
>> 
>> Does having four separate cases to handle, not all of which involve task
>> migration, change that calculus somewhat?
>> I believe all of these cases will be handled if we add wakeup_preempt()
>> to local_dsq_post_enq().
>
> If we call wakeup_preempt() unconditionally, we'd be calling
> wakeup_preempt_scx() on every dispatch, which isn't too appealing. We can
> add an ENQ flag to mark these sites specifically but I'm not sure whether
> that's necessarily better. The invovled code paths are inherently subtle and
> finicky, so might as well just add the calls where they're necessary. What
> do you think?

I don't like the idea of maintaining the separate checks, precisely because
of how subtle and finnicky they are. I worry that it will be difficult to keep
all cases covered as the codebase evolves.

As an optimization, we could skip wakeup_preempt() in local_dsq_post_enq() if
rq->next_class == &ext_sched_class. That should make the performance impact
negligible in the common case, WDYT?

Thanks,
Kuba

Re: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs

Posted by Tejun Heo 1 month, 3 weeks ago

Hello,

On Thu, Apr 23, 2026 at 08:03:51PM +0000, Kuba Piecuch wrote:
...
> I don't like the idea of maintaining the separate checks, precisely because
> of how subtle and finnicky they are. I worry that it will be difficult to keep
> all cases covered as the codebase evolves.
> 
> As an optimization, we could skip wakeup_preempt() in local_dsq_post_enq() if
> rq->next_class == &ext_sched_class. That should make the performance impact
> negligible in the common case, WDYT?

Yeah, let's do that. Please send a patch and can you also cc PeterZ?

Thanks.

-- 
tejun