sched: Make proxy execution compatible with sched_ext

[RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext

Posted by Andrea Righi 2 weeks, 4 days ago

This series enables using proxy execution with sched_ext and is based on early
work by John Stultz [1].

Background
==========

Proxy execution (proxy-exec) lets a waiting task ("donor") donate its execution
context to a mutex owner, so the owner can run while the donor stays eligible on
the runqueue.

Currently, proxy execution and sched_ext are mutually exclusive at build time:
we can't enable CONFIG_SCHED_PROXY_EXEC=y and CONFIG_SCHED_CLASS_EXT=y in the
same kernel.

This restriction can be problematic for Linux distributions and for anyone who
wants to ship one kernel and choose features at runtime.

Why they are mutually exclusive?
================================

sched_ext schedulers drive dispatch through their own interfaces. A proxy-exec
handoff can run a task that the BPF scheduler never dispatched through that
path. sched_ext callbacks then observe a "current" task that does not match what
the BPF side considers running, so kfuncs and helper state can see an
inconsistent view of the executing task.

sched_ext also tracks runnable work through Dispatch Queues (DSQs) and BPF
chosen dispatch rules, while the core scheduler still maintains classic per-CPU
runqueues and pick paths. A proxy handoff can therefore switch the CPU to a task
that the BPF scheduler never inserted or ordered through its DSQ interface.

DSQ state, vtime, and "who is running" bookkeeping inside the BPF program can
then disagree with what the core actually executes, so helpers and kfuncs that
assume their dispatched task is current may observe stale or inconsistent state.

Default behaviour when sched_ext is in use
==========================================

The series relaxes the Kconfig coupling, but keeps proxy-exec context donation
off by default whenever a sched_ext scheduler is loaded: mutex-blocked
tasks are forced to block instead of staying as donors, and the pick path skips
proxy selection; leftover handoff state is cleared so mutex retry paths do not
trip blocked_on consistency checks.

Users who accept the semantic mismatch for their BPF scheduler can opt in at
boot:

 sched_proxy_exec_scx=0|1   (default 0)

Setting 1 allows donor->owner context switches under sched_ext as well.

This enables Linux distributions to set CONFIG_SCHED_PROXY_EXEC and
CONFIG_SCHED_CLASS_EXT together and ship kernels capable of supporting both
features.

Then users can decide, via sched_proxy_exec and sched_proxy_exec_scx, whether to
enable proxy-exec alongside sched_ext, use sched_ext without proxy-exec, or
disable proxy-exec entirely.

References
==========

[1] https://lore.kernel.org/all/20251206001451.1418225-1-jstultz@google.com

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-proxy-exec

Andrea Righi (8):
      sched/core: Skip migration disabled tasks in proxy execution
      sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
      sched_ext: Fix TOCTOU race in consume_remote_task()
      sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors
      sched_ext: Save/restore kf_tasks[] when task ops nest
      sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK
      sched/core: Disable proxy-exec context switch under sched_ext by default
      sched: Allow enabling proxy exec with sched_ext

John Stultz (2):
      sched/ext: Split curr|donor references properly
      sched/ext: Avoid migrating blocked tasks with proxy execution

 Documentation/admin-guide/kernel-parameters.txt |   6 ++
 include/linux/sched/ext.h                       |   9 ++
 init/Kconfig                                    |   2 -
 kernel/sched/core.c                             |  78 ++++++++++++--
 kernel/sched/ext.c                              | 138 ++++++++++++++++++------
 5 files changed, 193 insertions(+), 40 deletions(-)

Re: [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext

Posted by Tejun Heo 2 weeks, 2 days ago

Hello,

I'm a bit worried this is more invasive than what it buys. Even with
the full series, the cross-CPU gap Prateek raised stays open -
find_proxy_task() doesn't go through put_prev_set_next_task(), so owner
runs without ops.running(owner). Closing that seems to need yet another
protocol on top, either synthetic running/stopping events or scx core
taking over dispatch_dequeue for substitutions. The BPF scheduler ends
up dispatching tasks it didn't pick and observing callbacks for tasks
it didn't enqueue, which feels too magical and error-prone.

Maybe worth considering an alternative where, when scx is loaded, we
just turn proxy-exec off entirely and expose blocked_on to the BPF
scheduler. Schedulers that want PI can implement it themselves on top
of the relationship; ones that don't pay nothing.

scx_enable could flip the proxy_exec static branch off, after which the
existing gates in __schedule keep blocked tasks off the runqueue and
skip find_proxy_task on their own. The remaining concern is in-flight
donors at the moment of the flip - the existing scx_bypass walk already
visits every rq's runnable list during enable, and could force-block
any task it sees with blocked_on set. Mutex unlock would re-wake them
through wake_q normally after that. blocked_on itself is set and
cleared in mutex.c regardless of proxy_exec, so the signal we'd want
to surface is already there.

For the BPF side, the natural shape seems to be tagging the existing
ops.quiescent and ops.runnable callbacks with a bit indicating "this
sleep/wake was a mutex transition," plus a small kfunc that returns
the owner of the mutex p is blocked on. A scheduler that wants PI then
records the owner in its own task storage on the quiescent side, boosts
it via the existing vtime / slice / dsq_move / kick primitives, and
drops the boost when the runnable side fires. No new dispatch protocol,
the BPF scheduler stays in charge of who runs.

Does that direction seem reasonable, or am I missing something that
makes it not work?

Thanks.
--
tejun

Re: [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext

Posted by Andrea Righi 2 weeks ago

Hi Tejun,

On Fri, May 08, 2026 at 03:00:59PM -1000, Tejun Heo wrote:
> Hello,
> 
> I'm a bit worried this is more invasive than what it buys. Even with
> the full series, the cross-CPU gap Prateek raised stays open -
> find_proxy_task() doesn't go through put_prev_set_next_task(), so owner
> runs without ops.running(owner). Closing that seems to need yet another
> protocol on top, either synthetic running/stopping events or scx core
> taking over dispatch_dequeue for substitutions. The BPF scheduler ends
> up dispatching tasks it didn't pick and observing callbacks for tasks
> it didn't enqueue, which feels too magical and error-prone.
> 
> Maybe worth considering an alternative where, when scx is loaded, we
> just turn proxy-exec off entirely and expose blocked_on to the BPF
> scheduler. Schedulers that want PI can implement it themselves on top
> of the relationship; ones that don't pay nothing.
> 
> scx_enable could flip the proxy_exec static branch off, after which the
> existing gates in __schedule keep blocked tasks off the runqueue and
> skip find_proxy_task on their own. The remaining concern is in-flight
> donors at the moment of the flip - the existing scx_bypass walk already
> visits every rq's runnable list during enable, and could force-block
> any task it sees with blocked_on set. Mutex unlock would re-wake them
> through wake_q normally after that. blocked_on itself is set and
> cleared in mutex.c regardless of proxy_exec, so the signal we'd want
> to surface is already there.
> 
> For the BPF side, the natural shape seems to be tagging the existing
> ops.quiescent and ops.runnable callbacks with a bit indicating "this
> sleep/wake was a mutex transition," plus a small kfunc that returns
> the owner of the mutex p is blocked on. A scheduler that wants PI then
> records the owner in its own task storage on the quiescent side, boosts
> it via the existing vtime / slice / dsq_move / kick primitives, and
> drops the boost when the runnable side fires. No new dispatch protocol,
> the BPF scheduler stays in charge of who runs.
> 
> Does that direction seem reasonable, or am I missing something that
> makes it not work?

Thanks for looking at this and laying this out. Let me try to elaborate more
about your concerns and the alternative approach you're proposing.

On the cross-CPU gap Prateek raised: you're right that find_proxy_task()
substitutes the owner without going through put_prev_set_next_task(), so neither
ops.stopping(donor) nor ops.running(owner) fires for that substitution. But I'd
argue this is less critical than it looks:

 1) For the ops.running(owner) side specifically, I don't think skipping it is
    actually a correctness problem. With proxy-exec, the owner is not really
    "the task that is running" in any scheduling sense, what runs is the donor,
    the donor's slice is what gets consumed, and the donor is what BPF
    dispatched. The owner just happens to be the execution context the kernel
    uses to make the critical section progress, more like a function call inside
    the donor's quantum than a real task switch. If we frame it that way,
    ops.running(donor) + ops.stopping(donor) is the pairing the BPF scheduler
    should observe.

 2) The cases where the owner is on a different CPU don't go through the
    substitution path at all, find_proxy_task() either migrates the donor over
    (proxy_migrate_task()) or proxy_force_returns() it. In both cases the
    receiving CPU's __schedule() does pick again, so ops.running() fires
    normally on that CPU for whatever gets picked next. The "ghost owner runs
    without ops.running()" only happens when the chain resolves locally, i.e.,
    when the owner was already on the same rq's runnable list. That should
    narrow the surface considerably.

About dispatching tasks BPF didn't pick / observing callbacks for tasks BPF
didn't enqueue: point 1 above is essentially an answer to that. If we treat the
donor as the running task and the owner substitution as an internal kernel
detail (a "function call" in the donor's context), then BPF only ever sees
callbacks for tasks it actually dispatched.

That said, your alternative proposal is also appealing in that it gets sched_ext
out of the proxy-exec dispatch protocol entirely, which is essentially the part
that genuinely is invasive. But I think there are some gaps before the "BPF
rolls its own proxy-exec" model is workable.

Let's say we expose blocked_on (and a kfunc returning the mutex owner) via
tagged ops.quiescent/runnable(). The BPF scheduler now wants to boost the owner.
What's the actual way to do so? Some mechanisms that we have right now:
 - slice extension: scx_bpf_task_set_slice() works in place, but it affects
   only a running owner,
 - dsq_vtime: scx_bpf_task_set_dsq_vtime() updates the value, but for a task
   already enqueued in a PRIQ DSQ the position in the rbtree doesn't move, so
   this doesn't actually boost an already-queued owner.
 - DSQ move: scx_bpf_dsq_move() requires an iterator and the task to have been
   queued before iteration started. We don't have a kfunc today that takes a
   task pointer and atomically yanks it from wherever it is to a higher-priority
   DSQ. We also have no API exposing which DSQ a task is currently sitting in.
 - scx_bpf_dsq_insert(SCX_DSQ_LOCAL) + SCX_ENQ_HEAD|SCX_ENQ_PREEMPT: it probably
   works to run the owner immediately on its CPU, if we have a way to
   re-enqueue it.

So, to make the BPF-side proxy-exec model real, I think we'd need at least:

 1) A kfunc that returns the DSQ id a task is currently enqueued on (or
    NULL/SCX_DSQ_INVALID if running), so the BPF scheduler can locate the owner.

 2) A kfunc that removes a task by pointer from its current DSQ and triggers a
    re-enqueue (or inserts the task into another DSQ).

Without these kfuncs a BPF scheduler that wants to support proxy-exec has no
concrete way to actually boost the owner.

If we add those primitives, the alternative seems reasonable: scx disables
proxy-exec, the bypass-style walk you described handles in-flight donors at flip
time, and proxy-exec with sched_ext becomes a BPF-side policy. I'm willing to
experiment in that direction if we think the primitives above are acceptable to
add.

Thanks,
-Andrea

Re: [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext

Posted by Tejun Heo 2 weeks ago

Hello,

I'll think more on enabling proxy execution as-is for sched_ext. Reponse on
scx_bpf_dsq_move():

On Sun, May 10, 2026 at 05:06:41PM +0200, Andrea Righi wrote:
...
> Let's say we expose blocked_on (and a kfunc returning the mutex owner) via
> tagged ops.quiescent/runnable(). The BPF scheduler now wants to boost the owner.
> What's the actual way to do so? Some mechanisms that we have right now:
>  - slice extension: scx_bpf_task_set_slice() works in place, but it affects
>    only a running owner,
>  - dsq_vtime: scx_bpf_task_set_dsq_vtime() updates the value, but for a task
>    already enqueued in a PRIQ DSQ the position in the rbtree doesn't move, so
>    this doesn't actually boost an already-queued owner.
>  - DSQ move: scx_bpf_dsq_move() requires an iterator and the task to have been
>    queued before iteration started. We don't have a kfunc today that takes a
>    task pointer and atomically yanks it from wherever it is to a higher-priority
>    DSQ. We also have no API exposing which DSQ a task is currently sitting in.

Assuming ->blocked_on() is triggered without rq lock held (if not, we just
need to tell scx_bpf_dsq_move() that it can do lock dancing in this context
too), we should already be able to move the task directly:

p->scx.dsq->id should already be accessible through BPF_CORE_READ(). Maybe
we can make it a bit nicer.

scx_bpf_dsq_move() doesn't actually need the task to come from iteration.
It's a bit odd but we're overloading the iterator for two purposes -
iteration and transaction scope definition. If a task is dequeued and
reenqueued after iteration is opened, scx_bpf_dsq_move() ignores the move as
the visit is considered stale. scx_bpf_dsq_move() only depends on this part.
The following is an excerpt from the function comment:

 * For the transfer to be successful, @p must still be on the DSQ and have been
 * queued before the DSQ iteration started. This function doesn't care whether
 * @p was obtained from the DSQ iteration. @p just has to be on the DSQ and have
 * been queued before the iteration started.

So, for an example, ->blocked_on() can do:

  void my_sched_blocked_on(struct task_struct *p, struct task_struct *blocker)
  {
          u64 dsq_id = BPF_CORE_READ(p, scx.dsq, id);
          struct bpf_iter_scx_dsq it;

          if (!bpf_iter_scx_dsq_new(&it, dsq_id, 0)) {
                  scx_bpf_dsq_move(&it, p, SCX_DSQ_LOCAL_ON | WHATEVER_CPU_WE_PICK,
                                   SCX_ENQ_PREEMPT);
                  bpf_iter_scx_dsq_destroy(&it);
          }
  }

This is not the prettiest but the above should do all that's needed and
nothing else. It just looks up the dsq and remembers the insert sequence. No
actual iteration happens. Again, it'd be trivial to add BPF helpers or extra
kfuncs to make this nicer.

Thanks.

-- 
tejun