[PATCH v2] sched_ext: Document task ownership state machine

Andrea Righi posted 1 patch 1 month ago
kernel/sched/ext_internal.h | 114 +++++++++++++++++++++++++++++++-----
1 file changed, 98 insertions(+), 16 deletions(-)
[PATCH v2] sched_ext: Document task ownership state machine
Posted by Andrea Righi 1 month ago
The task ownership state machine in sched_ext is quite hard to follow
from the code alone. The interaction of ownership states, memory
ordering rules and cross-CPU "lock dancing" makes the overall model
subtle.

Extend the documentation next to scx_ops_state to provide a more
structured and self-contained description of the state transitions and
their synchronization rules.

The new reference should make the code easier to reason about and
maintain and can help future contributors understand the overall
task-ownership workflow.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
Changes in v2:
 - Remove DISPATCHING -> QUEUED transition (Tejun)
 - v1: https://lore.kernel.org/all/20260304153343.340285-1-arighi@nvidia.com

 kernel/sched/ext_internal.h | 114 +++++++++++++++++++++++++++++++-----
 1 file changed, 98 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index bd26811fea99d..417d3c6f02fe3 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1042,26 +1042,108 @@ static const char *scx_enable_state_str[] = {
 };
 
 /*
- * sched_ext_entity->ops_state
+ * Task Ownership State Machine (sched_ext_entity->ops_state)
  *
- * Used to track the task ownership between the SCX core and the BPF scheduler.
- * State transitions look as follows:
+ * The sched_ext core uses this state machine to track task ownership
+ * between the SCX core and the BPF scheduler. This allows the BPF
+ * scheduler to dispatch tasks without strict ordering requirements, while
+ * the SCX core safely rejects invalid dispatches.
  *
- * NONE -> QUEUEING -> QUEUED -> DISPATCHING
- *   ^              |                 |
- *   |              v                 v
- *   \-------------------------------/
+ * State Transitions
  *
- * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
- * sites for explanations on the conditions being waited upon and why they are
- * safe. Transitions out of them into NONE or QUEUED must store_release and the
- * waiters should load_acquire.
+ *       .------------> NONE (owned by SCX core)
+ *       |               |           ^
+ *       |       enqueue |           | direct dispatch
+ *       |               v           |
+ *       |           QUEUEING -------'
+ *       |               |
+ *       |       enqueue |
+ *       |     completes |
+ *       |               v
+ *       |            QUEUED (owned by BPF scheduler)
+ *       |               |
+ *       |      dispatch |
+ *       |               |
+ *       |               v
+ *       |          DISPATCHING
+ *       |               |
+ *       |      dispatch |
+ *       |     completes |
+ *       `---------------'
  *
- * Tracking scx_ops_state enables sched_ext core to reliably determine whether
- * any given task can be dispatched by the BPF scheduler at all times and thus
- * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler
- * to try to dispatch any task anytime regardless of its state as the SCX core
- * can safely reject invalid dispatches.
+ * State Descriptions
+ *
+ * - %SCX_OPSS_NONE:
+ *     Task is owned by the SCX core. It's either on a run queue, running,
+ *     or being manipulated by the core scheduler. The BPF scheduler has no
+ *     claim on this task.
+ *
+ * - %SCX_OPSS_QUEUEING:
+ *     Transitional state while transferring a task from the SCX core to
+ *     the BPF scheduler. The task's rq lock is held during this state.
+ *     Since QUEUEING is both entered and exited under the rq lock, dequeue
+ *     can never observe this state (it would be a BUG). When finishing a
+ *     dispatch, if the task is still in %SCX_OPSS_QUEUEING the completion
+ *     path busy-waits for it to leave this state (via wait_ops_state())
+ *     before retrying.
+ *
+ * - %SCX_OPSS_QUEUED:
+ *     Task is owned by the BPF scheduler. It's on a DSQ (dispatch queue)
+ *     and the BPF scheduler is responsible for dispatching it. A QSEQ
+ *     (queue sequence number) is embedded in this state to detect
+ *     dispatch/dequeue races: if a task is dequeued and re-enqueued, the
+ *     QSEQ changes and any in-flight dispatch operations targeting the old
+ *     QSEQ are safely ignored.
+ *
+ * - %SCX_OPSS_DISPATCHING:
+ *     Transitional state while transferring a task from the BPF scheduler
+ *     back to the SCX core. This state indicates the BPF scheduler has
+ *     selected the task for execution. When dequeue needs to take the task
+ *     off a DSQ and it is still in %SCX_OPSS_DISPATCHING, the dequeue path
+ *     busy-waits for it to leave this state (via wait_ops_state()) before
+ *     proceeding. Exits to %SCX_OPSS_NONE when dispatch completes.
+ *
+ * Memory Ordering
+ *
+ * Transitions out of %SCX_OPSS_QUEUEING and %SCX_OPSS_DISPATCHING into
+ * %SCX_OPSS_NONE or %SCX_OPSS_QUEUED must use atomic_long_set_release()
+ * and waiters must use atomic_long_read_acquire(). This ensures proper
+ * synchronization between concurrent operations.
+ *
+ * Cross-CPU Task Migration
+ *
+ * When moving a task in the %SCX_OPSS_DISPATCHING state, we can't simply
+ * grab the target CPU's rq lock because a concurrent dequeue might be
+ * waiting on %SCX_OPSS_DISPATCHING while holding the source rq lock
+ * (deadlock).
+ *
+ * The sched_ext core uses a "lock dancing" protocol coordinated by
+ * p->scx.holding_cpu. When moving a task to a different rq:
+ *
+ *   1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
+ *   2. Set p->scx.holding_cpu to the current CPU
+ *   3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
+ *      is set, so clearing DISPATCHING first prevents the circular wait
+ *      (safe to lock the rq we need)
+ *   4. Unlock the current CPU's rq
+ *   5. Lock src_rq (where the task currently lives)
+ *   6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
+ *      race (dequeue clears holding_cpu to -1 when it takes the task), in
+ *      this case migration is aborted
+ *   7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
+ *      into dst_rq's local DSQ (no lock swap needed)
+ *   8. Otherwise: call move_remote_task_to_local_dsq(), which releases
+ *      src_rq, locks dst_rq, and performs the deactivate/activate
+ *      migration cycle (dst_rq is held on return)
+ *   9. Unlock dst_rq and re-lock the current CPU's rq to restore
+ *      the lock state expected by the caller
+ *
+ * If any verification fails, abort the migration.
+ *
+ * This state tracking allows the BPF scheduler to try to dispatch any task
+ * at any time regardless of its state. The SCX core can safely
+ * reject/ignore invalid dispatches, simplifying the BPF scheduler
+ * implementation.
  */
 enum scx_ops_state {
 	SCX_OPSS_NONE,		/* owned by the SCX core */
-- 
2.53.0
Re: [PATCH v2] sched_ext: Document task ownership state machine
Posted by Kuba Piecuch 2 weeks, 6 days ago
Hi Andrea,

Sorry for the late reply, I'm catching up with mail from the past ~month.

On Thu Mar 5, 2026 at 6:29 AM UTC, Andrea Righi wrote:
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index bd26811fea99d..417d3c6f02fe3 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -1042,26 +1042,108 @@ static const char *scx_enable_state_str[] = {
>  };
>  
>  /*
> - * sched_ext_entity->ops_state
> + * Task Ownership State Machine (sched_ext_entity->ops_state)
>   *
> - * Used to track the task ownership between the SCX core and the BPF scheduler.
> - * State transitions look as follows:
> + * The sched_ext core uses this state machine to track task ownership
> + * between the SCX core and the BPF scheduler. This allows the BPF
> + * scheduler to dispatch tasks without strict ordering requirements, while
> + * the SCX core safely rejects invalid dispatches.
>   *
> - * NONE -> QUEUEING -> QUEUED -> DISPATCHING
> - *   ^              |                 |
> - *   |              v                 v
> - *   \-------------------------------/
> + * State Transitions
>   *
> - * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
> - * sites for explanations on the conditions being waited upon and why they are
> - * safe. Transitions out of them into NONE or QUEUED must store_release and the
> - * waiters should load_acquire.
> + *       .------------> NONE (owned by SCX core)
> + *       |               |           ^
> + *       |       enqueue |           | direct dispatch
> + *       |               v           |
> + *       |           QUEUEING -------'
> + *       |               |
> + *       |       enqueue |
> + *       |     completes |
> + *       |               v
> + *       |            QUEUED (owned by BPF scheduler)
> + *       |               |
> + *       |      dispatch |
> + *       |               |
> + *       |               v
> + *       |          DISPATCHING
> + *       |               |
> + *       |      dispatch |
> + *       |     completes |
> + *       `---------------'

We can also go directly from QUEUED to NONE when a task is dequeued for an
attribute change or picked by core-sched.

> + * State Descriptions
> + *
> + * - %SCX_OPSS_NONE:
> + *     Task is owned by the SCX core. It's either on a run queue, running,
> + *     or being manipulated by the core scheduler. The BPF scheduler has no
> + *     claim on this task.

A blocked task's ops_state is also NONE. Or are we assuming here that the task
is on_rq?
Also, a task waiting to run on a built-in DSQ is in NONE state as well.

> + *
> + * - %SCX_OPSS_QUEUEING:
> + *     Transitional state while transferring a task from the SCX core to
> + *     the BPF scheduler. The task's rq lock is held during this state.
> + *     Since QUEUEING is both entered and exited under the rq lock, dequeue
> + *     can never observe this state (it would be a BUG). When finishing a
> + *     dispatch, if the task is still in %SCX_OPSS_QUEUEING the completion
> + *     path busy-waits for it to leave this state (via wait_ops_state())
> + *     before retrying.
> + *
> + * - %SCX_OPSS_QUEUED:
> + *     Task is owned by the BPF scheduler. It's on a DSQ (dispatch queue)
> + *     and the BPF scheduler is responsible for dispatching it. A QSEQ

The task doesn't have to be on a DSQ, it can be queued on some BPF data
structure instead. Even if it is on a DSQ, its state depends on whether it's
on a user DSQ (QUEUED) or a built-in DSQ, e.g. local (NONE).

This prompted me to have a look at the logic around SCX_OPSS_QUEUED and I can't
convince myself that it's correct in the case of direct dispatches to
non-builtin DSQs.

The only place where ops_state is set to QUEUED is at the end of
do_enqueue_task(). Notably, this assignment is skipped in the case of direct
dispatch.

direct_dispatch() will then call dispatch_enqueue() with SCX_ENQ_CLEAR_OPSS,
causing ops_state to be reset to NONE. We end up in a state where the task
is enqueued on a user DSQ, its ops_state is NONE and p->scx.flags has
SCX_TASK_IN_CUSTODY, which doesn't look like a consistent state to me.

Am I missing something here?

> + *     (queue sequence number) is embedded in this state to detect
> + *     dispatch/dequeue races: if a task is dequeued and re-enqueued, the
> + *     QSEQ changes and any in-flight dispatch operations targeting the old
> + *     QSEQ are safely ignored.

Technically speaking, the QSEQ is also embedded in QUEUEING, where it serves
the same purpose.

> + *
> + * - %SCX_OPSS_DISPATCHING:
> + *     Transitional state while transferring a task from the BPF scheduler
> + *     back to the SCX core. This state indicates the BPF scheduler has
> + *     selected the task for execution. When dequeue needs to take the task

This description only applies to the case of a task being dispatched from
ops.dispatch().

There are cases where a task is transferred from the BPF scheduler to SCX core
without going through DISPATCHING, e.g. dequeue for attribute change or
core-sched pick.

This state indicates the BPF scheduler has selected the task for execution,
but it doesn't always go the other way: when the BPF scheduler selects a task
for execution via direct dispatch, at no point does the task enter the
DISPATCHING state.

> + *     off a DSQ and it is still in %SCX_OPSS_DISPATCHING, the dequeue path
> + *     busy-waits for it to leave this state (via wait_ops_state()) before
> + *     proceeding. Exits to %SCX_OPSS_NONE when dispatch completes.
> + *
> + * Memory Ordering
> + *
> + * Transitions out of %SCX_OPSS_QUEUEING and %SCX_OPSS_DISPATCHING into
> + * %SCX_OPSS_NONE or %SCX_OPSS_QUEUED must use atomic_long_set_release()
> + * and waiters must use atomic_long_read_acquire(). This ensures proper
> + * synchronization between concurrent operations.

The transition from QUEUED to NONE in ops_dequeue() isn't covered here.
It uses atomic_long_try_cmpxchg(), which implies full ordering.

> + *
> + * Cross-CPU Task Migration
> + *
> + * When moving a task in the %SCX_OPSS_DISPATCHING state, we can't simply
> + * grab the target CPU's rq lock because a concurrent dequeue might be
> + * waiting on %SCX_OPSS_DISPATCHING while holding the source rq lock
> + * (deadlock).
> + *
> + * The sched_ext core uses a "lock dancing" protocol coordinated by
> + * p->scx.holding_cpu. When moving a task to a different rq:
> + *
> + *   1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
> + *   2. Set p->scx.holding_cpu to the current CPU
> + *   3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
> + *      is set, so clearing DISPATCHING first prevents the circular wait
> + *      (safe to lock the rq we need)
> + *   4. Unlock the current CPU's rq
> + *   5. Lock src_rq (where the task currently lives)
> + *   6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
> + *      race (dequeue clears holding_cpu to -1 when it takes the task), in
> + *      this case migration is aborted
> + *   7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
> + *      into dst_rq's local DSQ (no lock swap needed)
> + *   8. Otherwise: call move_remote_task_to_local_dsq(), which releases
> + *      src_rq, locks dst_rq, and performs the deactivate/activate
> + *      migration cycle (dst_rq is held on return)
> + *   9. Unlock dst_rq and re-lock the current CPU's rq to restore
> + *      the lock state expected by the caller
> + *
> + * If any verification fails, abort the migration.

Maybe also mention that the same dance happens during direct dispatch, where
the task's state at the beginning of the dance is already NONE
(set in direct_dispatch()), but src_rq is guaranteed to be equal to the current
CPU's rq?

> + *
> + * This state tracking allows the BPF scheduler to try to dispatch any task
> + * at any time regardless of its state. The SCX core can safely
> + * reject/ignore invalid dispatches, simplifying the BPF scheduler
> + * implementation.
>   */
>  enum scx_ops_state {
>  	SCX_OPSS_NONE,		/* owned by the SCX core */

Thanks,
Kuba
Re: [PATCH v2] sched_ext: Document task ownership state machine
Posted by Kuba Piecuch 2 weeks, 6 days ago
On Fri Mar 20, 2026 at 1:56 PM UTC, Kuba Piecuch wrote:
> On Thu Mar 5, 2026 at 6:29 AM UTC, Andrea Righi wrote:
>> + * - %SCX_OPSS_QUEUED:
>> + *     Task is owned by the BPF scheduler. It's on a DSQ (dispatch queue)
>> + *     and the BPF scheduler is responsible for dispatching it. A QSEQ
>
> The task doesn't have to be on a DSQ, it can be queued on some BPF data
> structure instead. Even if it is on a DSQ, its state depends on whether it's
> on a user DSQ (QUEUED) or a built-in DSQ, e.g. local (NONE).

My comment here is incorrect in that a task being on a user DSQ doesn't imply
it's in QUEUED state.
Now I actually think it's impossible for a task on _any_ DSQ to be in QUEUED
state, here's why:

In order to initially get on a DSQ a task must be inserted into one via
scx_bpf_dsq_insert(). That can happen in ops.{select_cpu,enqueue}()
(direct dispatch) or in ops.dispatch() ("normal" dispatch).

In the direct dispatch case, the task either:
* stays in NONE the whole time (ops.select_cpu()), or
* goes from QUEUEING to NONE and stays there (ops.enqueue()).

In the normal dispatch case, the task must be in QUEUED state for
finish_dispatch() to proceed, meaning it can't have been direct dispatched
earlier. Then, the task state goes through QUEUED -> DISPATCHING -> NONE
and stays there.

scx_bpf_dsq_insert() isn't the only way for a task to be inserted into a DSQ.
There's also scx_bpf_dsq_move(), but the task being moved must already be on
a DSQ, so it must be in state NONE, and scx_bpf_dsq_move() doesn't affect that
state.
There are also other mechanisms like SCX automatically inserting a task with
slice left into the local DSQ on preemption, but these also preserve the NONE
state.

So, QUEUED means the task is enqueued purely on the BPF side and isn't present
on any DSQ, either user-created or built-in.
Re: [PATCH v2] sched_ext: Document task ownership state machine
Posted by Kuba Piecuch 2 weeks, 6 days ago
On Fri Mar 20, 2026 at 1:56 PM UTC, Kuba Piecuch wrote:
> This prompted me to have a look at the logic around SCX_OPSS_QUEUED and I can't
> convince myself that it's correct in the case of direct dispatches to
> non-builtin DSQs.
>
> The only place where ops_state is set to QUEUED is at the end of
> do_enqueue_task(). Notably, this assignment is skipped in the case of direct
> dispatch.
>
> direct_dispatch() will then call dispatch_enqueue() with SCX_ENQ_CLEAR_OPSS,
> causing ops_state to be reset to NONE. We end up in a state where the task
> is enqueued on a user DSQ, its ops_state is NONE and p->scx.flags has
> SCX_TASK_IN_CUSTODY, which doesn't look like a consistent state to me.
>
> Am I missing something here?

I just saw the comment in ops_dequeue() introduced by your ops.dequeue()
patchset, so I guess this is WAI. Please disregard this comment.
Re: [PATCH v2] sched_ext: Document task ownership state machine
Posted by Tejun Heo 1 month ago
Applied to sched_ext/for-7.0-fixes.

Thanks.

--
tejun