Documentation/scheduler/sched-ext.rst | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-)
Document ops.dequeue() in the sched_ext task lifecycle now that its
semantics are well-defined.
Also update the pseudo-code to use task_is_runnable() consistently and
clarify the case where ops.dispatch() does not refill the time slice.
Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
Documentation/scheduler/sched-ext.rst | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404b4e4c33f7e..9f03650abfeba 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -422,23 +422,29 @@ by a sched_ext scheduler:
ops.runnable(); /* Task becomes ready to run */
- while (task is runnable) {
+ while (task_is_runnable(task)) {
if (task is not in a DSQ && task->scx.slice == 0) {
ops.enqueue(); /* Task can be added to a DSQ */
- /* Any usable CPU becomes available */
+ /* Task property change (i.e., affinity, nice, etc.)? */
+ if (sched_change(task)) {
+ ops.dequeue(); /* Exiting BPF scheduler custody */
+ continue;
+ }
+ }
- ops.dispatch(); /* Task is moved to a local DSQ */
+ /* Any usable CPU becomes available */
+
+ ops.dispatch(); /* Task is moved to a local DSQ */
+ ops.dequeue(); /* Exiting BPF scheduler custody */
- ops.dequeue(); /* Exiting BPF scheduler */
- }
ops.running(); /* Task starts running on its assigned CPU */
- while task_is_runnable(p) {
- while (task->scx.slice > 0 && task_is_runnable(p))
- ops.tick(); /* Called every 1/HZ seconds */
+ while (task_is_runnable(task) && task->scx.slice > 0) {
+ ops.tick(); /* Called every 1/HZ seconds */
- ops.dispatch(); /* task->scx.slice can be refilled */
+ if (task->scx.slice == 0)
+ ops.dispatch(); /* task->scx.slice can be refilled */
}
ops.stopping(); /* Task stops running (time slice expires or wait) */
--
2.53.0
Hi Andrea,
On Mon Apr 6, 2026 at 11:47 AM UTC, Andrea Righi wrote:
> Document ops.dequeue() in the sched_ext task lifecycle now that its
> semantics are well-defined.
>
> Also update the pseudo-code to use task_is_runnable() consistently and
> clarify the case where ops.dispatch() does not refill the time slice.
>
> Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> Documentation/scheduler/sched-ext.rst | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404b4e4c33f7e..9f03650abfeba 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -422,23 +422,29 @@ by a sched_ext scheduler:
>
> ops.runnable(); /* Task becomes ready to run */
>
> - while (task is runnable) {
> + while (task_is_runnable(task)) {
> if (task is not in a DSQ && task->scx.slice == 0) {
> ops.enqueue(); /* Task can be added to a DSQ */
>
> - /* Any usable CPU becomes available */
> + /* Task property change (i.e., affinity, nice, etc.)? */
> + if (sched_change(task)) {
> + ops.dequeue(); /* Exiting BPF scheduler custody */
Doesn't the task also go through quiescent -> runnable here? The full path
being dequeue -> quiescent -> (actual property change) -> runnable -> enqueue.
I guess we should be accurate here since quiescent and runnable are present
elsewhere in the pseudocode.
> + continue;
> + }
> + }
>
> - ops.dispatch(); /* Task is moved to a local DSQ */
> + /* Any usable CPU becomes available */
> +
> + ops.dispatch(); /* Task is moved to a local DSQ */
s/local/terminal/?
> + ops.dequeue(); /* Exiting BPF scheduler custody */
>
> - ops.dequeue(); /* Exiting BPF scheduler */
> - }
> ops.running(); /* Task starts running on its assigned CPU */
>
> - while task_is_runnable(p) {
> - while (task->scx.slice > 0 && task_is_runnable(p))
> - ops.tick(); /* Called every 1/HZ seconds */
> + while (task_is_runnable(task) && task->scx.slice > 0) {
> + ops.tick(); /* Called every 1/HZ seconds */
>
> - ops.dispatch(); /* task->scx.slice can be refilled */
> + if (task->scx.slice == 0)
> + ops.dispatch(); /* task->scx.slice can be refilled */
> }
>
> ops.stopping(); /* Task stops running (time slice expires or wait) */
Thanks,
Kuba
Hi Kuba,
On Tue, Apr 07, 2026 at 09:54:22AM +0000, Kuba Piecuch wrote:
> Hi Andrea,
>
> On Mon Apr 6, 2026 at 11:47 AM UTC, Andrea Righi wrote:
> > Document ops.dequeue() in the sched_ext task lifecycle now that its
> > semantics are well-defined.
> >
> > Also update the pseudo-code to use task_is_runnable() consistently and
> > clarify the case where ops.dispatch() does not refill the time slice.
> >
> > Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > Documentation/scheduler/sched-ext.rst | 24 +++++++++++++++---------
> > 1 file changed, 15 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404b4e4c33f7e..9f03650abfeba 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -422,23 +422,29 @@ by a sched_ext scheduler:
> >
> > ops.runnable(); /* Task becomes ready to run */
> >
> > - while (task is runnable) {
> > + while (task_is_runnable(task)) {
> > if (task is not in a DSQ && task->scx.slice == 0) {
> > ops.enqueue(); /* Task can be added to a DSQ */
> >
> > - /* Any usable CPU becomes available */
> > + /* Task property change (i.e., affinity, nice, etc.)? */
> > + if (sched_change(task)) {
> > + ops.dequeue(); /* Exiting BPF scheduler custody */
>
> Doesn't the task also go through quiescent -> runnable here? The full path
> being dequeue -> quiescent -> (actual property change) -> runnable -> enqueue.
>
> I guess we should be accurate here since quiescent and runnable are present
> elsewhere in the pseudocode.
Ah yes, we need to add ops.quiescent() and ops.runnable() here. Tejun already
applied this patch to his branch, can you send another patch on top of this?
>
> > + continue;
> > + }
> > + }
> >
> > - ops.dispatch(); /* Task is moved to a local DSQ */
> > + /* Any usable CPU becomes available */
> > +
> > + ops.dispatch(); /* Task is moved to a local DSQ */
>
> s/local/terminal/?
Technically it'd be correct to say "terminal", but typically we use
scx_bpf_move_to_local() here, which moves the task to a local DSQ. Then it may
fallback into SCX_DSQ_GLOBAL if something goes wrong, but, from a logical
perspective, the intention is to move the task to local DSQ at this point.
So, I'm not sure if saying "terminal" here would be more confusing than
helpful... but I don't have a strong opinion on that.
Thanks,
-Andrea
>
> > + ops.dequeue(); /* Exiting BPF scheduler custody */
> >
> > - ops.dequeue(); /* Exiting BPF scheduler */
> > - }
> > ops.running(); /* Task starts running on its assigned CPU */
> >
> > - while task_is_runnable(p) {
> > - while (task->scx.slice > 0 && task_is_runnable(p))
> > - ops.tick(); /* Called every 1/HZ seconds */
> > + while (task_is_runnable(task) && task->scx.slice > 0) {
> > + ops.tick(); /* Called every 1/HZ seconds */
> >
> > - ops.dispatch(); /* task->scx.slice can be refilled */
> > + if (task->scx.slice == 0)
> > + ops.dispatch(); /* task->scx.slice can be refilled */
> > }
> >
> > ops.stopping(); /* Task stops running (time slice expires or wait) */
>
> Thanks,
> Kuba
When a queued task has one of its scheduling properties changed
(e.g. nice, affinity), it goes through dequeue() -> quiescent() ->
(property change callback, e.g. ops.set_weight()) -> runnable() ->
enqueue().
The existing documentation only mentions dequeue() and enqueue() on that
path, so add the missing callbacks.
Fixes: a4f61f0a1afd ("sched_ext: Documentation: Add ops.dequeue() to task lifecycle")
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
---
Documentation/scheduler/sched-ext.rst | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index ec594ae8086de..b5c70f4cfc352 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -429,6 +429,11 @@ by a sched_ext scheduler:
/* Task property change (i.e., affinity, nice, etc.)? */
if (sched_change(task)) {
ops.dequeue(); /* Exiting BPF scheduler custody */
+ ops.quiescent();
+
+ /* Property change callback, e.g. ops.set_weight() */
+
+ ops.runnable();
continue;
}
}
--
2.53.0.1213.gd9a14994de-goog
Hi Kuba,
On Wed, Apr 08, 2026 at 09:18:21AM +0000, Kuba Piecuch wrote:
> When a queued task has one of its scheduling properties changed
> (e.g. nice, affinity), it goes through dequeue() -> quiescent() ->
> (property change callback, e.g. ops.set_weight()) -> runnable() ->
> enqueue().
>
> The existing documentation only mentions dequeue() and enqueue() on that
> path, so add the missing callbacks.
>
> Fixes: a4f61f0a1afd ("sched_ext: Documentation: Add ops.dequeue() to task lifecycle")
> Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
> ---
> Documentation/scheduler/sched-ext.rst | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index ec594ae8086de..b5c70f4cfc352 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -429,6 +429,11 @@ by a sched_ext scheduler:
Looks good, but I noticed another issue, should we also change the condition up
above as following?
Documentation/scheduler/sched-ext.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 29d36e248f58b..99df4cc982375 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -423,7 +423,7 @@ by a sched_ext scheduler:
ops.runnable(); /* Task becomes ready to run */
while (task_is_runnable(task)) {
- if (task is not in a DSQ && task->scx.slice == 0) {
+ if (task is not in a DSQ || task->scx.slice == 0) {
ops.enqueue(); /* Task can be added to a DSQ */
/* Task property change (i.e., affinity, nice, etc.)? */
Because we trigger ops.enqueue() when the task expired its time slice or it
becomes runnable and has not been added to a DSQ.
This also represents correctly the sched_change() scenario: a task being
re-enqueued after sched_change() still has its time slice > 0, but we need to
call ops.enqueue() for it.
Thanks,
-Andrea
> /* Task property change (i.e., affinity, nice, etc.)? */
> if (sched_change(task)) {
> ops.dequeue(); /* Exiting BPF scheduler custody */
> + ops.quiescent();
> +
> + /* Property change callback, e.g. ops.set_weight() */
> +
> + ops.runnable();
> continue;
> }
> }
> --
> 2.53.0.1213.gd9a14994de-goog
>
Hi Andrea,
On Wed Apr 8, 2026 at 11:28 AM UTC, Andrea Righi wrote:
...
>
> Looks good, but I noticed another issue, should we also change the condition up
> above as following?
>
> Documentation/scheduler/sched-ext.rst | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 29d36e248f58b..99df4cc982375 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -423,7 +423,7 @@ by a sched_ext scheduler:
> ops.runnable(); /* Task becomes ready to run */
>
> while (task_is_runnable(task)) {
> - if (task is not in a DSQ && task->scx.slice == 0) {
> + if (task is not in a DSQ || task->scx.slice == 0) {
> ops.enqueue(); /* Task can be added to a DSQ */
>
> /* Task property change (i.e., affinity, nice, etc.)? */
>
> Because we trigger ops.enqueue() when the task expired its time slice or it
> becomes runnable and has not been added to a DSQ.
>
> This also represents correctly the sched_change() scenario: a task being
> re-enqueued after sched_change() still has its time slice > 0, but we need to
> call ops.enqueue() for it.
I agree that the condition should be changed, but I'm not sure that this is
what it should look like.
Is the "task is not in a DSQ" part of the condition there to handle direct
dispatch? Apart from direct dispatch from ops.select_cpu(), I wasn't able to
come up with a situation where we would reach this condition with the task
present on some DSQ.
A more general comment about the pseudocode: I think it can be useful to
introduce someone new to the general flow of the callbacks in sched_ext,
but the documentation should be clear that this is a simplified view that
makes assumptions about the behavior of the BPF scheduler itself (flags like
SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
the overall system (Can sched_ext be preempted by a higher-priority sched
class? Can scheduling properties of a task be changed while it's running?)
Without stating these assumptions clearly, we risk leaving the reader falsely
believing they have a complete understanding.
Thanks,
Kuba
On Wed, Apr 08, 2026 at 12:40:09PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
>
> On Wed Apr 8, 2026 at 11:28 AM UTC, Andrea Righi wrote:
> ...
> >
> > Looks good, but I noticed another issue, should we also change the condition up
> > above as following?
> >
> > Documentation/scheduler/sched-ext.rst | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 29d36e248f58b..99df4cc982375 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -423,7 +423,7 @@ by a sched_ext scheduler:
> > ops.runnable(); /* Task becomes ready to run */
> >
> > while (task_is_runnable(task)) {
> > - if (task is not in a DSQ && task->scx.slice == 0) {
> > + if (task is not in a DSQ || task->scx.slice == 0) {
> > ops.enqueue(); /* Task can be added to a DSQ */
> >
> > /* Task property change (i.e., affinity, nice, etc.)? */
> >
> > Because we trigger ops.enqueue() when the task expired its time slice or it
> > becomes runnable and has not been added to a DSQ.
> >
> > This also represents correctly the sched_change() scenario: a task being
> > re-enqueued after sched_change() still has its time slice > 0, but we need to
> > call ops.enqueue() for it.
>
> I agree that the condition should be changed, but I'm not sure that this is
> what it should look like.
>
> Is the "task is not in a DSQ" part of the condition there to handle direct
> dispatch? Apart from direct dispatch from ops.select_cpu(), I wasn't able to
> come up with a situation where we would reach this condition with the task
> present on some DSQ.
The intent is to represent the direct dispatch from ops.select_cpu(), since in
that case ops.enqueue() is skipped.
Honestly I think if we change the && to || in that condition, everything should
be pretty accurate.
>
> A more general comment about the pseudocode: I think it can be useful to
> introduce someone new to the general flow of the callbacks in sched_ext,
> but the documentation should be clear that this is a simplified view that
> makes assumptions about the behavior of the BPF scheduler itself (flags like
> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
> the overall system (Can sched_ext be preempted by a higher-priority sched
> class? Can scheduling properties of a task be changed while it's running?)
> Without stating these assumptions clearly, we risk leaving the reader falsely
> believing they have a complete understanding.
Of course this schema is not a complete representation of the entire sched_ext
state machine, if we put everything it'd become too big and complex. I think we
should just cover the most common use cases here. Maybe we can clarify this in
the description before this diagram.
Thanks,
-Andrea
On Wed Apr 8, 2026 at 1:49 PM UTC, Andrea Righi wrote:
> On Wed, Apr 08, 2026 at 12:40:09PM +0000, Kuba Piecuch wrote:
>> Hi Andrea,
>>
>> On Wed Apr 8, 2026 at 11:28 AM UTC, Andrea Righi wrote:
>> ...
>> >
>> > Looks good, but I noticed another issue, should we also change the condition up
>> > above as following?
>> >
>> > Documentation/scheduler/sched-ext.rst | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
>> > index 29d36e248f58b..99df4cc982375 100644
>> > --- a/Documentation/scheduler/sched-ext.rst
>> > +++ b/Documentation/scheduler/sched-ext.rst
>> > @@ -423,7 +423,7 @@ by a sched_ext scheduler:
>> > ops.runnable(); /* Task becomes ready to run */
>> >
>> > while (task_is_runnable(task)) {
>> > - if (task is not in a DSQ && task->scx.slice == 0) {
>> > + if (task is not in a DSQ || task->scx.slice == 0) {
>> > ops.enqueue(); /* Task can be added to a DSQ */
>> >
>> > /* Task property change (i.e., affinity, nice, etc.)? */
>> >
>> > Because we trigger ops.enqueue() when the task expired its time slice or it
>> > becomes runnable and has not been added to a DSQ.
>> >
>> > This also represents correctly the sched_change() scenario: a task being
>> > re-enqueued after sched_change() still has its time slice > 0, but we need to
>> > call ops.enqueue() for it.
>>
>> I agree that the condition should be changed, but I'm not sure that this is
>> what it should look like.
>>
>> Is the "task is not in a DSQ" part of the condition there to handle direct
>> dispatch? Apart from direct dispatch from ops.select_cpu(), I wasn't able to
>> come up with a situation where we would reach this condition with the task
>> present on some DSQ.
>
> The intent is to represent the direct dispatch from ops.select_cpu(), since in
> that case ops.enqueue() is skipped.
>
> Honestly I think if we change the && to || in that condition, everything should
> be pretty accurate.
In the case of direct dispatch from ops.select_cpu() we don't invoke
ops.dispatch() and ops.dequeue() before ops.running(), right? The current
pseudocode calls them unconditionally.
Another inaccuracy not related to direct dispatch: property changes can occur
while a task is running, while the psedocode only allows for property changes
while a task is queued.
There's also preemption by a higher sched class, which is not covered in the
loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
task_is_runnable() to return false if there's a higher-priority sched class
with runnable tasks on the CPU, though that would be in conflict with the
actual implementation of task_is_runnable() in include/linux/sched.h.
>
>>
>> A more general comment about the pseudocode: I think it can be useful to
>> introduce someone new to the general flow of the callbacks in sched_ext,
>> but the documentation should be clear that this is a simplified view that
>> makes assumptions about the behavior of the BPF scheduler itself (flags like
>> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
>> the overall system (Can sched_ext be preempted by a higher-priority sched
>> class? Can scheduling properties of a task be changed while it's running?)
>> Without stating these assumptions clearly, we risk leaving the reader falsely
>> believing they have a complete understanding.
>
> Of course this schema is not a complete representation of the entire sched_ext
> state machine, if we put everything it'd become too big and complex. I think we
> should just cover the most common use cases here. Maybe we can clarify this in
> the description before this diagram.
Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
and attach an appropriate disclaimer to the pseudocode.
Hi Kuba,
On Wed, Apr 08, 2026 at 02:17:03PM +0000, Kuba Piecuch wrote:
> On Wed Apr 8, 2026 at 1:49 PM UTC, Andrea Righi wrote:
> > On Wed, Apr 08, 2026 at 12:40:09PM +0000, Kuba Piecuch wrote:
> >> Hi Andrea,
> >>
> >> On Wed Apr 8, 2026 at 11:28 AM UTC, Andrea Righi wrote:
> >> ...
> >> >
> >> > Looks good, but I noticed another issue, should we also change the condition up
> >> > above as following?
> >> >
> >> > Documentation/scheduler/sched-ext.rst | 2 +-
> >> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> >> > index 29d36e248f58b..99df4cc982375 100644
> >> > --- a/Documentation/scheduler/sched-ext.rst
> >> > +++ b/Documentation/scheduler/sched-ext.rst
> >> > @@ -423,7 +423,7 @@ by a sched_ext scheduler:
> >> > ops.runnable(); /* Task becomes ready to run */
> >> >
> >> > while (task_is_runnable(task)) {
> >> > - if (task is not in a DSQ && task->scx.slice == 0) {
> >> > + if (task is not in a DSQ || task->scx.slice == 0) {
> >> > ops.enqueue(); /* Task can be added to a DSQ */
> >> >
> >> > /* Task property change (i.e., affinity, nice, etc.)? */
> >> >
> >> > Because we trigger ops.enqueue() when the task expired its time slice or it
> >> > becomes runnable and has not been added to a DSQ.
> >> >
> >> > This also represents correctly the sched_change() scenario: a task being
> >> > re-enqueued after sched_change() still has its time slice > 0, but we need to
> >> > call ops.enqueue() for it.
> >>
> >> I agree that the condition should be changed, but I'm not sure that this is
> >> what it should look like.
> >>
> >> Is the "task is not in a DSQ" part of the condition there to handle direct
> >> dispatch? Apart from direct dispatch from ops.select_cpu(), I wasn't able to
> >> come up with a situation where we would reach this condition with the task
> >> present on some DSQ.
> >
> > The intent is to represent the direct dispatch from ops.select_cpu(), since in
> > that case ops.enqueue() is skipped.
> >
> > Honestly I think if we change the && to || in that condition, everything should
> > be pretty accurate.
>
> In the case of direct dispatch from ops.select_cpu() we don't invoke
> ops.dispatch() and ops.dequeue() before ops.running(), right? The current
> pseudocode calls them unconditionally.
We can move ops.dispatch() and ops.dequeue() inside the
if (task is not in a DSQ || task->scx.slice == 0) block.
>
> Another inaccuracy not related to direct dispatch: property changes can occur
> while a task is running, while the psedocode only allows for property changes
> while a task is queued.
Sure... but again, modelling all the possible scenarios would make the
pseudocode completely unreadable.
IMHO it'd be better to give an overview of the most common use cases here and
clarify in the description that the diagram doesn't cover all the possible
scenarios. This one is a special use case that, personally, I wouldn't cover in
the pseudocode.
>
> There's also preemption by a higher sched class, which is not covered in the
> loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
> task_is_runnable() to return false if there's a higher-priority sched class
> with runnable tasks on the CPU, though that would be in conflict with the
> actual implementation of task_is_runnable() in include/linux/sched.h.
Ditto.
>
> >
> >>
> >> A more general comment about the pseudocode: I think it can be useful to
> >> introduce someone new to the general flow of the callbacks in sched_ext,
> >> but the documentation should be clear that this is a simplified view that
> >> makes assumptions about the behavior of the BPF scheduler itself (flags like
> >> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
> >> the overall system (Can sched_ext be preempted by a higher-priority sched
> >> class? Can scheduling properties of a task be changed while it's running?)
> >> Without stating these assumptions clearly, we risk leaving the reader falsely
> >> believing they have a complete understanding.
> >
> > Of course this schema is not a complete representation of the entire sched_ext
> > state machine, if we put everything it'd become too big and complex. I think we
> > should just cover the most common use cases here. Maybe we can clarify this in
> > the description before this diagram.
>
> Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
> and attach an appropriate disclaimer to the pseudocode.
If we move ops.dispatch() + ops.dequeue() inside the ops.enqueue() block I think
the pseudocode becomes "fairly" accurate. At least more accurate than what we
have right now. It won't be perfect, but it can help newer sched_ext devs having
an overview the task lifecycle without going too much into implementation
details.
So, to recap, what do you think about this?
ops.init_task(); /* A new task is created */
ops.enable(); /* Enable BPF scheduling for the task */
while (task in SCHED_EXT) {
if (task can migrate)
ops.select_cpu(); /* Called on wakeup (optimization) */
ops.runnable(); /* Task becomes ready to run */
while (task_is_runnable(task)) {
if (task is not in a DSQ || task->scx.slice == 0) {
ops.enqueue(); /* Task can be added to a DSQ */
/* Task property change (i.e., affinity, nice, etc.)? */
if (sched_change(task)) {
ops.dequeue(); /* Exiting BPF scheduler custody */
ops.quiescent();
/* Property change callback, e.g. ops.set_weight() */
ops.runnable();
continue;
}
/* Any usable CPU becomes available */
ops.dispatch(); /* Task is moved to a local DSQ */
ops.dequeue(); /* Exiting BPF scheduler custody */
}
ops.running(); /* Task starts running on its assigned CPU */
while (task_is_runnable(task) && task->scx.slice > 0) {
ops.tick(); /* Called every 1/HZ seconds */
if (task->scx.slice == 0)
ops.dispatch(); /* task->scx.slice can be refilled */
}
ops.stopping(); /* Task stops running (time slice expires or wait) */
}
ops.quiescent(); /* Task releases its assigned CPU (wait) */
}
ops.disable(); /* Disable BPF scheduling for the task */
ops.exit_task(); /* Task is destroyed */
Thanks,
-Andrea
On Wed Apr 8, 2026 at 2:54 PM UTC, Andrea Righi wrote:
...
>>
>> Another inaccuracy not related to direct dispatch: property changes can occur
>> while a task is running, while the psedocode only allows for property changes
>> while a task is queued.
>
> Sure... but again, modelling all the possible scenarios would make the
> pseudocode completely unreadable.
I'm not arguing we should cover all scenarios.
I'm ok with omitting scenarios whose existence depends on a configuration flag
or presence/absence of a callback, because:
a) Using the right configuration, one can actually write a scheduler where the
pseudocode is an accurate representation of the task lifecycle;
b) The assumptions about the configuration can be clearly stated next to the
pseudocode.
I'm less ok with omitting specific scenarios that can't be simply "turned off"
because they are triggered by the scheduled tasks themselves. A task's property
being changed while it's running is one example of such a scenario -- one can't
just prevent it from happening by setting a configuration flag, and sched_ext
schedulers implementing dequeue/quiescent/runnable/enqueue should be aware of
it.
What I especially don't like is giving the reader a partial picture that looks
like a complete one, as is the case with property changes here. We're letting
the reader know that it can happen, but the pseudocode makes it look like it
can only happen while a task is queued and not while it's running, giving the
reader a false impression that they can assume property changes apply only to
queued tasks.
>
> IMHO it'd be better to give an overview of the most common use cases here and
> clarify in the description that the diagram doesn't cover all the possible
> scenarios. This one is a special use case that, personally, I wouldn't cover in
> the pseudocode.
>
>>
>> There's also preemption by a higher sched class, which is not covered in the
>> loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
>> task_is_runnable() to return false if there's a higher-priority sched class
>> with runnable tasks on the CPU, though that would be in conflict with the
>> actual implementation of task_is_runnable() in include/linux/sched.h.
>
> Ditto.
>
>>
>> >
>> >>
>> >> A more general comment about the pseudocode: I think it can be useful to
>> >> introduce someone new to the general flow of the callbacks in sched_ext,
>> >> but the documentation should be clear that this is a simplified view that
>> >> makes assumptions about the behavior of the BPF scheduler itself (flags like
>> >> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
>> >> the overall system (Can sched_ext be preempted by a higher-priority sched
>> >> class? Can scheduling properties of a task be changed while it's running?)
>> >> Without stating these assumptions clearly, we risk leaving the reader falsely
>> >> believing they have a complete understanding.
>> >
>> > Of course this schema is not a complete representation of the entire sched_ext
>> > state machine, if we put everything it'd become too big and complex. I think we
>> > should just cover the most common use cases here. Maybe we can clarify this in
>> > the description before this diagram.
>>
>> Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
>> and attach an appropriate disclaimer to the pseudocode.
>
> If we move ops.dispatch() + ops.dequeue() inside the ops.enqueue() block I think
> the pseudocode becomes "fairly" accurate. At least more accurate than what we
> have right now. It won't be perfect, but it can help newer sched_ext devs having
> an overview the task lifecycle without going too much into implementation
> details.
>
> So, to recap, what do you think about this?
>
> ops.init_task(); /* A new task is created */
> ops.enable(); /* Enable BPF scheduling for the task */
>
> while (task in SCHED_EXT) {
> if (task can migrate)
> ops.select_cpu(); /* Called on wakeup (optimization) */
>
> ops.runnable(); /* Task becomes ready to run */
>
> while (task_is_runnable(task)) {
> if (task is not in a DSQ || task->scx.slice == 0) {
> ops.enqueue(); /* Task can be added to a DSQ */
>
> /* Task property change (i.e., affinity, nice, etc.)? */
> if (sched_change(task)) {
> ops.dequeue(); /* Exiting BPF scheduler custody */
> ops.quiescent();
>
> /* Property change callback, e.g. ops.set_weight() */
>
> ops.runnable();
> continue;
> }
>
> /* Any usable CPU becomes available */
>
> ops.dispatch(); /* Task is moved to a local DSQ */
> ops.dequeue(); /* Exiting BPF scheduler custody */
> }
>
> ops.running(); /* Task starts running on its assigned CPU */
>
> while (task_is_runnable(task) && task->scx.slice > 0) {
> ops.tick(); /* Called every 1/HZ seconds */
>
> if (task->scx.slice == 0)
> ops.dispatch(); /* task->scx.slice can be refilled */
> }
>
> ops.stopping(); /* Task stops running (time slice expires or wait) */
> }
>
> ops.quiescent(); /* Task releases its assigned CPU (wait) */
> }
>
> ops.disable(); /* Disable BPF scheduling for the task */
> ops.exit_task(); /* Task is destroyed */
I don't love it (and I probably never will), but I agree it's the best so far.
I'll send a v2 with the updated pseudocode and I'll put a bit of a disclaimer
before it.
Thanks,
Kuba
On 4/9/26 09:46, Kuba Piecuch wrote:
> On Wed Apr 8, 2026 at 2:54 PM UTC, Andrea Righi wrote:
> ...
>>>
>>> Another inaccuracy not related to direct dispatch: property changes can occur
>>> while a task is running, while the psedocode only allows for property changes
>>> while a task is queued.
>>
>> Sure... but again, modelling all the possible scenarios would make the
>> pseudocode completely unreadable.
>
> I'm not arguing we should cover all scenarios.
>
> I'm ok with omitting scenarios whose existence depends on a configuration flag
> or presence/absence of a callback, because:
>
> a) Using the right configuration, one can actually write a scheduler where the
> pseudocode is an accurate representation of the task lifecycle;
>
> b) The assumptions about the configuration can be clearly stated next to the
> pseudocode.
>
> I'm less ok with omitting specific scenarios that can't be simply "turned off"
> because they are triggered by the scheduled tasks themselves. A task's property
> being changed while it's running is one example of such a scenario -- one can't
> just prevent it from happening by setting a configuration flag, and sched_ext
> schedulers implementing dequeue/quiescent/runnable/enqueue should be aware of
> it.
>
> What I especially don't like is giving the reader a partial picture that looks
> like a complete one, as is the case with property changes here. We're letting
> the reader know that it can happen, but the pseudocode makes it look like it
> can only happen while a task is queued and not while it's running, giving the
> reader a false impression that they can assume property changes apply only to
> queued tasks.
Agreed FWIW, I've implemented a few schedulers that need to track state transitions
100% accurately and it was painful to get it 100% right.
I think it's either this or we add a sample BPF scheduler that actually does
track/validate all possible transitions per-task accurately to illustrate. (Maybe a
selftest?)
But that would mean the below becoming quite a bit more complex, too.
>
>>
>> IMHO it'd be better to give an overview of the most common use cases here and
>> clarify in the description that the diagram doesn't cover all the possible
>> scenarios. This one is a special use case that, personally, I wouldn't cover in
>> the pseudocode.
>>
>>>
>>> There's also preemption by a higher sched class, which is not covered in the
>>> loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
>>> task_is_runnable() to return false if there's a higher-priority sched class
>>> with runnable tasks on the CPU, though that would be in conflict with the
>>> actual implementation of task_is_runnable() in include/linux/sched.h.
>>
>> Ditto.
>>
>>>
>>>>
>>>>>
>>>>> A more general comment about the pseudocode: I think it can be useful to
>>>>> introduce someone new to the general flow of the callbacks in sched_ext,
>>>>> but the documentation should be clear that this is a simplified view that
>>>>> makes assumptions about the behavior of the BPF scheduler itself (flags like
>>>>> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
>>>>> the overall system (Can sched_ext be preempted by a higher-priority sched
>>>>> class? Can scheduling properties of a task be changed while it's running?)
>>>>> Without stating these assumptions clearly, we risk leaving the reader falsely
>>>>> believing they have a complete understanding.
>>>>
>>>> Of course this schema is not a complete representation of the entire sched_ext
>>>> state machine, if we put everything it'd become too big and complex. I think we
>>>> should just cover the most common use cases here. Maybe we can clarify this in
>>>> the description before this diagram.
>>>
>>> Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
>>> and attach an appropriate disclaimer to the pseudocode.
>>
>> If we move ops.dispatch() + ops.dequeue() inside the ops.enqueue() block I think
>> the pseudocode becomes "fairly" accurate. At least more accurate than what we
>> have right now. It won't be perfect, but it can help newer sched_ext devs having
>> an overview the task lifecycle without going too much into implementation
>> details.
>>
>> So, to recap, what do you think about this?
>>
>> ops.init_task(); /* A new task is created */
>> ops.enable(); /* Enable BPF scheduling for the task */
>>
>> while (task in SCHED_EXT) {
>> if (task can migrate)
>> ops.select_cpu(); /* Called on wakeup (optimization) */
>>
>> ops.runnable(); /* Task becomes ready to run */
>>
>> while (task_is_runnable(task)) {
>> if (task is not in a DSQ || task->scx.slice == 0) {
>> ops.enqueue(); /* Task can be added to a DSQ */
>>
>> /* Task property change (i.e., affinity, nice, etc.)? */
>> if (sched_change(task)) {
>> ops.dequeue(); /* Exiting BPF scheduler custody */
>> ops.quiescent();
>>
>> /* Property change callback, e.g. ops.set_weight() */
>>
>> ops.runnable();
>> continue;
>> }
>>
>> /* Any usable CPU becomes available */
>>
>> ops.dispatch(); /* Task is moved to a local DSQ */
>> ops.dequeue(); /* Exiting BPF scheduler custody */
Is this true here? Any dispatch followed by a dequeue?
>> }
>>
>> ops.running(); /* Task starts running on its assigned CPU */
>>
>> while (task_is_runnable(task) && task->scx.slice > 0) {
>> ops.tick(); /* Called every 1/HZ seconds */
>>
>> if (task->scx.slice == 0)
>> ops.dispatch(); /* task->scx.slice can be refilled */
>> }
>>
>> ops.stopping(); /* Task stops running (time slice expires or wait) */
>> }
>>
>> ops.quiescent(); /* Task releases its assigned CPU (wait) */
>> }
>>
>> ops.disable(); /* Disable BPF scheduling for the task */
>> ops.exit_task(); /* Task is destroyed */
>
> I don't love it (and I probably never will), but I agree it's the best so far.
> I'll send a v2 with the updated pseudocode and I'll put a bit of a disclaimer
> before it.
On Thu Apr 9, 2026 at 9:46 AM UTC, Christian Loehle wrote:
...
>>>
>>> ops.init_task(); /* A new task is created */
>>> ops.enable(); /* Enable BPF scheduling for the task */
>>>
>>> while (task in SCHED_EXT) {
>>> if (task can migrate)
>>> ops.select_cpu(); /* Called on wakeup (optimization) */
>>>
>>> ops.runnable(); /* Task becomes ready to run */
>>>
>>> while (task_is_runnable(task)) {
>>> if (task is not in a DSQ || task->scx.slice == 0) {
>>> ops.enqueue(); /* Task can be added to a DSQ */
>>>
>>> /* Task property change (i.e., affinity, nice, etc.)? */
>>> if (sched_change(task)) {
>>> ops.dequeue(); /* Exiting BPF scheduler custody */
>>> ops.quiescent();
>>>
>>> /* Property change callback, e.g. ops.set_weight() */
>>>
>>> ops.runnable();
>>> continue;
>>> }
>>>
>>> /* Any usable CPU becomes available */
>>>
>>> ops.dispatch(); /* Task is moved to a local DSQ */
>>> ops.dequeue(); /* Exiting BPF scheduler custody */
> Is this true here? Any dispatch followed by a dequeue?
The comment next to ops.dispatch() says the task is moved to a local DSQ,
so if we assume that, then I think it will always be followed by ops.dequeue().
Same if we move the task to the global DSQ.
Of course, you could do something weird like dispatch the task to a user DSQ,
in which case there won't be a dequeue and the task won't start running, but
that's weird enough that I don't think we need to consider it.
You could also have a property change racing with the dispatch which would make
the dispatch fail and not be followed by a dequeue, but again, we need to draw
the line somewhere.
So, in other words, any _successful_ dispatch to a _terminal_ DSQ is always
followed by a dequeue.
Another case that isn't handled here is direct dispatch to a terminal DSQ from
ops.enqueue(), where we don't get ops.dispatch() or ops.dequeue() and go
straight to ops.running(). If any of the above cases should be handled in the
pseudocode, I'd say it's this one.
>>> }
>>>
>>> ops.running(); /* Task starts running on its assigned CPU */
>>>
>>> while (task_is_runnable(task) && task->scx.slice > 0) {
>>> ops.tick(); /* Called every 1/HZ seconds */
>>>
>>> if (task->scx.slice == 0)
>>> ops.dispatch(); /* task->scx.slice can be refilled */
>>> }
>>>
>>> ops.stopping(); /* Task stops running (time slice expires or wait) */
>>> }
>>>
>>> ops.quiescent(); /* Task releases its assigned CPU (wait) */
>>> }
>>>
>>> ops.disable(); /* Disable BPF scheduling for the task */
>>> ops.exit_task(); /* Task is destroyed */
On Thu, Apr 09, 2026 at 01:30:55PM +0000, Kuba Piecuch wrote:
> On Thu Apr 9, 2026 at 9:46 AM UTC, Christian Loehle wrote:
> ...
> >>>
> >>> ops.init_task(); /* A new task is created */
> >>> ops.enable(); /* Enable BPF scheduling for the task */
> >>>
> >>> while (task in SCHED_EXT) {
> >>> if (task can migrate)
> >>> ops.select_cpu(); /* Called on wakeup (optimization) */
> >>>
> >>> ops.runnable(); /* Task becomes ready to run */
> >>>
> >>> while (task_is_runnable(task)) {
> >>> if (task is not in a DSQ || task->scx.slice == 0) {
> >>> ops.enqueue(); /* Task can be added to a DSQ */
> >>>
> >>> /* Task property change (i.e., affinity, nice, etc.)? */
> >>> if (sched_change(task)) {
> >>> ops.dequeue(); /* Exiting BPF scheduler custody */
> >>> ops.quiescent();
> >>>
> >>> /* Property change callback, e.g. ops.set_weight() */
> >>>
> >>> ops.runnable();
> >>> continue;
> >>> }
> >>>
> >>> /* Any usable CPU becomes available */
> >>>
> >>> ops.dispatch(); /* Task is moved to a local DSQ */
> >>> ops.dequeue(); /* Exiting BPF scheduler custody */
> > Is this true here? Any dispatch followed by a dequeue?
>
> The comment next to ops.dispatch() says the task is moved to a local DSQ,
> so if we assume that, then I think it will always be followed by ops.dequeue().
> Same if we move the task to the global DSQ.
So, ops.dispatch() is not a "task callback", it's a "CPU callback", invoked when
a CPU becomes available. So having ops.dispatch() here can be a bit confusing.
The intent was to describe the workflow where, once the task is enqueued to a
non-terminal DSQ, then it can be consumed by an ops.dispatch() event and, in
that case, ops.dequeue() is also invoked when the task reaches a terminal DSQ.
Not sure if there's a better way to express this concept in the pseudocode.
>
> Of course, you could do something weird like dispatch the task to a user DSQ,
> in which case there won't be a dequeue and the task won't start running, but
> that's weird enough that I don't think we need to consider it.
Right. For the records, scx_rustland_core does something similar: from
ops.dispatch() it consumes a task from a BPF user ringbuffer, inserts it into a
user DSQ and then consumes the first task from the user DSQ via
scx_bpf_dsq_move_to_local().
But that's a bit of a special use case, due to the unusual user-space scheduling
part. But in this case the pseudocode is still accurate, since the
scx_bpf_dsq_move_to_local() triggers an ops.dequeue().
>
> You could also have a property change racing with the dispatch which would make
> the dispatch fail and not be followed by a dequeue, but again, we need to draw
> the line somewhere.
Yeah, the thing is that sched_change() introduces a lot of different edge cases
that are difficult to represent in the pseudocode. I guess the best we can do is
document in a descriptive manner the concept of "BPF scheduler's custody" and
the fact that a task can temporarily leave the custody when a sched_change()
event happens.
>
> So, in other words, any _successful_ dispatch to a _terminal_ DSQ is always
> followed by a dequeue.
>
> Another case that isn't handled here is direct dispatch to a terminal DSQ from
> ops.enqueue(), where we don't get ops.dispatch() or ops.dequeue() and go
> straight to ops.running(). If any of the above cases should be handled in the
> pseudocode, I'd say it's this one.
Right, in fact it should be: any _successful_ dispatch to a _terminal DSQ_, if
the task was in the BPF scheduler's custody. A direct dispatch to a terminal DSQ
either from ops.select_cpu() or ops.enqueue() doesn't trigger ops.dequeue(),
because the task doesn't enter the BPF scheduler's custody.
Thanks,
-Andrea
On Thu, Apr 09, 2026 at 10:46:09AM +0100, Christian Loehle wrote: > On 4/9/26 09:46, Kuba Piecuch wrote: > > On Wed Apr 8, 2026 at 2:54 PM UTC, Andrea Righi wrote: > > ... > >>> > >>> Another inaccuracy not related to direct dispatch: property changes can occur > >>> while a task is running, while the psedocode only allows for property changes > >>> while a task is queued. > >> > >> Sure... but again, modelling all the possible scenarios would make the > >> pseudocode completely unreadable. > > > > I'm not arguing we should cover all scenarios. > > > > I'm ok with omitting scenarios whose existence depends on a configuration flag > > or presence/absence of a callback, because: > > > > a) Using the right configuration, one can actually write a scheduler where the > > pseudocode is an accurate representation of the task lifecycle; > > > > b) The assumptions about the configuration can be clearly stated next to the > > pseudocode. > > > > I'm less ok with omitting specific scenarios that can't be simply "turned off" > > because they are triggered by the scheduled tasks themselves. A task's property > > being changed while it's running is one example of such a scenario -- one can't > > just prevent it from happening by setting a configuration flag, and sched_ext > > schedulers implementing dequeue/quiescent/runnable/enqueue should be aware of > > it. > > > > What I especially don't like is giving the reader a partial picture that looks > > like a complete one, as is the case with property changes here. We're letting > > the reader know that it can happen, but the pseudocode makes it look like it > > can only happen while a task is queued and not while it's running, giving the > > reader a false impression that they can assume property changes apply only to > > queued tasks. > > > Agreed FWIW, I've implemented a few schedulers that need to track state transitions > 100% accurately and it was painful to get it 100% right. > I think it's either this or we add a sample BPF scheduler that actually does > track/validate all possible transitions per-task accurately to illustrate. (Maybe a > selftest?) One thing doesn't exclude the other, we can have an example scheduler that implements 100% accurate state tracking (the dequeue kselftest is probably already a valid example of that) and this slightly inaccurate high-level overview of the task lifecycle workflow. -Andrea
On Thu, Apr 09, 2026 at 08:46:03AM +0000, Kuba Piecuch wrote:
> On Wed Apr 8, 2026 at 2:54 PM UTC, Andrea Righi wrote:
> ...
> >>
> >> Another inaccuracy not related to direct dispatch: property changes can occur
> >> while a task is running, while the psedocode only allows for property changes
> >> while a task is queued.
> >
> > Sure... but again, modelling all the possible scenarios would make the
> > pseudocode completely unreadable.
>
> I'm not arguing we should cover all scenarios.
>
> I'm ok with omitting scenarios whose existence depends on a configuration flag
> or presence/absence of a callback, because:
>
> a) Using the right configuration, one can actually write a scheduler where the
> pseudocode is an accurate representation of the task lifecycle;
>
> b) The assumptions about the configuration can be clearly stated next to the
> pseudocode.
>
> I'm less ok with omitting specific scenarios that can't be simply "turned off"
> because they are triggered by the scheduled tasks themselves. A task's property
> being changed while it's running is one example of such a scenario -- one can't
> just prevent it from happening by setting a configuration flag, and sched_ext
> schedulers implementing dequeue/quiescent/runnable/enqueue should be aware of
> it.
>
> What I especially don't like is giving the reader a partial picture that looks
> like a complete one, as is the case with property changes here. We're letting
> the reader know that it can happen, but the pseudocode makes it look like it
> can only happen while a task is queued and not while it's running, giving the
> reader a false impression that they can assume property changes apply only to
> queued tasks.
I agree on that, but I think the goal of this pseudocode is to find a reasonable
compromise between readability and accuracy. If such comprosmise doesn't exist
or if we're concerned that it'd introduce more confusion than benefits for the
users, then we can also consider removing it.
>
> >
> > IMHO it'd be better to give an overview of the most common use cases here and
> > clarify in the description that the diagram doesn't cover all the possible
> > scenarios. This one is a special use case that, personally, I wouldn't cover in
> > the pseudocode.
> >
> >>
> >> There's also preemption by a higher sched class, which is not covered in the
> >> loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
> >> task_is_runnable() to return false if there's a higher-priority sched class
> >> with runnable tasks on the CPU, though that would be in conflict with the
> >> actual implementation of task_is_runnable() in include/linux/sched.h.
> >
> > Ditto.
> >
> >>
> >> >
> >> >>
> >> >> A more general comment about the pseudocode: I think it can be useful to
> >> >> introduce someone new to the general flow of the callbacks in sched_ext,
> >> >> but the documentation should be clear that this is a simplified view that
> >> >> makes assumptions about the behavior of the BPF scheduler itself (flags like
> >> >> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
> >> >> the overall system (Can sched_ext be preempted by a higher-priority sched
> >> >> class? Can scheduling properties of a task be changed while it's running?)
> >> >> Without stating these assumptions clearly, we risk leaving the reader falsely
> >> >> believing they have a complete understanding.
> >> >
> >> > Of course this schema is not a complete representation of the entire sched_ext
> >> > state machine, if we put everything it'd become too big and complex. I think we
> >> > should just cover the most common use cases here. Maybe we can clarify this in
> >> > the description before this diagram.
> >>
> >> Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
> >> and attach an appropriate disclaimer to the pseudocode.
> >
> > If we move ops.dispatch() + ops.dequeue() inside the ops.enqueue() block I think
> > the pseudocode becomes "fairly" accurate. At least more accurate than what we
> > have right now. It won't be perfect, but it can help newer sched_ext devs having
> > an overview the task lifecycle without going too much into implementation
> > details.
> >
> > So, to recap, what do you think about this?
> >
> > ops.init_task(); /* A new task is created */
> > ops.enable(); /* Enable BPF scheduling for the task */
> >
> > while (task in SCHED_EXT) {
> > if (task can migrate)
> > ops.select_cpu(); /* Called on wakeup (optimization) */
> >
> > ops.runnable(); /* Task becomes ready to run */
> >
> > while (task_is_runnable(task)) {
> > if (task is not in a DSQ || task->scx.slice == 0) {
> > ops.enqueue(); /* Task can be added to a DSQ */
> >
> > /* Task property change (i.e., affinity, nice, etc.)? */
> > if (sched_change(task)) {
> > ops.dequeue(); /* Exiting BPF scheduler custody */
> > ops.quiescent();
> >
> > /* Property change callback, e.g. ops.set_weight() */
> >
> > ops.runnable();
> > continue;
> > }
> >
> > /* Any usable CPU becomes available */
> >
> > ops.dispatch(); /* Task is moved to a local DSQ */
> > ops.dequeue(); /* Exiting BPF scheduler custody */
> > }
> >
> > ops.running(); /* Task starts running on its assigned CPU */
> >
> > while (task_is_runnable(task) && task->scx.slice > 0) {
> > ops.tick(); /* Called every 1/HZ seconds */
> >
> > if (task->scx.slice == 0)
> > ops.dispatch(); /* task->scx.slice can be refilled */
> > }
> >
> > ops.stopping(); /* Task stops running (time slice expires or wait) */
> > }
> >
> > ops.quiescent(); /* Task releases its assigned CPU (wait) */
> > }
> >
> > ops.disable(); /* Disable BPF scheduling for the task */
> > ops.exit_task(); /* Task is destroyed */
>
> I don't love it (and I probably never will), but I agree it's the best so far.
> I'll send a v2 with the updated pseudocode and I'll put a bit of a disclaimer
> before it.
I also don't love it, but with these changes it'd better (or rather a bit more
accurate) than what we have right now...
Maybe we can add a "Special cases" section below the task lifecycle to better
explain all the exceptions and non-covered scenarios? Some of them are covered
in the "Scheduling Cycle" section, so we could also point to them.
Thanks,
-Andrea
> sched_ext: Documentation: Add ops.dequeue() to task lifecycle Applied to sched_ext/for-7.1 with Emil's Reviewed-by added and the Fixes: tag dropped per his comment. Thanks. -- tejun
On Mon Apr 6, 2026 at 7:47 AM EDT, Andrea Righi wrote:
> Document ops.dequeue() in the sched_ext task lifecycle now that its
> semantics are well-defined.
>
> Also update the pseudo-code to use task_is_runnable() consistently and
> clarify the case where ops.dispatch() does not refill the time slice.
>
> Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Is the Fixes: tag appropriate here? It's not like the original patch
introduced a bug by fixing ops.dequeue().
Otherwise the state machine looks fine to me!
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> Documentation/scheduler/sched-ext.rst | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404b4e4c33f7e..9f03650abfeba 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -422,23 +422,29 @@ by a sched_ext scheduler:
>
> ops.runnable(); /* Task becomes ready to run */
>
> - while (task is runnable) {
> + while (task_is_runnable(task)) {
> if (task is not in a DSQ && task->scx.slice == 0) {
> ops.enqueue(); /* Task can be added to a DSQ */
>
> - /* Any usable CPU becomes available */
> + /* Task property change (i.e., affinity, nice, etc.)? */
> + if (sched_change(task)) {
> + ops.dequeue(); /* Exiting BPF scheduler custody */
> + continue;
> + }
> + }
>
> - ops.dispatch(); /* Task is moved to a local DSQ */
> + /* Any usable CPU becomes available */
> +
> + ops.dispatch(); /* Task is moved to a local DSQ */
> + ops.dequeue(); /* Exiting BPF scheduler custody */
>
> - ops.dequeue(); /* Exiting BPF scheduler */
> - }
> ops.running(); /* Task starts running on its assigned CPU */
>
> - while task_is_runnable(p) {
> - while (task->scx.slice > 0 && task_is_runnable(p))
> - ops.tick(); /* Called every 1/HZ seconds */
> + while (task_is_runnable(task) && task->scx.slice > 0) {
> + ops.tick(); /* Called every 1/HZ seconds */
>
> - ops.dispatch(); /* task->scx.slice can be refilled */
> + if (task->scx.slice == 0)
> + ops.dispatch(); /* task->scx.slice can be refilled */
> }
>
> ops.stopping(); /* Task stops running (time slice expires or wait) */
Hi Emil,
On Mon, Apr 06, 2026 at 10:49:18AM -0400, Emil Tsalapatis wrote:
> On Mon Apr 6, 2026 at 7:47 AM EDT, Andrea Righi wrote:
> > Document ops.dequeue() in the sched_ext task lifecycle now that its
> > semantics are well-defined.
> >
> > Also update the pseudo-code to use task_is_runnable() consistently and
> > clarify the case where ops.dispatch() does not refill the time slice.
> >
> > Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
>
> Is the Fixes: tag appropriate here? It's not like the original patch
> introduced a bug by fixing ops.dequeue().
Yeah, the intent here was to make sure this commit isn't applied without
ebf1ccff79c4 (otherwise the state machine would be inaccurate), but that
shouldn't happen, so it's probably reasonable to drop the Fixes line.
Thanks,
-Andrea
>
> Otherwise the state machine looks fine to me!
>
> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > Documentation/scheduler/sched-ext.rst | 24 +++++++++++++++---------
> > 1 file changed, 15 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404b4e4c33f7e..9f03650abfeba 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -422,23 +422,29 @@ by a sched_ext scheduler:
> >
> > ops.runnable(); /* Task becomes ready to run */
> >
> > - while (task is runnable) {
> > + while (task_is_runnable(task)) {
> > if (task is not in a DSQ && task->scx.slice == 0) {
> > ops.enqueue(); /* Task can be added to a DSQ */
> >
> > - /* Any usable CPU becomes available */
> > + /* Task property change (i.e., affinity, nice, etc.)? */
> > + if (sched_change(task)) {
> > + ops.dequeue(); /* Exiting BPF scheduler custody */
> > + continue;
> > + }
> > + }
> >
> > - ops.dispatch(); /* Task is moved to a local DSQ */
> > + /* Any usable CPU becomes available */
> > +
> > + ops.dispatch(); /* Task is moved to a local DSQ */
> > + ops.dequeue(); /* Exiting BPF scheduler custody */
> >
> > - ops.dequeue(); /* Exiting BPF scheduler */
> > - }
> > ops.running(); /* Task starts running on its assigned CPU */
> >
> > - while task_is_runnable(p) {
> > - while (task->scx.slice > 0 && task_is_runnable(p))
> > - ops.tick(); /* Called every 1/HZ seconds */
> > + while (task_is_runnable(task) && task->scx.slice > 0) {
> > + ops.tick(); /* Called every 1/HZ seconds */
> >
> > - ops.dispatch(); /* task->scx.slice can be refilled */
> > + if (task->scx.slice == 0)
> > + ops.dispatch(); /* task->scx.slice can be refilled */
> > }
> >
> > ops.stopping(); /* Task stops running (time slice expires or wait) */
>
© 2016 - 2026 Red Hat, Inc.