[v2] sched/deadline: Reset dl_server execution state on stop

[PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 2 weeks ago

dl_server_stop() can leave a deadline server in an inconsistent internal
state across stop/start transitions, causing it to bypass its required
deferral phase when restarted. This breaks the scheduler invariant that
a restarted server must re-establish eligibility before being allowed to
execute.

When the server is stopped (e.g., because the associated task blocks),
it's expected to transition back to an inactive, initial state. However,
dl_server_stop() does not fully reset the execution state. As a result,
the server can be logically inactive while still appearing as if it was
still running.

When the server is restarted via dl_server_start(), the following
sequence occurs:
  1. dl_server_start() calls enqueue_dl_entity(ENQUEUE_WAKEUP),
  2. enqueue_dl_entity() calls update_dl_entity(),
  3. update_dl_entity() checks (!dl_se->dl_defer_running) to decide
     whether to arm the deferral mechanism,
  4. because dl_defer_running is stale, the check fails,
  5. dl_defer_armed and dl_throttled are not set,
  6. enqueue_dl_entity() skips start_dl_timer(), because
     dl_throttled == 0,
  7. the server is enqueued via __enqueue_dl_entity(),
  8. the scheduler picks the server to run,
  9. update_curr_dl_se() detects that the server has exhausted its
     runtime (or has negative runtime), as it wasn't properly
     replenished/deferred,
 10. the server is throttled (dl_throttled set to 1) and dequeued,
 11. the server repeatedly cycles through wakeup and throttling,
     effectively receiving no usable CPU bandwidth.

This results in starvation of the tasks serviced by the deadline server
in the presence of competing RT workloads.

This issue can be confirmed adding debugging traces, which show that the
server skips the deferral timer and is immediately throttled upon
execution with negative runtime:

 DEBUG: dl_server_start: dl_defer_running=1 active=0
 DEBUG: enqueue_dl_entity: flags=1 dl_throttled=0 dl_defer=1
 DEBUG: update_dl_entity: dl_defer_running=1
 DEBUG: enqueue_dl_entity: SKIPPING start_dl_timer! dl_throttled=0
 ...
 DEBUG: update_curr_dl_se: THROTTLED runtime=-954758

Fix this by properly resetting dl_defer_running in dl_server_stop(),
ensuring the server correctly enters the defer phase upon restart.

This issue is quite difficult to observe when only the fair server
is present, as the required stop/start patterns are relatively rare.
However, it becomes easier to trigger with an additional deadline server
with more frequent server lifecycle transitions (such as a sched_ext
deadline server).

This change is a prerequisite for introducing a sched_ext deadline
server, as it ensures correct and predictable behavior across server
stop/start cycles.

Link: https://lore.kernel.org/all/aXEMat4IoNnGYgxw@gpd4/
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
Changes in v2:
 - Update state machine documentation
 - Link to v1: https://lore.kernel.org/all/20260122140833.1655020-1-arighi@nvidia.com/

 kernel/sched/deadline.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c509f2e7d69de..e42867061ea77 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1615,7 +1615,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
  *   dl_server_active = 0
  *   dl_throttled = 0
  *   dl_defer_armed = 0
- *   dl_defer_running = 0/1
+ *   dl_defer_running = 0
  *   dl_defer_idle = 0
  *
  * [B] - zero_laxity-wait
@@ -1704,6 +1704,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
  *       hrtimer_try_to_cancel();
  *       dl_defer_armed = 0;
  *       dl_throttled = 0;
+ *       dl_defer_running = 0;
  *       dl_server_active = 0;
  *       // [A]
  *   return p;
@@ -1813,6 +1814,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	hrtimer_try_to_cancel(&dl_se->dl_timer);
 	dl_se->dl_defer_armed = 0;
 	dl_se->dl_throttled = 0;
+	dl_se->dl_defer_running = 0;
 	dl_se->dl_defer_idle = 0;
 	dl_se->dl_server_active = 0;
 }
-- 
2.52.0

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Gabriele Monaco 1 week, 4 days ago

On Fri, 2026-01-23 at 17:16 +0100, Andrea Righi wrote:
> dl_server_stop() can leave a deadline server in an inconsistent internal
> state across stop/start transitions, causing it to bypass its required
> deferral phase when restarted. This breaks the scheduler invariant that
> a restarted server must re-establish eligibility before being allowed to
> execute.
> 
> When the server is stopped (e.g., because the associated task blocks),
> it's expected to transition back to an inactive, initial state. However,
> dl_server_stop() does not fully reset the execution state. As a result,
> the server can be logically inactive while still appearing as if it was
> still running.
> 
> When the server is restarted via dl_server_start(), the following
> sequence occurs:
>   1. dl_server_start() calls enqueue_dl_entity(ENQUEUE_WAKEUP),
>   2. enqueue_dl_entity() calls update_dl_entity(),
>   3. update_dl_entity() checks (!dl_se->dl_defer_running) to decide
>      whether to arm the deferral mechanism,
>   4. because dl_defer_running is stale, the check fails,
>   5. dl_defer_armed and dl_throttled are not set,
>   6. enqueue_dl_entity() skips start_dl_timer(), because
>      dl_throttled == 0,
>   7. the server is enqueued via __enqueue_dl_entity(),
>   8. the scheduler picks the server to run,
>   9. update_curr_dl_se() detects that the server has exhausted its
>      runtime (or has negative runtime), as it wasn't properly
>      replenished/deferred,
>  10. the server is throttled (dl_throttled set to 1) and dequeued,
>  11. the server repeatedly cycles through wakeup and throttling,
>      effectively receiving no usable CPU bandwidth.

Hello,

I remember wondering why defer_running was kept after stop and Peter suggested
it's to avoid penalising tasks with short sleeps. [1]

Clearing defer_running on stop is in fact removing the edge from A:init to
D:running , isn't it? The server should be able to start as running and not only
deferred (dl_defer_armed and dl_throttled set).

In the sequence you described above, I wonder why the enqueue is never
replenishing. As far as I understand the runtime should remain <= 0 only as long
as the enqueue occurs before the deadline, after that it should simply replenish
a new period (pushing deadline and restoring runtime).

What am I missing here?

Thanks,
Gabriele

[1] -
https://lore.kernel.org/lkml/20251111111716.GL278048@noisy.programming.kicks-ass.net

> 
> This results in starvation of the tasks serviced by the deadline server
> in the presence of competing RT workloads.
> 
> This issue can be confirmed adding debugging traces, which show that the
> server skips the deferral timer and is immediately throttled upon
> execution with negative runtime:
> 
>  DEBUG: dl_server_start: dl_defer_running=1 active=0
>  DEBUG: enqueue_dl_entity: flags=1 dl_throttled=0 dl_defer=1
>  DEBUG: update_dl_entity: dl_defer_running=1
>  DEBUG: enqueue_dl_entity: SKIPPING start_dl_timer! dl_throttled=0
>  ...
>  DEBUG: update_curr_dl_se: THROTTLED runtime=-954758
> 
> Fix this by properly resetting dl_defer_running in dl_server_stop(),
> ensuring the server correctly enters the defer phase upon restart.
> 
> This issue is quite difficult to observe when only the fair server
> is present, as the required stop/start patterns are relatively rare.
> However, it becomes easier to trigger with an additional deadline server
> with more frequent server lifecycle transitions (such as a sched_ext
> deadline server).
> 
> This change is a prerequisite for introducing a sched_ext deadline
> server, as it ensures correct and predictable behavior across server
> stop/start cycles.
> 
> Link: https://lore.kernel.org/all/aXEMat4IoNnGYgxw@gpd4/
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> Changes in v2:
>  - Update state machine documentation
>  - Link to v1:
> https://lore.kernel.org/all/20260122140833.1655020-1-arighi@nvidia.com/
> 
>  kernel/sched/deadline.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index c509f2e7d69de..e42867061ea77 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1615,7 +1615,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64
> delta_exec)
>   *   dl_server_active = 0
>   *   dl_throttled = 0
>   *   dl_defer_armed = 0
> - *   dl_defer_running = 0/1
> + *   dl_defer_running = 0
>   *   dl_defer_idle = 0
>   *
>   * [B] - zero_laxity-wait
> @@ -1704,6 +1704,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64
> delta_exec)
>   *       hrtimer_try_to_cancel();
>   *       dl_defer_armed = 0;
>   *       dl_throttled = 0;
> + *       dl_defer_running = 0;
>   *       dl_server_active = 0;
>   *       // [A]
>   *   return p;
> @@ -1813,6 +1814,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
>  	hrtimer_try_to_cancel(&dl_se->dl_timer);
>  	dl_se->dl_defer_armed = 0;
>  	dl_se->dl_throttled = 0;
> +	dl_se->dl_defer_running = 0;
>  	dl_se->dl_defer_idle = 0;
>  	dl_se->dl_server_active = 0;
>  }

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week, 4 days ago

Hi Gabriele,

On Mon, Jan 26, 2026 at 03:20:12PM +0100, Gabriele Monaco wrote:
> On Fri, 2026-01-23 at 17:16 +0100, Andrea Righi wrote:
> > dl_server_stop() can leave a deadline server in an inconsistent internal
> > state across stop/start transitions, causing it to bypass its required
> > deferral phase when restarted. This breaks the scheduler invariant that
> > a restarted server must re-establish eligibility before being allowed to
> > execute.
> > 
> > When the server is stopped (e.g., because the associated task blocks),
> > it's expected to transition back to an inactive, initial state. However,
> > dl_server_stop() does not fully reset the execution state. As a result,
> > the server can be logically inactive while still appearing as if it was
> > still running.
> > 
> > When the server is restarted via dl_server_start(), the following
> > sequence occurs:
> >   1. dl_server_start() calls enqueue_dl_entity(ENQUEUE_WAKEUP),
> >   2. enqueue_dl_entity() calls update_dl_entity(),
> >   3. update_dl_entity() checks (!dl_se->dl_defer_running) to decide
> >      whether to arm the deferral mechanism,
> >   4. because dl_defer_running is stale, the check fails,
> >   5. dl_defer_armed and dl_throttled are not set,
> >   6. enqueue_dl_entity() skips start_dl_timer(), because
> >      dl_throttled == 0,
> >   7. the server is enqueued via __enqueue_dl_entity(),
> >   8. the scheduler picks the server to run,
> >   9. update_curr_dl_se() detects that the server has exhausted its
> >      runtime (or has negative runtime), as it wasn't properly
> >      replenished/deferred,
> >  10. the server is throttled (dl_throttled set to 1) and dequeued,
> >  11. the server repeatedly cycles through wakeup and throttling,
> >      effectively receiving no usable CPU bandwidth.
> 
> Hello,
> 
> I remember wondering why defer_running was kept after stop and Peter suggested
> it's to avoid penalising tasks with short sleeps. [1]

Correct, dl_defer_running was preserved across stop/start to avoid
penalizing very short sleeps. IIUC what Peter explained, this optimization
relies on the assumption that the server is stopped while its execution
context is still coherent (the remaining runtime is still usable and the
deadline has not yet expired), so that the server can resume execution
immediately instead of re-entering the full defer / zero-laxity path.

> 
> Clearing defer_running on stop is in fact removing the edge from A:init to
> D:running , isn't it? The server should be able to start as running and not only
> deferred (dl_defer_armed and dl_throttled set).

Yes, that's true in general and preserving the A:init -> D:running
transition is desirable for short sleeps. However, it's only valid as long
as the execution context is still coherent. In the failing case that I'm
experiencing, the server restarts with exhausted runtime and no
deferral/replenishment timer pending, so starting directly in D:running is
no longer a valid transition and breaks the state machine.

Maybe a way to preserve the short-sleep optimization without breaking the
state machine could be to retain dl_defer_running across stop/start only
when the execution context is still coherent (i.e., positive runtime and
deadline not expired). Otherwise clear it, so the server cleanly re-enters
the deferral/replenishment path.

> 
> In the sequence you described above, I wonder why the enqueue is never
> replenishing. As far as I understand the runtime should remain <= 0 only as long
> as the enqueue occurs before the deadline, after that it should simply replenish
> a new period (pushing deadline and restoring runtime).
> 
> What am I missing here?

Replenishment is not triggered directly by enqueueing, but by the
deferral/replenishment timer. In this case the timer is never armed: stale
dl_defer_running makes the enqueue path believe the server is already in
the running phase, which suppresses deferral arming, causing
start_dl_timer() to be skipped.

Thanks,
-Andrea

> 
> Thanks,
> Gabriele
> 
> [1] -
> https://lore.kernel.org/lkml/20251111111716.GL278048@noisy.programming.kicks-ass.net
> 
> > 
> > This results in starvation of the tasks serviced by the deadline server
> > in the presence of competing RT workloads.
> > 
> > This issue can be confirmed adding debugging traces, which show that the
> > server skips the deferral timer and is immediately throttled upon
> > execution with negative runtime:
> > 
> >  DEBUG: dl_server_start: dl_defer_running=1 active=0
> >  DEBUG: enqueue_dl_entity: flags=1 dl_throttled=0 dl_defer=1
> >  DEBUG: update_dl_entity: dl_defer_running=1
> >  DEBUG: enqueue_dl_entity: SKIPPING start_dl_timer! dl_throttled=0
> >  ...
> >  DEBUG: update_curr_dl_se: THROTTLED runtime=-954758
> > 
> > Fix this by properly resetting dl_defer_running in dl_server_stop(),
> > ensuring the server correctly enters the defer phase upon restart.
> > 
> > This issue is quite difficult to observe when only the fair server
> > is present, as the required stop/start patterns are relatively rare.
> > However, it becomes easier to trigger with an additional deadline server
> > with more frequent server lifecycle transitions (such as a sched_ext
> > deadline server).
> > 
> > This change is a prerequisite for introducing a sched_ext deadline
> > server, as it ensures correct and predictable behavior across server
> > stop/start cycles.
> > 
> > Link: https://lore.kernel.org/all/aXEMat4IoNnGYgxw@gpd4/
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > Changes in v2:
> >  - Update state machine documentation
> >  - Link to v1:
> > https://lore.kernel.org/all/20260122140833.1655020-1-arighi@nvidia.com/
> > 
> >  kernel/sched/deadline.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index c509f2e7d69de..e42867061ea77 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1615,7 +1615,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64
> > delta_exec)
> >   *   dl_server_active = 0
> >   *   dl_throttled = 0
> >   *   dl_defer_armed = 0
> > - *   dl_defer_running = 0/1
> > + *   dl_defer_running = 0
> >   *   dl_defer_idle = 0
> >   *
> >   * [B] - zero_laxity-wait
> > @@ -1704,6 +1704,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64
> > delta_exec)
> >   *       hrtimer_try_to_cancel();
> >   *       dl_defer_armed = 0;
> >   *       dl_throttled = 0;
> > + *       dl_defer_running = 0;
> >   *       dl_server_active = 0;
> >   *       // [A]
> >   *   return p;
> > @@ -1813,6 +1814,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
> >  	hrtimer_try_to_cancel(&dl_se->dl_timer);
> >  	dl_se->dl_defer_armed = 0;
> >  	dl_se->dl_throttled = 0;
> > +	dl_se->dl_defer_running = 0;
> >  	dl_se->dl_defer_idle = 0;
> >  	dl_se->dl_server_active = 0;
> >  }
>

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Gabriele Monaco 1 week, 4 days ago

2026-01-26T16:30:45Z Andrea Righi <arighi@nvidia.com>:

> Hi Gabriele,
>
> On Mon, Jan 26, 2026 at 03:20:12PM +0100, Gabriele Monaco wrote:

>> In the sequence you described above, I wonder why the enqueue is never
>> replenishing. As far as I understand the runtime should remain <= 0 only as long
>> as the enqueue occurs before the deadline, after that it should simply replenish
>> a new period (pushing deadline and restoring runtime).
>>
>> What am I missing here?
>
> Replenishment is not triggered directly by enqueueing, but by the
> deferral/replenishment timer. In this case the timer is never armed: stale
> dl_defer_running makes the enqueue path believe the server is already in
> the running phase, which suppresses deferral arming, causing
> start_dl_timer() to be skipped.
>

Hi Andrea,

thanks for the clarification, but I think I observed the enqueue/dl_server_start replenishing a new period when running.

Something like:
dl_server_start()
  enqueue_dl_entity(ENQUEUE_WAKEUP)
    update_dl_entity()
      replenish_dl_new_period()

should happen if the deadline is in the past, unless I'm missing some condition down the road.

Still if it starts before the deadline, the server is going to get throttled as you observed, and perhaps since in your tests the CPU isn't idle, we don't stop the server after that dequeue and then we never replenish after the deadline (because we never start and as you mentioned, the timer is not armed).

Can this be what you're observing?

Thanks,
Gabriele


> Thanks,
> -Andrea
>
>>
>> Thanks,
>> Gabriele
>>
>> [1] -
>> https://lore.kernel.org/lkml/20251111111716.GL278048@noisy.programming.kicks-ass.net
>>
>>>
>>> This results in starvation of the tasks serviced by the deadline server
>>> in the presence of competing RT workloads.
>>>
>>> This issue can be confirmed adding debugging traces, which show that the
>>> server skips the deferral timer and is immediately throttled upon
>>> execution with negative runtime:
>>>
>>>  DEBUG: dl_server_start: dl_defer_running=1 active=0
>>>  DEBUG: enqueue_dl_entity: flags=1 dl_throttled=0 dl_defer=1
>>>  DEBUG: update_dl_entity: dl_defer_running=1
>>>  DEBUG: enqueue_dl_entity: SKIPPING start_dl_timer! dl_throttled=0
>>>  ...
>>>  DEBUG: update_curr_dl_se: THROTTLED runtime=-954758
>>>
>>> Fix this by properly resetting dl_defer_running in dl_server_stop(),
>>> ensuring the server correctly enters the defer phase upon restart.
>>>
>>> This issue is quite difficult to observe when only the fair server
>>> is present, as the required stop/start patterns are relatively rare.
>>> However, it becomes easier to trigger with an additional deadline server
>>> with more frequent server lifecycle transitions (such as a sched_ext
>>> deadline server).
>>>
>>> This change is a prerequisite for introducing a sched_ext deadline
>>> server, as it ensures correct and predictable behavior across server
>>> stop/start cycles.
>>>
>>> Link: https://lore.kernel.org/all/aXEMat4IoNnGYgxw@gpd4/
>>> Signed-off-by: Andrea Righi <arighi@nvidia.com>
>>> ---
>>> Changes in v2:
>>>  - Update state machine documentation
>>>  - Link to v1:
>>> https://lore.kernel.org/all/20260122140833.1655020-1-arighi@nvidia.com/
>>>
>>>  kernel/sched/deadline.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index c509f2e7d69de..e42867061ea77 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -1615,7 +1615,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64
>>> delta_exec)
>>>   *   dl_server_active = 0
>>>   *   dl_throttled = 0
>>>   *   dl_defer_armed = 0
>>> - *   dl_defer_running = 0/1
>>> + *   dl_defer_running = 0
>>>   *   dl_defer_idle = 0
>>>   *
>>>   * [B] - zero_laxity-wait
>>> @@ -1704,6 +1704,7 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64
>>> delta_exec)
>>>   *       hrtimer_try_to_cancel();
>>>   *       dl_defer_armed = 0;
>>>   *       dl_throttled = 0;
>>> + *       dl_defer_running = 0;
>>>   *       dl_server_active = 0;
>>>   *       // [A]
>>>   *   return p;
>>> @@ -1813,6 +1814,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
>>>     hrtimer_try_to_cancel(&dl_se->dl_timer);
>>>     dl_se->dl_defer_armed = 0;
>>>     dl_se->dl_throttled = 0;
>>> +   dl_se->dl_defer_running = 0;
>>>     dl_se->dl_defer_idle = 0;
>>>     dl_se->dl_server_active = 0;
>>>  }
>>

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week, 4 days ago

On Mon, Jan 26, 2026 at 04:56:52PM +0000, Gabriele Monaco wrote:
> 2026-01-26T16:30:45Z Andrea Righi <arighi@nvidia.com>:
> 
> > Hi Gabriele,
> >
> > On Mon, Jan 26, 2026 at 03:20:12PM +0100, Gabriele Monaco wrote:
> 
> >> In the sequence you described above, I wonder why the enqueue is never
> >> replenishing. As far as I understand the runtime should remain <= 0 only as long
> >> as the enqueue occurs before the deadline, after that it should simply replenish
> >> a new period (pushing deadline and restoring runtime).
> >>
> >> What am I missing here?
> >
> > Replenishment is not triggered directly by enqueueing, but by the
> > deferral/replenishment timer. In this case the timer is never armed: stale
> > dl_defer_running makes the enqueue path believe the server is already in
> > the running phase, which suppresses deferral arming, causing
> > start_dl_timer() to be skipped.
> >
> 
> Hi Andrea,
> 
> thanks for the clarification, but I think I observed the enqueue/dl_server_start replenishing a new period when running.
> 
> Something like:
> dl_server_start()
>   enqueue_dl_entity(ENQUEUE_WAKEUP)
>     update_dl_entity()
>       replenish_dl_new_period()
> 
> should happen if the deadline is in the past, unless I'm missing some condition down the road.
> 
> Still if it starts before the deadline, the server is going to get throttled as you observed, and perhaps since in your tests the CPU isn't idle, we don't stop the server after that dequeue and then we never replenish after the deadline (because we never start and as you mentioned, the timer is not armed).
> 
> Can this be what you're observing?

Yes, I think it matches what I'm observing.

In my case the server is (re)started before the deadline, so it immediately
runs with exhausted runtime, gets throttled, and is dequeued. Since the CPU
isn't idle, we don't hit a path that would stop the server cleanly and
reset its execution state.

At that point, because dl_defer_running is still set, the restart path
assumes the server is already in the running phase and skips arming the
deferral/replenishment timer. Therefore, once the deadline passes there is
no remaining trigger to replenish a new period and the server gets stuck in
a throttled-but-running state.

Thanks,
-Andrea

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Gabriele Monaco 1 week, 4 days ago

2026-01-26T21:27:11Z Andrea Righi <arighi@nvidia.com>:
> On Mon, Jan 26, 2026 at 04:56:52PM +0000, Gabriele Monaco wrote:
>> Still if it starts before the deadline, the server is going to get throttled as you observed, and perhaps since in your tests the CPU isn't idle, we don't stop the server after that dequeue and then we never replenish after the deadline (because we never start and as you mentioned, the timer is not armed).
>>
>> Can this be what you're observing?
>
> Yes, I think it matches what I'm observing.
>
> In my case the server is (re)started before the deadline, so it immediately
> runs with exhausted runtime, gets throttled, and is dequeued. Since the CPU
> isn't idle, we don't hit a path that would stop the server cleanly and
> reset its execution state.
>
> At that point, because dl_defer_running is still set, the restart path
> assumes the server is already in the running phase and skips arming the
> deferral/replenishment timer. Therefore, once the deadline passes there is
> no remaining trigger to replenish a new period and the server gets stuck in
> a throttled-but-running state.
>

Alright thanks. I believe your fix would work even if you reset the defer_running only when the runtime is exhausted.

This way we'd still keep a bit of benefits of the start-running sequence if fair/scx tasks sleep and run back when the server still has runtime.

We could even keep the defer_running as it is and mark the server as defer_armed (with laxity timer and stuff) only if it starts in this exact condition (runtime = 0 and deadline not expired). But this may just be overly complex for little benefit.

What do you think?

Thanks,
Gabriele

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week, 3 days ago

On Tue, Jan 27, 2026 at 08:52:48AM +0000, Gabriele Monaco wrote:
> 2026-01-26T21:27:11Z Andrea Righi <arighi@nvidia.com>:
> > On Mon, Jan 26, 2026 at 04:56:52PM +0000, Gabriele Monaco wrote:
> >> Still if it starts before the deadline, the server is going to get throttled as you observed, and perhaps since in your tests the CPU isn't idle, we don't stop the server after that dequeue and then we never replenish after the deadline (because we never start and as you mentioned, the timer is not armed).
> >>
> >> Can this be what you're observing?
> >
> > Yes, I think it matches what I'm observing.
> >
> > In my case the server is (re)started before the deadline, so it immediately
> > runs with exhausted runtime, gets throttled, and is dequeued. Since the CPU
> > isn't idle, we don't hit a path that would stop the server cleanly and
> > reset its execution state.
> >
> > At that point, because dl_defer_running is still set, the restart path
> > assumes the server is already in the running phase and skips arming the
> > deferral/replenishment timer. Therefore, once the deadline passes there is
> > no remaining trigger to replenish a new period and the server gets stuck in
> > a throttled-but-running state.
> >
> 
> Alright thanks. I believe your fix would work even if you reset the defer_running only when the runtime is exhausted.
> 
> This way we'd still keep a bit of benefits of the start-running sequence if fair/scx tasks sleep and run back when the server still has runtime.
> 
> We could even keep the defer_running as it is and mark the server as defer_armed (with laxity timer and stuff) only if it starts in this exact condition (runtime = 0 and deadline not expired). But this may just be overly complex for little benefit.
> 
> What do you think?

I think my case should work also doing something like this (I'll run some
tests later to double check):

	if (dl_se->runtime <= 0)
		dl_se->dl_defer_running = 0;

In this way:
 - short sleep + remaining runtime > 0
   - dl_defer_running stays set
   - restart can go A->D directly
   - no extra defer / zero-laxity penalty

 - stop with exhausted (or negative) runtime
   - dl_defer_running is cleared
   - restart must re-establish eligibility
   - deferral / timer is armed again
   - no stale "already running" server

However, I think the right assumption should be that both runtime **and**
deadline are still coherent, so we should probably do something like this
to be fully correct:

	if (dl_se->runtime <= 0 ||
	    dl_time_before(dl_se->deadline, rq_clock(dl_se->rq)))
		dl_se->dl_defer_running = 0;

This makes the stop path slightly more complex, so I'm not sure whether
it's preferable to go in this direction or just unconditionally clearing
dl_defer_running, which is simpler and more explicit from a state-machine
point of view.

Which one do we prefer? Happy to go with whatever approach you think makes
more sense.

Thanks,
-Andrea

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Gabriele Monaco 1 week, 3 days ago

2026-01-27T14:18:29Z Andrea Righi <arighi@nvidia.com>:
> I think my case should work also doing something like this (I'll run some
> tests later to double check):
>
>     if (dl_se->runtime <= 0)
>         dl_se->dl_defer_running = 0;
>
> In this way:
> - short sleep + remaining runtime > 0
>    - dl_defer_running stays set
>    - restart can go A->D directly
>    - no extra defer / zero-laxity penalty
>
> - stop with exhausted (or negative) runtime
>    - dl_defer_running is cleared
>    - restart must re-establish eligibility
>    - deferral / timer is armed again
>    - no stale "already running" server

Yeah that looks like the neatest to me.
Fair tasks are a bit more penalised than now but won't be if they really sleep before consuming the runtime, which I think was the whole point of this logic.

> However, I think the right assumption should be that both runtime **and**
> deadline are still coherent, so we should probably do something like this
> to be fully correct:
>
>     if (dl_se->runtime <= 0 ||
>         dl_time_before(dl_se->deadline, rq_clock(dl_se->rq)))
>         dl_se->dl_defer_running = 0;
>
> This makes the stop path slightly more complex, so I'm not sure whether
> it's preferable to go in this direction or just unconditionally clearing
> dl_defer_running, which is simpler and more explicit from a state-machine
> point of view.

I think it's trickier than that. The state machine is coherent as long as the the server is restarted after the deadline, no matter when it was stopped.

This check should probably be done at some point during dl_server_start().
But yeah it's probably an overkill.

Thanks,
Gabriele

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week, 3 days ago

On Tue, Jan 27, 2026 at 04:00:31PM +0000, Gabriele Monaco wrote:
> 2026-01-27T14:18:29Z Andrea Righi <arighi@nvidia.com>:
> > I think my case should work also doing something like this (I'll run some
> > tests later to double check):
> >
> >     if (dl_se->runtime <= 0)
> >         dl_se->dl_defer_running = 0;
> >
> > In this way:
> > - short sleep + remaining runtime > 0
> >    - dl_defer_running stays set
> >    - restart can go A->D directly
> >    - no extra defer / zero-laxity penalty
> >
> > - stop with exhausted (or negative) runtime
> >    - dl_defer_running is cleared
> >    - restart must re-establish eligibility
> >    - deferral / timer is armed again
> >    - no stale "already running" server
> 
> Yeah that looks like the neatest to me.
> Fair tasks are a bit more penalised than now but won't be if they really sleep before consuming the runtime, which I think was the whole point of this logic.

Unfortunately checking only runtime <= 0 isn't enough for the sched_ext DL
server case:

 # Runtime of EXT task (PID 2025) is 0.000000 seconds
 # Runtime of RT task (PID 2026) is 4.990000 seconds
 # EXT task got 0.00% of total runtime
 not ok 2 FAIL: EXT task got less than 4.00% of runtime

With the unconditional reset the EXT task gets 5% of the bandwidth. I'll
add some debugging to figure out exactly what is happening.

-Andrea

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Gabriele Monaco 1 week, 3 days ago

2026-01-27T18:55:02Z Andrea Righi <arighi@nvidia.com>:
> Unfortunately checking only runtime <= 0 isn't enough for the sched_ext DL
> server case:
>
> # Runtime of EXT task (PID 2025) is 0.000000 seconds
> # Runtime of RT task (PID 2026) is 4.990000 seconds
> # EXT task got 0.00% of total runtime
> not ok 2 FAIL: EXT task got less than 4.00% of runtime
>
> With the unconditional reset the EXT task gets 5% of the bandwidth. I'll
> add some debugging to figure out exactly what is happening.

Thanks for testing it. That's quite strange..

I run your test on a kernel without ext server, as far as I understand, the test is kinda indirectly checking also the fair server and that does not fail, right?
At least that's what I get on an arm64 machine with 128 CPUs.

After letting the test continue on failure I get:

# # Runtime of FAIR task (PID 22503) is 0.240000 seconds
# # Runtime of RT task (PID 22504) is 4.750000 seconds
# # FAIR task got 4.81% of total runtime
# ok 1 PASS: FAIR task got more than 4.00% of runtime
# TAP version 13
# 1..1
# # Runtime of EXT task (PID 22511) is 0.020000 seconds
# # Runtime of RT task (PID 22512) is 4.970000 seconds
# # EXT task got 0.40% of total runtime
# not ok 2 FAIL: EXT task got less than 4.00% of runtime
# TAP version 13
# 1..1
# # Runtime of FAIR task (PID 22518) is 0.240000 seconds
# # Runtime of RT task (PID 22519) is 4.750000 seconds
# # FAIR task got 4.81% of total runtime
# ok 3 PASS: FAIR task got more than 4.00% of runtime
# TAP version 13
# 1..1
# # Runtime of EXT task (PID 22525) is 0.000000 seconds
# # Runtime of RT task (PID 22526) is 4.990000 seconds
# # EXT task got 0.00% of total runtime
# not ok 4 FAIL: EXT task got less than 4.00% of runtime
# ok 24 rt_stall #

Mind that it's expected for the ext task to starve (I didn't apply the patches enabling the server).

After adding all your patches [1], also the ext passes the test (i.e. gets boosted just fine).

I tried disabling all CPUs but CPU0 and run the same test and it hung (bad sign), then I also enabled CPU1 (total 2 CPUs online) and again I see both fair and ext getting their share.

What am I missing here?

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/20260126100050.3854740-1-arighi@nvidia.com

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week, 2 days ago

On Wed, Jan 28, 2026 at 09:50:36AM +0000, Gabriele Monaco wrote:
> 2026-01-27T18:55:02Z Andrea Righi <arighi@nvidia.com>:
> > Unfortunately checking only runtime <= 0 isn't enough for the sched_ext DL
> > server case:
> >
> > # Runtime of EXT task (PID 2025) is 0.000000 seconds
> > # Runtime of RT task (PID 2026) is 4.990000 seconds
> > # EXT task got 0.00% of total runtime
> > not ok 2 FAIL: EXT task got less than 4.00% of runtime
> >
> > With the unconditional reset the EXT task gets 5% of the bandwidth. I'll
> > add some debugging to figure out exactly what is happening.
> 
> Thanks for testing it. That's quite strange..
> 
> I run your test on a kernel without ext server, as far as I understand, the test is kinda indirectly checking also the fair server and that does not fail, right?
> At least that's what I get on an arm64 machine with 128 CPUs.

That's right, withtout the ext server, the EXT task is supposed to starve.

And yes, the test is also checking the fair server, just to make sure we're
not breaking anything in fair while we're loading / unloading the BPF
scheduler.

> 
> After letting the test continue on failure I get:
> 
> # # Runtime of FAIR task (PID 22503) is 0.240000 seconds
> # # Runtime of RT task (PID 22504) is 4.750000 seconds
> # # FAIR task got 4.81% of total runtime
> # ok 1 PASS: FAIR task got more than 4.00% of runtime
> # TAP version 13
> # 1..1
> # # Runtime of EXT task (PID 22511) is 0.020000 seconds
> # # Runtime of RT task (PID 22512) is 4.970000 seconds
> # # EXT task got 0.40% of total runtime
> # not ok 2 FAIL: EXT task got less than 4.00% of runtime
> # TAP version 13
> # 1..1
> # # Runtime of FAIR task (PID 22518) is 0.240000 seconds
> # # Runtime of RT task (PID 22519) is 4.750000 seconds
> # # FAIR task got 4.81% of total runtime
> # ok 3 PASS: FAIR task got more than 4.00% of runtime
> # TAP version 13
> # 1..1
> # # Runtime of EXT task (PID 22525) is 0.000000 seconds
> # # Runtime of RT task (PID 22526) is 4.990000 seconds
> # # EXT task got 0.00% of total runtime
> # not ok 4 FAIL: EXT task got less than 4.00% of runtime
> # ok 24 rt_stall #
> 
> Mind that it's expected for the ext task to starve (I didn't apply the patches enabling the server).

Correct, this makes sense.

> 
> After adding all your patches [1], also the ext passes the test (i.e. gets boosted just fine).
> 
> I tried disabling all CPUs but CPU0 and run the same test and it hung (bad sign), then I also enabled CPU1 (total 2 CPUs online) and again I see both fair and ext getting their share.
> 
> What am I missing here?

With the ext server patchset [1], without resetting dl_defer_running in
dl_server_stop() I get this:

$ sudo ./runner -t rt_stall
===== START =====
TEST: rt_stall
DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
OUTPUT:
TAP version 13
1..1
# Runtime of FAIR task (PID 1993) is 0.250000 seconds
# Runtime of RT task (PID 1994) is 4.740000 seconds
# FAIR task got 5.01% of total runtime
ok 1 PASS: FAIR task got more than 4.00% of runtime
[   28.494515] sched_ext: BPF scheduler "rt_stall" enabled
TAP version 13
1..1
# Runtime of EXT task (PID 1996) is 0.240000 seconds
# Runtime of RT task (PID 1997) is 4.740000 seconds
# EXT task got 4.82% of total runtime
ok 2 PASS: EXT task got more than 4.00% of runtime
[   33.538466] sched_ext: BPF scheduler "rt_stall" disabled (unregistered from user space)
TAP version 13
1..1
# Runtime of FAIR task (PID 1999) is 0.000000 seconds
# Runtime of RT task (PID 2000) is 4.990000 seconds
# FAIR task got 0.00% of total runtime
not ok 3 FAIL: FAIR task got less than 4.00% of runtime
# Planned tests != run tests (1 != 3)
# Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0

The fair server works, ext server works, but once the ext server is
unloaded, the fair server is broken.

If I apply the fix to reset dl_defer_running in dl_server_stop() on top of
[1], the test is passing:

$ sudo ./runner -t rt_stall
===== START =====
TEST: rt_stall
DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
OUTPUT:
TAP version 13
1..1
# Runtime of FAIR task (PID 1965) is 0.240000 seconds
# Runtime of RT task (PID 1966) is 4.740000 seconds
# FAIR task got 4.82% of total runtime
ok 1 PASS: FAIR task got more than 4.00% of runtime
[   25.307989] sched_ext: BPF scheduler "rt_stall" enabled
TAP version 13
1..1
[   26.257788] hrtimer: interrupt took 112519 ns
# Runtime of EXT task (PID 1968) is 0.250000 seconds
# Runtime of RT task (PID 1969) is 4.750000 seconds
# EXT task got 5.00% of total runtime
ok 2 PASS: EXT task got more than 4.00% of runtime
[   30.344700] sched_ext: BPF scheduler "rt_stall" disabled (unregistered from user space)
TAP version 13
1..1
# Runtime of FAIR task (PID 1971) is 0.250000 seconds
# Runtime of RT task (PID 1972) is 4.750000 seconds
# FAIR task got 5.00% of total runtime
ok 3 PASS: FAIR task got more than 4.00% of runtime
[   35.373585] sched_ext: BPF scheduler "rt_stall" enabled
TAP version 13
1..1
# Runtime of EXT task (PID 1975) is 0.240000 seconds
# Runtime of RT task (PID 1976) is 4.740000 seconds
# EXT task got 4.82% of total runtime
ok 4 PASS: EXT task got more than 4.00% of runtime
[   40.403697] sched_ext: BPF scheduler "rt_stall" disabled (unregistered from user space)
ok 1 rt_stall #
=====  END  =====


=============================

RESULTS:

PASSED:  1
SKIPPED: 0
FAILED:  0

> 
> Thanks,
> Gabriele
> 
> [1] - https://lore.kernel.org/lkml/20260126100050.3854740-1-arighi@nvidia.com
> 

Just to make sure we're testing the same thing, I'm currently using
https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git, branch
scx-dl-server.

I'm running this test inside virtme-ng:
  $ vng -vb --config tools/testing/selftests/sched_ext/config
  $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall

Thanks,
-Andrea

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Peter Zijlstra 1 week ago

On Wed, Jan 28, 2026 at 02:41:40PM +0100, Andrea Righi wrote:

> Just to make sure we're testing the same thing, I'm currently using
> https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git, branch
> scx-dl-server.
> 
> I'm running this test inside virtme-ng:
>   $ vng -vb --config tools/testing/selftests/sched_ext/config
>   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall

Apparently you also have to actually have that runner thing built from
that tree.

Anyway, all I seem to be able to get (on x86) is PASS: 1 :/

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Peter Zijlstra 1 week ago

On Fri, Jan 30, 2026 at 01:24:13PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 28, 2026 at 02:41:40PM +0100, Andrea Righi wrote:
> 
> > Just to make sure we're testing the same thing, I'm currently using
> > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git, branch
> > scx-dl-server.
> > 
> > I'm running this test inside virtme-ng:
> >   $ vng -vb --config tools/testing/selftests/sched_ext/config
> >   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> 
> Apparently you also have to actually have that runner thing built from
> that tree.
> 
> Anyway, all I seem to be able to get (on x86) is PASS: 1 :/

Argh, that tree has the dodgy 'fix' in. Let me go revert that.

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Peter Zijlstra 1 week ago

On Fri, Jan 30, 2026 at 01:26:20PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 30, 2026 at 01:24:13PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 28, 2026 at 02:41:40PM +0100, Andrea Righi wrote:
> > 
> > > Just to make sure we're testing the same thing, I'm currently using
> > > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git, branch
> > > scx-dl-server.
> > > 
> > > I'm running this test inside virtme-ng:
> > >   $ vng -vb --config tools/testing/selftests/sched_ext/config
> > >   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> > 
> > Apparently you also have to actually have that runner thing built from
> > that tree.
> > 
> > Anyway, all I seem to be able to get (on x86) is PASS: 1 :/
> 
> Argh, that tree has the dodgy 'fix' in. Let me go revert that.

This seems to work?

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 80c9559a3e30..aa3da4d3b8e3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1036,6 +1036,12 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 			return;
 		}
 
+		/*
+		 * When [4] D->A is followed by [1] A->B, dl_defer_running
+		 * needs to be cleared, otherwise it will fail to properly
+		 * start the zero-laxity timer.
+		 */
+		dl_se->dl_defer_running = 0;
 		replenish_dl_new_period(dl_se, rq);
 	} else if (dl_server(dl_se) && dl_se->dl_defer) {
 		/*
@@ -1654,6 +1660,12 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
  *   dl_server_active = 1;
  *   enqueue_dl_entity()
  *     update_dl_entity(WAKEUP)
+ *       if (dl_time_before() || dl_entity_overflow)
+ *         dl_defer_running = 0;
+ *         replenish_dl_new_period();
+ *           // fwd period
+ *           dl_throttled = 1;
+ *           dl_defer_armed = 1;
  *       if (!dl_defer_running)
  *         dl_defer_armed = 1;
  *         dl_throttled = 1;

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week ago

Hi Peter,

On Fri, Jan 30, 2026 at 01:41:00PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 30, 2026 at 01:26:20PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 30, 2026 at 01:24:13PM +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 28, 2026 at 02:41:40PM +0100, Andrea Righi wrote:
> > > 
> > > > Just to make sure we're testing the same thing, I'm currently using
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git, branch
> > > > scx-dl-server.
> > > > 
> > > > I'm running this test inside virtme-ng:
> > > >   $ vng -vb --config tools/testing/selftests/sched_ext/config
> > > >   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> > > 
> > > Apparently you also have to actually have that runner thing built from
> > > that tree.
> > > 
> > > Anyway, all I seem to be able to get (on x86) is PASS: 1 :/
> > 
> > Argh, that tree has the dodgy 'fix' in. Let me go revert that.
> 
> This seems to work?

Great! Makes sense to me, I re-ran all my stress tests and everything looks
good on my side. FWIW,

Tested-by: Andrea Righi arighi@nvidia.com

Can we route this through your branch / want me to send a new patch?

Thanks!
-Andrea

> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 80c9559a3e30..aa3da4d3b8e3 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1036,6 +1036,12 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
>  			return;
>  		}
>  
> +		/*
> +		 * When [4] D->A is followed by [1] A->B, dl_defer_running
> +		 * needs to be cleared, otherwise it will fail to properly
> +		 * start the zero-laxity timer.
> +		 */
> +		dl_se->dl_defer_running = 0;
>  		replenish_dl_new_period(dl_se, rq);
>  	} else if (dl_server(dl_se) && dl_se->dl_defer) {
>  		/*
> @@ -1654,6 +1660,12 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
>   *   dl_server_active = 1;
>   *   enqueue_dl_entity()
>   *     update_dl_entity(WAKEUP)
> + *       if (dl_time_before() || dl_entity_overflow)
> + *         dl_defer_running = 0;
> + *         replenish_dl_new_period();
> + *           // fwd period
> + *           dl_throttled = 1;
> + *           dl_defer_armed = 1;
>   *       if (!dl_defer_running)
>   *         dl_defer_armed = 1;
>   *         dl_throttled = 1;

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Peter Zijlstra 1 week ago

On Fri, Jan 30, 2026 at 05:25:56PM +0100, Andrea Righi wrote:

> > This seems to work?
> 
> Great! Makes sense to me, I re-ran all my stress tests and everything looks
> good on my side. FWIW,
> 
> Tested-by: Andrea Righi arighi@nvidia.com

Excellent!

> Can we route this through your branch / want me to send a new patch?

Yeah, I'll write a Changelog tonight somewhree and stuff it in tip.

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week ago

On Fri, Jan 30, 2026 at 05:40:32PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 30, 2026 at 05:25:56PM +0100, Andrea Righi wrote:
> 
> > > This seems to work?
> > 
> > Great! Makes sense to me, I re-ran all my stress tests and everything looks
> > good on my side. FWIW,
> > 
> > Tested-by: Andrea Righi arighi@nvidia.com
> 
> Excellent!
> 
> > Can we route this through your branch / want me to send a new patch?
> 
> Yeah, I'll write a Changelog tonight somewhree and stuff it in tip.

Awesome, thanks!

-Andrea

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Juri Lelli 1 week ago

On 30/01/26 13:41, Peter Zijlstra wrote:
> On Fri, Jan 30, 2026 at 01:26:20PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 30, 2026 at 01:24:13PM +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 28, 2026 at 02:41:40PM +0100, Andrea Righi wrote:
> > > 
> > > > Just to make sure we're testing the same thing, I'm currently using
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git, branch
> > > > scx-dl-server.
> > > > 
> > > > I'm running this test inside virtme-ng:
> > > >   $ vng -vb --config tools/testing/selftests/sched_ext/config
> > > >   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> > > 
> > > Apparently you also have to actually have that runner thing built from
> > > that tree.
> > > 
> > > Anyway, all I seem to be able to get (on x86) is PASS: 1 :/
> > 
> > Argh, that tree has the dodgy 'fix' in. Let me go revert that.
> 
> This seems to work?

Makes sense to me. Also handles nicely the CBS wakeup rule.

Thanks,
Juri

[tip: sched/urgent] sched/deadline: Fix 'stuck' dl_server

Posted by tip-bot2 for Peter Zijlstra 1 week ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     115135422562e2f791e98a6f55ec57b2da3b3a95
Gitweb:        https://git.kernel.org/tip/115135422562e2f791e98a6f55ec57b2da3b3a95
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 30 Jan 2026 13:41:00 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 30 Jan 2026 23:06:06 +01:00

sched/deadline: Fix 'stuck' dl_server

Andrea reported the dl_server getting stuck for him. He tracked it
down to a state where dl_server_start() saw dl_defer_running==1, but
the dl_server's job is no longer valid at the time of
dl_server_start().

In the state diagram this corresponds to [4] D->A (or dl_server_stop()
due to no more runnable tasks) followed by [1], which in case of a
lapsed deadline must then be A->B.

Now our A has dl_defer_running==1, while B demands
dl_defer_running==0, therefore it must get cleared when the CBS wakeup
rules demand a replenish.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Reported-by: Andrea Righi arighi@nvidia.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Andrea Righi arighi@nvidia.com
Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com
Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net
---
 kernel/sched/deadline.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c509f2e..7bcde71 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1034,6 +1034,12 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 			return;
 		}
 
+		/*
+		 * When [4] D->A is followed by [1] A->B, dl_defer_running
+		 * needs to be cleared, otherwise it will fail to properly
+		 * start the zero-laxity timer.
+		 */
+		dl_se->dl_defer_running = 0;
 		replenish_dl_new_period(dl_se, rq);
 	} else if (dl_server(dl_se) && dl_se->dl_defer) {
 		/*
@@ -1655,6 +1661,12 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
  *   dl_server_active = 1;
  *   enqueue_dl_entity()
  *     update_dl_entity(WAKEUP)
+ *       if (dl_time_before() || dl_entity_overflow())
+ *         dl_defer_running = 0;
+ *         replenish_dl_new_period();
+ *           // fwd period
+ *           dl_throttled = 1;
+ *           dl_defer_armed = 1;
  *       if (!dl_defer_running)
  *         dl_defer_armed = 1;
  *         dl_throttled = 1;

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by gmonaco@redhat.com 1 week, 1 day ago

On Wed, 2026-01-28 at 14:41 +0100, Andrea Righi wrote:
> Just to make sure we're testing the same thing, I'm currently using
> https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git,
> branch
> scx-dl-server.
> 
> I'm running this test inside virtme-ng:
>   $ vng -vb --config tools/testing/selftests/sched_ext/config
>   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall

Well, that's a fun one, I could reproduce the same failure you
described in vng on another x86 box.

The arm box (bare metal) I used initially still passes just fine all 4
iterations of the test.

On the x86 box (vng) I tried different orders of iterations (where the
original is fair-ext-fair-ext) with and without the ext server active.

No ext-server: the ext iteration fails and breaks also fair (unlike the
arm64 box where the fair was intact)
ext-server active: a sequence fair-ext breaks both (like you observe).

I don't have time to look further into this right now, but it looks
like an interesting pattern.

Thanks,
Gabriele

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Andrea Righi 1 week, 1 day ago

Hi Gabriele,

On Thu, Jan 29, 2026 at 12:48:35PM +0100, gmonaco@redhat.com wrote:
> On Wed, 2026-01-28 at 14:41 +0100, Andrea Righi wrote:
> > Just to make sure we're testing the same thing, I'm currently using
> > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git,
> > branch
> > scx-dl-server.
> > 
> > I'm running this test inside virtme-ng:
> >   $ vng -vb --config tools/testing/selftests/sched_ext/config
> >   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> 
> Well, that's a fun one, I could reproduce the same failure you
> described in vng on another x86 box.
> 
> The arm box (bare metal) I used initially still passes just fine all 4
> iterations of the test.
> 
> 
> On the x86 box (vng) I tried different orders of iterations (where the
> original is fair-ext-fair-ext) with and without the ext server active.
> 
> No ext-server: the ext iteration fails and breaks also fair (unlike the
> arm64 box where the fair was intact)
> ext-server active: a sequence fair-ext breaks both (like you observe).
> 
> I don't have time to look further into this right now, but it looks
> like an interesting pattern.

Thanks for checking and reproducing it.

Considering that these issues around DL server stop/start transitions can
be triggered introducing an additional DL server (EXT) makes me wonder
whether this could become even more problematic as we add more DL servers
(hierarchical DL servers?).

Considering that unconditionally clearing dl_defer_running in
dl_server_stop() seems to re-establish a clear state-machine workflow,
I think we should go with that fix for now, so we can unblock the EXT DL
server patch set. With that change in place, all the server combinations
and sequences I've tested seem to behave consistently.

We can always revisit preserving the short-sleep optimization later if we
find a way to do it with stronger guarantees (and I'll keep investigating
on this), but for now the unconditional reset seems like the most robust
fix to me.

Opinions? Peter / Juri?

Thanks,
-Andrea

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Juri Lelli 1 week, 1 day ago

Hello,

On 29/01/26 18:32, Andrea Righi wrote:
> Hi Gabriele,
> 
> On Thu, Jan 29, 2026 at 12:48:35PM +0100, gmonaco@redhat.com wrote:
> > On Wed, 2026-01-28 at 14:41 +0100, Andrea Righi wrote:
> > > Just to make sure we're testing the same thing, I'm currently using
> > > https://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git,
> > > branch
> > > scx-dl-server.
> > > 
> > > I'm running this test inside virtme-ng:
> > >   $ vng -vb --config tools/testing/selftests/sched_ext/config
> > >   $ vng -v -- tools/testing/selftests/sched_ext/runner -t rt_stall
> > 
> > Well, that's a fun one, I could reproduce the same failure you
> > described in vng on another x86 box.
> > 
> > The arm box (bare metal) I used initially still passes just fine all 4
> > iterations of the test.
> > 
> > 
> > On the x86 box (vng) I tried different orders of iterations (where the
> > original is fair-ext-fair-ext) with and without the ext server active.
> > 
> > No ext-server: the ext iteration fails and breaks also fair (unlike the
> > arm64 box where the fair was intact)
> > ext-server active: a sequence fair-ext breaks both (like you observe).
> > 
> > I don't have time to look further into this right now, but it looks
> > like an interesting pattern.
> 
> Thanks for checking and reproducing it.
> 
> Considering that these issues around DL server stop/start transitions can
> be triggered introducing an additional DL server (EXT) makes me wonder
> whether this could become even more problematic as we add more DL servers
> (hierarchical DL servers?).
> 
> Considering that unconditionally clearing dl_defer_running in
> dl_server_stop() seems to re-establish a clear state-machine workflow,
> I think we should go with that fix for now, so we can unblock the EXT DL
> server patch set. With that change in place, all the server combinations
> and sequences I've tested seem to behave consistently.
> 
> We can always revisit preserving the short-sleep optimization later if we
> find a way to do it with stronger guarantees (and I'll keep investigating
> on this), but for now the unconditional reset seems like the most robust
> fix to me.
> 
> Opinions? Peter / Juri?

Hummm, I now however fear that always cleaning on stop would reintroduce
the issue John Stultz reported a while ago where boosted tasks would
need to wait for an entire new period after sleeping briefly. Would it?

Would an hybrid approach be feasible? Can we do "the right thing" (what
Gabriele suggests?) during normal operation and cleanup state only on
server unload/load?

Thanks,
Juri

Re: [PATCH v2] sched/deadline: Reset dl_server execution state on stop

Posted by Juri Lelli 2 weeks ago

Hello,

On 23/01/26 17:16, Andrea Righi wrote:
> dl_server_stop() can leave a deadline server in an inconsistent internal
> state across stop/start transitions, causing it to bypass its required
> deferral phase when restarted. This breaks the scheduler invariant that
> a restarted server must re-establish eligibility before being allowed to
> execute.
> 
> When the server is stopped (e.g., because the associated task blocks),
> it's expected to transition back to an inactive, initial state. However,
> dl_server_stop() does not fully reset the execution state. As a result,
> the server can be logically inactive while still appearing as if it was
> still running.
> 
> When the server is restarted via dl_server_start(), the following
> sequence occurs:
>   1. dl_server_start() calls enqueue_dl_entity(ENQUEUE_WAKEUP),
>   2. enqueue_dl_entity() calls update_dl_entity(),
>   3. update_dl_entity() checks (!dl_se->dl_defer_running) to decide
>      whether to arm the deferral mechanism,
>   4. because dl_defer_running is stale, the check fails,
>   5. dl_defer_armed and dl_throttled are not set,
>   6. enqueue_dl_entity() skips start_dl_timer(), because
>      dl_throttled == 0,
>   7. the server is enqueued via __enqueue_dl_entity(),
>   8. the scheduler picks the server to run,
>   9. update_curr_dl_se() detects that the server has exhausted its
>      runtime (or has negative runtime), as it wasn't properly
>      replenished/deferred,
>  10. the server is throttled (dl_throttled set to 1) and dequeued,
>  11. the server repeatedly cycles through wakeup and throttling,
>      effectively receiving no usable CPU bandwidth.
> 
> This results in starvation of the tasks serviced by the deadline server
> in the presence of competing RT workloads.
> 
> This issue can be confirmed adding debugging traces, which show that the
> server skips the deferral timer and is immediately throttled upon
> execution with negative runtime:
> 
>  DEBUG: dl_server_start: dl_defer_running=1 active=0
>  DEBUG: enqueue_dl_entity: flags=1 dl_throttled=0 dl_defer=1
>  DEBUG: update_dl_entity: dl_defer_running=1
>  DEBUG: enqueue_dl_entity: SKIPPING start_dl_timer! dl_throttled=0
>  ...
>  DEBUG: update_curr_dl_se: THROTTLED runtime=-954758
> 
> Fix this by properly resetting dl_defer_running in dl_server_stop(),
> ensuring the server correctly enters the defer phase upon restart.
> 
> This issue is quite difficult to observe when only the fair server
> is present, as the required stop/start patterns are relatively rare.
> However, it becomes easier to trigger with an additional deadline server
> with more frequent server lifecycle transitions (such as a sched_ext
> deadline server).
> 
> This change is a prerequisite for introducing a sched_ext deadline
> server, as it ensures correct and predictable behavior across server
> stop/start cycles.
> 
> Link: https://lore.kernel.org/all/aXEMat4IoNnGYgxw@gpd4/
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---

Looks good to me!

Acked-by: Juri Lelli <juri.lelli@redhat.com>

Thanks,
Juri