sched/proxy_exec: Handle sched_delayed owner in find_proxy_task()

[PATCH] sched/proxy_exec: Handle sched_delayed owner in find_proxy_task()

Posted by soolaugust@gmail.com 1 month, 1 week ago

From: zhidao su <suzhidao@xiaomi.com>

The blocked-owner check at the top of the inner loop unconditionally
lumps two distinct states into one:

  1. !on_rq        -- the owner has fully left the runqueue; PE cannot
                      proceed and proxy_deactivate() is the right action.
  2. sched_delayed -- EEVDF deferred-dequeue: the owner called schedule()
                      but was kept physically in the RB-tree because its
                      lag was still positive (entity_eligible() == true).

Case 2 is transient.  The owner will resolve to one of two outcomes:

  * A wakeup arrives  --> sched_delayed cleared, on_rq stays 1,
                          owner eligible for PE on the next cycle.
  * Dequeue completes --> on_rq drops to 0, caught by case 1 above.

Calling proxy_deactivate() in case 2 is unnecessarily aggressive: it
removes the high-priority donor from the runqueue and clears its
blocked_on, discarding valid PE state for a single missed cycle.

A task that enters the mutex slowpath sets blocked_on before calling
schedule(), and try_to_block_task() is only reached via the explicit
DEQUEUE_DELAYED path -- not the sched_delayed shortcut.  Therefore a
sched_delayed owner never has blocked_on set and the chain cannot be
followed further regardless.

Split the check: keep proxy_deactivate() for !on_rq, and switch to
proxy_resched_idle() for sched_delayed.  This mirrors the existing
handling of task_on_rq_migrating() owners (see proxy_resched_idle()
call below), which also uses a yield-to-idle to handle a transient
per-owner condition without disturbing the donor.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
 kernel/sched/core.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6..dc9f17b35e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6625,10 +6625,31 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return p;
 		}
 
-		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
-			/* XXX Don't handle blocked owners/delayed dequeue yet */
+		if (!READ_ONCE(owner->on_rq)) {
+			/*
+			 * Owner is off the runqueue; proxy execution cannot
+			 * proceed through it. Deactivate the donor so it will
+			 * be properly re-enqueued when the owner eventually
+			 * wakes and releases the mutex.
+			 */
 			return proxy_deactivate(rq, donor);
 		}
+		if (owner->se.sched_delayed) {
+			/*
+			 * The owner is in EEVDF's deferred-dequeue state: it
+			 * called schedule() but the scheduler kept it physically
+			 * on the runqueue because its lag was still positive.
+			 * This is a transient condition -- the owner will either
+			 * be woken (clearing sched_delayed) or fully dequeued
+			 * (clearing on_rq) very shortly.
+			 *
+			 * Unlike the !on_rq case the donor is still valid; do
+			 * not deactivate it.  Yield to idle so the owner can
+			 * complete its state transition, then retry PE on the
+			 * next scheduling cycle.
+			 */
+			return proxy_resched_idle(rq);
+		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-- 
2.43.0

Re: [PATCH] sched/proxy_exec: Handle sched_delayed owner in find_proxy_task()

Posted by K Prateek Nayak 1 month, 1 week ago

(+ John)

Hello Zhidao,

On 3/2/2026 3:42 PM, soolaugust@gmail.com wrote:
> From: zhidao su <suzhidao@xiaomi.com>
> 
> The blocked-owner check at the top of the inner loop unconditionally
> lumps two distinct states into one:
> 
>   1. !on_rq        -- the owner has fully left the runqueue; PE cannot
>                       proceed and proxy_deactivate() is the right action.
>   2. sched_delayed -- EEVDF deferred-dequeue: the owner called schedule()
>                       but was kept physically in the RB-tree because its
>                       lag was still positive (entity_eligible() == true).
> 
> Case 2 is transient.  The owner will resolve to one of two outcomes:
> 
>   * A wakeup arrives  --> sched_delayed cleared, on_rq stays 1,
>                           owner eligible for PE on the next cycle.
>   * Dequeue completes --> on_rq drops to 0, caught by case 1 above.
> 
> Calling proxy_deactivate() in case 2 is unnecessarily aggressive: it
> removes the high-priority donor from the runqueue and clears its
> blocked_on, discarding valid PE state for a single missed cycle.

These bits are still in development and will get sorted later in the
series with the blocked owner handling. See
https://github.com/johnstultz-work/linux-dev/commit/e39257424cf2edb17b6be9be3cd50796d6650b1b

> 
> A task that enters the mutex slowpath sets blocked_on before calling
> schedule(), and try_to_block_task() is only reached via the explicit
> DEQUEUE_DELAYED path -- not the sched_delayed shortcut.  Therefore a
> sched_delayed owner never has blocked_on set and the chain cannot be
> followed further regardless.
> 
> Split the check: keep proxy_deactivate() for !on_rq, and switch to
> proxy_resched_idle() for sched_delayed.  This mirrors the existing
> handling of task_on_rq_migrating() owners (see proxy_resched_idle()
> call below), which also uses a yield-to-idle to handle a transient
> per-owner condition without disturbing the donor.

Just switching to idle will not alter the EEVDF state and the pick
will still converge on the same task whose owner will still be
delayed.

Until a wakeup or a full dequeue (and the owner could also be on a
remote CPU at this point), the pick would just be spinning in
__schedule(), continuously calling proxy_resched_idle() since
nothing has changed in the wait chain of this CPU no?

  next = pick_next_task(); /* Gets a blocked donor */
  if (task_is_blocked(next))
    find_proxy_task(rq, next)
      ...
      if (owner->se.sched_delayed) /* Finds a delayed owner. */
        next = proxy_resched_idle(rq)

  /*
   * Switched to rq->idle with NEED_RESCHED set.
   * Comes back into __schedule().
   */

  next = pick_next_task();
  if (task_is_blocked(next)) /* Same blocked task! */
    find_proxy_task(rq, next)
      ...
      if (owner->se.sched_delayed) /* Owner still delayed */
        next = proxy_resched_idle(rq) /* Again switched to idle. */

  ... And the cycle repeats with preemption disabled !!!


This is terrible since blocked owner can be delayed on a busy runqueue
for more than a few tick - sure it is a transient state but it can last
for a while depending on the state of the cfs_rq where it is delayed,
up to few 10s of milliseconds in a practical worst case scenario.

> 
> Signed-off-by: zhidao su <suzhidao@xiaomi.com>
> ---
>  kernel/sched/core.c | 25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b7f77c165a6..dc9f17b35e4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6625,10 +6625,31 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  			return p;
>  		}
>  
> -		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
> -			/* XXX Don't handle blocked owners/delayed dequeue yet */
> +		if (!READ_ONCE(owner->on_rq)) {
> +			/*
> +			 * Owner is off the runqueue; proxy execution cannot
> +			 * proceed through it. Deactivate the donor so it will
> +			 * be properly re-enqueued when the owner eventually
> +			 * wakes and releases the mutex.
> +			 */
>  			return proxy_deactivate(rq, donor);
>  		}
> +		if (owner->se.sched_delayed) {
> +			/*
> +			 * The owner is in EEVDF's deferred-dequeue state: it
> +			 * called schedule() but the scheduler kept it physically
> +			 * on the runqueue because its lag was still positive.
> +			 * This is a transient condition -- the owner will either
> +			 * be woken (clearing sched_delayed) or fully dequeued
> +			 * (clearing on_rq) very shortly.
> +			 *
> +			 * Unlike the !on_rq case the donor is still valid; do
> +			 * not deactivate it.  Yield to idle so the owner can
> +			 * complete its state transition, then retry PE on the
> +			 * next scheduling cycle.
> +			 */
> +			return proxy_resched_idle(rq);

proxy_deactivate() is correct to do for now until we get to the
blocked owner handling.

> +		}
>  
>  		if (task_cpu(owner) != this_cpu) {
>  			/* XXX Don't handle migrations yet */

-- 
Thanks and Regards,
Prateek

Re: [PATCH] sched/proxy_exec: Handle sched_delayed owner in find_proxy_task()

Posted by soolaugust@gmail.com 1 month, 1 week ago

From: zhidao su <suzhidao@xiaomi.com>

On 3/3/2026 11:29 AM, K Prateek Nayak wrote:
> Just switching to idle will not alter the EEVDF state and the pick
> will still converge on the same task whose owner will still be
> delayed.
>
> [...]
> ... And the cycle repeats with preemption disabled !!!
>
> This is terrible since blocked owner can be delayed on a busy runqueue
> for more than a few tick - sure it is a transient state but it can last
> for a while depending on the state of the cfs_rq where it is delayed,
> up to few 10s of milliseconds in a practical worst case scenario.

Agreed - the spinloop analysis is correct. pick_next_task() keeps
returning the same blocked donor; proxy_resched_idle() bounces off
idle each cycle until sched_delayed clears. Tens of milliseconds of
that is clearly unacceptable.

> proxy_deactivate() is correct to do for now until we get to the
> blocked owner handling.

Right. I'll send a v2 that keeps the two checks as separate blocks
(the conceptual distinction is worth preserving for when the blocked-
owner series lands), but uses proxy_deactivate() for both:

    if (!READ_ONCE(owner->on_rq)) {
        return proxy_deactivate(rq, donor);
    }
    if (owner->se.sched_delayed) {
        /*
         * Owner is in EEVDF deferred-dequeue: still physically on
         * the runqueue but has called schedule(). A sched_delayed
         * task never has blocked_on set, so the chain cannot be
         * followed further. Deactivate the donor for now; proper
         * handling will come with the blocked-owner series.
         *
         * XXX: Don't handle sched_delayed owners yet.
         */
        return proxy_deactivate(rq, donor);
    }

The comment replaces the shared "XXX" with per-case rationale, making
it clearer why they are handled differently once proper support lands.

Does that work, or would you prefer the two conditions be collapsed
back into one?

Re: [PATCH] sched/proxy_exec: Handle sched_delayed owner in find_proxy_task()

Posted by K Prateek Nayak 1 month, 1 week ago

Hello Zhidao,

On 3/3/2026 12:00 PM, soolaugust@gmail.com wrote:
> Right. I'll send a v2 that keeps the two checks as separate blocks
> (the conceptual distinction is worth preserving for when the blocked-
> owner series lands), but uses proxy_deactivate() for both:
> 
>     if (!READ_ONCE(owner->on_rq)) {
>         return proxy_deactivate(rq, donor);
>     }
>     if (owner->se.sched_delayed) {
>         /*
>          * Owner is in EEVDF deferred-dequeue: still physically on
>          * the runqueue but has called schedule(). A sched_delayed
>          * task never has blocked_on set, so the chain cannot be
>          * followed further. Deactivate the donor for now; proper
>          * handling will come with the blocked-owner series.
>          *
>          * XXX: Don't handle sched_delayed owners yet.
>          */
>         return proxy_deactivate(rq, donor);
>     }
> 
> The comment replaces the shared "XXX" with per-case rationale, making
> it clearer why they are handled differently once proper support lands.
> 
> Does that work, or would you prefer the two conditions be collapsed
> back into one?

I suppose we can expand on that comment but all of this will go away
soon(ish) anyways. Do we need to modify it?

I'll let John, Peter answer what they think is best. Personally, I
feel the current transient comment is good enough indicating that it'll
be handled with the upcoming changes similar to owner on remote CPU.

-- 
Thanks and Regards,
Prateek

Re: [PATCH] sched/proxy_exec: Handle sched_delayed owner in find_proxy_task()

Posted by John Stultz 1 month ago

On Mon, Mar 2, 2026 at 10:43 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/3/2026 12:00 PM, soolaugust@gmail.com wrote:
> > Right. I'll send a v2 that keeps the two checks as separate blocks
> > (the conceptual distinction is worth preserving for when the blocked-
> > owner series lands), but uses proxy_deactivate() for both:
> >
> >     if (!READ_ONCE(owner->on_rq)) {
> >         return proxy_deactivate(rq, donor);
> >     }
> >     if (owner->se.sched_delayed) {
> >         /*
> >          * Owner is in EEVDF deferred-dequeue: still physically on
> >          * the runqueue but has called schedule(). A sched_delayed
> >          * task never has blocked_on set, so the chain cannot be
> >          * followed further. Deactivate the donor for now; proper
> >          * handling will come with the blocked-owner series.
> >          *
> >          * XXX: Don't handle sched_delayed owners yet.
> >          */
> >         return proxy_deactivate(rq, donor);
> >     }
> >
> > The comment replaces the shared "XXX" with per-case rationale, making
> > it clearer why they are handled differently once proper support lands.
> >
> > Does that work, or would you prefer the two conditions be collapsed
> > back into one?
>
> I suppose we can expand on that comment but all of this will go away
> soon(ish) anyways. Do we need to modify it?
>
> I'll let John, Peter answer what they think is best. Personally, I
> feel the current transient comment is good enough indicating that it'll
> be handled with the upcoming changes similar to owner on remote CPU.

Again, thank you to K Prateek for his excellent analysis and response
in this thread.

Admittedly, being half way through upstreaming the proxy-exec series,
the code is a little awkward. But even in my current full patch
series, I handle (!on_rq || sched_delayed) together as the logic ends
up being the same. So I'm not sure splitting the two conditions out is
really valuable.

Even though sched_delayed is in some sense transient, for proxy-exec
it's not meaningfully different than !on_rq, its just an intermediate
step.  Either the task is woken up (same as with a !on_rq task) or the
dequeue completes (and it becomes !on_rq).

If there are proposals for ways to handle sched_delayed differently to
some benefit, then I'm open to splitting it out, but I'm not sure I
see it yet.

thanks
-john