[PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()

John Stultz posted 9 patches 3 weeks, 4 days ago
There is a newer version of this series
[PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by John Stultz 3 weeks, 4 days ago
With proxy-execution, the scheduler selects the donor, but for
blocked donors, we end up running the lock owner.

This caused some complexity, because the class schedulers make
sure to remove the task they pick from their pushable task
lists, which prevents the donor from being migrated, but there
wasn't then anything to prevent rq->curr from being migrated
if rq->curr != rq->donor.

This was sort of hacked around by calling proxy_tag_curr() on
the rq->curr task if we were running something other then the
donor. proxy_tag_curr() did a dequeue/enqueue pair on the
rq->curr task, allowing the class schedulers to remove it from
their pushable list.

The dequeue/enqueue pair was wasteful, and additonally K Prateek
highlighted that we didn't properly undo things when we stopped
proxying, leaving the lock owner off the pushable list.

After some alternative approaches were considered, Peter
suggested just having the RT/DL classes just avoid migrating
when task_on_cpu().

So rework pick_next_pushable_dl_task() and the rt
pick_next_pushable_task() functions so that they skip over the
first pushable task if it is on_cpu.

Then just drop all of the proxy_tag_curr() logic.

Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c     | 24 ------------------------
 kernel/sched/deadline.c | 16 ++++++++++++++--
 kernel/sched/rt.c       | 15 ++++++++++++---
 3 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e0..d86d648a75a4b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6702,23 +6702,6 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
-static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
-{
-	if (!sched_proxy_exec())
-		return;
-	/*
-	 * pick_next_task() calls set_next_task() on the chosen task
-	 * at some point, which ensures it is not push/pullable.
-	 * However, the chosen/donor task *and* the mutex owner form an
-	 * atomic pair wrt push/pull.
-	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
-	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
-}
-
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6871,9 +6854,6 @@ static void __sched notrace __schedule(int sched_mode)
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
 
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6907,10 +6887,6 @@ static void __sched notrace __schedule(int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
-		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
 		raw_spin_rq_unlock_irq(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b004293234..4e746f4de6529 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2801,12 +2801,24 @@ static int find_later_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct task_struct *p = NULL;
+	struct rb_node *next_node;
 
 	if (!has_pushable_dl_tasks(rq))
 		return NULL;
 
-	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
+	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+	while (next_node) {
+		p = __node_2_pdl(next_node);
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, p))
+			break;
+
+		next_node = rb_next(next_node);
+	}
+
+	if (!p)
+		return NULL;
 
 	WARN_ON_ONCE(rq->cpu != task_cpu(p));
 	WARN_ON_ONCE(task_current(rq, p));
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f69e1f16d9238..61569b622d1a3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1853,13 +1853,22 @@ static int find_lowest_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct plist_head *head = &rq->rt.pushable_tasks;
+	struct task_struct *i, *p = NULL;
 
 	if (!has_pushable_tasks(rq))
 		return NULL;
 
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
+	plist_for_each_entry(i, head, pushable_tasks) {
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, i)) {
+			p = i;
+			break;
+		}
+	}
+
+	if (!p)
+		return NULL;
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
-- 
2.53.0.880.g73c4285caa-goog
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by K Prateek Nayak 3 weeks, 1 day ago
Hello John,

On 3/13/2026 8:00 AM, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b7f77c165a6e0..d86d648a75a4b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6702,23 +6702,6 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  }
>  #endif /* SCHED_PROXY_EXEC */
>  
> -static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> -{
> -	if (!sched_proxy_exec())
> -		return;
> -	/*
> -	 * pick_next_task() calls set_next_task() on the chosen task
> -	 * at some point, which ensures it is not push/pullable.
> -	 * However, the chosen/donor task *and* the mutex owner form an
> -	 * atomic pair wrt push/pull.
> -	 *
> -	 * Make sure owner we run is not pushable. Unfortunately we can
> -	 * only deal with that by means of a dequeue/enqueue cycle. :-/
> -	 */
> -	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> -	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> -}
> -
>  /*
>   * __schedule() is the main scheduler function.
>   *
> @@ -6871,9 +6854,6 @@ static void __sched notrace __schedule(int sched_mode)
>  		 */
>  		RCU_INIT_POINTER(rq->curr, next);
>  
> -		if (!task_current_donor(rq, next))
> -			proxy_tag_curr(rq, next);
> -

Back to my concern with the queuing of the balance_callback, and the
deadline and RT folks can keep me honest here, consider the following:

    CPU0
    ====

  ======> Task A (prio: 80)
  ...
  
  mutex_lock(Mutex0)
  ... /* Executing critical section. */

    =====> Interrupt: Wakes up Task B (prio: 50); B->blocked_on = Mutex0;
      resched_curr()
    <===== Interrupt return
  preempt_schedule_irq()
    schedule()
      put_prev_set_next_Task(A, B)
      rq->donor = B
      if (task_is_blocked(B)
        next = find_proxy_task() /* Return Task A */
      rq->curr = A
      queue_balance_callback()
    do_balance_callbacks()
      /* Finds A as task_on_cpu(); Does nothing. */

  ... /* returns from schedule */
  ... /* continues with critical section */

  mutex_unlock(Mutex0)
    mutex_handoff(B /* Task B */)
    preempt_disable()
      try_to_wake_up()
        resched_curr()
    preempt_enable()
      preempt_schedule()
        proxy_force_return()
          /* Returns to same CPU */

        /*
         * put_prev_set_next_task() is skipped since
         * rq->donor context is same. no balance
         * callbacks are queued. Task A still on the
         * push list.
         */
        rq->donor = B
        rq->curr = B

  =======> sched_out: Task A

  !!! No balance callback; Task A still on push list. !!!
  
  <======= sched_in: Task B


So what I'm getting to is, if we find that rq->donor has not changed
with sched_proxy_exec() but rq->curr has changed during schedule(), we
should forcefully do a:

  prev->sched_class->put_prev_task(rq, rq->donor, rq->donor /* or rq->idle / NULL ? */);
  next->sched_class->set_next_task(rq, rq->donor, true /* to queue balance callback. */);

That way, when we do set_nex_task(), we see if we potentially have
tasks in the push list and queue a balance callback since the
task_on_cpu() condition may no longer apply to the tasks left behind
on the list.

Thoughts?

>  		/*
>  		 * The membarrier system call requires each architecture
>  		 * to have a full memory barrier after updating
> @@ -6907,10 +6887,6 @@ static void __sched notrace __schedule(int sched_mode)
>  		/* Also unlocks the rq: */
>  		rq = context_switch(rq, prev, next, &rf);
>  	} else {
> -		/* In case next was already curr but just got blocked_donor */
> -		if (!task_current_donor(rq, next))
> -			proxy_tag_curr(rq, next);
> -
>  		rq_unpin_lock(rq, &rf);
>  		__balance_callbacks(rq, NULL);
>  		raw_spin_rq_unlock_irq(rq);

-- 
Thanks and Regards,
Prateek
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by John Stultz 3 weeks ago
On Sun, Mar 15, 2026 at 9:27 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Back to my concern with the queuing of the balance_callback, and the
> deadline and RT folks can keep me honest here, consider the following:
>
>     CPU0
>     ====
>
>   ======> Task A (prio: 80)
>   ...
>
>   mutex_lock(Mutex0)
>   ... /* Executing critical section. */
>
>     =====> Interrupt: Wakes up Task B (prio: 50); B->blocked_on = Mutex0;
>       resched_curr()
>     <===== Interrupt return
>   preempt_schedule_irq()
>     schedule()
>       put_prev_set_next_Task(A, B)
>       rq->donor = B
>       if (task_is_blocked(B)
>         next = find_proxy_task() /* Return Task A */
>       rq->curr = A
>       queue_balance_callback()
>     do_balance_callbacks()
>       /* Finds A as task_on_cpu(); Does nothing. */
>
>   ... /* returns from schedule */
>   ... /* continues with critical section */
>
>   mutex_unlock(Mutex0)
>     mutex_handoff(B /* Task B */)
>     preempt_disable()
>       try_to_wake_up()
>         resched_curr()
>     preempt_enable()
>       preempt_schedule()
>         proxy_force_return()
>           /* Returns to same CPU */
>
>         /*
>          * put_prev_set_next_task() is skipped since
>          * rq->donor context is same. no balance
>          * callbacks are queued. Task A still on the
>          * push list.
>          */
>         rq->donor = B
>         rq->curr = B
>   =======> sched_out: Task A
>
>   !!! No balance callback; Task A still on push list. !!!
>
>   <======= sched_in: Task B

Hrm. I'm feeling like I'm a little lost here, specifically after
proxy_force_return(), since it doesn't exist yet at this point in the
patch series. But assuming we're at the "Handle blocked-waiter
migration" point in the series, I'd think it would be something like:

rq->donor= B
rq->curr = A
<< task A >>
mutex_unlock(Mutex0)
  mutex_handoff(B /* Task B */)
   preempt_disable()
     try_to_wake_up()
       resched_curr()
    preempt_enable()
      preempt_schedule()
        __schedule()
          find_proxy_task()
             proxy_force_return()
             return NULL
        pick_again:
          next = pick_next_task()
                      __pick_next_task() /* Returns B */
          rq->donor =B
          rq->curr = B
          context_switch()
<<switch to B >>
          finish_task_switch()
            finish_lock_switch()
              __balance_callbacks()

Your point "put_prev_set_next_task() is skipped since rq->donor
context is same" wasn't initially obvious to me, as the fair scheduler
does have a (p == prev) check, but it doesn't enqueue balance
callbacks.  And for RT/DL/SCX we should be using the pick_task()
method, which calls put_prev_set_next_task() in __pick_next_task().
But indeed, *inside* of put_prev_set_next_task() we return early if
(next == prev).

So I see your concern and agree.

> So what I'm getting to is, if we find that rq->donor has not changed
> with sched_proxy_exec() but rq->curr has changed during schedule(), we
> should forcefully do a:
>
>   prev->sched_class->put_prev_task(rq, rq->donor, rq->donor /* or rq->idle / NULL ? */);
>   next->sched_class->set_next_task(rq, rq->donor, true /* to queue balance callback. */);
>
> That way, when we do set_nex_task(), we see if we potentially have
> tasks in the push list and queue a balance callback since the
> task_on_cpu() condition may no longer apply to the tasks left behind
> on the list.
>
> Thoughts?

Yeah. I wonder if we can express this inside of
put_prev_set_next_task(). Reworking the shortcut to maybe:
    if (next == prev && next != rq->curr)

I probably need to think on this tomorrow, as I suspect the above has
some holes, but it seems like it would catch the cases that would
matter (maybe the issue is it catches too much - we'd probably also
trip it if we A boosted B, and then we hit schedule and again chose A
to boost B, which we probably could have skipped).

I guess adding a new helper function to manually do the
put_prev/set_next could be added to the top level __schedule() logic
in the (prev != next) case, though we'll have to preserve the
prev_donor on the stack probably.

thanks
-john
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by K Prateek Nayak 3 weeks ago
Hello John,

On 3/17/2026 10:19 AM, John Stultz wrote:
> On Sun, Mar 15, 2026 at 9:27 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Back to my concern with the queuing of the balance_callback, and the
>> deadline and RT folks can keep me honest here, consider the following:
>>
>>     CPU0
>>     ====
>>
>>   ======> Task A (prio: 80)
>>   ...
>>
>>   mutex_lock(Mutex0)
>>   ... /* Executing critical section. */
>>
>>     =====> Interrupt: Wakes up Task B (prio: 50); B->blocked_on = Mutex0;
>>       resched_curr()
>>     <===== Interrupt return
>>   preempt_schedule_irq()
>>     schedule()
>>       put_prev_set_next_Task(A, B)
>>       rq->donor = B
>>       if (task_is_blocked(B)
>>         next = find_proxy_task() /* Return Task A */
>>       rq->curr = A
>>       queue_balance_callback()
>>     do_balance_callbacks()
>>       /* Finds A as task_on_cpu(); Does nothing. */
>>
>>   ... /* returns from schedule */
>>   ... /* continues with critical section */
>>
>>   mutex_unlock(Mutex0)
>>     mutex_handoff(B /* Task B */)
>>     preempt_disable()
>>       try_to_wake_up()
>>         resched_curr()
>>     preempt_enable()
>>       preempt_schedule()
>>         proxy_force_return()
>>           /* Returns to same CPU */
>>
>>         /*
>>          * put_prev_set_next_task() is skipped since
>>          * rq->donor context is same. no balance
>>          * callbacks are queued. Task A still on the
>>          * push list.
>>          */
>>         rq->donor = B
>>         rq->curr = B
>>   =======> sched_out: Task A
>>
>>   !!! No balance callback; Task A still on push list. !!!
>>
>>   <======= sched_in: Task B
> 
> Hrm. I'm feeling like I'm a little lost here, specifically after
> proxy_force_return(), since it doesn't exist yet at this point in the
> patch series.

Yeah I had to look a little bit ahead to poke holes here. Sorry about
that!

> But assuming we're at the "Handle blocked-waiter
> migration" point in the series, I'd think it would be something like:
> 
> rq->donor= B
> rq->curr = A
> << task A >>
> mutex_unlock(Mutex0)
>   mutex_handoff(B /* Task B */)
>    preempt_disable()
>      try_to_wake_up()
>        resched_curr()
>     preempt_enable()
>       preempt_schedule()
>         __schedule()
>           find_proxy_task()
>              proxy_force_return()
>              return NULL
>         pick_again:
>           next = pick_next_task()
>                       __pick_next_task() /* Returns B */
>           rq->donor =B
>           rq->curr = B
>           context_switch()
> <<switch to B >>
>           finish_task_switch()
>             finish_lock_switch()
>               __balance_callbacks()
> 
> Your point "put_prev_set_next_task() is skipped since rq->donor
> context is same" wasn't initially obvious to me, as the fair scheduler
> does have a (p == prev) check, but it doesn't enqueue balance
> callbacks.  And for RT/DL/SCX we should be using the pick_task()
> method, which calls put_prev_set_next_task() in __pick_next_task().
> But indeed, *inside* of put_prev_set_next_task() we return early if
> (next == prev).
> 
> So I see your concern and agree.
> 
>> So what I'm getting to is, if we find that rq->donor has not changed
>> with sched_proxy_exec() but rq->curr has changed during schedule(), we
>> should forcefully do a:
>>
>>   prev->sched_class->put_prev_task(rq, rq->donor, rq->donor /* or rq->idle / NULL ? */);
>>   next->sched_class->set_next_task(rq, rq->donor, true /* to queue balance callback. */);
>>
>> That way, when we do set_nex_task(), we see if we potentially have
>> tasks in the push list and queue a balance callback since the
>> task_on_cpu() condition may no longer apply to the tasks left behind
>> on the list.
>>
>> Thoughts?
> 
> Yeah. I wonder if we can express this inside of
> put_prev_set_next_task(). Reworking the shortcut to maybe:
>     if (next == prev && next != rq->curr)
> 
> I probably need to think on this tomorrow, as I suspect the above has
> some holes, but it seems like it would catch the cases that would
> matter

Also this needs to be done after find_proxy_task() since
"donor->blocked_on" needs to be cleared to queue callbacks else we'll
bail out on task_is_blocked() check in set_next_task.*() with
PROXY_WAKING when it is done as a part of pick_next_task().

> (maybe the issue is it catches too much - we'd probably also
> trip it if we A boosted B, and then we hit schedule and again chose A
> to boost B, which we probably could have skipped).

Ack! Doing it when the execution context changes with donor context
remaining the same would be the most optimal.

> 
> I guess adding a new helper function to manually do the
> put_prev/set_next could be added to the top level __schedule() logic
> in the (prev != next) case, though we'll have to preserve the
> prev_donor on the stack probably.

That seems like the best option to me too.

Also, deadline, RT, fair, and idle don't really care about the "next"
argument of put_prev_task() and the only one that does care is
put_prev_task_scx() to call switch_class() callback so putting it as
either NULL or "rq->donor" should be safe.

-- 
Thanks and Regards,
Prateek

Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Tue, Mar 17, 2026 at 11:11:20AM +0530, K Prateek Nayak wrote:

> Also, deadline, RT, fair, and idle don't really care about the "next"
> argument of put_prev_task() and the only one that does care is
> put_prev_task_scx() to call switch_class() callback so putting it as
> either NULL or "rq->donor" should be safe.

https://lkml.kernel.org/r/20260317104343.225156112@infradead.org

Makes fair care about the @next argument to put_prev_task_fair().
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by K Prateek Nayak 2 weeks, 5 days ago
On 3/18/2026 6:25 PM, Peter Zijlstra wrote:
> On Tue, Mar 17, 2026 at 11:11:20AM +0530, K Prateek Nayak wrote:
> 
>> Also, deadline, RT, fair, and idle don't really care about the "next"
>> argument of put_prev_task() and the only one that does care is
>> put_prev_task_scx() to call switch_class() callback so putting it as
>> either NULL or "rq->donor" should be safe.
> 
> https://lkml.kernel.org/r/20260317104343.225156112@infradead.org
> 
> Makes fair care about the @next argument to put_prev_task_fair().

Ack! In this case, using "rq->donor" is better since we skip traversing
the entire cgroup hierarchy and bail-out at the common parent which is
the immediate next entity.

-- 
Thanks and Regards,
Prateek
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by John Stultz 3 weeks ago
On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/17/2026 10:19 AM, John Stultz wrote:
> >
> > I guess adding a new helper function to manually do the
> > put_prev/set_next could be added to the top level __schedule() logic
> > in the (prev != next) case, though we'll have to preserve the
> > prev_donor on the stack probably.
>
> That seems like the best option to me too.
>
> Also, deadline, RT, fair, and idle don't really care about the "next"
> argument of put_prev_task() and the only one that does care is
> put_prev_task_scx() to call switch_class() callback so putting it as
> either NULL or "rq->donor" should be safe.

Ack.
Here's the change I'm testing tonight (against 6.18):
https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6

Feel free to suggest a better name for the helper function. It feels a
little clunky (and sort of sad right after getting rid of the clunky
proxy_tag_curr(), to re-add something so similar).

Thanks again for your thoughtful feedback!
-john
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Mon, Mar 16, 2026 at 11:04:28PM -0700, John Stultz wrote:
> On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > On 3/17/2026 10:19 AM, John Stultz wrote:
> > >
> > > I guess adding a new helper function to manually do the
> > > put_prev/set_next could be added to the top level __schedule() logic
> > > in the (prev != next) case, though we'll have to preserve the
> > > prev_donor on the stack probably.
> >
> > That seems like the best option to me too.
> >
> > Also, deadline, RT, fair, and idle don't really care about the "next"
> > argument of put_prev_task() and the only one that does care is
> > put_prev_task_scx() to call switch_class() callback so putting it as
> > either NULL or "rq->donor" should be safe.
> 
> Ack.
> Here's the change I'm testing tonight (against 6.18):
> https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
> 
> Feel free to suggest a better name for the helper function. It feels a
> little clunky (and sort of sad right after getting rid of the clunky
> proxy_tag_curr(), to re-add something so similar).

Does this capture it?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7100,9 +7103,11 @@ static void __sched notrace __schedule(i
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
-	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
 	if (sched_proxy_exec()) {
+		struct task_struct *prev_donor = rq->donor;
+
+		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
@@ -7114,6 +7119,24 @@ static void __sched notrace __schedule(i
 				goto keep_resched;
 			}
 		}
+
+		/*
+		 * When transitioning like:
+		 *
+		 *	   prev		next
+		 * donor:    B		  B
+		 * curr:     A		  B
+		 *
+		 * then put_prev_set_next_task() will not have done anything,
+		 * since B == B. However, A might have missed a RT/DL balance
+		 * opportunity due to being on_cpu.
+		 */
+		if (next == rq->donor && next == prev_donor) {
+			next->sched_class->put_prev_task(rq, next, next);
+			next->sched_class->set_next_task(rq, next, true);
+		}
+	} else {
+		rq_set_donor(rq, next);
 	}
 picked:
 	clear_tsk_need_resched(prev);
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by John Stultz 2 weeks, 5 days ago
On Wed, Mar 18, 2026 at 6:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 16, 2026 at 11:04:28PM -0700, John Stultz wrote:
> > On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > > On 3/17/2026 10:19 AM, John Stultz wrote:
> > > >
> > > > I guess adding a new helper function to manually do the
> > > > put_prev/set_next could be added to the top level __schedule() logic
> > > > in the (prev != next) case, though we'll have to preserve the
> > > > prev_donor on the stack probably.
> > >
> > > That seems like the best option to me too.
> > >
> > > Also, deadline, RT, fair, and idle don't really care about the "next"
> > > argument of put_prev_task() and the only one that does care is
> > > put_prev_task_scx() to call switch_class() callback so putting it as
> > > either NULL or "rq->donor" should be safe.
> >
> > Ack.
> > Here's the change I'm testing tonight (against 6.18):
> > https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
> >
> > Feel free to suggest a better name for the helper function. It feels a
> > little clunky (and sort of sad right after getting rid of the clunky
> > proxy_tag_curr(), to re-add something so similar).
>
> Does this capture it?

Yeah, this looks equivalent to what I had in my patch linked above.
I'll go ahead and take your version though.

The only tweak I might consider is setting the prev_donor along with
prev, so we don't have to move the rq_set_donor() call to be on both
sides of the sched_proxy_exec() conditional.

Feels much simpler to read, and I'd hope the compiler would optimize
the extra assignment away if it were unused at compile time. But maybe
there is a concern you have I'm missing?

thanks
-john
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Wed, Mar 18, 2026 at 01:30:17PM -0700, John Stultz wrote:

> Yeah, this looks equivalent to what I had in my patch linked above.
> I'll go ahead and take your version though.
> 
> The only tweak I might consider is setting the prev_donor along with
> prev, so we don't have to move the rq_set_donor() call to be on both
> sides of the sched_proxy_exec() conditional.
> 
> Feels much simpler to read, and I'd hope the compiler would optimize
> the extra assignment away if it were unused at compile time. But maybe
> there is a concern you have I'm missing?

So my goal was to capture as much of the PE specific stuff inside that
sched_proxy_exec() branch.

Having to push that rq_set_donor() into the else branch was a little
unfortunate, but keeping it all inside that branch makes that we can
easily break that out into its own function.
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by John Stultz 2 weeks, 5 days ago
On Wed, Mar 18, 2026 at 1:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 18, 2026 at 01:30:17PM -0700, John Stultz wrote:
>
> > Yeah, this looks equivalent to what I had in my patch linked above.
> > I'll go ahead and take your version though.
> >
> > The only tweak I might consider is setting the prev_donor along with
> > prev, so we don't have to move the rq_set_donor() call to be on both
> > sides of the sched_proxy_exec() conditional.
> >
> > Feels much simpler to read, and I'd hope the compiler would optimize
> > the extra assignment away if it were unused at compile time. But maybe
> > there is a concern you have I'm missing?
>
> So my goal was to capture as much of the PE specific stuff inside that
> sched_proxy_exec() branch.
>
> Having to push that rq_set_donor() into the else branch was a little
> unfortunate, but keeping it all inside that branch makes that we can
> easily break that out into its own function.

Ok. Sounds good. I'll match your suggestion then.

Appreciate the quick feedback!
-john
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Wed, Mar 18, 2026 at 02:36:40PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 16, 2026 at 11:04:28PM -0700, John Stultz wrote:
> > On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > > On 3/17/2026 10:19 AM, John Stultz wrote:
> > > >
> > > > I guess adding a new helper function to manually do the
> > > > put_prev/set_next could be added to the top level __schedule() logic
> > > > in the (prev != next) case, though we'll have to preserve the
> > > > prev_donor on the stack probably.
> > >
> > > That seems like the best option to me too.
> > >
> > > Also, deadline, RT, fair, and idle don't really care about the "next"
> > > argument of put_prev_task() and the only one that does care is
> > > put_prev_task_scx() to call switch_class() callback so putting it as
> > > either NULL or "rq->donor" should be safe.
> > 
> > Ack.
> > Here's the change I'm testing tonight (against 6.18):
> > https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
> > 
> > Feel free to suggest a better name for the helper function. It feels a
> > little clunky (and sort of sad right after getting rid of the clunky
> > proxy_tag_curr(), to re-add something so similar).
> 
> Does this capture it?
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7100,9 +7103,11 @@ static void __sched notrace __schedule(i
>  pick_again:
>  	assert_balance_callbacks_empty(rq);
>  	next = pick_next_task(rq, rq->donor, &rf);
> -	rq_set_donor(rq, next);
>  	rq->next_class = next->sched_class;
>  	if (sched_proxy_exec()) {
> +		struct task_struct *prev_donor = rq->donor;
> +
> +		rq_set_donor(rq, next);
>  		if (unlikely(next->blocked_on)) {
>  			next = find_proxy_task(rq, next, &rf);
>  			if (!next) {
> @@ -7114,6 +7119,24 @@ static void __sched notrace __schedule(i
>  				goto keep_resched;
>  			}
>  		}
> +
> +		/*
> +		 * When transitioning like:
> +		 *
> +		 *	   prev		next
> +		 * donor:    B		  B
> +		 * curr:     A		  B
> +		 *
> +		 * then put_prev_set_next_task() will not have done anything,
> +		 * since B == B. However, A might have missed a RT/DL balance
> +		 * opportunity due to being on_cpu.
> +		 */
> +		if (next == rq->donor && next == prev_donor) {

	&& next != prev

> +			next->sched_class->put_prev_task(rq, next, next);
> +			next->sched_class->set_next_task(rq, next, true);
> +		}
> +	} else {
> +		rq_set_donor(rq, next);
>  	}
>  picked:
>  	clear_tsk_need_resched(prev);
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by K Prateek Nayak 2 weeks, 5 days ago
Hello Peter,

On 3/18/2026 7:22 PM, Peter Zijlstra wrote:
>> @@ -7114,6 +7119,24 @@ static void __sched notrace __schedule(i
>>  				goto keep_resched;
>>  			}
>>  		}
>> +
>> +		/*
>> +		 * When transitioning like:
>> +		 *
>> +		 *	   prev		next
>> +		 * donor:    B		  B
>> +		 * curr:     A		  B

Even for curr going A -> C, we should do this.

Consider B is blocked on C which is in turn blocked on A. B gets picked
and proxies A first as a result of the wait chain (B -> C -> A)

A runs and does an unlock with a handoff to C clearing its blocked_on.
In schedule B is picked again but this time it proxies C.

A might be on the push list and was skipped last time around since
A->on_cpu = 1 but now that is gone and we should still do put_prev_task()
+ set_next_task().

>> +		 *
>> +		 * then put_prev_set_next_task() will not have done anything,
>> +		 * since B == B. However, A might have missed a RT/DL balance
>> +		 * opportunity due to being on_cpu.
>> +		 */
>> +		if (next == rq->donor && next == prev_donor) {
> 
> 	&& next != prev
> 
>> +			next->sched_class->put_prev_task(rq, next, next);
>> +			next->sched_class->set_next_task(rq, next, true);
>> +		}
>> +	} else {
>> +		rq_set_donor(rq, next);
>>  	}
>>  picked:
>>  	clear_tsk_need_resched(prev);

-- 
Thanks and Regards,
Prateek
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by K Prateek Nayak 3 weeks ago
Hello John,

On 3/17/2026 11:34 AM, John Stultz wrote:
> On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 3/17/2026 10:19 AM, John Stultz wrote:
>>>
>>> I guess adding a new helper function to manually do the
>>> put_prev/set_next could be added to the top level __schedule() logic
>>> in the (prev != next) case, though we'll have to preserve the
>>> prev_donor on the stack probably.
>>
>> That seems like the best option to me too.
>>
>> Also, deadline, RT, fair, and idle don't really care about the "next"
>> argument of put_prev_task() and the only one that does care is
>> put_prev_task_scx() to call switch_class() callback so putting it as
>> either NULL or "rq->donor" should be safe.
> 
> Ack.
> Here's the change I'm testing tonight (against 6.18):
> https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6

Thanks a ton for pushing out to WIP! Only nit. would be a:

  s/rq->donor/prev_donor/

on the lines with {put_prev/set_next}_task() to save on the
additional dereference since they are both the same (but maybe
the complier figures that out on its own?)

Also would a sched_proxy_exec() check within that function
make sense to skip evaluation of that branch entirely when
proxy exec is disabled via cmdline?

> Feel free to suggest a better name for the helper function. It feels a
> little clunky (and sort of sad right after getting rid of the clunky
> proxy_tag_curr(), to re-add something so similar).

I know and I'm sorry about that (T_T)

-- 
Thanks and Regards,
Prateek

Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by John Stultz 2 weeks, 6 days ago
On Tue, Mar 17, 2026 at 12:53 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
>
> Hello John,
>
> On 3/17/2026 11:34 AM, John Stultz wrote:
> > On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >> On 3/17/2026 10:19 AM, John Stultz wrote:
> >>>
> >>> I guess adding a new helper function to manually do the
> >>> put_prev/set_next could be added to the top level __schedule() logic
> >>> in the (prev != next) case, though we'll have to preserve the
> >>> prev_donor on the stack probably.
> >>
> >> That seems like the best option to me too.
> >>
> >> Also, deadline, RT, fair, and idle don't really care about the "next"
> >> argument of put_prev_task() and the only one that does care is
> >> put_prev_task_scx() to call switch_class() callback so putting it as
> >> either NULL or "rq->donor" should be safe.
> >
> > Ack.
> > Here's the change I'm testing tonight (against 6.18):
> > https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
>
> Thanks a ton for pushing out to WIP! Only nit. would be a:
>
>   s/rq->donor/prev_donor/
>
> on the lines with {put_prev/set_next}_task() to save on the
> additional dereference since they are both the same (but maybe
> the complier figures that out on its own?)

While the rq->donor and prev_donor are the same, I sort of preferred using:
  prev_donor->sched_class->put_prev_task(rq, prev_donor, rq->donor);
  rq->donor->sched_class->set_next_task(rq, rq->donor, true);

As it matches the familiar prev/next pattern used elsewhere (to me it
visually makes more sense and doesn't look as confusing).
I'd hope the compilers woudl sort out they could save the derefernce,
but maybe I can avoid extra derefernces and still preserve the pattern
with a local "donor" variable? That should make sure its simpler for
the compiler to optimize when they are equivalent.

> Also would a sched_proxy_exec() check within that function
> make sense to skip evaluation of that branch entirely when
> proxy exec is disabled via cmdline?

So after Peter's feedback that the sched_proxy_exec() checks weren't
optimizing out the way I'd hope, I'm hesitant to add another check
there. Seems the donor == prev_donor check would amount to the same
amount of work?

> > Feel free to suggest a better name for the helper function. It feels a
> > little clunky (and sort of sad right after getting rid of the clunky
> > proxy_tag_curr(), to re-add something so similar).
>
> I know and I'm sorry about that (T_T)

Oh, no worries! Again, I really appreciate you catching these subtle issues!
thanks
-john
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by Juri Lelli 3 weeks, 3 days ago
Hello,

On 13/03/26 02:30, John Stultz wrote:

...

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d08b004293234..4e746f4de6529 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2801,12 +2801,24 @@ static int find_later_rq(struct task_struct *task)
>  
>  static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
>  {
> -	struct task_struct *p;
> +	struct task_struct *p = NULL;
> +	struct rb_node *next_node;
>  
>  	if (!has_pushable_dl_tasks(rq))
>  		return NULL;
>  
> -	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
> +	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
> +	while (next_node) {
> +		p = __node_2_pdl(next_node);
> +		/* make sure task isn't on_cpu (possible with proxy-exec) */
> +		if (!task_on_cpu(rq, p))
> +			break;
> +
> +		next_node = rb_next(next_node);
> +	}
> +
> +	if (!p)
> +		return NULL;

Can't this return an on_cpu task if we hit the corner case where all
pushable tasks are on_cpu?

Thanks,
Juri
Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
Posted by John Stultz 3 weeks, 3 days ago
On Fri, Mar 13, 2026 at 6:48 AM Juri Lelli <juri.lelli@redhat.com> wrote:
>
> Hello,
>
> On 13/03/26 02:30, John Stultz wrote:
>
> ...
>
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index d08b004293234..4e746f4de6529 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -2801,12 +2801,24 @@ static int find_later_rq(struct task_struct *task)
> >
> >  static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> >  {
> > -     struct task_struct *p;
> > +     struct task_struct *p = NULL;
> > +     struct rb_node *next_node;
> >
> >       if (!has_pushable_dl_tasks(rq))
> >               return NULL;
> >
> > -     p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
> > +     next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
> > +     while (next_node) {
> > +             p = __node_2_pdl(next_node);
> > +             /* make sure task isn't on_cpu (possible with proxy-exec) */
> > +             if (!task_on_cpu(rq, p))
> > +                     break;
> > +
> > +             next_node = rb_next(next_node);
> > +     }
> > +
> > +     if (!p)
> > +             return NULL;
>
> Can't this return an on_cpu task if we hit the corner case where all
> pushable tasks are on_cpu?
>

Oof. Yep. Thanks for catching this!

Let me know if you have any feedback on the rest of the series. I'll
try reviewers a chance to catch anything else, and will respin this
next week.

Really appreciate the review!
thanks
-john