[PATCH] sched/proxy_exec: Optimize proxy_tag_curr() pushable removal

soolaugust@gmail.com posted 1 patch 1 month, 1 week ago
kernel/sched/core.c     | 28 +++++++++++++++++++---------
kernel/sched/deadline.c |  3 ++-
kernel/sched/rt.c       |  2 +-
kernel/sched/sched.h    |  2 ++
4 files changed, 24 insertions(+), 11 deletions(-)
[PATCH] sched/proxy_exec: Optimize proxy_tag_curr() pushable removal
Posted by soolaugust@gmail.com 1 month, 1 week ago
From: zhidao su <suzhidao@xiaomi.com>

proxy_tag_curr() is called on every proxy-execution activation
to prevent the owner task from being pushed to another CPU while
the donor runs on its behalf.

The current implementation uses a full dequeue/enqueue(SAVE/RESTORE)
cycle.  This is unnecessarily expensive:

  RT owner: O(log n) run-list manipulation and plist update, plus
    uclamp/PSI/sched_core bookkeeping, all with zero net effect
    except for the single dequeue_pushable_task() call buried
    inside dequeue_task_rt().

  DL owner: same, plus DEQUEUE_SAVE triggers sub_running_bw() +
    sub_rq_bw(), which ENQUEUE_RESTORE immediately undoes via
    add_rq_bw() + add_running_bw().  The bandwidth counters of
    an owner that never left the runqueue should not be perturbed.

  CFS owner: the whole cycle is a no-op; CFS has no pushable
    list.  can_migrate_task() already rejects blocked owners
    via task_is_blocked(), making migration impossible.

The SAVE/RESTORE cycle achieves pushable removal only indirectly:
enqueue_task_rt/dl() suppresses re-enqueue into the pushable
list when task_is_blocked(owner) is true.  The same result is
obtained more directly by calling dequeue_pushable_task() or
dequeue_pushable_dl_task() once, without any of the side effects.

Replace the workaround with per-class direct calls:

  RT: dequeue_pushable_task(rq, owner)   -- O(1) plist remove
  DL: dequeue_pushable_dl_task(rq, owner) -- O(log n) rb_erase,
      but avoids the bandwidth counter churn entirely
  CFS: no-op (no pushable list; task_is_blocked() suffices)

Both functions are promoted from static and declared in sched.h.
deadline.c also gains the missing isolation.h include required
by dl_get_task_effective_cpus().

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
 kernel/sched/core.c     | 28 +++++++++++++++++++---------
 kernel/sched/deadline.c |  3 ++-
 kernel/sched/rt.c       |  2 +-
 kernel/sched/sched.h    |  2 ++
 4 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dc9f17b35e4..2aba15d84b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6728,16 +6728,26 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
 	if (!sched_proxy_exec())
 		return;
 	/*
-	 * pick_next_task() calls set_next_task() on the chosen task
-	 * at some point, which ensures it is not push/pullable.
-	 * However, the chosen/donor task *and* the mutex owner form an
-	 * atomic pair wrt push/pull.
+	 * The donor goes through set_next_task() which calls
+	 * dequeue_pushable_task() making it non-pushable.  The owner
+	 * does not go through that path, so we must remove it from
+	 * the pushable list explicitly.
 	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
-	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+	 * For RT tasks: remove from the plist directly.
+	 * For DL tasks: remove from the rb-tree directly.
+	 * For CFS tasks: no pushable list exists; can_migrate_task()
+	 * already rejects blocked owners via task_is_blocked().
+	 *
+	 * The prior dequeue/enqueue(SAVE/RESTORE) cycle achieved the
+	 * same result by relying on task_is_blocked() suppressing the
+	 * re-enqueue into the pushable list, but it carried O(log n)
+	 * overhead and, for DL owners, triggered sub_running_bw() +
+	 * sub_rq_bw() -- bandwidth counter churn with no net effect.
+	 */
+	if (rt_task(owner))
+		dequeue_pushable_task(rq, owner);
+	else if (dl_task(owner))
+		dequeue_pushable_dl_task(rq, owner);
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b0042932..a43557fd84b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -18,6 +18,7 @@
 
 #include <linux/cpuset.h>
 #include <linux/sched/clock.h>
+#include <linux/sched/isolation.h>
 #include <uapi/linux/sched/types.h>
 #include "sched.h"
 #include "pelt.h"
@@ -593,7 +594,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
 	}
 }
 
-static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
 {
 	struct dl_rq *dl_rq = &rq->dl;
 	struct rb_root_cached *root = &dl_rq->pushable_dl_tasks_root;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f69e1f16d92..5a124ee3114 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -410,7 +410,7 @@ static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
 	}
 }
 
-static void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
+void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
 {
 	plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cc..dd5e187e6c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -362,6 +362,7 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_deactivate(int cpu);
 extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p);
 /*
  * SCHED_DEADLINE supports servers (nested scheduling) with the following
  * interface:
@@ -3329,6 +3330,7 @@ print_numa_stats(struct seq_file *m, int node, unsigned long tsf,
 
 extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq);
+extern void dequeue_pushable_task(struct rq *rq, struct task_struct *p);
 extern void init_dl_rq(struct dl_rq *dl_rq);
 
 extern void cfs_bandwidth_usage_inc(void);
-- 
2.43.0
Re: [PATCH] sched/proxy_exec: Optimize proxy_tag_curr() pushable removal
Posted by K Prateek Nayak 1 month ago
Hello Zhidao,

On 3/3/2026 5:27 PM, soolaugust@gmail.com wrote:
> The SAVE/RESTORE cycle achieves pushable removal only indirectly:
> enqueue_task_rt/dl() suppresses re-enqueue into the pushable
> list when task_is_blocked(owner) is true.  The same result is
> obtained more directly by calling dequeue_pushable_task() or
> dequeue_pushable_dl_task() once, without any of the side effects.
> 
> Replace the workaround with per-class direct calls:

I had this question in the past and from my reading, I think we
can safely call the dequeue_pushable*() helper (John might have some
notes on these bits) but that direct call is just horrible!

Why can't we just have a sched_class->proxy_tag_curr(rq, p) or such
and just call it if a "sched_class" populates it?

> 
>   RT: dequeue_pushable_task(rq, owner)   -- O(1) plist remove
>   DL: dequeue_pushable_dl_task(rq, owner) -- O(log n) rb_erase,
>       but avoids the bandwidth counter churn entirely
>   CFS: no-op (no pushable list; task_is_blocked() suffices)
> 
> Both functions are promoted from static and declared in sched.h.
> deadline.c also gains the missing isolation.h include required
> by dl_get_task_effective_cpus().
> 
> Signed-off-by: zhidao su <suzhidao@xiaomi.com>
> ---
>  kernel/sched/core.c     | 28 +++++++++++++++++++---------
>  kernel/sched/deadline.c |  3 ++-
>  kernel/sched/rt.c       |  2 +-
>  kernel/sched/sched.h    |  2 ++
>  4 files changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index dc9f17b35e4..2aba15d84b7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6728,16 +6728,26 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
>         if (!sched_proxy_exec())
>                 return;
>         /*
> -        * pick_next_task() calls set_next_task() on the chosen task
> -        * at some point, which ensures it is not push/pullable.
> -        * However, the chosen/donor task *and* the mutex owner form an
> -        * atomic pair wrt push/pull.
> +        * The donor goes through set_next_task() which calls
> +        * dequeue_pushable_task() making it non-pushable.  The owner
> +        * does not go through that path, so we must remove it from
> +        * the pushable list explicitly.
>          *
> -        * Make sure owner we run is not pushable. Unfortunately we can
> -        * only deal with that by means of a dequeue/enqueue cycle. :-/
> -        */
> -       dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> -       enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +        * For RT tasks: remove from the plist directly.
> +        * For DL tasks: remove from the rb-tree directly.
> +        * For CFS tasks: no pushable list exists; can_migrate_task()
> +        * already rejects blocked owners via task_is_blocked().

Tag is for the current running context (owwner). task_on_cpu() check is
what guards that for fair.

> +        *
> +        * The prior dequeue/enqueue(SAVE/RESTORE) cycle achieved the
> +        * same result by relying on task_is_blocked() suppressing the
> +        * re-enqueue into the pushable list, but it carried O(log n)
> +        * overhead and, for DL owners, triggered sub_running_bw() +
> +        * sub_rq_bw() -- bandwidth counter churn with no net effect.
> +        */
> +       if (rt_task(owner))
> +               dequeue_pushable_task(rq, owner);
> +       else if (dl_task(owner))
> +               dequeue_pushable_dl_task(rq, owner);

I think a sched_class callback would be better and it'll also allow
other sched classes to just populate it when need arises instead of
adding more else if here. Like:

  if (p->sched_class->proxy_tag_curr) /* or dequeue_pushable_task()? */
    p->sched_class->proxy_tag_curr(rq, p);

Also I forgot how the current context gets put back on the pushable list
once the proxy is done since at that point it is just a preempted task
on the CPU.

John, do you remember how this happens?

I think we might also need a proxy_untag_current() which calls into
enqueue_pushable.*_task() which does the minimal bits of put_prev_task()
undoing the tag and possibly queuing some balance callbacks which would
have been skipped by set_next_task() on not seeing any pushable tasks
at the time of donor pick?

>  }
> 
>  /*

-- 
Thanks and Regards,
Prateek
Re: [PATCH] sched/proxy_exec: Optimize proxy_tag_curr() pushable removal
Posted by John Stultz 1 month ago
On Tue, Mar 3, 2026 at 7:40 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/3/2026 5:27 PM, soolaugust@gmail.com wrote:
> > The SAVE/RESTORE cycle achieves pushable removal only indirectly:
> > enqueue_task_rt/dl() suppresses re-enqueue into the pushable
> > list when task_is_blocked(owner) is true.  The same result is
> > obtained more directly by calling dequeue_pushable_task() or
> > dequeue_pushable_dl_task() once, without any of the side effects.
> >
> > Replace the workaround with per-class direct calls:
>
> I had this question in the past and from my reading, I think we
> can safely call the dequeue_pushable*() helper (John might have some
> notes on these bits) but that direct call is just horrible!
>
> Why can't we just have a sched_class->proxy_tag_curr(rq, p) or such
> and just call it if a "sched_class" populates it?

Yeah, I agree we should use a sched_class-> method rather then doing a
big conditional tree.

> Also I forgot how the current context gets put back on the pushable list
> once the proxy is done since at that point it is just a preempted task
> on the CPU.
>
> John, do you remember how this happens?
>
> I think we might also need a proxy_untag_current() which calls into
> enqueue_pushable.*_task() which does the minimal bits of put_prev_task()
> undoing the tag and possibly queuing some balance callbacks which would
> have been skipped by set_next_task() on not seeing any pushable tasks
> at the time of donor pick?

Oh hmm, good catch! So I think you're right that right now we're
leaning on the owner to be run without a donor before we return it to
the pushable list, which isn't great.

I'll take a swing at addressing this (along with your other feedback -
this year has been hectic and I've only just now finally managed to
get back to proxy-exec upstreaming efforts!) for v25.

Thanks again for the careful review and feedback!
-john
Re: [PATCH] sched/proxy_exec: Optimize proxy_tag_curr() pushable removal
Posted by soolaugust@gmail.com 1 month ago
From: zhidao su <suzhidao@xiaomi.com>

On Wed, Mar 4, 2026 at 10:28 AM John Stultz <jstultz@google.com> wrote:
> Yeah, I agree we should use a sched_class-> method rather then doing
> a big conditional tree.
>
> Oh hmm, good catch! So I think you're right that right now we're
> leaning on the owner to be run without a donor before we return it
> to the pushable list, which isn't great.
>
> I'll take a swing at addressing this (along with your other feedback
> - this year has been hectic and I've only just now finally managed
> to get back to proxy-exec upstreaming efforts!) for v25.

Understood, and thanks to both of you for the thorough review.

The sched_class->proxy_tag_curr() direction makes sense -- it avoids
the conditional tree and keeps each class's pushable logic self-
contained. And the missing proxy_untag_current() is a real gap I
had not considered; good to know it will be addressed in v25.

I'll drop this RFC and wait for v25. Looking forward to it.
Re: [PATCH] sched/proxy_exec: Optimize proxy_tag_curr() pushable removal
Posted by John Stultz 1 month ago
On Tue, Mar 3, 2026 at 9:33 PM <soolaugust@gmail.com> wrote:
>
> From: zhidao su <suzhidao@xiaomi.com>
>
> On Wed, Mar 4, 2026 at 10:28 AM John Stultz <jstultz@google.com> wrote:
> > Yeah, I agree we should use a sched_class-> method rather then doing
> > a big conditional tree.
> >
> > Oh hmm, good catch! So I think you're right that right now we're
> > leaning on the owner to be run without a donor before we return it
> > to the pushable list, which isn't great.
> >
> > I'll take a swing at addressing this (along with your other feedback
> > - this year has been hectic and I've only just now finally managed
> > to get back to proxy-exec upstreaming efforts!) for v25.
>
> Understood, and thanks to both of you for the thorough review.
>
> The sched_class->proxy_tag_curr() direction makes sense -- it avoids
> the conditional tree and keeps each class's pushable logic self-
> contained. And the missing proxy_untag_current() is a real gap I
> had not considered; good to know it will be addressed in v25.
>
> I'll drop this RFC and wait for v25. Looking forward to it.

Thanks! I'm going to try to send just two patches to address this
issue now as an RFC, so its not stuck behind the rest of my v25
rework.

Again, I really appreciate your submission here, the analysis of the
overhead cost you shared here is very clear and helps motivate fixing
this properly.

thanks
-john