From: Valentin Schneider <valentin.schneider@arm.com>
Proxy execution forms atomic pairs of tasks: The waiting donor
task (scheduling context) and a proxy (execution context). The
donor task, along with the rest of the blocked chain, follows
the proxy wrt CPU placement.
They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:
,-> RT42
| | blocked-on
| v
blocked_donor | mutex
| | owner
| v
`-- RT1
RT1
RT42
CPU0 CPU1
^ ^
| |
overloaded !overloaded
rq prio = 42 rq prio = 0
RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.
Unfortunately, only the donor task is usually dequeued from the rq,
and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.
Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule
returns, which allows the sched class logic to avoid adding the now
current task to the pushable_list.
Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Tweaked comments & commit message
v5:
* Minor simplifications to utilize the fix earlier
in the patch series.
* Rework the wording of the commit message to match selected/
proxy terminology and expand a bit to make it more clear how
it works.
v6:
* Dropped now-unused proxied value, to be re-added later in the
series when it is used, as caught by Dietmar
v7:
* Unused function argument fixup
* Commit message nit pointed out by Metin Kaya
* Dropped unproven unlikely() and use sched_proxy_exec()
in proxy_tag_curr, suggested by Metin Kaya
v8:
* More cleanups and typo fixes suggested by Metin Kaya
v11:
* Cleanup of comimt message suggested by Metin
v12:
* Rework for rq_selected -> rq->donor renaming
---
kernel/sched/core.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4f7b14f62a24..3596244f613f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
}
#endif /* SCHED_PROXY_EXEC */
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
+{
+ if (!sched_proxy_exec())
+ return;
+ /*
+ * pick_next_task() calls set_next_task() on the chosen task
+ * at some point, which ensures it is not push/pullable.
+ * However, the chosen/donor task *and* the mutex owner form an
+ * atomic pair wrt push/pull.
+ *
+ * Make sure owner we run is not pushable. Unfortunately we can
+ * only deal with that by means of a dequeue/enqueue cycle. :-/
+ */
+ dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+ enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
* changes to task_struct made by pick_next_task().
*/
RCU_INIT_POINTER(rq->curr, next);
+
+ if (!task_current_donor(rq, next))
+ proxy_tag_curr(rq, next);
+
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
@@ -6890,6 +6911,10 @@ static void __sched notrace __schedule(int sched_mode)
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
+ /* In case next was already curr but just got blocked_donor */
+ if (!task_current_donor(rq, next))
+ proxy_tag_curr(rq, next);
+
rq_unpin_lock(rq, &rf);
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);
--
2.49.0.rc0.332.g42c0ae87b1-goog
On Wed, Mar 12, 2025 at 03:11:36PM -0700, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4f7b14f62a24..3596244f613f8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> }
> #endif /* SCHED_PROXY_EXEC */
>
> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> +{
> + if (!sched_proxy_exec())
> + return;
> + /*
> + * pick_next_task() calls set_next_task() on the chosen task
> + * at some point, which ensures it is not push/pullable.
> + * However, the chosen/donor task *and* the mutex owner form an
> + * atomic pair wrt push/pull.
> + *
> + * Make sure owner we run is not pushable. Unfortunately we can
> + * only deal with that by means of a dequeue/enqueue cycle. :-/
> + */
> + dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> + enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +}
So this is probably fine at this point; but we should eventually look at
fixing this.
We can probably look at (ab)using sched_class::set_cpus_allowed() for
this.
Hello John, Peter,
On 3/17/2025 7:37 PM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2025 at 03:11:36PM -0700, John Stultz wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index b4f7b14f62a24..3596244f613f8 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>> }
>> #endif /* SCHED_PROXY_EXEC */
>>
>> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
>> +{
>> + if (!sched_proxy_exec())
>> + return;
>> + /*
>> + * pick_next_task() calls set_next_task() on the chosen task
>> + * at some point, which ensures it is not push/pullable.
>> + * However, the chosen/donor task *and* the mutex owner form an
>> + * atomic pair wrt push/pull.
>> + *
>> + * Make sure owner we run is not pushable. Unfortunately we can
>> + * only deal with that by means of a dequeue/enqueue cycle. :-/
>> + */
>> + dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
>> + enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
>> +}
>
> So this is probably fine at this point; but we should eventually look at
> fixing this.
>
> We can probably look at (ab)using sched_class::set_cpus_allowed() for
> this.
Isn't the net result in this case just "dequeue_pushable_task()"?
Can it be defined as a class callback to avoid a full dequeue + requeue
overhead for the fair class during proxy?
--
Thanks and Regards,
Prateek
Hello John,
On 3/13/2025 3:41 AM, John Stultz wrote:
[..snip..]
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4f7b14f62a24..3596244f613f8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> }
> #endif /* SCHED_PROXY_EXEC */
>
> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> +{
> + if (!sched_proxy_exec())
> + return;
> + /*
> + * pick_next_task() calls set_next_task() on the chosen task
> + * at some point, which ensures it is not push/pullable.
> + * However, the chosen/donor task *and* the mutex owner form an
> + * atomic pair wrt push/pull.
> + *
> + * Make sure owner we run is not pushable. Unfortunately we can
> + * only deal with that by means of a dequeue/enqueue cycle. :-/
> + */
> + dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> + enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +}
> +
> /*
> * __schedule() is the main scheduler function.
> *
> @@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
> * changes to task_struct made by pick_next_task().
> */
> RCU_INIT_POINTER(rq->curr, next);
> +
> + if (!task_current_donor(rq, next))
> + proxy_tag_curr(rq, next);
I don't see any dependency on rq->curr for task_current_donor() check.
Could this check be moved outside of the if-else block to avoid
duplicating in both places since rq_set_donor() was called just after
pick_next_task() or am I missing something?
> +
> /*
> * The membarrier system call requires each architecture
> * to have a full memory barrier after updating
> @@ -6890,6 +6911,10 @@ static void __sched notrace __schedule(int sched_mode)
> /* Also unlocks the rq: */
> rq = context_switch(rq, prev, next, &rf);
> } else {
> + /* In case next was already curr but just got blocked_donor */
> + if (!task_current_donor(rq, next))
> + proxy_tag_curr(rq, next);
> +
> rq_unpin_lock(rq, &rf);
> __balance_callbacks(rq);
> raw_spin_rq_unlock_irq(rq);
--
Thanks and Regards,
Prateek
On Fri, Mar 14, 2025 at 1:40 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/13/2025 3:41 AM, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index b4f7b14f62a24..3596244f613f8 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > }
> > #endif /* SCHED_PROXY_EXEC */
> >
> > +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> > +{
> > + if (!sched_proxy_exec())
> > + return;
> > + /*
> > + * pick_next_task() calls set_next_task() on the chosen task
> > + * at some point, which ensures it is not push/pullable.
> > + * However, the chosen/donor task *and* the mutex owner form an
> > + * atomic pair wrt push/pull.
> > + *
> > + * Make sure owner we run is not pushable. Unfortunately we can
> > + * only deal with that by means of a dequeue/enqueue cycle. :-/
> > + */
> > + dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> > + enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> > +}
> > +
> > /*
> > * __schedule() is the main scheduler function.
> > *
> > @@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
> > * changes to task_struct made by pick_next_task().
> > */
> > RCU_INIT_POINTER(rq->curr, next);
> > +
> > + if (!task_current_donor(rq, next))
> > + proxy_tag_curr(rq, next);
>
> I don't see any dependency on rq->curr for task_current_donor() check.
> Could this check be moved outside of the if-else block to avoid
> duplicating in both places since rq_set_donor() was called just after
> pick_next_task() or am I missing something?
So this check is just looking to see if next is the same as the
selected rq->donor (what pick_next_task() chose).
If so, nothing to do, same as always.
But If not (so we are proxying in this case), we need to call
proxy_tag_curr() because we have to make sure both the donor and the
proxy are not on a sched-classes pushable list.
This is because the logic around pick_next_task() calls
set_next_task() on the returned donor task, and in the sched-class
code, (for example RT) that logic will remove the chosen donor task
from the pushable list.
But when we find a proxy task to run on behalf of the donor, the
problem is that the proxy might be on the sched-class' pushable list.
So if we are proxying, we do a dequeue and enqueue pair, which allows
us to re-evaluate if the task is rq->curr, which will prevent it from
being added to any such pushable list. This avoids the potential of
the balance callbacks trying to migrate the rq->curr under us.
Thanks so much for the review and the question! Let me know if that
makes any more sense, or if you have suggestions on how I could better
explain it in the commit message to help.
Appreciate it!
-john
Hello John,
On 3/15/2025 10:40 AM, John Stultz wrote:
[..snip..]
>>> @@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
>>> * changes to task_struct made by pick_next_task().
>>> */
>>> RCU_INIT_POINTER(rq->curr, next);
>>> +
>>> + if (!task_current_donor(rq, next))
>>> + proxy_tag_curr(rq, next);
>>
>> I don't see any dependency on rq->curr for task_current_donor() check.
>> Could this check be moved outside of the if-else block to avoid
>> duplicating in both places since rq_set_donor() was called just after
>> pick_next_task() or am I missing something?
>
> So this check is just looking to see if next is the same as the
> selected rq->donor (what pick_next_task() chose).
>
> If so, nothing to do, same as always.
>
> But If not (so we are proxying in this case), we need to call
> proxy_tag_curr() because we have to make sure both the donor and the
> proxy are not on a sched-classes pushable list.
>
> This is because the logic around pick_next_task() calls
> set_next_task() on the returned donor task, and in the sched-class
> code, (for example RT) that logic will remove the chosen donor task
> from the pushable list.
>
> But when we find a proxy task to run on behalf of the donor, the
> problem is that the proxy might be on the sched-class' pushable list.
> So if we are proxying, we do a dequeue and enqueue pair, which allows
> us to re-evaluate if the task is rq->curr, which will prevent it from
> being added to any such pushable list. This avoids the potential of
> the balance callbacks trying to migrate the rq->curr under us.
>
> Thanks so much for the review and the question! Let me know if that
> makes any more sense, or if you have suggestions on how I could better
> explain it in the commit message to help.
Thanks a ton for clarifying. I found the enqueue_task_rt() bits from
Patch 5 and then it made sense.
P.S. Could the enqueue_task_rt() bits be moved to this patch since it
fits here better?
I couldn't see the dependency for the enqueue bits in Patch 5 since on
finding a "blocked_on" task, the logic simply dequeues it and since
proxy_resched_idle() will nuke the rq->{curr,donor} reference before
that, it should be safe to move those bits here unless I missed
something again :)
>
> Appreciate it!
> -john
--
Thanks and Regards,
Prateek
© 2016 - 2025 Red Hat, Inc.