[v26] Simple Donor Migration for Proxy Execution

[PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by John Stultz 1 week, 2 days ago

Hey All,

Yet another iteration on the next chunk of the Proxy Exec
series: Simple Donor Migration

This is just the next step for Proxy Execution, to allow us to
migrate blocked donors across runqueues to boost remote lock
owners.

As always, I’m trying to submit this larger work in smallish
digestible pieces, so in this portion of the series, I’m only
submitting for review and consideration some recent fixups, and
the logic that allows us to do donor(blocked waiter) migration,
which requires some additional changes to locking and extra
state tracking to ensure we don’t accidentally run a migrated
donor on a cpu it isn’t affined to, as well as some extra
handling to deal with balance callback state that needs to be
reset when we decide to pick a different task after doing donor
migration.

I really want to share my appreciation for feedback provided by
Peter, K Prateek and Juri on the last revision! 

New in this iteration:
* Fix missed balancing opportunity that K Prateek noticed

* Fix bug in pick_next_pushable_task_dl() that Juri noticed

* Use guard() in attach_one_task() as suggested by K Prateek

* Add context analysis annotations, as suggested by Peter

* Introduce proxy_release/reaquire_rq_lock() helpers as
  suggested by Peter

* Rework comments and logic in numerous places to address
  feedback from Peter

* Mark tasks PROXY_WAKING if try_to_block_task() fails due to a
  signal, as noted by K Prateek

I’d love to get further feedback on any place where these
patches are confusing, or could use additional clarifications.

There’s also been some further improvements In the full Proxy
Execution series:
* Tweaks to proxy_needs_return() suggested by K Prateek

* Additional tweaks to address concern about signal edge cases
  from K Prateek

I’d appreciate any testing or comments that folks have with
the full set!

You can find the full Proxy Exec series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v26-7.0-rc5/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v26-7.0-rc5


Issues still to address with the full series:
* Resolve a regression in the later optimized donor-migration
  changes combined with “Fix 'stuck' dl_server” change in 6.19

* With the full series against 7.0-rc3, when doing heavy stress
  testing, I’m occasionally hitting crashes due to null return
  from __pick_eevdf(). Need to dig on this and find why it
  doesn’t happen against 6.18

* Continue working to get sched_ext to be ok with Proxy
  Execution enabled.

* Reevaluate performance regression K Prateek Nayak found with
  the full series.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

Future work:
* Expand to more locking primitives: Figuring out pi-futexes
  would be good, using proxy for Binder PI is something else
  we’re exploring.

* Eventually: Work to replace rt_mutexes and get things happy
   with PREEMPT_RT

I’d love any feedback or review thoughts on the full series as
well. I’m trying to keep the chunks small, reviewable and
iteratively testable, but if you have any suggestions on how to
improve the larger series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!).
Thanks also to Joel Fernandes, Dietmar Eggemann, Metin Kaya,
K Prateek Nayak and Suleiman Souhlal for their substantial
review, suggestion, and patch contributions.

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are surely
mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (10):
  sched: Make class_schedulers avoid pushing current, and get rid of
    proxy_tag_curr()
  sched: Minimise repeated sched_proxy_exec() checking
  sched: Fix potentially missing balancing with Proxy Exec
  locking: Add task::blocked_lock to serialize blocked_on state
  sched: Fix modifying donor->blocked on without proper locking
  sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy
    return-migration
  sched: Add assert_balance_callbacks_empty helper
  sched: Add logic to zap balance callbacks if we pick again
  sched: Move attach_one_task and attach_task helpers to sched.h
  sched: Handle blocked-waiter migration (and return migration)

 include/linux/sched.h        |  91 ++++++----
 init/init_task.c             |   1 +
 kernel/fork.c                |   1 +
 kernel/locking/mutex-debug.c |   4 +-
 kernel/locking/mutex.c       |  40 +++--
 kernel/locking/mutex.h       |   6 +
 kernel/locking/ww_mutex.h    |  16 +-
 kernel/sched/core.c          | 328 +++++++++++++++++++++++++++++------
 kernel/sched/deadline.c      |  18 +-
 kernel/sched/fair.c          |  26 ---
 kernel/sched/rt.c            |  15 +-
 kernel/sched/sched.h         |  32 +++-
 12 files changed, 442 insertions(+), 136 deletions(-)

-- 
2.53.0.1018.g2bb0e51243-goog

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by K Prateek Nayak 1 week, 1 day ago

Hello John,

On 3/25/2026 12:43 AM, John Stultz wrote:
> I really want to share my appreciation for feedback provided by
> Peter, K Prateek and Juri on the last revision! 

And we really appreciate you working on this! (Cannot state this
enough)

> There’s also been some further improvements In the full Proxy
> Execution series:
> * Tweaks to proxy_needs_return() suggested by K Prateek

To answer your question on v25, I finally seem to have
ttwu_state_match() happy with the pieces in:
https://github.com/kudureranganath/linux/commits/kudure/sched/proxy/ttwu_state_match/

The base rationale is still the same from
https://lore.kernel.org/lkml/eccf9bb5-8455-48e5-aa35-4878c25f6822@amd.com/

tl;dr

Use rq_lock() to serialize clearing of "p->blocked_on". All the other
transition change "p->blocked_on" non-NULL values. Exploit this to use
ttwu_state_match() + ttwu_runnable() in our favor when waking up blocked
donors and handling their return migration.

These are the commits of interest on the tree with a brief explanation:

I added back proxy_reset_donor() for the sake of testing now that a
bunch of other bits are addressed in v26. I've mostly been testing at
the below commit (this time with LOCKDEP enabled):

    82a29c2ecd4b sched/core: Reset the donor to current task when donor is woken
    ...
    5dc4507b1f04 [ANNOTATION] === proxy donor/blocked-waiter migration before this point  ===

Above range which has you as the author has not been touched - same as
what you have on your proxy-exec-v26-7.0-rc5 branch.

I did not tackle sleeping owner bits yet because there are too many
locks, lists, synchronization nuances that I still need to wrap my
head around. That said ...

The below is making ttwu_state_match() sufficient enough to handle
the return migration which allows for using wake_up_process(). The
patches are small and the major ones should have enough rationale
in the comments and the commit message to justify the changes.

  0b3810f43c66 sched/core: Simplify proxy_force_return()
  609c41b77eaf sched/core: Remove proxy_task_runnable_but_waking()
  157721338332 sched/core: Prepare proxy_deactivate() to comply with ttwu state machinery
  abefa0729920 sched/core: Allow callers of try_to_block_task() to handle "blocked_on" relation

Only change to below was conflict resolution as a result of some
re-arrangement.

  787b078b588f sched: Handle blocked-waiter migration (and return migration)

These are few changes to proxy_needs_return() exploiting the idea
of "p->blocked_on" being only cleared under rq_lock:

  84a2b581dfe8 sched/core: Remove "p->wake_cpu" constraint in proxy_needs_return()
  c52d51d67452 sched/core: Handle "blocked_on" clearing for wakeups in ttwu_runnable()

I just moved this further because I think it is an important bit to
handle the return migration vs wakeup of blocked donor. This too
has only been modified to resolve conflicts and nothing more.

  9db85fb35c22 sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case

These are two small trivial fixes - one that already exists in your
tree and is required for using proxy_resched_idle() from
proxy_needs_return() and the other to clear "p->blocked_on" when a
wakeup races with __schedule():

  0d6a01bb19db sched/core: Clear "blocked_on" relation if schedule races with wakeup
  fd60c48f7b71 sched: Avoid donor->sched_class->yield_task() null traversal

Everything before this is same as what is in your tree.

The bottom ones have the most information and the commit messages
get brief as we move to top but I believe there is enough context
in comments + commit log to justify these changes. Some may
actually have too much context but I've dumped my head out.

I'll freeze this branch and use a WIP like yours if and when I
manage to crash and burn these bits.

I know you are already oversubscribed so please take a look on a
best effort basis. I can also resend this as a separate series
once v26 lands if there is enough interest around.

Sorry for the dump and thank you again for patiently working on
this. Much appreciated _/\_

-- 
Thanks and Regards,
Prateek

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by John Stultz 6 days, 11 hours ago

On Wed, Mar 25, 2026 at 3:52 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/25/2026 12:43 AM, John Stultz wrote:
> > There’s also been some further improvements In the full Proxy
> > Execution series:
> > * Tweaks to proxy_needs_return() suggested by K Prateek
>
> To answer your question on v25, I finally seem to have
> ttwu_state_match() happy with the pieces in:
> https://github.com/kudureranganath/linux/commits/kudure/sched/proxy/ttwu_state_match/
>
> The base rationale is still the same from
> https://lore.kernel.org/lkml/eccf9bb5-8455-48e5-aa35-4878c25f6822@amd.com/

So thank you so much for sharing this tree! It's definitely helpful
and better shows how to split up the larger proposal you had.

I've been occupied chasing the null __pick_eevdf() return issue (which
I've now tripped without my proxy changes, so its an upstream thing
but I'd still like to bisect it down), along with other items, so I've
not yet been able to fully ingest your changes. I did run some testing
on them and didn't see any immediate issues (other then the null
__pick_eevdf() issue, which limits the testing time to ~4 hours), and
I even ran it along with the sleeping owner enqueuing change on top
which had been giving me grief in earlier attempts to integrate these
suggestions.  So that's good!

My initial/brief reactions looking through your the series:

* sched/core: Clear "blocked_on" relation if schedule races with wakeup

At first glance, this makes me feel nervous because clearing the
blocked_on value has long been a source of bugs in the development of
the proxy series, as the task might have been proxy-migrated to a cpu
where it can't run. That's why my mental rules tend towards doing the
clearing in a few places and setting PROXY_WAKING in most cases (so
we're sure to evaluate the task before letting it run).  My earlier
logic of keeping blocked_on_state separate from blocked_on was trying
to make these rules as obvious as possible, and since consolidating
them I still get mentally muddied at times - ie, we probably don't
need to be clearing blocked_on in the mutex lock paths anymore, but
the symmetry is a little helpful to me.

But the fact that you're clearing the state on prev here, and at that
point prev is current saves it, since current can obviously run on
this cpu. So probably just needs a comment to that effect.

* sched/core: Handle "blocked_on" clearing for wakeups in ttwu_runnable()

Mostly looks sane to me (though I still have some heistancy to
dropping the set_task_blocked_on_waking() bit)

* sched/core: Remove "p->wake_cpu" constraint in proxy_needs_return()

Yeah, that's a sound call, the shortcut isn't necessary and just adds
complexity.

* sched/core: Allow callers of try_to_block_task() to handle
"blocked_on" relation

Seems like it could be pulled up earlier in the series? (with your first change)

* sched/core: Prepare proxy_deactivate() to comply with ttwu state machinery

This one I've not totally gotten my head around, still.  The
"WRITE_ONCE(p->__state, TASK_RUNNING);"  in find_proxy_task() feels
wrong, as it looks like we're overriding what ttwu should be handling.
But again, this is only done on current, so it's probably ok.
Similarly the clear_task_blocked_on() in proxy_deactivate() doesn't
make it clear how we ensure we're not proxy-migrated, and the
clear_task_blocked_on() in __block_task() feels wrong to me, as I
think we will need that for sleeping owner enqueuing.

But again, this didn't crash (at least right away), so it may just be
I've not fit it into my mental model yet and I'll get it eventually.

* sched/core: Remove proxy_task_runnable_but_waking()

Looks lovely, but obviously depends on the previous changes.

* sched/core: Simplify proxy_force_return()

Again, I really like how much that simplifies the logic! But I'm
hesitant as my previous attempts to do similar didn't work, and it
seems it depends on the ttwu state machinery change I've not fully
understood.

* sched/core: Reset the donor to current task when donor is woken

Looks nice! I fret there may be some subtlety I'm missing, but once I
get some confidence in it, I'll be happy to have it.

Anyway, apologies I've not had more time to spend on your feedback
yet.  I was hoping to start integrating and folding in your proposed
changes for another revision (if you are ok with that - I can keep
them separate as well, but it feels like more churn for reviewers),
but with Peter sounding like he's in-progress on queueing the current
set (with modifications), I want to wait to see if we should just work
this out on top of what he has (which I'm fine with).

As always, many many thanks for your time and feedback here! I really
appreciate your contributions to this effort!
-john

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by K Prateek Nayak 6 days, 2 hours ago

Hello John,

On 3/28/2026 12:40 AM, John Stultz wrote:
> On Wed, Mar 25, 2026 at 3:52 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 3/25/2026 12:43 AM, John Stultz wrote:
>>> There’s also been some further improvements In the full Proxy
>>> Execution series:
>>> * Tweaks to proxy_needs_return() suggested by K Prateek
>>
>> To answer your question on v25, I finally seem to have
>> ttwu_state_match() happy with the pieces in:
>> https://github.com/kudureranganath/linux/commits/kudure/sched/proxy/ttwu_state_match/
>>
>> The base rationale is still the same from
>> https://lore.kernel.org/lkml/eccf9bb5-8455-48e5-aa35-4878c25f6822@amd.com/
> 
> So thank you so much for sharing this tree! It's definitely helpful
> and better shows how to split up the larger proposal you had.
> 
> I've been occupied chasing the null __pick_eevdf() return issue (which
> I've now tripped without my proxy changes, so its an upstream thing
> but I'd still like to bisect it down),

Is the __pick_eevdf() returning NULL on tip:sched/core or is this on
mainline?

_Insert GTA San Andreas "Here we go again." meme here_

> along with other items, so I've
> not yet been able to fully ingest your changes. I did run some testing
> on them and didn't see any immediate issues (other then the null
> __pick_eevdf() issue, which limits the testing time to ~4 hours), and
> I even ran it along with the sleeping owner enqueuing change on top
> which had been giving me grief in earlier attempts to integrate these
> suggestions.  So that's good!
> 
> My initial/brief reactions looking through your the series:
> 
> * sched/core: Clear "blocked_on" relation if schedule races with wakeup
> 
> At first glance, this makes me feel nervous because clearing the
> blocked_on value has long been a source of bugs in the development of
> the proxy series, as the task might have been proxy-migrated to a cpu
> where it can't run. That's why my mental rules tend towards doing the
> clearing in a few places and setting PROXY_WAKING in most cases (so
> we're sure to evaluate the task before letting it run).  My earlier
> logic of keeping blocked_on_state separate from blocked_on was trying
> to make these rules as obvious as possible, and since consolidating
> them I still get mentally muddied at times - ie, we probably don't
> need to be clearing blocked_on in the mutex lock paths anymore, but
> the symmetry is a little helpful to me.
> 
> But the fact that you're clearing the state on prev here, and at that
> point prev is current saves it, since current can obviously run on
> this cpu. So probably just needs a comment to that effect.

Ack!

> 
> * sched/core: Handle "blocked_on" clearing for wakeups in ttwu_runnable()
> 
> Mostly looks sane to me (though I still have some heistancy to
> dropping the set_task_blocked_on_waking() bit)
> 
> * sched/core: Remove "p->wake_cpu" constraint in proxy_needs_return()
> 
> Yeah, that's a sound call, the shortcut isn't necessary and just adds
> complexity.
> 
> * sched/core: Allow callers of try_to_block_task() to handle
> "blocked_on" relation
> 
> Seems like it could be pulled up earlier in the series? (with your first change)
> 
> * sched/core: Prepare proxy_deactivate() to comply with ttwu state machinery
> 
> This one I've not totally gotten my head around, still.  The
> "WRITE_ONCE(p->__state, TASK_RUNNING);"  in find_proxy_task() feels
> wrong, as it looks like we're overriding what ttwu should be handling.

So the reason for that is, we can have:

  CPU0 (owner - A)                       CPU1 (donor - B)
  ================                       ================

mutex_unlock(M)
  atomic_long_try_cmpxchg_release()     /* B is just trying to acquire the mutex. */
  ...                                    schedule() /* prev = B, next = B; B is blocked on A */
                                           find_proxy_task()
                                             ...
                                             owner = __mutex_owner(M);
                                             if (!owner && task_current(rq, B))
                                                __clear_task_blocked_on(p, NULL)
                                                return B
  __set_task_blocked_on_waking(B, M);    ... /* B starts running without TASK_RUNNING. */
    /* nop since !B->blocked_on */
                                         /*
                                          * Scenario 1 - B gets mutex and then sets
                                          * TASK_RUNNING on its own.
                                          */
  /* Scenario 2 */
  wake_q_add(B)
    wake_up_process()
      ttwu_state_match() /* true */
        B->__state = TASK_RUNNING;

So in either case, task will wake up and set TASK_RUNNING so we
can just do the pending bits of wakeup in __schedule(). I think
even without an explicit TASK_RUNNING it should be fine but I
need to jog my memory on why I added that (maybe for caution).

If the task fails to acquire mutex, it'll reset to blocked
state and go into schedule() and everything should just work
out fine.

> But again, this is only done on current, so it's probably ok.
> Similarly the clear_task_blocked_on() in proxy_deactivate() doesn't
> make it clear how we ensure we're not proxy-migrated,

So the rationale behind that was, we should *never* hit that
condition but if we are, perhaps we can simply do a move_queued_task()
back to "wake_cpu" to ensure correctness?

> and the
> clear_task_blocked_on() in __block_task() feels wrong to me, as I
> think we will need that for sleeping owner enqueuing.

Yes, for sleeping owner that is not the ideal place - I completely
agree with that. Let me go stare at that find a better place to
put it.

> 
> But again, this didn't crash (at least right away), so it may just be
> I've not fit it into my mental model yet and I'll get it eventually.

Yeah, but then you lose the "blocked_on" chain when you deactivate
the donors only for it to be reconstructed back by running the task
for a little bit and re-establishing that relation so although
it might not have crashed (yet!), it is pretty inefficient.

I'll go stare more at that.

> 
> * sched/core: Remove proxy_task_runnable_but_waking()
> 
> Looks lovely, but obviously depends on the previous changes.
> 
> * sched/core: Simplify proxy_force_return()
> 
> Again, I really like how much that simplifies the logic! But I'm
> hesitant as my previous attempts to do similar didn't work, and it
> seems it depends on the ttwu state machinery change I've not fully
> understood.

Highly intertwined indeed! I'll try to add more comments and improve
the commit messages.

> 
> * sched/core: Reset the donor to current task when donor is woken
> 
> Looks nice! I fret there may be some subtlety I'm missing, but once I
> get some confidence in it, I'll be happy to have it.

Ack! I too will keep testing. Btw, do you have something that stresses
the deadline bits? I can't seem to reliably get something running with
lot of preemptions when holding mutexes.

> 
> Anyway, apologies I've not had more time to spend on your feedback
> yet.  I was hoping to start integrating and folding in your proposed
> changes for another revision (if you are ok with that - I can keep
> them separate as well, but it feels like more churn for reviewers),
> but with Peter sounding like he's in-progress on queueing the current
> set (with modifications), I want to wait to see if we should just work
> this out on top of what he has (which I'm fine with).

Ack! None of this is strictly necessary until we get to ttwu handling
the return migration so it should be okay. If you are occupied, I can
test and send these changes on top separately too to ease some load.

> 
> As always, many many thanks for your time and feedback here! I really
> appreciate your contributions to this effort!

And thanks a ton for looking at the tree. Much appreciated _/\_

-- 
Thanks and Regards,
Prateek

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by Peter Zijlstra 6 days, 19 hours ago

On Wed, Mar 25, 2026 at 04:22:14PM +0530, K Prateek Nayak wrote:

I tried to have a quick look, but I find it *very* hard to make sense of
the differences.

(could be I just don't know how to operate github -- that seems a
recurrent theme)

Anyway, this:

>   fd60c48f7b71 sched: Avoid donor->sched_class->yield_task() null traversal

That seems *very* dodgy indeed. Exposing idle as the donor seems ... wrong?


Anyway, you seem to want to drive the return migration from the regular
wakeup path and I don't mind doing that, provided it isn't horrible. But
we can do this on top of these patches, right?

That is, I'm thinking of taking these patches, they're in reasonable
shape, and John deserves a little progress :-)

I did find myself liking the below a little better, but I'll just sneak
that in.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6822,7 +6822,6 @@ static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
-	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
@@ -6838,8 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
 				clear_task_blocked_on(p, PROXY_WAKING);
 				return p;
 			}
-			action = NEEDS_RETURN;
-			break;
+			goto force_return;
 		}
 
 		/*
@@ -6874,16 +6872,14 @@ find_proxy_task(struct rq *rq, struct ta
 				__clear_task_blocked_on(p, NULL);
 				return p;
 			}
-			action = NEEDS_RETURN;
-			break;
+			goto force_return;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
 			if (curr_in_chain)
 				return proxy_resched_idle(rq);
-			action = DEACTIVATE_DONOR;
-			break;
+			goto deactivate;
 		}
 
 		owner_cpu = task_cpu(owner);
@@ -6894,8 +6890,7 @@ find_proxy_task(struct rq *rq, struct ta
 			 */
 			if (curr_in_chain)
 				return proxy_resched_idle(rq);
-			action = MIGRATE;
-			break;
+			goto migrate_task;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6952,26 +6947,20 @@ find_proxy_task(struct rq *rq, struct ta
 		 * guarantee its existence, as per ttwu_remote().
 		 */
 	}
-
-	/* Handle actions we need to do outside of the guard() scope */
-	switch (action) {
-	case DEACTIVATE_DONOR:
-		if (proxy_deactivate(rq, donor))
-			return NULL;
-		/* If deactivate fails, force return */
-		p = donor;
-		fallthrough;
-	case NEEDS_RETURN:
-		proxy_force_return(rq, rf, p);
-		return NULL;
-	case MIGRATE:
-		proxy_migrate_task(rq, rf, p, owner_cpu);
-		return NULL;
-	case FOUND:
-		/* fallthrough */;
-	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
+
+deactivate:
+	if (proxy_deactivate(rq, donor))
+		return NULL;
+	/* If deactivate fails, force return */
+	p = donor;
+force_return:
+	proxy_force_return(rq, rf, p);
+	return NULL;
+migrate_task:
+	proxy_migrate_task(rq, rf, p, owner_cpu);
+	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by John Stultz 6 days, 11 hours ago

On Fri, Mar 27, 2026 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Anyway, you seem to want to drive the return migration from the regular
> wakeup path and I don't mind doing that, provided it isn't horrible. But
> we can do this on top of these patches, right?
>
> That is, I'm thinking of taking these patches, they're in reasonable
> shape, and John deserves a little progress :-)
>
> I did find myself liking the below a little better, but I'll just sneak
> that in.

I was expecting to respin with some of Prateek's and Steven's feedback
(and include your suggested switch to using goto to get out of the
locking scope), but I'd totally not object to you taking this set
(with whatever tweaks you'd prefer).

Do let me know when you can share your queue and I'll rebase and
rework the rest of the series along with any un-integrated feedback to
this set.

thanks
-john

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by K Prateek Nayak 6 days, 17 hours ago

Hello Peter,

On 3/27/2026 5:18 PM, Peter Zijlstra wrote:
> I tried to have a quick look, but I find it *very* hard to make sense of
> the differences.

Couple of concerns I had with the current approach is:

1. Why can't we simply do block_task() + wake_up_process() for return
   migration?

2. Why does proxy_needs_return() (this comes later in John's tree but I
   moved it up ahead)  need the proxy_task_runnable_but_waking() override
   of the ttwu_state_mach() machinery?
   (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)

3. How can proxy_deactivate() see a TASK_RUNNING for blocked donor?

So I went back after my discussion with John at LPC to see if the 
ttwu_state_match() stuff can be left alone and I sent out that
incomprehensible diff on v24.

Then I put a tree where my mouth is to give better rationale behind
each small hunk that was mostly in my head until then. Voila, another
(slightly less) incomprehensible set of bite sized changes :-)

> 
> Anyway, this:
> 
>>   fd60c48f7b71 sched: Avoid donor->sched_class->yield_task() null traversal
> 
> That seems *very* dodgy indeed. Exposing idle as the donor seems ... wrong?

That should get fixed by
https://github.com/kudureranganath/linux/commit/82a29c2ecd4b5f8eb082bb6a4a647aa16a2850be

John has mentioned hitting some warnings a while back from that
https://lore.kernel.org/lkml/f5bc87a7-390f-4e68-95b0-10cab2b92caf@amd.com/

Since v26 does proxy_resched_idle() before doing
proxy_release_rq_lock() in proxy_force_return(), that shouldn't be a
problem.

Speaking of that commit, I would like you or Juri to confirm if it is
okay to set a throttled deadline task as rq->donor for a while until it
hits resched.

> 
> 
> Anyway, you seem to want to drive the return migration from the regular
> wakeup path and I don't mind doing that, provided it isn't horrible. But
> we can do this on top of these patches, right?
> 
> That is, I'm thinking of taking these patches, they're in reasonable
> shape, and John deserves a little progress :-)
> 
> I did find myself liking the below a little better, but I'll just sneak
> that in.

So John originally had that and then I saw Dan's comment in
cleanup.h that reads:

  Lastly, given that the benefit of cleanup helpers is removal of
  "goto", and that the "goto" statement can jump between scopes, the
  expectation is that usage of "goto" and cleanup helpers is never
  mixed in the same function. I.e. for a given routine, convert all
  resources that need a "goto" cleanup to scope-based cleanup, or
  convert none of them.

which can either be interpreted as "Don't do it unless you know what
you are doing" or "There is at least one compiler that will get a
goto + cleanup guard wrong" and to err on side of caution, I
suggested we do break + enums.

If there are no concerns, then the suggested diff is indeed much
better.

-- 
Thanks and Regards,
Prateek

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by Peter Zijlstra 6 days, 15 hours ago

On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 3/27/2026 5:18 PM, Peter Zijlstra wrote:
> > I tried to have a quick look, but I find it *very* hard to make sense of
> > the differences.
> 
> Couple of concerns I had with the current approach is:
> 
> 1. Why can't we simply do block_task() + wake_up_process() for return
>    migration?

So the way things are set up now, we have the blocked task 'on_rq', so
ttwu() will take ttwu_runnable() path, and we wake the task on the
'wrong' CPU.

At this point '->state == TASK_RUNNABLE' and schedule() will pick it and
... we hit '->blocked_on == PROXY_WAKING', which leads to
proxy_force_return(), which does deactivate_task()+activate_task() as
per a normal migration, and then all is well.

Right?

You're asking why proxy_force_return() doesn't use block_task()+ttwu()?
That seems really wrong at that point -- after all: '->state ==
TASK_RUNNABLE'.

Or; are you asking why we don't block_task() at the point where we set
'->blocked_on = PROXY_WAKING'? And then let ttwu() sort things out?

I suspect the latter is really hard to do vs lock ordering, but I've not
thought it through.

One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
task, and then it goes into the normal path and will do the migration.
I've done things like that before.

Does that fix all the return-migration cases?

> 2. Why does proxy_needs_return() (this comes later in John's tree but I
>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
>    of the ttwu_state_mach() machinery?
>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)

Since it comes later, I've not seen it and not given it thought ;-)

(I mean, I've probably seen it at some point, but being the gold-fish
that I am, I have no recollection, so I might as well not have seen it).

A brief look now makes me confused. The comment fails to describe how
that situation could ever come to pass.

> 3. How can proxy_deactivate() see a TASK_RUNNING for blocked donor?

I was looking at that.. I'm not sure. I mean, having the clause doesn't
hurt, but yeah, dunno.

> Speaking of that commit, I would like you or Juri to confirm if it is
> okay to set a throttled deadline task as rq->donor for a while until it
> hits resched.

I think that should be okay.

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by K Prateek Nayak 6 days, 14 hours ago

Hello Peter,

On 3/27/2026 9:30 PM, Peter Zijlstra wrote:
> On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 3/27/2026 5:18 PM, Peter Zijlstra wrote:
>>> I tried to have a quick look, but I find it *very* hard to make sense of
>>> the differences.
>>
>> Couple of concerns I had with the current approach is:
>>
>> 1. Why can't we simply do block_task() + wake_up_process() for return
>>    migration?
> 
> So the way things are set up now, we have the blocked task 'on_rq', so
> ttwu() will take ttwu_runnable() path, and we wake the task on the
> 'wrong' CPU.
> 
> At this point '->state == TASK_RUNNABLE' and schedule() will pick it and
> ... we hit '->blocked_on == PROXY_WAKING', which leads to
> proxy_force_return(), which does deactivate_task()+activate_task() as
> per a normal migration, and then all is well.
> 
> Right?
> 
> You're asking why proxy_force_return() doesn't use block_task()+ttwu()?
> That seems really wrong at that point -- after all: '->state ==
> TASK_RUNNABLE'.
> 
> Or; are you asking why we don't block_task() at the point where we set
> '->blocked_on = PROXY_WAKING'? And then let ttwu() sort things out?
> 
> I suspect the latter is really hard to do vs lock ordering, but I've not
> thought it through.

So taking a step back, this is what we have today (at least the
common scenario):

    CPU0 (donor - A)                         CPU1 (owner - B)
    ================                         ================

mutex_lock()
  __set_current_state(TASK_INTERRUPTIBLE)
  __set_task_blocked_on(M)
    schedule()
      /* Retained for proxy */
      proxy_migrate_task()
        ==================================> /* Migrates to CPU1 */
  ...
  send_sig(B)
    signal_wake_up_state()
      wake_up_state()
        try_to_wake_up()
          ttwu_runnable()
            ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */

                                            /*
                                             * After this point ttwu_state_match()
                                             * will fail for A so a mutex_unlock()
                                             * will have to go through __schedule()
                                             * for return migration.
                                             */

                                            __schedule()
                                              find_proxy_task()

                                                /* Scenario 1 - B sleeps */
                                                __clear_task_blocked_on()
                                                proxy_deactivate(A)
                                                  /* A->__state == TASK_RUNNING */
                                                  /* fallthrough */

                                                /* Scenario 2 - return migration after unlock() */
                                                __clear_task_blocked_on()
                                                /*
                                                 * At this point proxy stops.
                                                 * Much later after signal.
                                                 */
                                                proxy_force_return()
    schedule() <==================================
      signal_pending_state()

  clear_task_blocked_on()
  __set_current_state(TASK_RUNNING)

... /* return with -EINR */


Basically, a blocked donor has to wait for a mutex_unlock() before it
can go process the signal and bail out on the mutex_lock_interruptible()
which seems counter productive - but it is still okay from correctness
perspective.

> 
> One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> task, and then it goes into the normal path and will do the migration.
> I've done things like that before.
> 
> Does that fix all the return-migration cases?

Yes it does! If we handle the return via ttwu_runnable(), which is what
proxy_needs_return() in the next chunk of changes aims to do and we can
build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
state outside of __schedule() which works well with ttwu_state_match().

> 
>> 2. Why does proxy_needs_return() (this comes later in John's tree but I
>>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
>>    of the ttwu_state_mach() machinery?
>>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> 
> Since it comes later, I've not seen it and not given it thought ;-)
> 
> (I mean, I've probably seen it at some point, but being the gold-fish
> that I am, I have no recollection, so I might as well not have seen it).
> 
> A brief look now makes me confused. The comment fails to describe how
> that situation could ever come to pass.

That is a signal delivery happening before unlock which will force
TASK_RUNNING but since we are waiting on an unlock, the wakeup from
unlock will see TASK_RUNNING + PROXY_WAKING.

We then later force it on the ttwu path to do return via
ttwu_runnable().

> 
>> 3. How can proxy_deactivate() see a TASK_RUNNING for blocked donor?
> 
> I was looking at that.. I'm not sure. I mean, having the clause doesn't
> hurt, but yeah, dunno.

Outlined in that flow above - Scenario 1.

> 
> 
>> Speaking of that commit, I would like you or Juri to confirm if it is
>> okay to set a throttled deadline task as rq->donor for a while until it
>> hits resched.
> 
> I think that should be okay.

Good to know! Are you planning to push out the changes to queue? I can
send an RFC with the patches from my tree on top and we can perhaps
discuss it piecewise next week. Then we can decide if we want those
changes or not ;-)

-- 
Thanks and Regards,
Prateek

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by Peter Zijlstra 15 hours ago

On Fri, Mar 27, 2026 at 10:27:08PM +0530, K Prateek Nayak wrote:

> So taking a step back, this is what we have today (at least the
> common scenario):
> 
>     CPU0 (donor - A)                         CPU1 (owner - B)
>     ================                         ================
> 
> mutex_lock()
>   __set_current_state(TASK_INTERRUPTIBLE)
>   __set_task_blocked_on(M)
>     schedule()
>       /* Retained for proxy */
>       proxy_migrate_task()
>         ==================================> /* Migrates to CPU1 */
>   ...
>   send_sig(B)
>     signal_wake_up_state()
>       wake_up_state()
>         try_to_wake_up()
>           ttwu_runnable()
>             ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */
> 
>                                             /*
>                                              * After this point ttwu_state_match()
>                                              * will fail for A so a mutex_unlock()
>                                              * will have to go through __schedule()
>                                              * for return migration.
>                                              */
> 
>                                             __schedule()
>                                               find_proxy_task()
> 
>                                                 /* Scenario 1 - B sleeps */
>                                                 __clear_task_blocked_on()
>                                                 proxy_deactivate(A)
>                                                   /* A->__state == TASK_RUNNING */
>                                                   /* fallthrough */
> 
>                                                 /* Scenario 2 - return migration after unlock() */
>                                                 __clear_task_blocked_on()
>                                                 /*
>                                                  * At this point proxy stops.
>                                                  * Much later after signal.
>                                                  */
>                                                 proxy_force_return()
>     schedule() <==================================
>       signal_pending_state()
> 
>   clear_task_blocked_on()
>   __set_current_state(TASK_RUNNING)
> 
> ... /* return with -EINR */
> 
> 
> Basically, a blocked donor has to wait for a mutex_unlock() before it
> can go process the signal and bail out on the mutex_lock_interruptible()
> which seems counter productive - but it is still okay from correctness
> perspective.
> 
> > 
> > One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> > task, and then it goes into the normal path and will do the migration.
> > I've done things like that before.
> > 
> > Does that fix all the return-migration cases?
> 
> Yes it does! If we handle the return via ttwu_runnable(), which is what
> proxy_needs_return() in the next chunk of changes aims to do and we can
> build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
> state outside of __schedule() which works well with ttwu_state_match().
> 
> > 
> >> 2. Why does proxy_needs_return() (this comes later in John's tree but I
> >>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
> >>    of the ttwu_state_mach() machinery?
> >>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> > 
> > Since it comes later, I've not seen it and not given it thought ;-)
> > 
> > (I mean, I've probably seen it at some point, but being the gold-fish
> > that I am, I have no recollection, so I might as well not have seen it).
> > 
> > A brief look now makes me confused. The comment fails to describe how
> > that situation could ever come to pass.
> 
> That is a signal delivery happening before unlock which will force
> TASK_RUNNING but since we are waiting on an unlock, the wakeup from
> unlock will see TASK_RUNNING + PROXY_WAKING.
> 
> We then later force it on the ttwu path to do return via
> ttwu_runnable().

So, I've not gone through all the cases yet, and it is *COMPLETELY*
untested, but something like the below perhaps?

---
 include/linux/sched.h |    2 
 kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
 2 files changed, 58 insertions(+), 117 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -161,7 +161,7 @@ struct user_event_mm;
  */
 #define is_special_task_state(state)					\
 	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |	\
-		    TASK_DEAD | TASK_FROZEN))
+		    TASK_DEAD | TASK_WAKING | TASK_FROZEN))
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define debug_normal_state_change(state_value)				\
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,8 +2160,29 @@ void deactivate_task(struct rq *rq, stru
 	dequeue_task(rq, p, flags);
 }
 
-static void block_task(struct rq *rq, struct task_struct *p, int flags)
+static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
 {
+	int flags = DEQUEUE_NOCLOCK;
+
+	p->sched_contributes_to_load =
+		(task_state & TASK_UNINTERRUPTIBLE) &&
+		!(task_state & TASK_NOLOAD) &&
+		!(task_state & TASK_FROZEN);
+
+	if (unlikely(is_special_task_state(task_state)))
+		flags |= DEQUEUE_SPECIAL;
+
+	/*
+	 * __schedule()			ttwu()
+	 *   prev_state = prev->state;    if (p->on_rq && ...)
+	 *   if (prev_state)		    goto out;
+	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
+	 *				  p->state = TASK_WAKING
+	 *
+	 * Where __schedule() and ttwu() have matching control dependencies.
+	 *
+	 * After this, schedule() must not care about p->state any more.
+	 */
 	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
 		__block_task(rq, p);
 }
@@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	ACQUIRE(__task_rq_lock, guard)(p);
+	struct rq *rq = guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	if (sched_proxy_exec() && p->blocked_on) {
+		guard(raw_spinlock)(&p->blocked_lock);
+		struct mutex *lock = p->blocked_on;
+		if (lock) {
 			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
+			 * TASK_WAKING is a special state and results in
+			 * DEQUEUE_SPECIAL such that the task will actually be
+			 * forced from the runqueue.
 			 */
-			wakeup_preempt(rq, p, wake_flags);
+			block_task(rq, p, TASK_WAKING);
+			p->blocked_on = NULL;
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
 	}
-	__task_rq_unlock(rq, p, &rf);
 
-	return ret;
+	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
 }
 
 void sched_ttwu_pending(void *arg)
@@ -6519,7 +6551,6 @@ static bool try_to_block_task(struct rq
 			      unsigned long *task_state_p, bool should_block)
 {
 	unsigned long task_state = *task_state_p;
-	int flags = DEQUEUE_NOCLOCK;
 
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
@@ -6539,26 +6570,7 @@ static bool try_to_block_task(struct rq
 	if (!should_block)
 		return false;
 
-	p->sched_contributes_to_load =
-		(task_state & TASK_UNINTERRUPTIBLE) &&
-		!(task_state & TASK_NOLOAD) &&
-		!(task_state & TASK_FROZEN);
-
-	if (unlikely(is_special_task_state(task_state)))
-		flags |= DEQUEUE_SPECIAL;
-
-	/*
-	 * __schedule()			ttwu()
-	 *   prev_state = prev->state;    if (p->on_rq && ...)
-	 *   if (prev_state)		    goto out;
-	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
-	 *				  p->state = TASK_WAKING
-	 *
-	 * Where __schedule() and ttwu() have matching control dependencies.
-	 *
-	 * After this, schedule() must not care about p->state any more.
-	 */
-	block_task(rq, p, flags);
+	block_task(rq, p, task_state);
 	return true;
 }
 
@@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
 	return rq->idle;
 }
 
-static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
-	/* Don't deactivate if the state has been changed to TASK_RUNNING */
-	if (state == TASK_RUNNING)
-		return false;
+	WARN_ON_ONCE(state == TASK_RUNNING);
+
 	/*
 	 * Because we got donor from pick_next_task(), it is *crucial*
 	 * that we call proxy_resched_idle() before we deactivate it.
@@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
-	return try_to_block_task(rq, donor, &state, true);
+	block_task(rq, donor, state);
 }
 
 static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
@@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
 	proxy_reacquire_rq_lock(rq, rf);
 }
 
-static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
-			       struct task_struct *p)
-	__must_hold(__rq_lockp(rq))
-{
-	struct rq *task_rq, *target_rq = NULL;
-	int cpu, wake_flag = WF_TTWU;
-
-	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
-
-	if (p == rq->donor)
-		proxy_resched_idle(rq);
-
-	proxy_release_rq_lock(rq, rf);
-	/*
-	 * We drop the rq lock, and re-grab task_rq_lock to get
-	 * the pi_lock (needed for select_task_rq) as well.
-	 */
-	scoped_guard (task_rq_lock, p) {
-		task_rq = scope.rq;
-
-		/*
-		 * Since we let go of the rq lock, the task may have been
-		 * woken or migrated to another rq before we  got the
-		 * task_rq_lock. So re-check we're on the same RQ. If
-		 * not, the task has already been migrated and that CPU
-		 * will handle any futher migrations.
-		 */
-		if (task_rq != rq)
-			break;
-
-		/*
-		 * Similarly, if we've been dequeued, someone else will
-		 * wake us
-		 */
-		if (!task_on_rq_queued(p))
-			break;
-
-		/*
-		 * Since we should only be calling here from __schedule()
-		 * -> find_proxy_task(), no one else should have
-		 * assigned current out from under us. But check and warn
-		 * if we see this, then bail.
-		 */
-		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
-			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-				  __func__, cpu_of(task_rq),
-				  p->comm, p->pid, p->on_cpu);
-			break;
-		}
-
-		update_rq_clock(task_rq);
-		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
-		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-		set_task_cpu(p, cpu);
-		target_rq = cpu_rq(cpu);
-		clear_task_blocked_on(p, NULL);
-	}
-
-	if (target_rq)
-		attach_one_task(target_rq, p);
-
-	proxy_reacquire_rq_lock(rq, rf);
-}
-
 /*
  * Find runnable lock owner to proxy for mutex blocked donor
  *
@@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
 				clear_task_blocked_on(p, PROXY_WAKING);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		/*
@@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
 				__clear_task_blocked_on(p, NULL);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
@@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
 	return owner;
 
 deactivate:
-	if (proxy_deactivate(rq, donor))
-		return NULL;
-	/* If deactivate fails, force return */
-	p = donor;
-force_return:
-	proxy_force_return(rq, rf, p);
+	proxy_deactivate(rq, donor);
 	return NULL;
 migrate_task:
 	proxy_migrate_task(rq, rf, p, owner_cpu);

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by John Stultz 12 hours ago

On Thu, Apr 2, 2026 at 8:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Mar 27, 2026 at 10:27:08PM +0530, K Prateek Nayak wrote:
>
> > So taking a step back, this is what we have today (at least the
> > common scenario):
> >
> >     CPU0 (donor - A)                         CPU1 (owner - B)
> >     ================                         ================
> >
> > mutex_lock()
> >   __set_current_state(TASK_INTERRUPTIBLE)
> >   __set_task_blocked_on(M)
> >     schedule()
> >       /* Retained for proxy */
> >       proxy_migrate_task()
> >         ==================================> /* Migrates to CPU1 */
> >   ...
> >   send_sig(B)
> >     signal_wake_up_state()
> >       wake_up_state()
> >         try_to_wake_up()
> >           ttwu_runnable()
> >             ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */
> >
> >                                             /*
> >                                              * After this point ttwu_state_match()
> >                                              * will fail for A so a mutex_unlock()
> >                                              * will have to go through __schedule()
> >                                              * for return migration.
> >                                              */
> >
> >                                             __schedule()
> >                                               find_proxy_task()
> >
> >                                                 /* Scenario 1 - B sleeps */
> >                                                 __clear_task_blocked_on()
> >                                                 proxy_deactivate(A)
> >                                                   /* A->__state == TASK_RUNNING */
> >                                                   /* fallthrough */
> >
> >                                                 /* Scenario 2 - return migration after unlock() */
> >                                                 __clear_task_blocked_on()
> >                                                 /*
> >                                                  * At this point proxy stops.
> >                                                  * Much later after signal.
> >                                                  */
> >                                                 proxy_force_return()
> >     schedule() <==================================
> >       signal_pending_state()
> >
> >   clear_task_blocked_on()
> >   __set_current_state(TASK_RUNNING)
> >
> > ... /* return with -EINR */
> >
> >
> > Basically, a blocked donor has to wait for a mutex_unlock() before it
> > can go process the signal and bail out on the mutex_lock_interruptible()
> > which seems counter productive - but it is still okay from correctness
> > perspective.
> >
> > >
> > > One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> > > task, and then it goes into the normal path and will do the migration.
> > > I've done things like that before.
> > >
> > > Does that fix all the return-migration cases?
> >
> > Yes it does! If we handle the return via ttwu_runnable(), which is what
> > proxy_needs_return() in the next chunk of changes aims to do and we can
> > build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
> > state outside of __schedule() which works well with ttwu_state_match().
> >
> > >
> > >> 2. Why does proxy_needs_return() (this comes later in John's tree but I
> > >>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
> > >>    of the ttwu_state_mach() machinery?
> > >>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> > >
> > > Since it comes later, I've not seen it and not given it thought ;-)
> > >
> > > (I mean, I've probably seen it at some point, but being the gold-fish
> > > that I am, I have no recollection, so I might as well not have seen it).
> > >
> > > A brief look now makes me confused. The comment fails to describe how
> > > that situation could ever come to pass.
> >
> > That is a signal delivery happening before unlock which will force
> > TASK_RUNNING but since we are waiting on an unlock, the wakeup from
> > unlock will see TASK_RUNNING + PROXY_WAKING.
> >
> > We then later force it on the ttwu path to do return via
> > ttwu_runnable().
>
> So, I've not gone through all the cases yet, and it is *COMPLETELY*
> untested, but something like the below perhaps?
>

So, just to clarify, this suggestion is as an alternative to my
return-migration via ttwu logic (not included in the v26
simple-donor-migration chunk you've queued)?
   https://github.com/johnstultz-work/linux-dev/commit/dfaa472f2a0b2f6a0c73083aaba5c55e256fdb56

I'm working on prepping that next chunk of the series (which includes
that logic) to send out here shortly (integrating the few bits I from
Prateek that I've managerd to get my head around).

There's still a bunch of other changes tied into waking a rq->donor
outside of __schedule() in that chunk, so I'm not sure if this
discussion would be easier to have in context once those are on the
list?

So some of my (certainly confused) thoughts below...

> ---
>  include/linux/sched.h |    2
>  kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
>  2 files changed, 58 insertions(+), 117 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -161,7 +161,7 @@ struct user_event_mm;
>   */
>  #define is_special_task_state(state)                                   \
>         ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |      \
> -                   TASK_DEAD | TASK_FROZEN))
> +                   TASK_DEAD | TASK_WAKING | TASK_FROZEN))
>
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>  # define debug_normal_state_change(state_value)                                \
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
...
> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> +       ACQUIRE(__task_rq_lock, guard)(p);
> +       struct rq *rq = guard.rq;
>
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> +       if (!task_on_rq_queued(p))
> +               return 0;
> +
> +       if (sched_proxy_exec() && p->blocked_on) {
> +               guard(raw_spinlock)(&p->blocked_lock);
> +               struct mutex *lock = p->blocked_on;
> +               if (lock) {
>                         /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> +                        * TASK_WAKING is a special state and results in
> +                        * DEQUEUE_SPECIAL such that the task will actually be
> +                        * forced from the runqueue.
>                          */
> -                       wakeup_preempt(rq, p, wake_flags);
> +                       block_task(rq, p, TASK_WAKING);
> +                       p->blocked_on = NULL;
> +                       return 0;
>                 }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
>         }
> -       __task_rq_unlock(rq, p, &rf);
>
> -       return ret;
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed)
> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);

I can't precisely remember the details now, but I believe we need to
handle enqueueing sched_delayed tasks before handling blocked_on
tasks.


> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
> +       }
> +       ttwu_do_wakeup(p);
> +       return 1;
>  }
>
>  void sched_ttwu_pending(void *arg)
...
> @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
>         return rq->idle;
>  }
>
> -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  {
>         unsigned long state = READ_ONCE(donor->__state);
>
> -       /* Don't deactivate if the state has been changed to TASK_RUNNING */
> -       if (state == TASK_RUNNING)
> -               return false;
> +       WARN_ON_ONCE(state == TASK_RUNNING);
> +
>         /*
>          * Because we got donor from pick_next_task(), it is *crucial*
>          * that we call proxy_resched_idle() before we deactivate it.
> @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
>          * need to be changed from next *before* we deactivate.
>          */
>         proxy_resched_idle(rq);
> -       return try_to_block_task(rq, donor, &state, true);
> +       block_task(rq, donor, state);
>  }
>
>  static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
> @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
>         proxy_reacquire_rq_lock(rq, rf);
>  }
>
> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> -                              struct task_struct *p)
> -       __must_hold(__rq_lockp(rq))
> -{
> -       struct rq *task_rq, *target_rq = NULL;
> -       int cpu, wake_flag = WF_TTWU;
> -
> -       lockdep_assert_rq_held(rq);
> -       WARN_ON(p == rq->curr);
> -
> -       if (p == rq->donor)
> -               proxy_resched_idle(rq);
> -
> -       proxy_release_rq_lock(rq, rf);
> -       /*
> -        * We drop the rq lock, and re-grab task_rq_lock to get
> -        * the pi_lock (needed for select_task_rq) as well.
> -        */
> -       scoped_guard (task_rq_lock, p) {
> -               task_rq = scope.rq;
> -
> -               /*
> -                * Since we let go of the rq lock, the task may have been
> -                * woken or migrated to another rq before we  got the
> -                * task_rq_lock. So re-check we're on the same RQ. If
> -                * not, the task has already been migrated and that CPU
> -                * will handle any futher migrations.
> -                */
> -               if (task_rq != rq)
> -                       break;
> -
> -               /*
> -                * Similarly, if we've been dequeued, someone else will
> -                * wake us
> -                */
> -               if (!task_on_rq_queued(p))
> -                       break;
> -
> -               /*
> -                * Since we should only be calling here from __schedule()
> -                * -> find_proxy_task(), no one else should have
> -                * assigned current out from under us. But check and warn
> -                * if we see this, then bail.
> -                */
> -               if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
> -                       WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> -                                 __func__, cpu_of(task_rq),
> -                                 p->comm, p->pid, p->on_cpu);
> -                       break;
> -               }
> -
> -               update_rq_clock(task_rq);
> -               deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
> -               cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> -               set_task_cpu(p, cpu);
> -               target_rq = cpu_rq(cpu);
> -               clear_task_blocked_on(p, NULL);
> -       }
> -
> -       if (target_rq)
> -               attach_one_task(target_rq, p);
> -
> -       proxy_reacquire_rq_lock(rq, rf);
> -}
> -
>  /*
>   * Find runnable lock owner to proxy for mutex blocked donor
>   *
> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>                                 clear_task_blocked_on(p, PROXY_WAKING);
>                                 return p;
>                         }
> -                       goto force_return;
> +                       goto deactivate;
>                 }
>
>                 /*
> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>                                 __clear_task_blocked_on(p, NULL);
>                                 return p;
>                         }
> -                       goto force_return;
> +                       goto deactivate;
>                 }
>
>                 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
> @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
>         return owner;
>
>  deactivate:
> -       if (proxy_deactivate(rq, donor))
> -               return NULL;
> -       /* If deactivate fails, force return */
> -       p = donor;
> -force_return:
> -       proxy_force_return(rq, rf, p);
> +       proxy_deactivate(rq, donor);
>         return NULL;
>  migrate_task:
>         proxy_migrate_task(rq, rf, p, owner_cpu);

So I like getting rid of proxy_force_return(), but its not clear to me
that proxy_deactivate() is what we want to do in these
find_proxy_task() edge cases.

It feels like if we are already racing with ttwu, deactivating the
task seems like it might open more windows where we might lose the
wakeup.

In fact, the whole reason we have proxy_force_return() is that earlier
in the proxy-exec development, when we hit those edge cases we usually
would return proxy_reschedule_idle() just to drop the rq lock and let
ttwu do its thing, but there kept on being cases where we would end up
with lost wakeups.

But I'll give this a shot (and will integrate your ttwu_runnable
cleanups regardless) and see how it does.

thanks
-john

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by K Prateek Nayak 52 minutes ago

Hello peter, John,

On 4/3/2026 12:01 AM, John Stultz wrote:
>> So, I've not gone through all the cases yet, and it is *COMPLETELY*
>> untested, but something like the below perhaps?
>>
> 
> So, just to clarify, this suggestion is as an alternative to my
> return-migration via ttwu logic (not included in the v26
> simple-donor-migration chunk you've queued)?
>    https://github.com/johnstultz-work/linux-dev/commit/dfaa472f2a0b2f6a0c73083aaba5c55e256fdb56
> 
> I'm working on prepping that next chunk of the series (which includes
> that logic) to send out here shortly (integrating the few bits I from
> Prateek that I've managerd to get my head around).

Thanks and I agree that it might be more digestible that way.

> 
> There's still a bunch of other changes tied into waking a rq->donor
> outside of __schedule() in that chunk, so I'm not sure if this
> discussion would be easier to have in context once those are on the
> list?
> 
> So some of my (certainly confused) thoughts below...
> 
>> ---
>>  include/linux/sched.h |    2
>>  kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
>>  2 files changed, 58 insertions(+), 117 deletions(-)
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -161,7 +161,7 @@ struct user_event_mm;
>>   */
>>  #define is_special_task_state(state)                                   \
>>         ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |      \
>> -                   TASK_DEAD | TASK_FROZEN))
>> +                   TASK_DEAD | TASK_WAKING | TASK_FROZEN))
>>
>>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>>  # define debug_normal_state_change(state_value)                                \
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
> ...
>> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
>>   */
>>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>  {
>> -       struct rq_flags rf;
>> -       struct rq *rq;
>> -       int ret = 0;
>> +       ACQUIRE(__task_rq_lock, guard)(p);
>> +       struct rq *rq = guard.rq;
>>
>> -       rq = __task_rq_lock(p, &rf);
>> -       if (task_on_rq_queued(p)) {
>> -               update_rq_clock(rq);
>> -               if (p->se.sched_delayed)
>> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>> -               if (!task_on_cpu(rq, p)) {
>> +       if (!task_on_rq_queued(p))
>> +               return 0;
>> +
>> +       if (sched_proxy_exec() && p->blocked_on) {
>> +               guard(raw_spinlock)(&p->blocked_lock);
>> +               struct mutex *lock = p->blocked_on;
>> +               if (lock) {
>>                         /*
>> -                        * When on_rq && !on_cpu the task is preempted, see if
>> -                        * it should preempt the task that is current now.
>> +                        * TASK_WAKING is a special state and results in
>> +                        * DEQUEUE_SPECIAL such that the task will actually be
>> +                        * forced from the runqueue.
>>                          */
>> -                       wakeup_preempt(rq, p, wake_flags);
>> +                       block_task(rq, p, TASK_WAKING);

This needs to reset the rq->donor if the task getting woken up is the
current donor.

>> +                       p->blocked_on = NULL;
>> +                       return 0;
>>                 }
>> -               ttwu_do_wakeup(p);
>> -               ret = 1;
>>         }
>> -       __task_rq_unlock(rq, p, &rf);
>>
>> -       return ret;
>> +       update_rq_clock(rq);

nit. Since block_task() adds a DEQUEUE_NOCLOCK now we need to move that
clock update before the block.

>> +       if (p->se.sched_delayed)
>> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> 
> I can't precisely remember the details now, but I believe we need to
> handle enqueueing sched_delayed tasks before handling blocked_on
> tasks.

So proxy_deactivate() can still delay the task leading to
task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
we can dequeue it fully in proxy_deactivate() or we need to teach
block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().

I think the former is cleaner but we don't decay lag for fair task :-(

We can't simply re-enqueue it either since proxy migration might have
put it on a CPU outside its affinity mask so we need to take a full
dequeue + wakeup in ttwu_runnable().

> 
> 
>> +       if (!task_on_cpu(rq, p)) {
>> +               /*
>> +                * When on_rq && !on_cpu the task is preempted, see if
>> +                * it should preempt the task that is current now.
>> +                */
>> +               wakeup_preempt(rq, p, wake_flags);
>> +       }
>> +       ttwu_do_wakeup(p);
>> +       return 1;
>>  }
>>
>>  void sched_ttwu_pending(void *arg)
> ...
>> @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
>>         return rq->idle;
>>  }
>>
>> -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>> +static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
>>  {
>>         unsigned long state = READ_ONCE(donor->__state);
>>
>> -       /* Don't deactivate if the state has been changed to TASK_RUNNING */
>> -       if (state == TASK_RUNNING)
>> -               return false;
>> +       WARN_ON_ONCE(state == TASK_RUNNING);
>> +
>>         /*
>>          * Because we got donor from pick_next_task(), it is *crucial*
>>          * that we call proxy_resched_idle() before we deactivate it.
>> @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
>>          * need to be changed from next *before* we deactivate.
>>          */
>>         proxy_resched_idle(rq);
>> -       return try_to_block_task(rq, donor, &state, true);
>> +       block_task(rq, donor, state);
>>  }
>>
>>  static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
>> @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
>>         proxy_reacquire_rq_lock(rq, rf);
>>  }
>>
>> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>> -                              struct task_struct *p)
>> -       __must_hold(__rq_lockp(rq))
>> -{
>> -       struct rq *task_rq, *target_rq = NULL;
>> -       int cpu, wake_flag = WF_TTWU;
>> -
>> -       lockdep_assert_rq_held(rq);
>> -       WARN_ON(p == rq->curr);
>> -
>> -       if (p == rq->donor)
>> -               proxy_resched_idle(rq);
>> -
>> -       proxy_release_rq_lock(rq, rf);
>> -       /*
>> -        * We drop the rq lock, and re-grab task_rq_lock to get
>> -        * the pi_lock (needed for select_task_rq) as well.
>> -        */
>> -       scoped_guard (task_rq_lock, p) {
>> -               task_rq = scope.rq;
>> -
>> -               /*
>> -                * Since we let go of the rq lock, the task may have been
>> -                * woken or migrated to another rq before we  got the
>> -                * task_rq_lock. So re-check we're on the same RQ. If
>> -                * not, the task has already been migrated and that CPU
>> -                * will handle any futher migrations.
>> -                */
>> -               if (task_rq != rq)
>> -                       break;
>> -
>> -               /*
>> -                * Similarly, if we've been dequeued, someone else will
>> -                * wake us
>> -                */
>> -               if (!task_on_rq_queued(p))
>> -                       break;
>> -
>> -               /*
>> -                * Since we should only be calling here from __schedule()
>> -                * -> find_proxy_task(), no one else should have
>> -                * assigned current out from under us. But check and warn
>> -                * if we see this, then bail.
>> -                */
>> -               if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
>> -                       WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
>> -                                 __func__, cpu_of(task_rq),
>> -                                 p->comm, p->pid, p->on_cpu);
>> -                       break;
>> -               }
>> -
>> -               update_rq_clock(task_rq);
>> -               deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
>> -               cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
>> -               set_task_cpu(p, cpu);
>> -               target_rq = cpu_rq(cpu);
>> -               clear_task_blocked_on(p, NULL);
>> -       }
>> -
>> -       if (target_rq)
>> -               attach_one_task(target_rq, p);
>> -
>> -       proxy_reacquire_rq_lock(rq, rf);
>> -}
>> -

Went a little heavy of the delete there did you? :-)

>>  /*
>>   * Find runnable lock owner to proxy for mutex blocked donor
>>   *
>> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>>                                 clear_task_blocked_on(p, PROXY_WAKING);
>>                                 return p;
>>                         }
>> -                       goto force_return;
>> +                       goto deactivate;
>>                 }

This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
invariant since we'll definitely get a wakeup here.

>>
>>                 /*
>> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>>                                 __clear_task_blocked_on(p, NULL);
>>                                 return p;
>>                         }
>> -                       goto force_return;
>> +                       goto deactivate;

This too makes sense considering !owner implies some task will be woken
up but ... if we take this task off and another task steals the mutex,
this task will no longer be able to proxy it since it is completely
blocked now.

Probably not desired. We should at least let it run and see if it can
get the mutex and evaluate the "p->blocked_on" again since !owner is
a limbo state.

>>                 }
>>
>>                 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>> @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
>>         return owner;
>>
>>  deactivate:
>> -       if (proxy_deactivate(rq, donor))
>> -               return NULL;
>> -       /* If deactivate fails, force return */
>> -       p = donor;
>> -force_return:
>> -       proxy_force_return(rq, rf, p);
>> +       proxy_deactivate(rq, donor);
>>         return NULL;
>>  migrate_task:
>>         proxy_migrate_task(rq, rf, p, owner_cpu);
> 
> So I like getting rid of proxy_force_return(), but its not clear to me
> that proxy_deactivate() is what we want to do in these
> find_proxy_task() edge cases.
> 
> It feels like if we are already racing with ttwu, deactivating the
> task seems like it might open more windows where we might lose the
> wakeup.
> 
> In fact, the whole reason we have proxy_force_return() is that earlier
> in the proxy-exec development, when we hit those edge cases we usually
> would return proxy_reschedule_idle() just to drop the rq lock and let
> ttwu do its thing, but there kept on being cases where we would end up
> with lost wakeups.
> 
> But I'll give this a shot (and will integrate your ttwu_runnable
> cleanups regardless) and see how it does.

So I added the following in top of Peter's diff on top of
queue:sched/core and it hasn't crashed and burnt yet when running a
handful instances of sched-messaging with a mix of fair and SCHED_RR
priority:

  (Includes John's findings from the parallel thread)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b2b2451720a..e845e3a8ae65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_task(rq, p, flags);
 }
 
-static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
+static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
 {
 	int flags = DEQUEUE_NOCLOCK;
 
@@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 	}
 }
 
+static void zap_balance_callbacks(struct rq *rq);
+
+static inline void proxy_reset_donor(struct rq *rq)
+{
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	WARN_ON_ONCE(rq->curr == rq->donor);
+
+	put_prev_set_next_task(rq, rq->donor, rq->curr);
+	rq_set_donor(rq, rq->curr);
+	zap_balance_callbacks(rq);
+	resched_curr(rq);
+#endif
+}
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		return 0;
 
 	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
 	if (sched_proxy_exec() && p->blocked_on) {
 		guard(raw_spinlock)(&p->blocked_lock);
 		struct mutex *lock = p->blocked_on;
@@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 			 * TASK_WAKING is a special state and results in
 			 * DEQUEUE_SPECIAL such that the task will actually be
 			 * forced from the runqueue.
+			 *
+			 * XXX: All of this is now equivalent of
+			 * proxy_needs_return() from John's series :-)
 			 */
-			block_task(rq, p, TASK_WAKING);
 			p->blocked_on = NULL;
+			if (task_current(rq, p))
+				goto out;
+			if (task_current_donor(rq, p))
+				proxy_reset_donor(rq);
+			block_task(rq, p, TASK_WAKING);
 			return 0;
 		}
 	}
-
-	if (p->se.sched_delayed)
-		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+out:
 	if (!task_on_cpu(rq, p)) {
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if
@@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+		/*
+		 * We never clear the blocked_on relation on proxy_deactivate.
+		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
+		 * when waking up. Since this is a fully blocked, off CPU task
+		 * waking up, it should be safe to clear the blocked_on relation.
+		 */
+		if (task_is_blocked(p))
+			clear_task_blocked_on(p, NULL);
+
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
 		if (task_cpu(p) != cpu) {
 			if (p->in_iowait) {
@@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
+	if (!prev_state && task_is_blocked(prev))
+		clear_task_blocked_on(prev, NULL);
+
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
---

Now, there are obviously some sharp edges that I have highlighted above
which may affect performance and correctness to some extent but once we
have sleeping owner bits, it should all go away.

Anyways, I'll let you bash me now on why that try_to_wake_up() hunk
might be totally wrong and dangerous :-)
 
-- 
Thanks and Regards,
Prateek

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by John Stultz 9 hours ago

On Thu, Apr 2, 2026 at 11:31 AM John Stultz <jstultz@google.com> wrote:
> On Thu, Apr 2, 2026 at 8:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So, I've not gone through all the cases yet, and it is *COMPLETELY*
> > untested, but something like the below perhaps?
> >
> But I'll give this a shot (and will integrate your ttwu_runnable
> cleanups regardless) and see how it does.

So I tweaked it to move the blocked_on handling after the
p->se.sched_delayed check in ttwu_runnable(), otherwise it crashed
very quickly.

But even then, I unfortunately quickly see the WARN_ON_ONCE(state ==
TASK_RUNNING); in proxy_deactivate() trip and then it hangs after
starting the mutex_lock-torture tests. :(

I'll try to get my next set cleaned up and shared here shortly and
maybe we can try to whittle it down to something closer to your
approach.

thanks
-john

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by Peter Zijlstra 6 days, 15 hours ago

On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:

> So John originally had that and then I saw Dan's comment in
> cleanup.h that reads:
> 
>   Lastly, given that the benefit of cleanup helpers is removal of
>   "goto", and that the "goto" statement can jump between scopes, the
>   expectation is that usage of "goto" and cleanup helpers is never
>   mixed in the same function. I.e. for a given routine, convert all
>   resources that need a "goto" cleanup to scope-based cleanup, or
>   convert none of them.
> 
> which can either be interpreted as "Don't do it unless you know what
> you are doing" or "There is at least one compiler that will get a
> goto + cleanup guard wrong" and to err on side of caution, I
> suggested we do break + enums.
> 
> If there are no concerns, then the suggested diff is indeed much
> better.

IIRC the concern was doing partial error handling conversions and
getting it hopelessly wrong.

And while some GCC's generate wrong code when you goto into a guard, all
clangs ever will error on that, so any such code should not survive the
robots.

And then there was an issue with computed gotos and asm_goto, but I the
former are exceedingly rare (and again, clang will error IIRC) and the
latter we upped the minimum clang version for.

Anyway, there is nothing inherently wrong with using goto to exit a
scope and it works well.

Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

Posted by Peter Zijlstra 6 days, 15 hours ago

On Fri, Mar 27, 2026 at 04:20:52PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:
> 
> > So John originally had that and then I saw Dan's comment in
> > cleanup.h that reads:
> > 
> >   Lastly, given that the benefit of cleanup helpers is removal of
> >   "goto", and that the "goto" statement can jump between scopes, the
> >   expectation is that usage of "goto" and cleanup helpers is never
> >   mixed in the same function. I.e. for a given routine, convert all
> >   resources that need a "goto" cleanup to scope-based cleanup, or
> >   convert none of them.
> > 
> > which can either be interpreted as "Don't do it unless you know what
> > you are doing" or "There is at least one compiler that will get a
> > goto + cleanup guard wrong" and to err on side of caution, I
> > suggested we do break + enums.
> > 
> > If there are no concerns, then the suggested diff is indeed much
> > better.
> 
> IIRC the concern was doing partial error handling conversions and
> getting it hopelessly wrong.
> 
> And while some GCC's generate wrong code when you goto into a guard, all
> clangs ever will error on that, so any such code should not survive the
> robots.
> 
> And then there was an issue with computed gotos and asm_goto, but I the
> former are exceedingly rare (and again, clang will error IIRC) and the
> latter we upped the minimum clang version for.
> 
> Anyway, there is nothing inherently wrong with using goto to exit a
> scope and it works well.

That is, consider this:

void *foo(int bar)
{
	int err;

	something_1();

	err = register_something(..);
	if (!err)
		goto unregister;

	void *obj __free(kfree) = kzalloc_obj(...);

	....

	return_ptr(obj);

unregister:
	undo_something_1();
	return ERR_PTR(err);
}

Looks okay, right? Except note how 'unregister' is inside the scope of
@obj.

(And this compiles 'fine' with various GCC)

This is the kind of errors that you get from partial error handling
conversion and is why Dan wrote what he did.