[v25] Simple Donor Migration for Proxy Execution

[PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by John Stultz 3 weeks, 4 days ago

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks
v22:
* Add comments around zap_balance_callbacks in proxy_migration logic
* Rework logic to avoid gotos out of guard() scopes, and instead
  use break and switch() on action value, as suggested by K Prateek
* K Prateek suggested simplifications around putting donor and
  setting idle as next task in the migration paths, which I further
  simplified to using proxy_resched_idle()
* Comment improvements
* Dropped curr != donor check in pick_next_task_fair() suggested by
  K Prateek
v23:
* Rework to use the PROXY_WAKING approach suggested by Peter
* Drop unnecessarily setting wake_cpu when affinity changes
  as noticed by Peter
* Split out the ttwu() logic changes into a later separate patch
  as suggested by Peter
v24:
* Numerous fixes for rq clock handling, pointed out by K Prateek
* Slight tweak to where put_task() is called suggested by K Prateek
v25:
* Use WF_TTWU in proxy_force_return(), suggested by K Prateek
* Drop get/put_task_struct() in proxy_force_return(), suggested by
  K Prateek
* Use attach_one_task() to reduce repetitive logic, as suggested
  by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 221 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 191 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af497b8c72dce..fe20204cf51cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
 	rq->idle_stamp = 0;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, clear the task->blocked_on
-	 * regardless if it was set to a mutex or PROXY_WAKING so the
-	 * task can run. We will need to be more careful later when
-	 * properly handling proxy migration
-	 */
-	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6575,7 +6585,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6595,17 +6605,135 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		clear_task_blocked_on(donor, NULL);
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * So call proxy_resched_idle() to let go of the references before
+	 * we release the lock.
+	 */
+	proxy_resched_idle(rq);
+
+	WARN_ON(p == rq->curr);
+
+	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	proxy_set_task_cpu(p, target_cpu);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	attach_one_task(target_rq, p);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	struct rq *this_rq, *target_rq;
+	struct rq_flags this_rf;
+	int cpu, wake_flag = WF_TTWU;
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	/*
+	 * We drop the rq lock, and re-grab task_rq_lock to get
+	 * the pi_lock (needed for select_task_rq) as well.
+	 */
+	this_rq = task_rq_lock(p, &this_rf);
+
+	/*
+	 * Since we let go of the rq lock, the task may have been
+	 * woken or migrated to another rq before we  got the
+	 * task_rq_lock. So re-check we're on the same RQ. If
+	 * not, the task has already been migrated and that CPU
+	 * will handle any futher migrations.
+	 */
+	if (this_rq != rq)
+		goto err_out;
+
+	/* Similarly, if we've been dequeued, someone else will wake us */
+	if (!task_on_rq_queued(p))
+		goto err_out;
+
+	/*
+	 * Since we should only be calling here from __schedule()
+	 * -> find_proxy_task(), no one else should have
+	 * assigned current out from under us. But check and warn
+	 * if we see this, then bail.
+	 */
+	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
+		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+			  __func__, cpu_of(this_rq),
+			  p->comm, p->pid, p->on_cpu);
+		goto err_out;
 	}
-	return NULL;
+
+	update_rq_clock(this_rq);
+	proxy_resched_idle(this_rq);
+	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
+	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+	set_task_cpu(p, cpu);
+	target_rq = cpu_rq(cpu);
+	clear_task_blocked_on(p, NULL);
+	task_rq_unlock(this_rq, p, &this_rf);
+
+	attach_one_task(target_rq, p);
+
+	/* Finally, re-grab the origianl rq lock and return to pick-again */
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+	return;
+
+err_out:
+	task_rq_unlock(this_rq, p, &this_rf);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
 }
 
 /*
@@ -6627,17 +6755,25 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
-	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
+	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; (mutex = p->blocked_on); p = owner) {
-		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
-		if (mutex == PROXY_WAKING)
-			return proxy_resched_idle(rq);
+		/* if its PROXY_WAKING, do return migration or run if current */
+		if (mutex == PROXY_WAKING) {
+			if (task_current(rq, p)) {
+				clear_task_blocked_on(p, PROXY_WAKING);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
+		}
 
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
@@ -6657,26 +6793,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			/*
-			 * If there is no owner, clear blocked_on
-			 * and return p so it can run and try to
-			 * acquire the lock
+			 * If there is no owner, either clear blocked_on
+			 * and return p (if it is current and safe to
+			 * just run on this rq), or return-migrate the task.
 			 */
-			__clear_task_blocked_on(p, mutex);
-			return p;
+			if (task_current(rq, p)) {
+				__clear_task_blocked_on(p, NULL);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			action = DEACTIVATE_DONOR;
 			break;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			action = DEACTIVATE_DONOR;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			action = MIGRATE;
 			break;
 		}
 
@@ -6738,7 +6889,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	/* Handle actions we need to do outside of the guard() scope */
 	switch (action) {
 	case DEACTIVATE_DONOR:
-		return proxy_deactivate(rq, donor);
+		if (proxy_deactivate(rq, donor))
+			return NULL;
+		/* If deactivate fails, force return */
+		p = donor;
+		fallthrough;
+	case NEEDS_RETURN:
+		proxy_force_return(rq, rf, p);
+		return NULL;
+	case MIGRATE:
+		proxy_migrate_task(rq, rf, p, owner_cpu);
+		return NULL;
 	case FOUND:
 		/* fallthrough */;
 	}
-- 
2.53.0.880.g73c4285caa-goog

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by Peter Zijlstra 2 weeks, 5 days ago

On Fri, Mar 13, 2026 at 02:30:10AM +0000, John Stultz wrote:
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);

This comment confuses the heck out of me. It seems to imply we need to
schedule before dropping rq->lock.

> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */

It might be good to explain where these callbacks come from.

> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	attach_one_task(target_rq, p);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = WF_TTWU;
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */

idem

> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);

This is in fact the very same sequence as above.

> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);
> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	update_rq_clock(this_rq);
> +	proxy_resched_idle(this_rq);
> +	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);
> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	attach_one_task(target_rq, p);
> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +	return;
> +
> +err_out:
> +	task_rq_unlock(this_rq, p, &this_rf);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
>  }

Hurm... how about something like so?

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6662,6 +6662,28 @@ static bool proxy_deactivate(struct rq *
 	return try_to_block_task(rq, donor, &state, true);
 }
 
+static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__releases(__rq_lockp(rq))
+{
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+}
+
+static inline void proxy_reacquire_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__acquires(__rq_lockp(rq))
+{
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
 /*
  * If the blocked-on relationship crosses CPUs, migrate @p to the
  * owner's CPU.
@@ -6676,6 +6698,7 @@ static bool proxy_deactivate(struct rq *
  */
 static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
+	__must_hold(__rq_lockp(rq))
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
 
@@ -6699,98 +6722,72 @@ static void proxy_migrate_task(struct rq
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	proxy_set_task_cpu(p, target_cpu);
 
-	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
-	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	proxy_release_rq_lock(rq, rf);
 
 	attach_one_task(target_rq, p);
 
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p)
+	__must_hold(__rq_lockp(rq))
 {
-	struct rq *this_rq, *target_rq;
-	struct rq_flags this_rf;
+	struct rq *task_rq, *target_rq = NULL;
 	int cpu, wake_flag = WF_TTWU;
 
 	lockdep_assert_rq_held(rq);
 	WARN_ON(p == rq->curr);
 
-	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
-	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	proxy_release_rq_lock(rq, rf);
 
 	/*
 	 * We drop the rq lock, and re-grab task_rq_lock to get
 	 * the pi_lock (needed for select_task_rq) as well.
 	 */
-	this_rq = task_rq_lock(p, &this_rf);
+	scoped_guard (task_rq_lock, p) {
+		task_rq = scope.rq;
 
-	/*
-	 * Since we let go of the rq lock, the task may have been
-	 * woken or migrated to another rq before we  got the
-	 * task_rq_lock. So re-check we're on the same RQ. If
-	 * not, the task has already been migrated and that CPU
-	 * will handle any futher migrations.
-	 */
-	if (this_rq != rq)
-		goto err_out;
-
-	/* Similarly, if we've been dequeued, someone else will wake us */
-	if (!task_on_rq_queued(p))
-		goto err_out;
-
-	/*
-	 * Since we should only be calling here from __schedule()
-	 * -> find_proxy_task(), no one else should have
-	 * assigned current out from under us. But check and warn
-	 * if we see this, then bail.
-	 */
-	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
-		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-			  __func__, cpu_of(this_rq),
-			  p->comm, p->pid, p->on_cpu);
-		goto err_out;
+		/*
+		 * Since we let go of the rq lock, the task may have been
+		 * woken or migrated to another rq before we  got the
+		 * task_rq_lock. So re-check we're on the same RQ. If
+		 * not, the task has already been migrated and that CPU
+		 * will handle any futher migrations.
+		 */
+		if (task_rq != rq)
+			break;
+
+		/* Similarly, if we've been dequeued, someone else will wake us */
+		if (!task_on_rq_queued(p))
+			break;
+
+		/*
+		 * Since we should only be calling here from __schedule()
+		 * -> find_proxy_task(), no one else should have
+		 * assigned current out from under us. But check and warn
+		 * if we see this, then bail.
+		 */
+		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
+			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+				  __func__, cpu_of(task_rq),
+				  p->comm, p->pid, p->on_cpu);
+			break;
+		}
+
+		update_rq_clock(task_rq);
+		proxy_resched_idle(task_rq);
+		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+		set_task_cpu(p, cpu);
+		target_rq = cpu_rq(cpu);
+		clear_task_blocked_on(p, NULL);
 	}
 
-	update_rq_clock(this_rq);
-	proxy_resched_idle(this_rq);
-	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
-	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-	set_task_cpu(p, cpu);
-	target_rq = cpu_rq(cpu);
-	clear_task_blocked_on(p, NULL);
-	task_rq_unlock(this_rq, p, &this_rf);
+	if (target_rq)
+		attach_one_task(target_rq, p);
 
-	attach_one_task(target_rq, p);
-
-	/* Finally, re-grab the origianl rq lock and return to pick-again */
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
-	return;
-
-err_out:
-	task_rq_unlock(this_rq, p, &this_rf);
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 /*
@@ -6811,6 +6808,7 @@ static void proxy_force_return(struct rq
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
 	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by John Stultz 2 weeks, 4 days ago

On Thu, Mar 19, 2026 at 5:50 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 13, 2026 at 02:30:10AM +0000, John Stultz wrote:
> > +/*
> > + * If the blocked-on relationship crosses CPUs, migrate @p to the
> > + * owner's CPU.
> > + *
> > + * This is because we must respect the CPU affinity of execution
> > + * contexts (owner) but we can ignore affinity for scheduling
> > + * contexts (@p). So we have to move scheduling contexts towards
> > + * potential execution contexts.
> > + *
> > + * Note: The owner can disappear, but simply migrate to @target_cpu
> > + * and leave that CPU to sort things out.
> > + */
> > +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p, int target_cpu)
> >  {
> > +     struct rq *target_rq = cpu_rq(target_cpu);
> > +
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     /*
> > +      * Since we're going to drop @rq, we have to put(@rq->donor) first,
> > +      * otherwise we have a reference that no longer belongs to us.
> > +      *
> > +      * Additionally, as we put_prev_task(prev) earlier, its possible that
> > +      * prev will migrate away as soon as we drop the rq lock, however we
> > +      * still have it marked as rq->curr, as we've not yet switched tasks.
> > +      *
> > +      * So call proxy_resched_idle() to let go of the references before
> > +      * we release the lock.
> > +      */
> > +     proxy_resched_idle(rq);
>
> This comment confuses the heck out of me. It seems to imply we need to
> schedule before dropping rq->lock.

Fair point, I wrote that awhile back and indeed it's not really clear
(the rq->curr bit doesn't make much sense to me now).

There is a similar explanation is in proxy_deactivate() which maybe is
more clear?

Bascially since we are migrating a blocked donor, it could be
rq->donor, and we want to make sure there aren't any references from
this rq to it before we drop the lock. This avoids another cpu jumping
in and grabbing the rq lock and referencing rq->donor or cfs_rq->curr,
etc after we have migrated it to another cpu.

I'll rework it the comment to something like the above, but feel free
to suggest rewordings if you prefer.


> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> > +     proxy_set_task_cpu(p, target_cpu);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
>
> It might be good to explain where these callbacks come from.

Ack. I've taken a swing at this and will include it in the next revision.

>
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     attach_one_task(target_rq, p);
> > +
> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
> > +     update_rq_clock(rq);
> > +}
> > +
> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = WF_TTWU;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
>
> idem
>
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
>
> This is in fact the very same sequence as above.
>
...
>
> Hurm... how about something like so?

Sounds good. I've worked this in and am testing it now.

Thanks for the feedback and suggestions!
-john

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by Peter Zijlstra 2 weeks, 5 days ago

On Fri, Mar 13, 2026 at 02:30:10AM +0000, John Stultz wrote:
> Add logic to handle migrating a blocked waiter to a remote
> cpu where the lock owner is runnable.
> 
> Additionally, as the blocked task may not be able to run
> on the remote cpu, add logic to handle return migration once
> the waiting task is given the mutex.
> 
> Because tasks may get migrated to where they cannot run, also
> modify the scheduling classes to avoid sched class migrations on
> mutex blocked tasks, leaving find_proxy_task() and related logic
> to do the migrations and return migrations.
> 
> This was split out from the larger proxy patch, and
> significantly reworked.
> 
> Credits for the original patch go to:
>   Peter Zijlstra (Intel) <peterz@infradead.org>
>   Juri Lelli <juri.lelli@redhat.com>
>   Valentin Schneider <valentin.schneider@arm.com>
>   Connor O'Brien <connoro@google.com>
> 
> Signed-off-by: John Stultz <jstultz@google.com>

This patch wants the below.. Otherwise clang-22+ builds will be sad.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6676,6 +6676,7 @@ static bool proxy_deactivate(struct rq *
  */
 static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
+	__must_hold(__rq_lockp(rq))
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
 
@@ -6718,6 +6719,7 @@ static void proxy_migrate_task(struct rq
 
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p)
+	__must_hold(__rq_lockp(rq))
 {
 	struct rq *this_rq, *target_rq;
 	struct rq_flags this_rf;
@@ -6811,6 +6813,7 @@ static void proxy_force_return(struct rq
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
 	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by Juri Lelli 2 weeks, 6 days ago

Hello,

I couldn't convince myself the below is not potentially racy ...

On 13/03/26 02:30, John Stultz wrote:

...

> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		clear_task_blocked_on(donor, NULL);
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	attach_one_task(target_rq, p);

We release rq lock between deactivate and attach (and we don't hold
neither wait_lock nor blocked_lock as they are out of scope at this
point). Can't something like the following happen?

  - Task A: blocked on mutex M, queued on CPU 0
  - Task B: owns mutex M, running on CPU 1

  CPU 0 (migrating A→CPU 1)        CPU 1 (B finishes critical section)
  -------------------------        ------------------------------------
  find_proxy_task(donor=A):
    owner = B, owner_cpu = 1
    action = MIGRATE
    // guard releases wait_lock

  proxy_migrate_task(A, cpu=1):
    deactivate_task(rq0, A)
      → A->on_rq = 0
    proxy_set_task_cpu(A, 1)
      → A->cpu = 1
    raw_spin_rq_unlock(rq0)
      → RQ0 LOCK RELEASED
                                   // Task B running
                                   mutex_unlock(M):
                                     lock(&M->wait_lock)   // ← Can grab it
                                     A->blocked_on = PROXY_WAKING
                                     unlock(&M->wait_lock)
                                     wake_up_q():
                                       try_to_wake_up(A):
                                         sees A->on_rq == 0
                                         cpu = select_task_rq(A)
                                           → returns CPU 2
                                         set_task_cpu(A, 2)
                                         ttwu_queue(A, 2)
                                           → A enqueued on CPU 2
                                           → A->on_rq = 1, A->cpu = 2

  attach_one_task(rq1, A):
    attach_task(rq1, A):
      WARN_ON_ONCE(task_rq(A) != rq1)
        → Fires! task_rq(A) = rq2
      activate_task(rq1, A)
        → Double-enqueue! A->on_rq already = 1

What am I missing? :)

Thanks,
Juri

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by K Prateek Nayak 2 weeks, 6 days ago

Hello Juri,

On 3/18/2026 12:05 PM, Juri Lelli wrote:
>> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
>> +	proxy_set_task_cpu(p, target_cpu);
>> +
>> +	/*
>> +	 * We have to zap callbacks before unlocking the rq
>> +	 * as another CPU may jump in and call sched_balance_rq
>> +	 * which can trip the warning in rq_pin_lock() if we
>> +	 * leave callbacks set.
>> +	 */
>> +	zap_balance_callbacks(rq);
>> +	rq_unpin_lock(rq, rf);
>> +	raw_spin_rq_unlock(rq);
>> +
>> +	attach_one_task(target_rq, p);
> 
> We release rq lock between deactivate and attach (and we don't hold
> neither wait_lock nor blocked_lock as they are out of scope at this
> point). Can't something like the following happen?
> 
>   - Task A: blocked on mutex M, queued on CPU 0
>   - Task B: owns mutex M, running on CPU 1
> 
>   CPU 0 (migrating A→CPU 1)        CPU 1 (B finishes critical section)
>   -------------------------        ------------------------------------
>   find_proxy_task(donor=A):
>     owner = B, owner_cpu = 1
>     action = MIGRATE
>     // guard releases wait_lock
> 
>   proxy_migrate_task(A, cpu=1):
>     deactivate_task(rq0, A)
>       → A->on_rq = 0

      This sets TASK_ON_RQ_MIGRATING
      before dequeuing.

      block_task() is the only one
      that clears task->on_rq now.

>     proxy_set_task_cpu(A, 1)
>       → A->cpu = 1
>     raw_spin_rq_unlock(rq0)
>       → RQ0 LOCK RELEASED
>                                    // Task B running
>                                    mutex_unlock(M):
>                                      lock(&M->wait_lock)   // ← Can grab it
>                                      A->blocked_on = PROXY_WAKING
>                                      unlock(&M->wait_lock)
>                                      wake_up_q():
>                                        try_to_wake_up(A):

                                     CPU1 see p->on_rq (TASK_ON_RQ_MIGRATING)
                                     and go into ttwu_runnable() and stall
                                     at __task_rq_lock() since it sees
                                     task_on_rq_migrating() ...

    attach is done here
    A->on_rq is set to
    TASK_ON_RQ_QUEUED

                                     ... we come back here see
                                     task_on_rq_queued() and simply do a
                                     wakeup_preempt() and bail out early
                                     from try_to_wake_up() path.

>                                          sees A->on_rq == 0
>                                          cpu = select_task_rq(A)
>                                            → returns CPU 2
>                                          set_task_cpu(A, 2)
>                                          ttwu_queue(A, 2)
>                                            → A enqueued on CPU 2
>                                            → A->on_rq = 1, A->cpu = 2
> 
>   attach_one_task(rq1, A):
>     attach_task(rq1, A):
>       WARN_ON_ONCE(task_rq(A) != rq1)
>         → Fires! task_rq(A) = rq2
>       activate_task(rq1, A)
>         → Double-enqueue! A->on_rq already = 1

Thus, we avoid that unless I'm mistaken :-)

-- 
Thanks and Regards,
Prateek

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by Juri Lelli 2 weeks, 6 days ago

On 18/03/26 12:26, K Prateek Nayak wrote:
> Hello Juri,
> 
> On 3/18/2026 12:05 PM, Juri Lelli wrote:
> >> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> >> +	proxy_set_task_cpu(p, target_cpu);
> >> +
> >> +	/*
> >> +	 * We have to zap callbacks before unlocking the rq
> >> +	 * as another CPU may jump in and call sched_balance_rq
> >> +	 * which can trip the warning in rq_pin_lock() if we
> >> +	 * leave callbacks set.
> >> +	 */
> >> +	zap_balance_callbacks(rq);
> >> +	rq_unpin_lock(rq, rf);
> >> +	raw_spin_rq_unlock(rq);
> >> +
> >> +	attach_one_task(target_rq, p);
> > 
> > We release rq lock between deactivate and attach (and we don't hold
> > neither wait_lock nor blocked_lock as they are out of scope at this
> > point). Can't something like the following happen?
> > 
> >   - Task A: blocked on mutex M, queued on CPU 0
> >   - Task B: owns mutex M, running on CPU 1
> > 
> >   CPU 0 (migrating A→CPU 1)        CPU 1 (B finishes critical section)
> >   -------------------------        ------------------------------------
> >   find_proxy_task(donor=A):
> >     owner = B, owner_cpu = 1
> >     action = MIGRATE
> >     // guard releases wait_lock
> > 
> >   proxy_migrate_task(A, cpu=1):
> >     deactivate_task(rq0, A)
> >       → A->on_rq = 0
> 
>       This sets TASK_ON_RQ_MIGRATING
>       before dequeuing.

Right you are, I missed this!

Sorry for the noise and thanks for the quick reply.

Best,
Juri

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by K Prateek Nayak 3 weeks, 1 day ago

Hello John,

On 3/13/2026 8:00 AM, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index af497b8c72dce..fe20204cf51cc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
>  	rq->idle_stamp = 0;
>  }
>  
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned int wake_cpu;
> +
> +	/*
> +	 * Since we are enqueuing a blocked task on a cpu it may
> +	 * not be able to run on, preserve wake_cpu when we
> +	 * __set_task_cpu so we can return the task to where it
> +	 * was previously runnable.
> +	 */
> +	wake_cpu = p->wake_cpu;
> +	__set_task_cpu(p, cpu);
> +	p->wake_cpu = wake_cpu;
> +}
> +#endif /* CONFIG_SCHED_PROXY_EXEC */
> +
>  static void
>  ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>  		 struct rq_flags *rf)
> @@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		ttwu_queue(p, cpu, wake_flags);
>  	}
>  out:
> -	/*
> -	 * For now, if we've been woken up, clear the task->blocked_on
> -	 * regardless if it was set to a mutex or PROXY_WAKING so the
> -	 * task can run. We will need to be more careful later when
> -	 * properly handling proxy migration
> -	 */
> -	clear_task_blocked_on(p, NULL);

So, for this bit, there are mutex variants that are interruptible and
killable which probably benefits from clearing the blocked_on
relation.

For potential proxy task that are still queued, we'll hit the
ttwu_runnable() path and resched out of there so it makes sense to
mark them as PROXY_WAKING so schedule() can return migrate them, they
run and  hit the signal_pending_state() check in __mutex_lock_common()
loop, and return -EINTR.

Otherwise, if they need a full wakeup, they may be blocked on a
sleeping owner, in which case it is beneficial to clear blocked_on, do
a full wakeup. and let them run to evaluate the pending signal.

ttwu_state_match() should filter out any spurious signals. Thoughts? 

>  	if (success)
>  		ttwu_stat(p, task_cpu(p), wake_flags);
>  
> @@ -6575,7 +6585,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
>  	return rq->idle;
>  }
>  
> -static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  {
>  	unsigned long state = READ_ONCE(donor->__state);
>  
> @@ -6595,17 +6605,135 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  	return try_to_block_task(rq, donor, &state, true);
>  }
>  
> -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		clear_task_blocked_on(donor, NULL);
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	attach_one_task(target_rq, p);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = WF_TTWU;
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);
> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	update_rq_clock(this_rq);
> +	proxy_resched_idle(this_rq);

I still think this is too late, and only required if we are moving the
donor. Can we do this before we drop the rq_lock so that a remote
wakeup doesn't need to clear the this? (although I think we don't have
that bit in the ttwu path anymore and we rely on the schedule() bits
completely for return migration on this version - any particular
reason?).

> +	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);
> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	attach_one_task(target_rq, p);

I'm still having a hard time believing we cannot use wake_up_process()
but let me look more into that tomorrow when the sun rises.

> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +	return;
> +
> +err_out:
> +	task_rq_unlock(this_rq, p, &this_rf);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
>  }
>  
>  /*
-- 
Thanks and Regards,
Prateek

Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

Posted by John Stultz 2 weeks, 5 days ago

On Sun, Mar 15, 2026 at 10:38 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/13/2026 8:00 AM, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index af497b8c72dce..fe20204cf51cc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
> >       rq->idle_stamp = 0;
> >  }
> >
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned int wake_cpu;
> > +
> > +     /*
> > +      * Since we are enqueuing a blocked task on a cpu it may
> > +      * not be able to run on, preserve wake_cpu when we
> > +      * __set_task_cpu so we can return the task to where it
> > +      * was previously runnable.
> > +      */
> > +     wake_cpu = p->wake_cpu;
> > +     __set_task_cpu(p, cpu);
> > +     p->wake_cpu = wake_cpu;
> > +}
> > +#endif /* CONFIG_SCHED_PROXY_EXEC */
> > +
> >  static void
> >  ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> >                struct rq_flags *rf)
> > @@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >               ttwu_queue(p, cpu, wake_flags);
> >       }
> >  out:
> > -     /*
> > -      * For now, if we've been woken up, clear the task->blocked_on
> > -      * regardless if it was set to a mutex or PROXY_WAKING so the
> > -      * task can run. We will need to be more careful later when
> > -      * properly handling proxy migration
> > -      */
> > -     clear_task_blocked_on(p, NULL);
>
> So, for this bit, there are mutex variants that are interruptible and
> killable which probably benefits from clearing the blocked_on
> relation.

This is a good point! I need to re-review some of this with that in mind.

> For potential proxy task that are still queued, we'll hit the
> ttwu_runnable() path and resched out of there so it makes sense to
> mark them as PROXY_WAKING so schedule() can return migrate them, they
> run and  hit the signal_pending_state() check in __mutex_lock_common()
> loop, and return -EINTR.
>
> Otherwise, if they need a full wakeup, they may be blocked on a
> sleeping owner, in which case it is beneficial to clear blocked_on, do
> a full wakeup. and let them run to evaluate the pending signal.
>
> ttwu_state_match() should filter out any spurious signals. Thoughts?

So, I don't think we can keep clear_task_blocked_on(p, NULL) in the
out: path there, as then any wakeup would allow the task to run on
that runqueue, even if it was not smp affined.

But if we did go through the select_task_rq() logic, then clearing the
blocked_on bit should be safe.  However if blocked_on is set, the task
is likely to be on the rq, so most cases will shortcut at
ttwu_runnable(), so we probably wouldn't get there.

So maybe if I understand your suggestion, we should
clear_task_blocked_on() if we select_task_rq(), and otherwise in the
error path set any blocked_on value to PROXY_WAKING?

I guess this could also move the set_task_blocked_on_waking into ttwu
instead of the lock waker logic. I'll play with that.

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = WF_TTWU;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     /*
> > +      * We drop the rq lock, and re-grab task_rq_lock to get
> > +      * the pi_lock (needed for select_task_rq) as well.
> > +      */
> > +     this_rq = task_rq_lock(p, &this_rf);
> > +
> > +     /*
> > +      * Since we let go of the rq lock, the task may have been
> > +      * woken or migrated to another rq before we  got the
> > +      * task_rq_lock. So re-check we're on the same RQ. If
> > +      * not, the task has already been migrated and that CPU
> > +      * will handle any futher migrations.
> > +      */
> > +     if (this_rq != rq)
> > +             goto err_out;
> > +
> > +     /* Similarly, if we've been dequeued, someone else will wake us */
> > +     if (!task_on_rq_queued(p))
> > +             goto err_out;
> > +
> > +     /*
> > +      * Since we should only be calling here from __schedule()
> > +      * -> find_proxy_task(), no one else should have
> > +      * assigned current out from under us. But check and warn
> > +      * if we see this, then bail.
> > +      */
> > +     if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> > +             WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> > +                       __func__, cpu_of(this_rq),
> > +                       p->comm, p->pid, p->on_cpu);
> > +             goto err_out;
> >       }
> > -     return NULL;
> > +
> > +     update_rq_clock(this_rq);
> > +     proxy_resched_idle(this_rq);
>
> I still think this is too late, and only required if we are moving the
> donor. Can we do this before we drop the rq_lock so that a remote
> wakeup doesn't need to clear the this? (although I think we don't have

Sorry I'm not sure I'm following this bit. Are you suggesting the
update_rq_clock goes above the error handling? Or are you suggesting I
move proxy_resched_idle() elsewhere?

> that bit in the ttwu path anymore and we rely on the schedule() bits
> completely for return migration on this version - any particular
> reason?).

Yes, Peter wanted the return-migration via ttwu to be in a separate patch:
https://lore.kernel.org/lkml/20251009114302.GI3245006@noisy.programming.kicks-ass.net/


>
> > +     deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> > +     cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> > +     set_task_cpu(p, cpu);
> > +     target_rq = cpu_rq(cpu);
> > +     clear_task_blocked_on(p, NULL);
> > +     task_rq_unlock(this_rq, p, &this_rf);
> > +
> > +     attach_one_task(target_rq, p);
>
> I'm still having a hard time believing we cannot use wake_up_process()
> but let me look more into that tomorrow when the sun rises.

I'm curious to hear if you had much luck on this. I've tinkered a bit
today, but keep on hitting the same issue:

<<<Task A>>>
__mutex_unlock_slowpath(lock);
  set_task_blocked_on_waking(task_B, lock);
  wake_up_process(task_B); /* via wake_up_q() */
    try_to_wake_up(task_B, TASK_NORMAL, 0);
      ttwu_runnable(task_B, WF_TTWU);  /*donor is on_rq, so we trip into this */
        ttwu_do_wakeup(task_B);
          WRITE_ONCE(p->__state, TASK_RUNNING);
  preempt_schedule_irq()
    __schedule()
       next = pick_next_task(); /* returns task_B  (still PROXY_WAKING) */
       find_proxy_task(rq, task_B, &rf)
         proxy_force_return(rq, rf, task_B);

At this point conceptually we want to dequeue task_B from the
runqueue, and call wake_up_process() so it will be return-migrated to
a runqueue it can run on.

However, the task state is already TASK_RUNNING now, so calling
wake_up_process() again will just shortcut out at ttwu_state_mach().
Transitioning to INTERRUPTABLE or something else before calling
wake_up_process seems risky to me (but let me know if I'm wrong here).
So to me, doing the manual deactivate/select_task_rq/attach_one_task
work in proxy_force_return() seems the most straight forward, even
though it is a little duplicative of the ttwu logic.

I think when I had something similar before, it was leaning on
modifications to ttwu(), which this patch avoids at Peter's request.
Though maybe this logic can be simplified with the later optimization
patch to do return migration in the ttwu path?

thanks
-john