[v21] Donor Migration for Proxy Execution (v21)

[RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)

Posted by John Stultz 4 weeks, 1 day ago

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 247 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/fair.c |   3 +-
 2 files changed, 222 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01bf5ef8d9fcc..0f824446c6046 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3157,6 +3157,14 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, ctx);
 
+	/*
+	 * It might be that the p->wake_cpu is no longer
+	 * allowed, so set it to the dest_cpu so return
+	 * migration doesn't send it to an invalid cpu
+	 */
+	if (!is_cpu_allowed(p, p->wake_cpu))
+		p->wake_cpu = dest_cpu;
+
 	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
 
 out:
@@ -3717,6 +3725,67 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
 	trace_sched_wakeup(p);
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (READ_ONCE(p->__state) == TASK_RUNNING &&
+		READ_ONCE(p->blocked_on_state) == BO_WAKING);
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	return false;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
+/*
+ * Checks to see if task p has been proxy-migrated to another rq
+ * and needs to be returned. If so, we deactivate the task here
+ * so that it can be properly woken up on the p->wake_cpu
+ * (or whichever cpu select_task_rq() picks at the bottom of
+ * try_to_wake_up()
+ */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	bool ret = false;
+
+	if (!sched_proxy_exec())
+		return false;
+
+	raw_spin_lock(&p->blocked_lock);
+	if (__get_task_blocked_on(p) && p->blocked_on_state == BO_WAKING) {
+		if (!task_current(rq, p) && (p->wake_cpu != cpu_of(rq))) {
+			if (task_current_donor(rq, p)) {
+				put_prev_task(rq, p);
+				rq_set_donor(rq, rq->idle);
+			}
+			deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+			ret = true;
+		}
+		__set_blocked_on_runnable(p);
+		resched_curr(rq);
+	}
+	raw_spin_unlock(&p->blocked_lock);
+	return ret;
+}
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -3802,6 +3871,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		update_rq_clock(rq);
 		if (p->se.sched_delayed)
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		if (proxy_needs_return(rq, p))
+			goto out;
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -3812,6 +3883,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		ttwu_do_wakeup(p);
 		ret = 1;
 	}
+out:
 	__task_rq_unlock(rq, &rf);
 
 	return ret;
@@ -4199,6 +4271,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		/* If current is waking up, we know we can run here, so set BO_RUNNBLE */
+		set_blocked_on_runnable(p);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4215,8 +4289,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
-		if (!ttwu_state_match(p, state, &success))
-			break;
+		if (!ttwu_state_match(p, state, &success)) {
+			/*
+			 * If we're already TASK_RUNNING, and BO_WAKING
+			 * continue on to ttwu_runnable check to force
+			 * proxy_needs_return evaluation
+			 */
+			if (!proxy_task_runnable_but_waking(p))
+				break;
+		}
 
 		trace_sched_waking(p);
 
@@ -4278,6 +4359,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 * enqueue, such as ttwu_queue_wakelist().
 		 */
 		WRITE_ONCE(p->__state, TASK_WAKING);
+		set_blocked_on_runnable(p);
 
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
@@ -4328,12 +4410,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, set us as BO_RUNNABLE
-	 * We will need to be more careful later when handling
-	 * proxy migration
-	 */
-	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6635,7 +6711,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6655,17 +6731,98 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		force_blocked_on_runnable(donor);
-	}
-	return NULL;
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * After the migration, we are going to pick_again in the __schedule
+	 * logic, so backtrack a bit before we release the lock:
+	 * Put rq->donor, and set rq->curr as rq->donor and set_next_task,
+	 * so that we're close to the situation we had entering __schedule
+	 * the first time.
+	 *
+	 * Then when we re-aquire the lock, we will re-put rq->curr then
+	 * rq_set_donor(rq->idle) and set_next_task(rq->idle), before
+	 * picking again.
+	 */
+	/* XXX - Added to address problems with changed dl_server semantics - double check */
+	__put_prev_set_next_dl_server(rq, rq->donor, rq->curr);
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+
+	WARN_ON(p == rq->curr);
+
+	deactivate_task(rq, p, 0);
+	proxy_set_task_cpu(p, target_cpu);
+
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(target_rq);
+
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+
+	raw_spin_rq_unlock(target_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+
+	WARN_ON(p == rq->curr);
+
+	set_blocked_on_waking(p);
+	get_task_struct(p);
+	block_task(rq, p, 0);
+
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	wake_up_process(p);
+	put_task_struct(p);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
+{
+	if (p == rq->curr || p->wake_cpu == cpu_of(rq))
+		return true;
+	return false;
 }
 
 /*
@@ -6688,9 +6845,11 @@ static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; task_is_blocked(p); p = owner) {
@@ -6716,6 +6875,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		/* Double check blocked_on_state now we're holding the lock */
+		if (p->blocked_on_state == BO_RUNNABLE)
+			return p;
+
 		/*
 		 * If a ww_mutex hits the die/wound case, it marks the task as
 		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
@@ -6731,26 +6894,46 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * try_to_wake_up from completing and doing the return
 		 * migration.
 		 *
-		 * So when we hit a !BO_BLOCKED task briefly schedule idle
-		 * so we release the rq and let the wakeup complete.
+		 * So when we hit a BO_WAKING task try to wake it up ourselves.
 		 */
-		if (p->blocked_on_state != BO_BLOCKED)
-			return proxy_resched_idle(rq);
+		if (p->blocked_on_state == BO_WAKING) {
+			if (task_current(rq, p)) {
+				/* If its current just set it runnable */
+				__force_blocked_on_runnable(p);
+				return p;
+			}
+			goto needs_return;
+		}
+
+		if (task_current(rq, p))
+			curr_in_chain = true;
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/* If the owner is null, we may have some work to do */
+			if (!proxy_can_run_here(rq, p))
+				goto needs_return;
+
 			__force_blocked_on_runnable(p);
 			return p;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			goto deactivate_donor;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			goto deactivate_donor;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			goto migrate;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6817,8 +7000,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	 * blocked_lock released, so we have to get out of the
 	 * guard() scope.
 	 */
+migrate:
+	proxy_migrate_task(rq, rf, p, owner_cpu);
+	return NULL;
+needs_return:
+	proxy_force_return(rq, rf, p);
+	return NULL;
 deactivate_donor:
-	return proxy_deactivate(rq, donor);
+	if (!proxy_deactivate(rq, donor)) {
+		p = donor;
+		goto needs_return;
+	}
+	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c2..cc531eb939831 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	se = &p->se;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
+	if (prev->sched_class != &fair_sched_class ||
+	    rq->curr != rq->donor)
 		goto simple;
 
 	__put_prev_set_next_dl_server(rq, prev, p);
-- 
2.51.0.338.gd7d06c2dae-goog

Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)

Posted by K Prateek Nayak 2 weeks, 3 days ago

Hello John,

On 9/4/2025 5:51 AM, John Stultz wrote:
> @@ -6655,17 +6731,98 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  	return try_to_block_task(rq, donor, &state, true);
>  }
>  
> -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		force_blocked_on_runnable(donor);
> -	}
> -	return NULL;
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * After the migration, we are going to pick_again in the __schedule
> +	 * logic, so backtrack a bit before we release the lock:
> +	 * Put rq->donor, and set rq->curr as rq->donor and set_next_task,
> +	 * so that we're close to the situation we had entering __schedule
> +	 * the first time.
> +	 *
> +	 * Then when we re-aquire the lock, we will re-put rq->curr then
> +	 * rq_set_donor(rq->idle) and set_next_task(rq->idle), before
> +	 * picking again.
> +	 */
> +	/* XXX - Added to address problems with changed dl_server semantics - double check */
> +	__put_prev_set_next_dl_server(rq, rq->donor, rq->curr);

Given we are tagging the rq->dl_server to the donor's context, should we
do:

    __put_prev_set_next_dl_server(rq, rq->donor, rq->idle);

... since we are setting rq->idle as next task and the donor?

On a side note, this can perhaps be simplified as:

    put_prev_set_next_task(rq, rq->donor, rq->idle);
    rq_set_donor(rq, rq->idle);

put_prev_set_next_task() will take are of the dl_server handoff too.

> +	put_prev_task(rq, rq->donor);
> +	rq_set_donor(rq, rq->idle);
> +	set_next_task(rq, rq->idle);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, 0);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	zap_balance_callbacks(rq);

Is this zap necessary? Given we return NULL from find_proxy_task() for
migrate case, __schedule() would zap the callback soon. I don't see
any WARN_ON() for the balance callbacks in the unpin + repin path so
this might not be necessary or am I mistaken?

> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +	raw_spin_rq_lock(target_rq);
> +
> +	activate_task(target_rq, p, 0);
> +	wakeup_preempt(target_rq, p, 0);
> +
> +	raw_spin_rq_unlock(target_rq);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	lockdep_assert_rq_held(rq);
> +
> +	put_prev_task(rq, rq->donor);
> +	rq_set_donor(rq, rq->idle);
> +	set_next_task(rq, rq->idle);

Similar set of comments as above.

> +
> +	WARN_ON(p == rq->curr);
> +
> +	set_blocked_on_waking(p);
> +	get_task_struct(p);
> +	block_task(rq, p, 0);
> +
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	wake_up_process(p);
> +	put_task_struct(p);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +}
> +
> +static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
> +{
> +	if (p == rq->curr || p->wake_cpu == cpu_of(rq))
> +		return true;
> +	return false;
>  }
>  
>  /*
> @@ -6688,9 +6845,11 @@ static struct task_struct *
>  find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  {
>  	struct task_struct *owner = NULL;
> +	bool curr_in_chain = false;
>  	int this_cpu = cpu_of(rq);
>  	struct task_struct *p;
>  	struct mutex *mutex;
> +	int owner_cpu;
>  
>  	/* Follow blocked_on chain. */
>  	for (p = donor; task_is_blocked(p); p = owner) {
> @@ -6716,6 +6875,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  			return NULL;
>  		}
>  
> +		/* Double check blocked_on_state now we're holding the lock */
> +		if (p->blocked_on_state == BO_RUNNABLE)
> +			return p;
> +
>  		/*
>  		 * If a ww_mutex hits the die/wound case, it marks the task as
>  		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
> @@ -6731,26 +6894,46 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  		 * try_to_wake_up from completing and doing the return
>  		 * migration.
>  		 *
> -		 * So when we hit a !BO_BLOCKED task briefly schedule idle
> -		 * so we release the rq and let the wakeup complete.
> +		 * So when we hit a BO_WAKING task try to wake it up ourselves.
>  		 */
> -		if (p->blocked_on_state != BO_BLOCKED)
> -			return proxy_resched_idle(rq);
> +		if (p->blocked_on_state == BO_WAKING) {
> +			if (task_current(rq, p)) {
> +				/* If its current just set it runnable */
> +				__force_blocked_on_runnable(p);
> +				return p;
> +			}
> +			goto needs_return;
> +		}
> +
> +		if (task_current(rq, p))
> +			curr_in_chain = true;
>  
>  		owner = __mutex_owner(mutex);
>  		if (!owner) {
> +			/* If the owner is null, we may have some work to do */
> +			if (!proxy_can_run_here(rq, p))
> +				goto needs_return;
> +
>  			__force_blocked_on_runnable(p);
>  			return p;
>  		}
>  
>  		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>  			/* XXX Don't handle blocked owners/delayed dequeue yet */
> +			if (curr_in_chain)
> +				return proxy_resched_idle(rq);
>  			goto deactivate_donor;
>  		}
>  
> -		if (task_cpu(owner) != this_cpu) {
> -			/* XXX Don't handle migrations yet */
> -			goto deactivate_donor;
> +		owner_cpu = task_cpu(owner);
> +		if (owner_cpu != this_cpu) {
> +			/*
> +			 * @owner can disappear, simply migrate to @owner_cpu
> +			 * and leave that CPU to sort things out.
> +			 */
> +			if (curr_in_chain)
> +				return proxy_resched_idle(rq);
> +			goto migrate;
>  		}
>  
>  		if (task_on_rq_migrating(owner)) {
> @@ -6817,8 +7000,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	 * blocked_lock released, so we have to get out of the
>  	 * guard() scope.
>  	 */
> +migrate:
> +	proxy_migrate_task(rq, rf, p, owner_cpu);
> +	return NULL;
> +needs_return:
> +	proxy_force_return(rq, rf, p);
> +	return NULL;
>  deactivate_donor:
> -	return proxy_deactivate(rq, donor);
> +	if (!proxy_deactivate(rq, donor)) {
> +		p = donor;
> +		goto needs_return;
> +	}
> +	return NULL;
>  }
>  #else /* SCHED_PROXY_EXEC */
>  static struct task_struct *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c2..cc531eb939831 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  	se = &p->se;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -	if (prev->sched_class != &fair_sched_class)
> +	if (prev->sched_class != &fair_sched_class ||
> +	    rq->curr != rq->donor)

Why is this special handling required?

>  		goto simple;
>  
>  	__put_prev_set_next_dl_server(rq, prev, p);

-- 
Thanks and Regards,
Prateek

Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)

Posted by John Stultz 1 week, 6 days ago

On Mon, Sep 15, 2025 at 3:07 AM 'K Prateek Nayak' via kernel-team
<kernel-team@android.com> wrote:
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > +     /* XXX - Added to address problems with changed dl_server semantics - double check */
> > +     __put_prev_set_next_dl_server(rq, rq->donor, rq->curr);
>
> Given we are tagging the rq->dl_server to the donor's context, should we
> do:
>
>     __put_prev_set_next_dl_server(rq, rq->donor, rq->idle);
>
> ... since we are setting rq->idle as next task and the donor?
>
> On a side note, this can perhaps be simplified as:
>
>     put_prev_set_next_task(rq, rq->donor, rq->idle);
>     rq_set_donor(rq, rq->idle);
>
> put_prev_set_next_task() will take are of the dl_server handoff too.

Nice! That is simpler. I think I can also simplify those two lines to
proxy_resched_idle() as well.
And looking, the big comment above that section needs an update as
well as its out of date.

> > +     put_prev_task(rq, rq->donor);
> > +     rq_set_donor(rq, rq->idle);
> > +     set_next_task(rq, rq->idle);
> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, 0);
> > +     proxy_set_task_cpu(p, target_cpu);
> > +
> > +     zap_balance_callbacks(rq);
>
> Is this zap necessary? Given we return NULL from find_proxy_task() for
> migrate case, __schedule() would zap the callback soon. I don't see
> any WARN_ON() for the balance callbacks in the unpin + repin path so
> this might not be necessary or am I mistaken?

So unfortunately the issue is that when we're letting go of the
runqueue other CPUs can jump in via sched_balance_domains()  ->
sched_balance_rq() -> rq_lock_irqsave() -> rq_pin_lock() which will
trip the
WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=
&balance_push_callback); case if we leave callbacks.

So we need to zap the callbacks before we drop the rq lock.  I'll add
a comment to make that clear.

And yeah, that means we do call zap_balance_callbacks() again after we
re-take the lock and return null.

I guess we could remove the zap_balance_callbacks() call in
__schedule() and instead call it in every case where find_proxy_task()
returns NULL? Or we call it twice in the migration paths, as it should
just have one entry the second time.

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     put_prev_task(rq, rq->donor);
> > +     rq_set_donor(rq, rq->idle);
> > +     set_next_task(rq, rq->idle);
>
> Similar set of comments as above.

Ack.

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b173a059315c2..cc531eb939831 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >       se = &p->se;
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > -     if (prev->sched_class != &fair_sched_class)
> > +     if (prev->sched_class != &fair_sched_class ||
> > +         rq->curr != rq->donor)
>
> Why is this special handling required?

Hrmm.. that's a good question, as I haven't thought about that in
awhile, and I'm unfortunately forgetful. My instinct is that its a
concern that by checking the prev sched class is fair, is assuming the
prev task was selected by the fair scheduler's pick_next_task last
time around. But since we may have picked a blocked RT task as donor,
if we are proxying, we can't assume prev was previously chosen by
pick_next_task_fair(). So we skip the optimization just to be careful.

But this may be overly cautious and looking it over I don't
immediately see it as necessary. So I'll drop it and do some thorough
testing without it. If I do find an issue I'll re-add it with a much
better comment so I don't forget again.  I do really appreciate you
checking me here!

thanks
-john

Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)

Posted by K Prateek Nayak 1 week, 4 days ago

Hello John,

On 9/20/2025 8:08 AM, John Stultz wrote:
>>> +     put_prev_task(rq, rq->donor);
>>> +     rq_set_donor(rq, rq->idle);
>>> +     set_next_task(rq, rq->idle);
>>> +
>>> +     WARN_ON(p == rq->curr);
>>> +
>>> +     deactivate_task(rq, p, 0);
>>> +     proxy_set_task_cpu(p, target_cpu);
>>> +
>>> +     zap_balance_callbacks(rq);
>>
>> Is this zap necessary? Given we return NULL from find_proxy_task() for
>> migrate case, __schedule() would zap the callback soon. I don't see
>> any WARN_ON() for the balance callbacks in the unpin + repin path so
>> this might not be necessary or am I mistaken?
> 
> So unfortunately the issue is that when we're letting go of the
> runqueue other CPUs can jump in via sched_balance_domains()  ->
> sched_balance_rq() -> rq_lock_irqsave() -> rq_pin_lock() which will
> trip the
> WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=
> &balance_push_callback); case if we leave callbacks.

Ah! That makes sense. I overlooked the fact that we have *other CPUs*
that can take the rq lock once dropped.

> 
> So we need to zap the callbacks before we drop the rq lock.  I'll add
> a comment to make that clear.

Thank you!

> 
> And yeah, that means we do call zap_balance_callbacks() again after we
> re-take the lock and return null.
> 
> I guess we could remove the zap_balance_callbacks() call in
> __schedule() and instead call it in every case where find_proxy_task()
> returns NULL? Or we call it twice in the migration paths, as it should
> just have one entry the second time.

I think a comment would be good enough with the current flow. I'll let
you decide which is the best option based on your understanding of the
complete flow (I'm still slowly wrapping my head around all this :)

[..snip..]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index b173a059315c2..cc531eb939831 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>>>       se = &p->se;
>>>
>>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>> -     if (prev->sched_class != &fair_sched_class)
>>> +     if (prev->sched_class != &fair_sched_class ||
>>> +         rq->curr != rq->donor)
>>
>> Why is this special handling required?
> 
> Hrmm.. that's a good question, as I haven't thought about that in
> awhile, and I'm unfortunately forgetful. My instinct is that its a
> concern that by checking the prev sched class is fair, is assuming the
> prev task was selected by the fair scheduler's pick_next_task last
> time around. But since we may have picked a blocked RT task as donor,
> if we are proxying, we can't assume prev was previously chosen by
> pick_next_task_fair(). So we skip the optimization just to be careful.

My interpretation was - since pick_next_task_fair() selected the
scheduling context, it shouldn't matter what the execution context was.

> 
> But this may be overly cautious and looking it over I don't
> immediately see it as necessary. So I'll drop it and do some thorough
> testing without it. If I do find an issue I'll re-add it with a much
> better comment so I don't forget again.

Thank you! I too will do some testing and let you know if I see
something go awry.

-- 
Thanks and Regards,
Prateek

[RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state
[RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration
[RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again
[RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)
[RESEND][PATCH v21 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs
[RESEND][PATCH v21 6/6] sched: Migrate whole chain in proxy_migrate_task()