From nobody Mon Oct  6 11:49:39 2025
Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com
 [209.85.216.74])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 382F628C5B7
	for <linux-kernel@vger.kernel.org>; Tue, 22 Jul 2025 07:06:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.74
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1753167986; cv=none;
 b=YgXhZLt2vhxpGh5RyLiNHC9EIqXwSO3HVS8sV7aCvYpW2ArwcvAF5W/NY9XS8SDTVmhTzB8EgJFzRC1cKBFYm0q7JCs4SxtrZkVx2k1JPd7sRf9flu2bxx3Ur2o4RfcdgGT/2V2vVo5G2F9kDEp2WGuUdqdDBCmaR5Xp8LMPzTQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1753167986; c=relaxed/simple;
	bh=Psje1mevto8pvfTGZDIRWudsoohBi3eL3m8reKu6Ibk=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=VKRkgnzNmxUaMWZEAyiWohhh+a4U7A4P9/jcy/a3wLu7YFzzcrGn1GUQrQ5uhevHEw3GVM4Y0RkOomy5/ilTMO2wvqjqswF6q2bv8c4hpB9kXdZGtDv44i1EGENU0+9xTWT0Ka7spO8PtYvx8JlemNpcUmgLw7jcz7Tm7W9bn1U=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=SfgK2FYP; arc=none smtp.client-ip=209.85.216.74
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="SfgK2FYP"
Received: by mail-pj1-f74.google.com with SMTP id
 98e67ed59e1d1-311e98ee3fcso6710684a91.0
        for <linux-kernel@vger.kernel.org>;
 Tue, 22 Jul 2025 00:06:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1753167983; x=1753772783;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=kgnC2RSp82VPUFN2J9NjiMcyGkCawwSWUklDc5Bbz1Q=;
        b=SfgK2FYP/nk3/vlrs06i0SR/CO/WZN4fSWMnmjd6Y5HlPWeT8kBVCjv6NX5aiL2ekT
         V2r8JVfMhCOcTIU06DonfYc5KHQKuc3wMC7mUMXiUfwZ5F0jo0aPqKpnpaFDDgDcoQWu
         xrEw1HAB2nENqN3xLmt1bTf0e1dovrrRH3/WZdQb7HDLdxQ+9c87+kzoyRUrce/do17m
         V3fj8p4qZQR4IZ+vt8XZM19FEn4qI2cJ+n+eqycE26XyTxeh/aNHwYbV804ceh5qR2uk
         H24nd8JM/CjL81p8qcp1/szP0ipzR3G4SjB9NXn9aw52ILEVz7hlhBhKrLEts40q3dX0
         B2QA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1753167983; x=1753772783;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=kgnC2RSp82VPUFN2J9NjiMcyGkCawwSWUklDc5Bbz1Q=;
        b=k5IrMXXBWU5GxcrdLKNdD9EjzUIjicYKjuToPQt4bn9TMRPmsurguDAGGXIzL8v1UL
         V9UG+qn07HsnTUx1BbKaopLC/d0o4CcU2TAy11HQfXbZwtJMFV+QTNNvyWAFJwruecsf
         mwb9xzlgrdegs2BPn1rsc8uP/7wyu536iV2WEwCiClFatdDUQhTI/Jc+oLYGesBkUed7
         mTq1JP+3XERdxBUJr0SFsYy3y9whrl7pm5Vm6FzZXNsfcm+9xJ+WltAUJcvb0NiD4HJw
         cCK1Q+Cu0ZkCa6H7xNMszhpvk2GgZSJAOoffM9sRervugZn2bxpco1T9wqKlnCEbXLGl
         JlRA==
X-Gm-Message-State: AOJu0YxkF0lfhw+H1TVqu5dzP1h7C+5MlNxvONkFmArD+cqaMXtJHm4p
	LutAgvRDcbFP1QkCM2HB2BfYB95+4enMg7yGp03A/981v0XU8KTdqTg4QPVnnY2PlNakzjO2s1e
	ZKbgw1s3C/m/6t3tXULhuh8goBgHGxutwXuWftW8FRlaa9LXsyQfndw0kjY4oAiERqAvBsf4fiz
	yhDw/WxnSBHMor/Zr3HoSXcCSY4WLg7tKzWnb8QwqRpHfnX3eM
X-Google-Smtp-Source: 
 AGHT+IEOeXOm3UfqBlfcHumMae4Hcv86E6kn1yAwIDIX+H3sk3NX6tJUhFnovBs/5EmdF1yB9V294bn7ffAU
X-Received: from pjbqa6.prod.google.com ([2002:a17:90b:4fc6:b0:315:f140:91a5])
 (user=jstultz job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:90b:350a:b0:311:a5ab:3d47
 with SMTP id 98e67ed59e1d1-31e3e0fe307mr3757897a91.1.1753167982923; Tue, 22
 Jul 2025 00:06:22 -0700 (PDT)
Date: Tue, 22 Jul 2025 07:05:47 +0000
In-Reply-To: <20250722070600.3267819-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250722070600.3267819-1-jstultz@google.com>
X-Mailer: git-send-email 2.50.0.727.gbf7dc18ff4-goog
Message-ID: <20250722070600.3267819-2-jstultz@google.com>
Subject: [RFC][PATCH v20 1/6] locking: Add task::blocked_lock to serialize
 blocked_on state
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>, Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
 Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>,
 Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
 Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>, Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
 K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
 Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>, kuyo chang <kuyo.chang@mediatek.com>,
 hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split back out into later in the series
v16:
* Fixups to mark tasks unblocked before sleeping in
  mutex_optimistic_spin()
* Rework to use guard() as suggested by Peter
v19:
* Rework logic for PREEMPT_RT issues reported by
  K Prateek Nayak

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 25 ++++++++++++++++++-------
 init/init_task.c          |  1 +
 kernel/fork.c             |  1 +
 kernel/locking/mutex.c    | 34 ++++++++++++++++++++++------------
 kernel/locking/ww_mutex.h |  6 ++++--
 kernel/sched/core.c       |  4 +++-
 6 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5b4e1cd52e27a..a6654948d264f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1232,6 +1232,7 @@ struct task_struct {
 #endif
=20
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
=20
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2145,8 +2146,8 @@ static inline void __set_task_blocked_on(struct task_=
struct *p, struct mutex *m)
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p !=3D current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
@@ -2158,15 +2159,14 @@ static inline void __set_task_blocked_on(struct tas=
k_struct *p, struct mutex *m)
=20
 static inline void set_task_blocked_on(struct task_struct *p, struct mutex=
 *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__set_task_blocked_on(p, m);
 }
=20
 static inline void __clear_task_blocked_on(struct task_struct *p, struct m=
utex *m)
 {
-	WARN_ON_ONCE(!m);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * There may be cases where we re-clear already cleared
 	 * blocked_on relationships, but make sure we are not
@@ -2178,8 +2178,15 @@ static inline void __clear_task_blocked_on(struct ta=
sk_struct *p, struct mutex *
=20
 static inline void clear_task_blocked_on(struct task_struct *p, struct mut=
ex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__clear_task_blocked_on(p, m);
+}
+
+static inline void clear_task_blocked_on_nested(struct task_struct *p, str=
uct mutex *m)
+{
+	raw_spin_lock_nested(&p->blocked_lock, SINGLE_DEPTH_NESTING);
 	__clear_task_blocked_on(p, m);
+	raw_spin_unlock(&p->blocked_lock);
 }
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct r=
t_mutex *m)
@@ -2189,6 +2196,10 @@ static inline void __clear_task_blocked_on(struct ta=
sk_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_=
mutex *m)
 {
 }
+
+static inline void clear_task_blocked_on_nested(struct task_struct *p, str=
uct rt_mutex *m)
+{
+}
 #endif /* !CONFIG_PREEMPT_RT */
=20
 static __always_inline bool need_resched(void)
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd906..7e29d86153d9f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 	.journal_info	=3D NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	=3D __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	=3D __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns =3D 50000, /* 50 usec default slack */
 	.thread_pid	=3D &init_struct_pid,
 	.thread_node	=3D LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f87f05aff4a0..6a294e6ee105d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2025,6 +2025,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
=20
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
=20
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 80d778fedd605..2ab6d291696e8 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -614,6 +614,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	}
=20
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	raw_spin_lock(&current->blocked_lock);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -657,7 +658,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
=20
 		/*
 		 * Check for signals and kill conditions while holding
@@ -675,18 +676,21 @@ __mutex_lock_common(struct mutex *lock, unsigned int =
state, unsigned int subclas
 				goto err;
 		}
=20
+		raw_spin_unlock(&current->blocked_lock);
 		raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
=20
 		schedule_preempt_disabled();
=20
 		first =3D __mutex_waiter_is_first(lock, &waiter);
=20
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -697,23 +701,27 @@ __mutex_lock_common(struct mutex *lock, unsigned int =
state, unsigned int subclas
 			break;
=20
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter);
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
=20
@@ -739,6 +747,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
=20
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	preempt_enable();
 	return 0;
@@ -750,6 +759,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 err_early_kill:
 	WARN_ON(__get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
@@ -959,7 +969,7 @@ static noinline void __sched __mutex_unlock_slowpath(st=
ruct mutex *lock, unsigne
 		next =3D waiter->task;
=20
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
=20
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 086fd5487ca77..bf13039fb2a04 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER =
*waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		 /* nested as we should hold current->blocked_lock already */
+		clear_task_blocked_on_nested(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
=20
@@ -343,7 +344,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * blocked_on pointer. Otherwise we can see circular
 			 * blocked_on relationships that can't resolve.
 			 */
-			__clear_task_blocked_on(owner, lock);
+			/* nested as we should hold current->blocked_lock already */
+			clear_task_blocked_on_nested(owner, lock);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f7f576ad9b223..52c0f16aab101 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6633,6 +6633,7 @@ static struct task_struct *proxy_deactivate(struct rq=
 *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6656,8 +6657,9 @@ find_proxy_task(struct rq *rq, struct task_struct *do=
nor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
=20
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex !=3D __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
--=20
2.50.0.727.gbf7dc18ff4-goog
From nobody Mon Oct  6 11:49:39 2025
Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com
 [209.85.216.73])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7997B293B48
	for <linux-kernel@vger.kernel.org>; Tue, 22 Jul 2025 07:06:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.73
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1753167987; cv=none;
 b=Lr9rdIT2LVvLnu3niSgBN2l0S/lHYV71uwKblQ6ihZh8P7+Y6EGRnEzUrmUVVYE0fF/VOFUly7SbYYlpK27na/60yx4DAEv2RzBdDN3hL1gg8k9SWF75Vt7ASFGBg9CCaDwpLvOx2cvzS7QF5JsaywtxZ5glqLTKkMumxq7/1N4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1753167987; c=relaxed/simple;
	bh=z1A4GckKepv3V1J07+cVzUA3QwjoM03XF5DP4O6TMeA=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=DjrfARJDZHZTaY/TDahLazBDswakapoodS5OnPqqYm+3yiKxDocL8Q+g/O5CfIbpvPeULtyPXKIwyaoSy62tcuhR76bfxxaNeZQDpsD7SWmHt4mNihJfB2CIG2OM0Al7qJXCHJZ2AgPGqfgjoR5NCy+NolUPW/qfKRzNJWahnXA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=bXmt4cCc; arc=none smtp.client-ip=209.85.216.73
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="bXmt4cCc"
Received: by mail-pj1-f73.google.com with SMTP id
 98e67ed59e1d1-313d6d671ffso5228160a91.2
        for <linux-kernel@vger.kernel.org>;
 Tue, 22 Jul 2025 00:06:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1753167985; x=1753772785;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=WJjjKGrMW4j7JS/xdA1+Sts0M/8AGRXlKU0T7+gY/+c=;
        b=bXmt4cCccGKYcYqG89ENfbzNkVGznH7eeYgw+Y542Gm+aCS5HrseNUF5BdlqaMaOeO
         Uytltk1HW5w+Zxt91t4YFXJQ7RL4sWh8N1aeGpv8eq+gtDxpe8WMS46fjgWGQvrNSQsa
         rkEzGr6Gy51F3k2Pdn8O49/y9Lk5fDdKhWNGk3qi7pcdBD4V9B/XKAK927bS7YSuDkTm
         mbpALFjzHxtpI0XAj1PnAZiKUquinpyA7FByuu285CjYuI8kz+FXPBnjxc8L67uEBDIN
         x3JemnBGTSvdJ6/xOFvBj9qSyHqb4z6pANIW0fMyfzSir1Gs+Y0xcrmH8Vdg4LwJjsHm
         Sung==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1753167985; x=1753772785;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=WJjjKGrMW4j7JS/xdA1+Sts0M/8AGRXlKU0T7+gY/+c=;
        b=dol9LnhyEzXEcvXU5g3VJQzS1cFxc6Nn4MPTjKkJBo6AzJS9yxr+ByAVdsmMbrQLBC
         yWk3Z754yBRAIDjc2V8Emu8/x7Ovnm+lIUegVdyNo9tFDPpwNQNNam4j0AzkBs5iqOme
         oFwR9T5vMz+UG0Fz68vfKh48M0u674dtBelbSJBOB3c3gnQAOc87gau5GiTmxh3W5fLN
         5rV51wYEifxrxI+u6YJduMu+wcR81tEH/JtyT0Tf4n00zAJdldMd+lXgL5kTvmuM4vY7
         UMDy+LhpnATJITSIIQ8km84s7L92E9qkZLlCsLtmk8gPDuoi8e6n9odPGGStMgbVoi1Z
         vr+A==
X-Gm-Message-State: AOJu0YyN8EnpeKamQRY0IH+wulcI5gqDLwQGa80WByPQT0Sh3PX8g8ij
	sQg02irUvflgYwXHElvgFtBQknAhqlG050dtIoGPH7rVTJjPnwai7rYyj60mpsRrNJiE0gNNV2b
	IYYscNfQNzz7N1w3fGlJCovEEm3ZMef92EOD/cK2QNBC/xxNWqMSMUr78IbYEwKa/36sJ4G3q47
	Slib1DUnDI51ZFJeNcU8DuX/pp3W/lufKG08t9Wrr3hNnDkH+7
X-Google-Smtp-Source: 
 AGHT+IEOM/ui4m2BLzjuK0adEhE+IGtqN0NbMfedQhwU15wEoHwx69ftHSYIP71+XAIs4A9jykWnsxGuJ7sg
X-Received: from pjbpv12.prod.google.com ([2002:a17:90b:3c8c:b0:312:15b:e5d1])
 (user=jstultz job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:90a:dfce:b0:312:e76f:5213
 with SMTP id 98e67ed59e1d1-31c9e77ab0dmr30924064a91.28.1753167984679; Tue, 22
 Jul 2025 00:06:24 -0700 (PDT)
Date: Tue, 22 Jul 2025 07:05:48 +0000
In-Reply-To: <20250722070600.3267819-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250722070600.3267819-1-jstultz@google.com>
X-Mailer: git-send-email 2.50.0.727.gbf7dc18ff4-goog
Message-ID: <20250722070600.3267819-3-jstultz@google.com>
Subject: [RFC][PATCH v20 2/6] kernel/locking: Add blocked_on_state to provide
 necessary tri-state for return migration
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>, Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
 Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>,
 Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
 Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>, Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
 K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
 Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>, kuyo chang <kuyo.chang@mediatek.com>,
 hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Thus we need more then just a binary concept of the task being
blocked on a mutex or not.

So add a blocked_on_state value to the task, that allows the
task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING
and back to BO_RUNNING. This provides a guard state in
BO_WAKING so we can know the task is no longer blocked
but we don't want to run it until we have potentially
done return migration, back to a usable cpu.

Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
  series, as the tri-state isn't necessary until we deal
  with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
  BO_WAKING by another cpu (usually via ww_mutex die code).
  Make sure we release the rq lock so the wakeup can
  complete.
* Rework to use guard() in find_proxy_task() as suggested
  by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
  K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
  PREMEPT_RT conditionals

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 100 ++++++++++++++++++++++----------------
 init/init_task.c          |   1 +
 kernel/fork.c             |   1 +
 kernel/locking/mutex.c    |  15 +++---
 kernel/locking/ww_mutex.h |  17 +++----
 kernel/sched/core.c       |  26 +++++++++-
 kernel/sched/sched.h      |   2 +-
 7 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a6654948d264f..ced001f889519 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -812,6 +812,12 @@ struct kmap_ctrl {
 #endif
 };
=20
+enum blocked_on_state {
+	BO_RUNNABLE,
+	BO_BLOCKED,
+	BO_WAKING,
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1231,6 +1237,7 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
=20
+	enum blocked_on_state		blocked_on_state;
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	raw_spinlock_t			blocked_lock;
=20
@@ -2131,76 +2138,83 @@ extern int __cond_resched_rwlock_write(rwlock_t *lo=
ck);
 	__cond_resched_rwlock_write(lock);					\
 })
=20
-#ifndef CONFIG_PREEMPT_RT
-static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
+static inline void __force_blocked_on_runnable(struct task_struct *p)
 {
-	struct mutex *m =3D p->blocked_on;
+	lockdep_assert_held(&p->blocked_lock);
+	p->blocked_on_state =3D BO_RUNNABLE;
+}
=20
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+static inline void force_blocked_on_runnable(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__force_blocked_on_runnable(p);
 }
=20
-static inline void __set_task_blocked_on(struct task_struct *p, struct mut=
ex *m)
+static inline void __set_blocked_on_runnable(struct task_struct *p)
 {
-	WARN_ON_ONCE(!m);
-	/* The task should only be setting itself as blocked */
-	WARN_ON_ONCE(p !=3D current);
-	/* Currently we serialize blocked_on under the task::blocked_lock */
-	lockdep_assert_held_once(&p->blocked_lock);
-	/*
-	 * Check ensure we don't overwrite existing mutex value
-	 * with a different mutex. Note, setting it to the same
-	 * lock repeatedly is ok.
-	 */
-	WARN_ON_ONCE(p->blocked_on && p->blocked_on !=3D m);
-	p->blocked_on =3D m;
+	lockdep_assert_held(&p->blocked_lock);
+
+	if (p->blocked_on_state =3D=3D BO_WAKING)
+		p->blocked_on_state =3D BO_RUNNABLE;
 }
=20
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex=
 *m)
+static inline void set_blocked_on_runnable(struct task_struct *p)
 {
+	if (!sched_proxy_exec())
+		return;
+
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
-	__set_task_blocked_on(p, m);
+	__set_blocked_on_runnable(p);
 }
=20
-static inline void __clear_task_blocked_on(struct task_struct *p, struct m=
utex *m)
+static inline void __set_blocked_on_waking(struct task_struct *p)
 {
-	/* Currently we serialize blocked_on under the task::blocked_lock */
-	lockdep_assert_held_once(&p->blocked_lock);
-	/*
-	 * There may be cases where we re-clear already cleared
-	 * blocked_on relationships, but make sure we are not
-	 * clearing the relationship with a different lock.
-	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m);
-	p->blocked_on =3D NULL;
+	lockdep_assert_held(&p->blocked_lock);
+
+	if (p->blocked_on_state =3D=3D BO_BLOCKED)
+		p->blocked_on_state =3D BO_WAKING;
 }
=20
-static inline void clear_task_blocked_on(struct task_struct *p, struct mut=
ex *m)
+static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	guard(raw_spinlock_irqsave)(&p->blocked_lock);
-	__clear_task_blocked_on(p, m);
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
 }
=20
-static inline void clear_task_blocked_on_nested(struct task_struct *p, str=
uct mutex *m)
+static inline void set_blocked_on_waking_nested(struct task_struct *p)
 {
 	raw_spin_lock_nested(&p->blocked_lock, SINGLE_DEPTH_NESTING);
-	__clear_task_blocked_on(p, m);
+	__set_blocked_on_waking(p);
 	raw_spin_unlock(&p->blocked_lock);
 }
-#else
-static inline void __clear_task_blocked_on(struct task_struct *p, struct r=
t_mutex *m)
-{
-}
=20
-static inline void clear_task_blocked_on(struct task_struct *p, struct rt_=
mutex *m)
+static inline void __set_task_blocked_on(struct task_struct *p, struct mut=
ex *m)
 {
+	WARN_ON_ONCE(!m);
+	/* The task should only be setting itself as blocked */
+	WARN_ON_ONCE(p !=3D current);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * Check ensure we don't overwrite existing mutex value
+	 * with a different mutex.
+	 */
+	WARN_ON_ONCE(p->blocked_on);
+	p->blocked_on =3D m;
+	p->blocked_on_state =3D BO_BLOCKED;
 }
=20
-static inline void clear_task_blocked_on_nested(struct task_struct *p, str=
uct rt_mutex *m)
+static inline void __clear_task_blocked_on(struct task_struct *p, struct m=
utex *m)
 {
+	/* The task should only be clearing itself */
+	WARN_ON_ONCE(p !=3D current);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/* Make sure we are clearing the relationship with the right lock */
+	WARN_ON_ONCE(p->blocked_on !=3D m);
+	p->blocked_on =3D NULL;
+	p->blocked_on_state =3D BO_RUNNABLE;
 }
-#endif /* !CONFIG_PREEMPT_RT */
=20
 static __always_inline bool need_resched(void)
 {
diff --git a/init/init_task.c b/init/init_task.c
index 7e29d86153d9f..6d72ec23410a6 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -174,6 +174,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 	.mems_allowed_seq =3D SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
 						 &init_task.alloc_lock),
 #endif
+	.blocked_on_state =3D BO_RUNNABLE,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	=3D RB_ROOT_CACHED,
 	.pi_top_task	=3D NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 6a294e6ee105d..5eacb25a0c5ab 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2124,6 +2124,7 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
=20
+	p->blocked_on_state =3D BO_RUNNABLE;
 	p->blocked_on =3D NULL; /* not blocked yet */
=20
 #ifdef CONFIG_BCACHE
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2ab6d291696e8..b5145ddaec242 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -686,11 +686,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int s=
tate, unsigned int subclas
 		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 		raw_spin_lock(&current->blocked_lock);
 		/*
-		 * As we likely have been woken up by task
-		 * that has cleared our blocked_on state, re-set
-		 * it to the lock we are trying to acquire.
+		 * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
 		 */
-		__set_task_blocked_on(current, lock);
+		current->blocked_on_state =3D BO_BLOCKED;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -709,14 +707,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int =
state, unsigned int subclas
 			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			__clear_task_blocked_on(current, lock);
+			current->blocked_on_state =3D BO_RUNNABLE;
 			raw_spin_unlock(&current->blocked_lock);
 			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
 			opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter);
 			raw_spin_lock_irqsave(&lock->wait_lock, flags);
 			raw_spin_lock(&current->blocked_lock);
-			__set_task_blocked_on(current, lock);
+			current->blocked_on_state =3D BO_BLOCKED;
 			if (opt_acquired)
 				break;
 			trace_contention_begin(lock, LCB_F_MUTEX);
@@ -968,8 +966,11 @@ static noinline void __sched __mutex_unlock_slowpath(s=
truct mutex *lock, unsigne
=20
 		next =3D waiter->task;
=20
+		raw_spin_lock(&next->blocked_lock);
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		WARN_ON_ONCE(__get_task_blocked_on(next) !=3D lock);
+		__set_blocked_on_waking(next);
+		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
 	}
=20
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index bf13039fb2a04..44eceffd79b35 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,12 +285,12 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITE=
R *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on_state to BO_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
 		 /* nested as we should hold current->blocked_lock already */
-		clear_task_blocked_on_nested(waiter->task, lock);
+		set_blocked_on_waking_nested(waiter->task);
 		wake_q_add(wake_q, waiter->task);
 	}
=20
@@ -340,12 +340,11 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner !=3D current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on_state to BO_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 */
-			/* nested as we should hold current->blocked_lock already */
-			clear_task_blocked_on_nested(owner, lock);
+			set_blocked_on_waking_nested(owner);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52c0f16aab101..7ae5f2d257eb5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4322,6 +4322,7 @@ int try_to_wake_up(struct task_struct *p, unsigned in=
t state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
=20
@@ -6617,7 +6618,7 @@ static struct task_struct *proxy_deactivate(struct rq=
 *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on =3D NULL;
+		donor->blocked_on_state =3D BO_RUNNABLE;
 	}
 	return NULL;
 }
@@ -6670,9 +6671,30 @@ find_proxy_task(struct rq *rq, struct task_struct *d=
onor, struct rq_flags *rf)
 			return NULL;
 		}
=20
+		/*
+		 * If a ww_mutex hits the die/wound case, it marks the task as
+		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
+		 * cycle can be broken and we avoid a deadlock.
+		 *
+		 * However, if at that moment, we are here on the cpu which the
+		 * die/wounded task is enqueued, we might loop on the cycle as
+		 * BO_WAKING still causes task_is_blocked() to return true
+		 * (since we want return migration to occur before we run the
+		 * task).
+		 *
+		 * Unfortunately since we hold the rq lock, it will block
+		 * try_to_wake_up from completing and doing the return
+		 * migration.
+		 *
+		 * So when we hit a !BO_BLOCKED task briefly schedule idle
+		 * so we release the rq and let the wakeup complete.
+		 */
+		if (p->blocked_on_state !=3D BO_BLOCKED)
+			return proxy_resched_idle(rq);
+
 		owner =3D __mutex_owner(mutex);
 		if (!owner) {
-			__clear_task_blocked_on(p, mutex);
+			__force_blocked_on_runnable(p);
 			return p;
 		}
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3f33d10c58c9..d27e8a260e89d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2267,7 +2267,7 @@ static inline bool task_is_blocked(struct task_struct=
 *p)
 	if (!sched_proxy_exec())
 		return false;
=20
-	return !!p->blocked_on;
+	return !!p->blocked_on && p->blocked_on_state !=3D BO_RUNNABLE;
 }
=20
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
--=20
2.50.0.727.gbf7dc18ff4-goog
From nobody Mon Oct  6 11:49:39 2025
Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com
 [209.85.214.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 53BAF2C08B6
	for <linux-kernel@vger.kernel.org>; Tue, 22 Jul 2025 07:06:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1753167989; cv=none;
 b=VUdH8D9iOB51hBqRyzwPoxnQ34226FvEWFMgKK/apr+sVp4sS4QyvYSKefpvQogRVVAddFNDust2wHIhSDYnKUAR7cuZFZ52tRZqzDdxqRBpUInFWkceOVVlHGRP4VN+W+uPi8fB5qo6QUcWQGFkTF/MqsdaM0/kyxLP6sQDZ2M=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1753167989; c=relaxed/simple;
	bh=upFUszo1X4k/1v3kcSQhzPWQiLwPemqlypvnvLUec9o=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=Xdvz1ondjLy9QJCmhx+OwBV0isx/u2ukjQPkhikdJ4m+i6mrTbOc+yBAAG+f8cS93Qbu1cPE9l9wdoPgBzN/q1C0ZTmY3XNd4qxtw1pryqH61lW10doUE5JOk6PewuG+WPQUp99ec9SKg+d9b+kGEnxNlc8/u9UCc+LyG5VT9pI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=JjmI90/u; arc=none smtp.client-ip=209.85.214.201
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="JjmI90/u"
Received: by mail-pl1-f201.google.com with SMTP id
 d9443c01a7336-235eefe6a8fso39213255ad.1
        for <linux-kernel@vger.kernel.org>;
 Tue, 22 Jul 2025 00:06:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1753167986; x=1753772786;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=6mcaKefYRF9M4725G07PH0aETmeSusN8vwiT8bvtjZ0=;
        b=JjmI90/u6QhaglT9pSGJRWJR4NQ/Px1E2S+VWUP72cCHiJXJILv/anq5v+hPW7rL9U
         5xPACbk88yXNYLmUPZmOjDOPjRgKsVMjhRbwLLTFvICQWQdHdjaDDK0d5WYJ2NEHWb5N
         hJcF4LZDo9MhOn1AaEhwuI6gnWs4O9iPENeRo+Ec9ydhnKtU8aVQJDkx+t8wMggWUoqX
         XWUJziNsEhpifPuYOim/iw6a64a31zuDi5KzdnaW19ENnbj+cIV9d45w/JmzRsmhJlVE
         oR0PyPBKTp1wVkXbGMZCWx9d/Cf5HCqbaLNfcebTfYLuUnqflMwLkIO5IKK/idDe1uFe
         psjA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1753167986; x=1753772786;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=6mcaKefYRF9M4725G07PH0aETmeSusN8vwiT8bvtjZ0=;
        b=m45UwtuAQEVufXEyNEwxTaUYxL6wh2so2SdcB5R2819qXjldLb2wX9lGeQFfDXJyS9
         4urVDLALBOzNpz4kgRdRgJJcNmFokt2BT5K7rLrIC/Q/adxvhZW3544b0pKu6njFzstz
         T0ZDRVQjX/mljwsI/d8M1Zh+yC6ho/ZfKAoy104QUFw2blKUEnAbQUcn7Rmf7wzmTxmT
         /zXDGfvycOVtyoYrr4sfG6xtA6lE8q+YoIR3H4lprY4fBFz8S3bcn9Er6FPq1anY+X0I
         97lb54Kc7k5onmHcbrdNKj3M7kmgqMdtIpJbBcYlQl2ETX4qb/XffBzMLm95IeX3Z3tb
         LIyw==
X-Gm-Message-State: AOJu0YxUISznOebFgZqEu6TRKS0J7svYWjW7vTaIcMI0FEsN9m4sUadi
	6Vhgbzq5+QhNDfEbyxHCxdmpd7gKoMyU+9V+YXl9MmGJ4ruvj0+3We0OG+tOI9a5GMRVqQLGqqL
	ngiRUCU+3Mj4B4RBKrArYXfiNZHYEF9F1LbpOBPzIOoNruTQBn2EUiPe45lYH71pDGR8xEHefql
	W4aVcVu7eUfFx+Q52KVCU9rcEEXfBv75cDkt7dT+YMLnVNXzd6
X-Google-Smtp-Source: 
 AGHT+IEUZiwpCUDJ17os2Y/Rny4HU3K/F+cfRT7oN03A9czTfyDTuS1tCWHOwPySx8wi0OhVDElCD9md2TRV
X-Received: from plgb6.prod.google.com ([2002:a17:902:d506:b0:236:369f:dfc9])
 (user=jstultz job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:903:988:b0:234:c22:c612
 with SMTP id d9443c01a7336-23e3b84f767mr214122725ad.43.1753167986411; Tue, 22
 Jul 2025 00:06:26 -0700 (PDT)
Date: Tue, 22 Jul 2025 07:05:49 +0000
In-Reply-To: <20250722070600.3267819-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250722070600.3267819-1-jstultz@google.com>
X-Mailer: git-send-email 2.50.0.727.gbf7dc18ff4-goog
Message-ID: <20250722070600.3267819-4-jstultz@google.com>
Subject: [RFC][PATCH v20 3/6] sched: Add logic to zap balance callbacks if we
 pick again
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>, Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
 Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>,
 Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
 Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>, Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
 K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
 Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>, kuyo chang <kuyo.chang@mediatek.com>,
 hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback !=3D &balance_p=
ush_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched=
_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the blance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Signed-off-by: John Stultz <jstultz@google.com>
---
v20:
* Tweaked to avoid build issues with different configs

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ae5f2d257eb5..30e676c2d582b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4990,6 +4990,40 @@ static inline void finish_task(struct task_struct *p=
rev)
 	smp_store_release(&prev->on_cpu, 0);
 }
=20
+#ifdef CONFIG_SCHED_PROXY_EXEC
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found =3D false;
+
+	lockdep_assert_rq_held(rq);
+
+	head =3D rq->balance_callback;
+	while (head) {
+		if (head =3D=3D &balance_push_callback)
+			found =3D true;
+		next =3D head->next;
+		head->next =3D NULL;
+		head =3D next;
+	}
+	rq->balance_callback =3D found ? &balance_push_callback : NULL;
+}
+#else
+static inline void zap_balance_callbacks(struct rq *rq)
+{
+}
+#endif
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *h=
ead)
 {
 	void (*func)(struct rq *rq);
@@ -6920,8 +6954,11 @@ static void __sched notrace __schedule(int sched_mod=
e)
 	rq_set_donor(rq, next);
 	if (unlikely(task_is_blocked(next))) {
 		next =3D find_proxy_task(rq, next, &rf);
-		if (!next)
+		if (!next) {
+			/* zap the balance_callbacks before picking again */
+			zap_balance_callbacks(rq);
 			goto pick_again;
+		}
 		if (next =3D=3D rq->idle)
 			goto keep_resched;
 	}
--=20
2.50.0.727.gbf7dc18ff4-goog
From nobody Mon Oct  6 11:49:39 2025
Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com
 [209.85.214.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB8082C1597
	for <linux-kernel@vger.kernel.org>; Tue, 22 Jul 2025 07:06:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1753167991; cv=none;
 b=GN6j/kManVNR1q7BkKhlfAD5w0VOs15uGORzAU/A33beHVK6aNm5qVcemYncdwahMlbg9AF5XfhWL863rvZ3n9ZaMRIFy0+0RWgQHh7drBVqpeqlTEluT3o5Zt532jgDKyey/71KtKio5Dyrifza/rAoLDeeEBkO/2rh/IKRGqg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1753167991; c=relaxed/simple;
	bh=1K/d9ElTqIcbPOUXHqlm1yjy3JauSDVuZ5n6PjeSN0Q=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=tS0EupPfOTTLnvXdjxsGZW5mjrFfL9QXRVY9i3AmuBkVZct6ueekzz6Yh2fol+7Sh+mlfR3QUVHWkSLyIkPqdAZbwnLQbF0YzxFLuH5H2EDe7NB/84kF2U/CVvl+j3JaO94iOksNkDeH8rPq6JRYwhD+1pDSTI5JAL9YVRnx/uE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=0c/8khXd; arc=none smtp.client-ip=209.85.214.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="0c/8khXd"
Received: by mail-pl1-f202.google.com with SMTP id
 d9443c01a7336-234a102faa3so44526875ad.0
        for <linux-kernel@vger.kernel.org>;
 Tue, 22 Jul 2025 00:06:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1753167988; x=1753772788;
 darn=vger.kernel.org;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Cr0d2ZZxebdsan4mLLRGb0RxNd/ORYm1jBVJoJr6gkM=;
        b=0c/8khXdoEyZgvGEnCfOhnqe6UoDTjA7kJewvG0+vYOidWZnvZmnAzjEIQuGCMDwhE
         8PhYwcFSfehfIrHCRsv5C7HVR/q27b8Ok3O1n1cfjZzRoiN4AlpqkOT/6imwAx/y7en8
         jQiq1/zWNYlDNcQPCNUvUjpLQcwmihqm6XFYF1UipP58j4/RiqniD1TcR7cPuCQBpGc9
         st/KGhghcRPworwgcssg2K/0ESakUhrdwsgK/dBmotd+ZKnGzCgqTZh0Dl9hrkLBdWCU
         +fR17KiCYYJnTgjTNVhXaV2gNd6HUo+sCb2WrebBPLguGq5xCqtBKkWHQVjMb4LvWStY
         /Blg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1753167988; x=1753772788;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject
         :date:message-id:reply-to;
        bh=Cr0d2ZZxebdsan4mLLRGb0RxNd/ORYm1jBVJoJr6gkM=;
        b=Z/b2+vX0xIgl0GrjfORvC2wPxxVJ+dOBY6RlG2PYsQDmVh45f+lcJfWRay5koW5Cwv
         K8I3psOBTMUr/KBst41nzd2g1npr2GYvjjeaE+jPT89BcfEG1RuoZd5qDNPsDwJi66J/
         FVYx402MT6S/1FdQ1bSD7IaohRk6BzPWxFf+KyHDyRfpKywDPae9OAsAQy+TDHiInfWe
         jcPminblI7FtKzcLVd183U++oXkt0Hy/HMTHNFPygNOdYclCebfD1XShBnknurIX8u/9
         PylgZwnGgsCubD3yjI8U+CyOr1UwP2ohpnz/inSFQISHor2Hi6jkALI5O3ZDPRynG1Yg
         lHOQ==
X-Gm-Message-State: AOJu0YxdnnHyksPD5dDqQG5u03oOnirPeE8fzR8RyKZHOo+3qG5Lea3i
	beWs2+yxqA59AVNWEmj/ZOihORanmPHtuEekstXUvrAdKJbHz6p+LzV1myRH/KHmZa1ASAizaVA
	CnbBXjYmaTsfilRyVGW89L4eiC0EAlP2YQAZLCYno1cdLA6owrbBPE28fuIudkAT483h0/0NG2t
	6t79GQKkT4BQ0Gs6CAOHzMLqOkNF16/65136aZ+eS6nN6TAKYD
X-Google-Smtp-Source: 
 AGHT+IHK/xjc02GTGkgPFdT6um2sKiN6ZliCxAimmlXUSqI9L46Ky/Pj2OaU8VRaqrzfOd3emvTZYTxTWLFz
X-Received: from pjbli18.prod.google.com
 ([2002:a17:90b:48d2:b0:2e0:915d:d594])
 (user=jstultz job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:903:11ce:b0:235:6e1:3edf
 with SMTP id d9443c01a7336-23e24f4a962mr343124505ad.34.1753167987914; Tue, 22
 Jul 2025 00:06:27 -0700 (PDT)
Date: Tue, 22 Jul 2025 07:05:50 +0000
In-Reply-To: <20250722070600.3267819-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250722070600.3267819-1-jstultz@google.com>
X-Mailer: git-send-email 2.50.0.727.gbf7dc18ff4-goog
Message-ID: <20250722070600.3267819-5-jstultz@google.com>
Subject: [RFC][PATCH v20 4/6] sched: Handle blocked-waiter migration (and
 return migration)
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>, Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
 Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>,
 Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
 Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>, Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
 K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
 Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>, kuyo chang <kuyo.chang@mediatek.com>,
 hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

NOTE: With this patch I've hit a few cases where we seem to miss
a BO_WAKING->BO_RUNNING transition (and return migration) that
I'd expect to happen in ttwu(). So I have logic in
find_proxy_task() to detect and to handle the return migration
later. However I'm quite not happy with that as it shouldn't be
necessary, and am still trying to understand where I'm losing
the wakeup & return migration.

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej =C5=BBenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

fix for proxy migration logic on !SMP
---
 kernel/sched/core.c | 251 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/fair.c |   3 +-
 2 files changed, 230 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30e676c2d582b..1c249d1d62f5a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3151,6 +3151,14 @@ static int __set_cpus_allowed_ptr_locked(struct task=
_struct *p,
=20
 	__do_set_cpus_allowed(p, ctx);
=20
+	/*
+	 * It might be that the p->wake_cpu is no longer
+	 * allowed, so set it to the dest_cpu so return
+	 * migration doesn't send it to an invalid cpu
+	 */
+	if (!is_cpu_allowed(p, p->wake_cpu))
+		p->wake_cpu =3D dest_cpu;
+
 	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
=20
 out:
@@ -3711,6 +3719,67 @@ static inline void ttwu_do_wakeup(struct task_struct=
 *p)
 	trace_sched_wakeup(p);
 }
=20
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu =3D p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu =3D wake_cpu;
+}
+
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+	return (READ_ONCE(p->__state) =3D=3D TASK_RUNNING &&
+		READ_ONCE(p->blocked_on_state) =3D=3D BO_WAKING);
+}
+#else /* !CONFIG_SCHED_PROXY_EXEC */
+static bool proxy_task_runnable_but_waking(struct task_struct *p)
+{
+	return false;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
+/*
+ * Checks to see if task p has been proxy-migrated to another rq
+ * and needs to be returned. If so, we deactivate the task here
+ * so that it can be properly woken up on the p->wake_cpu
+ * (or whichever cpu select_task_rq() picks at the bottom of
+ * try_to_wake_up()
+ */
+static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
+{
+	bool ret =3D false;
+
+	if (!sched_proxy_exec())
+		return false;
+
+	raw_spin_lock(&p->blocked_lock);
+	if (__get_task_blocked_on(p) && p->blocked_on_state =3D=3D BO_WAKING) {
+		if (!task_current(rq, p) && (p->wake_cpu !=3D cpu_of(rq))) {
+			if (task_current_donor(rq, p)) {
+				put_prev_task(rq, p);
+				rq_set_donor(rq, rq->idle);
+			}
+			deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+			ret =3D true;
+		}
+		__set_blocked_on_runnable(p);
+		resched_curr(rq);
+	}
+	raw_spin_unlock(&p->blocked_lock);
+	return ret;
+}
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -3796,6 +3865,8 @@ static int ttwu_runnable(struct task_struct *p, int w=
ake_flags)
 		update_rq_clock(rq);
 		if (p->se.sched_delayed)
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+		if (proxy_needs_return(rq, p))
+			goto out;
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -3806,6 +3877,7 @@ static int ttwu_runnable(struct task_struct *p, int w=
ake_flags)
 		ttwu_do_wakeup(p);
 		ret =3D 1;
 	}
+out:
 	__task_rq_unlock(rq, &rf);
=20
 	return ret;
@@ -4193,6 +4265,8 @@ int try_to_wake_up(struct task_struct *p, unsigned in=
t state, int wake_flags)
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
 		WARN_ON_ONCE(p->se.sched_delayed);
+		/* If current is waking up, we know we can run here, so set BO_RUNNBLE */
+		set_blocked_on_runnable(p);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
=20
@@ -4209,8 +4283,15 @@ int try_to_wake_up(struct task_struct *p, unsigned i=
nt state, int wake_flags)
 	 */
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
 		smp_mb__after_spinlock();
-		if (!ttwu_state_match(p, state, &success))
-			break;
+		if (!ttwu_state_match(p, state, &success)) {
+			/*
+			 * If we're already TASK_RUNNING, and BO_WAKING
+			 * continue on to ttwu_runnable check to force
+			 * proxy_needs_return evaluation
+			 */
+			if (!proxy_task_runnable_but_waking(p))
+				break;
+		}
=20
 		trace_sched_waking(p);
=20
@@ -4272,6 +4353,7 @@ int try_to_wake_up(struct task_struct *p, unsigned in=
t state, int wake_flags)
 		 * enqueue, such as ttwu_queue_wakelist().
 		 */
 		WRITE_ONCE(p->__state, TASK_WAKING);
+		set_blocked_on_runnable(p);
=20
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
@@ -4322,7 +4404,6 @@ int try_to_wake_up(struct task_struct *p, unsigned in=
t state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
=20
@@ -6624,7 +6705,7 @@ static inline struct task_struct *proxy_resched_idle(=
struct rq *rq)
 	return rq->idle;
 }
=20
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state =3D READ_ONCE(donor->__state);
=20
@@ -6644,17 +6725,98 @@ static bool __proxy_deactivate(struct rq *rq, struc=
t task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
=20
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_str=
uct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		donor->blocked_on_state =3D BO_RUNNABLE;
-	}
-	return NULL;
+	struct rq *target_rq =3D cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * After the migration, we are going to pick_again in the __schedule
+	 * logic, so backtrack a bit before we release the lock:
+	 * Put rq->donor, and set rq->curr as rq->donor and set_next_task,
+	 * so that we're close to the situation we had entering __schedule
+	 * the first time.
+	 *
+	 * Then when we re-aquire the lock, we will re-put rq->curr then
+	 * rq_set_donor(rq->idle) and set_next_task(rq->idle), before
+	 * picking again.
+	 */
+	/* XXX - Added to address problems with changed dl_server semantics - dou=
ble check */
+	__put_prev_set_next_dl_server(rq, rq->donor, rq->curr);
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+
+	WARN_ON(p =3D=3D rq->curr);
+
+	deactivate_task(rq, p, 0);
+	proxy_set_task_cpu(p, target_cpu);
+
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(target_rq);
+
+	activate_task(target_rq, p, 0);
+	wakeup_preempt(target_rq, p, 0);
+
+	raw_spin_rq_unlock(target_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+
+	WARN_ON(p =3D=3D rq->curr);
+
+	p->blocked_on_state =3D BO_WAKING;
+	get_task_struct(p);
+	block_task(rq, p, 0);
+
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	wake_up_process(p);
+	put_task_struct(p);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+}
+
+static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p)
+{
+	if (p =3D=3D rq->curr || p->wake_cpu =3D=3D cpu_of(rq))
+		return true;
+	return false;
 }
=20
 /*
@@ -6677,9 +6839,11 @@ static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags =
*rf)
 {
 	struct task_struct *owner =3D NULL;
+	bool curr_in_chain =3D false;
 	int this_cpu =3D cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
=20
 	/* Follow blocked_on chain. */
 	for (p =3D donor; task_is_blocked(p); p =3D owner) {
@@ -6705,6 +6869,10 @@ find_proxy_task(struct rq *rq, struct task_struct *d=
onor, struct rq_flags *rf)
 			return NULL;
 		}
=20
+		/* Double check blocked_on_state now we're holding the lock */
+		if (p->blocked_on_state =3D=3D BO_RUNNABLE)
+			return p;
+
 		/*
 		 * If a ww_mutex hits the die/wound case, it marks the task as
 		 * BO_WAKING and calls try_to_wake_up(), so that the mutex
@@ -6720,26 +6888,50 @@ find_proxy_task(struct rq *rq, struct task_struct *=
donor, struct rq_flags *rf)
 		 * try_to_wake_up from completing and doing the return
 		 * migration.
 		 *
-		 * So when we hit a !BO_BLOCKED task briefly schedule idle
-		 * so we release the rq and let the wakeup complete.
+		 * So when we hit a BO_WAKING task try to wake it up ourselves.
 		 */
-		if (p->blocked_on_state !=3D BO_BLOCKED)
-			return proxy_resched_idle(rq);
+		if (p->blocked_on_state =3D=3D BO_WAKING) {
+			if (task_current(rq, p)) {
+				/* If its current just set it runnable */
+				__force_blocked_on_runnable(p);
+				return p;
+			}
+			goto needs_return;
+		}
+
+		if (task_current(rq, p))
+			curr_in_chain =3D true;
=20
 		owner =3D __mutex_owner(mutex);
 		if (!owner) {
+			/* If the owner is null, we may have some work to do */
+			if (!proxy_can_run_here(rq, p))
+				goto needs_return;
+
 			__force_blocked_on_runnable(p);
 			return p;
 		}
=20
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
-			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			/* XXX Don't handle blocked owners / delay dequeued yet */
+			if (!proxy_deactivate(rq, donor)) {
+				if (!proxy_can_run_here(rq, p))
+					goto needs_return;
+				__force_blocked_on_runnable(p);
+				return p;
+			}
+			return NULL;
 		}
=20
-		if (task_cpu(owner) !=3D this_cpu) {
-			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+		owner_cpu =3D task_cpu(owner);
+		if (owner_cpu !=3D this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			goto migrate;
 		}
=20
 		if (task_on_rq_migrating(owner)) {
@@ -6799,6 +6991,19 @@ find_proxy_task(struct rq *rq, struct task_struct *d=
onor, struct rq_flags *rf)
=20
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
+
+	/*
+	 * NOTE: This logic is down here, because we need to call
+	 * the functions with the mutex wait_lock and task
+	 * blocked_lock released, so we have to get out of the
+	 * guard() scope.
+	 */
+migrate:
+	proxy_migrate_task(rq, rf, p, owner_cpu);
+	return NULL;
+needs_return:
+	proxy_force_return(rq, rf, p);
+	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c2..cc531eb939831 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct=
 *prev, struct rq_flags *rf
 	se =3D &p->se;
=20
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class !=3D &fair_sched_class)
+	if (prev->sched_class !=3D &fair_sched_class ||
+	    rq->curr !=3D rq->donor)
 		goto simple;
=20
 	__put_prev_set_next_dl_server(rq, prev, p);
--=20
2.50.0.727.gbf7dc18ff4-goog
From nobody Mon Oct  6 11:49:39 2025
Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com
 [209.85.214.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FB782D0298
	for <linux-kernel@vger.kernel.org>; Tue, 22 Jul 2025 07:06:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1753167992; cv=none;
 b=Z3V52u8t3iF++a/rebB8T+3xrJ8yF71dTvLCtAuReOXkuRM4RtqwumlUCQ9jtsuolFEDtfxncyCnIo+CZtQq4aVCDCUDM/FPSVhCn+/jKxY+E1+p/pVAxUW3j8kTURrGtvSZUDLzuaemo5Gh1LaJIPANQ0i/pB09/BxVdNrkyeQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1753167992; c=relaxed/simple;
	bh=GOWYSERoZTLMI76tmnWxpW0TD5uG7IHIEFycpOWm4go=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=by3RuVxN4wXQmrDQ6qBRbiSX2fgo54cnTqVt1obYpOjSsc+SA59hMqSe6rVFBfaxU4/KYWhgAcC42IsxktuOPle/aYgnAYT9CNJk9DxD6Z3HpEAon01wwgCSL1l0xEjYwkd6ghHygIZcnCOAKhKtch2zQW+s7buhAiVSQ5nYITo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=EiElhnfV; arc=none smtp.client-ip=209.85.214.201
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="EiElhnfV"
Received: by mail-pl1-f201.google.com with SMTP id
 d9443c01a7336-234a102faa3so44527335ad.0
        for <linux-kernel@vger.kernel.org>;
 Tue, 22 Jul 2025 00:06:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1753167990; x=1753772790;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=bEiDhtjgc/gNaq1IhSIXuUoLupESkFjxJm5qn1iyONU=;
        b=EiElhnfVbFzlO5oW8m+0OitifFfUxKk5txYZ6oaKnzCBUaDEHcdZosGMRA0UdSYV5A
         Z5f7MQnprFk0qSxlxLIMUUJ3zRBNRL6Qf00nmrYVG29CES4AplCszRI93yU/ICQvPVte
         PIMztNPyhyRih87Gzb4o5oeMcXPfketkdaj7Q/pDQYOhuJPs2RfqbttKMnC62G72mTuT
         sTsucK0i9VqZ+Rr61ciLwgdiTDIrTU66pD3Le+ER0i+rd57Q55VmZU9zQnMto3dmj2BI
         w3ar8nHyOr7vyjF5ithy8C5UQ06+MGMn2UB5gadPjib2LRSeDuUR3/tI+DwC+iuvPtzx
         nRZA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1753167990; x=1753772790;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=bEiDhtjgc/gNaq1IhSIXuUoLupESkFjxJm5qn1iyONU=;
        b=JxsR2Re60hpVFYn0XpBtmX3+pquRGltPNjB7/DZKWU0WLwoJgUj05BJg3hIUDaaBLK
         s2lPO/i6Ka9CLRZ2FnOMZtfeIaLvYHGObBXKvk+VDQLfVPs6O0/qwuHkKgoOJe85tAQg
         P++St/qOke67ByoxU96lIqgaPjan1qRQOEP/P1h4KcW0dDtc0BLTxCcP71RgVUHNBOj9
         T4dIJrqsvjDJLEJ2JeKQRNEdpslBes3KfRQiIw/TryD39vTSx+iiVD+Pa2Hk/VGygQmQ
         nCyidbylQ1AiIdvQpo5gXS0RnoeiKRflPgaUUTi5AkQ4rPW6OsrhwPDIp3pAKDTyLVw3
         fATQ==
X-Gm-Message-State: AOJu0Yx/uA3/ma+Bk6dxG+3hKXERTxhWbI2TS8nIR8jk/IWm4rjbT3IX
	ibdqfKDS3xtujhtn9B0qt79yOTYfGyL20SWgjJLwzHpUTJPKrMBLAugLoMBCMadfWbvNdj8YRYi
	lUxAksyc/A5BN4WMc8fnyw4AdiJXEEugcLiBBjOmbhp0MFcCs8vTpL+DrehOPEb8L3bsLJF4eiD
	0LPzKc8GluVFl+tFWYiMlD3GEUEY/i6DLi08wPkO4sp8IRN+Rr
X-Google-Smtp-Source: 
 AGHT+IGnMe8LJH39aMo3nbUdeHwj15rcy4s1iCLSlesNTPVNl1oC+1k9Last8v5G/blyEQh/oBMUrmEZlvG7
X-Received: from pjbrs12.prod.google.com
 ([2002:a17:90b:2b8c:b0:312:187d:382d])
 (user=jstultz job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:903:19e7:b0:234:ed31:fc94
 with SMTP id d9443c01a7336-23e24f4af16mr339588185ad.26.1753167989581; Tue, 22
 Jul 2025 00:06:29 -0700 (PDT)
Date: Tue, 22 Jul 2025 07:05:51 +0000
In-Reply-To: <20250722070600.3267819-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250722070600.3267819-1-jstultz@google.com>
X-Mailer: git-send-email 2.50.0.727.gbf7dc18ff4-goog
Message-ID: <20250722070600.3267819-6-jstultz@google.com>
Subject: [RFC][PATCH v20 5/6] sched: Add blocked_donor link to task for
 smarter mutex handoffs
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
 "Connor O'Brien" <connoro@google.com>,
	John Stultz <jstultz@google.com>, Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
 Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>,
 Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
 Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>, Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
 K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
 Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>, kuyo chang <kuyo.chang@mediatek.com>,
 hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v6:
* Moved proxied value from earlier patch to this one where it
  is actually used
* Rework logic to check sched_proxy_exec() instead of using ifdefs
* Moved comment change to this patch where it makes sense
v7:
* Use more descriptive term then "us" in comments, as suggested
  by Metin Kaya.
* Minor typo fixup from Metin Kaya
* Reworked proxied variable to prev_not_proxied to simplify usage
v8:
* Use helper for donor blocked_on_state transition
v9:
* Re-add mutex lock handoff in the unlock path, but only when we
  have a blocked donor
* Slight reword of commit message suggested by Metin
v18:
* Add task_init initialization for blocked_donor, suggested by
  Suleiman

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  1 +
 kernel/locking/mutex.c | 41 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c    | 18 ++++++++++++++++--
 5 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ced001f889519..675e2f89ec0f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1239,6 +1239,7 @@ struct task_struct {
=20
 	enum blocked_on_state		blocked_on_state;
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct task_struct		*blocked_donor;	/* task that is boosting this task */
 	raw_spinlock_t			blocked_lock;
=20
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index 6d72ec23410a6..627bbd8953e88 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -175,6 +175,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 						 &init_task.alloc_lock),
 #endif
 	.blocked_on_state =3D BO_RUNNABLE,
+	.blocked_donor =3D NULL,
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	=3D RB_ROOT_CACHED,
 	.pi_top_task	=3D NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 5eacb25a0c5ab..61a2ac850faf0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2126,6 +2126,7 @@ __latent_entropy struct task_struct *copy_process(
=20
 	p->blocked_on_state =3D BO_RUNNABLE;
 	p->blocked_on =3D NULL; /* not blocked yet */
+	p->blocked_donor =3D NULL; /* nobody is boosting p yet */
=20
 #ifdef CONFIG_BCACHE
 	p->sequential_io	=3D 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index b5145ddaec242..da6e964498ad0 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -926,7 +926,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
  */
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, u=
nsigned long ip)
 {
-	struct task_struct *next =3D NULL;
+	struct task_struct *donor, *next =3D NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
 	unsigned long flags;
@@ -945,6 +945,12 @@ static noinline void __sched __mutex_unlock_slowpath(s=
truct mutex *lock, unsigne
 		MUTEX_WARN_ON(__owner_task(owner) !=3D current);
 		MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
=20
+		if (sched_proxy_exec() && current->blocked_donor) {
+			/* force handoff if we have a blocked_donor */
+			owner =3D MUTEX_FLAG_HANDOFF;
+			break;
+		}
+
 		if (owner & MUTEX_FLAG_HANDOFF)
 			break;
=20
@@ -958,7 +964,34 @@ static noinline void __sched __mutex_unlock_slowpath(s=
truct mutex *lock, unsigne
=20
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+	if (sched_proxy_exec()) {
+		raw_spin_lock(&current->blocked_lock);
+		/*
+		 * If we have a task boosting current, and that task was boosting
+		 * current through this lock, hand the lock to that task, as that
+		 * is the highest waiter, as selected by the scheduling function.
+		 */
+		donor =3D current->blocked_donor;
+		if (donor) {
+			struct mutex *next_lock;
+
+			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+			next_lock =3D __get_task_blocked_on(donor);
+			if (next_lock =3D=3D lock) {
+				next =3D donor;
+				__set_blocked_on_waking(donor);
+				wake_q_add(&wake_q, donor);
+				current->blocked_donor =3D NULL;
+			}
+			raw_spin_unlock(&donor->blocked_lock);
+		}
+	}
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =3D
 			list_first_entry(&lock->wait_list,
@@ -966,7 +999,7 @@ static noinline void __sched __mutex_unlock_slowpath(st=
ruct mutex *lock, unsigne
=20
 		next =3D waiter->task;
=20
-		raw_spin_lock(&next->blocked_lock);
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
 		debug_mutex_wake_waiter(lock, waiter);
 		WARN_ON_ONCE(__get_task_blocked_on(next) !=3D lock);
 		__set_blocked_on_waking(next);
@@ -977,6 +1010,8 @@ static noinline void __sched __mutex_unlock_slowpath(s=
truct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
=20
+	if (sched_proxy_exec())
+		raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 }
=20
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1c249d1d62f5a..2c3a4b9518927 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6823,7 +6823,17 @@ static inline bool proxy_can_run_here(struct rq *rq,=
 struct task_struct *p)
  * Find runnable lock owner to proxy for mutex blocked donor
  *
  * Follow the blocked-on relation:
- *   task->blocked_on -> mutex->owner -> task...
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *  blocked_donor |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
  *
  * Lock order:
  *
@@ -6987,6 +6997,7 @@ find_proxy_task(struct rq *rq, struct task_struct *do=
nor, struct rq_flags *rf)
 		 * rq, therefore holding @rq->lock is sufficient to
 		 * guarantee its existence, as per ttwu_remote().
 		 */
+		owner->blocked_donor =3D p;
 	}
=20
 	WARN_ON_ONCE(owner && !owner->on_rq);
@@ -7083,6 +7094,7 @@ static void __sched notrace __schedule(int sched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool prev_not_proxied;
 	int cpu;
=20
 	trace_sched_entry_tp(preempt, CALLER_ADDR0);
@@ -7154,9 +7166,11 @@ static void __sched notrace __schedule(int sched_mod=
e)
 		switch_count =3D &prev->nvcsw;
 	}
=20
+	prev_not_proxied =3D !prev->blocked_donor;
 pick_again:
 	next =3D pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	next->blocked_donor =3D NULL;
 	if (unlikely(task_is_blocked(next))) {
 		next =3D find_proxy_task(rq, next, &rf);
 		if (!next) {
@@ -7220,7 +7234,7 @@ static void __sched notrace __schedule(int sched_mode)
 		rq =3D context_switch(rq, prev, next, &rf);
 	} else {
 		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
+		if (prev_not_proxied && next->blocked_donor)
 			proxy_tag_curr(rq, next);
=20
 		rq_unpin_lock(rq, &rf);
--=20
2.50.0.727.gbf7dc18ff4-goog
From nobody Mon Oct  6 11:49:39 2025
Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com
 [209.85.216.74])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A4242D322A
	for <linux-kernel@vger.kernel.org>; Tue, 22 Jul 2025 07:06:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.74
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1753167995; cv=none;
 b=IiyPSO+i5sKMHJxiDmahRoG3BlDO/pQ2GWWzpmWP/LzYmLnI6rrdwLLl+knO1NaN+8ikNr/0JVpzFuFXVp2UZ6z4NuNMnblRzr0udKoo00yTb8jH9SweiecLsP8nEOyaNDN3IH1uMGCawATtisjhsklgPa3/N6OSyDZ18wm7YYE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1753167995; c=relaxed/simple;
	bh=wvTAnQeRXy+r30kCTxX8NcmOQRho8sKwbikQzT9qPFQ=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=Fb2o98p+LQpBSAyAoM6tsGGfY+7/WWNi6B5PPTKPX46y69qMZyC3xceiRff6CSfJyF/gejb1WqI3WNTomuwtOcKrRegfw/4F3A6mNMdR4vRnnIwO8T5iQ1FwydgAOSTFWloOglnTtgfRIXwkZLhPvwSpRGTwY5q/rpt9OAPDghc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=e2z55kgo; arc=none smtp.client-ip=209.85.216.74
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="e2z55kgo"
Received: by mail-pj1-f74.google.com with SMTP id
 98e67ed59e1d1-31332dc2b59so4329632a91.0
        for <linux-kernel@vger.kernel.org>;
 Tue, 22 Jul 2025 00:06:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1753167991; x=1753772791;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=ygMaExwRMEeYx9gdRiHqQ6erryKlsSCGj1l8MO3EJcQ=;
        b=e2z55kgo7yffhLJQqtgX85zuCRUxWrFRFFPst+gwkhwb6N3nxT/j7gB+F1L3mAyamV
         xHkvyB/IhY845COMAjaUofiV20rV9ppCaYFfeeujxS/KyzfR+6fTAQzCWnRmWZ4hHZbT
         rF92xpuazri36F+JkqliTmfMB1b3C42j/3T8KnDGHmmvF1h75TXiE7jhMClzSIrM0zML
         oqp1H4Ll/yqlNAYg0jZlnc1RuMpGFamGnuPN6d4Osd+q8HEks3umOcciUGN5R8UZuBIl
         Ms1mN1SatIW2ZtWfHcjhMhlGb8t5SPMxyhKxTsUZYJoS633/IIpP9lqVz1iaEe2qQ4R2
         9QRA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1753167991; x=1753772791;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=ygMaExwRMEeYx9gdRiHqQ6erryKlsSCGj1l8MO3EJcQ=;
        b=n4jNfqkgOLePa23ITALJ6x9d+XAkuXPRl6LVVZD8qDzLYk81C31B/AsgzeuaC0j7Z5
         2hgJZHDnNCIZSN0loFjjihddDE4eFanVtMn0UBD6xhIniuGRVAbU0c52rMBzUKfmqbmS
         XxYsZywfKG9VfaJg+5n/euD85LwLdn5nmefI8K1+Tddl0/KcJmPVI4qPv4C5F8T1i+f0
         tIddZJCBRd8O47yJu0ObiqJAPVczYFDybfaW1SyW7KleNQdmdo2tySe/T1pD7o09bijU
         +9ZnImaTNuw+dPKsiyXfW2DrKaZJawpI4JexhOYBWZa3GLK4czYDV/cHizS40GztPQHM
         YXow==
X-Gm-Message-State: AOJu0YzEetDYBn31r8fPXkGiMmq1Nac0fy0XbgYTXhmqp5tFoKSt1+kQ
	47pATu/0VRMYtUOD/eABte+9pc72U/x0QwWNt/DCPSPwMh0th3PBVHXQWXROcavC/a72lpTUJ7W
	jUXaa3FpKb3UYJAvvhDMX0PWYObJdREaO8u/gJgosLvMAl4T/5iEgw1fAU8vM5ya+vfY0CBSijV
	O2/CM2g/MOZNzjfXOdOdCNxp6k3yztpzlwuRMigqBTPgFf9jJY
X-Google-Smtp-Source: 
 AGHT+IEs+x/V4ASQZ2RMVn1S+cWwfSY9FFD0LcIcwgP9AeEIgsPB+YebA+7EY/Puf37am12JdW0r8MkLCK3A
X-Received: from pjk3.prod.google.com ([2002:a17:90b:5583:b0:31c:2fe4:33be])
 (user=jstultz job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:90b:4a81:b0:312:e618:bd53
 with SMTP id 98e67ed59e1d1-31c9e76c013mr32415571a91.26.1753167991326; Tue, 22
 Jul 2025 00:06:31 -0700 (PDT)
Date: Tue, 22 Jul 2025 07:05:52 +0000
In-Reply-To: <20250722070600.3267819-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250722070600.3267819-1-jstultz@google.com>
X-Mailer: git-send-email 2.50.0.727.gbf7dc18ff4-goog
Message-ID: <20250722070600.3267819-7-jstultz@google.com>
Subject: [RFC][PATCH v20 6/6] sched: Migrate whole chain in
 proxy_migrate_task()
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>, Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
 Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>,
 Mel Gorman <mgorman@suse.de>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
 Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>, Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
 K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
 Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>, kuyo chang <kuyo.chang@mediatek.com>,
 hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Instead of migrating one task each time through find_proxy_task(),
we can walk up the blocked_donor ptrs and migrate the entire
current chain in one go.

This was broken out of earlier patches and held back while the
series was being stabilized, but I wanted to re-introduce it.

Signed-off-by: John Stultz <jstultz@google.com>
---
v12:
* Earlier this was re-using blocked_node, but I hit
  a race with activating blocked entities, and to
  avoid it introduced a new migration_node listhead
v18:
* Add init_task initialization of migration_node as suggested
  by Suleiman

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h |  1 +
 init/init_task.c      |  1 +
 kernel/fork.c         |  1 +
 kernel/sched/core.c   | 25 +++++++++++++++++--------
 4 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 675e2f89ec0f8..e9242dfa5f271 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1240,6 +1240,7 @@ struct task_struct {
 	enum blocked_on_state		blocked_on_state;
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	struct task_struct		*blocked_donor;	/* task that is boosting this task */
+	struct list_head		migration_node;
 	raw_spinlock_t			blocked_lock;
=20
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
diff --git a/init/init_task.c b/init/init_task.c
index 627bbd8953e88..65e0f90285966 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -176,6 +176,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 #endif
 	.blocked_on_state =3D BO_RUNNABLE,
 	.blocked_donor =3D NULL,
+	.migration_node =3D LIST_HEAD_INIT(init_task.migration_node),
 #ifdef CONFIG_RT_MUTEXES
 	.pi_waiters	=3D RB_ROOT_CACHED,
 	.pi_top_task	=3D NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index 61a2ac850faf0..892940ea52958 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2127,6 +2127,7 @@ __latent_entropy struct task_struct *copy_process(
 	p->blocked_on_state =3D BO_RUNNABLE;
 	p->blocked_on =3D NULL; /* not blocked yet */
 	p->blocked_donor =3D NULL; /* nobody is boosting p yet */
+	INIT_LIST_HEAD(&p->migration_node);
=20
 #ifdef CONFIG_BCACHE
 	p->sequential_io	=3D 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2c3a4b9518927..c1d813a9cde96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6741,6 +6741,7 @@ static void proxy_migrate_task(struct rq *rq, struct =
rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
 {
 	struct rq *target_rq =3D cpu_rq(target_cpu);
+	LIST_HEAD(migrate_list);
=20
 	lockdep_assert_rq_held(rq);
=20
@@ -6768,19 +6769,27 @@ static void proxy_migrate_task(struct rq *rq, struc=
t rq_flags *rf,
 	rq_set_donor(rq, rq->idle);
 	set_next_task(rq, rq->idle);
=20
-	WARN_ON(p =3D=3D rq->curr);
-
-	deactivate_task(rq, p, 0);
-	proxy_set_task_cpu(p, target_cpu);
-
+	for (; p; p =3D p->blocked_donor) {
+		WARN_ON(p =3D=3D rq->curr);
+		deactivate_task(rq, p, 0);
+		proxy_set_task_cpu(p, target_cpu);
+		/*
+		 * We can abuse blocked_node to migrate the thing,
+		 * because @p was still on the rq.
+		 */
+		list_add(&p->migration_node, &migrate_list);
+	}
 	zap_balance_callbacks(rq);
 	rq_unpin_lock(rq, rf);
 	raw_spin_rq_unlock(rq);
 	raw_spin_rq_lock(target_rq);
+	while (!list_empty(&migrate_list)) {
+		p =3D list_first_entry(&migrate_list, struct task_struct, migration_node=
);
+		list_del_init(&p->migration_node);
=20
-	activate_task(target_rq, p, 0);
-	wakeup_preempt(target_rq, p, 0);
-
+		activate_task(target_rq, p, 0);
+		wakeup_preempt(target_rq, p, 0);
+	}
 	raw_spin_rq_unlock(target_rq);
 	raw_spin_rq_lock(rq);
 	rq_repin_lock(rq, rf);
--=20
2.50.0.727.gbf7dc18ff4-goog