From nobody Mon Jun  8 07:24:34 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 440EF3BADB6;
	Thu,  4 Jun 2026 18:45:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780598757; cv=none;
 b=kAn3eOgczDrlOdBsogICWnu7jZdUf7bmWL6VtASMgf7Llk6Ygx/hOQ0SbA5u+z5Wpcspv7wco3BTGEbI2O5O8uB9rxQCnpl4l+nPPUIXtEk5YsHCxKypaEYCnWpDs8+YOINNcOBFVYsOcnIWggY1weCtWMq28tgNwTPaWtQvk5w=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780598757; c=relaxed/simple;
	bh=I47a7t8HO992cWt9IxKO55Jz95yCV1APYYp5ZCZl0OA=;
	h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version:
	 Message-ID:Content-Type;
 b=ZQNL5UU2iWazlfkAfs/DeWOmVcS4/wun/T9Bm3TF2TSguc5anMjY5J29pYYCNu6uAwI4xseoLa0g1nylgwGcfLJ7oELwD3eZXp6hola3kIrtpQWlzw/ftWbP1AnQDAsDl38PqIg/ZTLNqNjMfi9apNAPOEDrFiyEFx0+r5LxHcg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=0evIg6wD;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=cFJgWqTJ; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="0evIg6wD";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="cFJgWqTJ"
Date: Thu, 04 Jun 2026 18:45:52 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1780598754;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qfoQ2UU6Tb82ofkpMu1sq/+DSc1MHcor2pzVCyXUrKc=;
	b=0evIg6wDLWQS0bGh9bvOU/lj25rw0XLh44E4wwDETokG+9D1DInsIuFekxFqWBsyvJFOld
	Q7u3hq5+s0gE0XS8rzaDsUJ+kfBEcrAFByHkRc5JZmZTHHzbeX/fYkuqrMcpYHwEhimi/b
	OMUAmRnNLXBc2UUUoQtJ7Iu93Ca7llJuFILCG0smt9X8bDovlqFpfbmNYocJtnMytlm5fB
	k5KsY+p3/S3byFyYH7Iuqv5GxICyYmnK5GiOOUPRNvwYRwHrXhkhOZqvV9UjmMki/Ae2Se
	f0ZeJkzMEeoHHw6r9BROYF9xe/UNhmAXXHsXbHYviOKAe4frSWKuB9Gqr9cCnw==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1780598754;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qfoQ2UU6Tb82ofkpMu1sq/+DSc1MHcor2pzVCyXUrKc=;
	b=cFJgWqTJbXmM+NtoRcAkdAvuUr4oAdcIExbS6RDKLu55ycRY1JLUhEpyZL3fnb94slsF12
	TrSKscWhwRSkk0Bg==
From: "tip-bot2 for John Stultz" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: 
 [tip: sched/core] sched: Rework prev_balance() to avoid stale prev references
Cc: John Stultz <jstultz@google.com>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>, x86@kernel.org,
 linux-kernel@vger.kernel.org
In-Reply-To: <20260512025635.2840817-2-jstultz@google.com>
References: <20260512025635.2840817-2-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Message-ID: <178059875243.710.15556955671777468090.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: 
 Contact <mailto:tglx@kernel.org> to get blacklisted from these emails
Precedence: bulk
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7a3a6bfbd62a2ba3e0ef1e92d6b71abb66890825
Gitweb:        https://git.kernel.org/tip/7a3a6bfbd62a2ba3e0ef1e92d6b71abb6=
6890825
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 12 May 2026 02:56:11=20
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 02 Jun 2026 12:26:06 +02:00

sched: Rework prev_balance() to avoid stale prev references

Historically, the prev value from __schedule() was the rq->curr.
This prev value is passed down through numerous functions, and
used in the class scheduler implementations. The fact that
prev was on_cpu until the end of __schedule(), meant it was
stable across the rq lock drops that the class->balance()
implementations often do.

However, with proxy-exec, the prev passed to functions called
by __schedule() is rq->donor, which may not be the same as
rq->curr and may not be on_cpu, this makes the prev value
potentially unstable across rq lock drops.

A recently found issue with proxy-exec, is when we begin doing
return migration from try_to_wake_up(), its possible we may be
waking up the rq->donor.  When we do this, we proxy_resched_idle()
to put_prev_set_next() setting the rq->donor to rq->idle, allowing
the rq->donor to be return migrated and allowed to run.

This however runs into trouble, as on another cpu we might be in
the middle of calling __schedule(). Conceptually the rq lock is
held for the majority of the time, but in calling prev_balance()
its possible the class->balance() handler call may briefly drop the rq lock.
This opens a window for try_to_wake_up() to wake and return migrate the
rq->donor before the class logic reacquires the rq lock.

Unfortunately prev_balance() pass in a prev argument, to which we pass
rq->donor. However this prev value can now become stale and incorrect acros=
s a
rq lock drop.

So, to correct this, rework the prev_balance() call so that it does not tak=
e a
"prev" argument.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-2-jstultz@google.com
---
 kernel/sched/core.c      | 33 ++++++++++++++++-----------------
 kernel/sched/deadline.c  |  8 +++++++-
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        |  8 +++++++-
 kernel/sched/sched.h     |  2 +-
 kernel/sched/stop_task.c |  2 +-
 6 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c8bfd6..a9c9b89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5986,10 +5986,9 @@ static inline void schedule_debug(struct task_struct=
 *prev, bool preempt)
 	schedstat_inc(this_rq()->sched_count);
 }
=20
-static void prev_balance(struct rq *rq, struct task_struct *prev,
-			 struct rq_flags *rf)
+static void prev_balance(struct rq *rq, struct rq_flags *rf)
 {
-	const struct sched_class *start_class =3D prev->sched_class;
+	const struct sched_class *start_class =3D rq->donor->sched_class;
 	const struct sched_class *class;
=20
 	/*
@@ -6001,7 +6000,7 @@ static void prev_balance(struct rq *rq, struct task_s=
truct *prev,
 	 * a runnable task of @class priority or higher.
 	 */
 	for_active_class_range(class, start_class, &idle_sched_class) {
-		if (class->balance && class->balance(rq, prev, rf))
+		if (class->balance && class->balance(rq, rf))
 			break;
 	}
 }
@@ -6010,7 +6009,7 @@ static void prev_balance(struct rq *rq, struct task_s=
truct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags =
*rf)
+__pick_next_task(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
 	const struct sched_class *class;
@@ -6027,7 +6026,7 @@ __pick_next_task(struct rq *rq, struct task_struct *p=
rev, struct rq_flags *rf)
 	 * higher scheduling class, because otherwise those lose the
 	 * opportunity to pull in more work from other CPUs.
 	 */
-	if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
+	if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) =
&&
 		   rq->nr_running =3D=3D rq->cfs.h_nr_queued)) {
=20
 		p =3D pick_task_fair(rq, rf);
@@ -6038,19 +6037,19 @@ __pick_next_task(struct rq *rq, struct task_struct =
*prev, struct rq_flags *rf)
 		if (!p)
 			p =3D pick_task_idle(rq, rf);
=20
-		put_prev_set_next_task(rq, prev, p);
+		put_prev_set_next_task(rq, rq->donor, p);
 		return p;
 	}
=20
 restart:
-	prev_balance(rq, prev, rf);
+	prev_balance(rq, rf);
=20
 	for_each_active_class(class) {
 		p =3D class->pick_task(rq, rf);
 		if (unlikely(p =3D=3D RETRY_TASK))
 			goto restart;
 		if (p) {
-			put_prev_set_next_task(rq, prev, p);
+			put_prev_set_next_task(rq, rq->donor, p);
 			return p;
 		}
 	}
@@ -6102,7 +6101,7 @@ extern void task_vruntime_update(struct rq *rq, struc=
t task_struct *p, bool in_f
 static void queue_core_balance(struct rq *rq);
=20
 static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r=
f)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
 	struct task_struct *next, *p, *max;
@@ -6115,7 +6114,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre=
v, struct rq_flags *rf)
 	bool need_sync;
=20
 	if (!sched_core_enabled(rq))
-		return __pick_next_task(rq, prev, rf);
+		return __pick_next_task(rq, rf);
=20
 	cpu =3D cpu_of(rq);
=20
@@ -6128,7 +6127,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre=
v, struct rq_flags *rf)
 		 */
 		rq->core_pick =3D NULL;
 		rq->core_dl_server =3D NULL;
-		return __pick_next_task(rq, prev, rf);
+		return __pick_next_task(rq, rf);
 	}
=20
 	/*
@@ -6152,7 +6151,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre=
v, struct rq_flags *rf)
 		goto out_set_next;
 	}
=20
-	prev_balance(rq, prev, rf);
+	prev_balance(rq, rf);
=20
 	smt_mask =3D cpu_smt_mask(cpu);
 	need_sync =3D !!rq->core->core_cookie;
@@ -6334,7 +6333,7 @@ restart_multi:
 	}
=20
 out_set_next:
-	put_prev_set_next_task(rq, prev, next);
+	put_prev_set_next_task(rq, rq->donor, next);
 	if (rq->core->core_forceidle_count && next =3D=3D rq->idle)
 		queue_core_balance(rq);
=20
@@ -6557,10 +6556,10 @@ static inline void sched_core_cpu_deactivate(unsign=
ed int cpu) {}
 static inline void sched_core_cpu_dying(unsigned int cpu) {}
=20
 static struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r=
f)
+pick_next_task(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
-	return __pick_next_task(rq, prev, rf);
+	return __pick_next_task(rq, rf);
 }
=20
 #endif /* !CONFIG_SCHED_CORE */
@@ -7108,7 +7107,7 @@ static void __sched notrace __schedule(int sched_mode)
=20
 pick_again:
 	assert_balance_callbacks_empty(rq);
-	next =3D pick_next_task(rq, rq->donor, &rf);
+	next =3D pick_next_task(rq, &rf);
 	rq->next_class =3D next->sched_class;
 	if (sched_proxy_exec()) {
 		struct task_struct *prev_donor =3D rq->donor;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f9e62ed..6ef5a80 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2698,8 +2698,14 @@ static void check_preempt_equal_dl(struct rq *rq, st=
ruct task_struct *p)
 	resched_curr(rq);
 }
=20
-static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flag=
s *rf)
+static int balance_dl(struct rq *rq, struct rq_flags *rf)
 {
+	/*
+	 * Note, rq->donor may change during rq lock drops,
+	 * so don't re-use prev across lock drops
+	 */
+	struct task_struct *p =3D rq->donor;
+
 	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a83be0c..ff39120 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -462,7 +462,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int=
 flags)
 }
=20
 static int
-balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_idle(struct rq *rq, struct rq_flags *rf)
 {
 	return WARN_ON_ONCE(1);
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e6ea728..e474c31 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1596,8 +1596,14 @@ static void check_preempt_equal_prio(struct rq *rq, =
struct task_struct *p)
 	resched_curr(rq);
 }
=20
-static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flag=
s *rf)
+static int balance_rt(struct rq *rq, struct rq_flags *rf)
 {
+	/*
+	 * Note, rq->donor may change during rq lock drops,
+	 * so don't re-use p across lock drops
+	 */
+	struct task_struct *p =3D rq->donor;
+
 	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 332ecf8..ef715f2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2587,7 +2587,7 @@ struct sched_class {
 	/*
 	 * schedule/pick_next_task/prev_balance: rq->lock
 	 */
-	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *=
rf);
+	int (*balance)(struct rq *rq, struct rq_flags *rf);
=20
 	/*
 	 * schedule/pick_next_task: rq->lock
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index f95798b..c909ca0 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int f=
lags)
 }
=20
 static int
-balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+balance_stop(struct rq *rq, struct rq_flags *rf)
 {
 	return sched_stop_runnable(rq);
 }