From nobody Mon Jun 8 07:24:34 2026 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 440EF3BADB6; Thu, 4 Jun 2026 18:45:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780598757; cv=none; b=kAn3eOgczDrlOdBsogICWnu7jZdUf7bmWL6VtASMgf7Llk6Ygx/hOQ0SbA5u+z5Wpcspv7wco3BTGEbI2O5O8uB9rxQCnpl4l+nPPUIXtEk5YsHCxKypaEYCnWpDs8+YOINNcOBFVYsOcnIWggY1weCtWMq28tgNwTPaWtQvk5w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780598757; c=relaxed/simple; bh=I47a7t8HO992cWt9IxKO55Jz95yCV1APYYp5ZCZl0OA=; h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version: Message-ID:Content-Type; b=ZQNL5UU2iWazlfkAfs/DeWOmVcS4/wun/T9Bm3TF2TSguc5anMjY5J29pYYCNu6uAwI4xseoLa0g1nylgwGcfLJ7oELwD3eZXp6hola3kIrtpQWlzw/ftWbP1AnQDAsDl38PqIg/ZTLNqNjMfi9apNAPOEDrFiyEFx0+r5LxHcg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0evIg6wD; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=cFJgWqTJ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0evIg6wD"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="cFJgWqTJ" Date: Thu, 04 Jun 2026 18:45:52 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1780598754; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qfoQ2UU6Tb82ofkpMu1sq/+DSc1MHcor2pzVCyXUrKc=; b=0evIg6wDLWQS0bGh9bvOU/lj25rw0XLh44E4wwDETokG+9D1DInsIuFekxFqWBsyvJFOld Q7u3hq5+s0gE0XS8rzaDsUJ+kfBEcrAFByHkRc5JZmZTHHzbeX/fYkuqrMcpYHwEhimi/b OMUAmRnNLXBc2UUUoQtJ7Iu93Ca7llJuFILCG0smt9X8bDovlqFpfbmNYocJtnMytlm5fB k5KsY+p3/S3byFyYH7Iuqv5GxICyYmnK5GiOOUPRNvwYRwHrXhkhOZqvV9UjmMki/Ae2Se f0ZeJkzMEeoHHw6r9BROYF9xe/UNhmAXXHsXbHYviOKAe4frSWKuB9Gqr9cCnw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1780598754; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qfoQ2UU6Tb82ofkpMu1sq/+DSc1MHcor2pzVCyXUrKc=; b=cFJgWqTJbXmM+NtoRcAkdAvuUr4oAdcIExbS6RDKLu55ycRY1JLUhEpyZL3fnb94slsF12 TrSKscWhwRSkk0Bg== From: "tip-bot2 for John Stultz" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: sched/core] sched: Rework prev_balance() to avoid stale prev references Cc: John Stultz , "Peter Zijlstra (Intel)" , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20260512025635.2840817-2-jstultz@google.com> References: <20260512025635.2840817-2-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-ID: <178059875243.710.15556955671777468090.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Precedence: bulk Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The following commit has been merged into the sched/core branch of tip: Commit-ID: 7a3a6bfbd62a2ba3e0ef1e92d6b71abb66890825 Gitweb: https://git.kernel.org/tip/7a3a6bfbd62a2ba3e0ef1e92d6b71abb6= 6890825 Author: John Stultz AuthorDate: Tue, 12 May 2026 02:56:11=20 Committer: Peter Zijlstra CommitterDate: Tue, 02 Jun 2026 12:26:06 +02:00 sched: Rework prev_balance() to avoid stale prev references Historically, the prev value from __schedule() was the rq->curr. This prev value is passed down through numerous functions, and used in the class scheduler implementations. The fact that prev was on_cpu until the end of __schedule(), meant it was stable across the rq lock drops that the class->balance() implementations often do. However, with proxy-exec, the prev passed to functions called by __schedule() is rq->donor, which may not be the same as rq->curr and may not be on_cpu, this makes the prev value potentially unstable across rq lock drops. A recently found issue with proxy-exec, is when we begin doing return migration from try_to_wake_up(), its possible we may be waking up the rq->donor. When we do this, we proxy_resched_idle() to put_prev_set_next() setting the rq->donor to rq->idle, allowing the rq->donor to be return migrated and allowed to run. This however runs into trouble, as on another cpu we might be in the middle of calling __schedule(). Conceptually the rq lock is held for the majority of the time, but in calling prev_balance() its possible the class->balance() handler call may briefly drop the rq lock. This opens a window for try_to_wake_up() to wake and return migrate the rq->donor before the class logic reacquires the rq lock. Unfortunately prev_balance() pass in a prev argument, to which we pass rq->donor. However this prev value can now become stale and incorrect acros= s a rq lock drop. So, to correct this, rework the prev_balance() call so that it does not tak= e a "prev" argument. Signed-off-by: John Stultz Signed-off-by: Peter Zijlstra (Intel) Link: https://patch.msgid.link/20260512025635.2840817-2-jstultz@google.com --- kernel/sched/core.c | 33 ++++++++++++++++----------------- kernel/sched/deadline.c | 8 +++++++- kernel/sched/idle.c | 2 +- kernel/sched/rt.c | 8 +++++++- kernel/sched/sched.h | 2 +- kernel/sched/stop_task.c | 2 +- 6 files changed, 33 insertions(+), 22 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3c8bfd6..a9c9b89 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5986,10 +5986,9 @@ static inline void schedule_debug(struct task_struct= *prev, bool preempt) schedstat_inc(this_rq()->sched_count); } =20 -static void prev_balance(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static void prev_balance(struct rq *rq, struct rq_flags *rf) { - const struct sched_class *start_class =3D prev->sched_class; + const struct sched_class *start_class =3D rq->donor->sched_class; const struct sched_class *class; =20 /* @@ -6001,7 +6000,7 @@ static void prev_balance(struct rq *rq, struct task_s= truct *prev, * a runnable task of @class priority or higher. */ for_active_class_range(class, start_class, &idle_sched_class) { - if (class->balance && class->balance(rq, prev, rf)) + if (class->balance && class->balance(rq, rf)) break; } } @@ -6010,7 +6009,7 @@ static void prev_balance(struct rq *rq, struct task_s= truct *prev, * Pick up the highest-prio task: */ static inline struct task_struct * -__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags = *rf) +__pick_next_task(struct rq *rq, struct rq_flags *rf) __must_hold(__rq_lockp(rq)) { const struct sched_class *class; @@ -6027,7 +6026,7 @@ __pick_next_task(struct rq *rq, struct task_struct *p= rev, struct rq_flags *rf) * higher scheduling class, because otherwise those lose the * opportunity to pull in more work from other CPUs. */ - if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && + if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) = && rq->nr_running =3D=3D rq->cfs.h_nr_queued)) { =20 p =3D pick_task_fair(rq, rf); @@ -6038,19 +6037,19 @@ __pick_next_task(struct rq *rq, struct task_struct = *prev, struct rq_flags *rf) if (!p) p =3D pick_task_idle(rq, rf); =20 - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); return p; } =20 restart: - prev_balance(rq, prev, rf); + prev_balance(rq, rf); =20 for_each_active_class(class) { p =3D class->pick_task(rq, rf); if (unlikely(p =3D=3D RETRY_TASK)) goto restart; if (p) { - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); return p; } } @@ -6102,7 +6101,7 @@ extern void task_vruntime_update(struct rq *rq, struc= t task_struct *p, bool in_f static void queue_core_balance(struct rq *rq); =20 static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r= f) +pick_next_task(struct rq *rq, struct rq_flags *rf) __must_hold(__rq_lockp(rq)) { struct task_struct *next, *p, *max; @@ -6115,7 +6114,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) bool need_sync; =20 if (!sched_core_enabled(rq)) - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); =20 cpu =3D cpu_of(rq); =20 @@ -6128,7 +6127,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) */ rq->core_pick =3D NULL; rq->core_dl_server =3D NULL; - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } =20 /* @@ -6152,7 +6151,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) goto out_set_next; } =20 - prev_balance(rq, prev, rf); + prev_balance(rq, rf); =20 smt_mask =3D cpu_smt_mask(cpu); need_sync =3D !!rq->core->core_cookie; @@ -6334,7 +6333,7 @@ restart_multi: } =20 out_set_next: - put_prev_set_next_task(rq, prev, next); + put_prev_set_next_task(rq, rq->donor, next); if (rq->core->core_forceidle_count && next =3D=3D rq->idle) queue_core_balance(rq); =20 @@ -6557,10 +6556,10 @@ static inline void sched_core_cpu_deactivate(unsign= ed int cpu) {} static inline void sched_core_cpu_dying(unsigned int cpu) {} =20 static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r= f) +pick_next_task(struct rq *rq, struct rq_flags *rf) __must_hold(__rq_lockp(rq)) { - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } =20 #endif /* !CONFIG_SCHED_CORE */ @@ -7108,7 +7107,7 @@ static void __sched notrace __schedule(int sched_mode) =20 pick_again: assert_balance_callbacks_empty(rq); - next =3D pick_next_task(rq, rq->donor, &rf); + next =3D pick_next_task(rq, &rf); rq->next_class =3D next->sched_class; if (sched_proxy_exec()) { struct task_struct *prev_donor =3D rq->donor; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index f9e62ed..6ef5a80 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2698,8 +2698,14 @@ static void check_preempt_equal_dl(struct rq *rq, st= ruct task_struct *p) resched_curr(rq); } =20 -static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flag= s *rf) +static int balance_dl(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use prev across lock drops + */ + struct task_struct *p =3D rq->donor; + if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index a83be0c..ff39120 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -462,7 +462,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int= flags) } =20 static int -balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_idle(struct rq *rq, struct rq_flags *rf) { return WARN_ON_ONCE(1); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index e6ea728..e474c31 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1596,8 +1596,14 @@ static void check_preempt_equal_prio(struct rq *rq, = struct task_struct *p) resched_curr(rq); } =20 -static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flag= s *rf) +static int balance_rt(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use p across lock drops + */ + struct task_struct *p =3D rq->donor; + if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 332ecf8..ef715f2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2587,7 +2587,7 @@ struct sched_class { /* * schedule/pick_next_task/prev_balance: rq->lock */ - int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *= rf); + int (*balance)(struct rq *rq, struct rq_flags *rf); =20 /* * schedule/pick_next_task: rq->lock diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index f95798b..c909ca0 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int f= lags) } =20 static int -balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_stop(struct rq *rq, struct rq_flags *rf) { return sched_stop_runnable(rq); }