From nobody Tue Dec 2 01:03:56 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54AB933BBC4 for ; Mon, 24 Nov 2025 22:31:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023487; cv=none; b=Bh2iQT9+N2sTDwZ67tDgyn75L3cyfUrw9BsLs3ANpcRbcVeduXDRmIUx6CGf83NTFQpeVAuNIX1quO7+7EaAxSWW/B9iU72c0wg/JnpKRndYBTQ7MVXPk45ZhwVsseGmTMSWG7R9Km1nC+eC4LeUI54gEocD9bmSa5eBuzLQ7mc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023487; c=relaxed/simple; bh=uL3Cm6b5lAgPhnEQGVTfJQM/5ZovUGAK9B5f1bhnqvM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Db0hAoObA+fL5SZ+HMotyla0a9vg7q/kEPZXpRMI1rgNVyFp0q81bqQKsyI5jKIPrwXELzIr+/TajGw2No66cPPwRwICVC9eIsFj9gU+0xNC5zLUwchylReuYb/u6zfYX9g72V5IF1D8nmF7Bfg81+nRQ4Gns9hjOyPcmVIzR8Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=jIfqXAzU; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="jIfqXAzU" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-295fbc7d4abso75193285ad.1 for ; Mon, 24 Nov 2025 14:31:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023485; x=1764628285; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XcCH6d0H0rHy6O/ohUMcok7+Eoj6mVSLe7ga5uhX8N0=; b=jIfqXAzU8O6rVjMSXgk2lGLLx9ZMllJidc8ekCT+F2QlNpMIewH7pzvL7u2dk9UABz 9ypC/0tPiFN14Tq7/pPvUcBmlGAtJq/jc7lmiwu291RPZxV01q0MgQnW6GBrARqo5o18 etSKCICToqZOwwV4bMAx1xeGq15abn4I3m4WXYN4VYsD0ztSHShHffl2WOFwia8/4BQT stQYlSRiYoonLsVIdKuih2TfJHHS1RVFWA0hUvc6PyLY9Z4bUMQr1iHnrq5wiQTx4ICa sGrq9hWwamE1jeI+7Inuu1zovh0+FuI9xMrq9u0zWvefd5GoUQ/dO2T/eAHVutw86hgJ mWHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023485; x=1764628285; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XcCH6d0H0rHy6O/ohUMcok7+Eoj6mVSLe7ga5uhX8N0=; b=OT8L/JR6UVibHQU8b4at9Dh3amVOhaN5+hD3V6AVXOvIqbndhF5dy9g6FQx5f0oeFt v+n2TaEGf3z44uHbrblhKf0v1l5BgMWc+nOh8zxdDR65DBQcS0HKii9uxBreT3MmgywZ VRwjAH0KBCWpsWiGcNJezdPiDGjg//9wlenN+o94iuxh1GwPAuWrN2CQEDz9xhx6EVW9 mBKqTG0xOwQNXGjRD7ZGyjAXiYGTDFGcOWwmTFX/cLKwqSI/daigSp4aQbbIewxNUgTV VXoV4KuBvcW/l8I8ulxj+QlfxLoqLhc3Q/FqozkEPr4EBjqfJSZ8zqEzYOWIPI0cCOiL LURg== X-Gm-Message-State: AOJu0YwCHrGt2jB0an0xxMuBzZwtb1Y8ou6tO+KVGvg/dA7XyVaWq0ik NTRSOd0x3O8P2WrUJFhUF7L6evBm7mvdjUd6lLyJtHsXGhcVZ3uCpmsSz2cOHdazCX1/c+T168d PdKW+XqRzHC+TOMnvSsmo3QdWv/ADfntjRgIy1xHeaHYYWaV7544spqklG/trJJReSrDp8Iktmb 8BxFuOGRbFmYC05pbn0PSHmc77O4Q1WUGbOtMMnrtM+bsUFuiW X-Google-Smtp-Source: AGHT+IFILPWjBT2VDvkeE+PV5eqFbpiXmulQvLqNjsYFUkE86qk68cUA28ZL51Mybn72ze3MAfD1NWmNi3Jy X-Received: from plbkj5.prod.google.com ([2002:a17:903:6c5:b0:293:de:a528]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ebc2:b0:298:33c9:eda2 with SMTP id d9443c01a7336-29b6bf3bccbmr146958445ad.33.1764023484658; Mon, 24 Nov 2025 14:31:24 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:59 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-8-jstultz@google.com> Subject: [PATCH v24 07/11] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Historically, the prev value from __schedule() was the rq->curr. This prev value is passed down through numerous functions, and used in the class scheduler implementations. The fact that prev was on_cpu until the end of __schedule(), meant it was stable across the rq lock drops that the class->pick_next_task() and ->balance() implementations often do. However, with proxy-exec, the prev passed to functions called by __schedule() is rq->donor, which may not be the same as rq->curr and may not be on_cpu, this makes the prev value potentially unstable across rq lock drops. A recently found issue with proxy-exec, is when we begin doing return migration from try_to_wake_up(), its possible we may be waking up the rq->donor. When we do this, we proxy_resched_idle() to put_prev_set_next() setting the rq->donor to rq->idle, allowing the rq->donor to be return migrated and allowed to run. This however runs into trouble, as on another cpu we might be in the middle of calling __schedule(). Conceptually the rq lock is held for the majority of the time, but in calling pick_next_task() its possible the class->pick_next_task() handler or the ->balance() call may briefly drop the rq lock. This opens a window for try_to_wake_up() to wake and return migrate the rq->donor before the class logic reacquires the rq lock. Unfortunately pick_next_task() and prev_balance() pass in a prev argument, to which we pass rq->donor. However this prev value can now become stale and incorrect across a rq lock drop. So, to correct this, rework the pick_next_task() and prev_balance() calls so that they do not take a "prev" argument. Also rework the class ->pick_next_task() and ->balance() implementations to drop the prev argument, and in the cases where it was used, and have the class functions reference rq->donor directly, and not save the value across rq lock drops so that we don't end up with a stale references. Signed-off-by: John Stultz --- Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 37 ++++++++++++++++++------------------- kernel/sched/deadline.c | 8 +++++++- kernel/sched/ext.c | 8 ++++++-- kernel/sched/fair.c | 15 ++++++++++----- kernel/sched/idle.c | 2 +- kernel/sched/rt.c | 8 +++++++- kernel/sched/sched.h | 8 ++++---- kernel/sched/stop_task.c | 2 +- 8 files changed, 54 insertions(+), 34 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4c5493b0ad210..fcf64c4db437e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5955,10 +5955,9 @@ static inline void schedule_debug(struct task_struct= *prev, bool preempt) schedstat_inc(this_rq()->sched_count); } =20 -static void prev_balance(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static void prev_balance(struct rq *rq, struct rq_flags *rf) { - const struct sched_class *start_class =3D prev->sched_class; + const struct sched_class *start_class =3D rq->donor->sched_class; const struct sched_class *class; =20 #ifdef CONFIG_SCHED_CLASS_EXT @@ -5983,7 +5982,7 @@ static void prev_balance(struct rq *rq, struct task_s= truct *prev, * a runnable task of @class priority or higher. */ for_active_class_range(class, start_class, &idle_sched_class) { - if (class->balance && class->balance(rq, prev, rf)) + if (class->balance && class->balance(rq, rf)) break; } } @@ -5992,7 +5991,7 @@ static void prev_balance(struct rq *rq, struct task_s= truct *prev, * Pick up the highest-prio task: */ static inline struct task_struct * -__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags = *rf) +__pick_next_task(struct rq *rq, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; @@ -6008,34 +6007,34 @@ __pick_next_task(struct rq *rq, struct task_struct = *prev, struct rq_flags *rf) * higher scheduling class, because otherwise those lose the * opportunity to pull in more work from other CPUs. */ - if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && + if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) = && rq->nr_running =3D=3D rq->cfs.h_nr_queued)) { =20 - p =3D pick_next_task_fair(rq, prev, rf); + p =3D pick_next_task_fair(rq, rf); if (unlikely(p =3D=3D RETRY_TASK)) goto restart; =20 /* Assume the next prioritized class is idle_sched_class */ if (!p) { p =3D pick_task_idle(rq); - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); } =20 return p; } =20 restart: - prev_balance(rq, prev, rf); + prev_balance(rq, rf); =20 for_each_active_class(class) { if (class->pick_next_task) { - p =3D class->pick_next_task(rq, prev); + p =3D class->pick_next_task(rq); if (p) return p; } else { p =3D class->pick_task(rq); if (p) { - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); return p; } } @@ -6084,7 +6083,7 @@ extern void task_vruntime_update(struct rq *rq, struc= t task_struct *p, bool in_f static void queue_core_balance(struct rq *rq); =20 static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r= f) +pick_next_task(struct rq *rq, struct rq_flags *rf) { struct task_struct *next, *p, *max =3D NULL; const struct cpumask *smt_mask; @@ -6096,7 +6095,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) bool need_sync; =20 if (!sched_core_enabled(rq)) - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); =20 cpu =3D cpu_of(rq); =20 @@ -6109,7 +6108,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) */ rq->core_pick =3D NULL; rq->core_dl_server =3D NULL; - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } =20 /* @@ -6133,7 +6132,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) goto out_set_next; } =20 - prev_balance(rq, prev, rf); + prev_balance(rq, rf); =20 smt_mask =3D cpu_smt_mask(cpu); need_sync =3D !!rq->core->core_cookie; @@ -6306,7 +6305,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) } =20 out_set_next: - put_prev_set_next_task(rq, prev, next); + put_prev_set_next_task(rq, rq->donor, next); if (rq->core->core_forceidle_count && next =3D=3D rq->idle) queue_core_balance(rq); =20 @@ -6528,9 +6527,9 @@ static inline void sched_core_cpu_deactivate(unsigned= int cpu) {} static inline void sched_core_cpu_dying(unsigned int cpu) {} =20 static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r= f) +pick_next_task(struct rq *rq, struct rq_flags *rf) { - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } =20 #endif /* !CONFIG_SCHED_CORE */ @@ -7097,7 +7096,7 @@ static void __sched notrace __schedule(int sched_mode) =20 pick_again: assert_balance_callbacks_empty(rq); - next =3D pick_next_task(rq, rq->donor, &rf); + next =3D pick_next_task(rq, &rf); rq_set_donor(rq, next); if (unlikely(task_is_blocked(next))) { next =3D find_proxy_task(rq, next, &rf); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c4402542ef44f..d86fc3dd0d806 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2268,8 +2268,14 @@ static void check_preempt_equal_dl(struct rq *rq, st= ruct task_struct *p) resched_curr(rq); } =20 -static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flag= s *rf) +static int balance_dl(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use prev across lock drops + */ + struct task_struct *p =3D rq->donor; + if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 7e0fcfdc06a2d..5c6cb0a3be738 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2153,9 +2153,13 @@ static int balance_one(struct rq *rq, struct task_st= ruct *prev) return true; } =20 -static int balance_scx(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static int balance_scx(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use prev across lock drops + */ + struct task_struct *prev =3D rq->donor; int ret; =20 rq_unpin_lock(rq, rf); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 328ea325a1d1c..7d2e92a55b164 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8713,7 +8713,7 @@ static void set_cpus_allowed_fair(struct task_struct = *p, struct affinity_context } =20 static int -balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_fair(struct rq *rq, struct rq_flags *rf) { if (sched_fair_runnable(rq)) return 1; @@ -8866,13 +8866,18 @@ static void __set_next_task_fair(struct rq *rq, str= uct task_struct *p, bool firs static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool = first); =20 struct task_struct * -pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_fla= gs *rf) +pick_next_task_fair(struct rq *rq, struct rq_flags *rf) { struct sched_entity *se; - struct task_struct *p; + struct task_struct *p, *prev; int new_tasks; =20 again: + /* + * Re-read rq->donor at the top as it may have + * changed across a rq lock drop + */ + prev =3D rq->donor; p =3D pick_task_fair(rq); if (!p) goto idle; @@ -8952,9 +8957,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf return NULL; } =20 -static struct task_struct *__pick_next_task_fair(struct rq *rq, struct tas= k_struct *prev) +static struct task_struct *__pick_next_task_fair(struct rq *rq) { - return pick_next_task_fair(rq, prev, NULL); + return pick_next_task_fair(rq, NULL); } =20 static struct task_struct *fair_server_pick_task(struct sched_dl_entity *d= l_se) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c39b089d4f09b..a7c718c1733ba 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -439,7 +439,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int= flags) } =20 static int -balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_idle(struct rq *rq, struct rq_flags *rf) { return WARN_ON_ONCE(1); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index fb07dcfc60a24..17cfac1da38b6 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1591,8 +1591,14 @@ static void check_preempt_equal_prio(struct rq *rq, = struct task_struct *p) resched_curr(rq); } =20 -static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flag= s *rf) +static int balance_rt(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use p across lock drops + */ + struct task_struct *p =3D rq->donor; + if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a0de4f00edd61..424c40bd46e2f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2415,18 +2415,18 @@ struct sched_class { =20 void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags); =20 - int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *= rf); + int (*balance)(struct rq *rq, struct rq_flags *rf); struct task_struct *(*pick_task)(struct rq *rq); /* * Optional! When implemented pick_next_task() should be equivalent to: * * next =3D pick_task(); * if (next) { - * put_prev_task(prev); + * put_prev_task(rq->donor); * set_next_task_first(next); * } */ - struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *= prev); + struct task_struct *(*pick_next_task)(struct rq *rq); =20 void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_s= truct *next); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); @@ -2586,7 +2586,7 @@ static inline bool sched_fair_runnable(struct rq *rq) return rq->cfs.nr_queued > 0; } =20 -extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_= struct *prev, struct rq_flags *rf); +extern struct task_struct *pick_next_task_fair(struct rq *rq, struct rq_fl= ags *rf); extern struct task_struct *pick_task_idle(struct rq *rq); =20 #define SCA_CHECK 0x01 diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 2d4e279f05ee9..73aeb0743aa2e 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int f= lags) } =20 static int -balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_stop(struct rq *rq, struct rq_flags *rf) { return sched_stop_runnable(rq); } --=20 2.52.0.487.g5c8c507ade-goog