From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B544B337BAF for ; Mon, 24 Nov 2025 22:31:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023477; cv=none; b=G94VqDItVZg/G72LqQ89rl/7Ke0QKClPFz8/8cCP530kYjtcYtZJ45OH1Gz2A8wYSXtBPp41lG4IZcwY8W2ypjLIHMWVJBmEBk8vyDujI0TZvKyEguXUbUhN0wQmHY5lkqAcMbVMw8LxYWFvb41YfBC97GpfRkmM2xOP/oAN0Dk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023477; c=relaxed/simple; bh=ta46bArNinr2zW0hIXt4MiW75EE16tii9lGen67jaMQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=dliKGwtEG0htY3MZQSDZNajyBLw+9EgqHbkpJRjmGCd3nBMnou0IxyUvks0w5+GUHzkCQ2qp/94Pj2/Cpkoen5SQWpaSjc2/kfC2a3rK9EedNy6UpHJx9RSo91rQ+x8B3IJGxcO3KjuKyWaTUtzB7jgF8Le4Ddd+nUgo/fBvhg0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=J7vD4D6Y; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="J7vD4D6Y" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-3437f0760daso10762733a91.1 for ; Mon, 24 Nov 2025 14:31:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023475; x=1764628275; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=j6wlAGwJlLVqixTNHQXp4T6pegIvAr+oiVsA6v4MvkA=; b=J7vD4D6YjoZdP8wYpCopZSvkEBjKeIlt9uCNBavmpmW9fOUf2HGgkkfgw/qhU9SS0W TVrim4eXB5XW2DELBRCjGdZrxmYzrDcWUoDMwCkxHy7dPqfLtSaYyoafDM021iShiYYy 9qBodRzSUK6HvDi72IesgTUnUlxX5Kp3fZlUNp//gepsFIYGGUe+HQAOuS6Zu6HiAhir z2uNRO/bmwvXLRl+OddOIJHu/DKm36iNMrnmLv9lDm6DNDTIa/BdE+MvDOz3iKfC8JLA BZLFkujkDngo+eyEALFNx2REQQDQz/rIlF0GCmSpzg+n13GFDeF0/q+TWW8l9exmLrNE k7nw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023475; x=1764628275; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=j6wlAGwJlLVqixTNHQXp4T6pegIvAr+oiVsA6v4MvkA=; b=Ypj+Amj5vLVvAddBJmQFAtD/xWtZjLj/fq6iZtP8lkAeKkuW5LxbsG1DAiw/ZXmDOu nIndoFMIkSiiw9x/mL5QhbVEZVhvTjHqq0RWZT8uI5AHfsVfT28XNXW735Wl3mLQiXe6 gvwDv/uacHigmy0qEYF7hBdRGP7sDuo9LYgNOtGSvR6pczsRBGy3/NZtHCv0mfE7/R3i 5jgnndJX53g9e1XmYUihhjXYHh85WwI4wHnspu5euErh8d2+FxhcoFbGV3H4phBiuLH8 SSqnSDKpND4fdLS2ttVrBLRcLW1I+9jKh8LYpJjgSE5wTo6732IhOsFnIMtopT1iAUgD FxEQ== X-Gm-Message-State: AOJu0Yx4husFujwPjbJPUiDEyDcwcgTJyEO+KiavzCSxnvubkVhWBltx cF+VVpbaUBeqOasd4xr/M4PW5nxianFYGWP+gg65oVOSTgBgrxEL7qiXVTwUjxocgLrU5WFdJhL DiiIpiaA1sd/Nc5XEiXMjJ4NtthLUSfYTkpprd92qexbWj8DQCSj2Xkpd/V/HSAVL6xhrds2zgD gVWXklpozZVYNi4o9k3sI8t0afG8MKeC/DF5xnXNuphRGuAoiK X-Google-Smtp-Source: AGHT+IES2u/WN6xTBiXQmACyo5MWoVRRkwG6GGEd8HktYKsM84wrV5Mj41sbpsn3vxhpSosYvAHSRBv3bFlx X-Received: from pjqv4.prod.google.com ([2002:a17:90a:af04:b0:340:bd8e:458f]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:582e:b0:343:a631:28b1 with SMTP id 98e67ed59e1d1-34733e94c4bmr13050831a91.16.1764023474845; Mon, 24 Nov 2025 14:31:14 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:53 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-2-jstultz@google.com> Subject: [PATCH v24 01/11] locking: Add task::blocked_lock to serialize blocked_on state From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" So far, we have been able to utilize the mutex::wait_lock for serializing the blocked_on state, but when we move to proxying across runqueues, we will need to add more state and a way to serialize changes to this state in contexts where we don't hold the mutex::wait_lock. So introduce the task::blocked_lock, which nests under the mutex::wait_lock in the locking order, and rework the locking to use it. Signed-off-by: John Stultz Reviewed-by: K Prateek Nayak --- v15: * Split back out into later in the series v16: * Fixups to mark tasks unblocked before sleeping in mutex_optimistic_spin() * Rework to use guard() as suggested by Peter v19: * Rework logic for PREEMPT_RT issues reported by K Prateek Nayak v21: * After recently thinking more on ww_mutex code, I reworked the blocked_lock usage in mutex lock to avoid having to take nested locks in the ww_mutex paths, as I was concerned the lock ordering constraints weren't as strong as I had previously thought. v22: * Added some extra spaces to avoid dense code blocks suggested by K Prateek v23: * Move get_task_blocked_on() to kernel/locking/mutex.h as requested by PeterZ Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 48 +++++++++++++----------------------- init/init_task.c | 1 + kernel/fork.c | 1 + kernel/locking/mutex-debug.c | 4 +-- kernel/locking/mutex.c | 40 +++++++++++++++++++----------- kernel/locking/mutex.h | 6 +++++ kernel/locking/ww_mutex.h | 4 +-- kernel/sched/core.c | 4 ++- 8 files changed, 58 insertions(+), 50 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b469878de25c8..16a2951f78b1f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1241,6 +1241,7 @@ struct task_struct { #endif =20 struct mutex *blocked_on; /* lock we're blocked on */ + raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER /* @@ -2149,57 +2150,42 @@ extern int __cond_resched_rwlock_write(rwlock_t *lo= ck); #ifndef CONFIG_PREEMPT_RT static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { - struct mutex *m =3D p->blocked_on; - - if (m) - lockdep_assert_held_once(&m->wait_lock); - return m; + lockdep_assert_held_once(&p->blocked_lock); + return p->blocked_on; } =20 static inline void __set_task_blocked_on(struct task_struct *p, struct mut= ex *m) { - struct mutex *blocked_on =3D READ_ONCE(p->blocked_on); - WARN_ON_ONCE(!m); /* The task should only be setting itself as blocked */ WARN_ON_ONCE(p !=3D current); - /* Currently we serialize blocked_on under the mutex::wait_lock */ - lockdep_assert_held_once(&m->wait_lock); + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); /* * Check ensure we don't overwrite existing mutex value * with a different mutex. Note, setting it to the same * lock repeatedly is ok. */ - WARN_ON_ONCE(blocked_on && blocked_on !=3D m); - WRITE_ONCE(p->blocked_on, m); -} - -static inline void set_task_blocked_on(struct task_struct *p, struct mutex= *m) -{ - guard(raw_spinlock_irqsave)(&m->wait_lock); - __set_task_blocked_on(p, m); + WARN_ON_ONCE(p->blocked_on && p->blocked_on !=3D m); + p->blocked_on =3D m; } =20 static inline void __clear_task_blocked_on(struct task_struct *p, struct m= utex *m) { - if (m) { - struct mutex *blocked_on =3D READ_ONCE(p->blocked_on); - - /* Currently we serialize blocked_on under the mutex::wait_lock */ - lockdep_assert_held_once(&m->wait_lock); - /* - * There may be cases where we re-clear already cleared - * blocked_on relationships, but make sure we are not - * clearing the relationship with a different lock. - */ - WARN_ON_ONCE(blocked_on && blocked_on !=3D m); - } - WRITE_ONCE(p->blocked_on, NULL); + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); + /* + * There may be cases where we re-clear already cleared + * blocked_on relationships, but make sure we are not + * clearing the relationship with a different lock. + */ + WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m); + p->blocked_on =3D NULL; } =20 static inline void clear_task_blocked_on(struct task_struct *p, struct mut= ex *m) { - guard(raw_spinlock_irqsave)(&m->wait_lock); + guard(raw_spinlock_irqsave)(&p->blocked_lock); __clear_task_blocked_on(p, m); } #else diff --git a/init/init_task.c b/init/init_task.c index a55e2189206fa..60477d74546e0 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -143,6 +143,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .journal_info =3D NULL, INIT_CPU_TIMERS(init_task) .pi_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock), + .blocked_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock), .timer_slack_ns =3D 50000, /* 50 usec default slack */ .thread_pid =3D &init_struct_pid, .thread_node =3D LIST_HEAD_INIT(init_signals.thread_head), diff --git a/kernel/fork.c b/kernel/fork.c index 3da0f08615a95..0697084be202f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2038,6 +2038,7 @@ __latent_entropy struct task_struct *copy_process( ftrace_graph_init_task(p); =20 rt_mutex_init_task(p); + raw_spin_lock_init(&p->blocked_lock); =20 lockdep_assert_irqs_enabled(); #ifdef CONFIG_PROVE_LOCKING diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c index 949103fd8e9b5..1d8cff71f65e1 100644 --- a/kernel/locking/mutex-debug.c +++ b/kernel/locking/mutex-debug.c @@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct = mutex_waiter *waiter, lockdep_assert_held(&lock->wait_lock); =20 /* Current thread can't be already blocked (since it's executing!) */ - DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task)); + DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task)); } =20 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *wa= iter, struct task_struct *task) { - struct mutex *blocked_on =3D __get_task_blocked_on(task); + struct mutex *blocked_on =3D get_task_blocked_on(task); =20 DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list)); DEBUG_LOCKS_WARN_ON(waiter->task !=3D task); diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index de7d6702cd96c..c44fc63d4476e 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -640,6 +640,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas goto err_early_kill; } =20 + raw_spin_lock(¤t->blocked_lock); __set_task_blocked_on(current, lock); set_current_state(state); trace_contention_begin(lock, LCB_F_MUTEX); @@ -653,8 +654,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas * the handoff. */ if (__mutex_trylock(lock)) - goto acquired; + break; =20 + raw_spin_unlock(¤t->blocked_lock); /* * Check for signals and kill conditions while holding * wait_lock. This ensures the lock cancellation is ordered @@ -677,12 +679,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas =20 first =3D __mutex_waiter_is_first(lock, &waiter); =20 + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); /* * As we likely have been woken up by task * that has cleared our blocked_on state, re-set * it to the lock we are trying to acquire. */ - set_task_blocked_on(current, lock); + __set_task_blocked_on(current, lock); set_current_state(state); /* * Here we order against unlock; we must either see it change @@ -693,25 +697,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas break; =20 if (first) { - trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); + bool opt_acquired; + /* * mutex_optimistic_spin() can call schedule(), so - * clear blocked on so we don't become unselectable + * we need to release these locks before calling it, + * and clear blocked on so we don't become unselectable * to run. */ - clear_task_blocked_on(current, lock); - if (mutex_optimistic_spin(lock, ww_ctx, &waiter)) + __clear_task_blocked_on(current, lock); + raw_spin_unlock(¤t->blocked_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); + + trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); + opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter); + + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); + __set_task_blocked_on(current, lock); + + if (opt_acquired) break; - set_task_blocked_on(current, lock); trace_contention_begin(lock, LCB_F_MUTEX); } - - raw_spin_lock_irqsave(&lock->wait_lock, flags); } - raw_spin_lock_irqsave(&lock->wait_lock, flags); -acquired: __clear_task_blocked_on(current, lock); __set_current_state(TASK_RUNNING); + raw_spin_unlock(¤t->blocked_lock); =20 if (ww_ctx) { /* @@ -740,11 +752,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas return 0; =20 err: - __clear_task_blocked_on(current, lock); + clear_task_blocked_on(current, lock); __set_current_state(TASK_RUNNING); __mutex_remove_waiter(lock, &waiter); err_early_kill: - WARN_ON(__get_task_blocked_on(current)); + WARN_ON(get_task_blocked_on(current)); trace_contention_end(lock, ret); raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q); debug_mutex_free_waiter(&waiter); @@ -955,7 +967,7 @@ static noinline void __sched __mutex_unlock_slowpath(st= ruct mutex *lock, unsigne next =3D waiter->task; =20 debug_mutex_wake_waiter(lock, waiter); - __clear_task_blocked_on(next, lock); + clear_task_blocked_on(next, lock); wake_q_add(&wake_q, next); } =20 diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h index 2e8080a9bee37..5cfd663e2c011 100644 --- a/kernel/locking/mutex.h +++ b/kernel/locking/mutex.h @@ -47,6 +47,12 @@ static inline struct task_struct *__mutex_owner(struct m= utex *lock) return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLA= GS); } =20 +static inline struct mutex *get_task_blocked_on(struct task_struct *p) +{ + guard(raw_spinlock_irqsave)(&p->blocked_lock); + return __get_task_blocked_on(p); +} + #ifdef CONFIG_DEBUG_MUTEXES extern void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter); diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index 31a785afee6c0..e4a81790ea7dd 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER = *waiter, * blocked_on pointer. Otherwise we can see circular * blocked_on relationships that can't resolve. */ - __clear_task_blocked_on(waiter->task, lock); + clear_task_blocked_on(waiter->task, lock); wake_q_add(wake_q, waiter->task); } =20 @@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock, * are waking the mutex owner, who may be currently * blocked on a different mutex. */ - __clear_task_blocked_on(owner, NULL); + clear_task_blocked_on(owner, NULL); wake_q_add(wake_q, owner); } return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6960c1bfc741a..b3f9bc20b7e1f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6607,6 +6607,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d * p->pi_lock * rq->lock * mutex->wait_lock + * p->blocked_lock * * Returns the task that is going to be used as execution context (the one * that is actually going to be run on cpu_of(rq)). @@ -6630,8 +6631,9 @@ find_proxy_task(struct rq *rq, struct task_struct *do= nor, struct rq_flags *rf) * and ensure @owner sticks around. */ guard(raw_spinlock)(&mutex->wait_lock); + guard(raw_spinlock)(&p->blocked_lock); =20 - /* Check again that p is blocked with wait_lock held */ + /* Check again that p is blocked with blocked_lock held */ if (mutex !=3D __get_task_blocked_on(p)) { /* * Something changed in the blocked_on chain and --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DBCA33A6E6 for ; Mon, 24 Nov 2025 22:31:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023479; cv=none; b=B1cLIhc4lagj0L49zzw3k8gRKfOlBEVLUx4CsoxUEcB//kWuIxRs83u9Va7B7UCrrdhv8vmV2+pGy2T3CoezPmHMRf3I5sNVn6z3gY6sWrV5FurkZt10GbVF4L5wK/at1PEPiVQsZdjzradeuhiNmdhyfrvBKNNN5dOSOStn1Uk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023479; c=relaxed/simple; bh=dnGK1OFM207LHaWvZ5Oah4Dp8Mj3BFxpa0knZS07sp0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=gWoEsUl71Vo+6QK9vdGN7gn4wZ3+Q0tVMqJrQdjB74SPTxJLptzUxJoQ9371oflb/8RP+YAcnwR/4Kw6OzIv1E2jD8SomtQri7PJUa6bdnGm5wfll2I6EAtdiZrd7eb/3C78j45eclVusjr5Tv65oG7APcNuMMtf6dasVVNMPXU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CmvRcHC0; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CmvRcHC0" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2955f0b8895so57337235ad.0 for ; Mon, 24 Nov 2025 14:31:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023477; x=1764628277; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=LfTUlOrLPZLJACCMG7Y2jp93sr5k+VIyEZrY/1E/AgU=; b=CmvRcHC0IDbmbyi1tkQjRFYr1pUQ8a1rJJckQOVcP38Z0vAX6VNLPxIVXTz3qzZYWT lo26WwwAZZdP7iqx2nNaNgMw8NCIUU9DBlvtQYmMWt+s01OKli8OLO69hgvoW1IAieoM FaTfRpi/xPVcNs9RRLuNKc3iUipcn7SvHisaDdCUxIYZBt6C4h/kNeqxc55GPFEQyhPA tgmbOAIAIDzqduVby+C7qOvTmMbeWSLps0vo3LNavNUtiYd+9ZLSWjY2lhFTvPJHKgBM xjHVdYZqYWCIzokCspgnYc4ENDhHJF7gwX8rNren7U/Sq2jt5bwei3QbMuk9U5ZsQGOc LfXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023477; x=1764628277; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LfTUlOrLPZLJACCMG7Y2jp93sr5k+VIyEZrY/1E/AgU=; b=MlMEhfspBeI1PVMw3k/1RbYp7VZ6OUZZzIxux++lUIcHNHcqy+H3kBAvdev3BIKtbA L8XqaNWjk+N5CI/ZZ50efHMLUWUsGtiFs2kwteukpRZQMEhXk2+iRTz3JPFz1QwEPr4g YFPRZ8WignWTfM8qZX8wndxIl64aE4lEX+sUZBS0Ieuk7spimtqRrxsc7Mpf/IBg9dTe /b3n5SINnVKmf6LwfYI4WzwyWCExihJGjPm8T3s2hX/rkNullnoE6I1gCyXd1E7IAmGW ghiZsmpr0rP3zYquYqiy8Cl2/8Os/o9e+nj5eJodJ+/7XV19A8ADvr0v0XdZPJ58OWqe GGWA== X-Gm-Message-State: AOJu0YyUwgOw0BxWuXwBMpIaWBJgWOoa33S1XdvH1/9Rykyd8loLPtvw NhSCQFq5k5E5RBKMSyPqLRG6KSte6sRmR0VrqtIC32IlwSSMYb0GYSGcBbT5lXXLGRYpDg6vwk0 961aJER9xjF3M/NYAO74+1JSKj6ZvqDjDVgCQqn5wHmfNc/7LqAqGLMiMFef9KfRkka0N66pEqb R6tqjsmHCJyuJOx3Z6SpFLEQGoRwn1/+9MGezHniSJFe7liOv+ X-Google-Smtp-Source: AGHT+IG8KcxAor0xNtwmCMCFrLatnQYElnPM4TdFvMNk0vgCHTregTxOcR4NDY9eN7XxBc7nPMmUqSY0gc+E X-Received: from plgc11.prod.google.com ([2002:a17:902:d48b:b0:29a:1de:14aa]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:98c:b0:299:dc97:a694 with SMTP id d9443c01a7336-29b5e3b86f5mr189472505ad.24.1764023476502; Mon, 24 Nov 2025 14:31:16 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:54 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-3-jstultz@google.com> Subject: [PATCH v24 02/11] sched: Fix modifying donor->blocked on without proper locking From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce an action enum in find_proxy_task() which allows us to handle work needed to be done outside the mutex.wait_lock and task.blocked_lock guard scopes. This ensures proper locking when we clear the donor's blocked_on pointer in proxy_deactivate(), and the switch statement will be useful as we add more cases to handle later in this series. Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v23: * Split out from earlier patch. v24: * Minor re-ordering local variables to keep with style as suggested by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b3f9bc20b7e1f..1b6fd173daadd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6591,7 +6591,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d * as unblocked, as we aren't doing proxy-migrations * yet (more logic will be needed then). */ - donor->blocked_on =3D NULL; + clear_task_blocked_on(donor, NULL); } return NULL; } @@ -6615,6 +6615,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d static struct task_struct * find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) { + enum { FOUND, DEACTIVATE_DONOR } action =3D FOUND; struct task_struct *owner =3D NULL; int this_cpu =3D cpu_of(rq); struct task_struct *p; @@ -6652,12 +6653,14 @@ find_proxy_task(struct rq *rq, struct task_struct *= donor, struct rq_flags *rf) =20 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) { /* XXX Don't handle blocked owners/delayed dequeue yet */ - return proxy_deactivate(rq, donor); + action =3D DEACTIVATE_DONOR; + break; } =20 if (task_cpu(owner) !=3D this_cpu) { /* XXX Don't handle migrations yet */ - return proxy_deactivate(rq, donor); + action =3D DEACTIVATE_DONOR; + break; } =20 if (task_on_rq_migrating(owner)) { @@ -6715,6 +6718,13 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) */ } =20 + /* Handle actions we need to do outside of the guard() scope */ + switch (action) { + case DEACTIVATE_DONOR: + return proxy_deactivate(rq, donor); + case FOUND: + /* fallthrough */; + } WARN_ON_ONCE(owner && !owner->on_rq); return owner; } --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F03D233AD80 for ; Mon, 24 Nov 2025 22:31:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023481; cv=none; b=YeBKjlUxw/5Hk0Xz+6l3VDMKDFYzjv68i+1+4SFGV9yt6YN2nkJI0S+7Tnm/mkp60cX3vtW1uhPc/3kMvTzOaOE6lDkZYOX0uUDIJE5op1Pu8OFGSOpWBFpkxtYI8ScuVE4SiqoaHWm9KBxHTAgZ64wcExzv7zWVVmtr2zyfe34= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023481; c=relaxed/simple; bh=K+zozUv02w7EtSDSCmV22Xy5Z5j2zhiYUKYsypMESrs=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=BJigDkdtUWuKSWBKMTIHYbUjPgWK7cU//V9Xyvx0T9DhOUouBBvflH605nIhQ3GRi4MyNXybwf/nh7vJFrva64pSgU/Z3nIW+TEv5FqaWz3EXFvVwG0Hbty7D4r38ImQCwdZ2QcO9FCzd+9hwE3oEOGjA89HS91NqwKOnwKH5h4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=NvdJlGMt; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="NvdJlGMt" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2958a134514so64369265ad.2 for ; Mon, 24 Nov 2025 14:31:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023478; x=1764628278; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9vMfv6yz6pI4y94co11Mb2zA0CLMB+QqKJvQ9kEhWp8=; b=NvdJlGMta3vXRwgt9h0bNRrRxsx961ab5yZQ+8NIla0TCwk4NwOMMG3d1t3F27kD7/ c2gU/u86BK3xZKSMLzwRTpbWA3UCcdhiCDA8a2oPqBFGtW0RZHiDRVD/1i5YShQOBVhq agC72uUWgLBeks/fG1wTuT5N5iUaj28QVPU789rkz99zTZDNVZi4rUQptxzeffMJqzPi 0dL6VQm3xxLFFVWwKVcgEMz7dXQLMCw3X1EMwG+0xMkJRMlLmYhGZ1FiXeibXYFt0SIh 6EXSDZW7ztfEb7csFAc3DFv6IdPE0DFTqlMfRLHV0xNy1m0T4ccv32La/KtI205TNdUf 5w8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023478; x=1764628278; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9vMfv6yz6pI4y94co11Mb2zA0CLMB+QqKJvQ9kEhWp8=; b=ODR+WyPBKBQ3FKc62VTmnhTQy0BbINFY74dRWJSyv4WHjGWvXC2n5Akfp9Oc6eWphY hwmJHIKzzBVlpy1ijAwaQp6vXToPXdkFcqYGDeth4TnGY0wjrC8ZEPMwd8abMbwc44Yo z7O3/4NeTrSMnN0YzvXZlbr0ome3I+8RK6zrZPe1LN2E58VgpMpeYCpMKPso5veGw+X9 q/uLXx7oUtZEByEgWqj0AXyGkyfsXjmDI3PiKp1/v64HfIn2CCJoHMYOOL6U4p74dTfc 5lBmyEVskdrcPoux8mZVclDZxZURelnTVgtoLrBVkYVX5EB0JWp/UtBEjhifTsELwbti 2y6A== X-Gm-Message-State: AOJu0YwXuN23wef1P5R6EsEKW/VD1LxbauVDcq1FxIgESA5k9jBwuYnc kPBgygeO2C8rMvS7Jz0IEjC8yatMieqYKXsxBG6c0R96zykqTVQrMYOJpGpct+UagysWQJK2n3S Y/ApYFlyuV66RR81k3dKLPtYzwkCTJM3QgRrc7+aw2koBcYSPGOEsmgYaT5TlRZ0OwEe6uB/4yO 7o1TpC2L4+e45e+69MfZSUgJaWVy1sfGdkvrWWV4qciER2DYI5 X-Google-Smtp-Source: AGHT+IHvt0iK+e5xhSMS2R3+pm+xXOEbSeW4moAXhF2iYHG/55uK4MOiKVBccQa4hJp4xYDz9pR/iLfWaV01 X-Received: from pllw1.prod.google.com ([2002:a17:902:7b81:b0:297:dad3:d100]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:2a87:b0:295:2c8e:8e44 with SMTP id d9443c01a7336-29bab318b3cmr4613135ad.59.1764023478196; Mon, 24 Nov 2025 14:31:18 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:55 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-4-jstultz@google.com> Subject: [PATCH v24 03/11] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" As we add functionality to proxy execution, we may migrate a donor task to a runqueue where it can't run due to cpu affinity. Thus, we must be careful to ensure we return-migrate the task back to a cpu in its cpumask when it becomes unblocked. Peter helpfully provided the following example with pictures: "Suppose we have a ww_mutex cycle: ,-+-* Mutex-1 <-. Task-A ---' | | ,-- Task-B `-> Mutex-2 *-+-' Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and where Task-B holds Mutex-2 and tries to acquire Mutex-1. Then the blocked_on->owner chain will go in circles. Task-A -> Mutex-2 ^ | | v Mutex-1 <- Task-B We need two things: - find_proxy_task() to stop iterating the circle; - the woken task to 'unblock' and run, such that it can back-off and re-try the transaction. Now, the current code [without this patch] does: __clear_task_blocked_on(); wake_q_add(); And surely clearing ->blocked_on is sufficient to break the cycle. Suppose it is Task-B that is made to back-off, then we have: Task-A -> Mutex-2 -> Task-B (no further blocked_on) and it would attempt to run Task-B. Or worse, it could directly pick Task-B and run it, without ever getting into find_proxy_task(). Now, here is a problem because Task-B might not be runnable on the CPU it is currently on; and because !task_is_blocked() we don't get into the proxy paths, so nobody is going to fix this up. Ideally we would have dequeued Task-B alongside of clearing ->blocked_on, but alas, [the lock ordering prevents us from getting the task_rq_lock() and] spoils things." Thus we need more than just a binary concept of the task being blocked on a mutex or not. So allow setting blocked_on to PROXY_WAKING as a special value which specifies the task is no longer blocked, but needs to be evaluated for return migration *before* it can be run. This will then be used in a later patch to handle proxy return-migration. Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v15: * Split blocked_on_state into its own patch later in the series, as the tri-state isn't necessary until we deal with proxy/return migrations v16: * Handle case where task in the chain is being set as BO_WAKING by another cpu (usually via ww_mutex die code). Make sure we release the rq lock so the wakeup can complete. * Rework to use guard() in find_proxy_task() as suggested by Peter v18: * Add initialization of blocked_on_state for init_task v19: * PREEMPT_RT build fixups and rework suggested by K Prateek Nayak v20: * Simplify one of the blocked_on_state changes to avoid extra PREMEPT_RT conditionals v21: * Slight reworks due to avoiding nested blocked_lock locking * Be consistent in use of blocked_on_state helper functions * Rework calls to proxy_deactivate() to do proper locking around blocked_on_state changes that we were cheating in previous versions. * Minor cleanups, some comment improvements v22: * Re-order blocked_on_state helpers to try to make it clearer the set_task_blocked_on() and clear_task_blocked_on() are the main enter/exit states and the blocked_on_state helpers help manage the transition states within. Per feedback from K Prateek Nayak. * Rework blocked_on_state to be defined within CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak. * Reworked empty stub functions to just take one line as suggestd by K Prateek * Avoid using gotos out of a guard() scope, as highlighted by K Prateek, and instead rework logic to break and switch() on an action value. v23: * Big rework to using PROXY_WAKING instead of blocked_on_state as suggested by Peter. * Reworked commit message to include Peter's nice diagrams and example for why this extra state is necessary. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 51 +++++++++++++++++++++++++++++++++++++-- kernel/locking/mutex.c | 2 +- kernel/locking/ww_mutex.h | 16 ++++++------ kernel/sched/core.c | 17 +++++++++++++ 4 files changed, 75 insertions(+), 11 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 16a2951f78b1f..0d6c4c31e3624 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2148,10 +2148,20 @@ extern int __cond_resched_rwlock_write(rwlock_t *lo= ck); }) =20 #ifndef CONFIG_PREEMPT_RT + +/* + * With proxy exec, if a task has been proxy-migrated, it may be a donor + * on a cpu that it can't actually run on. Thus we need a special state + * to denote that the task is being woken, but that it needs to be + * evaluated for return-migration before it is run. So if the task is + * blocked_on PROXY_WAKING, return migrate it before running it. + */ +#define PROXY_WAKING ((struct mutex *)(-1L)) + static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { lockdep_assert_held_once(&p->blocked_lock); - return p->blocked_on; + return p->blocked_on =3D=3D PROXY_WAKING ? NULL : p->blocked_on; } =20 static inline void __set_task_blocked_on(struct task_struct *p, struct mut= ex *m) @@ -2179,7 +2189,7 @@ static inline void __clear_task_blocked_on(struct tas= k_struct *p, struct mutex * * blocked_on relationships, but make sure we are not * clearing the relationship with a different lock. */ - WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m); + WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m && p->blocked_on = !=3D PROXY_WAKING); p->blocked_on =3D NULL; } =20 @@ -2188,6 +2198,35 @@ static inline void clear_task_blocked_on(struct task= _struct *p, struct mutex *m) guard(raw_spinlock_irqsave)(&p->blocked_lock); __clear_task_blocked_on(p, m); } + +static inline void __set_task_blocked_on_waking(struct task_struct *p, str= uct mutex *m) +{ + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); + + if (!sched_proxy_exec()) { + __clear_task_blocked_on(p, m); + return; + } + + /* Don't set PROXY_WAKING if blocked_on was already cleared */ + if (!p->blocked_on) + return; + /* + * There may be cases where we set PROXY_WAKING on tasks that were + * already set to waking, but make sure we are not changing + * the relationship with a different lock. + */ + WARN_ON_ONCE(m && p->blocked_on !=3D m && p->blocked_on !=3D PROXY_WAKING= ); + p->blocked_on =3D PROXY_WAKING; +} + +static inline void set_task_blocked_on_waking(struct task_struct *p, struc= t mutex *m) +{ + guard(raw_spinlock_irqsave)(&p->blocked_lock); + __set_task_blocked_on_waking(p, m); +} + #else static inline void __clear_task_blocked_on(struct task_struct *p, struct r= t_mutex *m) { @@ -2196,6 +2235,14 @@ static inline void __clear_task_blocked_on(struct ta= sk_struct *p, struct rt_mute static inline void clear_task_blocked_on(struct task_struct *p, struct rt_= mutex *m) { } + +static inline void __set_task_blocked_on_waking(struct task_struct *p, str= uct rt_mutex *m) +{ +} + +static inline void set_task_blocked_on_waking(struct task_struct *p, struc= t rt_mutex *m) +{ +} #endif /* !CONFIG_PREEMPT_RT */ =20 static __always_inline bool need_resched(void) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index c44fc63d4476e..3cb9001d15119 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -967,7 +967,7 @@ static noinline void __sched __mutex_unlock_slowpath(st= ruct mutex *lock, unsigne next =3D waiter->task; =20 debug_mutex_wake_waiter(lock, waiter); - clear_task_blocked_on(next, lock); + set_task_blocked_on_waking(next, lock); wake_q_add(&wake_q, next); } =20 diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index e4a81790ea7dd..5cd9dfa4b31e6 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITE= R *waiter, debug_mutex_wake_waiter(lock, waiter); #endif /* - * When waking up the task to die, be sure to clear the - * blocked_on pointer. Otherwise we can see circular - * blocked_on relationships that can't resolve. + * When waking up the task to die, be sure to set the + * blocked_on to PROXY_WAKING. Otherwise we can see + * circular blocked_on relationships that can't resolve. */ - clear_task_blocked_on(waiter->task, lock); + set_task_blocked_on_waking(waiter->task, lock); wake_q_add(wake_q, waiter->task); } =20 @@ -339,15 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock, */ if (owner !=3D current) { /* - * When waking up the task to wound, be sure to clear the - * blocked_on pointer. Otherwise we can see circular - * blocked_on relationships that can't resolve. + * When waking up the task to wound, be sure to set the + * blocked_on to PROXY_WAKING. Otherwise we can see + * circular blocked_on relationships that can't resolve. * * NOTE: We pass NULL here instead of lock, because we * are waking the mutex owner, who may be currently * blocked on a different mutex. */ - clear_task_blocked_on(owner, NULL); + set_task_blocked_on_waking(owner, NULL); wake_q_add(wake_q, owner); } return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1b6fd173daadd..b8a8495b82525 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4293,6 +4293,13 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: + /* + * For now, if we've been woken up, clear the task->blocked_on + * regardless if it was set to a mutex or PROXY_WAKING so the + * task can run. We will need to be more careful later when + * properly handling proxy migration + */ + clear_task_blocked_on(p, NULL); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 @@ -6627,6 +6634,11 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) /* Something changed in the chain, so pick again */ if (!mutex) return NULL; + + /* if its PROXY_WAKING, resched_idle so ttwu can complete */ + if (mutex =3D=3D PROXY_WAKING) + return proxy_resched_idle(rq); + /* * By taking mutex->wait_lock we hold off concurrent mutex_unlock() * and ensure @owner sticks around. @@ -6647,6 +6659,11 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) =20 owner =3D __mutex_owner(mutex); if (!owner) { + /* + * If there is no owner, clear blocked_on + * and return p so it can run and try to + * acquire the lock + */ __clear_task_blocked_on(p, mutex); return p; } --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C8DF333ADB0 for ; Mon, 24 Nov 2025 22:31:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023482; cv=none; b=EOvWF81pgViEqT4JqdAjCdjYoCMLlg+mGBJt/6PpKhgnl9aX23rVx0tdTAa5YFWu+z4s5TKVH/T3bcy6oFd81vOzP66NQyWrM7cfcEBb3onxP3AgAycIBP15ZVjzvi3KoqUcKaU7Z5pmba1Y/f0C4UF5EV1jGL1yOy+qmlfxtEo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023482; c=relaxed/simple; bh=/oXdn7BryHv40uXY6zMtVQA92VlfdlyBND4d45AQ8KE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=CVNBQT6q8rXn0cinwTNW0eodjId1C/NnGpHulUrb2nsNKjtrK4l7ELfp/68sQj1f1Sc7VKvRjb6B/8JzjdV7bFvtwLdPiIrEZR9SRgDWBrPQJJfsWOynkKrh5kmeNB9Z87pv26cz/+A5ZTyEb8NhLYFUEYFG8BMSmkWOl7Iuh1A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=3yNgztyL; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="3yNgztyL" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-340bc4ef67fso5399068a91.3 for ; Mon, 24 Nov 2025 14:31:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023480; x=1764628280; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=KXfmDDq6giLO8hB359IKk0sq4eA1RI0jBPv9p9aHRzo=; b=3yNgztyL8denx9XrVu+AP9GbD3mGXb8Ne44Ww6CXNQSF3muMEpZOrV1Aoy866qMcEB YKjrqi4HSbX+cjygdRaWerEObaGIqXwM9NK3+HzgZiND/ynSI6P+AFKclstRRXSlHGKI x0ceugx7LEpmc+zzWYjsaZMxx1xDt+6H9aLLtnwT8hSSkEzf0Lcxk+ZL7paFwMLmtZt/ nIacs8S3qEmDMLkXLwBvvl7BHg3+DDLpBrXKF2xGyqZshzC9Yec4G5GMO0hT4OS5Vp0R G17Q+gNoTfCv9D+7Lo3SvCbyg+V+zpz6Qu5tOwPVyTkniUPGD2DiY84fVeY0zRKRjO2o fo2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023480; x=1764628280; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KXfmDDq6giLO8hB359IKk0sq4eA1RI0jBPv9p9aHRzo=; b=X7O/6zMCf39CASEh44qieK8d/0fAqPujujBXNLOLxHbkmhLfiIusH/GyLR8E4aWsvP Emh5PcGA0G+OMbI4jB2sR2vMWHHsY1Lx7fegNczmWzaUopCwUgxBh1ShT5jow+4Oxs9T P5Li14eVQ2BmrH+UsXAnT6k1gXT5R2oVFQk4bnYmTHLxwM05IY7VY6a4A90Jt7R3Oo1H eASCZwD38s4vw2x8lNj2JndMipLFxr1FZDB0RwWBbU4a0zEQnpNpIjTHuP+CCv+yMHcG wAS+BvsqlkBOZKPtN+GG2EYRswSVVYAu2JO8TXNhW9cwR+ZWH/ql4uaygI4vEdFz1tQa vKLg== X-Gm-Message-State: AOJu0YwCyPb4a7UG16+qMmM76OXmIrg+Tu9jmRaViaUrpwuxYjM6wpgO W7cfS4lXIHi3w3jDcjZGRtxdLeGcy0LFN0b7nr/H9pI1bRCEG2qnxEjbkrA+NWaVO6Um5qvC+xC Ad3O/BJ6x/RX8pTLKVmS/B9XxAVRJHsMBh+0kU+OYeG8JWanpLurLLbRIVcCsxOG2ZJbkvHNtiV m80/r6CRlZzyexQh+BkxzkjOUWVL309fY9z2cWkW+/sWFGw+Fg X-Google-Smtp-Source: AGHT+IE3M3+KzFHr2HXCRp8tksfVoQe5bbv7ZL2P+uIj9xFRoxAZhEgRK2zl+Qcxwd5xKNBpIIUd1xkTmH6V X-Received: from plqw6.prod.google.com ([2002:a17:902:a706:b0:269:740f:8ae8]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ea0e:b0:274:3db8:e755 with SMTP id d9443c01a7336-29bab142451mr5548195ad.30.1764023479934; Mon, 24 Nov 2025 14:31:19 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:56 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-5-jstultz@google.com> Subject: [PATCH v24 04/11] sched: Add assert_balance_callbacks_empty helper From: John Stultz To: LKML Cc: John Stultz , Peter Zijlstra , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With proxy-exec utilizing pick-again logic, we can end up having balance callbacks set by the preivous pick_next_task() call left on the list. So pull the warning out into a helper function, and make sure we check it when we pick again. Suggested-by: Peter Zijlstra Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v24: * Use IS_ENABLED() as suggested by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 1 + kernel/sched/sched.h | 9 ++++++++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b8a8495b82525..cfe71c2764558 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6896,6 +6896,7 @@ static void __sched notrace __schedule(int sched_mode) } =20 pick_again: + assert_balance_callbacks_empty(rq); next =3D pick_next_task(rq, rq->donor, &rf); rq_set_donor(rq, next); if (unlikely(task_is_blocked(next))) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index adfb6e3409d72..a0de4f00edd61 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1780,6 +1780,13 @@ static inline void scx_rq_clock_update(struct rq *rq= , u64 clock) {} static inline void scx_rq_clock_invalidate(struct rq *rq) {} #endif /* !CONFIG_SCHED_CLASS_EXT */ =20 +static inline void assert_balance_callbacks_empty(struct rq *rq) +{ + WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_LOCKING) && + rq->balance_callback && + rq->balance_callback !=3D &balance_push_callback); +} + /* * Lockdep annotation that avoids accidental unlocks; it's like a * sticky/continuous lockdep_assert_held(). @@ -1796,7 +1803,7 @@ static inline void rq_pin_lock(struct rq *rq, struct = rq_flags *rf) =20 rq->clock_update_flags &=3D (RQCF_REQ_SKIP|RQCF_ACT_SKIP); rf->clock_update_flags =3D 0; - WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=3D &balance_p= ush_callback); + assert_balance_callbacks_empty(rq); } =20 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf) --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 489E833B6E7 for ; Mon, 24 Nov 2025 22:31:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023484; cv=none; b=r1a1tR1C6Lccuj/nXd0zvRYmOPwRmbG9FSj94PlTzGrSX38pAbCuShd+VhtrZtqQh+Tb5ZxvitAuEWThsg0CRYS2qQ+jOGI/ee77FQTUrKteY2quIY1DAVmTgdBxGuRY4cdCDIDSU9EbOeeSdOkpFHmHQoOWg0L751Gyi9WcsYQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023484; c=relaxed/simple; bh=TW8zeEM/1CA64cg/blrMfVCMKQfLJJaz6GJxdx1KyBg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=mYOryuAVCAVUEU9KuBk5CJlDyq6/CAL6+CiDp7xqtwitCP9NBlxrDHoufebjHbqKVVK1rG9uLWQ3zP3qq70WL6iDrCZbUWITdEa5ByzENkQ/R1OK5qohD4VbKLsdoUmLDSr25ROYBrhAPSTBbA7GbWxCzOUxB5tZDn1F/YOaMMo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=0LswTFzn; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="0LswTFzn" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-299d221b749so113403445ad.3 for ; Mon, 24 Nov 2025 14:31:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023481; x=1764628281; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=+eGRLCqNRsTWy8Dxi9Oqg9w9aCOF0cGLTzg7uF3KP78=; b=0LswTFzncG5ZIY7Ql47nD1l8SZepN5Pw0zn4sFraOjIcUHXAJ+4PDNEkH2e3dUGBf6 nV2pONJpqmWS0hh2r6PqbBVWsp0tCKdJyLQ68ffQyvPGVNGaeI06yPYh1E1k6ib6R6ot lQEqMRjIjMqNUxBLFFGTTlEy+98hiuC+CcAWpmtoWo9IeDhEwXK7UItFycieQKGc9feK kUBR0IvxJLPZOvgXZLCFjdaQXHb9AAu8dXcCjqRsqtCawYU57q7nKmIL7FwSAoEvSd/n LxX3mOEe3Gh7iC4LolSxGgAQEQa2Bam7uFf2DruyZSYSU4A2SOiXu9Xap5CTCUTd0yaL 2xWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023481; x=1764628281; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+eGRLCqNRsTWy8Dxi9Oqg9w9aCOF0cGLTzg7uF3KP78=; b=T3z/SL/K6KSaFVt2j6JLgQCNFwRpJo9C+S6xcm2LLDwbKmrisUFpe4atRiKpNIFiCx 9KVVVncCoc329qfoK4TJbhh6r/LmlAnJwxY0Br4a9pQa8EnsC9+v7PS7IUHVSDiOkK+k c++OW+llHupnHTU/R3jUiXacm+qVrBmRy08LcrgxzeiEqancFF+7qqZ7i0qJpZCbNFjM ULJgilEJHNhwQurV7+pb/3uPAZeTUdaS0uC9W2Z93PArLBg1Vq+PJ6bxfM+RzBnfmZE0 Mee2Ji1MVX1vw5ZvVeUD9ILvE2eSbIy9QSd7DnhkInYhSXN07kYE51xgSrdZqRcsGa+6 9AXw== X-Gm-Message-State: AOJu0YzowRY5/cxi9Y9d7jyLbu/fDAVM1Ovt6lJ3CqAoj+P40j1s79UL RMeO1AExON8qTHBsfIpkeZC1RjlRQ/4sdVodG89EpqvcXXoyyZ52MN2Yv2RdxENaMhvdQFY5U07 /WXrE3cff9VuDEYLMmpUW32y/41HTq1SsKrsoDAtj+IReXlFRcaaXgr1QIaWHdQBhU8W+pPVp+8 Ffkqtk51KrBiWc3JyejjULMFbh5t+SDFNSkUz3r+GltdnT2Tkv X-Google-Smtp-Source: AGHT+IHlD/HkJ2w3yxQ+ro9nXcGuXIJ/DYo3BOV+viWV7MhG5dfNv7OPYD0mwA/FcTH+vH8sQZAYojMstN+5 X-Received: from plbjy3.prod.google.com ([2002:a17:903:42c3:b0:269:770b:9520]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ebc1:b0:27e:f018:d312 with SMTP id d9443c01a7336-29b6c3c71c1mr133536365ad.1.1764023481390; Mon, 24 Nov 2025 14:31:21 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:57 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-6-jstultz@google.com> Subject: [PATCH v24 05/11] sched: Add logic to zap balance callbacks if we pick again From: John Stultz To: LKML Cc: John Stultz , K Prateek Nayak , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With proxy-exec, a task is selected to run via pick_next_task(), and then if it is a mutex blocked task, we call find_proxy_task() to find a runnable owner. If the runnable owner is on another cpu, we will need to migrate the selected donor task away, after which we will pick_again can call pick_next_task() to choose something else. However, in the first call to pick_next_task(), we may have had a balance_callback setup by the class scheduler. After we pick again, its possible pick_next_task_fair() will be called which calls sched_balance_newidle() and sched_balance_rq(). This will throw a warning: [ 8.796467] rq->balance_callback && rq->balance_callback !=3D &balance_p= ush_callback [ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched= _balance_rq+0xe92/0x1250 ... [ 8.796467] Call Trace: [ 8.796467] [ 8.796467] ? __warn.cold+0xb2/0x14e [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] ? report_bug+0x107/0x1a0 [ 8.796467] ? handle_bug+0x54/0x90 [ 8.796467] ? exc_invalid_op+0x17/0x70 [ 8.796467] ? asm_exc_invalid_op+0x1a/0x20 [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] sched_balance_newidle+0x295/0x820 [ 8.796467] pick_next_task_fair+0x51/0x3f0 [ 8.796467] __schedule+0x23a/0x14b0 [ 8.796467] ? lock_release+0x16d/0x2e0 [ 8.796467] schedule+0x3d/0x150 [ 8.796467] worker_thread+0xb5/0x350 [ 8.796467] ? __pfx_worker_thread+0x10/0x10 [ 8.796467] kthread+0xee/0x120 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork+0x31/0x50 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork_asm+0x1a/0x30 [ 8.796467] This is because if a RT task was originally picked, it will setup the rq->balance_callback with push_rt_tasks() via set_next_task_rt(). Once the task is migrated away and we pick again, we haven't processed any balance callbacks, so rq->balance_callback is not in the same state as it was the first time pick_next_task was called. To handle this, add a zap_balance_callbacks() helper function which cleans up the balance callbacks without running them. This should be ok, as we are effectively undoing the state set in the first call to pick_next_task(), and when we pick again, the new callback can be configured for the donor task actually selected. Reviewed-by: K Prateek Nayak Signed-off-by: John Stultz --- v20: * Tweaked to avoid build issues with different configs v22: * Spelling fix suggested by K Prateek * Collapsed the stub implementation to one line as suggested by K Prateek * Zap callbacks when we resched idle, as suggested by K Prateek v24: * Don't conditionalize function on CONFIG_SCHED_PROXY_EXEC as the callers will be optimized out if that is unset, and the dead function will be removed, as suggsted by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cfe71c2764558..8a3f9f63916dd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4970,6 +4970,34 @@ static inline void finish_task(struct task_struct *p= rev) smp_store_release(&prev->on_cpu, 0); } =20 +/* + * Only called from __schedule context + * + * There are some cases where we are going to re-do the action + * that added the balance callbacks. We may not be in a state + * where we can run them, so just zap them so they can be + * properly re-added on the next time around. This is similar + * handling to running the callbacks, except we just don't call + * them. + */ +static void zap_balance_callbacks(struct rq *rq) +{ + struct balance_callback *next, *head; + bool found =3D false; + + lockdep_assert_rq_held(rq); + + head =3D rq->balance_callback; + while (head) { + if (head =3D=3D &balance_push_callback) + found =3D true; + next =3D head->next; + head->next =3D NULL; + head =3D next; + } + rq->balance_callback =3D found ? &balance_push_callback : NULL; +} + static void do_balance_callbacks(struct rq *rq, struct balance_callback *h= ead) { void (*func)(struct rq *rq); @@ -6901,10 +6929,15 @@ static void __sched notrace __schedule(int sched_mo= de) rq_set_donor(rq, next); if (unlikely(task_is_blocked(next))) { next =3D find_proxy_task(rq, next, &rf); - if (!next) + if (!next) { + /* zap the balance_callbacks before picking again */ + zap_balance_callbacks(rq); goto pick_again; - if (next =3D=3D rq->idle) + } + if (next =3D=3D rq->idle) { + zap_balance_callbacks(rq); goto keep_resched; + } } picked: clear_tsk_need_resched(prev); --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C2B1033B96A for ; Mon, 24 Nov 2025 22:31:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023486; cv=none; b=KfQXR8XTnfaRsihpdk1j31ITPdaXOJwkJza4YYQft/6n6q9XcpPfVbiyMhGASviJW9vUIiG757+6idpYAh8BtfFaUI88IexvGq9Jmf8JLc9/XHxc8YejszAjUqYz/AwhCg4xV53oRVDozCRkxFVF+Z1dsUwnhmlY+GaVuzLCabk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023486; c=relaxed/simple; bh=BKhD9W0YJClOJHDDfuUA7BXKwkuYdAUUdehLmBYm6QI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=pAQxmc1mfelh7NwqxCXOko4OLRBPLZoaNLd02PwPiPb9CHRJKav+Ouo++FPMfGCewBUSNUajmh97ahD1XwKZvum1aj130iUEigCSqmVzzFyv0FlwIkp91v0qmKpMIE1EFzY0EF76+T3I9qbLFH7cg4TWkq9Jxvm7ILW2vfBABQk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=FisttGfK; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="FisttGfK" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-297e66542afso169806925ad.3 for ; Mon, 24 Nov 2025 14:31:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023483; x=1764628283; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=9hQvx+sEMIn9F+7kYJf9djkeYOBZrvWc9dVVTWZFAow=; b=FisttGfKfy3DHaRcqpIr5dAv8BSS+kZOT1NyCWLTee+BVHq/xGz8QYkJFEYTjEiJXS tL06SVfa01Ka4BUq4LbGG0Q2X+Dh8sYjcpTHQR4Zh9HbZ16WNwL2P89no5U3QwvjXtse 1zwmmeJM3fifQP/kDyypsZGUOxayf4ZOXO6qRVBZYIOp4X2/ZZq4j/JnTE4rpyQIcOLT Pr4udMyNftESPA9oZ50mzcKPbP9EylKl98VhPTIjRMSy5lf4Qz4FCC5Hzr8l6SCCtUfo Wgf4QGDccITlNbUt5ZBp1YPooNjYdTBJf+VxnRZ+yblDw2l5lEYgUFGGYj+xj4fXOCZp oa9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023483; x=1764628283; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=9hQvx+sEMIn9F+7kYJf9djkeYOBZrvWc9dVVTWZFAow=; b=DWD1CYT49ZxRMcSfQLSGNdcFqwLixj/CHhxzngV9YPPjR0mkmXMMHJP3U0dgVzodic MoTPTAbepMtRQEC21NXxf/wV42qBtiwcbWYZzwsrxKAWnZc3iusBUZ35ge8UrY2ZdMtS 3s6NCpiWwhUb7ZGUpb4OjpJ6Tolxf6id2xzbuY/osgd4mapJZ2kTzx6jwwGdVGw8BXfP mFFdHd9EguNmgaGcq6uasXj0hPh4k4tc/05Db6n15VgeHau+fDr2MTzZyjENEVgPhx8w GH9vxO3yluQEor9ljr2VavDGllgckclftXR7iU2WATNncfaM4XoDiIPQKjMBCPHPvemA mDmA== X-Gm-Message-State: AOJu0YwESW5wvRVM3Nu5aQbkrYxLmN+TmDMl4os9GX8YYdr3IaAbCldv jBa2eTUPF32xJ+wHRHTkWVooE3Z+OrkIw5CtQizPUsOohWvRPAWwg6jhnw4cSwbDfSD/5IpRc1H KF04ADewmVnhW9RZpl2WQbagGreRZOj1thXKqYaCpNKULqpxmHFEkrwQ9RGM9hX7g+8ajpjfzj0 HvYfU8ErQTP+lNlqMqoBLRiTobaqM5lJ7o6moLU7WPpuNOuWID X-Google-Smtp-Source: AGHT+IFM3rvDXC29Vtgj0+Z0Et2LWW/lxcrPzZCQs8FksWDuuMjShpcRab0EQH67pQkfWDeKEePzKvVanC6L X-Received: from plso3.prod.google.com ([2002:a17:902:bcc3:b0:297:e585:34c1]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:3b87:b0:295:565b:c691 with SMTP id d9443c01a7336-29baaf75d56mr5273515ad.17.1764023482903; Mon, 24 Nov 2025 14:31:22 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:58 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-7-jstultz@google.com> Subject: [PATCH v24 06/11] sched: Handle blocked-waiter migration (and return migration) From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add logic to handle migrating a blocked waiter to a remote cpu where the lock owner is runnable. Additionally, as the blocked task may not be able to run on the remote cpu, add logic to handle return migration once the waiting task is given the mutex. Because tasks may get migrated to where they cannot run, also modify the scheduling classes to avoid sched class migrations on mutex blocked tasks, leaving find_proxy_task() and related logic to do the migrations and return migrations. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) Juri Lelli Valentin Schneider Connor O'Brien Signed-off-by: John Stultz --- v6: * Integrated sched_proxy_exec() check in proxy_return_migration() * Minor cleanups to diff * Unpin the rq before calling __balance_callbacks() * Tweak proxy migrate to migrate deeper task in chain, to avoid tasks pingponging between rqs v7: * Fixup for unused function arguments * Switch from that_rq -> target_rq, other minor tweaks, and typo fixes suggested by Metin Kaya * Switch back to doing return migration in the ttwu path, which avoids nasty lock juggling and performance issues * Fixes for UP builds v8: * More simplifications from Metin Kaya * Fixes for null owner case, including doing return migration * Cleanup proxy_needs_return logic v9: * Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed return migrations * Switch to using zap_balance_callbacks rathern then running them when we are dropping rq locks for proxy_migration. * Drop task_is_blocked check in sched_submit_work as suggested by Metin (may re-add later if this causes trouble) * Do return migration when we're not on wake_cpu. This avoids bad task placement caused by proxy migrations raised by Xuewen Yan * Fix to call set_next_task(rq->curr) prior to dropping rq lock to avoid rq->curr getting migrated before we have actually switched from it * Cleanup to re-use proxy_resched_idle() instead of open coding it in proxy_migrate_task() * Fix return migration not to use DEQUEUE_SLEEP, so that we properly see the task as task_on_rq_migrating() after it is dequeued but before set_task_cpu() has been called on it * Fix to broaden find_proxy_task() checks to avoid race where a task is dequeued off the rq due to return migration, but set_task_cpu() and the enqueue on another rq happened after we checked task_cpu(owner). This ensures we don't proxy using a task that is not actually on our runqueue. * Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition in try_to_wake_up() if proxy execution isn't enabled. * Cleanup to improve comment in proxy_migrate_task() explaining the set_next_task(rq->curr) logic * Cleanup deadline.c change to stylistically match rt.c change * Numerous cleanups suggested by Metin v10: * Drop WARN_ON(task_is_blocked(p)) in ttwu current case v11: * Include proxy_set_task_cpu from later in the series to this change so we can use it, rather then reworking logic later in the series. * Fix problem with return migration, where affinity was changed and wake_cpu was left outside the affinity mask. * Avoid reading the owner's cpu twice (as it might change inbetween) to avoid occasional migration-to-same-cpu edge cases * Add extra WARN_ON checks for wake_cpu and return migration edge cases. * Typo fix from Metin v13: * As we set ret, return it, not just NULL (pulling this change in from later patch) * Avoid deadlock between try_to_wake_up() and find_proxy_task() when blocked_on cycle with ww_mutex is trying a mid-chain wakeup. * Tweaks to use new __set_blocked_on_runnable() helper * Potential fix for incorrectly updated task->dl_server issues * Minor comment improvements * Add logic to handle missed wakeups, in that case doing return migration from the find_proxy_task() path * Minor cleanups v14: * Improve edge cases where we wouldn't set the task as BO_RUNNABLE v15: * Added comment to better describe proxy_needs_return() as suggested by Qais * Build fixes for !CONFIG_SMP reported by Maciej =C5=BBenczykowski * Adds fix for re-evaluating proxy_needs_return when sched_proxy_exec() is disabled, reported and diagnosed by: kuyo chang v16: * Larger rework of needs_return logic in find_proxy_task, in order to avoid problems with cpuhotplug * Rework to use guard() as suggested by Peter v18: * Integrate optimization suggested by Suleiman to do the checks for sleeping owners before checking if the task_cpu is this_cpu, so that we can avoid needlessly proxy-migrating tasks to only then dequeue them. Also check if migrating last. * Improve comments around guard locking * Include tweak to ttwu_runnable() as suggested by hupu * Rework the logic releasing the rq->donor reference before letting go of the rqlock. Just use rq->idle. * Go back to doing return migration on BO_WAKING owners, as I was hitting some softlockups caused by running tasks not making it out of BO_WAKING. v19: * Fixed proxy_force_return() logic for !SMP cases v21: * Reworked donor deactivation for unhandled sleeping owners * Commit message tweaks v22: * Add comments around zap_balance_callbacks in proxy_migration logic * Rework logic to avoid gotos out of guard() scopes, and instead use break and switch() on action value, as suggested by K Prateek * K Prateek suggested simplifications around putting donor and setting idle as next task in the migration paths, which I further simplified to using proxy_resched_idle() * Comment improvements * Dropped curr !=3D donor check in pick_next_task_fair() suggested by K Prateek v23: * Rework to use the PROXY_WAKING approach suggested by Peter * Drop unnecessarily setting wake_cpu when affinity changes as noticed by Peter * Split out the ttwu() logic changes into a later separate patch as suggested by Peter v24: * Numerous fixes for rq clock handling, pointed out by K Prateek * Slight tweak to where put_task() is called suggested by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 232 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 202 insertions(+), 30 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8a3f9f63916dd..4c5493b0ad210 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3682,6 +3682,23 @@ static inline void ttwu_do_wakeup(struct task_struct= *p) trace_sched_wakeup(p); } =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu) +{ + unsigned int wake_cpu; + + /* + * Since we are enqueuing a blocked task on a cpu it may + * not be able to run on, preserve wake_cpu when we + * __set_task_cpu so we can return the task to where it + * was previously runnable. + */ + wake_cpu =3D p->wake_cpu; + __set_task_cpu(p, cpu); + p->wake_cpu =3D wake_cpu; +} +#endif /* CONFIG_SCHED_PROXY_EXEC */ + static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, struct rq_flags *rf) @@ -4293,13 +4310,6 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: - /* - * For now, if we've been woken up, clear the task->blocked_on - * regardless if it was set to a mutex or PROXY_WAKING so the - * task can run. We will need to be more careful later when - * properly handling proxy migration - */ - clear_task_blocked_on(p, NULL); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 @@ -6598,7 +6608,7 @@ static inline struct task_struct *proxy_resched_idle(= struct rq *rq) return rq->idle; } =20 -static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor) +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor) { unsigned long state =3D READ_ONCE(donor->__state); =20 @@ -6618,17 +6628,146 @@ static bool __proxy_deactivate(struct rq *rq, stru= ct task_struct *donor) return try_to_block_task(rq, donor, &state, true); } =20 -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_str= uct *donor) +/* + * If the blocked-on relationship crosses CPUs, migrate @p to the + * owner's CPU. + * + * This is because we must respect the CPU affinity of execution + * contexts (owner) but we can ignore affinity for scheduling + * contexts (@p). So we have to move scheduling contexts towards + * potential execution contexts. + * + * Note: The owner can disappear, but simply migrate to @target_cpu + * and leave that CPU to sort things out. + */ +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, int target_cpu) { - if (!__proxy_deactivate(rq, donor)) { - /* - * XXX: For now, if deactivation failed, set donor - * as unblocked, as we aren't doing proxy-migrations - * yet (more logic will be needed then). - */ - clear_task_blocked_on(donor, NULL); + struct rq *target_rq =3D cpu_rq(target_cpu); + + lockdep_assert_rq_held(rq); + + /* + * Since we're going to drop @rq, we have to put(@rq->donor) first, + * otherwise we have a reference that no longer belongs to us. + * + * Additionally, as we put_prev_task(prev) earlier, its possible that + * prev will migrate away as soon as we drop the rq lock, however we + * still have it marked as rq->curr, as we've not yet switched tasks. + * + * So call proxy_resched_idle() to let go of the references before + * we release the lock. + */ + proxy_resched_idle(rq); + + WARN_ON(p =3D=3D rq->curr); + + deactivate_task(rq, p, DEQUEUE_NOCLOCK); + proxy_set_task_cpu(p, target_cpu); + + /* + * We have to zap callbacks before unlocking the rq + * as another CPU may jump in and call sched_balance_rq + * which can trip the warning in rq_pin_lock() if we + * leave callbacks set. + */ + zap_balance_callbacks(rq); + rq_unpin_lock(rq, rf); + raw_spin_rq_unlock(rq); + + raw_spin_rq_lock(target_rq); + activate_task(target_rq, p, 0); + wakeup_preempt(target_rq, p, 0); + raw_spin_rq_unlock(target_rq); + + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + update_rq_clock(rq); +} + +static void proxy_force_return(struct rq *rq, struct rq_flags *rf, + struct task_struct *p) +{ + struct rq *this_rq, *target_rq; + struct rq_flags this_rf; + int cpu, wake_flag =3D 0; + + lockdep_assert_rq_held(rq); + WARN_ON(p =3D=3D rq->curr); + + get_task_struct(p); + + /* + * We have to zap callbacks before unlocking the rq + * as another CPU may jump in and call sched_balance_rq + * which can trip the warning in rq_pin_lock() if we + * leave callbacks set. + */ + zap_balance_callbacks(rq); + rq_unpin_lock(rq, rf); + raw_spin_rq_unlock(rq); + + /* + * We drop the rq lock, and re-grab task_rq_lock to get + * the pi_lock (needed for select_task_rq) as well. + */ + this_rq =3D task_rq_lock(p, &this_rf); + + /* + * Since we let go of the rq lock, the task may have been + * woken or migrated to another rq before we got the + * task_rq_lock. So re-check we're on the same RQ. If + * not, the task has already been migrated and that CPU + * will handle any futher migrations. + */ + if (this_rq !=3D rq) + goto err_out; + + /* Similarly, if we've been dequeued, someone else will wake us */ + if (!task_on_rq_queued(p)) + goto err_out; + + /* + * Since we should only be calling here from __schedule() + * -> find_proxy_task(), no one else should have + * assigned current out from under us. But check and warn + * if we see this, then bail. + */ + if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) { + WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d on_cpu: %i\n", + __func__, cpu_of(this_rq), + p->comm, p->pid, p->on_cpu); + goto err_out; } - return NULL; + + update_rq_clock(this_rq); + proxy_resched_idle(this_rq); + deactivate_task(this_rq, p, DEQUEUE_NOCLOCK); + cpu =3D select_task_rq(p, p->wake_cpu, &wake_flag); + set_task_cpu(p, cpu); + target_rq =3D cpu_rq(cpu); + clear_task_blocked_on(p, NULL); + task_rq_unlock(this_rq, p, &this_rf); + + /* Drop this_rq and grab target_rq for activation */ + raw_spin_rq_lock(target_rq); + activate_task(target_rq, p, 0); + wakeup_preempt(target_rq, p, 0); + put_task_struct(p); + raw_spin_rq_unlock(target_rq); + + /* Finally, re-grab the origianl rq lock and return to pick-again */ + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + update_rq_clock(rq); + return; + +err_out: + task_rq_unlock(this_rq, p, &this_rf); + put_task_struct(p); + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + update_rq_clock(rq); } =20 /* @@ -6650,11 +6789,13 @@ static struct task_struct *proxy_deactivate(struct = rq *rq, struct task_struct *d static struct task_struct * find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) { - enum { FOUND, DEACTIVATE_DONOR } action =3D FOUND; + enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action =3D FOUND; struct task_struct *owner =3D NULL; + bool curr_in_chain =3D false; int this_cpu =3D cpu_of(rq); struct task_struct *p; struct mutex *mutex; + int owner_cpu; =20 /* Follow blocked_on chain. */ for (p =3D donor; task_is_blocked(p); p =3D owner) { @@ -6663,9 +6804,15 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) if (!mutex) return NULL; =20 - /* if its PROXY_WAKING, resched_idle so ttwu can complete */ - if (mutex =3D=3D PROXY_WAKING) - return proxy_resched_idle(rq); + /* if its PROXY_WAKING, do return migration or run if current */ + if (mutex =3D=3D PROXY_WAKING) { + if (task_current(rq, p)) { + clear_task_blocked_on(p, PROXY_WAKING); + return p; + } + action =3D NEEDS_RETURN; + break; + } =20 /* * By taking mutex->wait_lock we hold off concurrent mutex_unlock() @@ -6685,26 +6832,41 @@ find_proxy_task(struct rq *rq, struct task_struct *= donor, struct rq_flags *rf) return NULL; } =20 + if (task_current(rq, p)) + curr_in_chain =3D true; + owner =3D __mutex_owner(mutex); if (!owner) { /* - * If there is no owner, clear blocked_on - * and return p so it can run and try to - * acquire the lock + * If there is no owner, either clear blocked_on + * and return p (if it is current and safe to + * just run on this rq), or return-migrate the task. */ - __clear_task_blocked_on(p, mutex); - return p; + if (task_current(rq, p)) { + __clear_task_blocked_on(p, NULL); + return p; + } + action =3D NEEDS_RETURN; + break; } =20 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) { /* XXX Don't handle blocked owners/delayed dequeue yet */ + if (curr_in_chain) + return proxy_resched_idle(rq); action =3D DEACTIVATE_DONOR; break; } =20 - if (task_cpu(owner) !=3D this_cpu) { - /* XXX Don't handle migrations yet */ - action =3D DEACTIVATE_DONOR; + owner_cpu =3D task_cpu(owner); + if (owner_cpu !=3D this_cpu) { + /* + * @owner can disappear, simply migrate to @owner_cpu + * and leave that CPU to sort things out. + */ + if (curr_in_chain) + return proxy_resched_idle(rq); + action =3D MIGRATE; break; } =20 @@ -6766,7 +6928,17 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) /* Handle actions we need to do outside of the guard() scope */ switch (action) { case DEACTIVATE_DONOR: - return proxy_deactivate(rq, donor); + if (proxy_deactivate(rq, donor)) + return NULL; + /* If deactivate fails, force return */ + p =3D donor; + fallthrough; + case NEEDS_RETURN: + proxy_force_return(rq, rf, p); + return NULL; + case MIGRATE: + proxy_migrate_task(rq, rf, p, owner_cpu); + return NULL; case FOUND: /* fallthrough */; } --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54AB933BBC4 for ; Mon, 24 Nov 2025 22:31:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023487; cv=none; b=Bh2iQT9+N2sTDwZ67tDgyn75L3cyfUrw9BsLs3ANpcRbcVeduXDRmIUx6CGf83NTFQpeVAuNIX1quO7+7EaAxSWW/B9iU72c0wg/JnpKRndYBTQ7MVXPk45ZhwVsseGmTMSWG7R9Km1nC+eC4LeUI54gEocD9bmSa5eBuzLQ7mc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023487; c=relaxed/simple; bh=uL3Cm6b5lAgPhnEQGVTfJQM/5ZovUGAK9B5f1bhnqvM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Db0hAoObA+fL5SZ+HMotyla0a9vg7q/kEPZXpRMI1rgNVyFp0q81bqQKsyI5jKIPrwXELzIr+/TajGw2No66cPPwRwICVC9eIsFj9gU+0xNC5zLUwchylReuYb/u6zfYX9g72V5IF1D8nmF7Bfg81+nRQ4Gns9hjOyPcmVIzR8Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=jIfqXAzU; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="jIfqXAzU" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-295fbc7d4abso75193285ad.1 for ; Mon, 24 Nov 2025 14:31:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023485; x=1764628285; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XcCH6d0H0rHy6O/ohUMcok7+Eoj6mVSLe7ga5uhX8N0=; b=jIfqXAzU8O6rVjMSXgk2lGLLx9ZMllJidc8ekCT+F2QlNpMIewH7pzvL7u2dk9UABz 9ypC/0tPiFN14Tq7/pPvUcBmlGAtJq/jc7lmiwu291RPZxV01q0MgQnW6GBrARqo5o18 etSKCICToqZOwwV4bMAx1xeGq15abn4I3m4WXYN4VYsD0ztSHShHffl2WOFwia8/4BQT stQYlSRiYoonLsVIdKuih2TfJHHS1RVFWA0hUvc6PyLY9Z4bUMQr1iHnrq5wiQTx4ICa sGrq9hWwamE1jeI+7Inuu1zovh0+FuI9xMrq9u0zWvefd5GoUQ/dO2T/eAHVutw86hgJ mWHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023485; x=1764628285; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XcCH6d0H0rHy6O/ohUMcok7+Eoj6mVSLe7ga5uhX8N0=; b=OT8L/JR6UVibHQU8b4at9Dh3amVOhaN5+hD3V6AVXOvIqbndhF5dy9g6FQx5f0oeFt v+n2TaEGf3z44uHbrblhKf0v1l5BgMWc+nOh8zxdDR65DBQcS0HKii9uxBreT3MmgywZ VRwjAH0KBCWpsWiGcNJezdPiDGjg//9wlenN+o94iuxh1GwPAuWrN2CQEDz9xhx6EVW9 mBKqTG0xOwQNXGjRD7ZGyjAXiYGTDFGcOWwmTFX/cLKwqSI/daigSp4aQbbIewxNUgTV VXoV4KuBvcW/l8I8ulxj+QlfxLoqLhc3Q/FqozkEPr4EBjqfJSZ8zqEzYOWIPI0cCOiL LURg== X-Gm-Message-State: AOJu0YwCHrGt2jB0an0xxMuBzZwtb1Y8ou6tO+KVGvg/dA7XyVaWq0ik NTRSOd0x3O8P2WrUJFhUF7L6evBm7mvdjUd6lLyJtHsXGhcVZ3uCpmsSz2cOHdazCX1/c+T168d PdKW+XqRzHC+TOMnvSsmo3QdWv/ADfntjRgIy1xHeaHYYWaV7544spqklG/trJJReSrDp8Iktmb 8BxFuOGRbFmYC05pbn0PSHmc77O4Q1WUGbOtMMnrtM+bsUFuiW X-Google-Smtp-Source: AGHT+IFILPWjBT2VDvkeE+PV5eqFbpiXmulQvLqNjsYFUkE86qk68cUA28ZL51Mybn72ze3MAfD1NWmNi3Jy X-Received: from plbkj5.prod.google.com ([2002:a17:903:6c5:b0:293:de:a528]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ebc2:b0:298:33c9:eda2 with SMTP id d9443c01a7336-29b6bf3bccbmr146958445ad.33.1764023484658; Mon, 24 Nov 2025 14:31:24 -0800 (PST) Date: Mon, 24 Nov 2025 22:30:59 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-8-jstultz@google.com> Subject: [PATCH v24 07/11] sched: Rework pick_next_task() and prev_balance() to avoid stale prev references From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Historically, the prev value from __schedule() was the rq->curr. This prev value is passed down through numerous functions, and used in the class scheduler implementations. The fact that prev was on_cpu until the end of __schedule(), meant it was stable across the rq lock drops that the class->pick_next_task() and ->balance() implementations often do. However, with proxy-exec, the prev passed to functions called by __schedule() is rq->donor, which may not be the same as rq->curr and may not be on_cpu, this makes the prev value potentially unstable across rq lock drops. A recently found issue with proxy-exec, is when we begin doing return migration from try_to_wake_up(), its possible we may be waking up the rq->donor. When we do this, we proxy_resched_idle() to put_prev_set_next() setting the rq->donor to rq->idle, allowing the rq->donor to be return migrated and allowed to run. This however runs into trouble, as on another cpu we might be in the middle of calling __schedule(). Conceptually the rq lock is held for the majority of the time, but in calling pick_next_task() its possible the class->pick_next_task() handler or the ->balance() call may briefly drop the rq lock. This opens a window for try_to_wake_up() to wake and return migrate the rq->donor before the class logic reacquires the rq lock. Unfortunately pick_next_task() and prev_balance() pass in a prev argument, to which we pass rq->donor. However this prev value can now become stale and incorrect across a rq lock drop. So, to correct this, rework the pick_next_task() and prev_balance() calls so that they do not take a "prev" argument. Also rework the class ->pick_next_task() and ->balance() implementations to drop the prev argument, and in the cases where it was used, and have the class functions reference rq->donor directly, and not save the value across rq lock drops so that we don't end up with a stale references. Signed-off-by: John Stultz --- Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 37 ++++++++++++++++++------------------- kernel/sched/deadline.c | 8 +++++++- kernel/sched/ext.c | 8 ++++++-- kernel/sched/fair.c | 15 ++++++++++----- kernel/sched/idle.c | 2 +- kernel/sched/rt.c | 8 +++++++- kernel/sched/sched.h | 8 ++++---- kernel/sched/stop_task.c | 2 +- 8 files changed, 54 insertions(+), 34 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4c5493b0ad210..fcf64c4db437e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5955,10 +5955,9 @@ static inline void schedule_debug(struct task_struct= *prev, bool preempt) schedstat_inc(this_rq()->sched_count); } =20 -static void prev_balance(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static void prev_balance(struct rq *rq, struct rq_flags *rf) { - const struct sched_class *start_class =3D prev->sched_class; + const struct sched_class *start_class =3D rq->donor->sched_class; const struct sched_class *class; =20 #ifdef CONFIG_SCHED_CLASS_EXT @@ -5983,7 +5982,7 @@ static void prev_balance(struct rq *rq, struct task_s= truct *prev, * a runnable task of @class priority or higher. */ for_active_class_range(class, start_class, &idle_sched_class) { - if (class->balance && class->balance(rq, prev, rf)) + if (class->balance && class->balance(rq, rf)) break; } } @@ -5992,7 +5991,7 @@ static void prev_balance(struct rq *rq, struct task_s= truct *prev, * Pick up the highest-prio task: */ static inline struct task_struct * -__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags = *rf) +__pick_next_task(struct rq *rq, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; @@ -6008,34 +6007,34 @@ __pick_next_task(struct rq *rq, struct task_struct = *prev, struct rq_flags *rf) * higher scheduling class, because otherwise those lose the * opportunity to pull in more work from other CPUs. */ - if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && + if (likely(!sched_class_above(rq->donor->sched_class, &fair_sched_class) = && rq->nr_running =3D=3D rq->cfs.h_nr_queued)) { =20 - p =3D pick_next_task_fair(rq, prev, rf); + p =3D pick_next_task_fair(rq, rf); if (unlikely(p =3D=3D RETRY_TASK)) goto restart; =20 /* Assume the next prioritized class is idle_sched_class */ if (!p) { p =3D pick_task_idle(rq); - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); } =20 return p; } =20 restart: - prev_balance(rq, prev, rf); + prev_balance(rq, rf); =20 for_each_active_class(class) { if (class->pick_next_task) { - p =3D class->pick_next_task(rq, prev); + p =3D class->pick_next_task(rq); if (p) return p; } else { p =3D class->pick_task(rq); if (p) { - put_prev_set_next_task(rq, prev, p); + put_prev_set_next_task(rq, rq->donor, p); return p; } } @@ -6084,7 +6083,7 @@ extern void task_vruntime_update(struct rq *rq, struc= t task_struct *p, bool in_f static void queue_core_balance(struct rq *rq); =20 static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r= f) +pick_next_task(struct rq *rq, struct rq_flags *rf) { struct task_struct *next, *p, *max =3D NULL; const struct cpumask *smt_mask; @@ -6096,7 +6095,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) bool need_sync; =20 if (!sched_core_enabled(rq)) - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); =20 cpu =3D cpu_of(rq); =20 @@ -6109,7 +6108,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) */ rq->core_pick =3D NULL; rq->core_dl_server =3D NULL; - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } =20 /* @@ -6133,7 +6132,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) goto out_set_next; } =20 - prev_balance(rq, prev, rf); + prev_balance(rq, rf); =20 smt_mask =3D cpu_smt_mask(cpu); need_sync =3D !!rq->core->core_cookie; @@ -6306,7 +6305,7 @@ pick_next_task(struct rq *rq, struct task_struct *pre= v, struct rq_flags *rf) } =20 out_set_next: - put_prev_set_next_task(rq, prev, next); + put_prev_set_next_task(rq, rq->donor, next); if (rq->core->core_forceidle_count && next =3D=3D rq->idle) queue_core_balance(rq); =20 @@ -6528,9 +6527,9 @@ static inline void sched_core_cpu_deactivate(unsigned= int cpu) {} static inline void sched_core_cpu_dying(unsigned int cpu) {} =20 static struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *r= f) +pick_next_task(struct rq *rq, struct rq_flags *rf) { - return __pick_next_task(rq, prev, rf); + return __pick_next_task(rq, rf); } =20 #endif /* !CONFIG_SCHED_CORE */ @@ -7097,7 +7096,7 @@ static void __sched notrace __schedule(int sched_mode) =20 pick_again: assert_balance_callbacks_empty(rq); - next =3D pick_next_task(rq, rq->donor, &rf); + next =3D pick_next_task(rq, &rf); rq_set_donor(rq, next); if (unlikely(task_is_blocked(next))) { next =3D find_proxy_task(rq, next, &rf); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c4402542ef44f..d86fc3dd0d806 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2268,8 +2268,14 @@ static void check_preempt_equal_dl(struct rq *rq, st= ruct task_struct *p) resched_curr(rq); } =20 -static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flag= s *rf) +static int balance_dl(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use prev across lock drops + */ + struct task_struct *p =3D rq->donor; + if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 7e0fcfdc06a2d..5c6cb0a3be738 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2153,9 +2153,13 @@ static int balance_one(struct rq *rq, struct task_st= ruct *prev) return true; } =20 -static int balance_scx(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static int balance_scx(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use prev across lock drops + */ + struct task_struct *prev =3D rq->donor; int ret; =20 rq_unpin_lock(rq, rf); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 328ea325a1d1c..7d2e92a55b164 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8713,7 +8713,7 @@ static void set_cpus_allowed_fair(struct task_struct = *p, struct affinity_context } =20 static int -balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_fair(struct rq *rq, struct rq_flags *rf) { if (sched_fair_runnable(rq)) return 1; @@ -8866,13 +8866,18 @@ static void __set_next_task_fair(struct rq *rq, str= uct task_struct *p, bool firs static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool = first); =20 struct task_struct * -pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_fla= gs *rf) +pick_next_task_fair(struct rq *rq, struct rq_flags *rf) { struct sched_entity *se; - struct task_struct *p; + struct task_struct *p, *prev; int new_tasks; =20 again: + /* + * Re-read rq->donor at the top as it may have + * changed across a rq lock drop + */ + prev =3D rq->donor; p =3D pick_task_fair(rq); if (!p) goto idle; @@ -8952,9 +8957,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf return NULL; } =20 -static struct task_struct *__pick_next_task_fair(struct rq *rq, struct tas= k_struct *prev) +static struct task_struct *__pick_next_task_fair(struct rq *rq) { - return pick_next_task_fair(rq, prev, NULL); + return pick_next_task_fair(rq, NULL); } =20 static struct task_struct *fair_server_pick_task(struct sched_dl_entity *d= l_se) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c39b089d4f09b..a7c718c1733ba 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -439,7 +439,7 @@ select_task_rq_idle(struct task_struct *p, int cpu, int= flags) } =20 static int -balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_idle(struct rq *rq, struct rq_flags *rf) { return WARN_ON_ONCE(1); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index fb07dcfc60a24..17cfac1da38b6 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1591,8 +1591,14 @@ static void check_preempt_equal_prio(struct rq *rq, = struct task_struct *p) resched_curr(rq); } =20 -static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flag= s *rf) +static int balance_rt(struct rq *rq, struct rq_flags *rf) { + /* + * Note, rq->donor may change during rq lock drops, + * so don't re-use p across lock drops + */ + struct task_struct *p =3D rq->donor; + if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) { /* * This is OK, because current is on_cpu, which avoids it being diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a0de4f00edd61..424c40bd46e2f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2415,18 +2415,18 @@ struct sched_class { =20 void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags); =20 - int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *= rf); + int (*balance)(struct rq *rq, struct rq_flags *rf); struct task_struct *(*pick_task)(struct rq *rq); /* * Optional! When implemented pick_next_task() should be equivalent to: * * next =3D pick_task(); * if (next) { - * put_prev_task(prev); + * put_prev_task(rq->donor); * set_next_task_first(next); * } */ - struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *= prev); + struct task_struct *(*pick_next_task)(struct rq *rq); =20 void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_s= truct *next); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); @@ -2586,7 +2586,7 @@ static inline bool sched_fair_runnable(struct rq *rq) return rq->cfs.nr_queued > 0; } =20 -extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_= struct *prev, struct rq_flags *rf); +extern struct task_struct *pick_next_task_fair(struct rq *rq, struct rq_fl= ags *rf); extern struct task_struct *pick_task_idle(struct rq *rq); =20 #define SCA_CHECK 0x01 diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 2d4e279f05ee9..73aeb0743aa2e 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -16,7 +16,7 @@ select_task_rq_stop(struct task_struct *p, int cpu, int f= lags) } =20 static int -balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +balance_stop(struct rq *rq, struct rq_flags *rf) { return sched_stop_runnable(rq); } --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 07C7C33C1BA for ; Mon, 24 Nov 2025 22:31:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023488; cv=none; b=jLqYUE6XvXcumYwmr0DW9unAyzCg5BuSK8SSNbfz5hTRujIskH+Z9QZBtjDMlu0mi+Mwk2RXUviIiBcuz2Y8yGWRP+cD/GBESfmogu2WvbkhrXO27Msk5RUz8wwzlO0sYuBQGXDr894lxS+sAaG9s7yqwIadxHlVawN91QtQ19o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023488; c=relaxed/simple; bh=4q473SgTwCYp6QKeMFdKzYedqzxmtloTSbgEr0egrbs=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=qIMsTWsE/CkMZkcFRxSBpgISjq0i7unScq9NGNnlpbVCSp9Tx7ifU6LSe1FPPKW+A0aPvyg9rV2htlj9RHp3zNLJ354tM7c3WDBX4NTQVMO+qTsWi1ji46Vgobf+dLsW8IrdS7izx+nf5WQPSX03jyeNAqdqZWuHepZKk0eQk5U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=EXecwnG6; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="EXecwnG6" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-3437b43eec4so7781920a91.3 for ; Mon, 24 Nov 2025 14:31:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023486; x=1764628286; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=m/fWLTLRfB+bV4rm+I0+IwxgfjUIPGp/IY/eUQV65TA=; b=EXecwnG6XulIXJYcyaSePWos93aGWkMT1krShkyYvS7eB7OqsSDbkfuxBurklo7q8Z 215CTIWLgHVoINfXg/SSC0J9G56fFvhYeRAX2hfRqeikdMxDULLgiFj7Xlq2tsEF6jNg 7tEnRxzcmCkdwcIK7Qte0C9pEeP5Q3MpfgzOb3G1AG0SrT/avmlvHU7pX5ORCqeYRzri RamY6/0wXol9eSEQkj7tYEsl14jmqhajfQwL+/uhdkyRJ1ETjZC54qK0vt5RzbE1ysP8 KPWuk+w2ZgCj9TYiEA0r23tC1JGgsbU6OuBf/9WZzlSIYfkvq3T0+gvRJgddo3fbiDEW P6hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023486; x=1764628286; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=m/fWLTLRfB+bV4rm+I0+IwxgfjUIPGp/IY/eUQV65TA=; b=Ww6GQBp9tKYl61yMjstLG5nfbUUqt8MjkX5F5GVyVDWvr1P+kWHXItuZ37lQe6/pAo 2eIgxOmIt3PHttYNU5k9xYchrgjY+3SxyIKTDxwP4hJgrOWVNNm7+1S+9pwszyDyfnPS Ny1HziGGzOrBk8tNdZuHeR7B5KDR9SHaJdHtNxbd8AY8xkYeO7ZYYkhCsIu5plk5YDRQ Pa3ha4+GJVn9MYkhn6Tn90tn6So3L/vwr/HBwNYCnHBPKNbN1L/exd84+i+BXUy4J4/G 9+btNftjJzeyHHuzPc/tiP+Trok6rNJSYJMYuocOphMFFSRSWXSFtY6T2VRi1/TCTWPU IkKQ== X-Gm-Message-State: AOJu0YxcBwjxq6iMbC5//VQsh1ClvVWroqZoNBXevdOMIv2GVF+D4F9N /vPASgzkXFsGASz8pVSDSjk85YP/MoQvHvu/EixwwIr2nX2879ioAGctTmqY4P0wBjbBOQ5M4kY 4gkX9BXFg6raSdtccH/wXC7Q28pVUxsaO4ie8N4Fg6tMb60/sJ+VUbWYO5JvAtymgmv8g3PZ5gW F6BTiDea75Aza4+x/RNndpaRsP1kPCbkEma7ZxQcnpkR7BCMeI X-Google-Smtp-Source: AGHT+IHUFTgTqOkK1PkYzrtfYnuxrOkEajhIkKqJz1ULwuzfSCuoHnYGg600gLrRmVM1ILKGHXdbCgCS/Dzn X-Received: from pjzh23.prod.google.com ([2002:a17:90a:ea97:b0:33d:ee1f:6fb7]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:3841:b0:343:6d82:9278 with SMTP id 98e67ed59e1d1-34733f36040mr11752101a91.30.1764023486164; Mon, 24 Nov 2025 14:31:26 -0800 (PST) Date: Mon, 24 Nov 2025 22:31:00 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-9-jstultz@google.com> Subject: [PATCH v24 08/11] sched: Avoid donor->sched_class->yield_task() null traversal From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With proxy-exec once we do return migration from ttwu(), if a task is proxying for a waiting donor, and the donor is woken up, we switch the rq->donor to point to idle briefly until we can re-enter __schedule(). However, if a task that was acting as a proxy calls into yield() right after the donor is switched to idle, it may trip a null pointer traversal, because the idle task doesn't have a yield_task() pointer. So add a conditional to ensure we don't try to call the yield_task() pointer in that case. This was only recently found because prior to commit 127b90315ca07 ("sched/proxy: Yield the donor task") do_sched_yield() incorrectly called current->sched_class_yield_task() instead of using rq->donor. Signed-off-by: John Stultz --- Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/syscalls.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index bf360a6fbb800..4b2b81437b03b 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -1351,7 +1351,8 @@ static void do_sched_yield(void) rq =3D this_rq_lock_irq(&rf); =20 schedstat_inc(rq->yld_count); - rq->donor->sched_class->yield_task(rq); + if (rq->donor->sched_class->yield_task) + rq->donor->sched_class->yield_task(rq); =20 preempt_disable(); rq_unlock_irq(rq, &rf); --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AB4B33C533 for ; Mon, 24 Nov 2025 22:31:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023490; cv=none; b=gKJFNobbGsQ7AbI5naOyywuL8wcJYoL4133qMGkkARfczyBUW1e/Vnq0MauEgWDVrFEt34LohGrhZVhgIppuwO9/MrZApE0WKPfnni/KMSbvTssJuuXdFcIeI4tlVqmYuwGrAC3rv+FheRPN5L8hW7Ek4Pioij5CX23ajB4U+H8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023490; c=relaxed/simple; bh=v/zcF9Srz6eBhbU8WwLt/GRqv69lW0hoZwLP4abxNfk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=I+yeEBzxdbUV5RbxIa9eV7GdbYUh6TCg3dUpO+VVJV6Jg6pcDNxcYUXL+w3xuVohu2JSV0OhIOV4FS8brTnaskQQlQcXkZm0bIjk69mBLKqNy+vxkeHwNOiS/u9YnS+lKAj/QPMF4ic93xsTVtkjwgps1Ykii/dnpi8QUqpHU/k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=4QKJ7FIk; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="4QKJ7FIk" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-343bf6ded5cso9954808a91.0 for ; Mon, 24 Nov 2025 14:31:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023488; x=1764628288; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=FTJN291N9+/JGE3/Shyo2dHmnxFQmABX+nrmE06GkUY=; b=4QKJ7FIkLFaJndOUktc7mrUC4wLFaZHtEo2ChN2evJS+TPJBY4y3KwauRaVGnSDGEw n6skGD1sPBCCFzsiXE1nNPpoSqDLPyc0mMiMGDyUhEJGCEpZGl5BQ/a79NAlTl3F+onz QhhH3r+LCKJZiheMzTGtOnnSpYQCFhHMOsLk9CrVt3pX5Khupy1AdNRZxFsmkr1ljuLC dfk6r4moolbULcHlZm5ttgac6G07HjiP0glzi/SD46SjAmC8PURDQbhtZjQbur1WCmUQ k287d2Z+g/ys2jI8ROdGd7tYxsBz6Qc8QZAvLse+mSBr1Ce5qxo8FC3OsuELjTp1MlPu IMiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023488; x=1764628288; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FTJN291N9+/JGE3/Shyo2dHmnxFQmABX+nrmE06GkUY=; b=UvzuubbBN4cVFFiFnwMoiCfiBcIoazBW7j75ar7a6QXAut5KF0+zyG1MF0+vhZa4Ec 1fhZSe5smtFT97bHnONHQMvs8g87NjqkZfE/QT/TYoqZjWuOdCdWiguUMIE856jOgbIL fS2go6WBSC3OStg6mIvnaFkJ4Xc3LajFnnmHq4QYMJdNw1lamAKMpszeus/bm77pxyEf X0F+l4zu5KDYk49bcP4q+v51xSTpFR8DJb/nzrkeXTuRCFrIWbi9Uuqth2tCObC3V/dz gAnyECis70qwVzOaeeyy51Ps0strXhI3aEQOt2OyCwG65Ey46A4HbjnKrS/dkeFT2/+p B7PQ== X-Gm-Message-State: AOJu0YzibSHkMO0FlMwHZExbR0K102h8IHejXNvkKdv0EnNYQTEJK+mi RfesW1wfhtw+/Sr7jhJ0YE6byEy0Fdpe5lMTanaoiN7P0zKbd7/nXIkJc6jwo9Yya19sSFRLjj8 gWhDBsar4LzrSgtIoEU6gNo0Lta2TH+4QAl0qNEHjecaZ3X7AKA5Ly1NsvGA33v6SkdWdx1aOqq 9L7+yX4xg0wN7E+nbmNCMlsFCQavrlC543AprFViXGwonfIF4u X-Google-Smtp-Source: AGHT+IHVvq04S49eCuYEGuzv4Pm5BmwPp7jcgA75U10/aoQ8QpmolsrKgGgyfemd/jxG6VLZrSYXujUQ3L11 X-Received: from pjbtb7.prod.google.com ([2002:a17:90b:53c7:b0:342:14d9:ca4f]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:3c8e:b0:340:ca7d:936a with SMTP id 98e67ed59e1d1-34733f22173mr14981153a91.18.1764023487676; Mon, 24 Nov 2025 14:31:27 -0800 (PST) Date: Mon, 24 Nov 2025 22:31:01 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-10-jstultz@google.com> Subject: [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch adds logic so try_to_wake_up() will notice if we are waking a task where blocked_on =3D=3D PROXY_WAKING, and if necessary dequeue the task so the wakeup will naturally return-migrate the donor task back to a cpu it can run on. This helps performance as we do the dequeue and wakeup under the locks normally taken in the try_to_wake_up() and avoids having to do proxy_force_return() from __schedule(), which has to re-take similar locks and then force a pick again loop. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) Juri Lelli Valentin Schneider Connor O'Brien Signed-off-by: John Stultz --- v24: * Reworked proxy_needs_return() so its less nested as suggested by K Prateek * Switch to using block_task with DEQUEUE_SPECIAL as suggested by K Prateek * Fix edge case to reset wake_cpu if select_task_rq() chooses the current rq and we skip set_task_cpu() Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 84 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 82 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fcf64c4db437e..e4a49c694dcd9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3697,6 +3697,64 @@ static inline void proxy_set_task_cpu(struct task_st= ruct *p, int cpu) __set_task_cpu(p, cpu); p->wake_cpu =3D wake_cpu; } + +static bool proxy_task_runnable_but_waking(struct task_struct *p) +{ + if (!sched_proxy_exec()) + return false; + return (READ_ONCE(p->__state) =3D=3D TASK_RUNNING && + READ_ONCE(p->blocked_on) =3D=3D PROXY_WAKING); +} + +static inline struct task_struct *proxy_resched_idle(struct rq *rq); + +/* + * Checks to see if task p has been proxy-migrated to another rq + * and needs to be returned. If so, we deactivate the task here + * so that it can be properly woken up on the p->wake_cpu + * (or whichever cpu select_task_rq() picks at the bottom of + * try_to_wake_up() + */ +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p) +{ + if (!sched_proxy_exec()) + return false; + + guard(raw_spinlock)(&p->blocked_lock); + + /* If task isn't PROXY_WAKING, we don't need to do return migration */ + if (p->blocked_on !=3D PROXY_WAKING) + return false; + + __clear_task_blocked_on(p, PROXY_WAKING); + + /* If already current, don't need to return migrate */ + if (task_current(rq, p)) + return false; + + /* If wake_cpu is targeting this cpu, don't bother return migrating */ + if (p->wake_cpu =3D=3D cpu_of(rq)) { + resched_curr(rq); + return false; + } + + /* If we're return migrating the rq->donor, switch it out for idle */ + if (task_current_donor(rq, p)) + proxy_resched_idle(rq); + + /* (ab)Use DEQUEUE_SPECIAL to ensure task is always blocked here. */ + block_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SPECIAL); + return true; +} +#else /* !CONFIG_SCHED_PROXY_EXEC */ +static bool proxy_task_runnable_but_waking(struct task_struct *p) +{ + return false; +} +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p) +{ + return false; +} #endif /* CONFIG_SCHED_PROXY_EXEC */ =20 static void @@ -3784,6 +3842,8 @@ static int ttwu_runnable(struct task_struct *p, int w= ake_flags) update_rq_clock(rq); if (p->se.sched_delayed) enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); + if (proxy_needs_return(rq, p)) + goto out; if (!task_on_cpu(rq, p)) { /* * When on_rq && !on_cpu the task is preempted, see if @@ -3794,6 +3854,7 @@ static int ttwu_runnable(struct task_struct *p, int w= ake_flags) ttwu_do_wakeup(p); ret =3D 1; } +out: __task_rq_unlock(rq, &rf); =20 return ret; @@ -4181,6 +4242,8 @@ int try_to_wake_up(struct task_struct *p, unsigned in= t state, int wake_flags) * it disabling IRQs (this allows not taking ->pi_lock). */ WARN_ON_ONCE(p->se.sched_delayed); + /* If p is current, we know we can run here, so clear blocked_on */ + clear_task_blocked_on(p, NULL); if (!ttwu_state_match(p, state, &success)) goto out; =20 @@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) */ scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { smp_mb__after_spinlock(); - if (!ttwu_state_match(p, state, &success)) - break; + if (!ttwu_state_match(p, state, &success)) { + /* + * If we're already TASK_RUNNING, and PROXY_WAKING + * continue on to ttwu_runnable check to force + * proxy_needs_return evaluation + */ + if (!proxy_task_runnable_but_waking(p)) + break; + } =20 trace_sched_waking(p); =20 @@ -4305,6 +4375,16 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) wake_flags |=3D WF_MIGRATED; psi_ttwu_dequeue(p); set_task_cpu(p, cpu); + } else if (cpu !=3D p->wake_cpu) { + /* + * If we were proxy-migrated to cpu, then + * select_task_rq() picks cpu instead of wake_cpu + * to return to, we won't call set_task_cpu(), + * leaving a stale wake_cpu pointing to where we + * proxy-migrated from. So just fixup wake_cpu here + * if its not correct + */ + p->wake_cpu =3D cpu; } =20 ttwu_queue(p, cpu, wake_flags); --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DEA0D33D6C2 for ; Mon, 24 Nov 2025 22:31:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023491; cv=none; b=dJpm85UUlkeG5l7mBY0WnAXYfAO9eXsj49tYLPT23alRdNoJqeuiLoRrw2PwIUIExeCG1oGp80r4dzi9ghzc8oPHXGUSyfu2oyAKiKQzg1F7DY/N22jy7okEgERq3sYQaTB3oi2PvTfstH6BmsBkP9ymkS8O+BPl+JRZP8nJAbw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023491; c=relaxed/simple; bh=Ucsmiu/NoiNA7ehb8K9jpIU8UXamZ22tEhb5x7pNYSo=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=tJBwoAdQTQXsvh7Qih2Raxx5OQiQl24wz1MQbCr9RJzyTll4nv4PGXi4LLhpmv3KsytsHv98NLNkXlDsLoGprOVkgHCjky+mw/0S2TE7/uWmVyhb/yeO8Wk+UoeeCb2JEzBVjB3eGNo3py5wkG+6FESA0wYduxDJUBM+5YHc0pI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ZlC41WNX; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ZlC41WNX" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-34740cc80d5so4480716a91.0 for ; Mon, 24 Nov 2025 14:31:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023489; x=1764628289; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=rSWdmSGe7yQPl8xRxcM72FT/N1f79D0LaxcCyZMZ5tA=; b=ZlC41WNXPZrwOT6Cko0i6v1GdDM+iV02oADxhAofuQ93o/Scn8DZCreHnfI5jU4mz/ RnQSB7kpLmTeCRqHWfgSnTw3Q+kz+Wro/lARx66rM5Oe5C/lMXzJFh+eAoQcKbgt+wV0 1IYy95ccrMirWVCz9QlhWOy6/DXLCXjPBy0sls469zZKTCgS/pm36B2QYbt5KHFfHcQ+ pztppnwQVEeqRJ1T+3NgJP2/x3nrPOKkr7wZdF3vxRLlxXa3KFN8neYkKQE3eBwrehZZ da3KB28nON0cYDzLlaNDLjtpoZa6ISFUFR9LkeDH0+MyQhjq2QZHGEVtib3nkX5j99Bw i0Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023489; x=1764628289; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rSWdmSGe7yQPl8xRxcM72FT/N1f79D0LaxcCyZMZ5tA=; b=fHxdvkPMBovF1N/gBmw6gozs4CDYXUk/Q7g7UESyl8kvTVJSTD75/9/OKDM7r7bZ1z Qp5sRZ9J+nQpyOcRtyYYyz+WChCogOBt4QUSg0QyOtR4bIxqPslLlMaGppdypE957bHw sAvM4TUQyQhjdWMNvzKNKuibhC/fWJhkyZnA8+sZG/rS3mKQ9OpQjwMwG1ekcLKnbWcT RwnWjf45m680qXZXLWe4C43yfWzjcmpIepXEhsH3xohfyFBNUB/8RgyJ7DylFqyF4JNb lQ2CNm+ZTNvzWPct966555V9D9vwFEMrRCVZdyPyEfacyijq7b1UIqM9Rw4aDmJH/nVT HW1Q== X-Gm-Message-State: AOJu0Yx0DR+Wa64u8jWO2MxpORwj1IhdtQkCEQjwz98gpvgeUaEuJuvg OE9QY8nEU/wg+HbTP3szqAKIn6YJZkdFaGtCb8zJZTEzDOsyiLaPhBIasQvT+Pee8QJlYKhN+u0 odC1aUXmHmyZAZs9YFT/lxSSOS//1YOXfN3yz3g8XBcCZ9h/aFvurMnZ5NDY/kqNMnQie6Y8uct uBU0yB4bRu1SWpT1LXeqT4ZxqgnNHLQCw//Yo6vXj33Bq4nzya X-Google-Smtp-Source: AGHT+IEIBisKczKfQ13L3ryImviRmXgpTQrc1TK9cSBl6wpyLa7pofj5WKWXb7O+/+T24lSWS3spRkIlS+xO X-Received: from pjblx14.prod.google.com ([2002:a17:90b:4b0e:b0:340:b94b:e61d]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2d87:b0:340:9d52:44c1 with SMTP id 98e67ed59e1d1-34733f3f6eamr14647549a91.35.1764023489187; Mon, 24 Nov 2025 14:31:29 -0800 (PST) Date: Mon, 24 Nov 2025 22:31:02 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-11-jstultz@google.com> Subject: [PATCH v24 10/11] sched: Add blocked_donor link to task for smarter mutex handoffs From: John Stultz To: LKML Cc: Peter Zijlstra , Juri Lelli , Valentin Schneider , "Connor O'Brien" , John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Add link to the task this task is proxying for, and use it so the mutex owner can do an intelligent hand-off of the mutex to the task that the owner is running on behalf. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Juri Lelli Signed-off-by: Valentin Schneider Signed-off-by: Connor O'Brien [jstultz: This patch was split out from larger proxy patch] Signed-off-by: John Stultz --- v5: * Split out from larger proxy patch v6: * Moved proxied value from earlier patch to this one where it is actually used * Rework logic to check sched_proxy_exec() instead of using ifdefs * Moved comment change to this patch where it makes sense v7: * Use more descriptive term then "us" in comments, as suggested by Metin Kaya. * Minor typo fixup from Metin Kaya * Reworked proxied variable to prev_not_proxied to simplify usage v8: * Use helper for donor blocked_on_state transition v9: * Re-add mutex lock handoff in the unlock path, but only when we have a blocked donor * Slight reword of commit message suggested by Metin v18: * Add task_init initialization for blocked_donor, suggested by Suleiman v23: * Reworks for PROXY_WAKING approach suggested by PeterZ Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 1 + init/init_task.c | 1 + kernel/fork.c | 1 + kernel/locking/mutex.c | 44 +++++++++++++++++++++++++++++++++++++++--- kernel/sched/core.c | 18 +++++++++++++++-- 5 files changed, 60 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0d6c4c31e3624..178ed37850470 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1241,6 +1241,7 @@ struct task_struct { #endif =20 struct mutex *blocked_on; /* lock we're blocked on */ + struct task_struct *blocked_donor; /* task that is boosting this task */ raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER diff --git a/init/init_task.c b/init/init_task.c index 60477d74546e0..34853a511b4d8 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -177,6 +177,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .mems_allowed_seq =3D SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq, &init_task.alloc_lock), #endif + .blocked_donor =3D NULL, #ifdef CONFIG_RT_MUTEXES .pi_waiters =3D RB_ROOT_CACHED, .pi_top_task =3D NULL, diff --git a/kernel/fork.c b/kernel/fork.c index 0697084be202f..0a9a17e25b85d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2136,6 +2136,7 @@ __latent_entropy struct task_struct *copy_process( lockdep_init_task(p); =20 p->blocked_on =3D NULL; /* not blocked yet */ + p->blocked_donor =3D NULL; /* nobody is boosting p yet */ =20 #ifdef CONFIG_BCACHE p->sequential_io =3D 0; diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 3cb9001d15119..08f438a54f56f 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -926,7 +926,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible); */ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, u= nsigned long ip) { - struct task_struct *next =3D NULL; + struct task_struct *donor, *next =3D NULL; DEFINE_WAKE_Q(wake_q); unsigned long owner; unsigned long flags; @@ -945,6 +945,12 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne MUTEX_WARN_ON(__owner_task(owner) !=3D current); MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP); =20 + if (sched_proxy_exec() && current->blocked_donor) { + /* force handoff if we have a blocked_donor */ + owner =3D MUTEX_FLAG_HANDOFF; + break; + } + if (owner & MUTEX_FLAG_HANDOFF) break; =20 @@ -958,7 +964,34 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne =20 raw_spin_lock_irqsave(&lock->wait_lock, flags); debug_mutex_unlock(lock); - if (!list_empty(&lock->wait_list)) { + + if (sched_proxy_exec()) { + raw_spin_lock(¤t->blocked_lock); + /* + * If we have a task boosting current, and that task was boosting + * current through this lock, hand the lock to that task, as that + * is the highest waiter, as selected by the scheduling function. + */ + donor =3D current->blocked_donor; + if (donor) { + struct mutex *next_lock; + + raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING); + next_lock =3D __get_task_blocked_on(donor); + if (next_lock =3D=3D lock) { + next =3D donor; + __set_task_blocked_on_waking(donor, next_lock); + wake_q_add(&wake_q, donor); + current->blocked_donor =3D NULL; + } + raw_spin_unlock(&donor->blocked_lock); + } + } + + /* + * Failing that, pick any on the wait list. + */ + if (!next && !list_empty(&lock->wait_list)) { /* get the first entry from the wait-list: */ struct mutex_waiter *waiter =3D list_first_entry(&lock->wait_list, @@ -966,14 +999,19 @@ static noinline void __sched __mutex_unlock_slowpath(= struct mutex *lock, unsigne =20 next =3D waiter->task; =20 + raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING); debug_mutex_wake_waiter(lock, waiter); - set_task_blocked_on_waking(next, lock); + __set_task_blocked_on_waking(next, lock); + raw_spin_unlock(&next->blocked_lock); wake_q_add(&wake_q, next); + } =20 if (owner & MUTEX_FLAG_HANDOFF) __mutex_handoff(lock, next); =20 + if (sched_proxy_exec()) + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q); } =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e4a49c694dcd9..7f42ec01192dc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6853,7 +6853,17 @@ static void proxy_force_return(struct rq *rq, struct= rq_flags *rf, * Find runnable lock owner to proxy for mutex blocked donor * * Follow the blocked-on relation: - * task->blocked_on -> mutex->owner -> task... + * + * ,-> task + * | | blocked-on + * | v + * blocked_donor | mutex + * | | owner + * | v + * `-- task + * + * and set the blocked_donor relation, this latter is used by the mutex + * code to find which (blocked) task to hand-off to. * * Lock order: * @@ -7002,6 +7012,7 @@ find_proxy_task(struct rq *rq, struct task_struct *do= nor, struct rq_flags *rf) * rq, therefore holding @rq->lock is sufficient to * guarantee its existence, as per ttwu_remote(). */ + owner->blocked_donor =3D p; } =20 /* Handle actions we need to do outside of the guard() scope */ @@ -7102,6 +7113,7 @@ static void __sched notrace __schedule(int sched_mode) unsigned long prev_state; struct rq_flags rf; struct rq *rq; + bool prev_not_proxied; int cpu; =20 /* Trace preemptions consistently with task switches */ @@ -7174,10 +7186,12 @@ static void __sched notrace __schedule(int sched_mo= de) switch_count =3D &prev->nvcsw; } =20 + prev_not_proxied =3D !prev->blocked_donor; pick_again: assert_balance_callbacks_empty(rq); next =3D pick_next_task(rq, &rf); rq_set_donor(rq, next); + next->blocked_donor =3D NULL; if (unlikely(task_is_blocked(next))) { next =3D find_proxy_task(rq, next, &rf); if (!next) { @@ -7243,7 +7257,7 @@ static void __sched notrace __schedule(int sched_mode) rq =3D context_switch(rq, prev, next, &rf); } else { /* In case next was already curr but just got blocked_donor */ - if (!task_current_donor(rq, next)) + if (prev_not_proxied && next->blocked_donor) proxy_tag_curr(rq, next); =20 rq_unpin_lock(rq, &rf); --=20 2.52.0.487.g5c8c507ade-goog From nobody Tue Dec 2 00:25:42 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 665C033D6F2 for ; Mon, 24 Nov 2025 22:31:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023493; cv=none; b=MBTAxSfIliO2T/VFF+YFOdawPSgyxlm9N6uzUgdhRzomIliaAeG4YOoLG55IWePHnRUeDo7u2n7PlBAFDgtHt1tjt81niw9Xi76Z3UgcrcPZcWiKYJU7jrhW/34PRv21dbd2SS8f2Jws4MxwTJlhp8KZkxdXOCkXjs8P6t7yUOs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764023493; c=relaxed/simple; bh=dsB++Dq5aeuQIcKbt/SpcwfcrztZL9dzCOmQ+56ILVo=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=I5qu2P2QvhfyCB1EuGMD6HjS66gyfpDvh0RCXUTTT9hRlo1IU5n6oqXucTr/1NrXrB5IpqozCDxM6Aonwn2myclEFL56Ot9WCD5QZG0SaehZZmeqc2fPpOeshu50XGsJskxQi47Lr8wP0/48ZSmeVT1cVJ8R2vKGYLWtgZvCEtc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=bZX7t9b+; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="bZX7t9b+" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-297f48e81b8so67422065ad.0 for ; Mon, 24 Nov 2025 14:31:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764023491; x=1764628291; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=aeFpnqlw9RK343DQ3nNUojYJkotp0BeD1hQVgQ4QeM0=; b=bZX7t9b+0cy4kg/lcbqIB/ROjKOqmcfMQgz8vN4y8mZQB1r5CCpPPzmiMNMFzOgGg5 B84BUW75hRiHuZUhm6irmHxTTs21y9t/gphcJRsjnm7dautR5LL708sN4AwdKl8JvS5F imbJTqtOSJdt+9Hfitk8ZVbrUOnkxPKES+/0bQbGe6sQ3uXQyD+Y1CygUJZ9TuYooaux mIxzG+6Q50j3EGsdVvVRve+55g3wjOMCCyoQVd68KebzMZZkOOjX7aCth2ZbSLMVJ3fg VzCwS4fsBDDGxgtomht+Vvj1fdkeWH7k4ASA/6hcfZWFROHSC6KHsGD+fBCAE/TLKF7e 8Uwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764023491; x=1764628291; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=aeFpnqlw9RK343DQ3nNUojYJkotp0BeD1hQVgQ4QeM0=; b=Xrks7Gin/TpQFVw2aH4NhNOCxg7WyQPj/jpq1VIro6w/Fl75SPmzOWHLVh6M0PTRYD 4FsorPYPrKTlwzvzw8xAbtSZYpmMQjaEM30Ox7n7B3di9QUkfXtyKO8QR06K3penJ3gX zlf79mKAf3ru7UL5a2rurpZoSLkXF1uJtAfLfgChQo73krl80R5+QBL97ZGRLr1yn4KQ zZtBmRHLC2JxGBvZr7Yb0FD769781v9emm3CuNRAcjOSDMIPlWz5mu6BmT1bkPnvoNi6 72mG8disgr0WtlYfQ4tZc+0hg9QQ4r6+CfC4/i0r6xCiotGiux3m8iSFSwMJTga/aAke 8k1A== X-Gm-Message-State: AOJu0YxAtwLsLi1yf1Wgd2dciGswuBlnx0RpGW+kAVOgteAe+KHsG16p sipSVLZgu/+MhRIQqrXjncYjqGJ34C41ysTPDffGJBfvGnO48kNaL/LQ6Q8SAsEdRAFn+c3ZGMn xY5WcEeOLqXGPDq/lnmILxunkRKHdYoI9UJHzqj7eNmnrE4OImk+Zp5qZvPHiCWB1EHGIcSv2i3 FZ/kf9gkreO611D2bSVy8aVhO83IsicYuhJwHUphrVO8v3D3wx X-Google-Smtp-Source: AGHT+IE7aRyQ55pgeK8WeQBqrlzoOLW01FrBCMFJycMZrPntHzOLbtavkzC5IcQe9PaGe8ygwQC+6b3u2wsd X-Received: from plhz10.prod.google.com ([2002:a17:902:d9ca:b0:269:7c36:eeb9]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:2383:b0:295:557e:7468 with SMTP id d9443c01a7336-29b6c00b02dmr149958335ad.28.1764023490488; Mon, 24 Nov 2025 14:31:30 -0800 (PST) Date: Mon, 24 Nov 2025 22:31:03 +0000 In-Reply-To: <20251124223111.3616950-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251124223111.3616950-1-jstultz@google.com> X-Mailer: git-send-email 2.52.0.487.g5c8c507ade-goog Message-ID: <20251124223111.3616950-12-jstultz@google.com> Subject: [PATCH v24 11/11] sched: Migrate whole chain in proxy_migrate_task() From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Instead of migrating one task each time through find_proxy_task(), we can walk up the blocked_donor ptrs and migrate the entire current chain in one go. This was broken out of earlier patches and held back while the series was being stabilized, but I wanted to re-introduce it. Signed-off-by: John Stultz --- v12: * Earlier this was re-using blocked_node, but I hit a race with activating blocked entities, and to avoid it introduced a new migration_node listhead v18: * Add init_task initialization of migration_node as suggested by Suleiman v22: * Move migration_node under CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 3 +++ init/init_task.c | 3 +++ kernel/fork.c | 3 +++ kernel/sched/core.c | 24 +++++++++++++++++------- 4 files changed, 26 insertions(+), 7 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 178ed37850470..775cc06f756d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1243,6 +1243,9 @@ struct task_struct { struct mutex *blocked_on; /* lock we're blocked on */ struct task_struct *blocked_donor; /* task that is boosting this task */ raw_spinlock_t blocked_lock; +#ifdef CONFIG_SCHED_PROXY_EXEC + struct list_head migration_node; +#endif =20 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER /* diff --git a/init/init_task.c b/init/init_task.c index 34853a511b4d8..78fb7cb83fa5d 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -178,6 +178,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { &init_task.alloc_lock), #endif .blocked_donor =3D NULL, +#ifdef CONFIG_SCHED_PROXY_EXEC + .migration_node =3D LIST_HEAD_INIT(init_task.migration_node), +#endif #ifdef CONFIG_RT_MUTEXES .pi_waiters =3D RB_ROOT_CACHED, .pi_top_task =3D NULL, diff --git a/kernel/fork.c b/kernel/fork.c index 0a9a17e25b85d..a7561480e879e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2137,6 +2137,9 @@ __latent_entropy struct task_struct *copy_process( =20 p->blocked_on =3D NULL; /* not blocked yet */ p->blocked_donor =3D NULL; /* nobody is boosting p yet */ +#ifdef CONFIG_SCHED_PROXY_EXEC + INIT_LIST_HEAD(&p->migration_node); +#endif =20 #ifdef CONFIG_BCACHE p->sequential_io =3D 0; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7f42ec01192dc..0c50d154050a3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6723,6 +6723,7 @@ static void proxy_migrate_task(struct rq *rq, struct = rq_flags *rf, struct task_struct *p, int target_cpu) { struct rq *target_rq =3D cpu_rq(target_cpu); + LIST_HEAD(migrate_list); =20 lockdep_assert_rq_held(rq); =20 @@ -6739,11 +6740,16 @@ static void proxy_migrate_task(struct rq *rq, struc= t rq_flags *rf, */ proxy_resched_idle(rq); =20 - WARN_ON(p =3D=3D rq->curr); - - deactivate_task(rq, p, DEQUEUE_NOCLOCK); - proxy_set_task_cpu(p, target_cpu); - + for (; p; p =3D p->blocked_donor) { + WARN_ON(p =3D=3D rq->curr); + deactivate_task(rq, p, DEQUEUE_NOCLOCK); + proxy_set_task_cpu(p, target_cpu); + /* + * We can abuse blocked_node to migrate the thing, + * because @p was still on the rq. + */ + list_add(&p->migration_node, &migrate_list); + } /* * We have to zap callbacks before unlocking the rq * as another CPU may jump in and call sched_balance_rq @@ -6755,8 +6761,12 @@ static void proxy_migrate_task(struct rq *rq, struct= rq_flags *rf, raw_spin_rq_unlock(rq); =20 raw_spin_rq_lock(target_rq); - activate_task(target_rq, p, 0); - wakeup_preempt(target_rq, p, 0); + while (!list_empty(&migrate_list)) { + p =3D list_first_entry(&migrate_list, struct task_struct, migration_node= ); + list_del_init(&p->migration_node); + activate_task(target_rq, p, 0); + wakeup_preempt(target_rq, p, 0); + } raw_spin_rq_unlock(target_rq); =20 raw_spin_rq_lock(rq); --=20 2.52.0.487.g5c8c507ade-goog