From nobody Fri Oct 3 07:40:23 2025 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9E7E7E105 for ; Thu, 4 Sep 2025 00:22:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945330; cv=none; b=hf9NDwOcMyslhpLz3FnHt81698k4H1U4VKjzxtPUhEZOmOaHH7mnR9EnSr5F6XmGI/kj+Jx5xRFBqlCGiD1jWeggGP+OV6Xz4MrYzIyYMf9s+k39oOrfhh6jJXHh1T/tWc7C4XfAhEqOn5WD0m4wlKPPKh0isMSD+nbxGRxzWkw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945330; c=relaxed/simple; bh=yWBYnqc3cT2pY7YdvmojxwWmTPICXDg62jwZ5+p81sY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=JmjHW1OzqGrS4z+QTzIQmHJIIeo5SAVGwTn3t59hKa7CJ3BL3zoxnPwnQL/THiSRcAaq865rtYjTm7Od4jhOTTDiL6kR5EU2Ta/gm3T7Gb5pKXFlHMvS+bX0bdwqsHh5hMdFkDirtIS8NGA+Y96Gfr2z7wPdsCfIglvh8aItlyI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=cWwpbwmp; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="cWwpbwmp" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-24cc4458e89so4149335ad.1 for ; Wed, 03 Sep 2025 17:22:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1756945326; x=1757550126; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wjlqdj+r+l1oFcdYEtQX7HrD+ILgUjxjeCdGaQhGBps=; b=cWwpbwmpHUXQVgbH5l9E0+fim4KbDhVerfWmsQOOzao8UckIOjZWxRMC/7AfNIjKhU SnVLGFJfuxtexgGqzUajCiS3L7XwEbplasIn19NodU3E/FjF5vPayEDFwayQfVtgo272 ecyPpyqDQLdwnvZy0oKELypd/lEmkF+fnII/LlhK1DGvhdaWRz9qjPu6+MwzLw8cq4Wg L+Q4ENO+9KArDI/8W9SUSxyS50OoQcU3Kn0QWEQK7XXAbqJvaeQjmvdPN0P0+xpKLns0 Hc+u2WDvU0/J9RTLPTQM6hCL7oaz5G5AFRf/1c9tqfQt356P9XpTFP0MnzpNhnMEsU0G V9Hg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756945326; x=1757550126; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wjlqdj+r+l1oFcdYEtQX7HrD+ILgUjxjeCdGaQhGBps=; b=AD/QMmJiWEpJl8s0m7ktksAVqaZDIJ7xLFtBmZKV2Ivur+vZ29JQaCFEgPfTfSGt1S kDvfGJkwRHFTrRggtJb0Av0xV+YkQzg08vw4oRUkqVfZ7vSBSMLnGS69FVnq+lrK6WDc ds/wtKoGGo/SCcfHnBwP6Gh48LDDaQsPdNngGGXKZ9bBPm5sOm1AFfKpIZfI0ENZktHV i3KOg/ttp2YMegke28UaFYyMwBB4aLUkO4j6gsApgznC2HOWQohZSAman+SfIt4yNrWk Ewq2JFu58Hryy8Y57rBwlx9WvbHkDv9CVAcnkdwNhSJmRQd8zLkiSg4qnUoFGSsDqCbO 81cQ== X-Gm-Message-State: AOJu0YxdGR4N656OXHdl59x4PQfd7sozuUnOp97UcVS8509QEf1A/PVU vLQlQNDnIZRqv23+FDe8Hl0vWoxfu4RH/6SEA5TazGO6eUgBkeRTyhOh52OcJ/mbxCsBPJthY2b 5Z16N/aYlyWXIAjJgY8Lk7f8x6g5jtztgRyWa/ogYDzKoMR9uNxAVXqDOxND8lseytOvs384k0U hrT7dFFS4bp9TUG+YhtsZhdQv6R9/wjQxPhb5kNVFQGH+PTAKx X-Google-Smtp-Source: AGHT+IHGYi6stTu9QT/slfd2jOTNi2DyCnGYvilIo92cWhDjuibfsxunsgAOn5enQfjS949vVM55/avbZeIh X-Received: from pleg20.prod.google.com ([2002:a17:902:e394:b0:24b:1d2c:dd4]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:f709:b0:24c:af64:ae11 with SMTP id d9443c01a7336-24caf64b238mr26607615ad.44.1756945325801; Wed, 03 Sep 2025 17:22:05 -0700 (PDT) Date: Thu, 4 Sep 2025 00:21:51 +0000 In-Reply-To: <20250904002201.971268-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250904002201.971268-1-jstultz@google.com> X-Mailer: git-send-email 2.51.0.338.gd7d06c2dae-goog Message-ID: <20250904002201.971268-2-jstultz@google.com> Subject: [RESEND][PATCH v21 1/6] locking: Add task::blocked_lock to serialize blocked_on state From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" So far, we have been able to utilize the mutex::wait_lock for serializing the blocked_on state, but when we move to proxying across runqueues, we will need to add more state and a way to serialize changes to this state in contexts where we don't hold the mutex::wait_lock. So introduce the task::blocked_lock, which nests under the mutex::wait_lock in the locking order, and rework the locking to use it. Signed-off-by: John Stultz Reviewed-by: K Prateek Nayak --- v15: * Split back out into later in the series v16: * Fixups to mark tasks unblocked before sleeping in mutex_optimistic_spin() * Rework to use guard() as suggested by Peter v19: * Rework logic for PREEMPT_RT issues reported by K Prateek Nayak v21: * After recently thinking more on ww_mutex code, I reworked the blocked_lock usage in mutex lock to avoid having to take nested locks in the ww_mutex paths, as I was concerned the lock ordering constraints weren't as strong as I had previously thought. Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 52 +++++++++++++++--------------------- init/init_task.c | 1 + kernel/fork.c | 1 + kernel/locking/mutex-debug.c | 4 +-- kernel/locking/mutex.c | 37 +++++++++++++++---------- kernel/locking/ww_mutex.h | 4 +-- kernel/sched/core.c | 4 ++- 7 files changed, 54 insertions(+), 49 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index f8188b8333503..3ec0ef0d91603 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1235,6 +1235,7 @@ struct task_struct { #endif =20 struct mutex *blocked_on; /* lock we're blocked on */ + raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER /* @@ -2143,57 +2144,48 @@ extern int __cond_resched_rwlock_write(rwlock_t *lo= ck); #ifndef CONFIG_PREEMPT_RT static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { - struct mutex *m =3D p->blocked_on; + lockdep_assert_held_once(&p->blocked_lock); + return p->blocked_on; +} =20 - if (m) - lockdep_assert_held_once(&m->wait_lock); - return m; +static inline struct mutex *get_task_blocked_on(struct task_struct *p) +{ + guard(raw_spinlock_irqsave)(&p->blocked_lock); + return __get_task_blocked_on(p); } =20 static inline void __set_task_blocked_on(struct task_struct *p, struct mut= ex *m) { - struct mutex *blocked_on =3D READ_ONCE(p->blocked_on); - WARN_ON_ONCE(!m); /* The task should only be setting itself as blocked */ WARN_ON_ONCE(p !=3D current); - /* Currently we serialize blocked_on under the mutex::wait_lock */ - lockdep_assert_held_once(&m->wait_lock); + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); /* * Check ensure we don't overwrite existing mutex value * with a different mutex. Note, setting it to the same * lock repeatedly is ok. */ - WARN_ON_ONCE(blocked_on && blocked_on !=3D m); - WRITE_ONCE(p->blocked_on, m); -} - -static inline void set_task_blocked_on(struct task_struct *p, struct mutex= *m) -{ - guard(raw_spinlock_irqsave)(&m->wait_lock); - __set_task_blocked_on(p, m); + WARN_ON_ONCE(p->blocked_on && p->blocked_on !=3D m); + p->blocked_on =3D m; } =20 static inline void __clear_task_blocked_on(struct task_struct *p, struct m= utex *m) { - if (m) { - struct mutex *blocked_on =3D READ_ONCE(p->blocked_on); - - /* Currently we serialize blocked_on under the mutex::wait_lock */ - lockdep_assert_held_once(&m->wait_lock); - /* - * There may be cases where we re-clear already cleared - * blocked_on relationships, but make sure we are not - * clearing the relationship with a different lock. - */ - WARN_ON_ONCE(blocked_on && blocked_on !=3D m); - } - WRITE_ONCE(p->blocked_on, NULL); + /* Currently we serialize blocked_on under the task::blocked_lock */ + lockdep_assert_held_once(&p->blocked_lock); + /* + * There may be cases where we re-clear already cleared + * blocked_on relationships, but make sure we are not + * clearing the relationship with a different lock. + */ + WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m); + p->blocked_on =3D NULL; } =20 static inline void clear_task_blocked_on(struct task_struct *p, struct mut= ex *m) { - guard(raw_spinlock_irqsave)(&m->wait_lock); + guard(raw_spinlock_irqsave)(&p->blocked_lock); __clear_task_blocked_on(p, m); } #else diff --git a/init/init_task.c b/init/init_task.c index e557f622bd906..7e29d86153d9f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .journal_info =3D NULL, INIT_CPU_TIMERS(init_task) .pi_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock), + .blocked_lock =3D __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock), .timer_slack_ns =3D 50000, /* 50 usec default slack */ .thread_pid =3D &init_struct_pid, .thread_node =3D LIST_HEAD_INIT(init_signals.thread_head), diff --git a/kernel/fork.c b/kernel/fork.c index af673856499dc..db6d08946ec11 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2030,6 +2030,7 @@ __latent_entropy struct task_struct *copy_process( ftrace_graph_init_task(p); =20 rt_mutex_init_task(p); + raw_spin_lock_init(&p->blocked_lock); =20 lockdep_assert_irqs_enabled(); #ifdef CONFIG_PROVE_LOCKING diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c index 949103fd8e9b5..1d8cff71f65e1 100644 --- a/kernel/locking/mutex-debug.c +++ b/kernel/locking/mutex-debug.c @@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct = mutex_waiter *waiter, lockdep_assert_held(&lock->wait_lock); =20 /* Current thread can't be already blocked (since it's executing!) */ - DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task)); + DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task)); } =20 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *wa= iter, struct task_struct *task) { - struct mutex *blocked_on =3D __get_task_blocked_on(task); + struct mutex *blocked_on =3D get_task_blocked_on(task); =20 DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list)); DEBUG_LOCKS_WARN_ON(waiter->task !=3D task); diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index de7d6702cd96c..fac40c456098e 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -640,6 +640,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas goto err_early_kill; } =20 + raw_spin_lock(¤t->blocked_lock); __set_task_blocked_on(current, lock); set_current_state(state); trace_contention_begin(lock, LCB_F_MUTEX); @@ -653,8 +654,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas * the handoff. */ if (__mutex_trylock(lock)) - goto acquired; + break; =20 + raw_spin_unlock(¤t->blocked_lock); /* * Check for signals and kill conditions while holding * wait_lock. This ensures the lock cancellation is ordered @@ -677,12 +679,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas =20 first =3D __mutex_waiter_is_first(lock, &waiter); =20 + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); /* * As we likely have been woken up by task * that has cleared our blocked_on state, re-set * it to the lock we are trying to acquire. */ - set_task_blocked_on(current, lock); + __set_task_blocked_on(current, lock); set_current_state(state); /* * Here we order against unlock; we must either see it change @@ -693,25 +697,30 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas break; =20 if (first) { - trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); + bool opt_acquired; + /* * mutex_optimistic_spin() can call schedule(), so - * clear blocked on so we don't become unselectable + * we need to release these locks before calling it, + * and clear blocked on so we don't become unselectable * to run. */ - clear_task_blocked_on(current, lock); - if (mutex_optimistic_spin(lock, ww_ctx, &waiter)) + __clear_task_blocked_on(current, lock); + raw_spin_unlock(¤t->blocked_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); + trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); + opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter); + raw_spin_lock_irqsave(&lock->wait_lock, flags); + raw_spin_lock(¤t->blocked_lock); + __set_task_blocked_on(current, lock); + if (opt_acquired) break; - set_task_blocked_on(current, lock); trace_contention_begin(lock, LCB_F_MUTEX); } - - raw_spin_lock_irqsave(&lock->wait_lock, flags); } - raw_spin_lock_irqsave(&lock->wait_lock, flags); -acquired: __clear_task_blocked_on(current, lock); __set_current_state(TASK_RUNNING); + raw_spin_unlock(¤t->blocked_lock); =20 if (ww_ctx) { /* @@ -740,11 +749,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas return 0; =20 err: - __clear_task_blocked_on(current, lock); + clear_task_blocked_on(current, lock); __set_current_state(TASK_RUNNING); __mutex_remove_waiter(lock, &waiter); err_early_kill: - WARN_ON(__get_task_blocked_on(current)); + WARN_ON(get_task_blocked_on(current)); trace_contention_end(lock, ret); raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q); debug_mutex_free_waiter(&waiter); @@ -955,7 +964,7 @@ static noinline void __sched __mutex_unlock_slowpath(st= ruct mutex *lock, unsigne next =3D waiter->task; =20 debug_mutex_wake_waiter(lock, waiter); - __clear_task_blocked_on(next, lock); + clear_task_blocked_on(next, lock); wake_q_add(&wake_q, next); } =20 diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index 31a785afee6c0..e4a81790ea7dd 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER = *waiter, * blocked_on pointer. Otherwise we can see circular * blocked_on relationships that can't resolve. */ - __clear_task_blocked_on(waiter->task, lock); + clear_task_blocked_on(waiter->task, lock); wake_q_add(wake_q, waiter->task); } =20 @@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock, * are waking the mutex owner, who may be currently * blocked on a different mutex. */ - __clear_task_blocked_on(owner, NULL); + clear_task_blocked_on(owner, NULL); wake_q_add(wake_q, owner); } return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index be00629f0ba4c..0180853dd48c5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6639,6 +6639,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d * p->pi_lock * rq->lock * mutex->wait_lock + * p->blocked_lock * * Returns the task that is going to be used as execution context (the one * that is actually going to be run on cpu_of(rq)). @@ -6662,8 +6663,9 @@ find_proxy_task(struct rq *rq, struct task_struct *do= nor, struct rq_flags *rf) * and ensure @owner sticks around. */ guard(raw_spinlock)(&mutex->wait_lock); + guard(raw_spinlock)(&p->blocked_lock); =20 - /* Check again that p is blocked with wait_lock held */ + /* Check again that p is blocked with blocked_lock held */ if (mutex !=3D __get_task_blocked_on(p)) { /* * Something changed in the blocked_on chain and --=20 2.51.0.338.gd7d06c2dae-goog From nobody Fri Oct 3 07:40:23 2025 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 76D5E1940A1 for ; Thu, 4 Sep 2025 00:22:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945330; cv=none; b=pmS68NT7gNyFRM2zgBSuId3RcoDDiR1xjF3BH1UJKaBMOsZunWloojZgB0zhZtFqfTxjx7eA4XlJvKp0s/Iys+VrPV4MKvVvr4q3RQNGIUHOUEx2xjdNNQwF1dywNa2HgV8YlLPbjcHpTWRvOcNhje6gtL/0xwhSdpvM/5Y8tuQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945330; c=relaxed/simple; bh=4/Ajhz33ECo4QbIpZFAb36HGHp4IbE8IRygEbXHdi14=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=e2t94kbEmAZSt7sHChsMb8MjYg4Wiyl2R2sDJFcIJXUAgrtN/3J2N4/RXq4DVacbfEjQKNdu7qqM1fg7KYYXBD/kv/zTNglpOrsj8kWgNM9FaqtCTQtv24YF9NFkuYW5CQluA4CWU+MSPxCYMK+a7aSPMhPTjE62Id+clhJ3LSs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ovU/M40R; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ovU/M40R" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-b49da7c3ff6so223747a12.0 for ; Wed, 03 Sep 2025 17:22:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1756945328; x=1757550128; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=t4LvJJuOUQt8VySts8UfwcrX5dSSl1XuKKwMlAVw8RQ=; b=ovU/M40Rh5p9S8mAxNbbws2ApTzm2VCPGoJEFI3tWFsb0LLYg/2YgjldxOlhQKwmXJ RZdF53ygUZqrhRp5+rmAwhsSnZSdSvgTf/8ZtW1RVAQKL0FJU3SWiyJwRIcACMK13x0C OgszJWYotpO8axQa1cTS1nL9IUevSjILjhJ+EV3hYYu29J+xUNprjbMK2fr8hq2Fy0X8 5ix/m1GsVBLbw2aK7ZkG7uGYg4OjeK/fgKAZzr8SXPlbdvDwh3S/fIEab7K4GJlgw7Jk nu501wyeIPC9MvRwKsm9UwJedvlufiLQhny9fqYIcDF6/CAIF8aTvTDZmNfQDfnCxGLS Lesg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756945328; x=1757550128; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=t4LvJJuOUQt8VySts8UfwcrX5dSSl1XuKKwMlAVw8RQ=; b=GtqZVcoIebXukCAwuGI+JLdUroxp7ESeIMB+yZ9idD4udaxSa/G+dLtuiyRWvjJtFR OpS7WNT3lCNZxIul1mVgOxcBTAOYvGMXE71tehDE4MNKtq9bw1/106CtwGZsYukni8+o DQF1b7RHchTPuF4YeyqhjWHyzwwcqQWCBIMTdp/IY5y39ZfBCzyqESFGuTONxrJhhIHn s0oDXf4rXuXHszqbnxR3sS4W+ynuq1REFufI1Qdjyg8y7XOpwHxR3c8SfBIyJLUuS9f3 ToV2TJunydklREhnrzA6qFpqylV4lKIk2/8dzvREMxVzLNSAlrJqzLOzoEtblfZXHpSS q94Q== X-Gm-Message-State: AOJu0YwX0J6Luox181+L2oagGC4xHOEb6006TVuvMvWIgsRY8wsWwQlS jntzZ9PT37aGib/LqJZZV6TgIKlO/IOjL5BBgPUoRwXxtWUEU9lbV1jck7sd8lGCKFThXQ//ORl m3GB0O9QIxi+mehIKkj3zv5gouOuMKyyzw7P11muEofJeWX5IsySm+9TYEMlQDCuZOcOVjF7a7+ y5DTljDG26DAMnjbm2kR0Neros1azbljNSiLYXxXLge6pFyvfc X-Google-Smtp-Source: AGHT+IH6HTD3vJrunSq1VKiuYRJsxBCSTPwU4QCCTfDoAkucR94W9DCII7ueTg7MBaDajq8NubMPGpDQo6mZ X-Received: from pfbcu9.prod.google.com ([2002:a05:6a00:4489:b0:771:83fa:dfac]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:6d9a:b0:243:af34:8f80 with SMTP id adf61e73a8af0-243d6f0a00emr23912254637.33.1756945327109; Wed, 03 Sep 2025 17:22:07 -0700 (PDT) Date: Thu, 4 Sep 2025 00:21:52 +0000 In-Reply-To: <20250904002201.971268-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250904002201.971268-1-jstultz@google.com> X-Mailer: git-send-email 2.51.0.338.gd7d06c2dae-goog Message-ID: <20250904002201.971268-3-jstultz@google.com> Subject: [RESEND][PATCH v21 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" As we add functionality to proxy execution, we may migrate a donor task to a runqueue where it can't run due to cpu affinity. Thus, we must be careful to ensure we return-migrate the task back to a cpu in its cpumask when it becomes unblocked. Thus we need more then just a binary concept of the task being blocked on a mutex or not. So add a blocked_on_state value to the task, that allows the task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING and back to BO_RUNNING. This provides a guard state in BO_WAKING so we can know the task is no longer blocked but we don't want to run it until we have potentially done return migration, back to a usable cpu. Signed-off-by: John Stultz --- v15: * Split blocked_on_state into its own patch later in the series, as the tri-state isn't necessary until we deal with proxy/return migrations v16: * Handle case where task in the chain is being set as BO_WAKING by another cpu (usually via ww_mutex die code). Make sure we release the rq lock so the wakeup can complete. * Rework to use guard() in find_proxy_task() as suggested by Peter v18: * Add initialization of blocked_on_state for init_task v19: * PREEMPT_RT build fixups and rework suggested by K Prateek Nayak v20: * Simplify one of the blocked_on_state changes to avoid extra PREMEPT_RT conditionals v21: * Slight reworks due to avoiding nested blocked_lock locking * Be consistent in use of blocked_on_state helper functions * Rework calls to proxy_deactivate() to do proper locking around blocked_on_state changes that we were cheating in previous versions. * Minor cleanups, some comment improvements Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 80 +++++++++++++++++++++++++++++---------- init/init_task.c | 1 + kernel/fork.c | 1 + kernel/locking/mutex.c | 15 ++++---- kernel/locking/ww_mutex.h | 20 ++++------ kernel/sched/core.c | 44 +++++++++++++++++++-- kernel/sched/sched.h | 2 +- 7 files changed, 120 insertions(+), 43 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 3ec0ef0d91603..5801de1a44a79 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -815,6 +815,12 @@ struct kmap_ctrl { #endif }; =20 +enum blocked_on_state { + BO_RUNNABLE, + BO_BLOCKED, + BO_WAKING, +}; + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -1234,6 +1240,7 @@ struct task_struct { struct rt_mutex_waiter *pi_blocked_on; #endif =20 + enum blocked_on_state blocked_on_state; struct mutex *blocked_on; /* lock we're blocked on */ raw_spinlock_t blocked_lock; =20 @@ -2141,7 +2148,52 @@ extern int __cond_resched_rwlock_write(rwlock_t *loc= k); __cond_resched_rwlock_write(lock); \ }) =20 -#ifndef CONFIG_PREEMPT_RT +static inline void __force_blocked_on_runnable(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + p->blocked_on_state =3D BO_RUNNABLE; +} + +static inline void force_blocked_on_runnable(struct task_struct *p) +{ + guard(raw_spinlock_irqsave)(&p->blocked_lock); + __force_blocked_on_runnable(p); +} + +static inline void __force_blocked_on_blocked(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + p->blocked_on_state =3D BO_BLOCKED; +} + +static inline void __set_blocked_on_runnable(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + if (p->blocked_on_state =3D=3D BO_WAKING) + p->blocked_on_state =3D BO_RUNNABLE; +} + +static inline void set_blocked_on_runnable(struct task_struct *p) +{ + if (!sched_proxy_exec()) + return; + guard(raw_spinlock_irqsave)(&p->blocked_lock); + __set_blocked_on_runnable(p); +} + +static inline void __set_blocked_on_waking(struct task_struct *p) +{ + lockdep_assert_held(&p->blocked_lock); + if (p->blocked_on_state =3D=3D BO_BLOCKED) + p->blocked_on_state =3D BO_WAKING; +} + +static inline void set_blocked_on_waking(struct task_struct *p) +{ + guard(raw_spinlock_irqsave)(&p->blocked_lock); + __set_blocked_on_waking(p); +} + static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { lockdep_assert_held_once(&p->blocked_lock); @@ -2163,24 +2215,23 @@ static inline void __set_task_blocked_on(struct tas= k_struct *p, struct mutex *m) lockdep_assert_held_once(&p->blocked_lock); /* * Check ensure we don't overwrite existing mutex value - * with a different mutex. Note, setting it to the same - * lock repeatedly is ok. + * with a different mutex. */ - WARN_ON_ONCE(p->blocked_on && p->blocked_on !=3D m); + WARN_ON_ONCE(p->blocked_on); p->blocked_on =3D m; + p->blocked_on_state =3D BO_BLOCKED; } =20 static inline void __clear_task_blocked_on(struct task_struct *p, struct m= utex *m) { + /* The task should only be clearing itself */ + WARN_ON_ONCE(p !=3D current); /* Currently we serialize blocked_on under the task::blocked_lock */ lockdep_assert_held_once(&p->blocked_lock); - /* - * There may be cases where we re-clear already cleared - * blocked_on relationships, but make sure we are not - * clearing the relationship with a different lock. - */ - WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m); + /* Make sure we are clearing the relationship with the right lock */ + WARN_ON_ONCE(m && p->blocked_on !=3D m); p->blocked_on =3D NULL; + p->blocked_on_state =3D BO_RUNNABLE; } =20 static inline void clear_task_blocked_on(struct task_struct *p, struct mut= ex *m) @@ -2188,15 +2239,6 @@ static inline void clear_task_blocked_on(struct task= _struct *p, struct mutex *m) guard(raw_spinlock_irqsave)(&p->blocked_lock); __clear_task_blocked_on(p, m); } -#else -static inline void __clear_task_blocked_on(struct task_struct *p, struct r= t_mutex *m) -{ -} - -static inline void clear_task_blocked_on(struct task_struct *p, struct rt_= mutex *m) -{ -} -#endif /* !CONFIG_PREEMPT_RT */ =20 static __always_inline bool need_resched(void) { diff --git a/init/init_task.c b/init/init_task.c index 7e29d86153d9f..6d72ec23410a6 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -174,6 +174,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .mems_allowed_seq =3D SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq, &init_task.alloc_lock), #endif + .blocked_on_state =3D BO_RUNNABLE, #ifdef CONFIG_RT_MUTEXES .pi_waiters =3D RB_ROOT_CACHED, .pi_top_task =3D NULL, diff --git a/kernel/fork.c b/kernel/fork.c index db6d08946ec11..4bd0731995e86 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2129,6 +2129,7 @@ __latent_entropy struct task_struct *copy_process( lockdep_init_task(p); #endif =20 + p->blocked_on_state =3D BO_RUNNABLE; p->blocked_on =3D NULL; /* not blocked yet */ =20 #ifdef CONFIG_BCACHE diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index fac40c456098e..42e4d2e6e4ad4 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -682,11 +682,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int s= tate, unsigned int subclas raw_spin_lock_irqsave(&lock->wait_lock, flags); raw_spin_lock(¤t->blocked_lock); /* - * As we likely have been woken up by task - * that has cleared our blocked_on state, re-set - * it to the lock we are trying to acquire. + * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE */ - __set_task_blocked_on(current, lock); + __force_blocked_on_blocked(current); set_current_state(state); /* * Here we order against unlock; we must either see it change @@ -705,14 +703,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int = state, unsigned int subclas * and clear blocked on so we don't become unselectable * to run. */ - __clear_task_blocked_on(current, lock); + __force_blocked_on_runnable(current); raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN); opt_acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter); raw_spin_lock_irqsave(&lock->wait_lock, flags); raw_spin_lock(¤t->blocked_lock); - __set_task_blocked_on(current, lock); + __force_blocked_on_blocked(current); if (opt_acquired) break; trace_contention_begin(lock, LCB_F_MUTEX); @@ -963,8 +961,11 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne =20 next =3D waiter->task; =20 + raw_spin_lock(&next->blocked_lock); debug_mutex_wake_waiter(lock, waiter); - clear_task_blocked_on(next, lock); + WARN_ON_ONCE(__get_task_blocked_on(next) !=3D lock); + __set_blocked_on_waking(next); + raw_spin_unlock(&next->blocked_lock); wake_q_add(&wake_q, next); } =20 diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h index e4a81790ea7dd..f34363615eb34 100644 --- a/kernel/locking/ww_mutex.h +++ b/kernel/locking/ww_mutex.h @@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITE= R *waiter, debug_mutex_wake_waiter(lock, waiter); #endif /* - * When waking up the task to die, be sure to clear the - * blocked_on pointer. Otherwise we can see circular - * blocked_on relationships that can't resolve. + * When waking up the task to die, be sure to set the + * blocked_on_state to BO_WAKING. Otherwise we can see + * circular blocked_on relationships that can't resolve. */ - clear_task_blocked_on(waiter->task, lock); + set_blocked_on_waking(waiter->task); wake_q_add(wake_q, waiter->task); } =20 @@ -339,15 +339,11 @@ static bool __ww_mutex_wound(struct MUTEX *lock, */ if (owner !=3D current) { /* - * When waking up the task to wound, be sure to clear the - * blocked_on pointer. Otherwise we can see circular - * blocked_on relationships that can't resolve. - * - * NOTE: We pass NULL here instead of lock, because we - * are waking the mutex owner, who may be currently - * blocked on a different mutex. + * When waking up the task to wound, be sure to set the + * blocked_on_state to BO_WAKING. Otherwise we can see + * circular blocked_on relationships that can't resolve. */ - clear_task_blocked_on(owner, NULL); + set_blocked_on_waking(owner); wake_q_add(wake_q, owner); } return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0180853dd48c5..e0007660161fa 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4328,6 +4328,12 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: + /* + * For now, if we've been woken up, set us as BO_RUNNABLE + * We will need to be more careful later when handling + * proxy migration + */ + set_blocked_on_runnable(p); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 @@ -6623,7 +6629,7 @@ static struct task_struct *proxy_deactivate(struct rq= *rq, struct task_struct *d * as unblocked, as we aren't doing proxy-migrations * yet (more logic will be needed then). */ - donor->blocked_on =3D NULL; + force_blocked_on_runnable(donor); } return NULL; } @@ -6676,20 +6682,41 @@ find_proxy_task(struct rq *rq, struct task_struct *= donor, struct rq_flags *rf) return NULL; } =20 + /* + * If a ww_mutex hits the die/wound case, it marks the task as + * BO_WAKING and calls try_to_wake_up(), so that the mutex + * cycle can be broken and we avoid a deadlock. + * + * However, if at that moment, we are here on the cpu which the + * die/wounded task is enqueued, we might loop on the cycle as + * BO_WAKING still causes task_is_blocked() to return true + * (since we want return migration to occur before we run the + * task). + * + * Unfortunately since we hold the rq lock, it will block + * try_to_wake_up from completing and doing the return + * migration. + * + * So when we hit a !BO_BLOCKED task briefly schedule idle + * so we release the rq and let the wakeup complete. + */ + if (p->blocked_on_state !=3D BO_BLOCKED) + return proxy_resched_idle(rq); + owner =3D __mutex_owner(mutex); if (!owner) { - __clear_task_blocked_on(p, mutex); + __force_blocked_on_runnable(p); return p; } =20 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) { /* XXX Don't handle blocked owners/delayed dequeue yet */ - return proxy_deactivate(rq, donor); + goto deactivate_donor; } =20 if (task_cpu(owner) !=3D this_cpu) { /* XXX Don't handle migrations yet */ - return proxy_deactivate(rq, donor); + goto deactivate_donor; } =20 if (task_on_rq_migrating(owner)) { @@ -6749,6 +6776,15 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) =20 WARN_ON_ONCE(owner && !owner->on_rq); return owner; + + /* + * NOTE: This logic is down here, because we need to call + * the functions with the mutex wait_lock and task + * blocked_lock released, so we have to get out of the + * guard() scope. + */ +deactivate_donor: + return proxy_deactivate(rq, donor); } #else /* SCHED_PROXY_EXEC */ static struct task_struct * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index be9745d104f75..845454ec81a22 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2264,7 +2264,7 @@ static inline bool task_is_blocked(struct task_struct= *p) if (!sched_proxy_exec()) return false; =20 - return !!p->blocked_on; + return !!p->blocked_on && p->blocked_on_state !=3D BO_RUNNABLE; } =20 static inline int task_on_cpu(struct rq *rq, struct task_struct *p) --=20 2.51.0.338.gd7d06c2dae-goog From nobody Fri Oct 3 07:40:23 2025 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3C8119CCF5 for ; Thu, 4 Sep 2025 00:22:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945331; cv=none; b=A2m9yT7JbsDWDf8yILBCtLKGYakKyVf8ZDHDiPtbyxJ3wOL47a6s3CZjWBtJXQNjXEGXjBFkaxTlsiNDdwCIJIurbOafN9PF0JUZjxONSr8qNNBIepUiTrJRtA8vA7iBSSqL2y7Hmv+Kga3JRVBV6dOJj0+pGXvT1d8gE1/9YO8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945331; c=relaxed/simple; bh=0C0beGOnuMmvTJ05/X24Md5t0O0RTWRq6BF8Z8LngD0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=kjuPjGEZk7SwkVx8+wb1COogQNjfLd5lubuPvXrMrBzYrfOJEMMRxExL31BwbfUGxvUr0egqsxydxYoMzb168igJtscTbuvNHyoxbJV+7LKouiwE2HfJZLFGUJbZsCeVqOk3VXJsDdmUdM49jQ5QjRGBVZ1BoAxOH17sdrART9w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hPD9fM9C; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hPD9fM9C" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-77241858ec1so404393b3a.0 for ; Wed, 03 Sep 2025 17:22:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1756945329; x=1757550129; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XC30hM2d3vulBtQQBWOSF9HQ0rVcEYNwJDO/svkGcLs=; b=hPD9fM9CoQAZE+whbWvehl0hZ2Pv/Qs6pfs2KbEYTowixUon8gkg0Xya616zUxR4SC lPzZX7TxotbUklhP6wr377+SEfLIRPJb5v8JqNkDQtgGVjqstyFUcH2YbIfJ8kjWXjwk w6CHmUM7MkgsybMit8Snb67Wz1/+ZzAhs8A6z44TfQjWnOM5iKyVY0xZdoc6RH3FQO/Q 1mVFn30rAvyNY8tjuvf1vngum3kijO2AeDl66lWf5JqU0Jxi5VGJpoKoV9QW6o6XDJLm 0ueTAIioRcr6DBX8UguyVPbNrUWfS8hAPWuuv2L9K6EbKkfaomJOQ2ZWwrjvDnmYf7yV wz9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756945329; x=1757550129; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XC30hM2d3vulBtQQBWOSF9HQ0rVcEYNwJDO/svkGcLs=; b=XAyZ3wCoBCLQPJeYShUrh+UgLJwYjml7WCe2icN7vi6QVxEuh4w0UkpvyZV4v5OEvK 648pIvKxaU0CbftS/mZ+uzYIgFuP/p6p7PyNt0/zcVkkK+ZiSJ7a1+gD24P9dQX/cZMf 3dR4qAQcvi/eDEMAZ3pdiKz5HrY4A6MMCqKZUXK6bB0JmOJnRVlLfKKajeXdLZbe26YA 95FqhXhw46cDbri07mNEmavbr6THCbfu/QpjY4aphC18SoGPNoVe8BkXcsASDDmUOxuk qir1ASZhn9pGOrx9m2+/QyrKHCB2NsTgNfmIcTb+H40TCUIoMGp4Fdtya+yu+sQvJGW8 pwKA== X-Gm-Message-State: AOJu0Yw1lKJaULfiJTftxP7qOoQYiIBHMhDpcJE7rSoi2vV5ICCBy/W1 fhTVgyu+qM8UUim4QtvrNUfnPcGgVzxNzR+34H+J5AWsb+B93I6xVLrCC9iG+isGZYIBobHytIF uu/UpOEN92oufHUUE24zyBlmeXu88TU7V6FY3Ge96Z+/8XrxoA5V6HKePU7UHEOGQ/wjEGwGf6c VK6lWWJUAYvgHMpws77LBGwDFhuJmpxzUzUDqOOxml4+WNAhwV X-Google-Smtp-Source: AGHT+IEwrvPguhhNdtbkBeLo4cuV+PQSTPuc2tpQmhXGScPHHwlxRxkQsvNzfZdLajUCb4wLQW9WzdlZdOFB X-Received: from pfbho6.prod.google.com ([2002:a05:6a00:8806:b0:772:46da:4dd1]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:3286:b0:240:af8:1784 with SMTP id adf61e73a8af0-243d6dd53dbmr22919025637.1.1756945328480; Wed, 03 Sep 2025 17:22:08 -0700 (PDT) Date: Thu, 4 Sep 2025 00:21:53 +0000 In-Reply-To: <20250904002201.971268-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250904002201.971268-1-jstultz@google.com> X-Mailer: git-send-email 2.51.0.338.gd7d06c2dae-goog Message-ID: <20250904002201.971268-4-jstultz@google.com> Subject: [RESEND][PATCH v21 3/6] sched: Add logic to zap balance callbacks if we pick again From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With proxy-exec, a task is selected to run via pick_next_task(), and then if it is a mutex blocked task, we call find_proxy_task() to find a runnable owner. If the runnable owner is on another cpu, we will need to migrate the selected donor task away, after which we will pick_again can call pick_next_task() to choose something else. However, in the first call to pick_next_task(), we may have had a balance_callback setup by the class scheduler. After we pick again, its possible pick_next_task_fair() will be called which calls sched_balance_newidle() and sched_balance_rq(). This will throw a warning: [ 8.796467] rq->balance_callback && rq->balance_callback !=3D &balance_p= ush_callback [ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched= _balance_rq+0xe92/0x1250 ... [ 8.796467] Call Trace: [ 8.796467] [ 8.796467] ? __warn.cold+0xb2/0x14e [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] ? report_bug+0x107/0x1a0 [ 8.796467] ? handle_bug+0x54/0x90 [ 8.796467] ? exc_invalid_op+0x17/0x70 [ 8.796467] ? asm_exc_invalid_op+0x1a/0x20 [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] sched_balance_newidle+0x295/0x820 [ 8.796467] pick_next_task_fair+0x51/0x3f0 [ 8.796467] __schedule+0x23a/0x14b0 [ 8.796467] ? lock_release+0x16d/0x2e0 [ 8.796467] schedule+0x3d/0x150 [ 8.796467] worker_thread+0xb5/0x350 [ 8.796467] ? __pfx_worker_thread+0x10/0x10 [ 8.796467] kthread+0xee/0x120 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork+0x31/0x50 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork_asm+0x1a/0x30 [ 8.796467] This is because if a RT task was originally picked, it will setup the rq->balance_callback with push_rt_tasks() via set_next_task_rt(). Once the task is migrated away and we pick again, we haven't processed any balance callbacks, so rq->balance_callback is not in the same state as it was the first time pick_next_task was called. To handle this, add a zap_balance_callbacks() helper function which cleans up the blance callbacks without running them. This should be ok, as we are effectively undoing the state set in the first call to pick_next_task(), and when we pick again, the new callback can be configured for the donor task actually selected. Signed-off-by: John Stultz --- v20: * Tweaked to avoid build issues with different configs Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e0007660161fa..01bf5ef8d9fcc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5001,6 +5001,40 @@ static inline void finish_task(struct task_struct *p= rev) smp_store_release(&prev->on_cpu, 0); } =20 +#if defined(CONFIG_SCHED_PROXY_EXEC) +/* + * Only called from __schedule context + * + * There are some cases where we are going to re-do the action + * that added the balance callbacks. We may not be in a state + * where we can run them, so just zap them so they can be + * properly re-added on the next time around. This is similar + * handling to running the callbacks, except we just don't call + * them. + */ +static void zap_balance_callbacks(struct rq *rq) +{ + struct balance_callback *next, *head; + bool found =3D false; + + lockdep_assert_rq_held(rq); + + head =3D rq->balance_callback; + while (head) { + if (head =3D=3D &balance_push_callback) + found =3D true; + next =3D head->next; + head->next =3D NULL; + head =3D next; + } + rq->balance_callback =3D found ? &balance_push_callback : NULL; +} +#else +static inline void zap_balance_callbacks(struct rq *rq) +{ +} +#endif + static void do_balance_callbacks(struct rq *rq, struct balance_callback *h= ead) { void (*func)(struct rq *rq); @@ -6941,8 +6975,11 @@ static void __sched notrace __schedule(int sched_mod= e) rq_set_donor(rq, next); if (unlikely(task_is_blocked(next))) { next =3D find_proxy_task(rq, next, &rf); - if (!next) + if (!next) { + /* zap the balance_callbacks before picking again */ + zap_balance_callbacks(rq); goto pick_again; + } if (next =3D=3D rq->idle) goto keep_resched; } --=20 2.51.0.338.gd7d06c2dae-goog From nobody Fri Oct 3 07:40:23 2025 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00F3E1A238C for ; Thu, 4 Sep 2025 00:22:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945333; cv=none; b=NAcOnO9vL9YoRDNZhBL5cDrXAqDAyvCZCG8Z/E0yfP+xszcoeLM9Yx0Kke/vGxEmtDRIOjFZsdCsMqKIGbG/1JSWXbX9hOs+tfeau8zWwcGvEVLjzDZcqvHpDsa93uf5bx+O4Y4YMbZYP7NvzQxvBFqU1fMbOTa4+hCnWGV+3DY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945333; c=relaxed/simple; bh=5sUO0gnwpiXpeNQW+SPRcETtKIu02uhswzjE9CpQ6d0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=pHPVOniqUw5u3dMpRgwgg2LlWalCLqwvQPImlx6JkiEUFOnF7Oy2O7wuWJKDf7Bjs8wVhtm276rM25M/PsG2qpUHTnpVtjJ0G26e6EMhnUBWg6+mESdvuAt7Wq6gNW2AkCbo0AeyqGQHQmcbyAx0UDF90fsfWp8/LGaDIsQO3H4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=urp+A3tj; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="urp+A3tj" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-325ce108e45so357798a91.0 for ; Wed, 03 Sep 2025 17:22:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1756945330; x=1757550130; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=597Og1OjANOJWLhtTJQA5BkXF5B3wHHRWfEHyhaIIdI=; b=urp+A3tjX4okTHFZCmb3gE+tV+pMRb3KiJkfaFC+0ILPw1/YchbBAmsswZepQASTY/ lts1h74b7ya7f4cJcPU4MGfj90ezitUhRQ3VEx6AXVpvWsBBvbIdU3OxLv/zX8894EtZ 4P5UtZ/0HovedeCf+GM5Ei5BAENfwT1xKPafu4dEx6C+7eUngv09HuwzC7ltH4DxF+Se YooOCcDZSJQFYDp+WM1YYeJHaLbsH9EuKnQWVGtex4psYHGXjmhg8hmXZjHV0t5jCBjP d04LwAg/BMS5InMyKWn5ZvxnCQG/vKbOyde+aVuMfO0+ennKWEj8s8gPML3aG5MuGLtc R1vA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756945330; x=1757550130; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=597Og1OjANOJWLhtTJQA5BkXF5B3wHHRWfEHyhaIIdI=; b=q96rDWY0tcQ7Mnd3HD6wWWtO85luq2XCCcVROmHSvZB+JWqOvekCugq4rXU5jje0nW YuIBjcsejYhNURsbadltcGcsvcufH6Qr7O+N80VC3cTEwhm1YjdPsn5k1oGjDU+44Ocm sEKtzcRo4zVsX0ZUWlNwIfjoagFQq9g25DGIdLnVH/d5GTYsUP0H86e9wKKUNtSc/VOF qxcnnK8U/or1JD7VIxek+VxgPdYShofmnqWENcd2LkF5yA0zGDpcjJmLwAB/A1AwjOyL 0gf8hzpNaVATtfVIOb8XCb63wbTk0S8JMiBv/L5Pb80vSmTFPArvO3IU22ABniQvUHtK rPkw== X-Gm-Message-State: AOJu0YzH8zPD2to7+eKUySAvMMCSX6VEknhShyTQLvP2jZXnkJFo1NXt oQnTZ3G0Fyp6FLDDesapW16rjSnbF0U6rmLhBT6VcGjztjuMJnOlW4dYzSg0KWY9FJIknlx/l4G NqBq3fPbnx84qSULUb1Q2p8rbIz7jcKT2yxtFYtYyE3gLyH1RT+YsYMdjR18i23CTGBbfxzNEuF o/iCZax0EKs9mzOYw4HuTd2+S27eKejsBvxwSRo/rjIaYTgRt9 X-Google-Smtp-Source: AGHT+IErT7+C/CQA6lnnONP2lEs9EQhifWNwaz0+Kz0/S/PL+CovyBEHqF8njBvVQsP0UJ2jyP9xW3g2YcIk X-Received: from pjbph12.prod.google.com ([2002:a17:90b:3bcc:b0:31f:3227:1724]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90a:ec8d:b0:32b:9506:1782 with SMTP id 98e67ed59e1d1-32b95061929mr714756a91.15.1756945330083; Wed, 03 Sep 2025 17:22:10 -0700 (PDT) Date: Thu, 4 Sep 2025 00:21:54 +0000 In-Reply-To: <20250904002201.971268-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250904002201.971268-1-jstultz@google.com> X-Mailer: git-send-email 2.51.0.338.gd7d06c2dae-goog Message-ID: <20250904002201.971268-5-jstultz@google.com> Subject: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration) From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add logic to handle migrating a blocked waiter to a remote cpu where the lock owner is runnable. Additionally, as the blocked task may not be able to run on the remote cpu, add logic to handle return migration once the waiting task is given the mutex. Because tasks may get migrated to where they cannot run, also modify the scheduling classes to avoid sched class migrations on mutex blocked tasks, leaving find_proxy_task() and related logic to do the migrations and return migrations. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) Juri Lelli Valentin Schneider Connor O'Brien Signed-off-by: John Stultz --- v6: * Integrated sched_proxy_exec() check in proxy_return_migration() * Minor cleanups to diff * Unpin the rq before calling __balance_callbacks() * Tweak proxy migrate to migrate deeper task in chain, to avoid tasks pingponging between rqs v7: * Fixup for unused function arguments * Switch from that_rq -> target_rq, other minor tweaks, and typo fixes suggested by Metin Kaya * Switch back to doing return migration in the ttwu path, which avoids nasty lock juggling and performance issues * Fixes for UP builds v8: * More simplifications from Metin Kaya * Fixes for null owner case, including doing return migration * Cleanup proxy_needs_return logic v9: * Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed return migrations * Switch to using zap_balance_callbacks rathern then running them when we are dropping rq locks for proxy_migration. * Drop task_is_blocked check in sched_submit_work as suggested by Metin (may re-add later if this causes trouble) * Do return migration when we're not on wake_cpu. This avoids bad task placement caused by proxy migrations raised by Xuewen Yan * Fix to call set_next_task(rq->curr) prior to dropping rq lock to avoid rq->curr getting migrated before we have actually switched from it * Cleanup to re-use proxy_resched_idle() instead of open coding it in proxy_migrate_task() * Fix return migration not to use DEQUEUE_SLEEP, so that we properly see the task as task_on_rq_migrating() after it is dequeued but before set_task_cpu() has been called on it * Fix to broaden find_proxy_task() checks to avoid race where a task is dequeued off the rq due to return migration, but set_task_cpu() and the enqueue on another rq happened after we checked task_cpu(owner). This ensures we don't proxy using a task that is not actually on our runqueue. * Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition in try_to_wake_up() if proxy execution isn't enabled. * Cleanup to improve comment in proxy_migrate_task() explaining the set_next_task(rq->curr) logic * Cleanup deadline.c change to stylistically match rt.c change * Numerous cleanups suggested by Metin v10: * Drop WARN_ON(task_is_blocked(p)) in ttwu current case v11: * Include proxy_set_task_cpu from later in the series to this change so we can use it, rather then reworking logic later in the series. * Fix problem with return migration, where affinity was changed and wake_cpu was left outside the affinity mask. * Avoid reading the owner's cpu twice (as it might change inbetween) to avoid occasional migration-to-same-cpu edge cases * Add extra WARN_ON checks for wake_cpu and return migration edge cases. * Typo fix from Metin v13: * As we set ret, return it, not just NULL (pulling this change in from later patch) * Avoid deadlock between try_to_wake_up() and find_proxy_task() when blocked_on cycle with ww_mutex is trying a mid-chain wakeup. * Tweaks to use new __set_blocked_on_runnable() helper * Potential fix for incorrectly updated task->dl_server issues * Minor comment improvements * Add logic to handle missed wakeups, in that case doing return migration from the find_proxy_task() path * Minor cleanups v14: * Improve edge cases where we wouldn't set the task as BO_RUNNABLE v15: * Added comment to better describe proxy_needs_return() as suggested by Qais * Build fixes for !CONFIG_SMP reported by Maciej =C5=BBenczykowski * Adds fix for re-evaluating proxy_needs_return when sched_proxy_exec() is disabled, reported and diagnosed by: kuyo chang v16: * Larger rework of needs_return logic in find_proxy_task, in order to avoid problems with cpuhotplug * Rework to use guard() as suggested by Peter v18: * Integrate optimization suggested by Suleiman to do the checks for sleeping owners before checking if the task_cpu is this_cpu, so that we can avoid needlessly proxy-migrating tasks to only then dequeue them. Also check if migrating last. * Improve comments around guard locking * Include tweak to ttwu_runnable() as suggested by hupu * Rework the logic releasing the rq->donor reference before letting go of the rqlock. Just use rq->idle. * Go back to doing return migration on BO_WAKING owners, as I was hitting some softlockups caused by running tasks not making it out of BO_WAKING. v19: * Fixed proxy_force_return() logic for !SMP cases v21: * Reworked donor deactivation for unhandled sleeping owners * Commit message tweaks Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/sched/core.c | 247 +++++++++++++++++++++++++++++++++++++++----- kernel/sched/fair.c | 3 +- 2 files changed, 222 insertions(+), 28 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 01bf5ef8d9fcc..0f824446c6046 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3157,6 +3157,14 @@ static int __set_cpus_allowed_ptr_locked(struct task= _struct *p, =20 __do_set_cpus_allowed(p, ctx); =20 + /* + * It might be that the p->wake_cpu is no longer + * allowed, so set it to the dest_cpu so return + * migration doesn't send it to an invalid cpu + */ + if (!is_cpu_allowed(p, p->wake_cpu)) + p->wake_cpu =3D dest_cpu; + return affine_move_task(rq, p, rf, dest_cpu, ctx->flags); =20 out: @@ -3717,6 +3725,67 @@ static inline void ttwu_do_wakeup(struct task_struct= *p) trace_sched_wakeup(p); } =20 +#ifdef CONFIG_SCHED_PROXY_EXEC +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu) +{ + unsigned int wake_cpu; + + /* + * Since we are enqueuing a blocked task on a cpu it may + * not be able to run on, preserve wake_cpu when we + * __set_task_cpu so we can return the task to where it + * was previously runnable. + */ + wake_cpu =3D p->wake_cpu; + __set_task_cpu(p, cpu); + p->wake_cpu =3D wake_cpu; +} + +static bool proxy_task_runnable_but_waking(struct task_struct *p) +{ + if (!sched_proxy_exec()) + return false; + return (READ_ONCE(p->__state) =3D=3D TASK_RUNNING && + READ_ONCE(p->blocked_on_state) =3D=3D BO_WAKING); +} +#else /* !CONFIG_SCHED_PROXY_EXEC */ +static bool proxy_task_runnable_but_waking(struct task_struct *p) +{ + return false; +} +#endif /* CONFIG_SCHED_PROXY_EXEC */ + +/* + * Checks to see if task p has been proxy-migrated to another rq + * and needs to be returned. If so, we deactivate the task here + * so that it can be properly woken up on the p->wake_cpu + * (or whichever cpu select_task_rq() picks at the bottom of + * try_to_wake_up() + */ +static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p) +{ + bool ret =3D false; + + if (!sched_proxy_exec()) + return false; + + raw_spin_lock(&p->blocked_lock); + if (__get_task_blocked_on(p) && p->blocked_on_state =3D=3D BO_WAKING) { + if (!task_current(rq, p) && (p->wake_cpu !=3D cpu_of(rq))) { + if (task_current_donor(rq, p)) { + put_prev_task(rq, p); + rq_set_donor(rq, rq->idle); + } + deactivate_task(rq, p, DEQUEUE_NOCLOCK); + ret =3D true; + } + __set_blocked_on_runnable(p); + resched_curr(rq); + } + raw_spin_unlock(&p->blocked_lock); + return ret; +} + static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, struct rq_flags *rf) @@ -3802,6 +3871,8 @@ static int ttwu_runnable(struct task_struct *p, int w= ake_flags) update_rq_clock(rq); if (p->se.sched_delayed) enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); + if (proxy_needs_return(rq, p)) + goto out; if (!task_on_cpu(rq, p)) { /* * When on_rq && !on_cpu the task is preempted, see if @@ -3812,6 +3883,7 @@ static int ttwu_runnable(struct task_struct *p, int w= ake_flags) ttwu_do_wakeup(p); ret =3D 1; } +out: __task_rq_unlock(rq, &rf); =20 return ret; @@ -4199,6 +4271,8 @@ int try_to_wake_up(struct task_struct *p, unsigned in= t state, int wake_flags) * it disabling IRQs (this allows not taking ->pi_lock). */ WARN_ON_ONCE(p->se.sched_delayed); + /* If current is waking up, we know we can run here, so set BO_RUNNBLE */ + set_blocked_on_runnable(p); if (!ttwu_state_match(p, state, &success)) goto out; =20 @@ -4215,8 +4289,15 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) */ scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { smp_mb__after_spinlock(); - if (!ttwu_state_match(p, state, &success)) - break; + if (!ttwu_state_match(p, state, &success)) { + /* + * If we're already TASK_RUNNING, and BO_WAKING + * continue on to ttwu_runnable check to force + * proxy_needs_return evaluation + */ + if (!proxy_task_runnable_but_waking(p)) + break; + } =20 trace_sched_waking(p); =20 @@ -4278,6 +4359,7 @@ int try_to_wake_up(struct task_struct *p, unsigned in= t state, int wake_flags) * enqueue, such as ttwu_queue_wakelist(). */ WRITE_ONCE(p->__state, TASK_WAKING); + set_blocked_on_runnable(p); =20 /* * If the owning (remote) CPU is still in the middle of schedule() with @@ -4328,12 +4410,6 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) ttwu_queue(p, cpu, wake_flags); } out: - /* - * For now, if we've been woken up, set us as BO_RUNNABLE - * We will need to be more careful later when handling - * proxy migration - */ - set_blocked_on_runnable(p); if (success) ttwu_stat(p, task_cpu(p), wake_flags); =20 @@ -6635,7 +6711,7 @@ static inline struct task_struct *proxy_resched_idle(= struct rq *rq) return rq->idle; } =20 -static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor) +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor) { unsigned long state =3D READ_ONCE(donor->__state); =20 @@ -6655,17 +6731,98 @@ static bool __proxy_deactivate(struct rq *rq, struc= t task_struct *donor) return try_to_block_task(rq, donor, &state, true); } =20 -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_str= uct *donor) +/* + * If the blocked-on relationship crosses CPUs, migrate @p to the + * owner's CPU. + * + * This is because we must respect the CPU affinity of execution + * contexts (owner) but we can ignore affinity for scheduling + * contexts (@p). So we have to move scheduling contexts towards + * potential execution contexts. + * + * Note: The owner can disappear, but simply migrate to @target_cpu + * and leave that CPU to sort things out. + */ +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, int target_cpu) { - if (!__proxy_deactivate(rq, donor)) { - /* - * XXX: For now, if deactivation failed, set donor - * as unblocked, as we aren't doing proxy-migrations - * yet (more logic will be needed then). - */ - force_blocked_on_runnable(donor); - } - return NULL; + struct rq *target_rq =3D cpu_rq(target_cpu); + + lockdep_assert_rq_held(rq); + + /* + * Since we're going to drop @rq, we have to put(@rq->donor) first, + * otherwise we have a reference that no longer belongs to us. + * + * Additionally, as we put_prev_task(prev) earlier, its possible that + * prev will migrate away as soon as we drop the rq lock, however we + * still have it marked as rq->curr, as we've not yet switched tasks. + * + * After the migration, we are going to pick_again in the __schedule + * logic, so backtrack a bit before we release the lock: + * Put rq->donor, and set rq->curr as rq->donor and set_next_task, + * so that we're close to the situation we had entering __schedule + * the first time. + * + * Then when we re-aquire the lock, we will re-put rq->curr then + * rq_set_donor(rq->idle) and set_next_task(rq->idle), before + * picking again. + */ + /* XXX - Added to address problems with changed dl_server semantics - dou= ble check */ + __put_prev_set_next_dl_server(rq, rq->donor, rq->curr); + put_prev_task(rq, rq->donor); + rq_set_donor(rq, rq->idle); + set_next_task(rq, rq->idle); + + WARN_ON(p =3D=3D rq->curr); + + deactivate_task(rq, p, 0); + proxy_set_task_cpu(p, target_cpu); + + zap_balance_callbacks(rq); + rq_unpin_lock(rq, rf); + raw_spin_rq_unlock(rq); + raw_spin_rq_lock(target_rq); + + activate_task(target_rq, p, 0); + wakeup_preempt(target_rq, p, 0); + + raw_spin_rq_unlock(target_rq); + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); +} + +static void proxy_force_return(struct rq *rq, struct rq_flags *rf, + struct task_struct *p) +{ + lockdep_assert_rq_held(rq); + + put_prev_task(rq, rq->donor); + rq_set_donor(rq, rq->idle); + set_next_task(rq, rq->idle); + + WARN_ON(p =3D=3D rq->curr); + + set_blocked_on_waking(p); + get_task_struct(p); + block_task(rq, p, 0); + + zap_balance_callbacks(rq); + rq_unpin_lock(rq, rf); + raw_spin_rq_unlock(rq); + + wake_up_process(p); + put_task_struct(p); + + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); +} + +static inline bool proxy_can_run_here(struct rq *rq, struct task_struct *p) +{ + if (p =3D=3D rq->curr || p->wake_cpu =3D=3D cpu_of(rq)) + return true; + return false; } =20 /* @@ -6688,9 +6845,11 @@ static struct task_struct * find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags = *rf) { struct task_struct *owner =3D NULL; + bool curr_in_chain =3D false; int this_cpu =3D cpu_of(rq); struct task_struct *p; struct mutex *mutex; + int owner_cpu; =20 /* Follow blocked_on chain. */ for (p =3D donor; task_is_blocked(p); p =3D owner) { @@ -6716,6 +6875,10 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) return NULL; } =20 + /* Double check blocked_on_state now we're holding the lock */ + if (p->blocked_on_state =3D=3D BO_RUNNABLE) + return p; + /* * If a ww_mutex hits the die/wound case, it marks the task as * BO_WAKING and calls try_to_wake_up(), so that the mutex @@ -6731,26 +6894,46 @@ find_proxy_task(struct rq *rq, struct task_struct *= donor, struct rq_flags *rf) * try_to_wake_up from completing and doing the return * migration. * - * So when we hit a !BO_BLOCKED task briefly schedule idle - * so we release the rq and let the wakeup complete. + * So when we hit a BO_WAKING task try to wake it up ourselves. */ - if (p->blocked_on_state !=3D BO_BLOCKED) - return proxy_resched_idle(rq); + if (p->blocked_on_state =3D=3D BO_WAKING) { + if (task_current(rq, p)) { + /* If its current just set it runnable */ + __force_blocked_on_runnable(p); + return p; + } + goto needs_return; + } + + if (task_current(rq, p)) + curr_in_chain =3D true; =20 owner =3D __mutex_owner(mutex); if (!owner) { + /* If the owner is null, we may have some work to do */ + if (!proxy_can_run_here(rq, p)) + goto needs_return; + __force_blocked_on_runnable(p); return p; } =20 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) { /* XXX Don't handle blocked owners/delayed dequeue yet */ + if (curr_in_chain) + return proxy_resched_idle(rq); goto deactivate_donor; } =20 - if (task_cpu(owner) !=3D this_cpu) { - /* XXX Don't handle migrations yet */ - goto deactivate_donor; + owner_cpu =3D task_cpu(owner); + if (owner_cpu !=3D this_cpu) { + /* + * @owner can disappear, simply migrate to @owner_cpu + * and leave that CPU to sort things out. + */ + if (curr_in_chain) + return proxy_resched_idle(rq); + goto migrate; } =20 if (task_on_rq_migrating(owner)) { @@ -6817,8 +7000,18 @@ find_proxy_task(struct rq *rq, struct task_struct *d= onor, struct rq_flags *rf) * blocked_lock released, so we have to get out of the * guard() scope. */ +migrate: + proxy_migrate_task(rq, rf, p, owner_cpu); + return NULL; +needs_return: + proxy_force_return(rq, rf, p); + return NULL; deactivate_donor: - return proxy_deactivate(rq, donor); + if (!proxy_deactivate(rq, donor)) { + p =3D donor; + goto needs_return; + } + return NULL; } #else /* SCHED_PROXY_EXEC */ static struct task_struct * diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b173a059315c2..cc531eb939831 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf se =3D &p->se; =20 #ifdef CONFIG_FAIR_GROUP_SCHED - if (prev->sched_class !=3D &fair_sched_class) + if (prev->sched_class !=3D &fair_sched_class || + rq->curr !=3D rq->donor) goto simple; =20 __put_prev_set_next_dl_server(rq, prev, p); --=20 2.51.0.338.gd7d06c2dae-goog From nobody Fri Oct 3 07:40:23 2025 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A898F1ADFFE for ; Thu, 4 Sep 2025 00:22:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945334; cv=none; b=V0EMb8oLo/KCK1Vd5xN+r7wFkrauP1CLVWqjwHIYbgI50aqE09Lb7TtqKEKkbu7aW3ukmELy+I0sidiu/nVbbtYVPerGJvwxFfbVBMQ/HS9EGdJMwEnFruQo2iDMdqdDOvtOpH8iECJ/p01fYDC+gIybm8EgNNue6Mr5PzjovPY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945334; c=relaxed/simple; bh=n4o9idlJ7XakJZ0c6LbA0nhfFyWzVNChIlegN4Xv90M=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=DP7cHIIf4+uoUxJ+AMT6NC5us4MZD4pAjIlPI+KEDxMIRkzjvNDUEONhrRGMdLuFOu77XUroXMUYg9t/UMyxqsyAks6ggSLjrYmv6E6mpNJKL8I/f0v4HWJiZIKL218x/4/UkKgZrVcSmgKkOiknuwTN6Ao7EO87ii9RegrJ86k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=maeehNbA; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="maeehNbA" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-24cc19f830fso4958515ad.3 for ; Wed, 03 Sep 2025 17:22:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1756945332; x=1757550132; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=57QCkAQxgVoON+UBZbjJA3WaSbl4/UN0kg7yN0QPum4=; b=maeehNbAAfuDsboWOyj//r8GY1njSwS1XOglG2+JhNmQ7IacSpmlEu/vCOEqqbc9wq IHRSvs5d1oLj98yf/MH/RacKon33+ln28vipz8tWzLAPlqQ7K8C0kxw8lKNl+fG2hMSI 0Of/O95SYVh4vRuuscjntMKtECIbd/vkgC+Xg4PWc5Hsuci4wFjyV8VVR3h3t0cmUrEn THi71Wa34QSpQxTkweYzcA4UXWhnZ4Nr8j3+HLY8gINV/p3C0UuzKshZNdly+5RO+Kjz FWWwevOaj9dzPaS6VnULbeiLuIy8p9tGDVOegQXTWLQniqU5Q2v/ZmMnfWDp6djW357P J56Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756945332; x=1757550132; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=57QCkAQxgVoON+UBZbjJA3WaSbl4/UN0kg7yN0QPum4=; b=SQtGEcgEfUNJacLKXN0nrUn5PvtL3DF68b313hNEahBhSVy4sd6bdFOkSdeSADGOsU IVhZBtKKzmUyRYIN6x1qaspLfMxbpmJ1mnGQaQl2jpcpuVjGl7O82S9HfCb0DID9rnlB TRhd4hErLvJMX9KUsWoWM2rc4PhXCnZI0sZE3SW/l1TUx+K9N6mFnqwj01ngprOZBJ+r Sk4aFNwbssv4OD8VLCTzVPPgxtWhgNvHomhoeSHhMvD9hsI/nE9EMuukvW+G+RQSTchB FUcKESW5J3YquswQZa0sBy6MbFXET5r+anDAmdbLuW+TgdfFM9PPfy9PPgaU3OD/sSY2 5ypA== X-Gm-Message-State: AOJu0YzG435v+mOj9ZubbPHJqUqn8aYBIcHXiPCQYER3Q8psFwSGmX53 n3S2TQJAjcRnm7BK/2UMRK5DUIQgvh9fEtBD0yV08dx7kJbHFWw8hBwpbo9aYuaLK401q2aX8tH wovudTVdmJ7t45mo1yZGEfs5XBDOxcdMuZmwCWqNps8tCqRaNVycWtAKEDEV5i279J62Ru2iw98 27Td1rMNaWhDOtHJBNA7Imk4D4crMNaBQPES3IuloDb8/LWUJP X-Google-Smtp-Source: AGHT+IFf90BuccrwCrBHmSC6IgAbzbFTF7o4UeBfr+WX31vb6UWsE0ALPj4Wf1IUizdjcNUwIsWUcD4xVo9q X-Received: from pjbst6.prod.google.com ([2002:a17:90b:1fc6:b0:32b:95bb:dbc]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1d46:b0:328:d0d:4960 with SMTP id 98e67ed59e1d1-32815452cefmr21968315a91.14.1756945331760; Wed, 03 Sep 2025 17:22:11 -0700 (PDT) Date: Thu, 4 Sep 2025 00:21:55 +0000 In-Reply-To: <20250904002201.971268-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250904002201.971268-1-jstultz@google.com> X-Mailer: git-send-email 2.51.0.338.gd7d06c2dae-goog Message-ID: <20250904002201.971268-6-jstultz@google.com> Subject: [RESEND][PATCH v21 5/6] sched: Add blocked_donor link to task for smarter mutex handoffs From: John Stultz To: LKML Cc: Peter Zijlstra , Juri Lelli , Valentin Schneider , "Connor O'Brien" , John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Add link to the task this task is proxying for, and use it so the mutex owner can do an intelligent hand-off of the mutex to the task that the owner is running on behalf. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Juri Lelli Signed-off-by: Valentin Schneider Signed-off-by: Connor O'Brien [jstultz: This patch was split out from larger proxy patch] Signed-off-by: John Stultz --- v5: * Split out from larger proxy patch v6: * Moved proxied value from earlier patch to this one where it is actually used * Rework logic to check sched_proxy_exec() instead of using ifdefs * Moved comment change to this patch where it makes sense v7: * Use more descriptive term then "us" in comments, as suggested by Metin Kaya. * Minor typo fixup from Metin Kaya * Reworked proxied variable to prev_not_proxied to simplify usage v8: * Use helper for donor blocked_on_state transition v9: * Re-add mutex lock handoff in the unlock path, but only when we have a blocked donor * Slight reword of commit message suggested by Metin v18: * Add task_init initialization for blocked_donor, suggested by Suleiman Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 1 + init/init_task.c | 1 + kernel/fork.c | 1 + kernel/locking/mutex.c | 41 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/core.c | 18 ++++++++++++++++-- 5 files changed, 57 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 5801de1a44a79..ab12eb738c440 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1242,6 +1242,7 @@ struct task_struct { =20 enum blocked_on_state blocked_on_state; struct mutex *blocked_on; /* lock we're blocked on */ + struct task_struct *blocked_donor; /* task that is boosting this task */ raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER diff --git a/init/init_task.c b/init/init_task.c index 6d72ec23410a6..627bbd8953e88 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -175,6 +175,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { &init_task.alloc_lock), #endif .blocked_on_state =3D BO_RUNNABLE, + .blocked_donor =3D NULL, #ifdef CONFIG_RT_MUTEXES .pi_waiters =3D RB_ROOT_CACHED, .pi_top_task =3D NULL, diff --git a/kernel/fork.c b/kernel/fork.c index 4bd0731995e86..86fe43ee35952 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2131,6 +2131,7 @@ __latent_entropy struct task_struct *copy_process( =20 p->blocked_on_state =3D BO_RUNNABLE; p->blocked_on =3D NULL; /* not blocked yet */ + p->blocked_donor =3D NULL; /* nobody is boosting p yet */ =20 #ifdef CONFIG_BCACHE p->sequential_io =3D 0; diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 42e4d2e6e4ad4..76cba3580fce7 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -921,7 +921,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible); */ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, u= nsigned long ip) { - struct task_struct *next =3D NULL; + struct task_struct *donor, *next =3D NULL; DEFINE_WAKE_Q(wake_q); unsigned long owner; unsigned long flags; @@ -940,6 +940,12 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne MUTEX_WARN_ON(__owner_task(owner) !=3D current); MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP); =20 + if (sched_proxy_exec() && current->blocked_donor) { + /* force handoff if we have a blocked_donor */ + owner =3D MUTEX_FLAG_HANDOFF; + break; + } + if (owner & MUTEX_FLAG_HANDOFF) break; =20 @@ -953,7 +959,34 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne =20 raw_spin_lock_irqsave(&lock->wait_lock, flags); debug_mutex_unlock(lock); - if (!list_empty(&lock->wait_list)) { + + if (sched_proxy_exec()) { + raw_spin_lock(¤t->blocked_lock); + /* + * If we have a task boosting current, and that task was boosting + * current through this lock, hand the lock to that task, as that + * is the highest waiter, as selected by the scheduling function. + */ + donor =3D current->blocked_donor; + if (donor) { + struct mutex *next_lock; + + raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING); + next_lock =3D __get_task_blocked_on(donor); + if (next_lock =3D=3D lock) { + next =3D donor; + __set_blocked_on_waking(donor); + wake_q_add(&wake_q, donor); + current->blocked_donor =3D NULL; + } + raw_spin_unlock(&donor->blocked_lock); + } + } + + /* + * Failing that, pick any on the wait list. + */ + if (!next && !list_empty(&lock->wait_list)) { /* get the first entry from the wait-list: */ struct mutex_waiter *waiter =3D list_first_entry(&lock->wait_list, @@ -961,7 +994,7 @@ static noinline void __sched __mutex_unlock_slowpath(st= ruct mutex *lock, unsigne =20 next =3D waiter->task; =20 - raw_spin_lock(&next->blocked_lock); + raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING); debug_mutex_wake_waiter(lock, waiter); WARN_ON_ONCE(__get_task_blocked_on(next) !=3D lock); __set_blocked_on_waking(next); @@ -972,6 +1005,8 @@ static noinline void __sched __mutex_unlock_slowpath(s= truct mutex *lock, unsigne if (owner & MUTEX_FLAG_HANDOFF) __mutex_handoff(lock, next); =20 + if (sched_proxy_exec()) + raw_spin_unlock(¤t->blocked_lock); raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q); } =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0f824446c6046..cac03f68cbcce 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6829,7 +6829,17 @@ static inline bool proxy_can_run_here(struct rq *rq,= struct task_struct *p) * Find runnable lock owner to proxy for mutex blocked donor * * Follow the blocked-on relation: - * task->blocked_on -> mutex->owner -> task... + * + * ,-> task + * | | blocked-on + * | v + * blocked_donor | mutex + * | | owner + * | v + * `-- task + * + * and set the blocked_donor relation, this latter is used by the mutex + * code to find which (blocked) task to hand-off to. * * Lock order: * @@ -6989,6 +6999,7 @@ find_proxy_task(struct rq *rq, struct task_struct *do= nor, struct rq_flags *rf) * rq, therefore holding @rq->lock is sufficient to * guarantee its existence, as per ttwu_remote(). */ + owner->blocked_donor =3D p; } =20 WARN_ON_ONCE(owner && !owner->on_rq); @@ -7091,6 +7102,7 @@ static void __sched notrace __schedule(int sched_mode) unsigned long prev_state; struct rq_flags rf; struct rq *rq; + bool prev_not_proxied; int cpu; =20 /* Trace preemptions consistently with task switches */ @@ -7163,9 +7175,11 @@ static void __sched notrace __schedule(int sched_mod= e) switch_count =3D &prev->nvcsw; } =20 + prev_not_proxied =3D !prev->blocked_donor; pick_again: next =3D pick_next_task(rq, rq->donor, &rf); rq_set_donor(rq, next); + next->blocked_donor =3D NULL; if (unlikely(task_is_blocked(next))) { next =3D find_proxy_task(rq, next, &rf); if (!next) { @@ -7229,7 +7243,7 @@ static void __sched notrace __schedule(int sched_mode) rq =3D context_switch(rq, prev, next, &rf); } else { /* In case next was already curr but just got blocked_donor */ - if (!task_current_donor(rq, next)) + if (prev_not_proxied && next->blocked_donor) proxy_tag_curr(rq, next); =20 rq_unpin_lock(rq, &rf); --=20 2.51.0.338.gd7d06c2dae-goog From nobody Fri Oct 3 07:40:23 2025 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D00CC1C5485 for ; Thu, 4 Sep 2025 00:22:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945335; cv=none; b=bB1fr4AAGwJgXNl+UxCP4FTBq4W2l2qwqGuz5v8SXD96sdtDGDzW2hnYaixlxxT5rLWOl6qEQhoJ5cv6qQrOU7KK4Os+lFpbCci8pcJE4c91kPdlKGwaYiBJNbM/1wKUGi6dxy2sFOBNK3gBtHd4QH568mqUl3G9AfqxULGKF0c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756945335; c=relaxed/simple; bh=OdTR+Z6yRt7APDDub9cJY5k3tOEgu+Ak4LyQWwGIwcM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=exD2T1UZGd1LipRT9WNtosk4W1JQNit5pydZZsrl3K2t1HuHxESa5ZMn1BlbJVulrDzc6MnQiCnR5CI193FGr+Qm7hrQLKP9BRDCBpw3JVnU561iYbnOVEg5ybyqk0N+2LJdTxUGkFfJK1hc6WmFoBf4hMVxrF8bVSlwS6fkYKE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=xy/GqtBf; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="xy/GqtBf" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-325b2959306so734574a91.0 for ; Wed, 03 Sep 2025 17:22:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1756945333; x=1757550133; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lQF15GlGRoaddkA2sNzRo/0a+6h5l29yn4+US7wAy30=; b=xy/GqtBf0v9vViq8WQiIGxPGFJGdeND1/3FrjAIU85ULWcv17xzHyrv5jjbKj3z8Bc /rtTTv/7UFH0ZqnV/hMUfJHGrR+1KsszU1ODBeP1hjW59/DBcXoTgYl4uvr+31NcT8JW qbfCO9WK2AKy6nDNzFzeo7AGFwZpjBPl1loN8zJ/f+L2yEpRo3zLv81kSAPgj7e9TxmF 6ik+rIQ0j9g6aIOieydxk+RrVNok9U0L1pVNBbxNin7PRCiwvFOtc/1EzpViyiEUWTsJ yyVfeTHasviMHagfdbZC9bxCEumgsvONWWO7lEL8pX8oepBgPTBgdUqc8WjyPCFK8/sX aEZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756945333; x=1757550133; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lQF15GlGRoaddkA2sNzRo/0a+6h5l29yn4+US7wAy30=; b=HQ3BFoN1dx0knOAaWL39qBRLBt66cP5ILF+7iMpcO0P9SKKhgW1NMOyKMjjfQYaOqf 3R1DSOUqdTJ5BT6+5/Ri9BMu9TTUkGEtpGBNH70fALH2GxGVjwYI4eS4hsyjMi1qKROR R8Y7DPvUVoJZiSZ3SXfDe7/x1ipUbEJft0BC1sVCxTw05ls0RcxfcYrmx1/elJLwsAVa nT5yp5yjo1ctTpo8d8RNx5FMjlxGs2tk44KLWL2bCLa9k10FfY5v5DBh0sSMj8yHaAoC x5oUUnzl9bjRX0iv97QhI4ZVa0FBGM7P6q+qaP/g3cVCKWHIYtOnlyrv++0MYQC7Jwvx jU4A== X-Gm-Message-State: AOJu0Yxv5Kd++uy1a4tRpPB+aXDzffoFyXiFidp8sK0AJCATxlAxFm9i 92jbOhJykeU7k+pdMXG0f1XgG2CChC+VqbFBuQMO56e3WAcpSDvUqzRGFFaPFf1EVXBg9co5G2f Deb+xA7Jads0R319alAQB1lqbPckadVvI4BaQXqv3Epi5dChDO/8zMOSxrm9u/DSCbTFFcECqry MT+ENovFWV/sfLNJpzifENzOZbQYLQh+961vPFmS9rm+ciZAvs X-Google-Smtp-Source: AGHT+IHUgQ+s77oJca63T2a0DSEjPICDEfhQi8xbh2RJZ5lQb+nv3Blij+Q52KcpfMpy0KJQ6GPj/CUUlArt X-Received: from pjd5.prod.google.com ([2002:a17:90b:54c5:b0:31f:1ed:c76e]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1646:b0:32b:97ff:c95c with SMTP id 98e67ed59e1d1-32b97ffd2edmr21096a91.21.1756945333076; Wed, 03 Sep 2025 17:22:13 -0700 (PDT) Date: Thu, 4 Sep 2025 00:21:56 +0000 In-Reply-To: <20250904002201.971268-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250904002201.971268-1-jstultz@google.com> X-Mailer: git-send-email 2.51.0.338.gd7d06c2dae-goog Message-ID: <20250904002201.971268-7-jstultz@google.com> Subject: [RESEND][PATCH v21 6/6] sched: Migrate whole chain in proxy_migrate_task() From: John Stultz To: LKML Cc: John Stultz , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Instead of migrating one task each time through find_proxy_task(), we can walk up the blocked_donor ptrs and migrate the entire current chain in one go. This was broken out of earlier patches and held back while the series was being stabilized, but I wanted to re-introduce it. Signed-off-by: John Stultz --- v12: * Earlier this was re-using blocked_node, but I hit a race with activating blocked entities, and to avoid it introduced a new migration_node listhead v18: * Add init_task initialization of migration_node as suggested by Suleiman Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Zimuzo Ezeozue Cc: Mel Gorman Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 1 + init/init_task.c | 1 + kernel/fork.c | 1 + kernel/sched/core.c | 25 +++++++++++++++++-------- 4 files changed, 20 insertions(+), 8 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ab12eb738c440..176ec117f4041 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1243,6 +1243,7 @@ struct task_struct { enum blocked_on_state blocked_on_state; struct mutex *blocked_on; /* lock we're blocked on */ struct task_struct *blocked_donor; /* task that is boosting this task */ + struct list_head migration_node; raw_spinlock_t blocked_lock; =20 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER diff --git a/init/init_task.c b/init/init_task.c index 627bbd8953e88..65e0f90285966 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -176,6 +176,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { #endif .blocked_on_state =3D BO_RUNNABLE, .blocked_donor =3D NULL, + .migration_node =3D LIST_HEAD_INIT(init_task.migration_node), #ifdef CONFIG_RT_MUTEXES .pi_waiters =3D RB_ROOT_CACHED, .pi_top_task =3D NULL, diff --git a/kernel/fork.c b/kernel/fork.c index 86fe43ee35952..01fd08c463871 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2132,6 +2132,7 @@ __latent_entropy struct task_struct *copy_process( p->blocked_on_state =3D BO_RUNNABLE; p->blocked_on =3D NULL; /* not blocked yet */ p->blocked_donor =3D NULL; /* nobody is boosting p yet */ + INIT_LIST_HEAD(&p->migration_node); =20 #ifdef CONFIG_BCACHE p->sequential_io =3D 0; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cac03f68cbcce..26f7a11a39e0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6747,6 +6747,7 @@ static void proxy_migrate_task(struct rq *rq, struct = rq_flags *rf, struct task_struct *p, int target_cpu) { struct rq *target_rq =3D cpu_rq(target_cpu); + LIST_HEAD(migrate_list); =20 lockdep_assert_rq_held(rq); =20 @@ -6774,19 +6775,27 @@ static void proxy_migrate_task(struct rq *rq, struc= t rq_flags *rf, rq_set_donor(rq, rq->idle); set_next_task(rq, rq->idle); =20 - WARN_ON(p =3D=3D rq->curr); - - deactivate_task(rq, p, 0); - proxy_set_task_cpu(p, target_cpu); - + for (; p; p =3D p->blocked_donor) { + WARN_ON(p =3D=3D rq->curr); + deactivate_task(rq, p, 0); + proxy_set_task_cpu(p, target_cpu); + /* + * We can abuse blocked_node to migrate the thing, + * because @p was still on the rq. + */ + list_add(&p->migration_node, &migrate_list); + } zap_balance_callbacks(rq); rq_unpin_lock(rq, rf); raw_spin_rq_unlock(rq); raw_spin_rq_lock(target_rq); + while (!list_empty(&migrate_list)) { + p =3D list_first_entry(&migrate_list, struct task_struct, migration_node= ); + list_del_init(&p->migration_node); =20 - activate_task(target_rq, p, 0); - wakeup_preempt(target_rq, p, 0); - + activate_task(target_rq, p, 0); + wakeup_preempt(target_rq, p, 0); + } raw_spin_rq_unlock(target_rq); raw_spin_rq_lock(rq); rq_repin_lock(rq, rf); --=20 2.51.0.338.gd7d06c2dae-goog