From nobody Thu Jun 18 08:58:39 2026 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9D5F2392C4A for ; Thu, 30 Apr 2026 21:51:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585874; cv=none; b=eD37A2zi+gbaKizU1AaSTI+9ugAN2bFBWXt9N6W6bAp4HBkGdX/8629v1XyOfgmYCsULQUvQGpXe2fXOCbssXHnoxydtYYuvPc8iMzHw6ngNZGSegdWxMMLxAL7on6SZufKYDtjWWwdQEFkymued236AlJqm1dQMzDol/RH9lvk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585874; c=relaxed/simple; bh=7EH82Wz5HZ6tkvwWNUlGQOxswubMnF5K1SipbsiFu8k=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=fQJw3TlvWL5D99kdkLC3ZhUN9jxWLQcrOxsx+KaiVyXijySK30G7pwTKu9I8t/4jkwbsG9CoiOHfN3hcDaFGCnK/XnX2JeOKdAW9MazMTCcK7t9BkmCPpIkdzeGa2twKWc54GTZX6bn2ygHBQPKPzb+L2b1HRif6BEeF4ds4tcc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=BW7tveWY; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="BW7tveWY" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-c7948640854so766440a12.1 for ; Thu, 30 Apr 2026 14:51:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777585872; x=1778190672; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=5Iu/xCqbCo/zTSrkdtrDLa1P5+N6MsN4LTbTPimT3vk=; b=BW7tveWYQtwE5ZA44ziJubsYjMsIDyY3Ven/MELPkgYckOhSuv0/Dk3sbDtcnOZJGT wpQhZgZ8/yzOxD75UqbI/DFhnP0RGVXsqzdKLzoqFFfwxkS1ewvgeh06lhw3/3o5JdzM bICCZDOOyNBscjYB9VJe5VdyIYJ32GAGwpu2oU481NFbEDpfSOgGg3odNO9+agZ3eiG8 wQr+nhDqqaQGrFAM9rh3J9IFibX2GWRcdEuohfjZQDNiHf314oSPS2EzGscU5B8+pJnp uv5en9S7caZRTQaCt5yGb3Te0QJXlS1PfAL3aNRBjdOSJnkd3gfbtlTNvpUIaoYt+eVn FGng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777585872; x=1778190672; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=5Iu/xCqbCo/zTSrkdtrDLa1P5+N6MsN4LTbTPimT3vk=; b=Vf5FRjvR/GLEN039NC6hytktUMbN76wB5M06sfTwj8NS1J1D6+MGFkM0xmZE/BcCkE Wf16M+PrKAK+AfD7VDYP60Ugb04jhNVlP3snootBbv/xUYwpQYtSlY8pBNImD0bgMkl1 xv2YRloWWWIj/7LC0ruBpgNSjAIKkAgkKnRZmI25yMslom5SkX0wJubB1yBsldffT6U4 2e3PWzEEU97oxGOF1I20+/Gmj+sOIpOuaJVNBoryYfUKWWLgNFMh51z1WQJCe8cLnQee JLmW9PvLhtl+M5L9pO5XCK8K/QcTHiFIszcfFSUyVFbDtwtIpHQI2v31CHoUaQnV7VUw 1qQA== X-Gm-Message-State: AOJu0YwGfuADAgskiPdrRujIInMQ+Zq6kT7bUP9y7FJi1dF9TcJmrlnJ mndkwwVzWyoYdAB9FAdMJDrEWhO0b+fSs4md8o6WiH8//RRDcCsE5qiRWXoCt0OxRXNpQV42WSe TaG8KnAGNflbUvR+MIT+EQHtoKeKFKQdPoOqDGfAiRYUiKsXqLpwESYH+4a6d5ukWE5JeTzWxNm 2aqEuIcDZGCb0m5+6cWiuMJ8/dr6ccFOTtXArdDRfSmQIARe+w X-Received: from pgu15.prod.google.com ([2002:a63:144f:0:b0:c79:6553:cfe0]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:9188:b0:398:a934:9be6 with SMTP id adf61e73a8af0-3a3cf85a036mr5480828637.43.1777585871134; Thu, 30 Apr 2026 14:51:11 -0700 (PDT) Date: Thu, 30 Apr 2026 21:50:46 +0000 In-Reply-To: <20260430215103.2978955-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260430215103.2978955-1-jstultz@google.com> X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260430215103.2978955-2-jstultz@google.com> Subject: [PATCH v2 1/2] sched: proxy-exec: Close race causing workqueue work being delayed From: John Stultz To: LKML Cc: John Stultz , Vineeth Pillai , Sonam Sanju , Sean Christopherson , Kunwu Chan , Tejun Heo , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Vineeth reported seeing a KVM related deadlock connected to work queue lockups using the android17-6.18 tree, which has Proxy Execution enabled (using the full patch stack), but I've subsequently reproduced it on v7.1-rc1. On further debugging he found: - kvm-irqfd-cleanup workqueue and rcu_gp lands in a per-cpu pwq(work queue pool) - one of kvm-irqfd-cleanup worker(say A) takes a mutex and then calls synchronize_srcu_expedited() - one other kvm-irqfd-cleanup worker worker(Say B) tries to acquire the lock and then gets blocked - On the way to blocking, this cpu gets an IPI and on return from IPI, it calls __schedule() and did not get to complete workqueue accounting(worker->sleeping =3D 0 and decrementing pool->nr_running). This is done in sched_submit_work() -> wq_worker_sleeping() called from schedule() and we got preempted before that. - proxy execution doesn't immediately take it off run queue as p->blocked_on is set during __mutex_lock - Next time when B is picked for running, it notices A(mutex holder) is not on a runqueue and then blocks B. find_proxy_task() -> proxy_deactivate() -> block_task() - And things are then stuck. A is waiting for the workqueue to be run, but B can't run the workqueue as it is blocked on A. The trouble is that with Proxy Execution, in __mutex_lock_common() we set the task state to TASK_UNINTERRUPTIBLE, and set blocked_on before calling into schedule(), where sched_submit_work() will be called. But if an IPI comes in before we call schedule() the interrupt will call __schedule(SM_PREEMPT) directly. This causes the scheduler to see the current task as blocked_on, and deactivate it (because the owner is off the runqueue). Since its deactivated, it wont' be run, and it won't get to call sched_submit_work(). And then we see workqueue stalls. Without proxy-execution, things work, as the SM_PREEMPT case will prevent the task from being dequeued, and it can be reselected again and run, which will allow it to finish calling into schedule() and calling sched_submit_work() before actually blocking. Peter didn't like my earlier attempt to solve this by clearing the blocked_on state and marking the task __state RUNNABLE, as we shouldn't modify __state from schedule(). So this approach is slightly different. We use the low bit of the blocked_on pointer as a latch bit flag. When the task sets the blocked_on pointer, we don't consider it for use with proxy execution until the latch is set. We then only set the latch bit in __schedule() when we are not in an SM_PREEMPT case and are considering blocking the task. This makes the blocked_on state machine a little more complex: NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL With additional transitions: // only done on current ptr:latched -> NULL // only done on current or when trying to set waking ptr:unlatched -> NULL And where NULL and ptr:unlatched are functionally equivalent except for the ability to transition to ptr:latched. Credit for this idea is due to Vineeth and Sulieman who had proposed something very similar when the issue was first reported. As well as to Peter for suggesting it and K Prateek who helped iterate and shared an initial working version. Many thanks to Vineeth for figuring this very obscure race out and for implementing a test tool to make it easily reproducible! Reported-by: Vineeth Pillai Signed-off-by: John Stultz --- v2: * Switch to using extra flag bit to ensure we don't proxy early on SM_PREEMPT cases, as suggested by Peter (and Vineeth and Suleiman) and helped developed with K Prateek Cc: Vineeth Pillai Cc: Sonam Sanju Cc: Sean Christopherson Cc: Kunwu Chan Cc: Tejun Heo Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- include/linux/sched.h | 64 +++++++++++++++++++++++++++++++++++-------- kernel/sched/core.c | 15 ++++++---- 2 files changed, 63 insertions(+), 16 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 368c7b4d7cb51..8b9e971d98f67 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2183,18 +2183,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lo= ck) __must_hold(lock); #ifndef CONFIG_PREEMPT_RT =20 /* - * With proxy exec, if a task has been proxy-migrated, it may be a donor - * on a cpu that it can't actually run on. Thus we need a special state - * to denote that the task is being woken, but that it needs to be - * evaluated for return-migration before it is run. So if the task is - * blocked_on PROXY_WAKING, return migrate it before running it. + * The proxy exec blocked_on pointer value uses the low bit as a latch + * value which clarifies if the blocked_on value is used for proxying or + * not. + * + * The state machine looks something like + * NULL -> ptr:unlatched -> ptr:latched -> PROXY_WAKING -> NULL + * + * With some additional transitions: + * ptr:unlatched -> NULL (done on current, or via set_task_blocked_on_wa= king()) + * ptr:latched -> NULL (done only on current) + * + * 1) NULL and ptr:unlatched are effectively equivalent, no proxying will = occur + * 2) ptr:latched is the state when proxying will occur + * 3) PROXY_WAKING is used when the task is being woken to ensure we + * return-migrate proxy-migrated tasks before running them (note it has + * the latch bit set). */ -#define PROXY_WAKING ((struct mutex *)(-1L)) +#define PROXY_BLOCKED_LATCH (1UL) +#define PROXY_BLOCKED_ON_MASK(x) ((struct mutex *)((unsigned long)(x) & ~P= ROXY_BLOCKED_LATCH)) +#define PROXY_WAKING ((struct mutex *)(-1L)) /* PROXY_WAKING has LATCH bit= set */ + +static inline struct mutex *task_is_blocked_on(struct task_struct *p) +{ + if (!sched_proxy_exec()) + return false; + return (struct mutex *)((unsigned long)p->blocked_on & PROXY_BLOCKED_LATC= H); +} + +static inline void __set_task_blocked_on_latched(struct task_struct *p) +{ + lockdep_assert_held_once(&p->blocked_lock); + WARN_ON_ONCE(!p->blocked_on); + p->blocked_on =3D (struct mutex *)((unsigned long)p->blocked_on | PROXY_B= LOCKED_LATCH); +} + +static inline struct mutex *__get_task_latched_blocked_on(struct task_stru= ct *p) +{ + if (!task_is_blocked_on(p)) + return NULL; + if (p->blocked_on =3D=3D PROXY_WAKING) + return PROXY_WAKING; + return PROXY_BLOCKED_ON_MASK(p->blocked_on); +} =20 static inline struct mutex *__get_task_blocked_on(struct task_struct *p) { lockdep_assert_held_once(&p->blocked_lock); - return p->blocked_on =3D=3D PROXY_WAKING ? NULL : p->blocked_on; + if (p->blocked_on =3D=3D PROXY_WAKING) + return NULL; + return PROXY_BLOCKED_ON_MASK(p->blocked_on); } =20 static inline void __set_task_blocked_on(struct task_struct *p, struct mut= ex *m) @@ -2215,6 +2253,8 @@ static inline void __set_task_blocked_on(struct task_= struct *p, struct mutex *m) =20 static inline void __clear_task_blocked_on(struct task_struct *p, struct m= utex *m) { + struct mutex *bo =3D p->blocked_on; + /* Currently we serialize blocked_on under the task::blocked_lock */ lockdep_assert_held_once(&p->blocked_lock); /* @@ -2222,7 +2262,7 @@ static inline void __clear_task_blocked_on(struct tas= k_struct *p, struct mutex * * blocked_on relationships, but make sure we are not * clearing the relationship with a different lock. */ - WARN_ON_ONCE(m && p->blocked_on && p->blocked_on !=3D m && p->blocked_on = !=3D PROXY_WAKING); + WARN_ON_ONCE(m && bo && __get_task_blocked_on(p) !=3D m && bo !=3D PROXY_= WAKING); p->blocked_on =3D NULL; } =20 @@ -2242,15 +2282,17 @@ static inline void __set_task_blocked_on_waking(str= uct task_struct *p, struct mu return; } =20 - /* Don't set PROXY_WAKING if blocked_on was already cleared */ - if (!p->blocked_on) + /* Don't set PROXY_WAKING if we are not really blocked_on */ + if (!task_is_blocked_on(p)) { + p->blocked_on =3D NULL; /* clear if unlatched */ return; + } /* * There may be cases where we set PROXY_WAKING on tasks that were * already set to waking, but make sure we are not changing * the relationship with a different lock. */ - WARN_ON_ONCE(m && p->blocked_on !=3D m && p->blocked_on !=3D PROXY_WAKING= ); + WARN_ON_ONCE(m && __get_task_blocked_on(p) !=3D m && p->blocked_on !=3D P= ROXY_WAKING); p->blocked_on =3D PROXY_WAKING; } =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da20fb6ea25ae..2f912bf698446 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6599,8 +6599,13 @@ static bool try_to_block_task(struct rq *rq, struct = task_struct *p, * blocked on a mutex, and we want to keep it on the runqueue * to be selectable for proxy-execution. */ - if (!should_block) - return false; + if (!should_block) { + guard(raw_spinlock)(&p->blocked_lock); + if (p->blocked_on) { + __set_task_blocked_on_latched(p); + return false; + } + } =20 p->sched_contributes_to_load =3D (task_state & TASK_UNINTERRUPTIBLE) && @@ -6833,7 +6838,7 @@ find_proxy_task(struct rq *rq, struct task_struct *do= nor, struct rq_flags *rf) int owner_cpu; =20 /* Follow blocked_on chain. */ - for (p =3D donor; (mutex =3D p->blocked_on); p =3D owner) { + for (p =3D donor; (mutex =3D __get_task_latched_blocked_on(p)); p =3D own= er) { /* if its PROXY_WAKING, do return migration or run if current */ if (mutex =3D=3D PROXY_WAKING) { if (task_current(rq, p)) { @@ -6851,7 +6856,7 @@ find_proxy_task(struct rq *rq, struct task_struct *do= nor, struct rq_flags *rf) guard(raw_spinlock)(&p->blocked_lock); =20 /* Check again that p is blocked with blocked_lock held */ - if (mutex !=3D __get_task_blocked_on(p)) { + if (mutex !=3D __get_task_latched_blocked_on(p)) { /* * Something changed in the blocked_on chain and * we don't know if only at this level. So, let's @@ -7107,7 +7112,7 @@ static void __sched notrace __schedule(int sched_mode) struct task_struct *prev_donor =3D rq->donor; =20 rq_set_donor(rq, next); - if (unlikely(next->blocked_on)) { + if (unlikely(task_is_blocked_on(next))) { next =3D find_proxy_task(rq, next, &rf); if (!next) { zap_balance_callbacks(rq); --=20 2.54.0.545.g6539524ca2-goog From nobody Thu Jun 18 08:58:39 2026 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2DDCB3542D4 for ; Thu, 30 Apr 2026 21:51:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585876; cv=none; b=dj7EtjtyI3HMmlmnpxlYcwVMU9StXXbcK/Nwqs4D3pt+VoCTr/a9rIPD0I96UPEIgD2V38bFODEPmBnAVDj+0FTmQfQrI56WfA15IOHWa7L/hDEeL0lxMolC2xx8nrCKLIxyyfetBdqH+4a4IZSy0zCzlgpoTuyTLKRvkE7PiSQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777585876; c=relaxed/simple; bh=x/Wh1D+yXamvOjv3wRMZFBY9yDBFNOH7/v8a82glpZg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=kovaAZ7lTBf9kcLHllPhFn/CxspEbk0igAdhUO9TBLji2cKuYBqpeirEfZENZA/CzprPgCasny/FD3WYUqrThlfZ/LvU7+pqY4d+bFYtNXugDo0/qlUb5tw7wIjxtDBm9znbe9RWYI2zCadvD4CHOFDGUpnE94rB6sI3btCuNMM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=p4SvThaz; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="p4SvThaz" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-35d9e67f6dcso3106967a91.1 for ; Thu, 30 Apr 2026 14:51:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777585874; x=1778190674; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=LCFOqcRmwNjHX26koROMD9CdgEZKlERSoYkhoq5pjSA=; b=p4SvThazKt2XEauxuxo6Ll6s5gzl5bE80s7j3Ss/XRsvNhQgs4FT23wXcvNI98Y122 T+6NucZ9OdtWQlkyjDDIEF8NxdXgqWOgAZZMxwEgRiwaFjBHzvar0AueEle52ndX+57F epDFvEYS7/Sp/1Kah46e4ts8rVGI6xeQMBLTOtJk9AX2QUMWaUQab+xOWmPUztkm8Jch BfSF5dC0O6/Dcmq1Ovhekr7jFKMLyji9zHL8IknXUbfRTpC92VI/27vulNxrL8tgLpuh hqJhzT25mK9JbZVQEcRonUFUhcnpHUvIXhSEB3zDgbcRcmeVlJyFFF26hGGW8l4dOnHc 0cqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777585874; x=1778190674; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LCFOqcRmwNjHX26koROMD9CdgEZKlERSoYkhoq5pjSA=; b=eUbBNDRW2Bnsyf3P1gMUZxhvolgTfZnUB6ZNC4zs4BA8D3QW4ByA6MNaKBwK4b1xci Xivnd+m682AWj2Hrd0Wj1/vWo1bI//P3OFpmFvaIGeNss90t15+0fBjpmIwqsdkbFHX7 orc2AK+AJCYSwfIZrWecccGKDl1/pJMPrUefvm0TkFsNLhxqAi2g68WtT2kFcIxLeDij FAh9MaxoqF/a0Q2EClsXlP03BIedVs6/ADoMWclNJWaaMfYdOgrRaUjwcP2nyzycDRTY xZZh+Ile+BlxQQ6CbCR0vpNy2tBjfiKQwKeG9LEbfGAyRR3hW6rb4iA+Y+cZ1Ln3+3+R ajrQ== X-Gm-Message-State: AOJu0Yylf5jLU/3A2oI0S2q3wy1YZ4EMzHHHtjVq8zbSCb9AA816lMj7 Waba5p5MXjX9BdMupMnI0BzXplSDFMi6CSKIdZiyX/hFQ5oqS0viSyY0RU9DCcy9U2UVm07CJWf qt/Sjazm404JBZ4aDyKuMuAOU+AQJkZ6ZDB2TS/nl9BL+L614a7sntTGs3Sc82lfvOX6h095c8q GRW5zSQKb/7zavOh4aS75AEUh0tKQFfrivFLniEJ/4vi6qCleC X-Received: from pgid17.prod.google.com ([2002:a63:ed11:0:b0:c79:85be:4c5c]) (user=jstultz job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1f86:b0:359:1130:1047 with SMTP id 98e67ed59e1d1-364c304ad91mr5572237a91.17.1777585874320; Thu, 30 Apr 2026 14:51:14 -0700 (PDT) Date: Thu, 30 Apr 2026 21:50:47 +0000 In-Reply-To: <20260430215103.2978955-1-jstultz@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260430215103.2978955-1-jstultz@google.com> X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260430215103.2978955-3-jstultz@google.com> Subject: [PATCH v2 2/2] locking: mutex: Fix proxy-exec potentially deactivating tasks marked TASK_RUNNING From: John Stultz To: LKML Cc: John Stultz , Vineeth Pillai , Sonam Sanju , Sean Christopherson , Kunwu Chan , Tejun Heo , Joel Fernandes , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , K Prateek Nayak , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Vineeth found came up with a test driver that could trip up workqueue stalls. After fixing one issue this test found, Vineeth reported the test was still failing. Greatly simplified, a task that tries to take a mutex already owned by another task that is sleeping, can hit a edge case in the mutex_lock_common() case. If the task fails to get the lock, calls into schedule, but gets a spurious wakeup, it will find that it is first waiter, and go into the mutex_optimistic_spin() logic. Though before calling mutex_optimistic_spin(), we clear task blocked_on state, since mutex_optimistic_spin() may call schedule() if need_resched() is set. After mutex_optimistic_spin() fails, we set blocked_on again, restart the main mutex loop, try to take the lock and call into schedule_preempt_disabled(). From there, with proxy-execution, we'll see the task is blocked_on, follow the chain, see the owner is sleeping and dequeue the waiting task from the runqueue. This all sounds fine and reasonable. But what I had missed is that in mutex_optimistic_spin(), not only do we call schedule() but we set TASK_RUNNABLE right before doing so. This is ok for that invocation of schedule(). But when we come back we re-set the blocked_on we had just cleared, but we do not re-set the task state to TASK_INTERRUPTIBLE/UNINTERRUPTIBLE. This means we have a task that is blocked_on & TASK_RUNNABLE, so when the proxy execution code dequeues the task, we are in trouble since future wakeups will be shortcut by the ttwu_state_match() check. Thus, to avoid this, after mutex_optimistic_spin(), set the task state back when we set blocked_on. Many many thanks again to Vineeth for his very useful testing driver that uncovered this long hidden bug, that I hadn't tripped in all my testing! Very impressed with the problems he's uncovered! Reported-by: Vineeth Pillai Tested-by: Vineeth Pillai Signed-off-by: John Stultz --- Cc: Vineeth Pillai Cc: Sonam Sanju Cc: Sean Christopherson Cc: Kunwu Chan Cc: Tejun Heo Cc: Joel Fernandes Cc: Qais Yousef Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Steven Rostedt Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Metin Kaya Cc: Xuewen Yan Cc: K Prateek Nayak Cc: Thomas Gleixner Cc: Daniel Lezcano Cc: Suleiman Souhlal Cc: kuyo chang Cc: hupu Cc: kernel-team@android.com --- kernel/locking/mutex.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 09534628dc01a..a93d4c6bee1a3 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -763,6 +763,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st= ate, unsigned int subclas raw_spin_lock_irqsave(&lock->wait_lock, flags); raw_spin_lock(¤t->blocked_lock); __set_task_blocked_on(current, lock); + set_current_state(state); =20 if (opt_acquired) break; --=20 2.54.0.545.g6539524ca2-goog