Preparatory changes for Proxy Execution v13

[PATCH v13 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock

Posted by John Stultz 1 month, 2 weeks ago

From: Peter Zijlstra <peterz@infradead.org>

In preparation to nest mutex::wait_lock under rq::lock we need
to remove wakeups from under it.

Do this by utilizing wake_qs to defer the wakeup until after the
lock is dropped.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[jstultz: rebased to mainline, added extra wake_up_q & init
 to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Reverted back to an earlier version of this patch to undo
  the change that kept the wake_q in the ctx structure, as
  that broke the rule that the wake_q must always be on the
  stack, as its not safe for concurrency.
v6:
* Made tweaks suggested by Waiman Long
v7:
* Fixups to pass wake_qs down for PREEMPT_RT logic
v10:
* Switched preempt_enable to be lower close to the unlock as
  suggested by Valentin
* Added additional preempt_disable coverage around the wake_q
  calls as again noted by Valentin
v12:
* Fixes and simplifications from K Prateek Nayak and Peter Zijlstra
* Commit message tweak
---
 kernel/futex/pi.c               |  6 +++-
 kernel/locking/mutex.c          | 16 ++++++++---
 kernel/locking/rtmutex.c        | 49 +++++++++++++++++++++++----------
 kernel/locking/rtmutex_api.c    | 11 ++++++--
 kernel/locking/rtmutex_common.h |  3 +-
 kernel/locking/rwbase_rt.c      |  8 +++++-
 kernel/locking/rwsem.c          |  4 +--
 kernel/locking/spinlock_rt.c    |  3 +-
 kernel/locking/ww_mutex.h       | 30 ++++++++++++--------
 9 files changed, 92 insertions(+), 38 deletions(-)

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 5722467f2737..d62cca5ed8f4 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -922,6 +922,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
+	DEFINE_WAKE_Q(wake_q);
 	int res, ret;
 
 	if (!IS_ENABLED(CONFIG_FUTEX_PI))
@@ -1018,8 +1019,11 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	 * such that futex_unlock_pi() is guaranteed to observe the waiter when
 	 * it sees the futex_q::pi_state.
 	 */
-	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
+	preempt_disable();
 	raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
+	wake_up_q(&wake_q);
+	preempt_enable();
 
 	if (ret) {
 		if (ret == 1)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cbae8c0b89ab..6c94da061ec2 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -575,6 +575,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		    struct lockdep_map *nest_lock, unsigned long ip,
 		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
 {
+	DEFINE_WAKE_Q(wake_q);
 	struct mutex_waiter waiter;
 	struct ww_mutex *ww;
 	int ret;
@@ -625,7 +626,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	 */
 	if (__mutex_trylock(lock)) {
 		if (ww_ctx)
-			__ww_mutex_check_waiters(lock, ww_ctx);
+			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
 
 		goto skip_wait;
 	}
@@ -645,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * Add in stamp order, waking up waiters that must kill
 		 * themselves.
 		 */
-		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
 		if (ret)
 			goto err_early_kill;
 	}
@@ -681,6 +682,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		}
 
 		raw_spin_unlock(&lock->wait_lock);
+		/* Make sure we do wakeups before calling schedule */
+		wake_up_q(&wake_q);
+		wake_q_init(&wake_q);
+
 		schedule_preempt_disabled();
 
 		first = __mutex_waiter_is_first(lock, &waiter);
@@ -714,7 +719,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 */
 		if (!ww_ctx->is_wait_die &&
 		    !__mutex_waiter_is_first(lock, &waiter))
-			__ww_mutex_check_waiters(lock, ww_ctx);
+			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
 	}
 
 	__mutex_remove_waiter(lock, &waiter);
@@ -730,6 +735,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
 	raw_spin_unlock(&lock->wait_lock);
+	wake_up_q(&wake_q);
 	preempt_enable();
 	return 0;
 
@@ -741,6 +747,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	raw_spin_unlock(&lock->wait_lock);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
+	wake_up_q(&wake_q);
 	preempt_enable();
 	return ret;
 }
@@ -951,9 +958,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	preempt_disable();
 	raw_spin_unlock(&lock->wait_lock);
-
 	wake_up_q(&wake_q);
+	preempt_enable();
 }
 
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index ebebd0eec7f6..8ada6567a141 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -34,13 +34,15 @@
 
 static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
 					struct rt_mutex *lock,
-					struct ww_acquire_ctx *ww_ctx)
+					struct ww_acquire_ctx *ww_ctx,
+					struct wake_q_head *wake_q)
 {
 	return 0;
 }
 
 static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
-					    struct ww_acquire_ctx *ww_ctx)
+					    struct ww_acquire_ctx *ww_ctx,
+					    struct wake_q_head *wake_q)
 {
 }
 
@@ -1201,7 +1203,8 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 					   struct rt_mutex_waiter *waiter,
 					   struct task_struct *task,
 					   struct ww_acquire_ctx *ww_ctx,
-					   enum rtmutex_chainwalk chwalk)
+					   enum rtmutex_chainwalk chwalk,
+					   struct wake_q_head *wake_q)
 {
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex_waiter *top_waiter = waiter;
@@ -1245,7 +1248,10 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 
 		/* Check whether the waiter should back out immediately */
 		rtm = container_of(lock, struct rt_mutex, rtmutex);
-		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
+		preempt_disable();
+		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, wake_q);
+		wake_up_q(wake_q);
+		preempt_enable();
 		if (res) {
 			raw_spin_lock(&task->pi_lock);
 			rt_mutex_dequeue(lock, waiter);
@@ -1679,7 +1685,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 				       struct ww_acquire_ctx *ww_ctx,
 				       unsigned int state,
 				       enum rtmutex_chainwalk chwalk,
-				       struct rt_mutex_waiter *waiter)
+				       struct rt_mutex_waiter *waiter,
+				       struct wake_q_head *wake_q)
 {
 	struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
 	struct ww_mutex *ww = ww_container_of(rtm);
@@ -1690,7 +1697,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 	/* Try to acquire the lock again: */
 	if (try_to_take_rt_mutex(lock, current, NULL)) {
 		if (build_ww_mutex() && ww_ctx) {
-			__ww_mutex_check_waiters(rtm, ww_ctx);
+			__ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
 			ww_mutex_lock_acquired(ww, ww_ctx);
 		}
 		return 0;
@@ -1700,7 +1707,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 
 	trace_contention_begin(lock, LCB_F_RT);
 
-	ret = task_blocks_on_rt_mutex(lock, waiter, current, ww_ctx, chwalk);
+	ret = task_blocks_on_rt_mutex(lock, waiter, current, ww_ctx, chwalk, wake_q);
 	if (likely(!ret))
 		ret = rt_mutex_slowlock_block(lock, ww_ctx, state, NULL, waiter);
 
@@ -1708,7 +1715,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 		/* acquired the lock */
 		if (build_ww_mutex() && ww_ctx) {
 			if (!ww_ctx->is_wait_die)
-				__ww_mutex_check_waiters(rtm, ww_ctx);
+				__ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
 			ww_mutex_lock_acquired(ww, ww_ctx);
 		}
 	} else {
@@ -1730,7 +1737,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 
 static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
 					     struct ww_acquire_ctx *ww_ctx,
-					     unsigned int state)
+					     unsigned int state,
+					     struct wake_q_head *wake_q)
 {
 	struct rt_mutex_waiter waiter;
 	int ret;
@@ -1739,7 +1747,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
 	waiter.ww_ctx = ww_ctx;
 
 	ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
-				  &waiter);
+				  &waiter, wake_q);
 
 	debug_rt_mutex_free_waiter(&waiter);
 	return ret;
@@ -1755,6 +1763,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 				     struct ww_acquire_ctx *ww_ctx,
 				     unsigned int state)
 {
+	DEFINE_WAKE_Q(wake_q);
 	unsigned long flags;
 	int ret;
 
@@ -1776,8 +1785,11 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 	 * irqsave/restore variants.
 	 */
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
+	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
+	preempt_disable();
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+	wake_up_q(&wake_q);
+	preempt_enable();
 	rt_mutex_post_schedule();
 
 	return ret;
@@ -1804,7 +1816,8 @@ static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
  * rtlock_slowlock_locked - Slow path lock acquisition for RT locks
  * @lock:	The underlying RT mutex
  */
-static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
+static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock,
+					   struct wake_q_head *wake_q)
 {
 	struct rt_mutex_waiter waiter;
 	struct task_struct *owner;
@@ -1821,7 +1834,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
 
 	trace_contention_begin(lock, LCB_F_RT);
 
-	task_blocks_on_rt_mutex(lock, &waiter, current, NULL, RT_MUTEX_MIN_CHAINWALK);
+	task_blocks_on_rt_mutex(lock, &waiter, current, NULL, RT_MUTEX_MIN_CHAINWALK, wake_q);
 
 	for (;;) {
 		/* Try to acquire the lock again */
@@ -1832,7 +1845,11 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
 			owner = rt_mutex_owner(lock);
 		else
 			owner = NULL;
+		preempt_disable();
 		raw_spin_unlock_irq(&lock->wait_lock);
+		wake_up_q(wake_q);
+		wake_q_init(wake_q);
+		preempt_enable();
 
 		if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner))
 			schedule_rtlock();
@@ -1857,10 +1874,14 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
 static __always_inline void __sched rtlock_slowlock(struct rt_mutex_base *lock)
 {
 	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-	rtlock_slowlock_locked(lock);
+	rtlock_slowlock_locked(lock, &wake_q);
+	preempt_disable();
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+	wake_up_q(&wake_q);
+	preempt_enable();
 }
 
 #endif /* RT_MUTEX_BUILD_SPINLOCKS */
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index a6974d044593..747f2da16037 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -291,7 +291,8 @@ void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
  */
 int __sched __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
 					struct rt_mutex_waiter *waiter,
-					struct task_struct *task)
+					struct task_struct *task,
+					struct wake_q_head *wake_q)
 {
 	int ret;
 
@@ -302,7 +303,7 @@ int __sched __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
 
 	/* We enforce deadlock detection for futexes */
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, NULL,
-				      RT_MUTEX_FULL_CHAINWALK);
+				      RT_MUTEX_FULL_CHAINWALK, wake_q);
 
 	if (ret && !rt_mutex_owner(lock)) {
 		/*
@@ -341,12 +342,16 @@ int __sched rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
 				      struct task_struct *task)
 {
 	int ret;
+	DEFINE_WAKE_Q(wake_q);
 
 	raw_spin_lock_irq(&lock->wait_lock);
-	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+	ret = __rt_mutex_start_proxy_lock(lock, waiter, task, &wake_q);
 	if (unlikely(ret))
 		remove_waiter(lock, waiter);
+	preempt_disable();
 	raw_spin_unlock_irq(&lock->wait_lock);
+	wake_up_q(&wake_q);
+	preempt_enable();
 
 	return ret;
 }
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 1162e07cdaea..c38a2d2d4a7e 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -83,7 +83,8 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
 extern void rt_mutex_proxy_unlock(struct rt_mutex_base *lock);
 extern int __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
 				     struct rt_mutex_waiter *waiter,
-				     struct task_struct *task);
+				     struct task_struct *task,
+				     struct wake_q_head *);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 34a59569db6b..9f4322c07486 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
 	struct rt_mutex_base *rtm = &rwb->rtmutex;
+	DEFINE_WAKE_Q(wake_q);
 	int ret;
 
 	rwbase_pre_schedule();
@@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	 * For rwlocks this returns 0 unconditionally, so the below
 	 * !ret conditionals are optimized out.
 	 */
-	ret = rwbase_rtmutex_slowlock_locked(rtm, state);
+	ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);
 
 	/*
 	 * On success the rtmutex is held, so there can't be a writer
@@ -121,7 +122,12 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	 */
 	if (!ret)
 		atomic_inc(&rwb->readers);
+
+	preempt_disable();
 	raw_spin_unlock_irq(&rtm->wait_lock);
+	wake_up_q(&wake_q);
+	preempt_enable();
+
 	if (!ret)
 		rwbase_rtmutex_unlock(rtm);
 
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2bbb6eca5144..2ddb827e3bea 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1413,8 +1413,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 #define rwbase_rtmutex_lock_state(rtm, state)		\
 	__rt_mutex_lock(rtm, state)
 
-#define rwbase_rtmutex_slowlock_locked(rtm, state)	\
-	__rt_mutex_slowlock_locked(rtm, NULL, state)
+#define rwbase_rtmutex_slowlock_locked(rtm, state, wq)	\
+	__rt_mutex_slowlock_locked(rtm, NULL, state, wq)
 
 #define rwbase_rtmutex_unlock(rtm)			\
 	__rt_mutex_unlock(rtm)
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 38e292454fcc..fb1810a14c9d 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -162,7 +162,8 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
 }
 
 static __always_inline int
-rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
+rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
+			       struct wake_q_head *wake_q)
 {
 	rtlock_slowlock_locked(rtm);
 	return 0;
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 76d204b7d29c..a54bd16d0f17 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
  */
 static bool
 __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
-	       struct ww_acquire_ctx *ww_ctx)
+	       struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
 {
 	if (!ww_ctx->is_wait_die)
 		return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
-		wake_up_process(waiter->task);
+		wake_q_add(wake_q, waiter->task);
 	}
 
 	return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
  */
 static bool __ww_mutex_wound(struct MUTEX *lock,
 			     struct ww_acquire_ctx *ww_ctx,
-			     struct ww_acquire_ctx *hold_ctx)
+			     struct ww_acquire_ctx *hold_ctx,
+			     struct wake_q_head *wake_q)
 {
 	struct task_struct *owner = __ww_mutex_owner(lock);
 
@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner != current)
-			wake_up_process(owner);
+			wake_q_add(wake_q, owner);
 
 		return true;
 	}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
  * The current task must not be on the wait list.
  */
 static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+			 struct wake_q_head *wake_q)
 {
 	struct MUTEX_WAITER *cur;
 
@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
 		if (!cur->ww_ctx)
 			continue;
 
-		if (__ww_mutex_die(lock, cur, ww_ctx) ||
-		    __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+		if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+		    __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
 			break;
 	}
 }
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
 static __always_inline void
 ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
+	DEFINE_WAKE_Q(wake_q);
+
 	ww_mutex_lock_acquired(lock, ctx);
 
 	/*
@@ -405,8 +409,11 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 	 * die or wound us.
 	 */
 	lock_wait_lock(&lock->base);
-	__ww_mutex_check_waiters(&lock->base, ctx);
+	__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
+	preempt_disable();
 	unlock_wait_lock(&lock->base);
+	wake_up_q(&wake_q);
+	preempt_enable();
 }
 
 static __always_inline int
@@ -488,7 +495,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 static inline int
 __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		      struct MUTEX *lock,
-		      struct ww_acquire_ctx *ww_ctx)
+		      struct ww_acquire_ctx *ww_ctx,
+		      struct wake_q_head *wake_q)
 {
 	struct MUTEX_WAITER *cur, *pos = NULL;
 	bool is_wait_die;
@@ -532,7 +540,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		pos = cur;
 
 		/* Wait-Die: ensure younger waiters die. */
-		__ww_mutex_die(lock, cur, ww_ctx);
+		__ww_mutex_die(lock, cur, ww_ctx, wake_q);
 	}
 
 	__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +558,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		 * such that either we or the fastpath will wound @ww->ctx.
 		 */
 		smp_mb();
-		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
+		__ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
 	}
 
 	return 0;
-- 
2.47.0.rc1.288.g06298d1525-goog

Re: [PATCH v13 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock

Posted by Arnd Bergmann 1 week, 6 days ago

On Sat, Oct 12, 2024, at 01:25, John Stultz wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> In preparation to nest mutex::wait_lock under rq::lock we need
> to remove wakeups from under it.
>
> Do this by utilizing wake_qs to defer the wakeup until after the
> lock is dropped.

To follow up from IRC, this patch is the one that caused a
boot time regression in linux-next in the regulator framework.

Anders Roxell found this during testing on the Rock Pi 4 board
(rockchips rk3399 based).

The book load with the NULL pointer dereference is at
https://lkft.validation.linaro.org/scheduler/job/7979980#L741

The interesting bit is this:

[ 0.957586] rk_gmac-dwmac fe300000.ethernet: Enable RX Mitigation via HW Watchdog Timer"}
[ 0.969402] hub 6-0:1.0: USB hub found"}
[ 0.969450] hub 6-0:1.0: 1 port detected"}
[ 0.988163] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000"}
[ 0.988172] Mem abort info:"}
[ 0.988174]   ESR = 0x0000000096000004"}
[ 0.988176]   EC = 0x25: DABT (current EL), IL = 32 bits"}
[ 0.988180]   SET = 0, FnV = 0"}
[ 0.988183]   EA = 0, S1PTW = 0"}
[ 0.988185]   FSC = 0x04: level 0 translation fault"}
[ 0.988187] Data abort info:"}
[ 0.988189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000"}
[ 0.988191]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0"}
[ 0.988194]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0"}
[ 0.988197] [0000000000000000] user address but active_mm is swapper"}
[ 0.988201] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP"}
[ 0.988205] Modules linked in:"}
[ 0.988217] Hardware name: Radxa ROCK Pi 4B (DT)"}
[ 0.988225] pc : wake_up_q (kernel/sched/core.c:1059)
[ 0.988238] lr : wake_up_q (kernel/sched/core.c:1054)
[ 0.988243] sp : ffff800083433a00"}
[ 0.988245] x29: ffff800083433a00 x28: 0000000000000000 x27: ffff0000053b6080"}
[ 0.988253] x26: ffff800083433b90 x25: ffff0000053b6000 x24: ffff800080098000"}
[ 0.988259] x23: 00000000ffffffff x22: 0000000000000001 x21: 0000000000000000"}
[ 0.988265] x20: fffffffffffff850 x19: 0000000000000000 x18: 0000000000000001"}
[ 0.988272] x17: ffff800075678000 x16: ffff800082728000 x15: 019ee6ab98006e30"}
[ 0.988278] x14: 000002ce459acd0c x13: 000b52b4cf08772c x12: 000000000000000f"}
[ 0.988284] x11: 0000000000000000 x10: 0000000000000a50 x9 : ffff800083433870"}
[ 0.988291] x8 : ffff0000050fceb0 x7 : ffff0000f76ab9c0 x6 : 00000000000b52b4"}
[ 0.988297] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000"}
[ 0.988303] x2 : 0000000000002710 x1 : 0000000000000001 x0 : 0000000000002710"}
[ 0.988310] Call trace:"}
[ 0.988313] wake_up_q+0x50/0xf0 P)"}
[ 0.988319] wake_up_q+0xa0/0xf0 L)"}
[ 0.988325] __ww_rt_mutex_lock.isra.0 (arch/arm64/include/asm/preempt.h:62 (discriminator 2) kernel/locking/rtmutex.c:1794 kernel/locking/ww_rt_mutex.c:71)
[ 0.988333] ww_mutex_lock (kernel/locking/ww_rt_mutex.c:82)
[ 0.988338] regulator_lock_recursive (drivers/regulator/core.c:161 drivers/regulator/core.c:333)
[ 0.988347] regulator_lock_recursive (drivers/regulator/core.c:348)
[ 0.988354] regulator_lock_dependent (drivers/regulator/core.c:409)
[ 0.988360] regulator_set_voltage (drivers/regulator/core.c:4173)
[ 0.988366] _opp_config_regulator_single (include/linux/regulator/consumer.h:707 (discriminator 1) drivers/opp/core.c:933 drivers/opp/core.c:1019)
[ 0.988375] _set_opp (drivers/opp/core.c:1253)
[ 0.988379] dev_pm_opp_set_rate (drivers/opp/core.c:1357)
[ 0.988384] set_target (drivers/cpufreq/cpufreq-dt.c:63)
[ 0.988392] __cpufreq_driver_target (drivers/cpufreq/cpufreq.c:2292 drivers/cpufreq/cpufreq.c:2355)
[ 0.988398] sugov_work (kernel/sched/cpufreq_schedutil.c:537)
[ 0.988406] kthread_worker_fn (arch/arm64/include/asm/jump_label.h:32 include/linux/freezer.h:36 include/linux/freezer.h:54 kernel/kthread.c:861)
[ 0.988414] kthread (kernel/kthread.c:389)
[ 0.988420] ret_from_fork (arch/arm64/kernel/entry.S:863)
[ 0.988430] Code: f100067f 54000320 d11ec274 aa1303f5 (f9400273) "}


> @@ -1776,8 +1785,11 @@ static int __sched rt_mutex_slowlock(struct 
> rt_mutex_base *lock,
>  	 * irqsave/restore variants.
>  	 */
>  	raw_spin_lock_irqsave(&lock->wait_lock, flags);
> -	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
> +	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
> +	preempt_disable();
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> +	wake_up_q(&wake_q);
> +	preempt_enable();
>  	rt_mutex_post_schedule();
> 
>  	return ret;

This is apparently where things went wrong, but it's possible that
the actual root cause is in the regulator framework instead.

The NULL pointer itself happens when chasing wake_q->first->next,
so it would seem that one of these got reinitialized at
the wrong time, perhaps with a task_struct getting freed
or getting put on more than one wake_q_head lists at the
same time.

       Arnd

Re: [PATCH v13 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock

Posted by John Stultz 1 week, 6 days ago

On Wed, Nov 13, 2024 at 10:18 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, Oct 12, 2024, at 01:25, John Stultz wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > In preparation to nest mutex::wait_lock under rq::lock we need
> > to remove wakeups from under it.
> >
> > Do this by utilizing wake_qs to defer the wakeup until after the
> > lock is dropped.
>
> To follow up from IRC, this patch is the one that caused a
> boot time regression in linux-next in the regulator framework.
>
> Anders Roxell found this during testing on the Rock Pi 4 board
> (rockchips rk3399 based).
>
> The book load with the NULL pointer dereference is at
> https://lkft.validation.linaro.org/scheduler/job/7979980#L741
>
> The interesting bit is this:
>
> [ 0.957586] rk_gmac-dwmac fe300000.ethernet: Enable RX Mitigation via HW Watchdog Timer"}
> [ 0.969402] hub 6-0:1.0: USB hub found"}
> [ 0.969450] hub 6-0:1.0: 1 port detected"}
> [ 0.988163] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000"}
> [ 0.988172] Mem abort info:"}
> [ 0.988174]   ESR = 0x0000000096000004"}
> [ 0.988176]   EC = 0x25: DABT (current EL), IL = 32 bits"}
> [ 0.988180]   SET = 0, FnV = 0"}
> [ 0.988183]   EA = 0, S1PTW = 0"}
> [ 0.988185]   FSC = 0x04: level 0 translation fault"}
> [ 0.988187] Data abort info:"}
> [ 0.988189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000"}
> [ 0.988191]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0"}
> [ 0.988194]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0"}
> [ 0.988197] [0000000000000000] user address but active_mm is swapper"}
> [ 0.988201] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP"}
> [ 0.988205] Modules linked in:"}
> [ 0.988217] Hardware name: Radxa ROCK Pi 4B (DT)"}
> [ 0.988225] pc : wake_up_q (kernel/sched/core.c:1059)
> [ 0.988238] lr : wake_up_q (kernel/sched/core.c:1054)
> [ 0.988243] sp : ffff800083433a00"}
> [ 0.988245] x29: ffff800083433a00 x28: 0000000000000000 x27: ffff0000053b6080"}
> [ 0.988253] x26: ffff800083433b90 x25: ffff0000053b6000 x24: ffff800080098000"}
> [ 0.988259] x23: 00000000ffffffff x22: 0000000000000001 x21: 0000000000000000"}
> [ 0.988265] x20: fffffffffffff850 x19: 0000000000000000 x18: 0000000000000001"}
> [ 0.988272] x17: ffff800075678000 x16: ffff800082728000 x15: 019ee6ab98006e30"}
> [ 0.988278] x14: 000002ce459acd0c x13: 000b52b4cf08772c x12: 000000000000000f"}
> [ 0.988284] x11: 0000000000000000 x10: 0000000000000a50 x9 : ffff800083433870"}
> [ 0.988291] x8 : ffff0000050fceb0 x7 : ffff0000f76ab9c0 x6 : 00000000000b52b4"}
> [ 0.988297] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000"}
> [ 0.988303] x2 : 0000000000002710 x1 : 0000000000000001 x0 : 0000000000002710"}
> [ 0.988310] Call trace:"}
> [ 0.988313] wake_up_q+0x50/0xf0 P)"}
> [ 0.988319] wake_up_q+0xa0/0xf0 L)"}
> [ 0.988325] __ww_rt_mutex_lock.isra.0 (arch/arm64/include/asm/preempt.h:62 (discriminator 2) kernel/locking/rtmutex.c:1794 kernel/locking/ww_rt_mutex.c:71)
> [ 0.988333] ww_mutex_lock (kernel/locking/ww_rt_mutex.c:82)
> [ 0.988338] regulator_lock_recursive (drivers/regulator/core.c:161 drivers/regulator/core.c:333)
> [ 0.988347] regulator_lock_recursive (drivers/regulator/core.c:348)
> [ 0.988354] regulator_lock_dependent (drivers/regulator/core.c:409)
> [ 0.988360] regulator_set_voltage (drivers/regulator/core.c:4173)
> [ 0.988366] _opp_config_regulator_single (include/linux/regulator/consumer.h:707 (discriminator 1) drivers/opp/core.c:933 drivers/opp/core.c:1019)
> [ 0.988375] _set_opp (drivers/opp/core.c:1253)
> [ 0.988379] dev_pm_opp_set_rate (drivers/opp/core.c:1357)
> [ 0.988384] set_target (drivers/cpufreq/cpufreq-dt.c:63)
> [ 0.988392] __cpufreq_driver_target (drivers/cpufreq/cpufreq.c:2292 drivers/cpufreq/cpufreq.c:2355)
> [ 0.988398] sugov_work (kernel/sched/cpufreq_schedutil.c:537)
> [ 0.988406] kthread_worker_fn (arch/arm64/include/asm/jump_label.h:32 include/linux/freezer.h:36 include/linux/freezer.h:54 kernel/kthread.c:861)
> [ 0.988414] kthread (kernel/kthread.c:389)
> [ 0.988420] ret_from_fork (arch/arm64/kernel/entry.S:863)
> [ 0.988430] Code: f100067f 54000320 d11ec274 aa1303f5 (f9400273) "}
>
>
> > @@ -1776,8 +1785,11 @@ static int __sched rt_mutex_slowlock(struct
> > rt_mutex_base *lock,
> >        * irqsave/restore variants.
> >        */
> >       raw_spin_lock_irqsave(&lock->wait_lock, flags);
> > -     ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
> > +     ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
> > +     preempt_disable();
> >       raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> > +     wake_up_q(&wake_q);
> > +     preempt_enable();
> >       rt_mutex_post_schedule();
> >
> >       return ret;
>
> This is apparently where things went wrong, but it's possible that
> the actual root cause is in the regulator framework instead.
>
> The NULL pointer itself happens when chasing wake_q->first->next,
> so it would seem that one of these got reinitialized at
> the wrong time, perhaps with a task_struct getting freed
> or getting put on more than one wake_q_head lists at the
> same time.

Thanks so much again to Anders for finding and reporting this!

So I've managed to reproduce this and I'm chasing down what's going wrong now.

Thanks so much for the report!
-john

[RFC][PATCH] locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex

Posted by John Stultz 1 week, 6 days ago

Anders had bisected a crash using PREEMPT_RT with linux-next and
isolated it down to commit 894d1b3db41c ("locking/mutex: Remove
wakeups from under mutex::wait_lock"), where it seemed the
wake_q structure was somehow getting corrupted causing a null
pointer traversal.

I was able to easily repoduce this with PREEMPT_RT and managed
to isolate down that through various call stacks we were
actually calling wake_up_q() twice on the same wake_q.

I found that in the problematic commit, I had added the
wake_up_q() call in task_blocks_on_rt_mutex() around
__ww_mutex_add_waiter(), following a similar pattern in
__mutex_lock_common().

However, its just wrong. We haven't dropped the lock->wait_lock,
so its contrary to the point of the original patch. And it
didn't match the __mutex_lock_common() logic of re-initializing
the wake_q after calling it midway in the stack.

Looking at it now, the wake_up_q() call is incorrect and should
just be removed. So drop the erronious logic I had added.

Anders: Can you double check this resolves the issue for you?

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: regressions@lists.linux.dev
Cc: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Anders Roxell <anders.roxell@linaro.org>
Fixes: 894d1b3db41c ("locking/mutex: Remove wakeups from under mutex::wait_lock")
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/lkml/6afb936f-17c7-43fa-90e0-b9e780866097@app.fastmail.com/
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/locking/rtmutex.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c7de80ee1f9d..a01e81179df0 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1248,10 +1248,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 
 		/* Check whether the waiter should back out immediately */
 		rtm = container_of(lock, struct rt_mutex, rtmutex);
-		preempt_disable();
 		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, wake_q);
-		wake_up_q(wake_q);
-		preempt_enable();
 		if (res) {
 			raw_spin_lock(&task->pi_lock);
 			rt_mutex_dequeue(lock, waiter);
-- 
2.47.0.277.g8800431eea-goog

Re: [RFC][PATCH] locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex

Posted by K Prateek Nayak 1 week, 6 days ago

Hello John,

On 11/14/2024 3:22 AM, John Stultz wrote:
> Anders had bisected a crash using PREEMPT_RT with linux-next and
> isolated it down to commit 894d1b3db41c ("locking/mutex: Remove
> wakeups from under mutex::wait_lock"), where it seemed the
> wake_q structure was somehow getting corrupted causing a null
> pointer traversal.
> 
> I was able to easily repoduce this with PREEMPT_RT and managed
> to isolate down that through various call stacks we were
> actually calling wake_up_q() twice on the same wake_q.
> 
> I found that in the problematic commit, I had added the
> wake_up_q() call in task_blocks_on_rt_mutex() around
> __ww_mutex_add_waiter(), following a similar pattern in
> __mutex_lock_common().
> 
> However, its just wrong. We haven't dropped the lock->wait_lock,
> so its contrary to the point of the original patch. And it
> didn't match the __mutex_lock_common() logic of re-initializing
> the wake_q after calling it midway in the stack.
> 
> Looking at it now, the wake_up_q() call is incorrect and should
> just be removed. So drop the erronious logic I had added.
> 
> Anders: Can you double check this resolves the issue for you?
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Benjamin Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: kernel-team@android.com
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: regressions@lists.linux.dev
> Cc: Thorsten Leemhuis <linux@leemhuis.info>
> Cc: Anders Roxell <anders.roxell@linaro.org>
> Fixes: 894d1b3db41c ("locking/mutex: Remove wakeups from under mutex::wait_lock")
> Reported-by: Anders Roxell <anders.roxell@linaro.org>
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Link: https://lore.kernel.org/lkml/6afb936f-17c7-43fa-90e0-b9e780866097@app.fastmail.com/
> Signed-off-by: John Stultz <jstultz@google.com>

I've been running rtmutex_lock torture test in addition to a few
standard micro-benchmarks with the fix on my system on top of
tip:sched/core and I haven't encountered any splats there. Feel free to
add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

> ---
>   kernel/locking/rtmutex.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index c7de80ee1f9d..a01e81179df0 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1248,10 +1248,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
>   
>   		/* Check whether the waiter should back out immediately */
>   		rtm = container_of(lock, struct rt_mutex, rtmutex);
> -		preempt_disable();
>   		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, wake_q);
> -		wake_up_q(wake_q);
> -		preempt_enable();
>   		if (res) {
>   			raw_spin_lock(&task->pi_lock);
>   			rt_mutex_dequeue(lock, waiter);

Re: [RFC][PATCH] locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex

Posted by John Stultz 1 week, 5 days ago

On Wed, Nov 13, 2024 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/14/2024 3:22 AM, John Stultz wrote:
> > Anders had bisected a crash using PREEMPT_RT with linux-next and
> > isolated it down to commit 894d1b3db41c ("locking/mutex: Remove
> > wakeups from under mutex::wait_lock"), where it seemed the
> > wake_q structure was somehow getting corrupted causing a null
> > pointer traversal.
> >
> > I was able to easily repoduce this with PREEMPT_RT and managed
> > to isolate down that through various call stacks we were
> > actually calling wake_up_q() twice on the same wake_q.
> >
> > I found that in the problematic commit, I had added the
> > wake_up_q() call in task_blocks_on_rt_mutex() around
> > __ww_mutex_add_waiter(), following a similar pattern in
> > __mutex_lock_common().
> >
> > However, its just wrong. We haven't dropped the lock->wait_lock,
> > so its contrary to the point of the original patch. And it
> > didn't match the __mutex_lock_common() logic of re-initializing
> > the wake_q after calling it midway in the stack.
> >
> > Looking at it now, the wake_up_q() call is incorrect and should
> > just be removed. So drop the erronious logic I had added.
> >
...
>
> I've been running rtmutex_lock torture test in addition to a few
> standard micro-benchmarks with the fix on my system on top of
> tip:sched/core and I haven't encountered any splats there. Feel free to
> add:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>

Thank you so much for testing! I really appreciate it!
I'll resend with the provided tags and without the RFC here soon.
-john

Re: [RFC][PATCH] locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex

Posted by Anders Roxell 1 week, 6 days ago

On Wed, 13 Nov 2024 at 22:53, John Stultz <jstultz@google.com> wrote:
>
> Anders had bisected a crash using PREEMPT_RT with linux-next and
> isolated it down to commit 894d1b3db41c ("locking/mutex: Remove
> wakeups from under mutex::wait_lock"), where it seemed the
> wake_q structure was somehow getting corrupted causing a null
> pointer traversal.
>
> I was able to easily repoduce this with PREEMPT_RT and managed
> to isolate down that through various call stacks we were
> actually calling wake_up_q() twice on the same wake_q.
>
> I found that in the problematic commit, I had added the
> wake_up_q() call in task_blocks_on_rt_mutex() around
> __ww_mutex_add_waiter(), following a similar pattern in
> __mutex_lock_common().
>
> However, its just wrong. We haven't dropped the lock->wait_lock,
> so its contrary to the point of the original patch. And it
> didn't match the __mutex_lock_common() logic of re-initializing
> the wake_q after calling it midway in the stack.
>
> Looking at it now, the wake_up_q() call is incorrect and should
> just be removed. So drop the erronious logic I had added.
>
> Anders: Can you double check this resolves the issue for you?

Thank you John for looking into the issue
It booted just fine on the rockpi4 now, ontop of tag next-2024111.

Tested-by: Anders Roxell <anders.roxell@linaro.org>

Cheers,
Anders

>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Benjamin Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: kernel-team@android.com
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: regressions@lists.linux.dev
> Cc: Thorsten Leemhuis <linux@leemhuis.info>
> Cc: Anders Roxell <anders.roxell@linaro.org>
> Fixes: 894d1b3db41c ("locking/mutex: Remove wakeups from under mutex::wait_lock")
> Reported-by: Anders Roxell <anders.roxell@linaro.org>
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Link: https://lore.kernel.org/lkml/6afb936f-17c7-43fa-90e0-b9e780866097@app.fastmail.com/
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
>  kernel/locking/rtmutex.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index c7de80ee1f9d..a01e81179df0 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1248,10 +1248,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
>
>                 /* Check whether the waiter should back out immediately */
>                 rtm = container_of(lock, struct rt_mutex, rtmutex);
> -               preempt_disable();
>                 res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, wake_q);
> -               wake_up_q(wake_q);
> -               preempt_enable();
>                 if (res) {
>                         raw_spin_lock(&task->pi_lock);
>                         rt_mutex_dequeue(lock, waiter);
> --
> 2.47.0.277.g8800431eea-goog
>

[PATCH v13 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock
[PATCH v13 2/7] locking/mutex: Make mutex::wait_lock irq safe
[PATCH v13 3/7] locking/mutex: Expose __mutex_owner()
[PATCH v13 4/7] sched: Add move_queued_task_locked helper
[PATCH v13 5/7] sched: Consolidate pick_*_task to task_is_pushable helper
[PATCH v13 6/7] sched: Split out __schedule() deactivate task logic into a helper
[PATCH v13 7/7] sched: Split scheduler and execution contexts