[v18] Single RunQueue Proxy Execution (v18)

[RESEND][PATCH v18 3/8] locking/mutex: Add p->blocked_on wrappers for correctness checks

Posted by John Stultz 3 months ago

From: Valentin Schneider <valentin.schneider@arm.com>

This lets us assert mutex::wait_lock is held whenever we access
p->blocked_on, as well as warn us for unexpected state changes.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix conflicts, call in more places]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: tweaked commit subject, reworked a good bit]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Added get_task_blocked_on() accessor
v4:
* Address READ_ONCE usage that was dropped in v2
* Reordered to be a later add on to the main patch series as
  Peter was unhappy with similar wrappers in other patches.
v5:
* Added some extra correctness checking in wrappers
v7:
* Tweaks to reorder this change in the patch series
* Minor cleanup to set_task_blocked_on() suggested by Metin Kaya
v15:
* Split out into its own patch again.
* Further improve assumption checks in helpers.
v16:
* Fix optimistic spin case that can call schedule()
v17:
* Fix typos caught by Metin Kaya
* Add lockdep_assert_held_once and drop the READ_ONCE in
  __get_task_blocked_on(), as suggested by Juri Lelli

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h        | 54 ++++++++++++++++++++++++++++++++++--
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 32 ++++++++++-----------
 kernel/locking/ww_mutex.h    |  6 ++--
 4 files changed, 70 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index aa7966141a090..1d7f625adbb5e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/sched/prio.h>
 #include <linux/sched/types.h>
 #include <linux/signal_types.h>
+#include <linux/spinlock.h>
 #include <linux/syscall_user_dispatch_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/netdevice_xmit.h>
@@ -2177,6 +2178,57 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
+static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	WARN_ON_ONCE(!m);
+	/* The task should only be setting itself as blocked */
+	WARN_ON_ONCE(p != current);
+	/* Currently we serialize blocked_on under the mutex::wait_lock */
+	lockdep_assert_held_once(&m->wait_lock);
+	/*
+	 * Check ensure we don't overwrite existing mutex value
+	 * with a different mutex. Note, setting it to the same
+	 * lock repeatedly is ok.
+	 */
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
+}
+
+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	__set_task_blocked_on(p, m);
+}
+
+static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	WARN_ON_ONCE(!m);
+	/* Currently we serialize blocked_on under the mutex::wait_lock */
+	lockdep_assert_held_once(&m->wait_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
+}
+
+static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	__clear_task_blocked_on(p, m);
+}
+
+static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
+{
+	struct mutex *m = p->blocked_on;
+
+	if (m)
+		lockdep_assert_held_once(&m->wait_lock);
+	return m;
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
@@ -2216,8 +2268,6 @@ extern bool sched_task_on_rq(struct task_struct *p);
 extern unsigned long get_wchan(struct task_struct *p);
 extern struct task_struct *cpu_curr_snapshot(int cpu);
 
-#include <linux/spinlock.h>
-
 /*
  * In order to reduce various lock holder preemption latencies provide an
  * interface to see if a vCPU is currently running or not.
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 758b7a6792b0c..949103fd8e9b5 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(task->blocked_on);
+	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+	struct mutex *blocked_on = __get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index e2f59863a866e..80d778fedd605 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -644,8 +644,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
-	WARN_ON(current->blocked_on);
-	current->blocked_on = lock;
+	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -685,9 +684,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
-		 * it to the lock we are trying to aquire.
+		 * it to the lock we are trying to acquire.
 		 */
-		current->blocked_on = lock;
+		set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -699,11 +698,15 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		if (first) {
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
-			/* clear blocked_on as mutex_optimistic_spin may schedule() */
-			current->blocked_on = NULL;
+			/*
+			 * mutex_optimistic_spin() can call schedule(), so
+			 * clear blocked on so we don't become unselectable
+			 * to run.
+			 */
+			clear_task_blocked_on(current, lock);
 			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
 				break;
-			current->blocked_on = lock;
+			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
 
@@ -711,7 +714,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
-	current->blocked_on = NULL;
+	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 
 	if (ww_ctx) {
@@ -741,11 +744,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	current->blocked_on = NULL;
+	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(current->blocked_on);
+	WARN_ON(__get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -956,14 +959,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		/*
-		 * Unlock wakeups can be happening in parallel
-		 * (when optimistic spinners steal and release
-		 * the lock), so blocked_on may already be
-		 * cleared here.
-		 */
-		WARN_ON(next->blocked_on && next->blocked_on != lock);
-		next->blocked_on = NULL;
+		__clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 00db40946328e..086fd5487ca77 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,9 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		WARN_ON(waiter->task->blocked_on &&
-			waiter->task->blocked_on != lock);
-		waiter->task->blocked_on = NULL;
+		__clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -345,7 +343,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * blocked_on pointer. Otherwise we can see circular
 			 * blocked_on relationships that can't resolve.
 			 */
-			owner->blocked_on = NULL;
+			__clear_task_blocked_on(owner, lock);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
-- 
2.50.0.727.gbf7dc18ff4-goog

Re: [RESEND][PATCH v18 3/8] locking/mutex: Add p->blocked_on wrappers for correctness checks

Posted by K Prateek Nayak 3 months ago

Hello John,

On 7/8/2025 2:13 AM, John Stultz wrote:
> @@ -2177,6 +2178,57 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
>  	__cond_resched_rwlock_write(lock);					\
>  })
>  
> +static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
> +{
> +	WARN_ON_ONCE(!m);
> +	/* The task should only be setting itself as blocked */
> +	WARN_ON_ONCE(p != current);
> +	/* Currently we serialize blocked_on under the mutex::wait_lock */
> +	lockdep_assert_held_once(&m->wait_lock);

Sorry I didn't catch this earlier but building this with PREEMPT_RT fails
with the following error:

    ./include/linux/sched.h: In function ‘__set_task_blocked_on’:
    ./include/linux/sched.h:2187:36: error: ‘struct mutex’ has no member named ‘wait_lock’
     2187 |         lockdep_assert_held_once(&m->wait_lock);


Putting all these helpers behind a "#ifdef CONFIG_PREEMPT_RT" will then
lead to the this error:

    kernel/locking/ww_mutex.h:292:17: error: implicit declaration of function ‘__clear_task_blocked_on’ [-Werror=implicit-function-declaration]
      292 |                 __clear_task_blocked_on(waiter->task, lock);


Putting the "__clear_task_blocked_on()" calls in kernel/locking/ww_mutex.h
behind "#ifndef WW_RT" (which should start from Patch 2 since MUTEX and
MUTEX_WAITER for ww_mutex will resolve to rt_mutex and rt_mutex_waiter in
presence of WW_RT) solves all build issues with PREEMPT_RT. I'll let others
comment on the correctness tho :)


Following is the diff I used on top of this series for reference:

(build and boot tested only)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1d7f625adbb5..d47c9e4edf4f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2178,6 +2178,8 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
+#ifndef CONFIG_PREEMPT_RT
+
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
 	WARN_ON_ONCE(!m);
@@ -2229,6 +2231,8 @@ static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 	return m;
 }
 
+#endif
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 086fd5487ca7..fd67a4a49892 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -283,13 +283,14 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 	if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
-#endif
+
 		/*
 		 * When waking up the task to die, be sure to clear the
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
 		__clear_task_blocked_on(waiter->task, lock);
+#endif
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -338,12 +339,14 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner != current) {
+#ifndef WW_RT
 			/*
 			 * When waking up the task to wound, be sure to clear the
 			 * blocked_on pointer. Otherwise we can see circular
 			 * blocked_on relationships that can't resolve.
 			 */
 			__clear_task_blocked_on(owner, lock);
+#endif
 			wake_q_add(wake_q, owner);
 		}
 		return true;
---

I'll make sure to give the PREEMPT_RT build a spin next time around.
Sorry for the oversight.

> +	/*
> +	 * Check ensure we don't overwrite existing mutex value
> +	 * with a different mutex. Note, setting it to the same
> +	 * lock repeatedly is ok.
> +	 */
> +	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
> +	p->blocked_on = m;
> +}
> +
> +static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
> +{
> +	guard(raw_spinlock_irqsave)(&m->wait_lock);
> +	__set_task_blocked_on(p, m);
> +}
> +
> +static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
> +{
> +	WARN_ON_ONCE(!m);
> +	/* Currently we serialize blocked_on under the mutex::wait_lock */
> +	lockdep_assert_held_once(&m->wait_lock);
> +	/*
> +	 * There may be cases where we re-clear already cleared
> +	 * blocked_on relationships, but make sure we are not
> +	 * clearing the relationship with a different lock.
> +	 */
> +	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
> +	p->blocked_on = NULL;
> +}
> +
> +static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
> +{
> +	guard(raw_spinlock_irqsave)(&m->wait_lock);
> +	__clear_task_blocked_on(p, m);
> +}
> +
> +static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
> +{
> +	struct mutex *m = p->blocked_on;
> +
> +	if (m)
> +		lockdep_assert_held_once(&m->wait_lock);
> +	return m;
> +}
> +
>  static __always_inline bool need_resched(void)
>  {
>  	return unlikely(tif_need_resched());


-- 
Thanks and Regards,
Prateek

Re: [RESEND][PATCH v18 3/8] locking/mutex: Add p->blocked_on wrappers for correctness checks

Posted by John Stultz 3 months ago

On Mon, Jul 7, 2025 at 11:44 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 7/8/2025 2:13 AM, John Stultz wrote:
> > @@ -2177,6 +2178,57 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
> >       __cond_resched_rwlock_write(lock);                                      \
> >  })
> >
> > +static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
> > +{
> > +     WARN_ON_ONCE(!m);
> > +     /* The task should only be setting itself as blocked */
> > +     WARN_ON_ONCE(p != current);
> > +     /* Currently we serialize blocked_on under the mutex::wait_lock */
> > +     lockdep_assert_held_once(&m->wait_lock);
>
> Sorry I didn't catch this earlier but building this with PREEMPT_RT fails
> with the following error:
...
> @@ -338,12 +339,14 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
>                  * wakeup pending to re-read the wounded state.
>                  */
>                 if (owner != current) {
> +#ifndef WW_RT
>                         /*
>                          * When waking up the task to wound, be sure to clear the
>                          * blocked_on pointer. Otherwise we can see circular
>                          * blocked_on relationships that can't resolve.
>                          */
>                         __clear_task_blocked_on(owner, lock);
> +#endif
>                         wake_q_add(wake_q, owner);
>                 }
>                 return true;
> ---
>
> I'll make sure to give the PREEMPT_RT build a spin next time around.
> Sorry for the oversight.

No no, thank you for letting me know! Very much appreciate the detailed testing!

Between PROXY_EXEC, SMP and PREEMPT_RT, there's a fair number of
relevant combinations to test, and I've only done so in an ad-hoc
fashion (and I'm sure I'm still missing some).
So I've added a few defconfigs along with a build script to my full
tree so I can better automate build testing across all the patches in
the series, so this hopefully doesn't bite me again.
(Though let me know if you find something else, as I probably will
need to add another defconfig to test).

I've also tried to rework your suggestion above a little more deeply
into the helper logic so we don't have as many ifdefs in the
functions.  It unfortunately touches a lot of the patch series so its
been slow going in getting all the changes in the most sensible spots.

I'll continue testing and refining the changes locally and if there's
no other feedback on the series I'll send out a v19 near the end of
the week.

thanks
-john

[RESEND][PATCH v18 1/8] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
[RESEND][PATCH v18 2/8] locking/mutex: Rework task_struct::blocked_on
[RESEND][PATCH v18 3/8] locking/mutex: Add p->blocked_on wrappers for correctness checks
[RESEND][PATCH v18 4/8] sched: Move update_curr_task logic into update_curr_se
[RESEND][PATCH v18 5/8] sched: Fix runtime accounting w/ split exec & sched contexts
[RESEND][PATCH v18 6/8] sched: Add an initial sketch of the find_proxy_task() function
[RESEND][PATCH v18 7/8] sched: Fix proxy/current (push,pull)ability
[RESEND][PATCH v18 8/8] sched: Start blocked_on chain processing in find_proxy_task()