Single CPU Proxy Execution (v14)

[RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by John Stultz 1 year, 2 months ago

From: Peter Zijlstra <peterz@infradead.org>

Track the blocked-on relation for mutexes, to allow following this
relation at schedule time.

   task
     | blocked-on
     v
   mutex
     | owner
     v
   task

Also add a blocked_on_state value so we can distinguish when a
task is blocked_on a mutex, but is either blocked, waking up, or
runnable (such that it can try to acquire the lock its blocked
on).

This avoids some of the subtle & racy games where the blocked_on
state gets cleared, only to have it re-added by the
mutex_lock_slowpath call when it tries to acquire the lock on
wakeup

Also add blocked_lock to the task_struct so we can safely
serialize the blocked-on state.

Finally add wrappers that are useful to provide correctness
checks. Folded in from a patch by:
  Valentin Schneider <valentin.schneider@arm.com>

This all will be used for tracking blocked-task/mutex chains
with the prox-execution patch in a similar fashion to how
priority inheritance is done with rt_mutexes.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
  This is critical so we don't get circular blocked_on relationships
  that can't be resolved.
v5:
* Fix potential bug where the skip_wait path might clear blocked_on
  when that path never set it
* Slight tweaks to where we set blocked_on to make it consistent,
  along with extra WARN_ON correctness checking
* Minor comment changes
v7:
* Minor commit message change suggested by Metin Kaya
* Fix WARN_ON conditionals in unlock path (as blocked_on might already
  be cleared), found while looking at issue Metin Kaya raised.
* Minor tweaks to be consistent in what we do under the
  blocked_on lock, also tweaked variable name to avoid confusion
  with label, and comment typos, as suggested by Metin Kaya
* Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
* Moved unused block of code to later in the series, as suggested
  by Metin Kaya
* Switch to a tri-state to be able to distinguish from waking and
  runnable so we can later safely do return migration from ttwu
* Folded together with related blocked_on changes
v8:
* Fix issue leaving task BO_BLOCKED when calling into optimistic
  spinning path.
* Include helper to better handle BO_BLOCKED->BO_WAKING transitions
v9:
* Typo fixup pointed out by Metin
* Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case
* Many cleanups and simplifications suggested by Metin
v11:
* Whitespace fixup pointed out by Metin
v13:
* Refactor set_blocked_on helpers clean things up a bit
v14:
* Small build fixup with PREEMPT_RT
---
 include/linux/sched.h        | 66 ++++++++++++++++++++++++++++++++----
 init/init_task.c             |  1 +
 kernel/fork.c                |  4 +--
 kernel/locking/mutex-debug.c |  9 ++---
 kernel/locking/mutex.c       | 40 ++++++++++++++++++----
 kernel/locking/ww_mutex.h    | 24 +++++++++++--
 kernel/sched/core.c          |  1 +
 7 files changed, 125 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 24e338ac34d7b..0ad8033f8c2b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/sched/prio.h>
 #include <linux/sched/types.h>
 #include <linux/signal_types.h>
+#include <linux/spinlock.h>
 #include <linux/syscall_user_dispatch_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/netdevice_xmit.h>
@@ -775,6 +776,12 @@ struct kmap_ctrl {
 #endif
 };
 
+enum blocked_on_state {
+	BO_RUNNABLE,
+	BO_BLOCKED,
+	BO_WAKING,
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1195,10 +1202,9 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
-	/* Mutex deadlock detection: */
-	struct mutex_waiter		*blocked_on;
-#endif
+	enum blocked_on_state		blocked_on_state;
+	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
@@ -2118,6 +2124,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
+static inline void __set_blocked_on_runnable(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	if (p->blocked_on_state == BO_WAKING)
+		p->blocked_on_state = BO_RUNNABLE;
+}
+
+static inline void set_blocked_on_runnable(struct task_struct *p)
+{
+	unsigned long flags;
+
+	if (!sched_proxy_exec())
+		return;
+
+	raw_spin_lock_irqsave(&p->blocked_lock, flags);
+	__set_blocked_on_runnable(p);
+	raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
+}
+
+static inline void set_blocked_on_waking(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	if (p->blocked_on_state == BO_BLOCKED)
+		p->blocked_on_state = BO_WAKING;
+}
+
+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	/*
+	 * Check we are clearing values to NULL or setting NULL
+	 * to values to ensure we don't overwrite existing mutex
+	 * values or clear already cleared values
+	 */
+	WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
+
+	p->blocked_on = m;
+	p->blocked_on_state = m ? BO_BLOCKED : BO_RUNNABLE;
+}
+
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	return p->blocked_on;
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
@@ -2157,8 +2213,6 @@ extern bool sched_task_on_rq(struct task_struct *p);
 extern unsigned long get_wchan(struct task_struct *p);
 extern struct task_struct *cpu_curr_snapshot(int cpu);
 
-#include <linux/spinlock.h>
-
 /*
  * In order to reduce various lock holder preemption latencies provide an
  * interface to see if a vCPU is currently running or not.
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd906..7e29d86153d9f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index f253e81d0c28e..160bead843afb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2231,6 +2231,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
@@ -2329,9 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
+	p->blocked_on_state = BO_RUNNABLE;
 	p->blocked_on = NULL; /* not blocked yet */
-#endif
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 6e6f6071cfa27..1d8cff71f65e1 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 {
 	lockdep_assert_held(&lock->wait_lock);
 
-	/* Mark the current thread as blocked on the lock: */
-	task->blocked_on = waiter;
+	/* Current thread can't be already blocked (since it's executing!) */
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
+	struct mutex *blocked_on = get_task_blocked_on(task);
+
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
-	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
-	task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
 
 	INIT_LIST_HEAD(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 3302e52f0c967..8f5d3fe6c1029 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -597,6 +597,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	raw_spin_lock(&current->blocked_lock);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -639,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break; /* acquired */;
 
 		/*
 		 * Check for signals and kill conditions while holding
@@ -657,6 +659,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 				goto err;
 		}
 
+		raw_spin_unlock(&current->blocked_lock);
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		/* Make sure we do wakeups before calling schedule */
 		wake_up_q(&wake_q);
@@ -666,6 +669,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
+
+		/*
+		 * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
+		 */
+		current->blocked_on_state = BO_BLOCKED;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
+			bool opt_acquired;
+
+			/*
+			 * mutex_optimistic_spin() can schedule, so  we need to
+			 * release these locks before calling it.
+			 */
+			current->blocked_on_state = BO_RUNNABLE;
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			current->blocked_on_state = BO_BLOCKED;
+			if (opt_acquired)
 				break;
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
+	set_task_blocked_on(current, NULL);
 	__set_current_state(TASK_RUNNING);
 
 	if (ww_ctx) {
@@ -710,16 +730,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	wake_up_q(&wake_q);
 	preempt_enable();
 	return 0;
 
 err:
+	set_task_blocked_on(current, NULL);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
@@ -928,8 +952,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
+		raw_spin_lock(&next->blocked_lock);
 		debug_mutex_wake_waiter(lock, waiter);
+		WARN_ON(get_task_blocked_on(next) != lock);
+		set_blocked_on_waking(next);
 		wake_q_add(&wake_q, next);
+		raw_spin_unlock(&next->blocked_lock);
 	}
 
 	if (owner & MUTEX_FLAG_HANDOFF)
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 37f025a096c9d..d9ff2022eef6f 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		return false;
 
 	if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
+		/* nested as we should hold current->blocked_lock already */
+		raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
+		/*
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on_state to WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't
+		 * resolve.
+		 */
+		WARN_ON(get_task_blocked_on(waiter->task) != lock);
 #endif
+		set_blocked_on_waking(waiter->task);
 		wake_q_add(wake_q, waiter->task);
+		raw_spin_unlock(&waiter->task->blocked_lock);
 	}
 
 	return true;
@@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * it's wounded in __ww_mutex_check_kill() or has a
 		 * wakeup pending to re-read the wounded state.
 		 */
-		if (owner != current)
+		if (owner != current) {
+			/* nested as we should hold current->blocked_lock already */
+			raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
+			/*
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on_state flag. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
+			 */
+			set_blocked_on_waking(owner);
 			wake_q_add(wake_q, owner);
-
+			raw_spin_unlock(&owner->blocked_lock);
+		}
 		return true;
 	}
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d712e177d3b75..f8714050b6d0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4350,6 +4350,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
-- 
2.47.0.371.ga323438b13-goog

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by Peter Zijlstra 1 year, 1 month ago

On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:

> Also add a blocked_on_state value so we can distinguish when a
> task is blocked_on a mutex, but is either blocked, waking up, or
> runnable (such that it can try to acquire the lock its blocked
> on).
> 
> This avoids some of the subtle & racy games where the blocked_on
> state gets cleared, only to have it re-added by the
> mutex_lock_slowpath call when it tries to acquire the lock on
> wakeup

If you can remember those sublte cases, I'm sure our future selves
would've loved it if you wrote a comment to go with these states :-)

> +enum blocked_on_state {
> +	BO_RUNNABLE,
> +	BO_BLOCKED,
> +	BO_WAKING,
> +};
> +
>  struct task_struct {
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/*
> @@ -1195,10 +1202,9 @@ struct task_struct {
>  	struct rt_mutex_waiter		*pi_blocked_on;
>  #endif
>  
> -#ifdef CONFIG_DEBUG_MUTEXES
> -	/* Mutex deadlock detection: */
> -	struct mutex_waiter		*blocked_on;
> -#endif
> +	enum blocked_on_state		blocked_on_state;
> +	struct mutex			*blocked_on;	/* lock we're blocked on */
> +	raw_spinlock_t			blocked_lock;
>  
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>  	int				non_block_count;
> @@ -2118,6 +2124,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
>  	__cond_resched_rwlock_write(lock);					\
>  })
>  
> +static inline void __set_blocked_on_runnable(struct task_struct *p)
> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	if (p->blocked_on_state == BO_WAKING)
> +		p->blocked_on_state = BO_RUNNABLE;
> +}
> +
> +static inline void set_blocked_on_runnable(struct task_struct *p)
> +{
> +	unsigned long flags;
> +
> +	if (!sched_proxy_exec())
> +		return;
> +
> +	raw_spin_lock_irqsave(&p->blocked_lock, flags);

Do we want to make this:

	guard(raw_spinlock_irqsave)(&p->blocked_lock);

?

> +	__set_blocked_on_runnable(p);
> +	raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
> +}
> +
> +static inline void set_blocked_on_waking(struct task_struct *p)

consistent naming would be __set_blocked_on_waking()

> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	if (p->blocked_on_state == BO_BLOCKED)
> +		p->blocked_on_state = BO_WAKING;
> +}
> +
> +static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)

__set_task_blocked_on()

> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	/*
> +	 * Check we are clearing values to NULL or setting NULL
> +	 * to values to ensure we don't overwrite existing mutex
> +	 * values or clear already cleared values
> +	 */
> +	WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
> +
> +	p->blocked_on = m;
> +	p->blocked_on_state = m ? BO_BLOCKED : BO_RUNNABLE;
> +}
> +
> +static inline struct mutex *get_task_blocked_on(struct task_struct *p)

__get_task_blocked_on()

> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	return p->blocked_on;
> +}

That gets us the __ prefix if the caller is assumed to have taken
blocked_lock.

>  static __always_inline bool need_resched(void)
>  {
>  	return unlikely(tif_need_resched());
> @@ -2157,8 +2213,6 @@ extern bool sched_task_on_rq(struct task_struct *p);
>  extern unsigned long get_wchan(struct task_struct *p);
>  extern struct task_struct *cpu_curr_snapshot(int cpu);
>  
> -#include <linux/spinlock.h>
> -
>  /*
>   * In order to reduce various lock holder preemption latencies provide an
>   * interface to see if a vCPU is currently running or not.
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd906..7e29d86153d9f 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>  	.journal_info	= NULL,
>  	INIT_CPU_TIMERS(init_task)
>  	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
> +	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
>  	.timer_slack_ns = 50000, /* 50 usec default slack */
>  	.thread_pid	= &init_struct_pid,
>  	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f253e81d0c28e..160bead843afb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2231,6 +2231,7 @@ __latent_entropy struct task_struct *copy_process(
>  	ftrace_graph_init_task(p);
>  
>  	rt_mutex_init_task(p);
> +	raw_spin_lock_init(&p->blocked_lock);
>  
>  	lockdep_assert_irqs_enabled();
>  #ifdef CONFIG_PROVE_LOCKING
> @@ -2329,9 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
>  	lockdep_init_task(p);
>  #endif
>  
> -#ifdef CONFIG_DEBUG_MUTEXES
> +	p->blocked_on_state = BO_RUNNABLE;
>  	p->blocked_on = NULL; /* not blocked yet */
> -#endif
>  #ifdef CONFIG_BCACHE
>  	p->sequential_io	= 0;
>  	p->sequential_io_avg	= 0;
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index 6e6f6071cfa27..1d8cff71f65e1 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  {
>  	lockdep_assert_held(&lock->wait_lock);
>  
> -	/* Mark the current thread as blocked on the lock: */
> -	task->blocked_on = waiter;
> +	/* Current thread can't be already blocked (since it's executing!) */
> +	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
>  }
>  
>  void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  			 struct task_struct *task)
>  {
> +	struct mutex *blocked_on = get_task_blocked_on(task);
> +
>  	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
>  	DEBUG_LOCKS_WARN_ON(waiter->task != task);
> -	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
> -	task->blocked_on = NULL;
> +	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
>  
>  	INIT_LIST_HEAD(&waiter->list);
>  	waiter->task = NULL;
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 3302e52f0c967..8f5d3fe6c1029 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -597,6 +597,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  	}
>  
>  	raw_spin_lock_irqsave(&lock->wait_lock, flags);
> +	raw_spin_lock(&current->blocked_lock);
>  	/*
>  	 * After waiting to acquire the wait_lock, try again.
>  	 */
> @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			goto err_early_kill;
>  	}
>  
> +	set_task_blocked_on(current, lock);
>  	set_current_state(state);

blocked_on_state mirrors task-state

>  	trace_contention_begin(lock, LCB_F_MUTEX);
>  	for (;;) {
> @@ -639,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  		 * the handoff.
>  		 */
>  		if (__mutex_trylock(lock))
> -			goto acquired;
> +			break; /* acquired */;
>  
>  		/*
>  		 * Check for signals and kill conditions while holding
> @@ -657,6 +659,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  				goto err;
>  		}
>  
> +		raw_spin_unlock(&current->blocked_lock);
>  		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  		/* Make sure we do wakeups before calling schedule */
>  		wake_up_q(&wake_q);
> @@ -666,6 +669,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  
>  		first = __mutex_waiter_is_first(lock, &waiter);
>  
> +		raw_spin_lock_irqsave(&lock->wait_lock, flags);
> +		raw_spin_lock(&current->blocked_lock);
> +
> +		/*
> +		 * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
> +		 */
> +		current->blocked_on_state = BO_BLOCKED;
>  		set_current_state(state);

And blocked_on_state again mirrors taks-state

>  		/*
>  		 * Here we order against unlock; we must either see it change
> @@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			break;
>  
>  		if (first) {
> +			bool opt_acquired;
> +
> +			/*
> +			 * mutex_optimistic_spin() can schedule, so  we need to
> +			 * release these locks before calling it.
> +			 */
> +			current->blocked_on_state = BO_RUNNABLE;
> +			raw_spin_unlock(&current->blocked_lock);
> +			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> -			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> +			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);

I'm confused -- didn't I kill optimistic spinning for proxy? IIRC it
fundamentally doesn't make sense since we do a hand-off to the donor
thread.

> +			raw_spin_lock_irqsave(&lock->wait_lock, flags);
> +			raw_spin_lock(&current->blocked_lock);
> +			current->blocked_on_state = BO_BLOCKED;
> +			if (opt_acquired)
>  				break;
>  			trace_contention_begin(lock, LCB_F_MUTEX);
>  		}
> -
> -		raw_spin_lock_irqsave(&lock->wait_lock, flags);
>  	}
> -	raw_spin_lock_irqsave(&lock->wait_lock, flags);
> -acquired:
> +	set_task_blocked_on(current, NULL);
>  	__set_current_state(TASK_RUNNING);

And again..

>  
>  	if (ww_ctx) {
> @@ -710,16 +730,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  	if (ww_ctx)
>  		ww_mutex_lock_acquired(ww, ww_ctx);
>  
> +	raw_spin_unlock(&current->blocked_lock);
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  	wake_up_q(&wake_q);
>  	preempt_enable();
>  	return 0;
>  
>  err:
> +	set_task_blocked_on(current, NULL);
>  	__set_current_state(TASK_RUNNING);

and one again..

>  	__mutex_remove_waiter(lock, &waiter);
>  err_early_kill:
> +	WARN_ON(get_task_blocked_on(current));
>  	trace_contention_end(lock, ret);
> +	raw_spin_unlock(&current->blocked_lock);
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  	debug_mutex_free_waiter(&waiter);
>  	mutex_release(&lock->dep_map, ip);
> @@ -928,8 +952,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>  
>  		next = waiter->task;
>  
> +		raw_spin_lock(&next->blocked_lock);
>  		debug_mutex_wake_waiter(lock, waiter);
> +		WARN_ON(get_task_blocked_on(next) != lock);
> +		set_blocked_on_waking(next);

and more..

>  		wake_q_add(&wake_q, next);
> +		raw_spin_unlock(&next->blocked_lock);
>  	}
>  
>  	if (owner & MUTEX_FLAG_HANDOFF)
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 37f025a096c9d..d9ff2022eef6f 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
>  		return false;
>  
>  	if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
> +		/* nested as we should hold current->blocked_lock already */
> +		raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
>  #ifndef WW_RT
>  		debug_mutex_wake_waiter(lock, waiter);
> +		/*
> +		 * When waking up the task to die, be sure to set the
> +		 * blocked_on_state to WAKING. Otherwise we can see
> +		 * circular blocked_on relationships that can't
> +		 * resolve.
> +		 */
> +		WARN_ON(get_task_blocked_on(waiter->task) != lock);
>  #endif
> +		set_blocked_on_waking(waiter->task);
>  		wake_q_add(wake_q, waiter->task);
> +		raw_spin_unlock(&waiter->task->blocked_lock);
>  	}
>  
>  	return true;
> @@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
>  		 * it's wounded in __ww_mutex_check_kill() or has a
>  		 * wakeup pending to re-read the wounded state.
>  		 */
> -		if (owner != current)
> +		if (owner != current) {
> +			/* nested as we should hold current->blocked_lock already */
> +			raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
> +			/*
> +			 * When waking up the task to wound, be sure to set the
> +			 * blocked_on_state flag. Otherwise we can see circular
> +			 * blocked_on relationships that can't resolve.
> +			 */
> +			set_blocked_on_waking(owner);
>  			wake_q_add(wake_q, owner);
> -
> +			raw_spin_unlock(&owner->blocked_lock);
> +		}
>  		return true;
>  	}
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d712e177d3b75..f8714050b6d0d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4350,6 +4350,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		ttwu_queue(p, cpu, wake_flags);
>  	}
>  out:
> +	set_blocked_on_runnable(p);
>  	if (success)
>  		ttwu_stat(p, task_cpu(p), wake_flags);
>  

All in all I'm properly confused by this patch... please write
more/better comments changelog.

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by John Stultz 1 year, 1 month ago

On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
>
> > Also add a blocked_on_state value so we can distinguish when a
> > task is blocked_on a mutex, but is either blocked, waking up, or
> > runnable (such that it can try to acquire the lock its blocked
> > on).
> >
> > This avoids some of the subtle & racy games where the blocked_on
> > state gets cleared, only to have it re-added by the
> > mutex_lock_slowpath call when it tries to acquire the lock on
> > wakeup
>
> If you can remember those sublte cases, I'm sure our future selves
> would've loved it if you wrote a comment to go with these states :-)

Thanks so much for the review feedback! I really appreciate it!

So yes, the description can use improvement here. I at one time had
3-4 separate very fine grained patches (see the top 4 patches here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
that I rolled into one when sending out(mostly to avoid overwhelming
folks), but the squished commit description isn't as clear.
So if it's helpful I can split this back out?

I'll also add some better comments as well.

> > +static inline void set_blocked_on_runnable(struct task_struct *p)
> > +{
> > +     unsigned long flags;
> > +
> > +     if (!sched_proxy_exec())
> > +             return;
> > +
> > +     raw_spin_lock_irqsave(&p->blocked_lock, flags);
>
> Do we want to make this:
>
>         guard(raw_spinlock_irqsave)(&p->blocked_lock);
>
> ?

Sure, sorry, while I've seen the guard bits, I've not yet really utilized them.
I'll take a stab at this for v15.

>
> > +     __set_blocked_on_runnable(p);
> > +     raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
> > +}
> > +
> > +static inline void set_blocked_on_waking(struct task_struct *p)
>
> consistent naming would be __set_blocked_on_waking()

Yeah, I started with the unlocked versions and then added the one case
that takes the lock, but that does make it inconsistent. I'll rework
all these. Thanks for pointing this out!

> > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       goto err_early_kill;
> >       }
> >
> > +     set_task_blocked_on(current, lock);
> >       set_current_state(state);
>
> blocked_on_state mirrors task-state

Yea. I know I always move a little fast in my talks but at OSPM and
LPC, but I've raised the point that the blocked_on_state very much
seems like it aliases the task->__state.
(See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)

I've not quite worked out how to integrate it though.

My initial introduction of the blocked_on_state was mostly to help
detangle the code, as there were a number of cases originally where in
order to let the task be selected to try to acquire the mutex, we
cleared the task->blocked_on pointer.  But this often ran into races
with ttwu, as if it cleared the blocked_on pointer, the task could get
selected to run before the return migration happened. So having the
tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
enforced the rules properly so we didn't run on the wrong cpu.

Trying to squish this tristate into the task->__state has eluded me a
bit (mostly due to the lockfree manipulations being a little subtle -
defaulting to RUNNABLE has been usually safe, as the task will just
loop on the condition but with this we really don't want to lose the
signal that we need to do a return migration before the task runs), so
I'd love your insight here.

One thing I've been thinking about is reworking the blocked_on_state
to instead just be a "needs_return" flag. This might simplify things
as we could only set needs_return when we proxy_migrate away from the
task->wake_cpu, so we wouldn't have to handle the WAKING->RUNNABLE
transition for tasks that were never migrated.  And, once it's down to
a single bit, that might work better as a task->__state flag (but
again, it has to prevent transition to TASK_RUNNING until we migrate
back).

Thoughts?

> > @@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       break;
> >
> >               if (first) {
> > +                     bool opt_acquired;
> > +
> > +                     /*
> > +                      * mutex_optimistic_spin() can schedule, so  we need to
> > +                      * release these locks before calling it.
> > +                      */
> > +                     current->blocked_on_state = BO_RUNNABLE;
> > +                     raw_spin_unlock(&current->blocked_lock);
> > +                     raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> >                       trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> > -                     if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> > +                     opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
>
> I'm confused -- didn't I kill optimistic spinning for proxy? IIRC it
> fundamentally doesn't make sense since we do a hand-off to the donor
> thread.

So, the trouble was killing optimistic spinning with proxy-exec had a
larger performance impact. So around v8 (this time last year, I
think), I worked to re-enable optimistic spinning.

The idea is as long as the lock owner is running, we do the optimistic
spin as there's no benefit to proxy boosting it (the owner is already
running, mission accomplished). But, if it's not on_cpu, then we
consider ourselves blocked, and will go into __schedule() where we can
be selected and boost whomever the chain is waiting on. Then if we are
proxy boosting, on release we can hand it over to the boosting donor.

This feels like it nicely balances the normal non-proxy performance
where possible, only switching into proxy-boosting when things are
going to take longer to release.

> All in all I'm properly confused by this patch... please write
> more/better comments changelog.

Sure, I think it's probably worth splitting it out, so I'll take a
swing at that before v15.

Definitely would appreciate your thoughts on the idea for a
TASK_NEEDS_RETURN __state flag or if you see a more elegant way to put
an intermediate step in place on the wakeup to ensure we don't wake
proxy-migrated tasks on the wrong cpu.

Thanks so much again for the input! Really appreciate the feedback!
-john

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by Peter Zijlstra 1 year, 1 month ago

On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> >
> > > Also add a blocked_on_state value so we can distinguish when a
> > > task is blocked_on a mutex, but is either blocked, waking up, or
> > > runnable (such that it can try to acquire the lock its blocked
> > > on).
> > >
> > > This avoids some of the subtle & racy games where the blocked_on
> > > state gets cleared, only to have it re-added by the
> > > mutex_lock_slowpath call when it tries to acquire the lock on
> > > wakeup
> >
> > If you can remember those sublte cases, I'm sure our future selves
> > would've loved it if you wrote a comment to go with these states :-)
> 
> Thanks so much for the review feedback! I really appreciate it!
> 
> So yes, the description can use improvement here. I at one time had
> 3-4 separate very fine grained patches (see the top 4 patches here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> that I rolled into one when sending out(mostly to avoid overwhelming
> folks), but the squished commit description isn't as clear.
> So if it's helpful I can split this back out?
> 
> I'll also add some better comments as well.

Not sure yet about splitting back out -- let me try and figure out what
all is actually done / needed.

So blocked_lock started out as another lock around ttwu(), in order to
serialize the task wakeup vs reading a remote ->blocked_on relation.

Since we do this with rq->lock held, it can't be ->pi_lock, and hence
->blocked_lock was born.

Later patches appear to have moved it into mutex, mirroring the
->wait_lock -- this is probably better.

/me goes chase that state thing for a bit..

So, this all seems to have started with:

  https://github.com/johnstultz-work/linux-dev/commit/c122552e07b75bb39d0297431799801de30f147f


> > > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> > >                       goto err_early_kill;
> > >       }
> > >
> > > +     set_task_blocked_on(current, lock);
> > >       set_current_state(state);
> >
> > blocked_on_state mirrors task-state
> 
> Yea. I know I always move a little fast in my talks but at OSPM and
> LPC, but I've raised the point that the blocked_on_state very much
> seems like it aliases the task->__state.
> (See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)
> 
> I've not quite worked out how to integrate it though.
> 
> My initial introduction of the blocked_on_state was mostly to help
> detangle the code, as there were a number of cases originally where in
> order to let the task be selected to try to acquire the mutex, we
> cleared the task->blocked_on pointer.  But this often ran into races
> with ttwu, as if it cleared the blocked_on pointer, the task could get
> selected to run before the return migration happened. So having the
> tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
> enforced the rules properly so we didn't run on the wrong cpu.

Right, so we already have a TASK_WAKING state, that is that
intermediate. But let me try and get a feel for how things relate;
TASK_WAKING is after we've determined it is elegible for wakeup, but
before it is enqueued -- eg. it sits on the wakelist.

This BO_WAKING is not quite that. It happens before doing the actual
wakeup.

Also, when I pull your proxy-exec-v7-6.7-rc6-fine-gained and put that
next to your slides, I'm a little confused.

Specifically, slide 27 talks about a modification to try_to_wake_up() in
order to force the task into ttwu_runnable() such that it then hits
proxy_needs_return(). This latter part I can find, but not the former.

/me puzzles

Hmm, since a blocked task is on the runqueue and all that, you should
always walk that path, anything else would be buggy.

So fundamentally the problem is that you want to do a reverse migration
-- it needs to go back to a CPU where it is allowed to run. The way it
does this is to dequeue itself and let ttwu continue and clean up the
mess.

This all makes sense, but why can't you do this on the waking side? That
is, instead of setting BO_WAKING, do this whole if (!is_cpu_allowed())
deactivate_task() thing right there, before issuing the wakeup.

Notably, that sched_proxy_exec() block in __mutex_unlock_slowpath():

 - probably doesn't need current->blocked_lock, current isn't blocked
 
 - probably doesn't need lock->wait_lock, donor is stable under
   preempt_disable().


Something like the completely untested below (also, ttwu path needs
surgery now):

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2711af8c0052..47cfa01cd066 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -946,32 +946,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	}
 
 	preempt_disable();
+	next = proxy_handoff(lock);
+
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
 
-	if (sched_proxy_exec()) {
-		raw_spin_lock(&current->blocked_lock);
-		/*
-		 * If we have a task boosting current, and that task was boosting
-		 * current through this lock, hand the lock to that task, as that
-		 * is the highest waiter, as selected by the scheduling function.
-		 */
-		donor = current->blocked_donor;
-		if (donor) {
-			struct mutex *next_lock;
-
-			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
-			next_lock = get_task_blocked_on(donor);
-			if (next_lock == lock) {
-				next = donor;
-				donor->blocked_on_state = BO_WAKING;
-				wake_q_add(&wake_q, donor);
-				current->blocked_donor = NULL;
-			}
-			raw_spin_unlock(&donor->blocked_lock);
-		}
-	}
-
 	/*
 	 * Failing that, pick any on the wait list.
 	 */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6eaffa913495..30d7371bb5c4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
 }
 #endif /* CONFIG_SCHED_PROXY_EXEC */
 
+struct task_struct *proxy_handoff(struct mutex *lock);
+{
+	struct task_struct *next;
+
+	if (!sched_proxy_exec())
+		return NULL;
+
+	/*
+	 * current->blocked_donor can't change if we can't schedule
+	 * caller needs to do this, since its needs stabiliy of return value
+	 */
+	lockdep_assert_preemption_disabled();
+	next = current->blocked_donor;
+	if (!next)
+		return NULL;
+
+	scoped_guard (task_rq_lock, next) {
+		/*
+		 * current->blocked_donor had better be on the same CPU as current
+		 */
+		WARN_ON_ONCE(scope.rq != this_rq());
+
+		scoped_guard (raw_spin_lock, next->blocked_lock) {
+			/*
+			 * WARN_ON on this? How can this happen
+			 */
+			if (next->blocked_on != lock)
+				return NULL;
+		}
+
+		/*
+		 * blocked_on relation is stable, since we hold both
+		 * next->pi_lock and it's rq->lock
+		 *
+		 * OK -- we have a donor, it is blocked on the lock we're about
+		 * to release and it cannot run on this CPU -- fixies are
+		 * required.
+		 *
+		 * Dequeue the task, such that ttwu() can fix up the placement thing.
+		 */
+		if (!is_cpu_allowed(next, cpu_of(scope.rq)))
+			deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
+	}
+
+	return next;
+}
+
 #ifdef CONFIG_SMP
 static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
 {

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by John Stultz 1 year, 1 month ago

On Mon, Dec 16, 2024 at 8:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> > On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> > So yes, the description can use improvement here. I at one time had
> > 3-4 separate very fine grained patches (see the top 4 patches here:
> > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> > that I rolled into one when sending out(mostly to avoid overwhelming
> > folks), but the squished commit description isn't as clear.
> > So if it's helpful I can split this back out?
> >
> > I'll also add some better comments as well.
>
> Not sure yet about splitting back out -- let me try and figure out what
> all is actually done / needed.
>
> So blocked_lock started out as another lock around ttwu(), in order to
> serialize the task wakeup vs reading a remote ->blocked_on relation.

I think of it primarily to serialize the task->blocked* state (there
gets to be quite a bit by the end of the proxy series).

> Since we do this with rq->lock held, it can't be ->pi_lock, and hence
> ->blocked_lock was born.

Yeah, we needed to use something other than the task->pi_lock to
serialize it as it has to nest under the mutex->wait_lock.

> Later patches appear to have moved it into mutex, mirroring the
> ->wait_lock -- this is probably better.
>
> /me goes chase that state thing for a bit..

? I'm not sure I followed this.  The blocked_lock continues to
serialize the task->blocked* state through the patch series.


> > > > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> > > >                       goto err_early_kill;
> > > >       }
> > > >
> > > > +     set_task_blocked_on(current, lock);
> > > >       set_current_state(state);
> > >
> > > blocked_on_state mirrors task-state
> >
> > Yea. I know I always move a little fast in my talks but at OSPM and
> > LPC, but I've raised the point that the blocked_on_state very much
> > seems like it aliases the task->__state.
> > (See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)
> >
> > I've not quite worked out how to integrate it though.
> >
> > My initial introduction of the blocked_on_state was mostly to help
> > detangle the code, as there were a number of cases originally where in
> > order to let the task be selected to try to acquire the mutex, we
> > cleared the task->blocked_on pointer.  But this often ran into races
> > with ttwu, as if it cleared the blocked_on pointer, the task could get
> > selected to run before the return migration happened. So having the
> > tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
> > enforced the rules properly so we didn't run on the wrong cpu.
>
> Right, so we already have a TASK_WAKING state, that is that
> intermediate. But let me try and get a feel for how things relate;
> TASK_WAKING is after we've determined it is elegible for wakeup, but
> before it is enqueued -- eg. it sits on the wakelist.
>
> This BO_WAKING is not quite that. It happens before doing the actual
> wakeup.

Right. So BO_WAKING is set from the point that the mutex unlock (or
ww_mutex_wound/die code) tries to wake up a task, as the task may have
been proxy-migrated, so we need to return it back before it runs.
So it's sort of a guard to make sure __schedule() doesn't just run the
task (as we could if it were BO_RUNNABLE), but also that we don't try
to proxy for it (as we would if it were BO_BLOCKED), since it needs to
be migrated.

My suspicion is reframing BO_WAKING as TASK_NEEDS_RETURN might be
doable, but I couldn't be sure we wouldn't hit an edge case where
TASK_RUNNING gets written over everything - effectively clearing the
guard.
Maybe the NEEDS_RETURN could just be a side state similar to the
blocked_on_state, and then BO_BLOCKED vs BO_RUNNABLE would just be
distinguishable by the !!task->blocked_on value.

> Also, when I pull your proxy-exec-v7-6.7-rc6-fine-gained and put that
> next to your slides, I'm a little confused.

So, apologies for the confusion. The link I sent you to the
fine-grained changes are a bit old (v7), since I stopped maintaining
the fine-grained patches around v11 (and the most recently shared
version is v14).
I just shared the v7 link as an example to see if it would be helpful
to split the patches here out again in a similar fashion (and I don't
think I published later versions of the fine grained series).

> Specifically, slide 27 talks about a modification to try_to_wake_up() in
> order to force the task into ttwu_runnable() such that it then hits
> proxy_needs_return(). This latter part I can find, but not the former.
>
> /me puzzles

So the slides actually have links to the code at that time, which
should be the v12 series:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v12-6.11-rc5

And the try_to_wake_up() change from slide 27 is here:
https://github.com/johnstultz-work/linux-dev/commit/cc828a6bac87dcd5734902021973659fe52ef7e6#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4158

(and here's the link to the v14 version of the same:
https://github.com/johnstultz-work/linux-dev/commit/602a4197c83b83cff08c7adda31324739c7938c7#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4314
)

> Hmm, since a blocked task is on the runqueue and all that, you should
> always walk that path, anything else would be buggy.
>
> So fundamentally the problem is that you want to do a reverse migration
> -- it needs to go back to a CPU where it is allowed to run. The way it
> does this is to dequeue itself and let ttwu continue and clean up the
> mess.

Yep.

> This all makes sense, but why can't you do this on the waking side? That
> is, instead of setting BO_WAKING, do this whole if (!is_cpu_allowed())
> deactivate_task() thing right there, before issuing the wakeup.
>
> Notably, that sched_proxy_exec() block in __mutex_unlock_slowpath():
>
>  - probably doesn't need current->blocked_lock, current isn't blocked
>
>  - probably doesn't need lock->wait_lock, donor is stable under
>    preempt_disable().
>
>
> Something like the completely untested below (also, ttwu path needs
> surgery now):
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 2711af8c0052..47cfa01cd066 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -946,32 +946,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>         }
>
>         preempt_disable();
> +       next = proxy_handoff(lock);
> +
>         raw_spin_lock_irqsave(&lock->wait_lock, flags);
>         debug_mutex_unlock(lock);
>
> -       if (sched_proxy_exec()) {
> -               raw_spin_lock(&current->blocked_lock);
> -               /*
> -                * If we have a task boosting current, and that task was boosting
> -                * current through this lock, hand the lock to that task, as that
> -                * is the highest waiter, as selected by the scheduling function.
> -                */
> -               donor = current->blocked_donor;
> -               if (donor) {
> -                       struct mutex *next_lock;
> -
> -                       raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
> -                       next_lock = get_task_blocked_on(donor);
> -                       if (next_lock == lock) {
> -                               next = donor;
> -                               donor->blocked_on_state = BO_WAKING;
> -                               wake_q_add(&wake_q, donor);
> -                               current->blocked_donor = NULL;
> -                       }
> -                       raw_spin_unlock(&donor->blocked_lock);
> -               }
> -       }
> -
>         /*
>          * Failing that, pick any on the wait list.
>          */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6eaffa913495..30d7371bb5c4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
>  }
>  #endif /* CONFIG_SCHED_PROXY_EXEC */
>
> +struct task_struct *proxy_handoff(struct mutex *lock);
> +{
> +       struct task_struct *next;
> +
> +       if (!sched_proxy_exec())
> +               return NULL;
> +
> +       /*
> +        * current->blocked_donor can't change if we can't schedule
> +        * caller needs to do this, since its needs stabiliy of return value
> +        */
> +       lockdep_assert_preemption_disabled();
> +       next = current->blocked_donor;
> +       if (!next)
> +               return NULL;
> +
> +       scoped_guard (task_rq_lock, next) {
> +               /*
> +                * current->blocked_donor had better be on the same CPU as current
> +                */
> +               WARN_ON_ONCE(scope.rq != this_rq());
> +
> +               scoped_guard (raw_spin_lock, next->blocked_lock) {
> +                       /*
> +                        * WARN_ON on this? How can this happen
> +                        */
> +                       if (next->blocked_on != lock)
> +                               return NULL;
> +               }
> +
> +               /*
> +                * blocked_on relation is stable, since we hold both
> +                * next->pi_lock and it's rq->lock
> +                *
> +                * OK -- we have a donor, it is blocked on the lock we're about
> +                * to release and it cannot run on this CPU -- fixies are
> +                * required.
> +                *
> +                * Dequeue the task, such that ttwu() can fix up the placement thing.
> +                */
> +               if (!is_cpu_allowed(next, cpu_of(scope.rq)))

nit, we'd want to check its not the wake_cpu so we try to return it so
proxy migrations don't upset the tasks' original placement

> +                       deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> +       }
> +
> +       return next;
> +}
> +

Ok. I'll stare at all this a bit and see if I can give it a try.  I
fret that this doesn't handle the case if wakeups on the task occur
through other code paths? (So we still need BO_WAKING/NEEDS_RETURN to
prevent us from running until we migrate back). I don't really have a
specific case I can articulate, but my gut is telling me the problem
will be w/ ww_mutexes as that was a major source of problems with the
early versions of the patches that I believe tried to use logic
similar to this.

Again, I really appreciate your insight and suggestions here!

thanks
-john

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by Peter Zijlstra 1 year, 1 month ago

On Mon, Dec 16, 2024 at 09:01:24PM -0800, John Stultz wrote:
> > Specifically, slide 27 talks about a modification to try_to_wake_up() in
> > order to force the task into ttwu_runnable() such that it then hits
> > proxy_needs_return(). This latter part I can find, but not the former.
> >
> > /me puzzles
> 
> So the slides actually have links to the code at that time, which
> should be the v12 series:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v12-6.11-rc5

Oh, no the reason I looked at -v7 is that I thought it was the most
recent one.

This github piece of shit lists it as the first 'proxy-exec-v7*' in the
branch pull down with earlier version below it.

Only if you scoll down further (which I didn't, because why would I),
you'll eventually come across 'proxy-exec-v14*'.

I'm looking forward to the day where the web implodes under its own
incompetence, urgh, web sucks.

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by Peter Zijlstra 1 year, 1 month ago

On Mon, Dec 16, 2024 at 09:01:24PM -0800, John Stultz wrote:

> > +struct task_struct *proxy_handoff(struct mutex *lock);
> > +{
> > +       struct task_struct *next;
> > +
> > +       if (!sched_proxy_exec())
> > +               return NULL;
> > +
> > +       /*
> > +        * current->blocked_donor can't change if we can't schedule
> > +        * caller needs to do this, since its needs stabiliy of return value
> > +        */
> > +       lockdep_assert_preemption_disabled();
> > +       next = current->blocked_donor;
> > +       if (!next)
> > +               return NULL;
> > +
> > +       scoped_guard (task_rq_lock, next) {
> > +               /*
> > +                * current->blocked_donor had better be on the same CPU as current
> > +                */
> > +               WARN_ON_ONCE(scope.rq != this_rq());
> > +
> > +               scoped_guard (raw_spin_lock, next->blocked_lock) {
> > +                       /*
> > +                        * WARN_ON on this? How can this happen
> > +                        */
> > +                       if (next->blocked_on != lock)
> > +                               return NULL;
> > +               }
> > +
> > +               /*
> > +                * blocked_on relation is stable, since we hold both
> > +                * next->pi_lock and it's rq->lock
> > +                *
> > +                * OK -- we have a donor, it is blocked on the lock we're about
> > +                * to release and it cannot run on this CPU -- fixies are
> > +                * required.
> > +                *
> > +                * Dequeue the task, such that ttwu() can fix up the placement thing.
> > +                */
> > +               if (!is_cpu_allowed(next, cpu_of(scope.rq)))
> 
> nit, we'd want to check its not the wake_cpu so we try to return it so
> proxy migrations don't upset the tasks' original placement

I don't think that really matters, not doing this migration is probably
faster and then load balance will try and fix things again.

The thing is, you want this task to take over the lock and start running
ASAP, and we know *this* CPU is about to release the lock and then the
power of the block-chain is likely to make the next task run.

So less migration is more better, otherwise you then get to pull the
entire block chain over to whatever -- and you know how much 'fun' that
is.

> > +                       deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> > +       }
> > +
> > +       return next;
> > +}
> > +
> 
> Ok. I'll stare at all this a bit and see if I can give it a try.  I
> fret that this doesn't handle the case if wakeups on the task occur
> through other code paths? (So we still need BO_WAKING/NEEDS_RETURN to
> prevent us from running until we migrate back). I don't really have a
> specific case I can articulate, but my gut is telling me the problem
> will be w/ ww_mutexes as that was a major source of problems with the
> early versions of the patches that I believe tried to use logic
> similar to this.

Right, so I've not looked at ww_mutex specifically yet, but I thought to
have understood it was more or less the same problem there. If you've
got more details, please do share :-)

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by Peter Zijlstra 1 year, 1 month ago

On Mon, Dec 16, 2024 at 09:01:24PM -0800, John Stultz wrote:
> On Mon, Dec 16, 2024 at 8:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> > > On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> > > So yes, the description can use improvement here. I at one time had
> > > 3-4 separate very fine grained patches (see the top 4 patches here:
> > > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> > > that I rolled into one when sending out(mostly to avoid overwhelming
> > > folks), but the squished commit description isn't as clear.
> > > So if it's helpful I can split this back out?
> > >
> > > I'll also add some better comments as well.
> >
> > Not sure yet about splitting back out -- let me try and figure out what
> > all is actually done / needed.
> >
> > So blocked_lock started out as another lock around ttwu(), in order to
> > serialize the task wakeup vs reading a remote ->blocked_on relation.
> 
> I think of it primarily to serialize the task->blocked* state (there
> gets to be quite a bit by the end of the proxy series).
> 
> > Since we do this with rq->lock held, it can't be ->pi_lock, and hence
> > ->blocked_lock was born.
> 
> Yeah, we needed to use something other than the task->pi_lock to
> serialize it as it has to nest under the mutex->wait_lock.

No, the critical bit is nesting under rq->lock -- we need to be able to
walk the blocked relation in the middle of schedule(). You can equally
wrap blocked_lock outside of wait_lock, that doesn't really matter much.

> > Later patches appear to have moved it into mutex, mirroring the
> > ->wait_lock -- this is probably better.
> >
> > /me goes chase that state thing for a bit..
> 
> ? I'm not sure I followed this.  The blocked_lock continues to
> serialize the task->blocked* state through the patch series.

Well, there was only ->blocked_on, and on UP you don't need
serialization beyond disabling preemption.

The tricky bit is SMP, then you need something to stabilize the blocked
relation.

Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

Posted by Peter Zijlstra 1 year, 1 month ago

On Mon, Dec 16, 2024 at 05:54:19PM +0100, Peter Zijlstra wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6eaffa913495..30d7371bb5c4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
>  }
>  #endif /* CONFIG_SCHED_PROXY_EXEC */
>  
> +struct task_struct *proxy_handoff(struct mutex *lock);
> +{
> +	struct task_struct *next;
> +
> +	if (!sched_proxy_exec())
> +		return NULL;
> +
> +	/*
> +	 * current->blocked_donor can't change if we can't schedule
> +	 * caller needs to do this, since its needs stabiliy of return value
> +	 */
> +	lockdep_assert_preemption_disabled();
> +	next = current->blocked_donor;
> +	if (!next)
> +		return NULL;
> +
> +	scoped_guard (task_rq_lock, next) {
> +		/*
> +		 * current->blocked_donor had better be on the same CPU as current
> +		 */
> +		WARN_ON_ONCE(scope.rq != this_rq());
> +
> +		scoped_guard (raw_spin_lock, next->blocked_lock) {
> +			/*
> +			 * WARN_ON on this? How can this happen
> +			 */
> +			if (next->blocked_on != lock)
> +				return NULL;
> +		}
> +
> +		/*
> +		 * blocked_on relation is stable, since we hold both
> +		 * next->pi_lock and it's rq->lock
> +		 *
> +		 * OK -- we have a donor, it is blocked on the lock we're about
> +		 * to release and it cannot run on this CPU -- fixies are
> +		 * required.
> +		 *
> +		 * Dequeue the task, such that ttwu() can fix up the placement thing.
> +		 */
> +		if (!is_cpu_allowed(next, cpu_of(scope.rq)))
> +			deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> +	}

It is probably better to do:

	scoped_guard (raw_spin_lock_irq, next->pi_lock) {

		int cpu = smp_processor_id();
		WARN_ON_ONCE(task_cpu(next) != cpu);

		...

		if (!is_cpu_allowed(next, cpu)) {
			struct rq_flags rf;
			struct rq *rq;
			rq = __task_rq_lock(next, &rf);
			deactivate_task(rq, next, DEQUEUE_SLEEP);
			__task_rq_unlock(rq, &rf);
		}
	}	

In order to minize the amount or rq->lock'ing.

> +
> +	return next;
> +}
> +
>  #ifdef CONFIG_SMP
>  static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
>  {