[v14] unwind_user: x86: Deferred unwinding infrastructure

[PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Steven Rostedt 2 months, 3 weeks ago

From: Steven Rostedt <rostedt@goodmis.org>

Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.

As the callbacks themselves are allowed to sleep, regular RCU cannot be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.

Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/

Also added a new guard (srcu_lite) written by Peter Zilstra

Link: https://lore.kernel.org/all/20250715102912.GQ1613200@noisy.programming.kicks-ass.net/

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v13: https://lore.kernel.org/20250708012359.172959778@kernel.org

- Have the locking of the link list walk use guard(srcu_lite)
  (Peter Zijlstra)

- Fixed up due to the new atomic_long logic.

 include/linux/srcu.h     |  4 ++++
 kernel/unwind/deferred.c | 27 +++++++++++++++++++++------
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 900b0d5c05f5..879054b8bf87 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -524,4 +524,8 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
 		    srcu_read_unlock(_T->lock, _T->idx),
 		    int idx)
 
+DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,
+		    _T->idx = srcu_read_lock_lite(_T->lock),
+		    srcu_read_unlock_lite(_T->lock, _T->idx),
+		    int idx)
 #endif
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 2311b725d691..353f7af610bf 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -41,7 +41,7 @@ static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
 
@@ -49,6 +49,7 @@ static LIST_HEAD(callbacks);
 
 /* Zero'd bits are available for assigning callback users */
 static unsigned long unwind_mask = RESERVED_BITS;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 static inline bool unwind_pending(struct unwind_task_info *info)
 {
@@ -174,8 +175,9 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	cookie = info->id.id;
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
+	guard(srcu_lite)(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
 		if (test_bit(work->bit, &bits)) {
 			work->func(work, &trace, cookie);
 			if (info->cache)
@@ -213,7 +215,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	unsigned long old, bits;
-	unsigned long bit = BIT(work->bit);
+	unsigned long bit;
 	int ret;
 
 	*cookie = 0;
@@ -230,6 +232,14 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 	if (WARN_ON_ONCE(!CAN_USE_IN_NMI && in_nmi()))
 		return -EINVAL;
 
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
+	/* Only need the mask now */
+	bit = BIT(bit);
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
@@ -281,10 +291,15 @@ void unwind_deferred_cancel(struct unwind_work *work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
 
 	__clear_bit(bit, &unwind_mask);
 
+	synchronize_srcu(&unwind_srcu);
+
 	guard(rcu)();
 	/* Clear this bit from all threads */
 	for_each_process_thread(g, t) {
@@ -307,7 +322,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	work->bit = ffz(unwind_mask);
 	__set_bit(work->bit, &unwind_mask);
 
-	list_add(&work->list, &callbacks);
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }
-- 
2.47.2

Re: [PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Wed, Jul 16, 2025 at 08:49:19PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Instead of using the callback_mutex to protect the link list of callbacks
> in unwind_deferred_task_work(), use SRCU instead. This gets called every
> time a task exits that has to record a stack trace that was requested.
> This can happen for many tasks on several CPUs at the same time. A mutex
> is a bottleneck and can cause a bit of contention and slow down performance.
> 
> As the callbacks themselves are allowed to sleep, regular RCU cannot be
> used to protect the list. Instead use SRCU, as that still allows the
> callbacks to sleep and the list can be read without needing to hold the
> callback_mutex.
> 
> Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/
> 
> Also added a new guard (srcu_lite) written by Peter Zilstra
> 
> Link: https://lore.kernel.org/all/20250715102912.GQ1613200@noisy.programming.kicks-ass.net/
> 
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Changes since v13: https://lore.kernel.org/20250708012359.172959778@kernel.org
> 
> - Have the locking of the link list walk use guard(srcu_lite)
>   (Peter Zijlstra)
> 
> - Fixed up due to the new atomic_long logic.
> 
>  include/linux/srcu.h     |  4 ++++
>  kernel/unwind/deferred.c | 27 +++++++++++++++++++++------
>  2 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 900b0d5c05f5..879054b8bf87 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -524,4 +524,8 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
>  		    srcu_read_unlock(_T->lock, _T->idx),
>  		    int idx)
>  
> +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,

You need srcu_fast because srcu_lite is being removed.  They are quite
similar, but srcu_fast is faster and is NMI-safe.  (This last might or
might not matter here.)

See https://lore.kernel.org/all/20250716225418.3014815-3-paulmck@kernel.org/
for a srcu_fast_notrace, so something like this:

DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
		    _T->scp = srcu_read_lock_fast(_T->lock),
		    srcu_read_unlock_fast(_T->lock, _T->scp),
		    struct srcu_ctr __percpu *scp)

Other than that, it looks plausible.

							Thanx, Paul

> +		    _T->idx = srcu_read_lock_lite(_T->lock),
> +		    srcu_read_unlock_lite(_T->lock, _T->idx),
> +		    int idx)
>  #endif
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index 2311b725d691..353f7af610bf 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -41,7 +41,7 @@ static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
>  #define UNWIND_MAX_ENTRIES					\
>  	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
>  
> -/* Guards adding to and reading the list of callbacks */
> +/* Guards adding to or removing from the list of callbacks */
>  static DEFINE_MUTEX(callback_mutex);
>  static LIST_HEAD(callbacks);
>  
> @@ -49,6 +49,7 @@ static LIST_HEAD(callbacks);
>  
>  /* Zero'd bits are available for assigning callback users */
>  static unsigned long unwind_mask = RESERVED_BITS;
> +DEFINE_STATIC_SRCU(unwind_srcu);
>  
>  static inline bool unwind_pending(struct unwind_task_info *info)
>  {
> @@ -174,8 +175,9 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  
>  	cookie = info->id.id;
>  
> -	guard(mutex)(&callback_mutex);
> -	list_for_each_entry(work, &callbacks, list) {
> +	guard(srcu_lite)(&unwind_srcu);
> +	list_for_each_entry_srcu(work, &callbacks, list,
> +				 srcu_read_lock_held(&unwind_srcu)) {
>  		if (test_bit(work->bit, &bits)) {
>  			work->func(work, &trace, cookie);
>  			if (info->cache)
> @@ -213,7 +215,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
>  	unsigned long old, bits;
> -	unsigned long bit = BIT(work->bit);
> +	unsigned long bit;
>  	int ret;
>  
>  	*cookie = 0;
> @@ -230,6 +232,14 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  	if (WARN_ON_ONCE(!CAN_USE_IN_NMI && in_nmi()))
>  		return -EINVAL;
>  
> +	/* Do not allow cancelled works to request again */
> +	bit = READ_ONCE(work->bit);
> +	if (WARN_ON_ONCE(bit < 0))
> +		return -EINVAL;
> +
> +	/* Only need the mask now */
> +	bit = BIT(bit);
> +
>  	guard(irqsave)();
>  
>  	*cookie = get_cookie(info);
> @@ -281,10 +291,15 @@ void unwind_deferred_cancel(struct unwind_work *work)
>  		return;
>  
>  	guard(mutex)(&callback_mutex);
> -	list_del(&work->list);
> +	list_del_rcu(&work->list);
> +
> +	/* Do not allow any more requests and prevent callbacks */
> +	work->bit = -1;
>  
>  	__clear_bit(bit, &unwind_mask);
>  
> +	synchronize_srcu(&unwind_srcu);
> +
>  	guard(rcu)();
>  	/* Clear this bit from all threads */
>  	for_each_process_thread(g, t) {
> @@ -307,7 +322,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
>  	work->bit = ffz(unwind_mask);
>  	__set_bit(work->bit, &unwind_mask);
>  
> -	list_add(&work->list, &callbacks);
> +	list_add_rcu(&work->list, &callbacks);
>  	work->func = func;
>  	return 0;
>  }
> -- 
> 2.47.2
> 
>

Re: [PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Steven Rostedt 2 months, 3 weeks ago

On Wed, 16 Jul 2025 21:43:47 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,  
> 
> You need srcu_fast because srcu_lite is being removed.  They are quite
> similar, but srcu_fast is faster and is NMI-safe.  (This last might or
> might not matter here.)
> 
> See https://lore.kernel.org/all/20250716225418.3014815-3-paulmck@kernel.org/
> for a srcu_fast_notrace, so something like this:

Yeah, I already saw that patch.

> 
> DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
> 		    _T->scp = srcu_read_lock_fast(_T->lock),
> 		    srcu_read_unlock_fast(_T->lock, _T->scp),
> 		    struct srcu_ctr __percpu *scp)
> 
> Other than that, it looks plausible.

Using srcu_lite or srcu_fast is an optimization here. And since I saw you
adding the guard for srcu_fast in that other thread, I'll just use normal
SRCU here for this series, and in the future we could convert it over to
srcu_fast.

Thanks!

-- Steve

Re: [PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Thu, Jul 17, 2025 at 08:25:26AM -0400, Steven Rostedt wrote:
> On Wed, 16 Jul 2025 21:43:47 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,  
> > 
> > You need srcu_fast because srcu_lite is being removed.  They are quite
> > similar, but srcu_fast is faster and is NMI-safe.  (This last might or
> > might not matter here.)
> > 
> > See https://lore.kernel.org/all/20250716225418.3014815-3-paulmck@kernel.org/
> > for a srcu_fast_notrace, so something like this:
> 
> Yeah, I already saw that patch.
> 
> > 
> > DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct,
> > 		    _T->scp = srcu_read_lock_fast(_T->lock),
> > 		    srcu_read_unlock_fast(_T->lock, _T->scp),
> > 		    struct srcu_ctr __percpu *scp)
> > 
> > Other than that, it looks plausible.
> 
> Using srcu_lite or srcu_fast is an optimization here. And since I saw you
> adding the guard for srcu_fast in that other thread, I'll just use normal
> SRCU here for this series, and in the future we could convert it over to
> srcu_fast.

Works for me!

That said, "in the future" started in -next some time back and is slated
to start in mainline in the upcoming v6.17 merge window.  SRCU-lite is
being removed from the kernel, and has been deprecated via checkpatch.pl.

So if there is some reason that you absolutely cannot immediately convert
to SRCU-fast, let's please discuss.

							Thanx, Paul

Re: [PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Steven Rostedt 2 months, 3 weeks ago

On Thu, 17 Jul 2025 08:48:40 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> So if there is some reason that you absolutely cannot immediately convert
> to SRCU-fast, let's please discuss.

There's two reasons I wouldn't add it immediately.

One, is the guard(srcu_fast) isn't in mainline yet. I would either need
to open code it, or play the tricks of basing code off your tree.

Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for
that matter), where I rather be slow and safe than optimize and be
unsafe. The code where this is used may be faulting in user space
memory, so it doesn't need the micro-optimizations now.

-- Steve

Re: [PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Thu, Jul 17, 2025 at 12:10:10PM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 08:48:40 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > So if there is some reason that you absolutely cannot immediately convert
> > to SRCU-fast, let's please discuss.
> 
> There's two reasons I wouldn't add it immediately.
> 
> One, is the guard(srcu_fast) isn't in mainline yet. I would either need
> to open code it, or play the tricks of basing code off your tree.

Fair point!  But guard(srcu_fast) isn't in my tree, either, just
guard(srcu_fast_nopreempt).  So why not add guard(srcu_fast) in your
tree, and we can ack it.  Yes, that means we will have a merge conflict
at some point, but it will be a trivial one.

> Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for
> that matter), where I rather be slow and safe than optimize and be
> unsafe. The code where this is used may be faulting in user space
> memory, so it doesn't need the micro-optimizations now.

Straight-up SRCU and guard(srcu), then?  Both are already in mainline.

Or are those read-side smp_mb() calls a no-go for this code?

							Thanx, Paul

Re: [PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Steven Rostedt 2 months, 3 weeks ago

On Thu, 17 Jul 2025 09:27:34 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for
> > that matter), where I rather be slow and safe than optimize and be
> > unsafe. The code where this is used may be faulting in user space
> > memory, so it doesn't need the micro-optimizations now.  
> 
> Straight-up SRCU and guard(srcu), then?  Both are already in mainline.
> 
> Or are those read-side smp_mb() calls a no-go for this code?

As I stated, the read-side is likely going to be faulting in user space
memory. I don't think one or two smp_mb() will really make much of a
difference ;-)

It's not urgent. If it can be switched to srcu_fast, we can do it later.

Thanks,

-- Steve

Re: [PATCH v14 09/12] unwind deferred: Use SRCU unwind_deferred_task_work()

Posted by Paul E. McKenney 2 months, 3 weeks ago

On Thu, Jul 17, 2025 at 12:38:35PM -0400, Steven Rostedt wrote:
> On Thu, 17 Jul 2025 09:27:34 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for
> > > that matter), where I rather be slow and safe than optimize and be
> > > unsafe. The code where this is used may be faulting in user space
> > > memory, so it doesn't need the micro-optimizations now.  
> > 
> > Straight-up SRCU and guard(srcu), then?  Both are already in mainline.
> > 
> > Or are those read-side smp_mb() calls a no-go for this code?
> 
> As I stated, the read-side is likely going to be faulting in user space
> memory. I don't think one or two smp_mb() will really make much of a
> difference ;-)
> 
> It's not urgent. If it can be switched to srcu_fast, we can do it later.

Very good, we will continue with our removal of SRCU-lite, and I might
as well add guard(srcu_fast) in my current series.

							Thanx, Paul