From: Steven Rostedt <rostedt@goodmis.org>
Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.
As the callbacks themselves are allowed to sleep, regular RCU cannot be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.
Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/
Also added a new guard (srcu_lite) written by Peter Zilstra
Link: https://lore.kernel.org/all/20250715102912.GQ1613200@noisy.programming.kicks-ass.net/
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v13: https://lore.kernel.org/20250708012359.172959778@kernel.org
- Have the locking of the link list walk use guard(srcu_lite)
(Peter Zijlstra)
- Fixed up due to the new atomic_long logic.
include/linux/srcu.h | 4 ++++
kernel/unwind/deferred.c | 27 +++++++++++++++++++++------
2 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 900b0d5c05f5..879054b8bf87 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -524,4 +524,8 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct,
srcu_read_unlock(_T->lock, _T->idx),
int idx)
+DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,
+ _T->idx = srcu_read_lock_lite(_T->lock),
+ srcu_read_unlock_lite(_T->lock, _T->idx),
+ int idx)
#endif
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 2311b725d691..353f7af610bf 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -41,7 +41,7 @@ static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
#define UNWIND_MAX_ENTRIES \
((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
static DEFINE_MUTEX(callback_mutex);
static LIST_HEAD(callbacks);
@@ -49,6 +49,7 @@ static LIST_HEAD(callbacks);
/* Zero'd bits are available for assigning callback users */
static unsigned long unwind_mask = RESERVED_BITS;
+DEFINE_STATIC_SRCU(unwind_srcu);
static inline bool unwind_pending(struct unwind_task_info *info)
{
@@ -174,8 +175,9 @@ static void unwind_deferred_task_work(struct callback_head *head)
cookie = info->id.id;
- guard(mutex)(&callback_mutex);
- list_for_each_entry(work, &callbacks, list) {
+ guard(srcu_lite)(&unwind_srcu);
+ list_for_each_entry_srcu(work, &callbacks, list,
+ srcu_read_lock_held(&unwind_srcu)) {
if (test_bit(work->bit, &bits)) {
work->func(work, &trace, cookie);
if (info->cache)
@@ -213,7 +215,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
{
struct unwind_task_info *info = ¤t->unwind_info;
unsigned long old, bits;
- unsigned long bit = BIT(work->bit);
+ unsigned long bit;
int ret;
*cookie = 0;
@@ -230,6 +232,14 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
if (WARN_ON_ONCE(!CAN_USE_IN_NMI && in_nmi()))
return -EINVAL;
+ /* Do not allow cancelled works to request again */
+ bit = READ_ONCE(work->bit);
+ if (WARN_ON_ONCE(bit < 0))
+ return -EINVAL;
+
+ /* Only need the mask now */
+ bit = BIT(bit);
+
guard(irqsave)();
*cookie = get_cookie(info);
@@ -281,10 +291,15 @@ void unwind_deferred_cancel(struct unwind_work *work)
return;
guard(mutex)(&callback_mutex);
- list_del(&work->list);
+ list_del_rcu(&work->list);
+
+ /* Do not allow any more requests and prevent callbacks */
+ work->bit = -1;
__clear_bit(bit, &unwind_mask);
+ synchronize_srcu(&unwind_srcu);
+
guard(rcu)();
/* Clear this bit from all threads */
for_each_process_thread(g, t) {
@@ -307,7 +322,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
work->bit = ffz(unwind_mask);
__set_bit(work->bit, &unwind_mask);
- list_add(&work->list, &callbacks);
+ list_add_rcu(&work->list, &callbacks);
work->func = func;
return 0;
}
--
2.47.2
On Wed, Jul 16, 2025 at 08:49:19PM -0400, Steven Rostedt wrote: > From: Steven Rostedt <rostedt@goodmis.org> > > Instead of using the callback_mutex to protect the link list of callbacks > in unwind_deferred_task_work(), use SRCU instead. This gets called every > time a task exits that has to record a stack trace that was requested. > This can happen for many tasks on several CPUs at the same time. A mutex > is a bottleneck and can cause a bit of contention and slow down performance. > > As the callbacks themselves are allowed to sleep, regular RCU cannot be > used to protect the list. Instead use SRCU, as that still allows the > callbacks to sleep and the list can be read without needing to hold the > callback_mutex. > > Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/ > > Also added a new guard (srcu_lite) written by Peter Zilstra > > Link: https://lore.kernel.org/all/20250715102912.GQ1613200@noisy.programming.kicks-ass.net/ > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > --- > Changes since v13: https://lore.kernel.org/20250708012359.172959778@kernel.org > > - Have the locking of the link list walk use guard(srcu_lite) > (Peter Zijlstra) > > - Fixed up due to the new atomic_long logic. > > include/linux/srcu.h | 4 ++++ > kernel/unwind/deferred.c | 27 +++++++++++++++++++++------ > 2 files changed, 25 insertions(+), 6 deletions(-) > > diff --git a/include/linux/srcu.h b/include/linux/srcu.h > index 900b0d5c05f5..879054b8bf87 100644 > --- a/include/linux/srcu.h > +++ b/include/linux/srcu.h > @@ -524,4 +524,8 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_struct, > srcu_read_unlock(_T->lock, _T->idx), > int idx) > > +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct, You need srcu_fast because srcu_lite is being removed. They are quite similar, but srcu_fast is faster and is NMI-safe. (This last might or might not matter here.) See https://lore.kernel.org/all/20250716225418.3014815-3-paulmck@kernel.org/ for a srcu_fast_notrace, so something like this: DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct, _T->scp = srcu_read_lock_fast(_T->lock), srcu_read_unlock_fast(_T->lock, _T->scp), struct srcu_ctr __percpu *scp) Other than that, it looks plausible. Thanx, Paul > + _T->idx = srcu_read_lock_lite(_T->lock), > + srcu_read_unlock_lite(_T->lock, _T->idx), > + int idx) > #endif > diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c > index 2311b725d691..353f7af610bf 100644 > --- a/kernel/unwind/deferred.c > +++ b/kernel/unwind/deferred.c > @@ -41,7 +41,7 @@ static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt) > #define UNWIND_MAX_ENTRIES \ > ((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long)) > > -/* Guards adding to and reading the list of callbacks */ > +/* Guards adding to or removing from the list of callbacks */ > static DEFINE_MUTEX(callback_mutex); > static LIST_HEAD(callbacks); > > @@ -49,6 +49,7 @@ static LIST_HEAD(callbacks); > > /* Zero'd bits are available for assigning callback users */ > static unsigned long unwind_mask = RESERVED_BITS; > +DEFINE_STATIC_SRCU(unwind_srcu); > > static inline bool unwind_pending(struct unwind_task_info *info) > { > @@ -174,8 +175,9 @@ static void unwind_deferred_task_work(struct callback_head *head) > > cookie = info->id.id; > > - guard(mutex)(&callback_mutex); > - list_for_each_entry(work, &callbacks, list) { > + guard(srcu_lite)(&unwind_srcu); > + list_for_each_entry_srcu(work, &callbacks, list, > + srcu_read_lock_held(&unwind_srcu)) { > if (test_bit(work->bit, &bits)) { > work->func(work, &trace, cookie); > if (info->cache) > @@ -213,7 +215,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie) > { > struct unwind_task_info *info = ¤t->unwind_info; > unsigned long old, bits; > - unsigned long bit = BIT(work->bit); > + unsigned long bit; > int ret; > > *cookie = 0; > @@ -230,6 +232,14 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie) > if (WARN_ON_ONCE(!CAN_USE_IN_NMI && in_nmi())) > return -EINVAL; > > + /* Do not allow cancelled works to request again */ > + bit = READ_ONCE(work->bit); > + if (WARN_ON_ONCE(bit < 0)) > + return -EINVAL; > + > + /* Only need the mask now */ > + bit = BIT(bit); > + > guard(irqsave)(); > > *cookie = get_cookie(info); > @@ -281,10 +291,15 @@ void unwind_deferred_cancel(struct unwind_work *work) > return; > > guard(mutex)(&callback_mutex); > - list_del(&work->list); > + list_del_rcu(&work->list); > + > + /* Do not allow any more requests and prevent callbacks */ > + work->bit = -1; > > __clear_bit(bit, &unwind_mask); > > + synchronize_srcu(&unwind_srcu); > + > guard(rcu)(); > /* Clear this bit from all threads */ > for_each_process_thread(g, t) { > @@ -307,7 +322,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) > work->bit = ffz(unwind_mask); > __set_bit(work->bit, &unwind_mask); > > - list_add(&work->list, &callbacks); > + list_add_rcu(&work->list, &callbacks); > work->func = func; > return 0; > } > -- > 2.47.2 > >
On Wed, 16 Jul 2025 21:43:47 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct, > > You need srcu_fast because srcu_lite is being removed. They are quite > similar, but srcu_fast is faster and is NMI-safe. (This last might or > might not matter here.) > > See https://lore.kernel.org/all/20250716225418.3014815-3-paulmck@kernel.org/ > for a srcu_fast_notrace, so something like this: Yeah, I already saw that patch. > > DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct, > _T->scp = srcu_read_lock_fast(_T->lock), > srcu_read_unlock_fast(_T->lock, _T->scp), > struct srcu_ctr __percpu *scp) > > Other than that, it looks plausible. Using srcu_lite or srcu_fast is an optimization here. And since I saw you adding the guard for srcu_fast in that other thread, I'll just use normal SRCU here for this series, and in the future we could convert it over to srcu_fast. Thanks! -- Steve
On Thu, Jul 17, 2025 at 08:25:26AM -0400, Steven Rostedt wrote: > On Wed, 16 Jul 2025 21:43:47 -0700 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct, > > > > You need srcu_fast because srcu_lite is being removed. They are quite > > similar, but srcu_fast is faster and is NMI-safe. (This last might or > > might not matter here.) > > > > See https://lore.kernel.org/all/20250716225418.3014815-3-paulmck@kernel.org/ > > for a srcu_fast_notrace, so something like this: > > Yeah, I already saw that patch. > > > > > DEFINE_LOCK_GUARD_1(srcu_fast, struct srcu_struct, > > _T->scp = srcu_read_lock_fast(_T->lock), > > srcu_read_unlock_fast(_T->lock, _T->scp), > > struct srcu_ctr __percpu *scp) > > > > Other than that, it looks plausible. > > Using srcu_lite or srcu_fast is an optimization here. And since I saw you > adding the guard for srcu_fast in that other thread, I'll just use normal > SRCU here for this series, and in the future we could convert it over to > srcu_fast. Works for me! That said, "in the future" started in -next some time back and is slated to start in mainline in the upcoming v6.17 merge window. SRCU-lite is being removed from the kernel, and has been deprecated via checkpatch.pl. So if there is some reason that you absolutely cannot immediately convert to SRCU-fast, let's please discuss. Thanx, Paul
On Thu, 17 Jul 2025 08:48:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > So if there is some reason that you absolutely cannot immediately convert > to SRCU-fast, let's please discuss. There's two reasons I wouldn't add it immediately. One, is the guard(srcu_fast) isn't in mainline yet. I would either need to open code it, or play the tricks of basing code off your tree. Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for that matter), where I rather be slow and safe than optimize and be unsafe. The code where this is used may be faulting in user space memory, so it doesn't need the micro-optimizations now. -- Steve
On Thu, Jul 17, 2025 at 12:10:10PM -0400, Steven Rostedt wrote: > On Thu, 17 Jul 2025 08:48:40 -0700 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > So if there is some reason that you absolutely cannot immediately convert > > to SRCU-fast, let's please discuss. > > There's two reasons I wouldn't add it immediately. > > One, is the guard(srcu_fast) isn't in mainline yet. I would either need > to open code it, or play the tricks of basing code off your tree. Fair point! But guard(srcu_fast) isn't in my tree, either, just guard(srcu_fast_nopreempt). So why not add guard(srcu_fast) in your tree, and we can ack it. Yes, that means we will have a merge conflict at some point, but it will be a trivial one. > Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for > that matter), where I rather be slow and safe than optimize and be > unsafe. The code where this is used may be faulting in user space > memory, so it doesn't need the micro-optimizations now. Straight-up SRCU and guard(srcu), then? Both are already in mainline. Or are those read-side smp_mb() calls a no-go for this code? Thanx, Paul
On Thu, 17 Jul 2025 09:27:34 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for > > that matter), where I rather be slow and safe than optimize and be > > unsafe. The code where this is used may be faulting in user space > > memory, so it doesn't need the micro-optimizations now. > > Straight-up SRCU and guard(srcu), then? Both are already in mainline. > > Or are those read-side smp_mb() calls a no-go for this code? As I stated, the read-side is likely going to be faulting in user space memory. I don't think one or two smp_mb() will really make much of a difference ;-) It's not urgent. If it can be switched to srcu_fast, we can do it later. Thanks, -- Steve
On Thu, Jul 17, 2025 at 12:38:35PM -0400, Steven Rostedt wrote: > On Thu, 17 Jul 2025 09:27:34 -0700 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > Two, I'm still grasping at the concept of srcu_fast (and srcu_lite for > > > that matter), where I rather be slow and safe than optimize and be > > > unsafe. The code where this is used may be faulting in user space > > > memory, so it doesn't need the micro-optimizations now. > > > > Straight-up SRCU and guard(srcu), then? Both are already in mainline. > > > > Or are those read-side smp_mb() calls a no-go for this code? > > As I stated, the read-side is likely going to be faulting in user space > memory. I don't think one or two smp_mb() will really make much of a > difference ;-) > > It's not urgent. If it can be switched to srcu_fast, we can do it later. Very good, we will continue with our removal of SRCU-lite, and I might as well add guard(srcu_fast) in my current series. Thanx, Paul
© 2016 - 2025 Red Hat, Inc.