[PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Steven Rostedt posted 14 patches 3 months ago
There is a newer version of this series
[PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
Posted by Steven Rostedt 3 months ago
From: Steven Rostedt <rostedt@goodmis.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.

Note, this is only allowed for architectures that implement a safe
cmpxchg. If an architecture requests a deferred stack trace from NMI
context that does not support a safe NMI cmpxchg, it will get an -EINVAL.
For those architectures, they would need another method (perhaps an
irqwork), to request a deferred user space stack trace. That can be dealt
with later if one of theses architectures require this feature.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005451.737614486@goodmis.org

- Now that the timestamp has been replaced by a cookie that uses only a 32
  bit cmpxchg(), this code just checks if the architecture has a safe
  cmpxchg that can be used in NMI and doesn't do the 64 bit check.
  Only the pending value is converted to local_t.

 include/linux/unwind_deferred_types.h |  4 +-
 kernel/unwind/deferred.c              | 56 ++++++++++++++++++++++-----
 2 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 79b4f8cece53..cd95ed1c8610 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+#include <asm/local.h>
+
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -20,7 +22,7 @@ struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
 	union unwind_task_id	id;
-	int			pending;
+	local_t			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b1faaa55e5d5..2417e4ebbc82 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,6 +12,31 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#if defined(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG)
+# define CAN_USE_IN_NMI		1
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	u32 old = 0;
+
+	return try_cmpxchg(&info->id.cnt, &old, cnt);
+}
+#else
+# define CAN_USE_IN_NMI		0
+/* When NMIs are not allowed, this always succeeds */
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	info->id.cnt = cnt;
+	return true;
+}
+#endif
+
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
@@ -43,7 +68,6 @@ static u64 get_cookie(struct unwind_task_info *info)
 {
 	u32 cpu_cnt;
 	u32 cnt;
-	u32 old = 0;
 
 	if (info->id.cpu)
 		return info->id.id;
@@ -52,7 +76,7 @@ static u64 get_cookie(struct unwind_task_info *info)
 	cpu_cnt += 2;
 	cnt = cpu_cnt | 1; /* Always make non zero */
 
-	if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
+	if (try_assign_cnt(info, cnt)) {
 		/* Update the per cpu counter */
 		__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
 	}
@@ -119,11 +143,11 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_work *work;
 	u64 cookie;
 
-	if (WARN_ON_ONCE(!info->pending))
+	if (WARN_ON_ONCE(!local_read(&info->pending)))
 		return;
 
 	/* Allow work to come in again */
-	WRITE_ONCE(info->pending, 0);
+	local_set(&info->pending, 0);
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -170,31 +194,43 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	long pending;
 	int ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	/* NMI requires having safe cmpxchg operations */
+	if (!CAN_USE_IN_NMI && in_nmi())
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = local_read(&info->pending);
+	if (pending)
 		return 1;
 
+	if (CAN_USE_IN_NMI) {
+		/* Claim the work unless an NMI just now swooped in to do so. */
+		if (!local_try_cmpxchg(&info->pending, &pending, 1))
+			return 1;
+	} else {
+		local_set(&info->pending, 1);
+	}
+
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		local_set(&info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2
Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
Posted by Peter Zijlstra 2 months, 3 weeks ago
On Mon, Jul 07, 2025 at 09:22:46PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Make unwind_deferred_request() NMI-safe so tracers in NMI context can
> call it and safely request a user space stacktrace when the task exits.
> 
> Note, this is only allowed for architectures that implement a safe
> cmpxchg. If an architecture requests a deferred stack trace from NMI
> context that does not support a safe NMI cmpxchg, it will get an -EINVAL.
> For those architectures, they would need another method (perhaps an
> irqwork), to request a deferred user space stack trace. That can be dealt
> with later if one of theses architectures require this feature.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>

How's this instead?

---
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,6 +12,40 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+#define UNWIND_NMI_SAFE 1
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	u32 zero = 0;
+	return try_cmpxchg(&info->id.cnt, &zero, cnt);
+}
+static inline bool test_and_set_pending(struct unwind_task_info *info)
+{
+	return info->pending || cmpxchg_local(&info->pending, 0, 1);
+}
+#else
+#define UNWIND_NMI_SAFE 0
+/* When NMIs are not allowed, this always succeeds */
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	info->id.cnt = cnt;
+	return true;
+}
+static inline bool test_and_set_pending(struct unwind_task_info *info)
+{
+	int pending = info->pending;
+	info->pending = 1;
+	return pending;
+}
+#endif /* CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG */
+
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
@@ -41,21 +75,16 @@ static DEFINE_PER_CPU(u32, unwind_ctx_ct
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
-	u32 cpu_cnt;
-	u32 cnt;
-	u32 old = 0;
+	u32 cnt = 1;
 
 	if (info->id.cpu)
 		return info->id.id;
 
-	cpu_cnt = __this_cpu_read(unwind_ctx_ctr);
-	cpu_cnt += 2;
-	cnt = cpu_cnt | 1; /* Always make non zero */
-
-	if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
-		/* Update the per cpu counter */
-		__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
-	}
+	/* LSB it always set to ensure 0 is an invalid value. */
+	cnt |= __this_cpu_read(unwind_ctx_ctr) + 2;
+	if (try_assign_cnt(info, cnt))
+		__this_cpu_write(unwind_ctx_ctr, cnt);
+
 	/* Interrupts are disabled, the CPU will always be same */
 	info->id.cpu = smp_processor_id() + 1; /* Must be non zero */
 
@@ -174,27 +203,29 @@ int unwind_deferred_request(struct unwin
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	/* NMI requires having safe cmpxchg operations */
+	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	if (test_and_set_pending(info))
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
Posted by Steven Rostedt 2 months, 3 weeks ago
On Mon, 14 Jul 2025 15:29:36 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> +#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
> +#define UNWIND_NMI_SAFE 1
> +static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
> +{
> +	u32 zero = 0;
> +	return try_cmpxchg(&info->id.cnt, &zero, cnt);
> +}
> +static inline bool test_and_set_pending(struct unwind_task_info *info)
> +{
> +	return info->pending || cmpxchg_local(&info->pending, 0, 1);
> +}
> +#el

Patch 10 moves the pending bit into the unwind_mask as it needs to be
in sync with the different tracer requests. I'm not sure how this
change will interact with that.

-- Steve
Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
Posted by Peter Zijlstra 2 months, 3 weeks ago
On Mon, Jul 14, 2025 at 10:19:19AM -0400, Steven Rostedt wrote:
> On Mon, 14 Jul 2025 15:29:36 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
> > +#define UNWIND_NMI_SAFE 1
> > +static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
> > +{
> > +	u32 zero = 0;
> > +	return try_cmpxchg(&info->id.cnt, &zero, cnt);
> > +}
> > +static inline bool test_and_set_pending(struct unwind_task_info *info)
> > +{
> > +	return info->pending || cmpxchg_local(&info->pending, 0, 1);
> > +}
> > +#el
> 
> Patch 10 moves the pending bit into the unwind_mask as it needs to be
> in sync with the different tracer requests. I'm not sure how this
> change will interact with that.

Urgh; so I hate reviewing code you're ripping out in the next patch :-(

Let me go stare at that.
Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
Posted by Steven Rostedt 2 months, 3 weeks ago
On Mon, 14 Jul 2025 17:05:16 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Urgh; so I hate reviewing code you're ripping out in the next patch :-(

Sorry. It just happened to be developed that way. Patch 10 came about
to fix a bug that was triggered with the current method.

> 
> Let me go stare at that.

Thanks!

-- Steve
Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
Posted by Peter Zijlstra 2 months, 3 weeks ago
On Mon, Jul 14, 2025 at 11:11:58AM -0400, Steven Rostedt wrote:
> On Mon, 14 Jul 2025 17:05:16 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Urgh; so I hate reviewing code you're ripping out in the next patch :-(
> 
> Sorry. It just happened to be developed that way. Patch 10 came about
> to fix a bug that was triggered with the current method.

Sure; but then you rework the series such that the bug never happened
and reviewers don't go insane from the back and forth and possibly
stumbling over the same bug you then fix later.

You should know this.

I'm going to not stare at email for some 3 weeks soon; I strongly
suggest you take this time to fix up this series to not suffer nonsense
like this.
Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
Posted by Steven Rostedt 2 months, 3 weeks ago
On Tue, 15 Jul 2025 11:09:55 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jul 14, 2025 at 11:11:58AM -0400, Steven Rostedt wrote:
> > On Mon, 14 Jul 2025 17:05:16 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > Urgh; so I hate reviewing code you're ripping out in the next patch :-(  
> > 
> > Sorry. It just happened to be developed that way. Patch 10 came about
> > to fix a bug that was triggered with the current method.  
> 
> Sure; but then you rework the series such that the bug never happened
> and reviewers don't go insane from the back and forth and possibly
> stumbling over the same bug you then fix later.
> 
> You should know this.

The bug was with actually with the next patch (#8) that uses the bitmask to
know which tracer requested a callback. Patch 8 cleared the bit after the
callbacks were called. The bug that was triggered was when the tracer set
an event to do a user space stack trace on an event that is called between
the task_work and going back to user space. It triggered an infinite loop
because the bit would get set again and trigger another task_work!

I can merge patch 8 and 10, but it still would not have affected this
patch, and would have likely led to the same confusion.

> 
> I'm going to not stare at email for some 3 weeks soon; I strongly
> suggest you take this time to fix up this series to not suffer nonsense
> like this.
> 

Sure, I'll take a deep look at your review and work on the next series to
hopefully address each of your concerns.

Thanks!

-- Steve