[v13] unwind_user: x86: Deferred unwinding infrastructure

[PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 3 months ago

From: Steven Rostedt <rostedt@goodmis.org>

When testing the deferred unwinder by attaching deferred user space
stacktraces to events, a live lock happened. This was when the deferred
unwinding was added to the irqs_disabled event, which happens after the
task_work callbacks are called and before the task goes back to user
space.

The event callback would be registered when irqs were disabled, the
task_work would trigger, call the callback for this work and clear the
work's bit. Then before getting back to user space, irqs would be disabled
again, the event triggered again, and a new task_work registered. This
caused an infinite loop and the system hung.

To prevent this, clear the bits at the very last moment before going back
to user space and when instrumentation is disabled. That is in
unwind_exit_to_user_mode().

Move the pending bit from a value on the task_struct to the most
significant bit of the unwind_mask (saves space on the task_struct). This
will allow modifying the pending bit along with the work bits atomically.

Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the work will be notified that the task has already been
called (via UNWIND_ALREADY_EXECUTED return value).

The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.

If a new work registers, then it will set both the pending bit and its own
bit but clear the other work bits so that their callbacks do not get
called again.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005452.242933931@goodmis.org

- Removed no longer used local.h headers from unwind_deferred_types.h

 include/linux/unwind_deferred.h       | 25 +++++++--
 include/linux/unwind_deferred_types.h |  3 --
 kernel/unwind/deferred.c              | 76 ++++++++++++++++++---------
 3 files changed, 74 insertions(+), 30 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 12bffdb0648e..587e120c0fd6 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,14 @@ struct unwind_work {
 
 #ifdef CONFIG_UNWIND_USER
 
+#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
+#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)
+
+enum {
+	UNWIND_ALREADY_PENDING	= 1,
+	UNWIND_ALREADY_EXECUTED	= 2,
+};
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
@@ -29,15 +37,26 @@ void unwind_deferred_cancel(struct unwind_work *work);
 
 static __always_inline void unwind_reset_info(void)
 {
-	if (unlikely(current->unwind_info.id.id))
+	struct unwind_task_info *info = &current->unwind_info;
+	unsigned long bits;
+
+	/* Was there any unwinding? */
+	if (unlikely(info->unwind_mask)) {
+		bits = info->unwind_mask;
+		do {
+			/* Is a task_work going to run again before going back */
+			if (bits & UNWIND_PENDING)
+				return;
+		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		current->unwind_info.id.id = 0;
+	}
 	/*
 	 * As unwind_user_faultable() can be called directly and
 	 * depends on nr_entries being cleared on exit to user,
 	 * this needs to be a separate conditional.
 	 */
-	if (unlikely(current->unwind_info.cache))
-		current->unwind_info.cache->nr_entries = 0;
+	if (unlikely(info->cache))
+		info->cache->nr_entries = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 7a03a8672b0d..db6c65daf185 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,6 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
-#include <asm/local.h>
-
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -23,7 +21,6 @@ struct unwind_task_info {
 	struct callback_head	work;
 	unsigned long		unwind_mask;
 	union unwind_task_id	id;
-	local_t			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 9aed9866f460..256458f3eafe 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -47,6 +47,11 @@ static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
 DEFINE_STATIC_SRCU(unwind_srcu);
 
+static inline bool unwind_pending(struct unwind_task_info *info)
+{
+	return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask);
+}
+
 /*
  * This is a unique percpu identifier for a given task entry context.
  * Conceptually, it's incremented every time the CPU enters the kernel from
@@ -143,14 +148,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	unsigned long bits;
 	u64 cookie;
 	int idx;
 
-	if (WARN_ON_ONCE(!local_read(&info->pending)))
+	if (WARN_ON_ONCE(!unwind_pending(info)))
 		return;
 
-	/* Allow work to come in again */
-	local_set(&info->pending, 0);
+	/* Clear pending bit but make sure to have the current bits */
+	bits = READ_ONCE(info->unwind_mask);
+	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
+		;
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -166,10 +174,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	idx = srcu_read_lock(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (test_bit(work->bit, &info->unwind_mask)) {
+		if (test_bit(work->bit, &bits))
 			work->func(work, &trace, cookie);
-			clear_bit(work->bit, &info->unwind_mask);
-		}
 	}
 	srcu_read_unlock(&unwind_srcu, idx);
 }
@@ -194,15 +200,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
  * because it has already been previously called for the same entry context,
  * it will be called again with the same stack trace and cookie.
  *
- * Return: 1 if the the callback was already queued.
- *         0 if the callback successfully was queued.
+ * Return: 0 if the callback successfully was queued.
+ *         UNWIND_ALREADY_PENDING if the the callback was already queued.
+ *         UNWIND_ALREADY_EXECUTED if the callback was already called
+ *                (and will not be called again)
  *         Negative if there's an error.
  *         @cookie holds the cookie of the first request by any user
  */
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	long pending;
+	unsigned long old, bits;
 	int bit;
 	int ret;
 
@@ -225,32 +233,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 
 	*cookie = get_cookie(info);
 
-	/* This is already queued */
-	if (test_bit(bit, &info->unwind_mask))
-		return 1;
+	old = READ_ONCE(info->unwind_mask);
+
+	/* Is this already queued */
+	if (test_bit(bit, &old)) {
+		/*
+		 * If pending is not set, it means this work's callback
+		 * was already called.
+		 */
+		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
+			UNWIND_ALREADY_EXECUTED;
+	}
 
-	/* callback already pending? */
-	pending = local_read(&info->pending);
-	if (pending)
+	if (unwind_pending(info))
 		goto out;
 
+	/*
+	 * This is the first to enable another task_work for this task since
+	 * the task entered the kernel, or had already called the callbacks.
+	 * Set only the bit for this work and clear all others as they have
+	 * already had their callbacks called, and do not need to call them
+	 * again because of this work.
+	 */
+	bits = UNWIND_PENDING | BIT(bit);
+
+	/*
+	 * If the cmpxchg() fails, it means that an NMI came in and set
+	 * the pending bit as well as cleared the other bits. Just
+	 * jump to setting the bit for this work.
+	 */
 	if (CAN_USE_IN_NMI) {
-		/* Claim the work unless an NMI just now swooped in to do so. */
-		if (!local_try_cmpxchg(&info->pending, &pending, 1))
+		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
 			goto out;
 	} else {
-		local_set(&info->pending, 1);
+		info->unwind_mask = bits;
 	}
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret)) {
-		local_set(&info->pending, 0);
-		return ret;
-	}
 
+	if (WARN_ON_ONCE(ret))
+		WRITE_ONCE(info->unwind_mask, 0);
+
+	return ret;
  out:
-	return test_and_set_bit(bit, &info->unwind_mask);
+	return test_and_set_bit(bit, &info->unwind_mask) ?
+		UNWIND_ALREADY_PENDING : 0;
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
@@ -286,7 +314,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	guard(mutex)(&callback_mutex);
 
 	/* See if there's a bit in the mask available */
-	if (unwind_mask == ~0UL)
+	if (unwind_mask == ~(UNWIND_PENDING))
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);
-- 
2.47.2

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Peter Zijlstra 2 months, 3 weeks ago

On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> index 12bffdb0648e..587e120c0fd6 100644
> --- a/include/linux/unwind_deferred.h
> +++ b/include/linux/unwind_deferred.h
> @@ -18,6 +18,14 @@ struct unwind_work {
>  
>  #ifdef CONFIG_UNWIND_USER
>  
> +#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
> +#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)

Since it really didn't matter what bit you took, why not take bit 0?

> +
> +enum {
> +	UNWIND_ALREADY_PENDING	= 1,
> +	UNWIND_ALREADY_EXECUTED	= 2,
> +};
> +
>  void unwind_task_init(struct task_struct *task);
>  void unwind_task_free(struct task_struct *task);
>  
> @@ -29,15 +37,26 @@ void unwind_deferred_cancel(struct unwind_work *work);
>  
>  static __always_inline void unwind_reset_info(void)
>  {
> -	if (unlikely(current->unwind_info.id.id))
> +	struct unwind_task_info *info = &current->unwind_info;
> +	unsigned long bits;
> +
> +	/* Was there any unwinding? */
> +	if (unlikely(info->unwind_mask)) {
> +		bits = info->unwind_mask;
> +		do {
> +			/* Is a task_work going to run again before going back */
> +			if (bits & UNWIND_PENDING)
> +				return;
> +		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
>  		current->unwind_info.id.id = 0;
> +	}
>  	/*
>  	 * As unwind_user_faultable() can be called directly and
>  	 * depends on nr_entries being cleared on exit to user,
>  	 * this needs to be a separate conditional.
>  	 */
> -	if (unlikely(current->unwind_info.cache))
> -		current->unwind_info.cache->nr_entries = 0;
> +	if (unlikely(info->cache))
> +		info->cache->nr_entries = 0;
>  }
>  
>  #else /* !CONFIG_UNWIND_USER */

> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index 9aed9866f460..256458f3eafe 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -47,6 +47,11 @@ static LIST_HEAD(callbacks);
>  static unsigned long unwind_mask;
>  DEFINE_STATIC_SRCU(unwind_srcu);
>  
> +static inline bool unwind_pending(struct unwind_task_info *info)
> +{
> +	return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask);
> +}
> +
>  /*
>   * This is a unique percpu identifier for a given task entry context.
>   * Conceptually, it's incremented every time the CPU enters the kernel from
> @@ -143,14 +148,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
>  	struct unwind_stacktrace trace;
>  	struct unwind_work *work;
> +	unsigned long bits;
>  	u64 cookie;
>  	int idx;
>  
> -	if (WARN_ON_ONCE(!local_read(&info->pending)))
> +	if (WARN_ON_ONCE(!unwind_pending(info)))
>  		return;
>  
> -	/* Allow work to come in again */
> -	local_set(&info->pending, 0);
> +	/* Clear pending bit but make sure to have the current bits */
> +	bits = READ_ONCE(info->unwind_mask);
> +	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
> +		;

We have:

	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);

for that. A fair number of architecture can actually do this better than
a cmpxchg loop.

>  
>  	/*
>  	 * From here on out, the callback must always be called, even if it's
> @@ -166,10 +174,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  	idx = srcu_read_lock(&unwind_srcu);
>  	list_for_each_entry_srcu(work, &callbacks, list,
>  				 srcu_read_lock_held(&unwind_srcu)) {
> -		if (test_bit(work->bit, &info->unwind_mask)) {
> +		if (test_bit(work->bit, &bits))
>  			work->func(work, &trace, cookie);
> -			clear_bit(work->bit, &info->unwind_mask);
> -		}
>  	}
>  	srcu_read_unlock(&unwind_srcu, idx);
>  }
> @@ -194,15 +200,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
>   * because it has already been previously called for the same entry context,
>   * it will be called again with the same stack trace and cookie.
>   *
> - * Return: 1 if the the callback was already queued.
> - *         0 if the callback successfully was queued.
> + * Return: 0 if the callback successfully was queued.
> + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> + *                (and will not be called again)
>   *         Negative if there's an error.
>   *         @cookie holds the cookie of the first request by any user
>   */

Lots of babbling in the Changelog, but no real elucidation as to why you
need this second return value.

AFAICT it serves no real purpose; the users of this function should not
care. The only difference is that the unwind reference (your cookie)
becomes a backward reference instead of a forward reference. But why
would anybody care?

Whatever tool is ultimately in charge of gluing humpty^Wstacktraces back
together again should have no problem with this.

>  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	long pending;
> +	unsigned long old, bits;
>  	int bit;
>  	int ret;
>  
> @@ -225,32 +233,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  
>  	*cookie = get_cookie(info);
>  
> -	/* This is already queued */
> -	if (test_bit(bit, &info->unwind_mask))
> -		return 1;
> +	old = READ_ONCE(info->unwind_mask);
> +
> +	/* Is this already queued */
> +	if (test_bit(bit, &old)) {
> +		/*
> +		 * If pending is not set, it means this work's callback
> +		 * was already called.
> +		 */
> +		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
> +			UNWIND_ALREADY_EXECUTED;
> +	}
>  
> -	/* callback already pending? */
> -	pending = local_read(&info->pending);
> -	if (pending)
> +	if (unwind_pending(info))
>  		goto out;
>  
> +	/*
> +	 * This is the first to enable another task_work for this task since
> +	 * the task entered the kernel, or had already called the callbacks.
> +	 * Set only the bit for this work and clear all others as they have
> +	 * already had their callbacks called, and do not need to call them
> +	 * again because of this work.
> +	 */
> +	bits = UNWIND_PENDING | BIT(bit);
> +
> +	/*
> +	 * If the cmpxchg() fails, it means that an NMI came in and set
> +	 * the pending bit as well as cleared the other bits. Just
> +	 * jump to setting the bit for this work.
> +	 */
>  	if (CAN_USE_IN_NMI) {
> -		/* Claim the work unless an NMI just now swooped in to do so. */
> -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
>  			goto out;
>  	} else {
> -		local_set(&info->pending, 1);
> +		info->unwind_mask = bits;
>  	}
>  
>  	/* The work has been claimed, now schedule it. */
>  	ret = task_work_add(current, &info->work, TWA_RESUME);
> -	if (WARN_ON_ONCE(ret)) {
> -		local_set(&info->pending, 0);
> -		return ret;
> -	}
>  
> +	if (WARN_ON_ONCE(ret))
> +		WRITE_ONCE(info->unwind_mask, 0);
> +
> +	return ret;
>   out:
> -	return test_and_set_bit(bit, &info->unwind_mask);
> +	return test_and_set_bit(bit, &info->unwind_mask) ?
> +		UNWIND_ALREADY_PENDING : 0;
>  }

This is some of the most horrifyingly confused code I've seen in a
while.

Please just slow down and think for a minute.

The below is the last four patches rolled into one. Not been near a
compiler.

---

--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -524,4 +524,8 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_st
 		    srcu_read_unlock(_T->lock, _T->idx),
 		    int idx)
 
+DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,
+		    _T->idx = srcu_read_lock_lite(_T->lock),
+		    srcu_read_unlock_lite(_T->lock, _T->idx),
+		    int idx)
 #endif
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,10 +13,14 @@ typedef void (*unwind_callback_t)(struct
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
 
+#define UNWIND_PENDING_BIT	0
+#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
@@ -28,15 +32,26 @@ void unwind_deferred_cancel(struct unwin
 
 static __always_inline void unwind_reset_info(void)
 {
-	if (unlikely(current->unwind_info.id.id))
+	struct unwind_task_info *info = &current->unwind_info;
+	unsigned long bits;
+
+	/* Was there any unwinding? */
+	if (unlikely(info->unwind_mask)) {
+		bits = raw_atomic_long_read(&info->unwind_mask);
+		do {
+			/* Is a task_work going to run again before going back */
+			if (bits & UNWIND_PENDING)
+				return;
+		} while (!raw_atomic_long_try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		current->unwind_info.id.id = 0;
+	}
 	/*
 	 * As unwind_user_faultable() can be called directly and
 	 * depends on nr_entries being cleared on exit to user,
 	 * this needs to be a separate conditional.
 	 */
-	if (unlikely(current->unwind_info.cache))
-		current->unwind_info.cache->nr_entries = 0;
+	if (unlikely(info->cache))
+		info->cache->nr_entries = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+#include <linux/atomic.h>
+
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -19,8 +21,8 @@ union unwind_task_id {
 struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
+	atomic_long_t		unwind_mask;
 	union unwind_task_id	id;
-	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,13 +12,39 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+#define UNWIND_NMI_SAFE 1
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	u32 zero = 0;
+	return try_cmpxchg(&info->id.cnt, &zero, cnt);
+}
+#else
+#define UNWIND_NMI_SAFE 0
+/* When NMIs are not allowed, this always succeeds */
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	info->id.cnt = cnt;
+	return true;
+}
+#endif /* CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG */
+
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 /*
  * This is a unique percpu identifier for a given task entry context.
@@ -41,21 +67,16 @@ static DEFINE_PER_CPU(u32, unwind_ctx_ct
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
-	u32 cpu_cnt;
-	u32 cnt;
-	u32 old = 0;
+	u32 cnt = 1;
 
 	if (info->id.cpu)
 		return info->id.id;
 
-	cpu_cnt = __this_cpu_read(unwind_ctx_ctr);
-	cpu_cnt += 2;
-	cnt = cpu_cnt | 1; /* Always make non zero */
-
-	if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
-		/* Update the per cpu counter */
-		__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
-	}
+	/* LSB it always set to ensure 0 is an invalid value. */
+	cnt |= __this_cpu_read(unwind_ctx_ctr) + 2;
+	if (try_assign_cnt(info, cnt))
+		__this_cpu_write(unwind_ctx_ctr, cnt);
+
 	/* Interrupts are disabled, the CPU will always be same */
 	info->id.cpu = smp_processor_id() + 1; /* Must be non zero */
 
@@ -117,13 +138,13 @@ static void unwind_deferred_task_work(st
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	unsigned long bits;
 	u64 cookie;
 
-	if (WARN_ON_ONCE(!info->pending))
+	if (WARN_ON_ONCE(!unwind_pending(info)))
 		return;
 
-	/* Allow work to come in again */
-	WRITE_ONCE(info->pending, 0);
+	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -136,9 +157,11 @@ static void unwind_deferred_task_work(st
 
 	cookie = info->id.id;
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, cookie);
+	guard(srcu_lite)(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
+		if (test_bit(work->bit, &bits))
+			work->func(work, &trace, cookie);
 	}
 }
 
@@ -162,7 +185,7 @@ static void unwind_deferred_task_work(st
  * because it has already been previously called for the same entry context,
  * it will be called again with the same stack trace and cookie.
  *
- * Return: 1 if the the callback was already queued.
+ * Return: 1 if the callback was already queued.
  *         0 if the callback successfully was queued.
  *         Negative if there's an error.
  *         @cookie holds the cookie of the first request by any user
@@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	int ret;
+	unsigned long bits, mask;
+	int bit, ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	/* NMI requires having safe cmpxchg operations */
+	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
+		return -EINVAL;
+
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
-	/* callback already pending? */
-	if (info->pending)
+	bits = UNWIND_PENDING | BIT(bit);
+	mask = atomic_long_fetch_or(bits, &info->unwind_mask);
+	if (mask & bits)
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
 	if (WARN_ON_ONCE(ret))
-		return ret;
-
-	info->pending = 1;
-	return 0;
+		atomic_long_set(0, &info->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+	int bit;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+	bit = work->bit;
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
+
+	__clear_bit(bit, &unwind_mask);
+
+	synchronize_srcu(&unwind_srcu);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t)
+		atomic_long_andnot(BIT(bit), &t->unwind_info.unwind_mask);
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -212,7 +256,15 @@ int unwind_deferred_init(struct unwind_w
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
-	list_add(&work->list, &callbacks);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~UNWIND_PENDING)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	__set_bit(work->bit, &unwind_mask);
+
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> >  
> > +	/*
> > +	 * This is the first to enable another task_work for this task since
> > +	 * the task entered the kernel, or had already called the callbacks.
> > +	 * Set only the bit for this work and clear all others as they have
> > +	 * already had their callbacks called, and do not need to call them
> > +	 * again because of this work.
> > +	 */
> > +	bits = UNWIND_PENDING | BIT(bit);
> > +
> > +	/*
> > +	 * If the cmpxchg() fails, it means that an NMI came in and set
> > +	 * the pending bit as well as cleared the other bits. Just
> > +	 * jump to setting the bit for this work.
> > +	 */
> >  	if (CAN_USE_IN_NMI) {
> > -		/* Claim the work unless an NMI just now swooped in to do so. */
> > -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> > +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
> >  			goto out;
> >  	} else {
> > -		local_set(&info->pending, 1);
> > +		info->unwind_mask = bits;
> >  	}
> >  
> >  	/* The work has been claimed, now schedule it. */
> >  	ret = task_work_add(current, &info->work, TWA_RESUME);
> > -	if (WARN_ON_ONCE(ret)) {
> > -		local_set(&info->pending, 0);
> > -		return ret;
> > -	}
> >  
> > +	if (WARN_ON_ONCE(ret))
> > +		WRITE_ONCE(info->unwind_mask, 0);
> > +
> > +	return ret;
> >   out:
> > -	return test_and_set_bit(bit, &info->unwind_mask);
> > +	return test_and_set_bit(bit, &info->unwind_mask) ?
> > +		UNWIND_ALREADY_PENDING : 0;
> >  }  
> 
> This is some of the most horrifyingly confused code I've seen in a
> while.
> 
> Please just slow down and think for a minute.
> 
> The below is the last four patches rolled into one. Not been near a
> compiler.

The above is still needed as is (explained below).


> @@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
>  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	int ret;
> +	unsigned long bits, mask;
> +	int bit, ret;
>  
>  	*cookie = 0;
>  
> -	if (WARN_ON_ONCE(in_nmi()))
> -		return -EINVAL;
> -
>  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>  	    !user_mode(task_pt_regs(current)))
>  		return -EINVAL;
>  
> +	/* NMI requires having safe cmpxchg operations */
> +	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
> +		return -EINVAL;
> +
> +	/* Do not allow cancelled works to request again */
> +	bit = READ_ONCE(work->bit);
> +	if (WARN_ON_ONCE(bit < 0))
> +		return -EINVAL;
> +
>  	guard(irqsave)();
>  
>  	*cookie = get_cookie(info);
>  
> -	/* callback already pending? */
> -	if (info->pending)
> +	bits = UNWIND_PENDING | BIT(bit);
> +	mask = atomic_long_fetch_or(bits, &info->unwind_mask);
> +	if (mask & bits)
>  		return 1;


So the fetch_or() isn't good enough for what needs to be done, and why
the code above is the way it is.

We have this scenario:


  perf and ftrace are both tracing the same task. perf with bit 1 and
  ftrace with bit 2. Let's say there's even another perf program
  running and registered bit 3.


  perf requests a deferred callback, and info->unwind_mask gets bit 1
  and the pending bit set.

  The task is exiting to user space and calls perf's callback and
  clears the pending bit but keeps perf's bit set as it was already
  called, and doesn't need to be called again even if perf requests a
  new stacktrace before the task gets back to user space.

  Now before the task gets back to user space, ftrace requests the
  deferred trace. To do so, it must set the pending bit and its bit,
  but it must also clear the perf bit as it should not call perf's
  callback again.

The atomic_long_fetch_or() above will set ftrace's bit but not clear
perf's bits and the perf callback will get called a second time even
though perf never requested another callback.

This is why the code at the top has:

	bits = UNWIND_PENDING | BIT(bit);

	/*
	 * If the cmpxchg() fails, it means that an NMI came in and set
	 * the pending bit as well as cleared the other bits. Just
	 * jump to setting the bit for this work.
	 */
	if (CAN_USE_IN_NMI) {
		/* Claim the work unless an NMI just now swooped in to do so. */
		if (!local_try_cmpxchg(&info->pending, &pending, 1))
		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
			goto out;

That cmpxchg() clears out any of the old bits if pending isn't set. Now
if an NMI came in and the other perf process requested a callback, it
would set its own bit plus the pending bit and then ftrace only needs
to jump to the end and do the test_and_set on its bit.

-- Steve


>  
>  	/* The work has been claimed, now schedule it. */
>  	ret = task_work_add(current, &info->work, TWA_RESUME);
>  	if (WARN_ON_ONCE(ret))
> -		return ret;
> -
> -	info->pending = 1;
> -	return 0;
> +		atomic_long_set(0, &info->unwind_mask);
>  }
>

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Wed, 16 Jul 2025 14:26:09 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>   Now before the task gets back to user space, ftrace requests the
>   deferred trace. To do so, it must set the pending bit and its bit,
>   but it must also clear the perf bit as it should not call perf's
>   callback again.

After ftrace clears the bits, it is possible that the first perf
program will request again and this time it will get another callback
with the same cookie. But at least it has a request between the last
callback and the next one.

That is, it would have:

[Task enters kernel]
  request -> add cookie
  request -> add cookie
  [..]
  callback -> add trace + cookie
 [ftrace clears bits]
  request -> add cookie
  callback -> add trace + cookie
[Task exits back to user space]

Which shouldn't be too confusing. But if we just do the fetch_or and it
didn't request a new trace, it would have:

[Task enters kernel]
  request -> add cookie
  request -> add cookie
  [..]
  callback -> add trace + cookie
 [ftrace clears bits]
  callback -> add trace + cookie
[Task exits back to user space]

Where there's two callbacks written to the perf buffer for the same
request.

Maybe this isn't a problem, but I was trying to avoid adding multiple
requests due to other tracers affecting the state.

-- Steve

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Wed, 16 Jul 2025 14:33:52 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> [Task enters kernel]
>   request -> add cookie
>   request -> add cookie
>   [..]
>   callback -> add trace + cookie
>  [ftrace clears bits]
>   callback -> add trace + cookie
> [Task exits back to user space]

Another solution could be to add another unwind mask of completed
callbacks. Then we could use the fetch_or(). I guess that could look
like this:

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 15045999c5e2..0124865aaab4 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -61,8 +61,10 @@ static __always_inline void unwind_reset_info(void)
 		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		current->unwind_info.id.id = 0;
 
-		if (unlikely(info->cache))
+		if (unlikely(info->cache)) {
 			info->cache->nr_entries = 0;
+			info->cache->unwind_completed = 0;
+		}
 	}
 }
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5dc9cda141ff..33b62ac25c86 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -3,6 +3,7 @@
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
 struct unwind_cache {
+	unsigned long		unwind_completed;
 	unsigned int		nr_entries;
 	unsigned long		entries[];
 };
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 60cc71062f86..9a3e06ee9d63 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -170,13 +170,19 @@ static void process_unwind_deferred(struct task_struct *task)
 
 	unwind_user_faultable(&trace);
 
+	if (info->cache)
+		bits &= ~(info->cache->unwind_completed);
+
 	cookie = info->id.id;
 
 	guard(srcu_lite)(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (test_bit(work->bit, &bits))
+		if (test_bit(work->bit, &bits)) {
 			work->func(work, &trace, cookie);
+			if (info->cache)
+				info->cache->unwind_completed |= BIT(work->bit);
+		}
 	}
 }
 
Instead of adding another long word in the tasks struct, I just use the
unwind_cache that gets allocated on the first use.

I think this can work. I'll switch it over to this and then I can use
the fetch_or() and there should be no extra callbacks, even if an
already called callback is requested again after another callback was
requested which would trigger another task work.

I'll update this patch (and fold it into the bitmask patch) with the
fetch_or() and create this patch as a separate patch that just gets rid
of spurious callbacks.

-- Steve

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> The below is the last four patches rolled into one. Not been near a
> compiler.

And it shows ;-)

> @@ -117,13 +138,13 @@ static void unwind_deferred_task_work(st
>  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
>  	struct unwind_stacktrace trace;
>  	struct unwind_work *work;
> +	unsigned long bits;
>  	u64 cookie;
>  
> -	if (WARN_ON_ONCE(!info->pending))
> +	if (WARN_ON_ONCE(!unwind_pending(info)))
>  		return;
>  
> -	/* Allow work to come in again */
> -	WRITE_ONCE(info->pending, 0);
> +	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);

I may need to do what other parts of the kernel has done and turn the
above into:

	bits = atomic_long_fetch_andnot(UNWIND_PENDING, (atomic_long_t *)&info->unwind_mask);

As there's other bit manipulations that atomic_long does not take care
of and it's making the code more confusing. When I looked to see how
other users of atomic_long_andnot() did things, most just typecasted
the value to use that function :-/

-- Steve


>  
>  	/*
>  	 * From here on out, the callback must always be called, even if it's
> @@ -136,9 +157,11 @@ static void unwind_deferred_task_work(st
>  
>  	cookie = info->id.id;
>  
> -	guard(mutex)(&callback_mutex);
> -	list_for_each_entry(work, &callbacks, list) {
> -		work->func(work, &trace, cookie);
> +	guard(srcu_lite)(&unwind_srcu);
> +	list_for_each_entry_srcu(work, &callbacks, list,
> +				 srcu_read_lock_held(&unwind_srcu)) {
> +		if (test_bit(work->bit, &bits))
> +			work->func(work, &trace, cookie);
>  	}
>  }
>  
> @@ -162,7 +185,7 @@ static void unwind_deferred_task_work(st
>   * because it has already been previously called for the same entry context,
>   * it will be called again with the same stack trace and cookie.
>   *
> - * Return: 1 if the the callback was already queued.
> + * Return: 1 if the callback was already queued.
>   *         0 if the callback successfully was queued.
>   *         Negative if there's an error.
>   *         @cookie holds the cookie of the first request by any user

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> @@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
>  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	int ret;
> +	unsigned long bits, mask;
> +	int bit, ret;
>  
>  	*cookie = 0;
>  
> -	if (WARN_ON_ONCE(in_nmi()))
> -		return -EINVAL;
> -
>  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>  	    !user_mode(task_pt_regs(current)))
>  		return -EINVAL;
>  
> +	/* NMI requires having safe cmpxchg operations */
> +	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
> +		return -EINVAL;

I don't think we want to have a WARN_ON() here as the perf series tries
to first do the deferred unwinding and if that fails, it will go back
to it's old method.

By having a WARN_ON(), we need to make perf aware of this limitation
too. Do we want to do that?

-- Steve

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Peter Zijlstra 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 01:20:16PM -0400, Steven Rostedt wrote:
> On Tue, 15 Jul 2025 12:29:12 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > @@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
> >  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  {
> >  	struct unwind_task_info *info = &current->unwind_info;
> > -	int ret;
> > +	unsigned long bits, mask;
> > +	int bit, ret;
> >  
> >  	*cookie = 0;
> >  
> > -	if (WARN_ON_ONCE(in_nmi()))
> > -		return -EINVAL;
> > -
> >  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
> >  	    !user_mode(task_pt_regs(current)))
> >  		return -EINVAL;
> >  
> > +	/* NMI requires having safe cmpxchg operations */
> > +	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
> > +		return -EINVAL;
> 
> I don't think we want to have a WARN_ON() here as the perf series tries
> to first do the deferred unwinding and if that fails, it will go back
> to it's old method.

The thing is, I don't think we have an architecture that supports NMIs
and does not have NMI safe cmpxchg. And if we do have one such -- I
don't think it has perf; perf very much assumes cmpxchg is NMI safe.

Calling this from NMI context and not having an NMI safe cmpxchg is very
much a dodgy use case. Please leave the WARN, if it ever triggers, we'll
look at who manages and deal with it then.

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> > diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> > index 12bffdb0648e..587e120c0fd6 100644
> > --- a/include/linux/unwind_deferred.h
> > +++ b/include/linux/unwind_deferred.h
> > @@ -18,6 +18,14 @@ struct unwind_work {
> >  
> >  #ifdef CONFIG_UNWIND_USER
> >  
> > +#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
> > +#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)  
> 
> Since it really didn't matter what bit you took, why not take bit 0?

I was worried about it affecting the global unwind_mask test, but I guess
bit zero works too. In fact, I think I could just set the PENDING and USED
(see next patch) bits in the global unwind mask as being already "used" and
then change:

	/* See if there's a bit in the mask available */
	if (unwind_mask == ~(UNWIND_PENDING|UNWIND_USED))
		return -EBUSY;

Back to:

	/* See if there's a bit in the mask available */
	if (unwind_mask == ~0UL)
		return -EBUSY;

> 


> >  /*
> >   * This is a unique percpu identifier for a given task entry context.
> >   * Conceptually, it's incremented every time the CPU enters the kernel from
> > @@ -143,14 +148,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
> >  	struct unwind_stacktrace trace;
> >  	struct unwind_work *work;
> > +	unsigned long bits;
> >  	u64 cookie;
> >  	int idx;
> >  
> > -	if (WARN_ON_ONCE(!local_read(&info->pending)))
> > +	if (WARN_ON_ONCE(!unwind_pending(info)))
> >  		return;
> >  
> > -	/* Allow work to come in again */
> > -	local_set(&info->pending, 0);
> > +	/* Clear pending bit but make sure to have the current bits */
> > +	bits = READ_ONCE(info->unwind_mask);
> > +	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
> > +		;  
> 
> We have:
> 
> 	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);
> 
> for that. A fair number of architecture can actually do this better than
> a cmpxchg loop.

Thanks, I didn't know about that one.

> 
> >  
> >  	/*
> >  	 * From here on out, the callback must always be called, even if it's
> > @@ -166,10 +174,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >  	idx = srcu_read_lock(&unwind_srcu);
> >  	list_for_each_entry_srcu(work, &callbacks, list,
> >  				 srcu_read_lock_held(&unwind_srcu)) {
> > -		if (test_bit(work->bit, &info->unwind_mask)) {
> > +		if (test_bit(work->bit, &bits))
> >  			work->func(work, &trace, cookie);
> > -			clear_bit(work->bit, &info->unwind_mask);
> > -		}
> >  	}
> >  	srcu_read_unlock(&unwind_srcu, idx);
> >  }
> > @@ -194,15 +200,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >   * because it has already been previously called for the same entry context,
> >   * it will be called again with the same stack trace and cookie.
> >   *
> > - * Return: 1 if the the callback was already queued.
> > - *         0 if the callback successfully was queued.
> > + * Return: 0 if the callback successfully was queued.
> > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > + *                (and will not be called again)
> >   *         Negative if there's an error.
> >   *         @cookie holds the cookie of the first request by any user
> >   */  
> 
> Lots of babbling in the Changelog, but no real elucidation as to why you
> need this second return value.
> 
> AFAICT it serves no real purpose; the users of this function should not
> care. The only difference is that the unwind reference (your cookie)
> becomes a backward reference instead of a forward reference. But why
> would anybody care?

Older versions of the code required it. I think I can remove it now.

> 
> Whatever tool is ultimately in charge of gluing humpty^Wstacktraces back
> together again should have no problem with this.
> 
> >  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  {
> >  	struct unwind_task_info *info = &current->unwind_info;
> > -	long pending;
> > +	unsigned long old, bits;
> >  	int bit;
> >  	int ret;
> >  
> > @@ -225,32 +233,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  
> >  	*cookie = get_cookie(info);
> >  
> > -	/* This is already queued */
> > -	if (test_bit(bit, &info->unwind_mask))
> > -		return 1;
> > +	old = READ_ONCE(info->unwind_mask);
> > +
> > +	/* Is this already queued */
> > +	if (test_bit(bit, &old)) {
> > +		/*
> > +		 * If pending is not set, it means this work's callback
> > +		 * was already called.
> > +		 */
> > +		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
> > +			UNWIND_ALREADY_EXECUTED;
> > +	}
> >  
> > -	/* callback already pending? */
> > -	pending = local_read(&info->pending);
> > -	if (pending)
> > +	if (unwind_pending(info))
> >  		goto out;
> >  
> > +	/*
> > +	 * This is the first to enable another task_work for this task since
> > +	 * the task entered the kernel, or had already called the callbacks.
> > +	 * Set only the bit for this work and clear all others as they have
> > +	 * already had their callbacks called, and do not need to call them
> > +	 * again because of this work.
> > +	 */
> > +	bits = UNWIND_PENDING | BIT(bit);
> > +
> > +	/*
> > +	 * If the cmpxchg() fails, it means that an NMI came in and set
> > +	 * the pending bit as well as cleared the other bits. Just
> > +	 * jump to setting the bit for this work.
> > +	 */
> >  	if (CAN_USE_IN_NMI) {
> > -		/* Claim the work unless an NMI just now swooped in to do so. */
> > -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> > +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
> >  			goto out;
> >  	} else {
> > -		local_set(&info->pending, 1);
> > +		info->unwind_mask = bits;
> >  	}
> >  
> >  	/* The work has been claimed, now schedule it. */
> >  	ret = task_work_add(current, &info->work, TWA_RESUME);
> > -	if (WARN_ON_ONCE(ret)) {
> > -		local_set(&info->pending, 0);
> > -		return ret;
> > -	}
> >  
> > +	if (WARN_ON_ONCE(ret))
> > +		WRITE_ONCE(info->unwind_mask, 0);
> > +
> > +	return ret;
> >   out:
> > -	return test_and_set_bit(bit, &info->unwind_mask);
> > +	return test_and_set_bit(bit, &info->unwind_mask) ?
> > +		UNWIND_ALREADY_PENDING : 0;
> >  }  
> 
> This is some of the most horrifyingly confused code I've seen in a
> while.
> 
> Please just slow down and think for a minute.
> 
> The below is the last four patches rolled into one. Not been near a
> compiler.

Are you recommending that I fold those patches into one?

I'm fine with that. Note, part of the way things are broken up is because I
took Josh's code and built on top of his work. I tend to try not to modify
someone else's code when doing that and make building blocks of each stage.

Also, it follows the way I tend to review code. Which is to take the entire
patch set, apply it, then look at each patch compared to the final result.

That probably explains why my patch series is confusing for you, as it was
written more for the way I review. Sorry about that.

-- Steve

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Peter Zijlstra 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 08:49:32AM -0400, Steven Rostedt wrote:
> > The below is the last four patches rolled into one. Not been near a
> > compiler.
> 
> Are you recommending that I fold those patches into one?

Not particularly; but given the terrible back and forth, the only sane
way to 'show' my changes it from patch 6 onwards folded. If I were to
diff against patch 9 it'd be a shitshow.

At the very least the SRCU thing ought to be broken back out. Not sure
how many pieces it can reasonably be broken into, see what works.

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 15 Jul 2025 08:49:32 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > >   *
> > > - * Return: 1 if the the callback was already queued.
> > > - *         0 if the callback successfully was queued.
> > > + * Return: 0 if the callback successfully was queued.
> > > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > > + *                (and will not be called again)
> > >   *         Negative if there's an error.
> > >   *         @cookie holds the cookie of the first request by any user
> > >   */    
> > 
> > Lots of babbling in the Changelog, but no real elucidation as to why you
> > need this second return value.
> > 
> > AFAICT it serves no real purpose; the users of this function should not
> > care. The only difference is that the unwind reference (your cookie)
> > becomes a backward reference instead of a forward reference. But why
> > would anybody care?  
> 
> Older versions of the code required it. I think I can remove it now.

Ah it is still used in the perf code:

perf_callchain() has:

        if (defer_user) {
                int ret = deferred_request(event);
                if (!ret)
                        local_inc(&event->ctx->nr_no_switch_fast);
                else if (ret < 0)
                        defer_user = false;
        }

Where deferred_requests() is as static function that returns the result
of the unwind request. If it is zero, it means the callback will be
called, if it is greater than zero it means it has already been called,
and negative is an error (and use the old method).

It looks like when the callback is called it expects nr_no_switch_fast
to be incremented and it will decrement it. This is directly from
Josh's patch and I don't know perf well enough to know if that update
to nr_no_switch_fast is needed.

If it's not needed, we can just return 0 on success and negative on
failure. What do you think?

Here's the original patch:

  https://lore.kernel.org/all/20250708020050.928524258@kernel.org/

-- Steve

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Peter Zijlstra 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 02:06:50PM -0400, Steven Rostedt wrote:
> On Tue, 15 Jul 2025 08:49:32 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > > >   *
> > > > - * Return: 1 if the the callback was already queued.
> > > > - *         0 if the callback successfully was queued.
> > > > + * Return: 0 if the callback successfully was queued.
> > > > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > > > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > > > + *                (and will not be called again)
> > > >   *         Negative if there's an error.
> > > >   *         @cookie holds the cookie of the first request by any user
> > > >   */    
> > > 
> > > Lots of babbling in the Changelog, but no real elucidation as to why you
> > > need this second return value.
> > > 
> > > AFAICT it serves no real purpose; the users of this function should not
> > > care. The only difference is that the unwind reference (your cookie)
> > > becomes a backward reference instead of a forward reference. But why
> > > would anybody care?  
> > 
> > Older versions of the code required it. I think I can remove it now.
> 
> Ah it is still used in the perf code:
> 
> perf_callchain() has:
> 
>         if (defer_user) {
>                 int ret = deferred_request(event);
>                 if (!ret)
>                         local_inc(&event->ctx->nr_no_switch_fast);
>                 else if (ret < 0)
>                         defer_user = false;
>         }
> 
> Where deferred_requests() is as static function that returns the result
> of the unwind request. If it is zero, it means the callback will be
> called, if it is greater than zero it means it has already been called,
> and negative is an error (and use the old method).
> 
> It looks like when the callback is called it expects nr_no_switch_fast
> to be incremented and it will decrement it. This is directly from
> Josh's patch and I don't know perf well enough to know if that update
> to nr_no_switch_fast is needed.
> 
> If it's not needed, we can just return 0 on success and negative on
> failure. What do you think?

I'm yet again confused. I don't see this code differentiate between 1
and 2 return values (those PENDING and EXECUTED).

Anyway, fundamentally I don't think there is a problem with backward
references as opposed to the normal forward references.

So leave it out for now.

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 15 Jul 2025 14:06:50 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > > > + * Return: 0 if the callback successfully was queued.
> > > > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > > > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > > > + *                (and will not be called again)
> > > >   *         Negative if there's an error.
> > > >   *         @cookie holds the cookie of the first request by any user
> > > >   */      
> > > 
> > > Lots of babbling in the Changelog, but no real elucidation as to why you
> > > need this second return value.
> > > 
> > > AFAICT it serves no real purpose; the users of this function should not
> > > care. The only difference is that the unwind reference (your cookie)
> > > becomes a backward reference instead of a forward reference. But why
> > > would anybody care?    
> > 
> > Older versions of the code required it. I think I can remove it now.  
> 
> Ah it is still used in the perf code:
> 
> perf_callchain() has:
> 
>         if (defer_user) {
>                 int ret = deferred_request(event);
>                 if (!ret)
>                         local_inc(&event->ctx->nr_no_switch_fast);

Hmm, I guess this could work if it returned non zero for both already
queued and already executed. So it doesn't need to be two different
values.

-- Steve


>                 else if (ret < 0)
>                         defer_user = false;
>         }

Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 15 Jul 2025 14:06:50 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Ah it is still used in the perf code:

Either way, what I'll do is to remove this special return value for this
series, and add it back in the perf series if needed.

This is one of the problems that arises when you take a series with
lots of changes and try to break it apart. You will always miss
something that isn't needed in one where a change was made for another
series.

Working on each one to make sure this all works, it starts to blend together :-p

-- Steve