[v15] perf: Support the deferred unwinding infrastructure

[RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Steven Rostedt 5 months ago

From: Josh Poimboeuf <jpoimboe@kernel.org>

If the user fault unwind is available (the one that will be used for
sframes), have perf be able to utilize it. Currently all user stack
traces are done at the request site. This mostly happens in interrupt or
NMI context where user space is only accessible if it is currently present
in memory. It is possible that the user stack was swapped out and is not
present, but mostly the use of sframes will require faulting in user pages
which will not be possible from interrupt context. Instead, add a frame
work that will delay the reading of the user space stack until the task
goes back to user space where faulting in pages is possible. This is also
advantageous as the user space stack doesn't change while in the kernel,
and this will also remove duplicate entries of user space stacks for a
long running system call being profiled.

A new perf context is created called PERF_CONTEXT_USER_DEFERRED. It is
added to the kernel callchain, usually when an interrupt or NMI is
triggered (but can be added to any callchain). When a deferred unwind is
required, a new task_work is triggered (pending_unwind_work) on the task.
The callchain that is done immediately for the kernel is appended with the
PERF_CONTEXT_USER_DEFERRED.

When the task exits to user space and the task_work handler is triggered,
it will execute the user stack unwinding and record the user stack trace.
This user stack trace will go into a new perf type called
PERF_RECORD_CALLCHAIN_DEFERRED.  The perf user space will need to attach
this stack trace to each of the previous kernel callchains for that task
with the PERF_CONTEXT_USER_DEFERRED context in them.

As the struct unwind_stacktrace has its entries as "unsigned long", and it
is used to copy directly into struct perf_callchain_entry which its "ip"
field is defined as u64, currently only deferred callchains are allowed
for 64bit architectures. This could change in the future if there is a
demand for it for 32 bit architectures.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h            |   7 +-
 include/uapi/linux/perf_event.h       |  20 +++-
 kernel/bpf/stackmap.c                 |   4 +-
 kernel/events/callchain.c             |  11 +-
 kernel/events/core.c                  | 156 +++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h |  20 +++-
 6 files changed, 210 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d91017b99..1527afa952f7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -53,6 +53,7 @@
 #include <linux/security.h>
 #include <linux/static_call.h>
 #include <linux/lockdep.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/local.h>
 
@@ -880,6 +881,10 @@ struct perf_event {
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 
+	unsigned int			pending_unwind_callback;
+	struct callback_head		pending_unwind_work;
+	struct rcuwait			pending_unwind_wait;
+
 	atomic_t			event_limit;
 
 	/* address range filters */
@@ -1720,7 +1725,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 78a362b80027..20b8f890113b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -463,7 +463,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1240,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1286,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ec3a57a5fba1..339f7cbbcf36 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+				   false, false, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   crosstask, false, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 808c0d7a31fa..d0e0da66a164 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,6 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one.
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28de3baff792..37e684edbc8a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5582,6 +5582,95 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_pending_unwind_sync(struct perf_event *event)
+{
+	might_sleep();
+
+	if (!event->pending_unwind_callback)
+		return;
+
+	/*
+	 * If the task is queued to the current task's queue, we
+	 * obviously can't wait for it to complete. Simply cancel it.
+	 */
+	if (task_work_cancel(current, &event->pending_unwind_work)) {
+		event->pending_unwind_callback = 0;
+		local_dec(&event->ctx->nr_no_switch_fast);
+		return;
+	}
+
+	/*
+	 * All accesses related to the event are within the same RCU section in
+	 * perf_event_callchain_deferred(). The RCU grace period before the
+	 * event is freed will make sure all those accesses are complete by then.
+	 */
+	rcuwait_wait_event(&event->pending_unwind_wait, !event->pending_unwind_callback, TASK_UNINTERRUPTIBLE);
+}
+
+struct perf_callchain_deferred_event {
+	struct perf_event_header	header;
+	u64				cookie;
+	u64				nr;
+	u64				ips[];
+};
+
+static void perf_event_callchain_deferred(struct callback_head *work)
+{
+	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
+	struct perf_callchain_deferred_event deferred_event;
+	u64 callchain_context = PERF_CONTEXT_USER;
+	struct unwind_stacktrace trace;
+	struct perf_output_handle handle;
+	struct perf_sample_data data;
+	u64 nr;
+
+	if (!event->pending_unwind_callback)
+		return;
+
+	if (unwind_user_faultable(&trace) < 0)
+		goto out;
+
+	/*
+	 * All accesses to the event must belong to the same implicit RCU
+	 * read-side critical section as the ->pending_unwind_callback reset.
+	 * See comment in perf_pending_unwind_sync().
+	 */
+	guard(rcu)();
+
+	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
+		goto out;
+
+	nr = trace.nr + 1 ; /* '+1' == callchain_context */
+
+	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
+	deferred_event.header.misc = PERF_RECORD_MISC_USER;
+	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
+
+	deferred_event.nr = nr;
+	deferred_event.cookie = unwind_user_get_cookie();
+
+	perf_event_header__init_id(&deferred_event.header, &data, event);
+
+	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
+		goto out;
+
+	perf_output_put(&handle, deferred_event);
+	perf_output_put(&handle, callchain_context);
+	/* trace.entries[] are not guaranteed to be 64bit */
+	for (int i = 0; i < trace.nr; i++) {
+		u64 entry = trace.entries[i];
+		perf_output_put(&handle, entry);
+	}
+	perf_event__output_id_sample(event, &handle, &data);
+
+	perf_output_end(&handle);
+
+out:
+	event->pending_unwind_callback = 0;
+	local_dec(&event->ctx->nr_no_switch_fast);
+	rcuwait_wake_up(&event->pending_unwind_wait);
+}
+
 static void perf_free_addr_filters(struct perf_event *event);
 
 /* vs perf_event_alloc() error */
@@ -5649,6 +5738,7 @@ static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending_irq);
 	irq_work_sync(&event->pending_disable_irq);
+	perf_pending_unwind_sync(event);
 
 	unaccount_event(event);
 
@@ -8194,6 +8284,46 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+/*
+ * Returns:
+*     > 0 : if already queued.
+ *      0 : if it performed the queuing
+ *    < 0 : if it did not get queued.
+ */
+static int deferred_request(struct perf_event *event)
+{
+	struct callback_head *work = &event->pending_unwind_work;
+	int pending;
+	int ret;
+
+	/* Only defer for task events */
+	if (!event->ctx->task)
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	/* callback already pending? */
+	pending = READ_ONCE(event->pending_unwind_callback);
+	if (pending)
+		return 1;
+
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&event->pending_unwind_callback, &pending, 1))
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(event->pending_unwind_callback, 0);
+		return ret;
+	}
+	return 0;
+}
+
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
@@ -8204,6 +8334,9 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	/* perf currently only supports deferred in 64bit */
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
+			  event->attr.defer_callchain;
 
 	if (!current->mm)
 		user = false;
@@ -8211,8 +8344,21 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	/* Disallow cross-task callchains. */
+	if (event->ctx->task && event->ctx->task != current)
+		return &__empty_callchain;
+
+	if (defer_user) {
+		int ret = deferred_request(event);
+		if (!ret)
+			local_inc(&event->ctx->nr_no_switch_fast);
+		else if (ret < 0)
+			defer_user = false;
+	}
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack,
+				       crosstask, true, defer_user);
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -12882,6 +13028,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
 	init_task_work(&event->pending_task, perf_pending_task);
 
+	rcuwait_init(&event->pending_unwind_wait);
+
 	mutex_init(&event->mmap_mutex);
 	raw_spin_lock_init(&event->addr_filters.lock);
 
@@ -13050,6 +13198,10 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		return ERR_PTR(err);
 
+	if (event->attr.defer_callchain)
+		init_task_work(&event->pending_unwind_work,
+			       perf_event_callchain_deferred);
+
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 78a362b80027..20b8f890113b 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -463,7 +463,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1240,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1286,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
-- 
2.50.1

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Peter Zijlstra 4 months, 2 weeks ago

On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:
> +struct perf_callchain_deferred_event {
> +	struct perf_event_header	header;
> +	u64				cookie;
> +	u64				nr;
> +	u64				ips[];
> +};
> +
> +static void perf_event_callchain_deferred(struct callback_head *work)
> +{
> +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> +	struct perf_callchain_deferred_event deferred_event;
> +	u64 callchain_context = PERF_CONTEXT_USER;
> +	struct unwind_stacktrace trace;
> +	struct perf_output_handle handle;
> +	struct perf_sample_data data;
> +	u64 nr;
> +
> +	if (!event->pending_unwind_callback)
> +		return;
> +
> +	if (unwind_user_faultable(&trace) < 0)
> +		goto out;
> +
> +	/*
> +	 * All accesses to the event must belong to the same implicit RCU
> +	 * read-side critical section as the ->pending_unwind_callback reset.
> +	 * See comment in perf_pending_unwind_sync().
> +	 */
> +	guard(rcu)();
> +
> +	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> +		goto out;
> +
> +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> +
> +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> +
> +	deferred_event.nr = nr;
> +	deferred_event.cookie = unwind_user_get_cookie();
> +
> +	perf_event_header__init_id(&deferred_event.header, &data, event);
> +
> +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> +		goto out;
> +
> +	perf_output_put(&handle, deferred_event);
> +	perf_output_put(&handle, callchain_context);
> +	/* trace.entries[] are not guaranteed to be 64bit */
> +	for (int i = 0; i < trace.nr; i++) {
> +		u64 entry = trace.entries[i];
> +		perf_output_put(&handle, entry);
> +	}
> +	perf_event__output_id_sample(event, &handle, &data);
> +
> +	perf_output_end(&handle);
> +
> +out:
> +	event->pending_unwind_callback = 0;
> +	local_dec(&event->ctx->nr_no_switch_fast);
> +	rcuwait_wake_up(&event->pending_unwind_wait);
> +}
> +

> +/*
> + * Returns:
> +*     > 0 : if already queued.
> + *      0 : if it performed the queuing
> + *    < 0 : if it did not get queued.
> + */
> +static int deferred_request(struct perf_event *event)
> +{
> +	struct callback_head *work = &event->pending_unwind_work;
> +	int pending;
> +	int ret;
> +
> +	/* Only defer for task events */
> +	if (!event->ctx->task)
> +		return -EINVAL;
> +
> +	if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) ||
> +	    !user_mode(task_pt_regs(current)))
> +		return -EINVAL;
> +
> +	guard(irqsave)();
> +
> +	/* callback already pending? */
> +	pending = READ_ONCE(event->pending_unwind_callback);
> +	if (pending)
> +		return 1;
> +
> +	/* Claim the work unless an NMI just now swooped in to do so. */
> +	if (!try_cmpxchg(&event->pending_unwind_callback, &pending, 1))
> +		return 1;
> +
> +	/* The work has been claimed, now schedule it. */
> +	ret = task_work_add(current, work, TWA_RESUME);
> +	if (WARN_ON_ONCE(ret)) {
> +		WRITE_ONCE(event->pending_unwind_callback, 0);
> +		return ret;
> +	}
> +	return 0;
> +}

So the thing that stands out is that you're not actually using the
unwind infrastructure you've previously created. Things like: struct
unwind_work, unwind_deferred_{init,request,cancel}() all go unused, and
instead you seem to have build a parallel set, with similar bugs to the
ones I just had to fix in the unwind_deferred things :/

I'm also not much of a fan of nr_no_switch_fast, and the fact that this
patch is limited to per-task events, and you're then adding another 300+
lines of code to support per-cpu events later on.

Fundamentally we only have one stack-trace per task at any one point. We
can have many events per task and many more per-cpu. Let us stick a
struct unwind_work in task_struct and have the perf callback function
use perf_iterate_sb() to find all events that want delivery or so (or we
can add another per perf_event_context list for this purpose).

But duplicating all this seems 'unfortunate'.

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Steven Rostedt 4 months, 1 week ago

On Tue, 23 Sep 2025 12:32:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> I'm also not much of a fan of nr_no_switch_fast, and the fact that this
> patch is limited to per-task events, and you're then adding another 300+
> lines of code to support per-cpu events later on.

BTW, I'm not exactly sure what the purpose of the "nr_no_switch_fast" is
for. Josh had it in his patches and I kept it.

I'm almost done with my next version that moved a lot of the "follow task"
work into the deferred unwind infrastructure, which drastically simplified
this patch.

But I still have this "nr_no_switch_fast" increment when a request is
successfully made and decremented when the stacktrace is executed. In the
task switch perf code there's:

			/* PMIs are disabled; ctx->nr_no_switch_fast is stable. */
			if (local_read(&ctx->nr_no_switch_fast) ||
			    local_read(&next_ctx->nr_no_switch_fast)) {
				/*
				 * Must not swap out ctx when there's pending
				 * events that rely on the ctx->task relation.
				 *
				 * Likewise, when a context contains inherit +
				 * SAMPLE_READ events they should be switched
				 * out using the slow path so that they are
				 * treated as if they were distinct contexts.
				 */
				raw_spin_unlock(&next_ctx->lock);
				rcu_read_unlock();
				goto inside_switch;
			}

Is this mostly to do with PMU counters? Here there is a relation to the
task and the event, but that's just that the task is going to have a
deferred stack trace.

Can I safely drop this counter?

-- Steve

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Steven Rostedt 4 months, 2 weeks ago

On Tue, 23 Sep 2025 12:32:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> So the thing that stands out is that you're not actually using the
> unwind infrastructure you've previously created. Things like: struct
> unwind_work, unwind_deferred_{init,request,cancel}() all go unused, and
> instead you seem to have build a parallel set, with similar bugs to the
> ones I just had to fix in the unwind_deferred things :/
> 
> I'm also not much of a fan of nr_no_switch_fast, and the fact that this
> patch is limited to per-task events, and you're then adding another 300+
> lines of code to support per-cpu events later on.
> 
> Fundamentally we only have one stack-trace per task at any one point. We
> can have many events per task and many more per-cpu. Let us stick a
> struct unwind_work in task_struct and have the perf callback function
> use perf_iterate_sb() to find all events that want delivery or so (or we
> can add another per perf_event_context list for this purpose).
> 
> But duplicating all this seems 'unfortunate'.

We could remove this and have perf only use the CPU version. That may
be better in the long run anyway, as it gets rid of the duplication. In
fact that was the original plan we had, but since Josh wrote this patch
thinking it was all that perf needed (which ended not being the case),
I still kept it in. But I believe this will work just the same as the
CPU tracing which uses all the other infrastructure.

-- Steve

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Peter Zijlstra 4 months, 2 weeks ago

On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:
> +	/*
> +	 * All accesses related to the event are within the same RCU section in
> +	 * perf_event_callchain_deferred(). The RCU grace period before the
> +	 * event is freed will make sure all those accesses are complete by then.
> +	 */
> +	rcuwait_wait_event(&event->pending_unwind_wait, !event->pending_unwind_callback, TASK_UNINTERRUPTIBLE);

You need a narrower terminal, this is again excessive. I mostly code
with my screen split in 4 columns (3 if I can't find my glasses), and
that gets me around 90 character columns.

> +	if (event->attr.defer_callchain)
> +		init_task_work(&event->pending_unwind_work,
> +			       perf_event_callchain_deferred);

And let me hand you a bucket of {}, I've heard they're getting expensive
over in the US due to this tariff nonsense :-)

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Peter Zijlstra 4 months, 2 weeks ago

On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:

> +static void perf_event_callchain_deferred(struct callback_head *work)
> +{
> +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> +	struct perf_callchain_deferred_event deferred_event;
> +	u64 callchain_context = PERF_CONTEXT_USER;
> +	struct unwind_stacktrace trace;
> +	struct perf_output_handle handle;
> +	struct perf_sample_data data;
> +	u64 nr;
> +
> +	if (!event->pending_unwind_callback)
> +		return;
> +
> +	if (unwind_user_faultable(&trace) < 0)
> +		goto out;

This is broken. Because:

> +
> +	/*
> +	 * All accesses to the event must belong to the same implicit RCU
> +	 * read-side critical section as the ->pending_unwind_callback reset.
> +	 * See comment in perf_pending_unwind_sync().
> +	 */
> +	guard(rcu)();

Here you start a guard, that lasts until close of function..

> +
> +	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> +		goto out;
> +
> +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> +
> +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> +
> +	deferred_event.nr = nr;
> +	deferred_event.cookie = unwind_user_get_cookie();
> +
> +	perf_event_header__init_id(&deferred_event.header, &data, event);
> +
> +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> +		goto out;
> +
> +	perf_output_put(&handle, deferred_event);
> +	perf_output_put(&handle, callchain_context);
> +	/* trace.entries[] are not guaranteed to be 64bit */
> +	for (int i = 0; i < trace.nr; i++) {
> +		u64 entry = trace.entries[i];
> +		perf_output_put(&handle, entry);
> +	}
> +	perf_event__output_id_sample(event, &handle, &data);
> +
> +	perf_output_end(&handle);
> +
> +out:

Which very much includes here, so your goto jumps into a scope, which is
not permitted.

GCC can fail to warn on this, but clang will consistently fail to
compile this. Surely the robot would've told you by now -- even if
you're not using clang yourself.

> +	event->pending_unwind_callback = 0;
> +	local_dec(&event->ctx->nr_no_switch_fast);
> +	rcuwait_wake_up(&event->pending_unwind_wait);
> +}

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Steven Rostedt 4 months, 2 weeks ago

On Tue, 23 Sep 2025 11:19:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:
> 
> > +static void perf_event_callchain_deferred(struct callback_head *work)
> > +{
> > +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> > +	struct perf_callchain_deferred_event deferred_event;
> > +	u64 callchain_context = PERF_CONTEXT_USER;
> > +	struct unwind_stacktrace trace;
> > +	struct perf_output_handle handle;
> > +	struct perf_sample_data data;
> > +	u64 nr;
> > +
> > +	if (!event->pending_unwind_callback)
> > +		return;
> > +
> > +	if (unwind_user_faultable(&trace) < 0)
> > +		goto out;  
> 
> This is broken. Because:
> 
> > +
> > +	/*
> > +	 * All accesses to the event must belong to the same implicit RCU
> > +	 * read-side critical section as the ->pending_unwind_callback reset.
> > +	 * See comment in perf_pending_unwind_sync().
> > +	 */
> > +	guard(rcu)();  
> 
> Here you start a guard, that lasts until close of function..
> 
> > +
> > +	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> > +		goto out;
> > +
> > +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> > +
> > +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> > +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> > +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> > +
> > +	deferred_event.nr = nr;
> > +	deferred_event.cookie = unwind_user_get_cookie();
> > +
> > +	perf_event_header__init_id(&deferred_event.header, &data, event);
> > +
> > +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> > +		goto out;
> > +
> > +	perf_output_put(&handle, deferred_event);
> > +	perf_output_put(&handle, callchain_context);
> > +	/* trace.entries[] are not guaranteed to be 64bit */
> > +	for (int i = 0; i < trace.nr; i++) {
> > +		u64 entry = trace.entries[i];
> > +		perf_output_put(&handle, entry);
> > +	}
> > +	perf_event__output_id_sample(event, &handle, &data);
> > +
> > +	perf_output_end(&handle);
> > +
> > +out:  
> 
> Which very much includes here, so your goto jumps into a scope, which is
> not permitted.

Nice catch.

> 
> GCC can fail to warn on this, but clang will consistently fail to
> compile this. Surely the robot would've told you by now -- even if
> you're not using clang yourself.

Unfortunately it hasn't :-(

I need to start building with clang more often.

I even pushed this to a git tree. Not sure why it didn't get flagged.

-- Steve

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains

Posted by Peter Zijlstra 4 months, 2 weeks ago

On Tue, Sep 23, 2025 at 05:35:15AM -0400, Steven Rostedt wrote:

> I even pushed this to a git tree. Not sure why it didn't get flagged.

I've been looking at this... how do I enable CONFIG_UWIND_USER ?

I suspect the problem is that its impossible to actually compile/use
this code.

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchainshttps://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/

Posted by Steven Rostedt 4 months, 2 weeks ago

On Tue, 23 Sep 2025 11:38:21 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Sep 23, 2025 at 05:35:15AM -0400, Steven Rostedt wrote:
> 
> > I even pushed this to a git tree. Not sure why it didn't get flagged.  
> 
> I've been looking at this... how do I enable CONFIG_UWIND_USER ?

Hmm, maybe that's why it wasn't flagged.

> 
> I suspect the problem is that its impossible to actually compile/use
> this code.

It needs an arch to enable it. Here's the x86 patches that I was hoping
would get in too:

  https://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/

Hmm, but I had a branch that applied all the necessary patches :-/

-- Steve

Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchainshttps://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/

Posted by Peter Zijlstra 4 months, 2 weeks ago

On Tue, Sep 23, 2025 at 06:28:48AM -0400, Steven Rostedt wrote:
> On Tue, 23 Sep 2025 11:38:21 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Sep 23, 2025 at 05:35:15AM -0400, Steven Rostedt wrote:
> > 
> > > I even pushed this to a git tree. Not sure why it didn't get flagged.  
> > 
> > I've been looking at this... how do I enable CONFIG_UWIND_USER ?
> 
> Hmm, maybe that's why it wasn't flagged.
> 
> > 
> > I suspect the problem is that its impossible to actually compile/use
> > this code.
> 
> It needs an arch to enable it. Here's the x86 patches that I was hoping
> would get in too:
> 
>   https://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/

Hmm, let me go find that.