[PATCH v16 0/4] perf: Support the deferred unwinding infrastructure

Steven Rostedt posted 4 patches 4 months ago
include/linux/perf_event.h            |   9 +-
include/linux/unwind_deferred.h       |  15 ++
include/uapi/linux/perf_event.h       |  25 ++-
kernel/bpf/stackmap.c                 |   4 +-
kernel/events/callchain.c             |  14 +-
kernel/events/core.c                  | 362 +++++++++++++++++++++++++++++++++-
kernel/unwind/deferred.c              | 283 ++++++++++++++++++++++----
tools/include/uapi/linux/perf_event.h |  25 ++-
8 files changed, 686 insertions(+), 51 deletions(-)
[PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Steven Rostedt 4 months ago
This is based on top of tip/perf/core commit: 6d48436560e91be85

Then I added the patches from Peter Zijlstra:

    https://lore.kernel.org/all/20250924075948.579302904@infradead.org/

This series implements the perf interface to use deferred user space stack
tracing.

The patches for the user space side should still work with this series:

  https://lore.kernel.org/linux-trace-kernel/20250908175319.841517121@kernel.org

Patch 1 updates the deferred unwinding infrastructure. It adds a new
function called: unwind_deferred_task_init(). This is used when a tracer
(perf) only needs to follow a single task. The descriptor returned can
be used the same way as the descriptor returned by unwind_deferred_init(),
but the tracer must only use it on one task at a time.

Patch 2 adds the per task deferred stack traces to perf. It adds a new event
type called PERF_RECORD_CALLCHAIN_DEFERRED that is recorded when a task is
about to go back to user space and happens in a location that pages may be
faulted in. It also adds a new callchain context called
PERF_CONTEXT_USER_DEFERRED that is used as a place holder in a kernel
callchain to append the deferred user space stack trace to.

Patch 3 adds the user stack trace context cookie in the kernel callchain right
after the PERF_CONTEXT_USER_DEFERRED context so that the user space side can
map the request to the deferred user space stack trace.

Patch 4 adds support for the per CPU perf events that will allow the kernel to
associate each of the per CPU perf event buffers to a single application. This
is needed so that when a request for a deferred stack trace happens on a task
that then migrates to another CPU, it will know which CPU buffer to use to
record the stack trace on. It is possible to have more than one perf user tool
running and a request made by one perf tool should have the deferred trace go
to the same perf tool's perf CPU event buffer. A global list of all the
descriptors representing each perf tool that is using deferred stack tracing
is created to manage this.

Changes since v15: https://lore.kernel.org/linux-trace-kernel/20250825180638.877627656@kernel.org/

- The main update was that I moved the code to do single task deferred
  stack tracing into the unwind code. That allowed to reuse the code
  for tracing all tasks, and simplified the perf code in doing so.

  The first patch updates the unwind deferred code to have this
  infrastructure. It only added a new function:
    unwind_deferred_task_init()
  This is the same as unwind_deferred_init() but it is used when the
  tracer will only trace a single task. The descriptor returned will
  have its own task_work callback it will use and it allows for any
  number of callers, not a limited set like the "all task" deferred
  unwinding has.

- The new code also removed the need to expose the generation of the
  cookie.

Josh Poimboeuf (1):
      perf: Support deferred user callchains

Steven Rostedt (3):
      unwind: Add interface to allow tracing a single task
      perf: Have the deferred request record the user context cookie
      perf: Support deferred user callchains for per CPU events

----
 include/linux/perf_event.h            |   9 +-
 include/linux/unwind_deferred.h       |  15 ++
 include/uapi/linux/perf_event.h       |  25 ++-
 kernel/bpf/stackmap.c                 |   4 +-
 kernel/events/callchain.c             |  14 +-
 kernel/events/core.c                  | 362 +++++++++++++++++++++++++++++++++-
 kernel/unwind/deferred.c              | 283 ++++++++++++++++++++++----
 tools/include/uapi/linux/perf_event.h |  25 ++-
 8 files changed, 686 insertions(+), 51 deletions(-)
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Tue, Oct 07, 2025 at 05:40:08PM -0400, Steven Rostedt wrote:

>  include/linux/perf_event.h            |   9 +-
>  include/linux/unwind_deferred.h       |  15 ++
>  include/uapi/linux/perf_event.h       |  25 ++-
>  kernel/bpf/stackmap.c                 |   4 +-
>  kernel/events/callchain.c             |  14 +-
>  kernel/events/core.c                  | 362 +++++++++++++++++++++++++++++++++-
>  kernel/unwind/deferred.c              | 283 ++++++++++++++++++++++----
>  tools/include/uapi/linux/perf_event.h |  25 ++-
>  8 files changed, 686 insertions(+), 51 deletions(-)

After staring at this some, I mostly threw it all out and wrote the
below.

I also have some hackery on the userspace patches to go along with this,
and it all sits in my unwind/cleanup branch.

Trouble is, pretty much every unwind is 510 entries long -- this cannot
be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
tired to find it just now. I'll try again tomorrow.
  
---
 include/linux/perf_event.h            |    2 
 include/linux/unwind_deferred.h       |   12 -----
 include/linux/unwind_deferred_types.h |   13 +++++
 include/uapi/linux/perf_event.h       |   21 ++++++++-
 kernel/bpf/stackmap.c                 |    4 -
 kernel/events/callchain.c             |   14 +++++-
 kernel/events/core.c                  |   79 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h |   21 ++++++++-
 8 files changed, 146 insertions(+), 20 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1720,7 +1720,7 @@ extern void perf_callchain_user(struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -6,18 +6,6 @@
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
-struct unwind_work;
-
-typedef void (*unwind_callback_t)(struct unwind_work *work,
-				  struct unwind_stacktrace *trace,
-				  u64 cookie);
-
-struct unwind_work {
-	struct list_head		list;
-	unwind_callback_t		func;
-	int				bit;
-};
-
 #ifdef CONFIG_UNWIND_USER
 
 enum {
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -39,4 +39,17 @@ struct unwind_task_info {
 	union unwind_task_id	id;
 };
 
+struct unwind_work;
+struct unwind_stacktrace;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work,
+				  struct unwind_stacktrace *trace,
+				  u64 cookie);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+	int				bit;
+};
+
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -463,7 +463,9 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* request PERF_RECORD_CALLCHAIN_DEFERRED records */
+				defer_output   :  1, /* output PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 24;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1241,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1287,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_re
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+				   false, false, 0);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_re
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   crosstask, false, 0);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_e
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,6 +251,18 @@ get_perf_callchain(struct pt_regs *regs,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_cookie) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one, and add
+			 * the cookie after it (it will be cut off when the
+			 * user stack is copied to the callchain).
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			perf_callchain_store_context(&ctx, defer_cookie);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -56,6 +56,7 @@
 #include <linux/buildid.h>
 #include <linux/task_work.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/unwind_deferred.h>
 
 #include "internal.h"
 
@@ -8200,6 +8201,8 @@ static u64 perf_get_page_size(unsigned l
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+static struct unwind_work perf_unwind_work;
+
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
@@ -8208,8 +8211,11 @@ perf_callchain(struct perf_event *event,
 		!(current->flags & (PF_KTHREAD | PF_USER_WORKER));
 	/* Disallow cross-task user callchains. */
 	bool crosstask = event->ctx->task && event->ctx->task != current;
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
+			  event->attr.defer_callchain;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	u64 defer_cookie;
 
 	if (!current->mm)
 		user = false;
@@ -8217,8 +8223,13 @@ perf_callchain(struct perf_event *event,
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	if (!(user && defer_user && !crosstask &&
+	      unwind_deferred_request(&perf_unwind_work, &defer_cookie) >= 0))
+		defer_cookie = 0;
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack,
+				       crosstask, true, defer_cookie);
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -10003,6 +10014,67 @@ void perf_event_bpf_event(struct bpf_pro
 	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
 }
 
+struct perf_callchain_deferred_event {
+	struct unwind_stacktrace *trace;
+	struct {
+		struct perf_event_header	header;
+		u64				cookie;
+		u64				nr;
+		u64				ips[];
+	} event;
+};
+
+static void perf_callchain_deferred_output(struct perf_event *event, void *data)
+{
+	struct perf_callchain_deferred_event *deferred_event = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int ret, size = deferred_event->event.header.size;
+
+	if (!event->attr.defer_output)
+		return;
+
+	/* XXX do we really need sample_id_all for this ??? */
+	perf_event_header__init_id(&deferred_event->event.header, &sample, event);
+
+	ret = perf_output_begin(&handle, &sample, event,
+				deferred_event->event.header.size);
+	if (ret)
+		goto out;
+
+	perf_output_put(&handle, deferred_event->event);
+	for (int i = 0; i < deferred_event->trace->nr; i++) {
+		u64 entry = deferred_event->trace->entries[i];
+		perf_output_put(&handle, entry);
+	}
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	deferred_event->event.header.size = size;
+}
+
+/* Deferred unwinding callback for task specific events */
+static void perf_unwind_deferred_callback(struct unwind_work *work,
+					 struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_callchain_deferred_event deferred_event = {
+		.trace = trace,
+		.event = {
+			.header = {
+				.type = PERF_RECORD_CALLCHAIN_DEFERRED,
+				.misc = PERF_RECORD_MISC_USER,
+				.size = sizeof(deferred_event.event) +
+					(trace->nr * sizeof(u64)),
+			},
+			.cookie = cookie,
+			.nr = trace->nr,
+		},
+	};
+
+	perf_iterate_sb(perf_callchain_deferred_output, &deferred_event, NULL);
+}
+
 struct perf_text_poke_event {
 	const void		*old_bytes;
 	const void		*new_bytes;
@@ -14799,6 +14871,9 @@ void __init perf_event_init(void)
 
 	idr_init(&pmu_idr);
 
+	unwind_deferred_init(&perf_unwind_work,
+			     perf_unwind_deferred_callback);
+
 	perf_event_init_all_cpus();
 	init_srcu_struct(&pmus_srcu);
 	perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -463,7 +463,9 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* request PERF_RECORD_CALLCHAIN_DEFERRED records */
+				defer_output   :  1, /* output PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 24;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1241,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1287,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:

> Trouble is, pretty much every unwind is 510 entries long -- this cannot
> be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
> tired to find it just now. I'll try again tomorrow.

PEBKAC
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Oct 24, 2025 at 11:29:26AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:
> 
> > Trouble is, pretty much every unwind is 510 entries long -- this cannot
> > be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
> > tired to find it just now. I'll try again tomorrow.
> 
> PEBKAC

Anyway, while staring at this, I noted that the perf userspace unwind
code has a few bits that are missing from the new shiny thing.

How about something like so? This add an optional arch specific unwinder
at the very highest priority (bit 0) and uses that to do a few extra
bits before disabling itself and falling back to whatever lower prio
unwinder to do the actual unwinding.

---
 arch/x86/events/core.c             |   40 ---------------------------
 arch/x86/include/asm/unwind_user.h |    4 ++
 arch/x86/include/asm/uprobes.h     |    9 ++++++
 arch/x86/kernel/unwind_user.c      |   53 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/uprobes.c          |   32 ++++++++++++++++++++++
 include/linux/unwind_user_types.h  |    5 ++-
 kernel/unwind/user.c               |    7 ++++
 7 files changed, 109 insertions(+), 41 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2845,46 +2845,6 @@ static unsigned long get_segment_base(un
 	return get_desc_base(desc);
 }
 
-#ifdef CONFIG_UPROBES
-/*
- * Heuristic-based check if uprobe is installed at the function entry.
- *
- * Under assumption of user code being compiled with frame pointers,
- * `push %rbp/%ebp` is a good indicator that we indeed are.
- *
- * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
- * If we get this wrong, captured stack trace might have one extra bogus
- * entry, but the rest of stack trace will still be meaningful.
- */
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	struct arch_uprobe *auprobe;
-
-	if (!current->utask)
-		return false;
-
-	auprobe = current->utask->auprobe;
-	if (!auprobe)
-		return false;
-
-	/* push %rbp/%ebp */
-	if (auprobe->insn[0] == 0x55)
-		return true;
-
-	/* endbr64 (64-bit only) */
-	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
-		return true;
-
-	return false;
-}
-
-#else
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	return false;
-}
-#endif /* CONFIG_UPROBES */
-
 #ifdef CONFIG_IA32_EMULATION
 
 #include <linux/compat.h>
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -8,4 +8,8 @@
 	.fp_off		= -2*(ws),			\
 	.use_fp		= true,
 
+#define HAVE_UNWIND_USER_ARCH 1
+
+extern int unwind_user_next_arch(struct unwind_user_state *state);
+
 #endif /* _ASM_X86_UNWIND_USER_H */
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -62,4 +62,13 @@ struct arch_uprobe_task {
 	unsigned int			saved_tf;
 };
 
+#ifdef CONFIG_UPROBES
+extern bool is_uprobe_at_func_entry(struct pt_regs *regs);
+#else
+static bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	return false;
+}
+#endif /* CONFIG_UPROBES */
+
 #endif	/* _ASM_UPROBES_H */
--- /dev/null
+++ b/arch/x86/kernel/unwind_user.c
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/unwind_user.h>
+#include <linux/uprobes.h>
+#include <linux/uaccess.h>
+#include <linux/sched/task_stack.h>
+#include <asm/processor.h>
+#include <asm/tlbflush.h>
+
+int unwind_user_next_arch(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	/* only once, on the first iteration */
+	state->available_types &= ~UNWIND_USER_TYPE_ARCH;
+
+	/* We don't know how to unwind VM86 stacks. */
+	if (regs->flags & X86_VM_MASK) {
+		state->done = true;
+		return 0;
+	}
+
+	/*
+	 * If we are called from uprobe handler, and we are indeed at the very
+	 * entry to user function (which is normally a `push %rbp` instruction,
+	 * under assumption of application being compiled with frame pointers),
+	 * we should read return address from *regs->sp before proceeding
+	 * to follow frame pointers, otherwise we'll skip immediate caller
+	 * as %rbp is not yet setup.
+	 */
+	if (!is_uprobe_at_func_entry(regs))
+		return -EINVAL;
+
+#ifdef CONFIG_COMPAT
+	if (state->ws == sizeof(int)) {
+		unsigned int retaddr;
+		int ret = get_user(retaddr, (unsigned int __user *)regs->sp);
+		if (ret)
+			return ret;
+
+		state->ip = retaddr;
+		return 0;
+	}
+#endif
+	unsigned long retaddr;
+	int ret = get_user(retaddr, (unsigned long __user *)regs->sp);
+	if (ret)
+		return ret;
+
+	state->ip = retaddr;
+	return 0;
+}
+
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct retu
 	else
 		return regs->sp <= ret->stack;
 }
+
+/*
+ * Heuristic-based check if uprobe is installed at the function entry.
+ *
+ * Under assumption of user code being compiled with frame pointers,
+ * `push %rbp/%ebp` is a good indicator that we indeed are.
+ *
+ * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
+ * If we get this wrong, captured stack trace might have one extra bogus
+ * entry, but the rest of stack trace will still be meaningful.
+ */
+bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	struct arch_uprobe *auprobe;
+
+	if (!current->utask)
+		return false;
+
+	auprobe = current->utask->auprobe;
+	if (!auprobe)
+		return false;
+
+	/* push %rbp/%ebp */
+	if (auprobe->insn[0] == 0x55)
+		return true;
+
+	/* endbr64 (64-bit only) */
+	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
+		return true;
+
+	return false;
+}
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,13 +3,15 @@
 #define _LINUX_UNWIND_USER_TYPES_H
 
 #include <linux/types.h>
+#include <linux/bits.h>
 
 /*
  * Unwind types, listed in priority order: lower numbers are attempted first if
  * available.
  */
 enum unwind_user_type_bits {
-	UNWIND_USER_TYPE_FP_BIT =		0,
+	UNWIND_USER_TYPE_ARCH_BIT = 0,
+	UNWIND_USER_TYPE_FP_BIT,
 
 	NR_UNWIND_USER_TYPE_BITS,
 };
@@ -17,6 +19,7 @@ enum unwind_user_type_bits {
 enum unwind_user_type {
 	/* Type "none" for the start of stack walk iteration. */
 	UNWIND_USER_TYPE_NONE =			0,
+	UNWIND_USER_TYPE_ARCH =			BIT(UNWIND_USER_TYPE_ARCH_BIT),
 	UNWIND_USER_TYPE_FP =			BIT(UNWIND_USER_TYPE_FP_BIT),
 };
 
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -79,6 +79,10 @@ static int unwind_user_next(struct unwin
 
 		state->current_type = type;
 		switch (type) {
+		case UNWIND_USER_TYPE_ARCH:
+			if (!unwind_user_next_arch(state))
+				return 0;
+			continue;
 		case UNWIND_USER_TYPE_FP:
 			if (!unwind_user_next_fp(state))
 				return 0;
@@ -107,6 +111,9 @@ static int unwind_user_start(struct unwi
 		return -EINVAL;
 	}
 
+	if (HAVE_UNWIND_USER_ARCH)
+		state->available_types |= UNWIND_USER_TYPE_ARCH;
+
 	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->available_types |= UNWIND_USER_TYPE_FP;
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Jens Remus 3 months, 2 weeks ago
Hello Peter!

On 10/24/2025 12:41 PM, Peter Zijlstra wrote:
> On Fri, Oct 24, 2025 at 11:29:26AM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:
>>
>>> Trouble is, pretty much every unwind is 510 entries long -- this cannot
>>> be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
>>> tired to find it just now. I'll try again tomorrow.
>>
>> PEBKAC
> 
> Anyway, while staring at this, I noted that the perf userspace unwind
> code has a few bits that are missing from the new shiny thing.
> 
> How about something like so? This add an optional arch specific unwinder
> at the very highest priority (bit 0) and uses that to do a few extra
> bits before disabling itself and falling back to whatever lower prio
> unwinder to do the actual unwinding.

unwind user sframe does not need any of this special handling, because
it knows for each IP whether the SP or FP is the CFA base register
and whether the FP and RA have been saved.

Isn't this actually specific to unwind user fp?  If the IP is at
function entry, then the FP has not been setup yet.  I think unwind user
fp could handle this using an arch specific is_uprobe_at_func_entry() to
determine whether to use a new frame_fp_entry instead of frame_fp.  For
x86 the following frame_fp_entry should work, if I am not wrong:

#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)	\
	.cfa_off	=  1*(ws),		\
	.ra_off		= -1*(ws),		\
	.fp_off		= 0,			\
	.use_fp		= false,

Following roughly outlines the required changes:

diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c

-static int unwind_user_next_fp(struct unwind_user_state *state)
+static int unwind_user_next_common(struct unwind_user_state *state,
+                                  const struct unwind_user_frame *frame,
+                                  struct pt_regs *regs)

@@ -71,6 +83,7 @@ static int unwind_user_next_common(struct unwind_user_state *state,
        state->sp = sp;
        if (frame->fp_off)
                state->fp = fp;
+       state->topmost = false;
        return 0;
 }
@@ -154,6 +167,7 @@ static int unwind_user_start(struct unwind_user_state *state)
        state->sp = user_stack_pointer(regs);
        state->fp = frame_pointer(regs);
        state->ws = compat_user_mode(regs) ? sizeof(int) : sizeof(long);
+       state->topmost = true;

        return 0;
 }

static int unwind_user_next_fp(struct unwind_user_state *state)
{
	const struct unwind_user_frame fp_frame = {
		ARCH_INIT_USER_FP_FRAME(state->ws)
	};
	const struct unwind_user_frame fp_entry_frame = {
		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
	};
	struct pt_regs *regs = task_pt_regs(current);

	if (state->topmost && is_uprobe_at_func_entry(regs))
		return unwind_user_next_common(state, &fp_entry_frame, regs);
	else
		return unwind_user_next_common(state, &fp_frame, regs);
}

diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
@@ -43,6 +43,7 @@ struct unwind_user_state {
        unsigned int                            ws;
        enum unwind_user_type                   current_type;
        unsigned int                            available_types;
+       bool                                    topmost;
        bool                                    done;
 };

What do you think?

> +++ b/arch/x86/kernel/unwind_user.c
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/unwind_user.h>
> +#include <linux/uprobes.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/task_stack.h>
> +#include <asm/processor.h>
> +#include <asm/tlbflush.h>
> +
> +int unwind_user_next_arch(struct unwind_user_state *state)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	/* only once, on the first iteration */
> +	state->available_types &= ~UNWIND_USER_TYPE_ARCH;
> +
> +	/* We don't know how to unwind VM86 stacks. */
> +	if (regs->flags & X86_VM_MASK) {
> +		state->done = true;
> +		return 0;
> +	}
> +
> +	/*
> +	 * If we are called from uprobe handler, and we are indeed at the very
> +	 * entry to user function (which is normally a `push %rbp` instruction,
> +	 * under assumption of application being compiled with frame pointers),
> +	 * we should read return address from *regs->sp before proceeding
> +	 * to follow frame pointers, otherwise we'll skip immediate caller
> +	 * as %rbp is not yet setup.
> +	 */
> +	if (!is_uprobe_at_func_entry(regs))
> +		return -EINVAL;
> +
> +#ifdef CONFIG_COMPAT
> +	if (state->ws == sizeof(int)) {
> +		unsigned int retaddr;
> +		int ret = get_user(retaddr, (unsigned int __user *)regs->sp);
> +		if (ret)
> +			return ret;
> +
> +		state->ip = retaddr;
> +		return 0;
> +	}
> +#endif
> +	unsigned long retaddr;
> +	int ret = get_user(retaddr, (unsigned long __user *)regs->sp);
> +	if (ret)
> +		return ret;
> +
> +	state->ip = retaddr;
> +	return 0;
> +}

Above would then not be needed, as the unwind_user_next_fp() logic would
do the right thing.

> +++ b/arch/x86/kernel/uprobes.c
> @@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct retu
>  	else
>  		return regs->sp <= ret->stack;
>  }
> +
> +/*
> + * Heuristic-based check if uprobe is installed at the function entry.
> + *
> + * Under assumption of user code being compiled with frame pointers,
> + * `push %rbp/%ebp` is a good indicator that we indeed are.
> + *
> + * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
> + * If we get this wrong, captured stack trace might have one extra bogus
> + * entry, but the rest of stack trace will still be meaningful.
> + */
> +bool is_uprobe_at_func_entry(struct pt_regs *regs)
> +{
> +	struct arch_uprobe *auprobe;
> +
> +	if (!current->utask)
> +		return false;
> +
> +	auprobe = current->utask->auprobe;
> +	if (!auprobe)
> +		return false;
> +
> +	/* push %rbp/%ebp */
> +	if (auprobe->insn[0] == 0x55)
> +		return true;
> +
> +	/* endbr64 (64-bit only) */
> +	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
> +		return true;
> +
> +	return false;
> +}
Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/

Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Oct 24, 2025 at 03:58:20PM +0200, Jens Remus wrote:
> Hello Peter!
> 
> On 10/24/2025 12:41 PM, Peter Zijlstra wrote:
> > On Fri, Oct 24, 2025 at 11:29:26AM +0200, Peter Zijlstra wrote:
> >> On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:
> >>
> >>> Trouble is, pretty much every unwind is 510 entries long -- this cannot
> >>> be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
> >>> tired to find it just now. I'll try again tomorrow.
> >>
> >> PEBKAC
> > 
> > Anyway, while staring at this, I noted that the perf userspace unwind
> > code has a few bits that are missing from the new shiny thing.
> > 
> > How about something like so? This add an optional arch specific unwinder
> > at the very highest priority (bit 0) and uses that to do a few extra
> > bits before disabling itself and falling back to whatever lower prio
> > unwinder to do the actual unwinding.
> 
> unwind user sframe does not need any of this special handling, because
> it knows for each IP whether the SP or FP is the CFA base register
> and whether the FP and RA have been saved.

It still can't unwind VM86 stacks. But yes, it should do lots better
with that start of function hack.

> Isn't this actually specific to unwind user fp?  If the IP is at
> function entry, then the FP has not been setup yet.  I think unwind user
> fp could handle this using an arch specific is_uprobe_at_func_entry() to
> determine whether to use a new frame_fp_entry instead of frame_fp.  For
> x86 the following frame_fp_entry should work, if I am not wrong:
> 
> #define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)	\
> 	.cfa_off	=  1*(ws),		\
> 	.ra_off		= -1*(ws),		\
> 	.fp_off		= 0,			\
> 	.use_fp		= false,
> 
> Following roughly outlines the required changes:
> 
> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
> 
> -static int unwind_user_next_fp(struct unwind_user_state *state)
> +static int unwind_user_next_common(struct unwind_user_state *state,
> +                                  const struct unwind_user_frame *frame,
> +                                  struct pt_regs *regs)
> 
> @@ -71,6 +83,7 @@ static int unwind_user_next_common(struct unwind_user_state *state,
>         state->sp = sp;
>         if (frame->fp_off)
>                 state->fp = fp;
> +       state->topmost = false;
>         return 0;
>  }
> @@ -154,6 +167,7 @@ static int unwind_user_start(struct unwind_user_state *state)
>         state->sp = user_stack_pointer(regs);
>         state->fp = frame_pointer(regs);
>         state->ws = compat_user_mode(regs) ? sizeof(int) : sizeof(long);
> +       state->topmost = true;
> 
>         return 0;
>  }
> 
> static int unwind_user_next_fp(struct unwind_user_state *state)
> {
> 	const struct unwind_user_frame fp_frame = {
> 		ARCH_INIT_USER_FP_FRAME(state->ws)
> 	};
> 	const struct unwind_user_frame fp_entry_frame = {
> 		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> 	};
> 	struct pt_regs *regs = task_pt_regs(current);
> 
> 	if (state->topmost && is_uprobe_at_func_entry(regs))
> 		return unwind_user_next_common(state, &fp_entry_frame, regs);
> 	else
> 		return unwind_user_next_common(state, &fp_frame, regs);
> }
> 
> diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
> @@ -43,6 +43,7 @@ struct unwind_user_state {
>         unsigned int                            ws;
>         enum unwind_user_type                   current_type;
>         unsigned int                            available_types;
> +       bool                                    topmost;
>         bool                                    done;
>  };
> 
> What do you think?

Yeah, I suppose that should work. Let me rework things accordingly.
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Oct 24, 2025 at 04:08:15PM +0200, Peter Zijlstra wrote:

> Yeah, I suppose that should work. Let me rework things accordingly.

---
Subject: unwind_user/x86: Teach FP unwind about start of function
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Oct 24 12:31:10 CEST 2025

When userspace is interrupted at the start of a function, before we
get a chance to complete the frame, unwind will miss one caller.

X86 has a uprobe specific fixup for this, add bits to the generic
unwinder to support this.

Suggested-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c             |   40 -------------------------------------
 arch/x86/include/asm/unwind_user.h |   12 +++++++++++
 arch/x86/include/asm/uprobes.h     |    9 ++++++++
 arch/x86/kernel/uprobes.c          |   32 +++++++++++++++++++++++++++++
 include/linux/unwind_user_types.h  |    1 
 kernel/unwind/user.c               |   35 ++++++++++++++++++++++++--------
 6 files changed, 80 insertions(+), 49 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2845,46 +2845,6 @@ static unsigned long get_segment_base(un
 	return get_desc_base(desc);
 }
 
-#ifdef CONFIG_UPROBES
-/*
- * Heuristic-based check if uprobe is installed at the function entry.
- *
- * Under assumption of user code being compiled with frame pointers,
- * `push %rbp/%ebp` is a good indicator that we indeed are.
- *
- * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
- * If we get this wrong, captured stack trace might have one extra bogus
- * entry, but the rest of stack trace will still be meaningful.
- */
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	struct arch_uprobe *auprobe;
-
-	if (!current->utask)
-		return false;
-
-	auprobe = current->utask->auprobe;
-	if (!auprobe)
-		return false;
-
-	/* push %rbp/%ebp */
-	if (auprobe->insn[0] == 0x55)
-		return true;
-
-	/* endbr64 (64-bit only) */
-	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
-		return true;
-
-	return false;
-}
-
-#else
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	return false;
-}
-#endif /* CONFIG_UPROBES */
-
 #ifdef CONFIG_IA32_EMULATION
 
 #include <linux/compat.h>
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_UNWIND_USER_H
 
 #include <asm/ptrace.h>
+#include <asm/uprobes.h>
 
 #define ARCH_INIT_USER_FP_FRAME(ws)			\
 	.cfa_off	=  2*(ws),			\
@@ -10,6 +11,12 @@
 	.fp_off		= -2*(ws),			\
 	.use_fp		= true,
 
+#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)		\
+	.cfa_off	=  1*(ws),			\
+	.ra_off		= -1*(ws),			\
+	.fp_off		= 0,				\
+	.use_fp		= false,
+
 static inline int unwind_user_word_size(struct pt_regs *regs)
 {
 	/* We can't unwind VM86 stacks */
@@ -22,4 +29,9 @@ static inline int unwind_user_word_size(
 	return sizeof(long);
 }
 
+static inline bool unwind_user_at_function_start(struct pt_regs *regs)
+{
+	return is_uprobe_at_func_entry(regs);
+}
+
 #endif /* _ASM_X86_UNWIND_USER_H */
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -62,4 +62,13 @@ struct arch_uprobe_task {
 	unsigned int			saved_tf;
 };
 
+#ifdef CONFIG_UPROBES
+extern bool is_uprobe_at_func_entry(struct pt_regs *regs);
+#else
+static bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	return false;
+}
+#endif /* CONFIG_UPROBES */
+
 #endif	/* _ASM_UPROBES_H */
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct retu
 	else
 		return regs->sp <= ret->stack;
 }
+
+/*
+ * Heuristic-based check if uprobe is installed at the function entry.
+ *
+ * Under assumption of user code being compiled with frame pointers,
+ * `push %rbp/%ebp` is a good indicator that we indeed are.
+ *
+ * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
+ * If we get this wrong, captured stack trace might have one extra bogus
+ * entry, but the rest of stack trace will still be meaningful.
+ */
+bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	struct arch_uprobe *auprobe;
+
+	if (!current->utask)
+		return false;
+
+	auprobe = current->utask->auprobe;
+	if (!auprobe)
+		return false;
+
+	/* push %rbp/%ebp */
+	if (auprobe->insn[0] == 0x55)
+		return true;
+
+	/* endbr64 (64-bit only) */
+	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
+		return true;
+
+	return false;
+}
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -39,6 +39,7 @@ struct unwind_user_state {
 	unsigned int				ws;
 	enum unwind_user_type			current_type;
 	unsigned int				available_types;
+	bool					topmost;
 	bool					done;
 };
 
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -26,14 +26,12 @@ get_user_word(unsigned long *word, unsig
 	return get_user(*word, addr);
 }
 
-static int unwind_user_next_fp(struct unwind_user_state *state)
+static int unwind_user_next_common(struct unwind_user_state *state,
+				   const struct unwind_user_frame *frame)
 {
-	const struct unwind_user_frame frame = {
-		ARCH_INIT_USER_FP_FRAME(state->ws)
-	};
 	unsigned long cfa, fp, ra;
 
-	if (frame.use_fp) {
+	if (frame->use_fp) {
 		if (state->fp < state->sp)
 			return -EINVAL;
 		cfa = state->fp;
@@ -42,7 +40,7 @@ static int unwind_user_next_fp(struct un
 	}
 
 	/* Get the Canonical Frame Address (CFA) */
-	cfa += frame.cfa_off;
+	cfa += frame->cfa_off;
 
 	/* stack going in wrong direction? */
 	if (cfa <= state->sp)
@@ -53,19 +51,37 @@ static int unwind_user_next_fp(struct un
 		return -EINVAL;
 
 	/* Find the Return Address (RA) */
-	if (get_user_word(&ra, cfa, frame.ra_off, state->ws))
+	if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
 		return -EINVAL;
 
-	if (frame.fp_off && get_user_word(&fp, cfa, frame.fp_off, state->ws))
+	if (frame->fp_off && get_user_word(&fp, cfa, frame->fp_off, state->ws))
 		return -EINVAL;
 
 	state->ip = ra;
 	state->sp = cfa;
-	if (frame.fp_off)
+	if (frame->fp_off)
 		state->fp = fp;
+	state->topmost = false;
 	return 0;
 }
 
+static int unwind_user_next_fp(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	const struct unwind_user_frame fp_frame = {
+		ARCH_INIT_USER_FP_FRAME(state->ws)
+	};
+	const struct unwind_user_frame fp_entry_frame = {
+		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
+	};
+
+	if (state->topmost && unwind_user_at_function_start(regs))
+		return unwind_user_next_common(state, &fp_entry_frame);
+
+	return unwind_user_next_common(state, &fp_frame);
+}
+
 static int unwind_user_next(struct unwind_user_state *state)
 {
 	unsigned long iter_mask = state->available_types;
@@ -118,6 +134,7 @@ static int unwind_user_start(struct unwi
 		state->done = true;
 		return -EINVAL;
 	}
+	state->topmost = true;
 
 	return 0;
 }
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Florian Weimer 3 months ago
* Peter Zijlstra:

> +/*
> + * Heuristic-based check if uprobe is installed at the function entry.
> + *
> + * Under assumption of user code being compiled with frame pointers,
> + * `push %rbp/%ebp` is a good indicator that we indeed are.
> + *
> + * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
> + * If we get this wrong, captured stack trace might have one extra bogus
> + * entry, but the rest of stack trace will still be meaningful.
> + */
> +bool is_uprobe_at_func_entry(struct pt_regs *regs)

Is this specifically for uprobes?  Wouldn't it make sense to tell the
kernel when the uprobe is installed whether the frame pointer has been
set up at this point?  Userspace can typically figure this out easily
enough (it's not much more difficult to find the address of the
function).

Thanks,
Florian
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months ago
On Tue, Nov 04, 2025 at 12:22:01PM +0100, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > +/*
> > + * Heuristic-based check if uprobe is installed at the function entry.
> > + *
> > + * Under assumption of user code being compiled with frame pointers,
> > + * `push %rbp/%ebp` is a good indicator that we indeed are.
> > + *
> > + * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
> > + * If we get this wrong, captured stack trace might have one extra bogus
> > + * entry, but the rest of stack trace will still be meaningful.
> > + */
> > +bool is_uprobe_at_func_entry(struct pt_regs *regs)
> 
> Is this specifically for uprobes?  Wouldn't it make sense to tell the
> kernel when the uprobe is installed whether the frame pointer has been
> set up at this point?  Userspace can typically figure this out easily
> enough (it's not much more difficult to find the address of the
> function).

Yeah, I suppose so. Not sure the actual user interface for this allows
for that. Someone would have to dig into that a bit.
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Jens Remus 3 months, 2 weeks ago
Hello Peter,

very nice!

On 10/24/2025 4:51 PM, Peter Zijlstra wrote:

> Subject: unwind_user/x86: Teach FP unwind about start of function
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Oct 24 12:31:10 CEST 2025
> 
> When userspace is interrupted at the start of a function, before we
> get a chance to complete the frame, unwind will miss one caller.
> 
> X86 has a uprobe specific fixup for this, add bits to the generic
> unwinder to support this.
> 
> Suggested-by: Jens Remus <jremus@linux.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> +++ b/kernel/unwind/user.c

> +static int unwind_user_next_fp(struct unwind_user_state *state)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	const struct unwind_user_frame fp_frame = {
> +		ARCH_INIT_USER_FP_FRAME(state->ws)
> +	};
> +	const struct unwind_user_frame fp_entry_frame = {
> +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> +	};
> +
> +	if (state->topmost && unwind_user_at_function_start(regs))
> +		return unwind_user_next_common(state, &fp_entry_frame);

IIUC this will cause kernel/unwind/user.c to fail compile on
architectures that will support HAVE_UNWIND_USER_SFRAME but not
HAVE_UNWIND_USER_FP (such as s390), and thus do not need to implement
unwind_user_at_function_start().

Either s390 would need to supply a dummy unwind_user_at_function_start()
or the unwind user sframe series needs to address this and supply
a dummy one if FP is not enabled, so that the code compiles with only
SFRAME enabled.

What do you think?

> +
> +	return unwind_user_next_common(state, &fp_frame);
> +}
Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/

Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Oct 24, 2025 at 05:09:02PM +0200, Jens Remus wrote:
> Hello Peter,
> 
> very nice!
> 
> On 10/24/2025 4:51 PM, Peter Zijlstra wrote:
> 
> > Subject: unwind_user/x86: Teach FP unwind about start of function
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Fri Oct 24 12:31:10 CEST 2025
> > 
> > When userspace is interrupted at the start of a function, before we
> > get a chance to complete the frame, unwind will miss one caller.
> > 
> > X86 has a uprobe specific fixup for this, add bits to the generic
> > unwinder to support this.
> > 
> > Suggested-by: Jens Remus <jremus@linux.ibm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> > +++ b/kernel/unwind/user.c
> 
> > +static int unwind_user_next_fp(struct unwind_user_state *state)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +
> > +	const struct unwind_user_frame fp_frame = {
> > +		ARCH_INIT_USER_FP_FRAME(state->ws)
> > +	};
> > +	const struct unwind_user_frame fp_entry_frame = {
> > +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> > +	};
> > +
> > +	if (state->topmost && unwind_user_at_function_start(regs))
> > +		return unwind_user_next_common(state, &fp_entry_frame);
> 
> IIUC this will cause kernel/unwind/user.c to fail compile on
> architectures that will support HAVE_UNWIND_USER_SFRAME but not
> HAVE_UNWIND_USER_FP (such as s390), and thus do not need to implement
> unwind_user_at_function_start().
> 
> Either s390 would need to supply a dummy unwind_user_at_function_start()
> or the unwind user sframe series needs to address this and supply
> a dummy one if FP is not enabled, so that the code compiles with only
> SFRAME enabled.
> 
> What do you think?

I'll make it conditional on HAVE_UNWIND_USER_FP -- but tomorrow or so.
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Oct 24, 2025 at 04:51:56PM +0200, Peter Zijlstra wrote:

> --- a/arch/x86/include/asm/unwind_user.h
> +++ b/arch/x86/include/asm/unwind_user.h
> @@ -3,6 +3,7 @@
>  #define _ASM_X86_UNWIND_USER_H
>  
>  #include <asm/ptrace.h>
> +#include <asm/uprobes.h>
>  
>  #define ARCH_INIT_USER_FP_FRAME(ws)			\
>  	.cfa_off	=  2*(ws),			\
> @@ -10,6 +11,12 @@
>  	.fp_off		= -2*(ws),			\
>  	.use_fp		= true,
>  
> +#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)		\
> +	.cfa_off	=  1*(ws),			\
> +	.ra_off		= -1*(ws),			\
> +	.fp_off		= 0,				\
> +	.use_fp		= false,
> +
>  static inline int unwind_user_word_size(struct pt_regs *regs)
>  {
>  	/* We can't unwind VM86 stacks */
> @@ -22,4 +29,9 @@ static inline int unwind_user_word_size(
>  	return sizeof(long);
>  }
>  
> +static inline bool unwind_user_at_function_start(struct pt_regs *regs)
> +{
> +	return is_uprobe_at_func_entry(regs);
> +}
> +
>  #endif /* _ASM_X86_UNWIND_USER_H */

> --- a/include/linux/unwind_user_types.h
> +++ b/include/linux/unwind_user_types.h
> @@ -39,6 +39,7 @@ struct unwind_user_state {
>  	unsigned int				ws;
>  	enum unwind_user_type			current_type;
>  	unsigned int				available_types;
> +	bool					topmost;
>  	bool					done;
>  };
>  
> --- a/kernel/unwind/user.c
> +++ b/kernel/unwind/user.c

>  
> +static int unwind_user_next_fp(struct unwind_user_state *state)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	const struct unwind_user_frame fp_frame = {
> +		ARCH_INIT_USER_FP_FRAME(state->ws)
> +	};
> +	const struct unwind_user_frame fp_entry_frame = {
> +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> +	};
> +
> +	if (state->topmost && unwind_user_at_function_start(regs))
> +		return unwind_user_next_common(state, &fp_entry_frame);
> +
> +	return unwind_user_next_common(state, &fp_frame);
> +}
> +
>  static int unwind_user_next(struct unwind_user_state *state)
>  {
>  	unsigned long iter_mask = state->available_types;
> @@ -118,6 +134,7 @@ static int unwind_user_start(struct unwi
>  		state->done = true;
>  		return -EINVAL;
>  	}
> +	state->topmost = true;
>  
>  	return 0;
>  }

And right before sending this; I realized we could do the
unwind_user_at_function_start() in unwind_user_start() and set something
like state->entry = true instead of topmost.

That saves having to do task_pt_regs() in unwind_user_next_fp().

Does that make sense?
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Oct 24, 2025 at 04:54:02PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 24, 2025 at 04:51:56PM +0200, Peter Zijlstra wrote:
> 
> > --- a/arch/x86/include/asm/unwind_user.h
> > +++ b/arch/x86/include/asm/unwind_user.h
> > @@ -3,6 +3,7 @@
> >  #define _ASM_X86_UNWIND_USER_H
> >  
> >  #include <asm/ptrace.h>
> > +#include <asm/uprobes.h>
> >  
> >  #define ARCH_INIT_USER_FP_FRAME(ws)			\
> >  	.cfa_off	=  2*(ws),			\
> > @@ -10,6 +11,12 @@
> >  	.fp_off		= -2*(ws),			\
> >  	.use_fp		= true,
> >  
> > +#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)		\
> > +	.cfa_off	=  1*(ws),			\
> > +	.ra_off		= -1*(ws),			\
> > +	.fp_off		= 0,				\
> > +	.use_fp		= false,
> > +
> >  static inline int unwind_user_word_size(struct pt_regs *regs)
> >  {
> >  	/* We can't unwind VM86 stacks */
> > @@ -22,4 +29,9 @@ static inline int unwind_user_word_size(
> >  	return sizeof(long);
> >  }
> >  
> > +static inline bool unwind_user_at_function_start(struct pt_regs *regs)
> > +{
> > +	return is_uprobe_at_func_entry(regs);
> > +}
> > +
> >  #endif /* _ASM_X86_UNWIND_USER_H */
> 
> > --- a/include/linux/unwind_user_types.h
> > +++ b/include/linux/unwind_user_types.h
> > @@ -39,6 +39,7 @@ struct unwind_user_state {
> >  	unsigned int				ws;
> >  	enum unwind_user_type			current_type;
> >  	unsigned int				available_types;
> > +	bool					topmost;
> >  	bool					done;
> >  };
> >  
> > --- a/kernel/unwind/user.c
> > +++ b/kernel/unwind/user.c
> 
> >  
> > +static int unwind_user_next_fp(struct unwind_user_state *state)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +
> > +	const struct unwind_user_frame fp_frame = {
> > +		ARCH_INIT_USER_FP_FRAME(state->ws)
> > +	};
> > +	const struct unwind_user_frame fp_entry_frame = {
> > +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> > +	};
> > +
> > +	if (state->topmost && unwind_user_at_function_start(regs))
> > +		return unwind_user_next_common(state, &fp_entry_frame);
> > +
> > +	return unwind_user_next_common(state, &fp_frame);
> > +}
> > +
> >  static int unwind_user_next(struct unwind_user_state *state)
> >  {
> >  	unsigned long iter_mask = state->available_types;
> > @@ -118,6 +134,7 @@ static int unwind_user_start(struct unwi
> >  		state->done = true;
> >  		return -EINVAL;
> >  	}
> > +	state->topmost = true;
> >  
> >  	return 0;
> >  }
> 
> And right before sending this; I realized we could do the
> unwind_user_at_function_start() in unwind_user_start() and set something
> like state->entry = true instead of topmost.
> 
> That saves having to do task_pt_regs() in unwind_user_next_fp().
> 
> Does that make sense?

Urgh, that makes us call that weird hack for sframe too, which isn't
needed. Oh well, ignore this.
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Oct 24, 2025 at 04:57:35PM +0200, Peter Zijlstra wrote:

> Urgh, that makes us call that weird hack for sframe too, which isn't
> needed. Oh well, ignore this.

I've decided to stop tinkering for today and pushed out the lot into:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core

It seems to build and work on the one test build I did, so fingers
crossed.

If there is anything you want changed, please holler, I'll not push to
tip until at least Monday anyway.
[tip: perf/core] unwind_user/x86: Teach FP unwind about start of function
Posted by tip-bot2 for Peter Zijlstra 3 months, 1 week ago
The following commit has been merged into the perf/core branch of tip:

Commit-ID:     ae25884ad749e7f6e0c3565513bdc8aa2554a425
Gitweb:        https://git.kernel.org/tip/ae25884ad749e7f6e0c3565513bdc8aa2554a425
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 24 Oct 2025 12:31:10 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 10:29:58 +01:00

unwind_user/x86: Teach FP unwind about start of function

When userspace is interrupted at the start of a function, before we
get a chance to complete the frame, unwind will miss one caller.

X86 has a uprobe specific fixup for this, add bits to the generic
unwinder to support this.

Suggested-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251024145156.GM4068168@noisy.programming.kicks-ass.net
---
 arch/x86/events/core.c             | 40 +-----------------------------
 arch/x86/include/asm/unwind_user.h | 12 +++++++++-
 arch/x86/include/asm/uprobes.h     |  9 +++++++-
 arch/x86/kernel/uprobes.c          | 32 +++++++++++++++++++++++-
 include/linux/unwind_user_types.h  |  1 +-
 kernel/unwind/user.c               | 39 +++++++++++++++++++++-------
 6 files changed, 84 insertions(+), 49 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 745caa6..0cf68ad 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2845,46 +2845,6 @@ static unsigned long get_segment_base(unsigned int segment)
 	return get_desc_base(desc);
 }
 
-#ifdef CONFIG_UPROBES
-/*
- * Heuristic-based check if uprobe is installed at the function entry.
- *
- * Under assumption of user code being compiled with frame pointers,
- * `push %rbp/%ebp` is a good indicator that we indeed are.
- *
- * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
- * If we get this wrong, captured stack trace might have one extra bogus
- * entry, but the rest of stack trace will still be meaningful.
- */
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	struct arch_uprobe *auprobe;
-
-	if (!current->utask)
-		return false;
-
-	auprobe = current->utask->auprobe;
-	if (!auprobe)
-		return false;
-
-	/* push %rbp/%ebp */
-	if (auprobe->insn[0] == 0x55)
-		return true;
-
-	/* endbr64 (64-bit only) */
-	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
-		return true;
-
-	return false;
-}
-
-#else
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	return false;
-}
-#endif /* CONFIG_UPROBES */
-
 #ifdef CONFIG_IA32_EMULATION
 
 #include <linux/compat.h>
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index b166e10..c4f1ff8 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_UNWIND_USER_H
 
 #include <asm/ptrace.h>
+#include <asm/uprobes.h>
 
 #define ARCH_INIT_USER_FP_FRAME(ws)			\
 	.cfa_off	=  2*(ws),			\
@@ -10,6 +11,12 @@
 	.fp_off		= -2*(ws),			\
 	.use_fp		= true,
 
+#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)		\
+	.cfa_off	=  1*(ws),			\
+	.ra_off		= -1*(ws),			\
+	.fp_off		= 0,				\
+	.use_fp		= false,
+
 static inline int unwind_user_word_size(struct pt_regs *regs)
 {
 	/* We can't unwind VM86 stacks */
@@ -22,4 +29,9 @@ static inline int unwind_user_word_size(struct pt_regs *regs)
 	return sizeof(long);
 }
 
+static inline bool unwind_user_at_function_start(struct pt_regs *regs)
+{
+	return is_uprobe_at_func_entry(regs);
+}
+
 #endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 1ee2e51..362210c 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -62,4 +62,13 @@ struct arch_uprobe_task {
 	unsigned int			saved_tf;
 };
 
+#ifdef CONFIG_UPROBES
+extern bool is_uprobe_at_func_entry(struct pt_regs *regs);
+#else
+static bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	return false;
+}
+#endif /* CONFIG_UPROBES */
+
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index a563e90..7be8e36 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+/*
+ * Heuristic-based check if uprobe is installed at the function entry.
+ *
+ * Under assumption of user code being compiled with frame pointers,
+ * `push %rbp/%ebp` is a good indicator that we indeed are.
+ *
+ * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
+ * If we get this wrong, captured stack trace might have one extra bogus
+ * entry, but the rest of stack trace will still be meaningful.
+ */
+bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	struct arch_uprobe *auprobe;
+
+	if (!current->utask)
+		return false;
+
+	auprobe = current->utask->auprobe;
+	if (!auprobe)
+		return false;
+
+	/* push %rbp/%ebp */
+	if (auprobe->insn[0] == 0x55)
+		return true;
+
+	/* endbr64 (64-bit only) */
+	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
+		return true;
+
+	return false;
+}
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 938f7e6..412729a 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -39,6 +39,7 @@ struct unwind_user_state {
 	unsigned int				ws;
 	enum unwind_user_type			current_type;
 	unsigned int				available_types;
+	bool					topmost;
 	bool					done;
 };
 
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 6428715..39e2707 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -26,14 +26,12 @@ get_user_word(unsigned long *word, unsigned long base, int off, unsigned int ws)
 	return get_user(*word, addr);
 }
 
-static int unwind_user_next_fp(struct unwind_user_state *state)
+static int unwind_user_next_common(struct unwind_user_state *state,
+				   const struct unwind_user_frame *frame)
 {
-	const struct unwind_user_frame frame = {
-		ARCH_INIT_USER_FP_FRAME(state->ws)
-	};
 	unsigned long cfa, fp, ra;
 
-	if (frame.use_fp) {
+	if (frame->use_fp) {
 		if (state->fp < state->sp)
 			return -EINVAL;
 		cfa = state->fp;
@@ -42,7 +40,7 @@ static int unwind_user_next_fp(struct unwind_user_state *state)
 	}
 
 	/* Get the Canonical Frame Address (CFA) */
-	cfa += frame.cfa_off;
+	cfa += frame->cfa_off;
 
 	/* stack going in wrong direction? */
 	if (cfa <= state->sp)
@@ -53,19 +51,41 @@ static int unwind_user_next_fp(struct unwind_user_state *state)
 		return -EINVAL;
 
 	/* Find the Return Address (RA) */
-	if (get_user_word(&ra, cfa, frame.ra_off, state->ws))
+	if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
 		return -EINVAL;
 
-	if (frame.fp_off && get_user_word(&fp, cfa, frame.fp_off, state->ws))
+	if (frame->fp_off && get_user_word(&fp, cfa, frame->fp_off, state->ws))
 		return -EINVAL;
 
 	state->ip = ra;
 	state->sp = cfa;
-	if (frame.fp_off)
+	if (frame->fp_off)
 		state->fp = fp;
+	state->topmost = false;
 	return 0;
 }
 
+static int unwind_user_next_fp(struct unwind_user_state *state)
+{
+#ifdef CONFIG_HAVE_UNWIND_USER_FP
+	struct pt_regs *regs = task_pt_regs(current);
+
+	if (state->topmost && unwind_user_at_function_start(regs)) {
+		const struct unwind_user_frame fp_entry_frame = {
+			ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
+		};
+		return unwind_user_next_common(state, &fp_entry_frame);
+	}
+
+	const struct unwind_user_frame fp_frame = {
+		ARCH_INIT_USER_FP_FRAME(state->ws)
+	};
+	return unwind_user_next_common(state, &fp_frame);
+#else
+	return -EINVAL;
+#endif
+}
+
 static int unwind_user_next(struct unwind_user_state *state)
 {
 	unsigned long iter_mask = state->available_types;
@@ -118,6 +138,7 @@ static int unwind_user_start(struct unwind_user_state *state)
 		state->done = true;
 		return -EINVAL;
 	}
+	state->topmost = true;
 
 	return 0;
 }
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Steven Rostedt 3 months, 2 weeks ago
On Thu, 23 Oct 2025 17:00:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> +/* Deferred unwinding callback for task specific events */
> +static void perf_unwind_deferred_callback(struct unwind_work *work,
> +					 struct unwind_stacktrace *trace, u64 cookie)
> +{
> +	struct perf_callchain_deferred_event deferred_event = {
> +		.trace = trace,
> +		.event = {
> +			.header = {
> +				.type = PERF_RECORD_CALLCHAIN_DEFERRED,
> +				.misc = PERF_RECORD_MISC_USER,
> +				.size = sizeof(deferred_event.event) +
> +					(trace->nr * sizeof(u64)),
> +			},
> +			.cookie = cookie,
> +			.nr = trace->nr,
> +		},
> +	};
> +
> +	perf_iterate_sb(perf_callchain_deferred_output, &deferred_event, NULL);
> +}
> +

So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
basically a demultiplexer that distributes events to all the requestors?

If I had know this, I would have done it completely different.

-- Steve
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Thu, Oct 23, 2025 at 12:40:57PM -0400, Steven Rostedt wrote:
> On Thu, 23 Oct 2025 17:00:02 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +/* Deferred unwinding callback for task specific events */
> > +static void perf_unwind_deferred_callback(struct unwind_work *work,
> > +					 struct unwind_stacktrace *trace, u64 cookie)
> > +{
> > +	struct perf_callchain_deferred_event deferred_event = {
> > +		.trace = trace,
> > +		.event = {
> > +			.header = {
> > +				.type = PERF_RECORD_CALLCHAIN_DEFERRED,
> > +				.misc = PERF_RECORD_MISC_USER,
> > +				.size = sizeof(deferred_event.event) +
> > +					(trace->nr * sizeof(u64)),
> > +			},
> > +			.cookie = cookie,
> > +			.nr = trace->nr,
> > +		},
> > +	};
> > +
> > +	perf_iterate_sb(perf_callchain_deferred_output, &deferred_event, NULL);
> > +}
> > +
> 
> So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
> basically a demultiplexer that distributes events to all the requestors?

A superset. Basically every event in the relevant context that 'wants'
it.

It is what we use for all traditional side-band events (hence the _sb
naming) like mmap, task creation/exit, etc.

I was under the impression the perf tool would create one software dummy
event to listen specifically for these events per buffer, but alas, when
I looked at the tool this does not appear to be the case.

As a result it is possible to receive these events multiple times. And
since that is a problem that needs to be solved anyway, I didn't think
it 'relevant' in this case.

> If I had know this, I would have done it completely different.

I did do mention it here:

  https://lkml.kernel.org/r/20250923103213.GD3419281@noisy.programming.kicks-ass.net

Anyway, no worries. Onwards to figuring out WTF the unwinder doesn't
seem to terminate properly.
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Peter Zijlstra 3 months, 2 weeks ago
Arnaldo, Namhyung,

On Fri, Oct 24, 2025 at 10:26:56AM +0200, Peter Zijlstra wrote:

> > So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
> > basically a demultiplexer that distributes events to all the requestors?
> 
> A superset. Basically every event in the relevant context that 'wants'
> it.
> 
> It is what we use for all traditional side-band events (hence the _sb
> naming) like mmap, task creation/exit, etc.
> 
> I was under the impression the perf tool would create one software dummy
> event to listen specifically for these events per buffer, but alas, when
> I looked at the tool this does not appear to be the case.
> 
> As a result it is possible to receive these events multiple times. And
> since that is a problem that needs to be solved anyway, I didn't think
> it 'relevant' in this case.

When I use:

  perf record -ag -e cycles -e instructions

I get:

# event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
# event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
# event : name = dummy:u, , id = { }, type = 1 (PERF_TYPE_SOFTWARE), size = 136, config = 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|IDENTIFIER, read_format = ID|LOST, exclude_kernel = 1, exclude_hv = 1, mmap = 1, comm = 1, task = 1, sample_id_all = 1, exclude_guest = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_output = 1

And we have this dummy event I spoke of above; and it has defer_output
set, none of the others do. This is what I expected.

*However*, when I use:

  perf record -g -e cycles -e instruction

I get:

# event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, mmap = 1, comm = 1, freq = 1, enable_on_exec = 1, task = 1, sample_id_all = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_callchain = 1, defer_output = 1
# event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, freq = 1, enable_on_exec = 1, sample_id_all = 1, defer_callchain = 1

Which doesn't have a dummy event. Notably the first real event has
defer_output set (and all the other sideband stuff like mmap, comm,
etc.).

Is there a reason the !cpu mode doesn't have the dummy event? Anyway, it
should all work, just unexpected inconsistency that confused me.
Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
Posted by Namhyung Kim 3 months, 2 weeks ago
Hi Peter,

On Fri, Oct 24, 2025 at 02:58:41PM +0200, Peter Zijlstra wrote:
> 
> Arnaldo, Namhyung,
> 
> On Fri, Oct 24, 2025 at 10:26:56AM +0200, Peter Zijlstra wrote:
> 
> > > So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
> > > basically a demultiplexer that distributes events to all the requestors?
> > 
> > A superset. Basically every event in the relevant context that 'wants'
> > it.
> > 
> > It is what we use for all traditional side-band events (hence the _sb
> > naming) like mmap, task creation/exit, etc.
> > 
> > I was under the impression the perf tool would create one software dummy
> > event to listen specifically for these events per buffer, but alas, when
> > I looked at the tool this does not appear to be the case.
> > 
> > As a result it is possible to receive these events multiple times. And
> > since that is a problem that needs to be solved anyway, I didn't think
> > it 'relevant' in this case.
> 
> When I use:
> 
>   perf record -ag -e cycles -e instructions
> 
> I get:
> 
> # event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
> # event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
> # event : name = dummy:u, , id = { }, type = 1 (PERF_TYPE_SOFTWARE), size = 136, config = 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|IDENTIFIER, read_format = ID|LOST, exclude_kernel = 1, exclude_hv = 1, mmap = 1, comm = 1, task = 1, sample_id_all = 1, exclude_guest = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_output = 1
> 
> And we have this dummy event I spoke of above; and it has defer_output
> set, none of the others do. This is what I expected.
> 
> *However*, when I use:
> 
>   perf record -g -e cycles -e instruction
> 
> I get:
> 
> # event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, mmap = 1, comm = 1, freq = 1, enable_on_exec = 1, task = 1, sample_id_all = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_callchain = 1, defer_output = 1
> # event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, freq = 1, enable_on_exec = 1, sample_id_all = 1, defer_callchain = 1
> 
> Which doesn't have a dummy event. Notably the first real event has
> defer_output set (and all the other sideband stuff like mmap, comm,
> etc.).
> 
> Is there a reason the !cpu mode doesn't have the dummy event? Anyway, it
> should all work, just unexpected inconsistency that confused me. 

Right, I don't remember why.  I think there's no reason doing it for
system wide mode only.

Adrian, do you have any idea?  I have a vague memory that you worked on
this in the past.

Thanks,
Namhyung
[tip: perf/core] perf: Support deferred user unwind
Posted by tip-bot2 for Peter Zijlstra 3 months, 1 week ago
The following commit has been merged into the perf/core branch of tip:

Commit-ID:     c69993ecdd4dfde2b7da08b022052a33b203da07
Gitweb:        https://git.kernel.org/tip/c69993ecdd4dfde2b7da08b022052a33b203da07
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 Oct 2025 15:17:05 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 10:29:58 +01:00

perf: Support deferred user unwind

Add support for deferred userspace unwind to perf.

Where perf currently relies on in-place stack unwinding; from NMI
context and all that. This moves the userspace part of the unwind to
right before the return-to-userspace.

This has two distinct benefits, the biggest is that it moves the
unwind to a faultable context. It becomes possible to fault in debug
info (.eh_frame, SFrame etc.) that might not otherwise be readily
available. And secondly, it de-duplicates the user callchain where
multiple samples happen during the same kernel entry.

To facilitate this the perf interface is extended with a new record
type:

  PERF_RECORD_CALLCHAIN_DEFERRED

and two new attribute flags:

  perf_event_attr::defer_callchain - to request the user unwind be deferred
  perf_event_attr::defer_output    - to request PERF_RECORD_CALLCHAIN_DEFERRED records

The existing PERF_RECORD_SAMPLE callchain section gets a new
context type:

  PERF_CONTEXT_USER_DEFERRED

After which will come a single entry, denoting the 'cookie' of the
deferred callchain that should be attached here, matching the 'cookie'
field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED.

The 'defer_callchain' flag is expected on all events with
PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event
responsible for collecting side-band events (like mmap, comm etc.).
Setting 'defer_output' on multiple events will get you duplicated
PERF_RECORD_CALLCHAIN_DEFERRED records.

Based on earlier patches by Josh and Steven.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net
---
 include/linux/perf_event.h            |  2 +-
 include/linux/unwind_deferred.h       | 12 +----
 include/linux/unwind_deferred_types.h | 13 ++++-
 include/uapi/linux/perf_event.h       | 21 ++++++-
 kernel/bpf/stackmap.c                 |  4 +-
 kernel/events/callchain.c             | 14 ++++-
 kernel/events/core.c                  | 78 +++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h | 21 ++++++-
 8 files changed, 145 insertions(+), 20 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d910..9870d76 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1720,7 +1720,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index f4743c8..bc7ae7d 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -6,18 +6,6 @@
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
-struct unwind_work;
-
-typedef void (*unwind_callback_t)(struct unwind_work *work,
-				  struct unwind_stacktrace *trace,
-				  u64 cookie);
-
-struct unwind_work {
-	struct list_head		list;
-	unwind_callback_t		func;
-	int				bit;
-};
-
 #ifdef CONFIG_UNWIND_USER
 
 enum {
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 0a4c8dd..18fa393 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -39,4 +39,17 @@ struct unwind_task_info {
 	union unwind_task_id	id;
 };
 
+struct unwind_work;
+struct unwind_stacktrace;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work,
+				  struct unwind_stacktrace *trace,
+				  u64 cookie);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+	int				bit;
+};
+
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 78a362b..d292f96 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -463,7 +463,9 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* request PERF_RECORD_CALLCHAIN_DEFERRED records */
+				defer_output   :  1, /* output PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 24;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1241,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1287,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 4d53cdd..8f1daca 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+				   false, false, 0);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   crosstask, false, 0);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 808c0d7..b9c7e00 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,6 +251,18 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_cookie) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one, and add
+			 * the cookie after it (it will be cut off when the
+			 * user stack is copied to the callchain).
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			perf_callchain_store_context(&ctx, defer_cookie);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7541f6f..f6a08c7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -56,6 +56,7 @@
 #include <linux/buildid.h>
 #include <linux/task_work.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/unwind_deferred.h>
 
 #include "internal.h"
 
@@ -8200,6 +8201,8 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+static struct unwind_work perf_unwind_work;
+
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
@@ -8208,8 +8211,11 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 		!(current->flags & (PF_KTHREAD | PF_USER_WORKER));
 	/* Disallow cross-task user callchains. */
 	bool crosstask = event->ctx->task && event->ctx->task != current;
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
+			  event->attr.defer_callchain;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	u64 defer_cookie;
 
 	if (!current->mm)
 		user = false;
@@ -8217,8 +8223,13 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	if (!(user && defer_user && !crosstask &&
+	      unwind_deferred_request(&perf_unwind_work, &defer_cookie) >= 0))
+		defer_cookie = 0;
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack,
+				       crosstask, true, defer_cookie);
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -10003,6 +10014,66 @@ void perf_event_bpf_event(struct bpf_prog *prog,
 	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
 }
 
+struct perf_callchain_deferred_event {
+	struct unwind_stacktrace *trace;
+	struct {
+		struct perf_event_header	header;
+		u64				cookie;
+		u64				nr;
+		u64				ips[];
+	} event;
+};
+
+static void perf_callchain_deferred_output(struct perf_event *event, void *data)
+{
+	struct perf_callchain_deferred_event *deferred_event = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int ret, size = deferred_event->event.header.size;
+
+	if (!event->attr.defer_output)
+		return;
+
+	/* XXX do we really need sample_id_all for this ??? */
+	perf_event_header__init_id(&deferred_event->event.header, &sample, event);
+
+	ret = perf_output_begin(&handle, &sample, event,
+				deferred_event->event.header.size);
+	if (ret)
+		goto out;
+
+	perf_output_put(&handle, deferred_event->event);
+	for (int i = 0; i < deferred_event->trace->nr; i++) {
+		u64 entry = deferred_event->trace->entries[i];
+		perf_output_put(&handle, entry);
+	}
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	deferred_event->event.header.size = size;
+}
+
+static void perf_unwind_deferred_callback(struct unwind_work *work,
+					 struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_callchain_deferred_event deferred_event = {
+		.trace = trace,
+		.event = {
+			.header = {
+				.type = PERF_RECORD_CALLCHAIN_DEFERRED,
+				.misc = PERF_RECORD_MISC_USER,
+				.size = sizeof(deferred_event.event) +
+					(trace->nr * sizeof(u64)),
+			},
+			.cookie = cookie,
+			.nr = trace->nr,
+		},
+	};
+
+	perf_iterate_sb(perf_callchain_deferred_output, &deferred_event, NULL);
+}
+
 struct perf_text_poke_event {
 	const void		*old_bytes;
 	const void		*new_bytes;
@@ -14799,6 +14870,9 @@ void __init perf_event_init(void)
 
 	idr_init(&pmu_idr);
 
+	unwind_deferred_init(&perf_unwind_work,
+			     perf_unwind_deferred_callback);
+
 	perf_event_init_all_cpus();
 	init_srcu_struct(&pmus_srcu);
 	perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 78a362b..d292f96 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -463,7 +463,9 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* request PERF_RECORD_CALLCHAIN_DEFERRED records */
+				defer_output   :  1, /* output PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 24;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1241,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1287,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,