[PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure

Steven Rostedt posted 13 patches 8 months, 4 weeks ago
There is a newer version of this series
MAINTAINERS                              |   8 +
arch/Kconfig                             |  11 +
arch/x86/Kconfig                         |   2 +
arch/x86/events/core.c                   |  44 +---
arch/x86/include/asm/ptrace.h            |   2 +
arch/x86/include/asm/unwind_user.h       |  61 +++++
arch/x86/include/asm/unwind_user_types.h |  17 ++
arch/x86/kernel/ptrace.c                 |  38 ++++
include/asm-generic/Kbuild               |   2 +
include/asm-generic/unwind_user.h        |  24 ++
include/asm-generic/unwind_user_types.h  |   9 +
include/linux/entry-common.h             |   2 +
include/linux/sched.h                    |   6 +
include/linux/unwind_deferred.h          |  72 ++++++
include/linux/unwind_deferred_types.h    |  17 ++
include/linux/unwind_user.h              |  15 ++
include/linux/unwind_user_types.h        |  35 +++
kernel/Makefile                          |   1 +
kernel/fork.c                            |   4 +
kernel/unwind/Makefile                   |   1 +
kernel/unwind/deferred.c                 | 367 +++++++++++++++++++++++++++++++
kernel/unwind/user.c                     | 130 +++++++++++
22 files changed, 829 insertions(+), 39 deletions(-)
create mode 100644 arch/x86/include/asm/unwind_user.h
create mode 100644 arch/x86/include/asm/unwind_user_types.h
create mode 100644 include/asm-generic/unwind_user.h
create mode 100644 include/asm-generic/unwind_user_types.h
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/deferred.c
create mode 100644 kernel/unwind/user.c
[PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 8 months, 4 weeks ago
This series does not make any user space visible changes.
It only adds the necessary infrastructure of the deferred unwinder.

 Based off of tip/master: d60119b82d8871c66563f8657bb3e550a80234de

This has modifications in x86 and I would like it to go through the x86
tree. Preferably it can go into this merge window so we can focus on getting
perf and ftrace to work on top of this.

Perf exposes a lot of the interface to user space as the perf tool needs
to handle the merging of the stacks, I figured it would be better to just
get the kernel side mostly done and then work out the kinks of the code
between user and kernel.

As there is no exposure to user space at this time, if something is found
wrong with it it can be fixed without worrying about breaking API.

I ran all these patches through my tests and they passed
(had to add the fix for the CONFIG_MODULE in tip/master:
 https://lore.kernel.org/all/20250513025839.495755-1-ebiggers@kernel.org/)

Changes since v8: https://lore.kernel.org/all/20250509164524.448387100@goodmis.org/

- The patches posted here have not been updated since v8.

- Removed updates to perf proper

  This is only to create the unwind infrastructure so that perf and
  ftrace can be worked on simultaneously without dependencies for
  each.

- Discussion about using guard for SRCU

   Andrii Nakryiko brought up using guard for SRCU around the
   list iteration, but it was decided that just using the normal
   methods were fine for this use case.


Josh Poimboeuf (9):
      unwind_user: Add user space unwinding API
      unwind_user: Add frame pointer support
      unwind_user/x86: Enable frame pointer unwinding on x86
      perf/x86: Rename and move get_segment_base() and make it global
      unwind_user: Add compat mode frame pointer support
      unwind_user/x86: Enable compat mode frame pointer unwinding on x86
      unwind_user/deferred: Add unwind cache
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/deferred: Make unwind deferral requests NMI-safe

Steven Rostedt (4):
      unwind_user/deferred: Add unwind_deferred_trace()
      unwind deferred: Use bitmask to determine which callbacks to call
      unwind deferred: Use SRCU unwind_deferred_task_work()
      unwind: Clear unwind_mask on exit back to user space

----
 MAINTAINERS                              |   8 +
 arch/Kconfig                             |  11 +
 arch/x86/Kconfig                         |   2 +
 arch/x86/events/core.c                   |  44 +---
 arch/x86/include/asm/ptrace.h            |   2 +
 arch/x86/include/asm/unwind_user.h       |  61 +++++
 arch/x86/include/asm/unwind_user_types.h |  17 ++
 arch/x86/kernel/ptrace.c                 |  38 ++++
 include/asm-generic/Kbuild               |   2 +
 include/asm-generic/unwind_user.h        |  24 ++
 include/asm-generic/unwind_user_types.h  |   9 +
 include/linux/entry-common.h             |   2 +
 include/linux/sched.h                    |   6 +
 include/linux/unwind_deferred.h          |  72 ++++++
 include/linux/unwind_deferred_types.h    |  17 ++
 include/linux/unwind_user.h              |  15 ++
 include/linux/unwind_user_types.h        |  35 +++
 kernel/Makefile                          |   1 +
 kernel/fork.c                            |   4 +
 kernel/unwind/Makefile                   |   1 +
 kernel/unwind/deferred.c                 | 367 +++++++++++++++++++++++++++++++
 kernel/unwind/user.c                     | 130 +++++++++++
 22 files changed, 829 insertions(+), 39 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 arch/x86/include/asm/unwind_user_types.h
 create mode 100644 include/asm-generic/unwind_user.h
 create mode 100644 include/asm-generic/unwind_user_types.h
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/deferred.c
 create mode 100644 kernel/unwind/user.c
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 8 months, 4 weeks ago
On Tue, 13 May 2025 18:34:35 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> This has modifications in x86 and I would like it to go through the x86
> tree. Preferably it can go into this merge window so we can focus on getting
> perf and ftrace to work on top of this.

I think it may be best for me to remove the two x86 specific patches, and
rebuild the ftrace work on top of it. For testing, I'll just keep those two
patches in my tree locally, but then I can get this moving for this merge
window.

Next merge window, we can spend more time on getting the perf API working
properly.

-- Steve
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Namhyung Kim 8 months, 3 weeks ago
Hi Steve,

On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> On Tue, 13 May 2025 18:34:35 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > This has modifications in x86 and I would like it to go through the x86
> > tree. Preferably it can go into this merge window so we can focus on getting
> > perf and ftrace to work on top of this.
> 
> I think it may be best for me to remove the two x86 specific patches, and
> rebuild the ftrace work on top of it. For testing, I'll just keep those two
> patches in my tree locally, but then I can get this moving for this merge
> window.

Maybe I asked this before but I don't remember if I got the answer. :)
How does it handle task exits as it won't go to userspace?  I guess it'll
lose user callstacks for exit syscalls and other termination paths.

Similarly, it will miss user callstacks in the samples at the end of
profiling if the target tasks remain in the kernel (or they sleep).
It looks like a fundamental limitation of the deferred callchains.

Thanks,
Namhyung

> 
> Next merge window, we can spend more time on getting the perf API working
> properly.
> 
> -- Steve
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Masami Hiramatsu (Google) 8 months, 3 weeks ago
On Fri, 16 May 2025 16:39:56 -0700
Namhyung Kim <namhyung@kernel.org> wrote:

> Hi Steve,
> 
> On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> > On Tue, 13 May 2025 18:34:35 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > This has modifications in x86 and I would like it to go through the x86
> > > tree. Preferably it can go into this merge window so we can focus on getting
> > > perf and ftrace to work on top of this.
> > 
> > I think it may be best for me to remove the two x86 specific patches, and
> > rebuild the ftrace work on top of it. For testing, I'll just keep those two
> > patches in my tree locally, but then I can get this moving for this merge
> > window.
> 
> Maybe I asked this before but I don't remember if I got the answer. :)
> How does it handle task exits as it won't go to userspace?  I guess it'll
> lose user callstacks for exit syscalls and other termination paths.
> 
> Similarly, it will miss user callstacks in the samples at the end of
> profiling if the target tasks remain in the kernel (or they sleep).
> It looks like a fundamental limitation of the deferred callchains.

Can we use a hybrid approach for this case?
It might be more balanced (from the performance point of view) to save
the full stack in a classic way only in this case, rather than faulting
on process exit or doing file access just to load the sframe.

Thanks,

> 
> Thanks,
> Namhyung
> 
> > 
> > Next merge window, we can spend more time on getting the perf API working
> > properly.
> > 
> > -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 8 months, 3 weeks ago
On Wed, 21 May 2025 08:26:05 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Maybe I asked this before but I don't remember if I got the answer. :)
> > How does it handle task exits as it won't go to userspace?  I guess it'll
> > lose user callstacks for exit syscalls and other termination paths.

I just checked, and the good news is that task_work does indeed get called
when a task exits. The bad news is that it happens after do_exit() cleans
up the task's "mm" structure via exit_mm(). Which means that current->mm is
NULL :-p

There's a proposal to move trace_sched_process_exit() to before exit_mm().
If that happens, we could make that tracepoint a "faultable" tracepoint and
then the unwind infrastructure could attach to it and do the unwinding from
that tracepoint.

> > 
> > Similarly, it will miss user callstacks in the samples at the end of
> > profiling if the target tasks remain in the kernel (or they sleep).
> > It looks like a fundamental limitation of the deferred callchains.  

Yes that is a limitation.

> 
> Can we use a hybrid approach for this case?
> It might be more balanced (from the performance point of view) to save
> the full stack in a classic way only in this case, rather than faulting
> on process exit or doing file access just to load the sframe.

Another approach is that the tool (like perf) could request to take the
user space stack trace every time a task enters the kernel via a system
call.

-- Steve
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 8 months, 3 weeks ago
On Tue, 20 May 2025 19:55:49 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> There's a proposal to move trace_sched_process_exit() to before exit_mm().
> If that happens, we could make that tracepoint a "faultable" tracepoint and
> then the unwind infrastructure could attach to it and do the unwinding from
> that tracepoint.

The below patch does work. It's just a PoC and would need to be broken up
and also cleaned up.

I created a TRACE_EVENT_FAULTABLE() that is basically just a
TRACE_EVENT_SYSCALL(), and used that for the sched_process_exit tracepoint.

I then had the unwinder attach to that tracepoint when the first unwind
callback is registered.

I had to change the check in the trace from testing PF_EXITING to just
current->mm is NULL.

But this does work for the exiting of a task:

-- Steve

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index a351763e6965..eb98bb61126e 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -617,6 +617,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign,	\
 			    print, reg, unreg)			\
 	DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args))
+#define TRACE_EVENT_FAULTABLE(name, proto, args, struct, assign, print)	\
+	DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args))
 
 #define TRACE_EVENT_FLAGS(event, flag)
 
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index ed52d0506c69..b228424744fd 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -50,6 +50,10 @@
 #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
 	DEFINE_TRACE_SYSCALL(name, reg, unreg, PARAMS(proto), PARAMS(args))
 
+#undef TRACE_EVENT_FAULTABLE
+#define TRACE_EVENT_FAULTABLE(name, proto, args, struct, assign, print) \
+	DEFINE_TRACE_SYSCALL(name, NULL, NULL, PARAMS(proto), PARAMS(args))
+
 #undef TRACE_EVENT_NOP
 #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
 
@@ -125,6 +129,7 @@
 #undef TRACE_EVENT_FN
 #undef TRACE_EVENT_FN_COND
 #undef TRACE_EVENT_SYSCALL
+#undef TRACE_EVENT_FAULTABLE
 #undef TRACE_EVENT_CONDITION
 #undef TRACE_EVENT_NOP
 #undef DEFINE_EVENT_NOP
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 3bec9fb73a36..c6d7894970e3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -326,13 +326,13 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
 	     TP_ARGS(p));
 
 /*
- * Tracepoint for a task exiting.
+ * Tracepoint for a task exiting (allows faulting)
  * Note, it's a superset of sched_process_template and should be kept
  * compatible as much as possible. sched_process_exits has an extra
  * `group_dead` argument, so sched_process_template can't be used,
  * unfortunately, just like sched_migrate_task above.
  */
-TRACE_EVENT(sched_process_exit,
+TRACE_EVENT_FAULTABLE(sched_process_exit,
 
 	TP_PROTO(struct task_struct *p, bool group_dead),
 
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4f22136fd465..0ed57e7906d1 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -55,6 +55,16 @@
 			     PARAMS(print));		       \
 	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_FAULTABLE
+#define TRACE_EVENT_FAULTABLE(name, proto, args, tstruct, assign, print) \
+	DECLARE_EVENT_SYSCALL_CLASS(name,		       \
+			     PARAMS(proto),		       \
+			     PARAMS(args),		       \
+			     PARAMS(tstruct),		       \
+			     PARAMS(assign),		       \
+			     PARAMS(print));		       \
+	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
 #include "stages/stage1_struct_define.h"
 
 #undef DECLARE_EVENT_CLASS
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 63d0237bad3e..7aad471f2887 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -11,6 +11,8 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+#include <trace/events/sched.h>
+
 #define UNWIND_MAX_ENTRIES 512
 
 /* Guards adding to or removing from the list of callbacks */
@@ -77,7 +79,7 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	/* Should always be called from faultable context */
 	might_fault();
 
-	if (current->flags & PF_EXITING)
+	if (!current->mm)
 		return -EINVAL;
 
 	if (!info->cache) {
@@ -107,14 +109,14 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	return 0;
 }
 
-static void unwind_deferred_task_work(struct callback_head *head)
+static void process_unwind_deferred(void)
 {
-	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct task_struct *task = current;
+	struct unwind_task_info *info = &task->unwind_info;
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
 	unsigned long bits;
 	u64 timestamp;
-	struct task_struct *task = current;
 	int idx;
 
 	if (WARN_ON_ONCE(!unwind_pending(task)))
@@ -152,6 +155,21 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	srcu_read_unlock(&unwind_srcu, idx);
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	process_unwind_deferred();
+}
+
+static void unwind_deferred_callback(void *data, struct task_struct *p, bool group_dead)
+{
+	if (!unwind_pending(p))
+		return;
+
+	process_unwind_deferred();
+
+	task_work_cancel(p, &p->unwind_info.work);
+}
+
 static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -329,6 +347,10 @@ void unwind_deferred_cancel(struct unwind_work *work)
 	for_each_process_thread(g, t) {
 		clear_bit(bit, &t->unwind_mask);
 	}
+
+	/* Is this the last registered unwinding? */
+	if (!unwind_mask)
+		unregister_trace_sched_process_exit(unwind_deferred_callback, NULL);
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -341,6 +363,15 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	if (unwind_mask == ~(UNWIND_PENDING))
 		return -EBUSY;
 
+	/* Is this the first registered unwinding? */
+	if (!unwind_mask) {
+		int ret;
+
+		ret = register_trace_sched_process_exit(unwind_deferred_callback, NULL);
+		if (ret < 0)
+			return ret;
+	}
+
 	work->bit = ffz(unwind_mask);
 	unwind_mask |= 1UL << work->bit;
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 8 months, 3 weeks ago
On Fri, 16 May 2025 16:39:56 -0700
Namhyung Kim <namhyung@kernel.org> wrote:

> Hi Steve,
> 
> On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> > On Tue, 13 May 2025 18:34:35 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> > > This has modifications in x86 and I would like it to go through the x86
> > > tree. Preferably it can go into this merge window so we can focus on getting
> > > perf and ftrace to work on top of this.  
> > 
> > I think it may be best for me to remove the two x86 specific patches, and
> > rebuild the ftrace work on top of it. For testing, I'll just keep those two
> > patches in my tree locally, but then I can get this moving for this merge
> > window.  
> 
> Maybe I asked this before but I don't remember if I got the answer. :)
> How does it handle task exits as it won't go to userspace?  I guess it'll
> lose user callstacks for exit syscalls and other termination paths.
> 
> Similarly, it will miss user callstacks in the samples at the end of
> profiling if the target tasks remain in the kernel (or they sleep).
> It looks like a fundamental limitation of the deferred callchains.
> 

Ah, I think I forgot about that. I believe the exit path can also be a
faultable path. All it needs is a hook to do the exit. Is there any
"task work" clean up on exit? I need to take a look.

-- Steve
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Ingo Molnar 8 months, 3 weeks ago
* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 16 May 2025 16:39:56 -0700
> Namhyung Kim <namhyung@kernel.org> wrote:
> 
> > Hi Steve,
> > 
> > On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> > > On Tue, 13 May 2025 18:34:35 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >   
> > > > This has modifications in x86 and I would like it to go through the x86
> > > > tree. Preferably it can go into this merge window so we can focus on getting
> > > > perf and ftrace to work on top of this.  
> > > 
> > > I think it may be best for me to remove the two x86 specific patches, and
> > > rebuild the ftrace work on top of it. For testing, I'll just keep those two
> > > patches in my tree locally, but then I can get this moving for this merge
> > > window.  
> > 
> > Maybe I asked this before but I don't remember if I got the answer. :)
> > How does it handle task exits as it won't go to userspace?  I guess it'll
> > lose user callstacks for exit syscalls and other termination paths.
> > 
> > Similarly, it will miss user callstacks in the samples at the end of
> > profiling if the target tasks remain in the kernel (or they sleep).
> > It looks like a fundamental limitation of the deferred callchains.
> > 
> 
> Ah, I think I forgot about that. I believe the exit path can also be a
> faultable path. All it needs is a hook to do the exit. Is there any
> "task work" clean up on exit? I need to take a look.

Could you please not rush this facility into v6.16? It barely had any 
design review so far, and I'm still not entirely sure about the 
approach.

Thanks,

	Ingo
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 8 months, 3 weeks ago
On Tue, 20 May 2025 11:35:56 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> > Ah, I think I forgot about that. I believe the exit path can also be a
> > faultable path. All it needs is a hook to do the exit. Is there any
> > "task work" clean up on exit? I need to take a look.  
> 
> Could you please not rush this facility into v6.16? It barely had any 
> design review so far, and I'm still not entirely sure about the 
> approach.

Hi Ingo,

Note, there has been a lot of discussion on this approach, although it's
been mostly at conferences and in meetings. At GNU Cauldron in September
2024 (before Plumbers) Josh, Mathieu and myself discussed it in quite
detail. I've then been hosting a monthly meeting with engineers from
Google, Red Hat, EffiOS, Oracle, Microsoft and others (I can invite you to
it if you would like). There's actually two meetings (one that is in a Asia
friendly timezone and another in a European friendly timezone). The first
patches from this went out in October, 2024:

  https://lore.kernel.org/all/cover.1730150953.git.jpoimboe@kernel.org/

There wasn't much discussion on it, although Peter did reply, and I believe
that we did address all of his concerns.

Josh then changed the approach from what we originally discussed, which was
to just have each tracer attach a task_work to the task it wants a trace
thinking that would be good enough, but Mathieu and I found that it doesn't
work because even a perf event can not handle this because it would need to
keep track of several tasks that may migrate.

Let me state the goal of this work, as it started from a session at the
2022 Tracing Summit in London. We needed a way to get reliable user space
stack traces without using frame pointers. We discussed various methods,
including using eh_frame but they all had issues. We concluded that it
would be nice to have an ORC unwinder (technically it's called a stack
walker), for user space.

Then in Nov 2022, I read about "sframes" which was exactly what we were
looking for:

   https://www.phoronix.com/news/GNU-Binutils-SFrame

At FOSDEM 2023, I asked Jose Marchesi (a GCC maintainer) about sframes, and
it just happened to be one of his employees that created it (Indu Bhagat).
At Kernel Recipes 2023, Brendan Gregg, during his talk, asked if it would
be great to have user space stack walking without needing to run everything
with frame pointers. I raised my hand and asked if he had heard about
"sframes", which he did not. I explained what it was and he was very
interested. After that talk, Josh Poimboeuf came up to me and asked if he
could implement this, which I agreed.

One thing that is needed for sframes is that it has to be done in a
faultable context. The sframe sections are like ORC, where it has lookup
tables, but they live in the user space address, and because they can be
large, they can't be locked into memory. This brings up the deferred
unwinding aspect.

At the LSFMMBPF 2023 conference, Indu and myself ran a session on sframes
to get a better idea on how to implement this. The perf user space stack
tracing was mentioned, where if frame pointers are not there, perf may copy
thousands of bytes of the user space stack into the perf buffer. If there's
a long system call, it may do this several times, and because the user
space stack does not change while the task is in the kernel, this is
thousands of bytes of duplicate data. I asked Jiri Olsa "why not just defer
the stack trace and only make a single entry", and I believe he replied "we
are thinking about it", but nothing further came about it.

Now, there's a serious push to get sframes upstream, and that will take a
few steps. These are:

1) Create a new user unwind stack call that is expected to be called in
   faultable context. If a tracer (perf, ftrace, BPF or whatever) knows
   it's in a faultable context and wants a trace, it can simply ask for it.

   It was also asked to have a user space system call that can get this
   trace so that a function in a user space applications can see what is
   calling it.

2) Create a deferred stack unwinding infrastructure that can be used by
   many clients (perf, ftrace, BPF or whatever) and called in any context
   (interrupt or NMI).

   As the unwind stack call needs to be in a faultable context, and it is
   very common to want both a kernel stack trace along with a user space
   stack trace and this can happen in an NMI, allow the kernel stack trace
   to be executed and delay the user stack trace.

   The accounting for this is not easy, as it has a many to many
   relationship. You could have perf, ftrace and BPF all asking for a
   delayed stack trace and they all need to be called. But each could be
   wanting a stack trace from a different set of tasks. Keeping track of
   which tracer gets a callback for which task is where this patch set
   comes in. Or at least just the infrastructure part.

3) Add sframes.

   The final part is to get this working with sframes. Where a distro
   builds all its applications with sframes enabled and then perf, ftrace,
   BPF or whatever gets access to it through this interface.

There's quite a momentum of work happening today that is being built
expecting us to get to step 3. There's no use to adding sframes to
applications if the kernel can't read them. The only way to read them is to
have this deferred infrastructure.

We've discussed the current design quiet a bit, but until there's actual
users starting to build on top of it, all the corner cases may not come
out. That's why I'm suggesting if we can just get the basic infrastructure
in this merge window, where it's not fully enabled (there are no users of
it), we can then have several users build on top of it in the next merge
window to see if it finds anything that breaks.

As perf has the biggest user space ABI, where the delayed stack trace may
be in a different event buffer than where the kernel stack trace occurred
(due to migration), it's the one I'm most concern about getting this right.
As once it is exposed to user space, it can never change. That's the one I
do want to focus on the most. But it shouldn't delay getting the non user
space visible aspect of the kernel side moving forward. If we find an
issue, it can always be changed because it doesn't affect user space.

I've been testing this on the ftrace side and so far, everything works
fine. But the ftrace side has the deferred trace always in the same
instance buffer as where it was triggered, as it doesn't depend on user
space API for that.

I've also been running this on perf and it too is working well. But I don't
know the perf interface well enough to make sure there isn't other corner
cases that I may be missing.

For 6.16, I would just like to get the common infrastructure in the kernel
so there's no dependency on different tracers. This patch set adds the
changes but does not enable it yet. This way, perf, ftrace and BPF can all
work to build on top of the changes without needing to share code.

There's only two patches that touch x86 here. I have another patch series
that removes them and implements this for ftrace, but because no
architecture has it enabled, it's just dead code. But it would also allow
to build on top of the infrastructure where any architecture could enable
it. Kind of like what PREEMPT_RT did.

-- Steve
Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 8 months, 3 weeks ago
On Tue, 20 May 2025 11:57:21 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> For 6.16, I would just like to get the common infrastructure in the kernel
> so there's no dependency on different tracers. This patch set adds the
> changes but does not enable it yet. This way, perf, ftrace and BPF can all
> work to build on top of the changes without needing to share code.

Another thing that would work for me is not to push it this merge window,
but if you could make a separate branch based on top of v6.15 when it is
released, that has this code, where other subsystems could work on top of
that, would be great!

-- Steve