[PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure

Steven Rostedt posted 14 patches 3 months, 1 week ago
There is a newer version of this series
MAINTAINERS                              |   8 +
arch/Kconfig                             |  11 +
arch/x86/Kconfig                         |   2 +
arch/x86/include/asm/unwind_user.h       |  42 ++++
arch/x86/include/asm/unwind_user_types.h |  17 ++
arch/x86/kernel/stacktrace.c             |  28 +++
include/asm-generic/Kbuild               |   2 +
include/asm-generic/unwind_user.h        |   5 +
include/asm-generic/unwind_user_types.h  |   5 +
include/linux/entry-common.h             |   2 +
include/linux/sched.h                    |   5 +
include/linux/unwind_deferred.h          |  79 +++++++
include/linux/unwind_deferred_types.h    |  20 ++
include/linux/unwind_user.h              |  45 ++++
include/linux/unwind_user_types.h        |  39 ++++
kernel/Makefile                          |   1 +
kernel/exit.c                            |   2 +
kernel/fork.c                            |   4 +
kernel/unwind/Makefile                   |   1 +
kernel/unwind/deferred.c                 | 357 +++++++++++++++++++++++++++++++
kernel/unwind/user.c                     | 130 +++++++++++
21 files changed, 805 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user.h
create mode 100644 arch/x86/include/asm/unwind_user_types.h
create mode 100644 include/asm-generic/unwind_user.h
create mode 100644 include/asm-generic/unwind_user_types.h
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/deferred.c
create mode 100644 kernel/unwind/user.c
[PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 3 months, 1 week ago
[
   UPDATE: Florian Weimer is looking to having Fedora built with SFrames
           so that once this becomes available in the kernel, it will
           also be usable in Fedora.
]

This is the first patch series of a set that will make it possible to be able
to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
doing this.

Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.

Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.

SFrames is now supported in gcc binutils and soon will also be supported
by LLVM. SFrames acts more like ORC, and lives in the ELF executable
file as its own section. Like ORC it has two tables where the first table
is sorted by instruction pointers (IP) and using the current IP and finding
it's entry in the first table, it will take you to the second table which
will tell you where the return address of the current function is located
and then you can use that address to look it up in the first table to find
the return address of that function, and so on. This performs a user
space stack walk.

Now because the SFrame section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.

Josh originally wrote the PoC of this code and his last version he posted
was back in January:

   https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/

That series contained everything from adding a new faultable user space
stack walking code, deferring the stack walk, implementing sframes,
fixing up x86 (VDSO), and even added both the kernel and user space side
of perf to make it work. But Josh also ran out of time to work on it and
I picked it up. As there's several parts to this series, I also broke
it out. Especially since there's parts of his series that do not depend
on each other.

This series contains only the core infrastructure that all the rest needs.
Of the 14 patches, only 3 are x86 specific. The rest is simply the unwinding
code that s390 can build against. I moved the 3 x86 specific to the end
of the series too.

Since multiple tracers (like perf, ftrace, bpf, etc) can attach to the
deferred unwinder and each of these tracers can attach to some or all
of the tasks to trace, there is a many to many relationship. This relationship
needs to be made in interrupt or NMI context so it can not rely on any
allocation. To handle this, a bitmask is used. There's a global bitmask of
size long which will allocate a single bit when a tracer registers for
deferred stack traces. The task struct will also have a bitmask where a
request comes in from one of the tracers to have a deferred stack trace, it
will set the corresponding bit for that tracer it its mask. As one of the bits
represents that a request has been made, this means at most 31 on 32 bit
systems or 63 on 64 bit systems of tracers may be registered at a given time.
This should not be an issue as only one perf application, or ftrace instance
should request a bit. BPF should also use only one bit and handle any
multiplexing for its users.

When the first request is made for a deferred stack trace from a task, it will
take a timestamp. This timestamp will be used as the identifier for the user
space stack trace. As the user space stack trace does not change while the
task is in the kernel, requests that come in after the first request and
before the task goes back to user space will get the same timestamp. This
timestamp also serves the purpose of knowing how far back a given user space
stack trace goes. If there's dropped events, and the events dropped miss a
task entering user space and coming back to the kernel, the new stack trace
taken when it goes back to user space should not be used with the events
before the drop happened.

When a tracer makes a request, it gets this timestamp, and the tasks bitmask
sets the bit for the requesting tracer. A task work is used to have the task
do the callbacks before it goes back to user space. When it does, it will scan
its bitmask and call all the callbacks for the tracers that have their
representing bit set. The callback will receive the user space stack trace as
well as the timestamp that was used.

That's the basic idea. Obviously there's more to it than the above
explanation, but each patch explains what it is doing, and it is broken up
step by step.

I run two SFrame meetings once a month (one in Asia friendly timezone and
the other in Europe friendly). We have developers from Google, Oracle, Red Hat,
IBM, EfficiOS, Meta, Microsoft, and more that attend. (If anyone is interested
in attending let me know). I have been running this since December of 2024.
Last year in GNU Cauldron, a few of us got together to discuss the design
and such. We are pretty confident that the current design is sound. We have
working code on top of this and have been testing it.

Since the s390 folks want to start working on this (they have patches to
sframes already from working on the prototypes), I would like this series
to be a separate branch based on top of v6.16-rc2. Then all the subsystems
that want to work on top of this can as there's no real dependency between
them.

I have more patches on top of this series that add perf support, ftrace
support, sframe support and the x86 fix ups (for VDSO). But each of those
patch series can be worked on independently, but they all depend on this
series (although the x86 specific patches at the end isn't necessarily
needed, at least for other architectures).

Please review, and if you are happy with them, lets get them in a branch
that we all can use. I'm happy to take it in my tree if I can get acks on the
x86 code. Or it can be in the tip tree as a separate branch on top of 6.16-rc4
and I'll just base my work on top of that. Doesn't matter either way.

This is based on top of v6.16-rc4 and the code is here:

  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git unwind/core

  Head SHA1: 649fe8a37fbb8bc7eb1d420630523feb4f44d1d7

Changes since v11: https://lore.kernel.org/linux-trace-kernel/20250625225600.555017347@goodmis.org/

- Add USED bit to the task's unwind_mask to know if the faultable user stack
  function was used or not. This allows for only having to check one value on
  the way back to user space to know if it has to do more work or not.

- Fix header macro protection name to include X86 (Ingo Molnar)

- Use insn_get_seg_base() to get segment registers instead of using the
  function perf uses and making it global. Also as that function doesn't
  look to have a requirement to disable interrupts, the scoped_guard(irqsave)
  is removed.

- Check return code of insn_get_seg_base() for the unlikely event that it
  returns invalid (-1).

- Moved arch_unwind_user_init() into stacktrace.c as to use
  insn_get_seg_base(), it must include insn-eval.h that defines
  pt_regs_offset(), but that is also used in the perf generic code as an
  array and if it is included in the header file, it causes a build
  conflict.

- Update the comments that explain arch_unwind_user_init/next that a macro
  needs to be defined with those names if they are going to be used.

Josh Poimboeuf (7):
      unwind_user: Add user space unwinding API
      unwind_user: Add frame pointer support
      unwind_user: Add compat mode frame pointer support
      unwind_user/deferred: Add unwind cache
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/x86: Enable frame pointer unwinding on x86
      unwind_user/x86: Enable compat mode frame pointer unwinding on x86

Steven Rostedt (7):
      unwind_user/deferred: Add unwind_user_faultable()
      unwind_user/deferred: Make unwind deferral requests NMI-safe
      unwind deferred: Use bitmask to determine which callbacks to call
      unwind deferred: Use SRCU unwind_deferred_task_work()
      unwind: Clear unwind_mask on exit back to user space
      unwind: Add USED bit to only have one conditional on way back to user space
      unwind: Finish up unwind when a task exits

----
 MAINTAINERS                              |   8 +
 arch/Kconfig                             |  11 +
 arch/x86/Kconfig                         |   2 +
 arch/x86/include/asm/unwind_user.h       |  42 ++++
 arch/x86/include/asm/unwind_user_types.h |  17 ++
 arch/x86/kernel/stacktrace.c             |  28 +++
 include/asm-generic/Kbuild               |   2 +
 include/asm-generic/unwind_user.h        |   5 +
 include/asm-generic/unwind_user_types.h  |   5 +
 include/linux/entry-common.h             |   2 +
 include/linux/sched.h                    |   5 +
 include/linux/unwind_deferred.h          |  79 +++++++
 include/linux/unwind_deferred_types.h    |  20 ++
 include/linux/unwind_user.h              |  45 ++++
 include/linux/unwind_user_types.h        |  39 ++++
 kernel/Makefile                          |   1 +
 kernel/exit.c                            |   2 +
 kernel/fork.c                            |   4 +
 kernel/unwind/Makefile                   |   1 +
 kernel/unwind/deferred.c                 | 357 +++++++++++++++++++++++++++++++
 kernel/unwind/user.c                     | 130 +++++++++++
 21 files changed, 805 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 arch/x86/include/asm/unwind_user_types.h
 create mode 100644 include/asm-generic/unwind_user.h
 create mode 100644 include/asm-generic/unwind_user_types.h
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/deferred.c
 create mode 100644 kernel/unwind/user.c
Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
Posted by Linus Torvalds 3 months, 1 week ago
On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> This is the first patch series of a set that will make it possible to be able
> to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> doing this.

You have a '[1]' to indicate there's a link to what SFrames are.

But no such link actually exists in this email. Hmm?

           Linus
Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 3 months, 1 week ago
On Mon, 30 Jun 2025 19:06:12 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > This is the first patch series of a set that will make it possible to be able
> > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > doing this.  
> 
> You have a '[1]' to indicate there's a link to what SFrames are.
> 
> But no such link actually exists in this email. Hmm?
> 

Oops. I cut and pasted from v11:

  https://lore.kernel.org/linux-trace-kernel/20250625225600.555017347@goodmis.org/

But ended the cut too early. I stopped at my sig:

---
  Thanks!

  -- Steve


  [1] https://sourceware.org/binutils/wiki/sframe


-- Steve
Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
Posted by Kees Cook 3 months, 1 week ago
On Mon, Jun 30, 2025 at 10:45:39PM -0400, Steven Rostedt wrote:
> On Mon, 30 Jun 2025 19:06:12 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > This is the first patch series of a set that will make it possible to be able
> > > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > > doing this.  
> > 
> > You have a '[1]' to indicate there's a link to what SFrames are.
> [...]
>   [1] https://sourceware.org/binutils/wiki/sframe

Okay, I've read the cover letter and this wiki page, but I am dense: why
does the _kernel_ want to do this? Shouldn't it only be userspace that
cares about userspace unwinding? I don't use perf, ftrace, and ebpf
enough to make this obvious to me, I guess. ;)

-- 
Kees Cook
Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
Posted by Steven Rostedt 3 months, 1 week ago
On Tue, 1 Jul 2025 15:49:23 -0700
Kees Cook <kees@kernel.org> wrote:

> On Mon, Jun 30, 2025 at 10:45:39PM -0400, Steven Rostedt wrote:
> > On Mon, 30 Jun 2025 19:06:12 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >   
> > > On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:  
> > > >
> > > > This is the first patch series of a set that will make it possible to be able
> > > > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > > > doing this.    
> > > 
> > > You have a '[1]' to indicate there's a link to what SFrames are.  
> > [...]
> >   [1] https://sourceware.org/binutils/wiki/sframe  
> 
> Okay, I've read the cover letter and this wiki page, but I am dense: why
> does the _kernel_ want to do this? Shouldn't it only be userspace that
> cares about userspace unwinding? I don't use perf, ftrace, and ebpf
> enough to make this obvious to me, I guess. ;)
> 

It's how perf does profiling. It needs to walk the user space stack to see
what functions are being called. Ftrace can do the same thing, but is not
as used because it doesn't have the tooling (yet) to figure out what the
user space addresses mean (but I'm working on fixing that).

And BPF has commands that it can do, but I don't know BPF enough to comment.

The big user is perf with profiling. It currently uses frame pointers, but
because of the way frame pointers are set up, it misses a lot of the leaf
functions when the interrupt triggers (which sframes does not have that
problem). Also, if frame pointers is not configured, perf may just copy
thousands of bytes of the user space stack into the kernel ring buffer and
then parse it later (this isn't used often due to the overhead).

Then there's s390 that doesn't have frame pointers and only has the copy of
thousands of bytes to do any meaningful user space profiling.

Note, this has been a long standing issue where in 2022, we had a BOF on
this, looking for something like ORC in user space as it would solve lots
of our issues. Then December of that same year, we heard about SFrames.

At Kernel Recipes in 2023, Brendan Gregg during his talk was saying that
there needs to be a better way to do profiling of user space from the
kernel without frame pointers. I mentioned SFrames and he was quite excited
to hear about it. That's also when Josh, who was in the attendance, asked
if he could do the implementation of it in the kernel!

Anyway, yeah, it's something that has a ton of interest, as it's the way
for tools like perf to give nice graphs of where user space bottlenecks
exist.

-- Steve
Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
Posted by Kees Cook 3 months, 1 week ago
On Tue, Jul 01, 2025 at 07:26:19PM -0400, Steven Rostedt wrote:
> On Tue, 1 Jul 2025 15:49:23 -0700 Kees Cook <kees@kernel.org> wrote:
> > Okay, I've read the cover letter and this wiki page, but I am dense: why
> > does the _kernel_ want to do this? Shouldn't it only be userspace that
> > cares about userspace unwinding? I don't use perf, ftrace, and ebpf
> > enough to make this obvious to me, I guess. ;)
> [...]
> Anyway, yeah, it's something that has a ton of interest, as it's the way
> for tools like perf to give nice graphs of where user space bottlenecks
> exist.

Right! Yeah, I know it's very wanted -- I wasn't saying "don't this in
the kernel", but quite literally, "*I* am missing something about why
this is so important." :) And thank you, now I see: the sampling-based
profiling of userspace must happen via the kernel.

-- 
Kees Cook
Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
Posted by Mathieu Desnoyers 3 months, 1 week ago
On 2025-07-02 10:56, Kees Cook wrote:
> On Tue, Jul 01, 2025 at 07:26:19PM -0400, Steven Rostedt wrote:
>> On Tue, 1 Jul 2025 15:49:23 -0700 Kees Cook <kees@kernel.org> wrote:
>>> Okay, I've read the cover letter and this wiki page, but I am dense: why
>>> does the _kernel_ want to do this? Shouldn't it only be userspace that
>>> cares about userspace unwinding? I don't use perf, ftrace, and ebpf
>>> enough to make this obvious to me, I guess. ;)
>> [...]
>> Anyway, yeah, it's something that has a ton of interest, as it's the way
>> for tools like perf to give nice graphs of where user space bottlenecks
>> exist.
> 
> Right! Yeah, I know it's very wanted -- I wasn't saying "don't this in
> the kernel", but quite literally, "*I* am missing something about why
> this is so important." :) And thank you, now I see: the sampling-based
> profiling of userspace must happen via the kernel.
> 

I should add that once we have this in place for perf sampling,
it enables the following additional use-cases:

- Sample stack traces from specific tracepoints, e.g. system call
   entry. This allows associating the kernel tracepoints with their
   userspace call chain (causality) without requiring tracing of
   userspace.

- Implement a system call that allows a userspace thread to use the
   kernel stack sampling facilities on itself rather than reimplement
   the stack walk and sframe registry handling + decoding on the
   userspace side.

For this last point, it's only relevant because the infrastructure
will already be in place for stack sampling from the kernel. So we'd
eliminate duplication by allowing its use from userspace as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com