[PATCH v11 00/15] unwind_deferred: Implement sframe handling

Jens Remus posted 15 patches 3 months, 2 weeks ago
There is a newer version of this series
MAINTAINERS                        |   1 +
arch/Kconfig                       |  23 ++
arch/x86/Kconfig                   |   1 +
arch/x86/include/asm/mmu.h         |   2 +-
arch/x86/include/asm/uaccess.h     |  39 +-
arch/x86/include/asm/unwind_user.h |  11 +-
fs/binfmt_elf.c                    |  49 ++-
include/linux/mm_types.h           |   3 +
include/linux/sframe.h             |  60 +++
include/linux/unwind_user_types.h  |   5 +-
include/uapi/linux/elf.h           |   1 +
include/uapi/linux/prctl.h         |   6 +-
kernel/fork.c                      |  10 +
kernel/sys.c                       |   9 +
kernel/unwind/Makefile             |   3 +-
kernel/unwind/sframe.c             | 615 +++++++++++++++++++++++++++++
kernel/unwind/sframe.h             |  72 ++++
kernel/unwind/sframe_debug.h       |  68 ++++
kernel/unwind/user.c               |  56 ++-
mm/init-mm.c                       |   2 +
20 files changed, 1004 insertions(+), 32 deletions(-)
create mode 100644 include/linux/sframe.h
create mode 100644 kernel/unwind/sframe.c
create mode 100644 kernel/unwind/sframe.h
create mode 100644 kernel/unwind/sframe_debug.h
[PATCH v11 00/15] unwind_deferred: Implement sframe handling
Posted by Jens Remus 3 months, 2 weeks ago
This is the implementation of parsing the SFrame section in an ELF file.
It's a continuation of Josh's and Steve's last work that can be found
here:

   https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
   https://lore.kernel.org/all/20250827201548.448472904@kernel.org/

Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.

Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.

SFrames[1] is now supported in gcc binutils and soon will also be supported
by LLVM. SFrames acts more like ORC, and lives in the ELF executable
file as its own section. Like ORC it has two tables where the first table
is sorted by instruction pointers (IP) and using the current IP and finding
it's entry in the first table, it will take you to the second table which
will tell you where the return address of the current function is located
and then you can use that address to look it up in the first table to find
the return address of that function, and so on. This performs a user
space stack walk.

Now because the SFrame section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.

This series makes the deferred unwind code implement SFrames.

[1] https://sourceware.org/binutils/wiki/sframe

Changes since v10:
- Rebase on v6.17-rc1 with Peter's unwind user fixes and x86 support
  series [2] and Steve's support for the deferred unwinding infrastructure
  series in perf [3] and perf tool [4] on top.
- Support for SFrame V2 PC-relative FDE function start address. (Jens)
- Support for SFrame V2 representing RA undefined as indication for
  outermost frames. (Jens)

[2]: [PATCH 00/12] Various fixes and x86 support,
     https://lore.kernel.org/all/20250924075948.579302904@infradead.org/
[3]: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure,
     https://lore.kernel.org/all/20251007214008.080852573@kernel.org/
[4]: [PATCH v16 0/4] perf tool: Support the deferred unwinding infrastructure,
     https://lore.kernel.org/all/20250908175319.841517121@kernel.org/

Patches 1 and 2 are suggested fixups to patches from Peter's unwind user
fixes and x86 support series.  They keep the factoring out of the word
size from the frame's CFA, FP, and RA offsets local to unwind user fp, as
unwind user sframe does use absolute offsets.

Patches 3, 6, and 14 have been updated to exclusively support the recent
PC-relative SFrame FDE function start address encoding.  With Binutils 2.45
the SFrame V2 FDE function start address field value is an offset from the
field (i.e. PC-relative) instead of from the .sframe section start.  This
is indicated by the new SFrame header flag SFRAME_F_FDE_FUNC_START_PCREL.
Old SFrame V2 sections get rejected with dynamic debug message
"bad/unsupported sframe header".

Patches 9 and 10 add support to unwind user and unwind user sframe for
a recent change of the SFrame V2 format to represent an undefined
return address as an SFrame FRE without any offsets, which is used as
indication for outermost frames.  Note that currently only a development
build of Binutils mainline generates SFrame information including this
new indication for outermost frames.  SFrame information without the new
indication is still supported.  Without these patches unwind user sframe
would identify such new SFrame FREs without any offsets as corrupted and
remove the .sframe section, causing any any further stack tracing using
sframe to fail.

Regards,
Jens


Jens Remus (4):
  fixup! unwind: Implement compat fp unwind
  fixup! unwind_user/x86: Enable frame pointer unwinding on x86
  unwind_user: Stop when reaching an outermost frame
  unwind_user/sframe: Add support for outermost frame indication

Josh Poimboeuf (11):
  unwind_user/sframe: Add support for reading .sframe headers
  unwind_user/sframe: Store sframe section data in per-mm maple tree
  x86/uaccess: Add unsafe_copy_from_user() implementation
  unwind_user/sframe: Add support for reading .sframe contents
  unwind_user/sframe: Detect .sframe sections in executables
  unwind_user/sframe: Wire up unwind_user to sframe
  unwind_user/sframe/x86: Enable sframe unwinding on x86
  unwind_user/sframe: Remove .sframe section on detected corruption
  unwind_user/sframe: Show file name in debug output
  unwind_user/sframe: Add .sframe validation option
  unwind_user/sframe: Add prctl() interface for registering .sframe
    sections

 MAINTAINERS                        |   1 +
 arch/Kconfig                       |  23 ++
 arch/x86/Kconfig                   |   1 +
 arch/x86/include/asm/mmu.h         |   2 +-
 arch/x86/include/asm/uaccess.h     |  39 +-
 arch/x86/include/asm/unwind_user.h |  11 +-
 fs/binfmt_elf.c                    |  49 ++-
 include/linux/mm_types.h           |   3 +
 include/linux/sframe.h             |  60 +++
 include/linux/unwind_user_types.h  |   5 +-
 include/uapi/linux/elf.h           |   1 +
 include/uapi/linux/prctl.h         |   6 +-
 kernel/fork.c                      |  10 +
 kernel/sys.c                       |   9 +
 kernel/unwind/Makefile             |   3 +-
 kernel/unwind/sframe.c             | 615 +++++++++++++++++++++++++++++
 kernel/unwind/sframe.h             |  72 ++++
 kernel/unwind/sframe_debug.h       |  68 ++++
 kernel/unwind/user.c               |  56 ++-
 mm/init-mm.c                       |   2 +
 20 files changed, 1004 insertions(+), 32 deletions(-)
 create mode 100644 include/linux/sframe.h
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/sframe_debug.h

-- 
2.48.1
Re: [PATCH v11 00/15] unwind_deferred: Implement sframe handling
Posted by Andrew Morton 3 months, 2 weeks ago
On Wed, 22 Oct 2025 16:43:11 +0200 Jens Remus <jremus@linux.ibm.com> wrote:

> This is the implementation of parsing the SFrame section in an ELF file.

Presently x86_64-only, it seems.  Can we expect to see this implemented
for other architectures?

Would a selftest for this be appropriate?  To give testers some way of
exercising the code and make to life better for people who are enabling
this on other architectures.

In what tree do you anticipate this project being carried?
Re: [PATCH v11 00/15] unwind_deferred: Implement sframe handling
Posted by Steven Rostedt 3 months, 2 weeks ago
On Wed, 22 Oct 2025 13:39:32 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 22 Oct 2025 16:43:11 +0200 Jens Remus <jremus@linux.ibm.com> wrote:
> 
> > This is the implementation of parsing the SFrame section in an ELF file.  
> 
> Presently x86_64-only, it seems.  Can we expect to see this implemented
> for other architectures?

Yes, and Jens is here to port it to the s390 :-)

Currently Peter Zijlstra and I are updating the deferred unwinder. Jens is
working on getting sframes to work with it. His interest is getting it for
s390 whereas ours is for x86.

> 
> Would a selftest for this be appropriate?  To give testers some way of
> exercising the code and make to life better for people who are enabling
> this on other architectures.

Yes we should definitely have selftests. But we are far from getting there.
One requirement is that the toolchain used to build the test must support
adding sframes.

> 
> In what tree do you anticipate this project being carried?
> 

It will likely go between tip or my tree.

-- Steve
Re: [PATCH v11 00/15] unwind_deferred: Implement sframe handling
Posted by Fangrui Song 3 months, 2 weeks ago
On 2025-10-22, Jens Remus wrote:
>This is the implementation of parsing the SFrame section in an ELF file.
>It's a continuation of Josh's and Steve's last work that can be found
>here:
>
>   https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
>   https://lore.kernel.org/all/20250827201548.448472904@kernel.org/
>
>Currently the only way to get a user space stack trace from a stack
>walk (and not just copying large amount of user stack into the kernel
>ring buffer) is to use frame pointers. This has a few issues. The biggest
>one is that compiling frame pointers into every application and library
>has been shown to cause performance overhead.
>
>Another issue is that the format of the frames may not always be consistent
>between different compilers and some architectures (s390) has no defined
>format to do a reliable stack walk. The only way to perform user space
>profiling on these architectures is to copy the user stack into the kernel
>buffer.
>
>SFrames[1] is now supported in gcc binutils and soon will also be supported
>by LLVM. 

Please consider dropping the statement, "soon will also be supported by LLVM."
Speaking as LLVM's MC, lld/ELF, and binary utilities maintainer, I have significant concerns about the v2 format, specifically its apparent disregard for standard ELF and linker conventions
(https://maskray.me/blog/2025-09-28-remarks-on-sframe#linking-and-execution-views)

To arm64 maintainers, it is critical time to revisit a unwind
information format, as I have outlined in my blog post:

A sorted address table like .eh_frame_hdr might still be needed, but the
design could be very different for arm64.

I am curious whether anyone has thought about a library that parses .eh_frame and generates SFrame.
If objtool integrates  this library, it can generate SFrame for vmlinux and modules without relying on assembler/linker.
Linker and assembler requires a level of stability that is currently concerning on the toolchain side.

(https://sourceware.org/pipermail/binutils/2025-October/144974.html
"This "linker will DTRT" assertion glosses over significant
implementation complexity. Each version needs not just a reader but
version-specific *merging* logic in every linker—fundamentally different
from simply reading a format.")

>SFrames acts more like ORC, and lives in the ELF executable
>file as its own section. Like ORC it has two tables where the first table
>is sorted by instruction pointers (IP) and using the current IP and finding
>it's entry in the first table, it will take you to the second table which
>will tell you where the return address of the current function is located
>and then you can use that address to look it up in the first table to find
>the return address of that function, and so on. This performs a user
>space stack walk.
>
>Now because the SFrame section lives in the ELF file it needs to be faulted
>into memory when it is used. This means that walking the user space stack
>requires being in a faultable context. As profilers like perf request a stack
>trace in interrupt or NMI context, it cannot do the walking when it is
>requested. Instead it must be deferred until it is safe to fault in user
>space. One place this is known to be safe is when the task is about to return
>back to user space.
>
>This series makes the deferred unwind code implement SFrames.
>
>[1] https://sourceware.org/binutils/wiki/sframe
>
>Changes since v10:
>- Rebase on v6.17-rc1 with Peter's unwind user fixes and x86 support
>  series [2] and Steve's support for the deferred unwinding infrastructure
>  series in perf [3] and perf tool [4] on top.
>- Support for SFrame V2 PC-relative FDE function start address. (Jens)
>- Support for SFrame V2 representing RA undefined as indication for
>  outermost frames. (Jens)
>
>[2]: [PATCH 00/12] Various fixes and x86 support,
>     https://lore.kernel.org/all/20250924075948.579302904@infradead.org/
>[3]: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure,
>     https://lore.kernel.org/all/20251007214008.080852573@kernel.org/
>[4]: [PATCH v16 0/4] perf tool: Support the deferred unwinding infrastructure,
>     https://lore.kernel.org/all/20250908175319.841517121@kernel.org/
>
>Patches 1 and 2 are suggested fixups to patches from Peter's unwind user
>fixes and x86 support series.  They keep the factoring out of the word
>size from the frame's CFA, FP, and RA offsets local to unwind user fp, as
>unwind user sframe does use absolute offsets.
>
>Patches 3, 6, and 14 have been updated to exclusively support the recent
>PC-relative SFrame FDE function start address encoding.  With Binutils 2.45
>the SFrame V2 FDE function start address field value is an offset from the
>field (i.e. PC-relative) instead of from the .sframe section start.  This
>is indicated by the new SFrame header flag SFRAME_F_FDE_FUNC_START_PCREL.
>Old SFrame V2 sections get rejected with dynamic debug message
>"bad/unsupported sframe header".
>
>Patches 9 and 10 add support to unwind user and unwind user sframe for
>a recent change of the SFrame V2 format to represent an undefined
>return address as an SFrame FRE without any offsets, which is used as
>indication for outermost frames.  Note that currently only a development
>build of Binutils mainline generates SFrame information including this
>new indication for outermost frames.  SFrame information without the new
>indication is still supported.  Without these patches unwind user sframe
>would identify such new SFrame FREs without any offsets as corrupted and
>remove the .sframe section, causing any any further stack tracing using
>sframe to fail.
>
>Regards,
>Jens
>
>
>Jens Remus (4):
>  fixup! unwind: Implement compat fp unwind
>  fixup! unwind_user/x86: Enable frame pointer unwinding on x86
>  unwind_user: Stop when reaching an outermost frame
>  unwind_user/sframe: Add support for outermost frame indication
>
>Josh Poimboeuf (11):
>  unwind_user/sframe: Add support for reading .sframe headers
>  unwind_user/sframe: Store sframe section data in per-mm maple tree
>  x86/uaccess: Add unsafe_copy_from_user() implementation
>  unwind_user/sframe: Add support for reading .sframe contents
>  unwind_user/sframe: Detect .sframe sections in executables
>  unwind_user/sframe: Wire up unwind_user to sframe
>  unwind_user/sframe/x86: Enable sframe unwinding on x86
>  unwind_user/sframe: Remove .sframe section on detected corruption
>  unwind_user/sframe: Show file name in debug output
>  unwind_user/sframe: Add .sframe validation option
>  unwind_user/sframe: Add prctl() interface for registering .sframe
>    sections
>
> MAINTAINERS                        |   1 +
> arch/Kconfig                       |  23 ++
> arch/x86/Kconfig                   |   1 +
> arch/x86/include/asm/mmu.h         |   2 +-
> arch/x86/include/asm/uaccess.h     |  39 +-
> arch/x86/include/asm/unwind_user.h |  11 +-
> fs/binfmt_elf.c                    |  49 ++-
> include/linux/mm_types.h           |   3 +
> include/linux/sframe.h             |  60 +++
> include/linux/unwind_user_types.h  |   5 +-
> include/uapi/linux/elf.h           |   1 +
> include/uapi/linux/prctl.h         |   6 +-
> kernel/fork.c                      |  10 +
> kernel/sys.c                       |   9 +
> kernel/unwind/Makefile             |   3 +-
> kernel/unwind/sframe.c             | 615 +++++++++++++++++++++++++++++
> kernel/unwind/sframe.h             |  72 ++++
> kernel/unwind/sframe_debug.h       |  68 ++++
> kernel/unwind/user.c               |  56 ++-
> mm/init-mm.c                       |   2 +
> 20 files changed, 1004 insertions(+), 32 deletions(-)
> create mode 100644 include/linux/sframe.h
> create mode 100644 kernel/unwind/sframe.c
> create mode 100644 kernel/unwind/sframe.h
> create mode 100644 kernel/unwind/sframe_debug.h
>
>-- 
>2.48.1
>
Re: [PATCH v11 00/15] unwind_deferred: Implement sframe handling
Posted by Steven Rostedt 3 months, 2 weeks ago
On Thu, 23 Oct 2025 01:09:02 -0700
Fangrui Song <maskray@sourceware.org> wrote:

> Please consider dropping the statement, "soon will also be supported by LLVM."
> Speaking as LLVM's MC, lld/ELF, and binary utilities maintainer, I have
> significant concerns about the v2 format, specifically its apparent
> disregard for standard ELF and linker conventions
> (https://maskray.me/blog/2025-09-28-remarks-on-sframe#linking-and-execution-views)

Please note, v2 can be dropped entirely. There's no plans to have the Linux
kernel ship with v2. The patches for v2 for the Linux kernel are for
testing purposes only (which was what help find the issues with v2).

The plan is to have v3 be the first versions supported by an official
release of the Linux kernel with the assumptions that changes after v3 will
be minimal.

The reason there was such a big difference between v2 and v3 is because v2
was the first version to have a consumer try to use it in a more production
like environment. This found several corner cases that needed to be
addressed, and that the current layout of v2 was not acceptable.

No linker needs to support v2 as there will be no consumers of it.

-- Steve