[RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer

Li Pengfei posted 3 patches 2 days, 7 hours ago
Documentation/trace/ftrace-stackmap.rst       | 145 ++++
Documentation/trace/index.rst                 |   1 +
kernel/trace/Kconfig                          |  21 +
kernel/trace/Makefile                         |   1 +
kernel/trace/trace.c                          |  66 ++
kernel/trace/trace.h                          |  16 +
kernel/trace/trace_entries.h                  |  15 +
kernel/trace/trace_output.c                   |  23 +
kernel/trace/trace_stackmap.c                 | 643 ++++++++++++++++++
kernel/trace/trace_stackmap.h                 |  56 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc    | 100 +++
tools/tracing/stackmap_dump.py                | 150 ++++
12 files changed, 1237 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
[RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
Posted by Li Pengfei 2 days, 7 hours ago
From: Pengfei Li <lipengfei28@xiaomi.com>

Hi Steven, all,

This is v2 of the ftrace stackmap series. It addresses the Sashiko
review at [1] and incorporates the kernel test robot's toctree fix.

The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.

Problem
=======

With stacktrace enabled, each trace event stores a full kernel
stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On
production devices with 4-8 MB trace buffers, this fills the
buffer in seconds, limiting the usefulness of boot-time tracing
and always-on performance monitoring.

Design
======

The implementation is a lock-free hash map modeled after
tracing_map.c, as suggested by Steven [2]:

- lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- pre-allocated element pool, so there is no allocation on the hot path
- linear probing with a 2x over-provisioned table
- bounded probe length to keep worst-case lookup/insert cost bounded
- currently implemented for the global trace instance

The ring buffer stores only stack_id. Full stacks are exported via:

  /sys/kernel/debug/tracing/stack_map
  /sys/kernel/debug/tracing/stack_map_stat
  /sys/kernel/debug/tracing/stack_map_bin

Reset semantics
===============

Reset is treated as a control-path operation and is only supported
when tracing is stopped on the owning trace_array. Online reset is
intentionally not supported.

The reset path:
- atomically claims reset rights via cmpxchg
- rejects reset with -EBUSY if tracing is active
- blocks new get_id() callers via the resetting flag
- waits for in-flight ftrace callback paths with synchronize_rcu()
- clears the map and releases resetting with release semantics

Why not reuse tracing_map.c
===========================

This series follows the same overall lock-free approach, but uses a
purpose-built structure. tracing_map.c is designed for histogram-style
aggregation with fixed-size keys and value fields, while this use case
needs variable-length stack storage plus reference counting.

Why not reuse BPF stackmap
==========================

BPF_MAP_TYPE_STACK_TRACE addresses a similar problem, but requires a
BPF program and the BPF runtime. This series keeps the functionality
inside ftrace and available without CONFIG_BPF.

Unlike BPF stackmap, which may replace entries on collision, this
design keeps stack_id stable once assigned, which is important because
ring buffer events may reference that stack_id long after insertion.

Test results
============

Platform: ARM64 Qualcomm SM8850 (8 cores), kernel 6.12, bits=14,
tracing sched_switch + kmem_cache_alloc with stacktrace trigger,
5-second capture, default ring buffer.

Per-event payload (measured from tracing stats):

  Event                    Full stack    Stackmap    Reduction
  ---------------------    ----------    --------    ---------
  sched_switch             102 B/entry   48 B/entry    -53%
  kmem_cache_alloc         111 B/entry   44 B/entry    -60%

In the same 5-second capture window, the smaller per-event footprint
translated to many more retained events before wraparound. For
sched_switch:

  - without stackmap:      43,950 retained entries
  - with stackmap:      1,710,044 retained entries

During the same runs, the stackmap observed a few thousand unique
stacks and no drops.

Boot-time activation is also supported via:

  trace_options=stackmap,stacktrace

Events that occur before stackmap initialization fall back to full
stack traces; later events are deduplicated. This transition does
not itself drop events, but early boot stacks recorded before
initialization are not deduplicated.

QEMU validation
===============

The series also runs cleanly in QEMU on aarch64 (mainline,
qemu-system-aarch64, 2 vCPU, virt machine, busybox initrd).

A post-init smoke test verified:
- stack_map, stack_map_stat, stack_map_bin, and options/stackmap exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset is rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- stack_map_bin magic is correct

Changes since RFC v1
====================

- tightened reset semantics: reset now requires tracing to be stopped
  and returns -EBUSY if tracing is active or another reset is in progress
- fixed publication/consumption ordering with smp_store_release() /
  smp_load_acquire()
- bounded probe length and added pool-exhaustion fast-path handling
- moved hash_seed into struct ftrace_stackmap
- switched the element pool to a single flat vmalloc allocation
- bounded bits range to [10, 18] to limit worst-case memory usage
- fixed TRACE_ITER(STACKMAP) handling
- tightened stack_map reset input parsing
- renamed stat counters to "successes" / "success_rate" so the meaning
  is unambiguous (counts events served, including first-time inserts)
- added documentation, selftest coverage, and userspace dump tooling

Known limitations
=================

- Per-instance stackmap support is not included in this series.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
  binary format settles.

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

[1] https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
[2] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/

Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 145 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  21 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          |  66 ++
 kernel/trace/trace.h                          |  16 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_stackmap.c                 | 643 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  56 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 100 +++
 tools/tracing/stackmap_dump.py                | 150 ++++
 12 files changed, 1237 insertions(+)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1