From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Steven, all,
This is v2 of the ftrace stackmap series. It addresses the Sashiko
review at [1] and incorporates the kernel test robot's toctree fix.
The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.
Problem
=======
With stacktrace enabled, each trace event stores a full kernel
stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On
production devices with 4-8 MB trace buffers, this fills the
buffer in seconds, limiting the usefulness of boot-time tracing
and always-on performance monitoring.
Design
======
The implementation is a lock-free hash map modeled after
tracing_map.c, as suggested by Steven [2]:
- lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- pre-allocated element pool, so there is no allocation on the hot path
- linear probing with a 2x over-provisioned table
- bounded probe length to keep worst-case lookup/insert cost bounded
- currently implemented for the global trace instance
The ring buffer stores only stack_id. Full stacks are exported via:
/sys/kernel/debug/tracing/stack_map
/sys/kernel/debug/tracing/stack_map_stat
/sys/kernel/debug/tracing/stack_map_bin
Reset semantics
===============
Reset is treated as a control-path operation and is only supported
when tracing is stopped on the owning trace_array. Online reset is
intentionally not supported.
The reset path:
- atomically claims reset rights via cmpxchg
- rejects reset with -EBUSY if tracing is active
- blocks new get_id() callers via the resetting flag
- waits for in-flight ftrace callback paths with synchronize_rcu()
- clears the map and releases resetting with release semantics
Why not reuse tracing_map.c
===========================
This series follows the same overall lock-free approach, but uses a
purpose-built structure. tracing_map.c is designed for histogram-style
aggregation with fixed-size keys and value fields, while this use case
needs variable-length stack storage plus reference counting.
Why not reuse BPF stackmap
==========================
BPF_MAP_TYPE_STACK_TRACE addresses a similar problem, but requires a
BPF program and the BPF runtime. This series keeps the functionality
inside ftrace and available without CONFIG_BPF.
Unlike BPF stackmap, which may replace entries on collision, this
design keeps stack_id stable once assigned, which is important because
ring buffer events may reference that stack_id long after insertion.
Test results
============
Platform: ARM64 Qualcomm SM8850 (8 cores), kernel 6.12, bits=14,
tracing sched_switch + kmem_cache_alloc with stacktrace trigger,
5-second capture, default ring buffer.
Per-event payload (measured from tracing stats):
Event Full stack Stackmap Reduction
--------------------- ---------- -------- ---------
sched_switch 102 B/entry 48 B/entry -53%
kmem_cache_alloc 111 B/entry 44 B/entry -60%
In the same 5-second capture window, the smaller per-event footprint
translated to many more retained events before wraparound. For
sched_switch:
- without stackmap: 43,950 retained entries
- with stackmap: 1,710,044 retained entries
During the same runs, the stackmap observed a few thousand unique
stacks and no drops.
Boot-time activation is also supported via:
trace_options=stackmap,stacktrace
Events that occur before stackmap initialization fall back to full
stack traces; later events are deduplicated. This transition does
not itself drop events, but early boot stacks recorded before
initialization are not deduplicated.
QEMU validation
===============
The series also runs cleanly in QEMU on aarch64 (mainline,
qemu-system-aarch64, 2 vCPU, virt machine, busybox initrd).
A post-init smoke test verified:
- stack_map, stack_map_stat, stack_map_bin, and options/stackmap exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset is rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- stack_map_bin magic is correct
Changes since RFC v1
====================
- tightened reset semantics: reset now requires tracing to be stopped
and returns -EBUSY if tracing is active or another reset is in progress
- fixed publication/consumption ordering with smp_store_release() /
smp_load_acquire()
- bounded probe length and added pool-exhaustion fast-path handling
- moved hash_seed into struct ftrace_stackmap
- switched the element pool to a single flat vmalloc allocation
- bounded bits range to [10, 18] to limit worst-case memory usage
- fixed TRACE_ITER(STACKMAP) handling
- tightened stack_map reset input parsing
- renamed stat counters to "successes" / "success_rate" so the meaning
is unambiguous (counts events served, including first-time inserts)
- added documentation, selftest coverage, and userspace dump tooling
Known limitations
=================
- Per-instance stackmap support is not included in this series.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
binary format settles.
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
[1] https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
[2] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 145 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 21 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 66 ++
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 56 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 100 +++
tools/tracing/stackmap_dump.py | 150 ++++
12 files changed, 1237 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1