From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Masami, Steven, all,
This is v4 of the ftrace stackmap series. It is sent as a new thread.
v3: https://lore.kernel.org/all/cover.1779769138.git.lipengfei28@xiaomi.com/
The series adds stack trace deduplication to ftrace. When the
'stackmap' option is enabled alongside 'stacktrace', the ring buffer
stores a 4-byte stack_id instead of a full kernel stack trace, and the
full stacks are exported once via tracefs (stack_map / stack_map_bin).
Rebased onto v7.1-rc5 (e8c2f9fdadee).
Motivation
==========
The target use case is long-duration, from-boot kernel tracing where
the same stacks recur enormously often and the bottleneck is ring
buffer space, not CPU.
Concretely: tracing the slab allocator from boot for hours to study
memory aging and to catch the allocation backtraces behind a usage
peak. With a stacktrace trigger on the slab tracepoints, every event
today carries a full kernel stack (~80-160 bytes). On a fixed-size
ring buffer that bounds how far back in time the trace reaches: the
buffer wraps in seconds to minutes and the early-boot history -- the
part we care about -- is overwritten before it can be consumed.
In this workload the set of distinct stacks is small and highly
repetitive, so storing a 4-byte stack_id per event and the full stack
only once dramatically increases the time span a given buffer covers.
The intended operating model is exactly the low-overhead one ftrace is
good at: let the trace run for a long time producing a comparatively
small, dense log, then resolve stack_ids offline (cat stack_map, or
parse stack_map_bin with the included tool) during analysis.
This is complementary to, not a replacement for, the existing full
stack recording: deep stacks and the early pre-init window still fall
back to full stacks (see below).
Effect on retention
====================
Same fixed per-CPU buffer, slab allocation workload with a shallow
kernel stack (kmem_cache_alloc), stackmap OFF vs ON:
retained events bytes/event time span
stackmap OFF 645,068 ~104 B 15.0 s
stackmap ON 1,397,741 ~48 B 27.7 s
2.17x 2.17x 1.85x
The buffer holds ~2.17x more events and reaches ~1.85x further back in
time for the same memory. The win grows with stack depth and with how
repetitive the stacks are; for deep, highly-repeated stacks the
per-event size approaches the 4-byte stack_id plus event header.
Changes since v3
================
Correctness:
- Deep stacks are never silently truncated or merged. A stack deeper
than FTRACE_STACKMAP_MAX_DEPTH (64) now falls back to a full stack
trace; ftrace_stackmap_get_id() returns -E2BIG rather than
truncating, so two distinct stacks sharing their first 64 frames
can no longer collapse to one stack_id.
- reset is now genuinely destructive and coherent: under the
reader_sem write lock it clears the owning trace_array's ring
buffer (and snapshot) BEFORE clearing the map, and uses
tracing_reset_all_cpus() rather than _online_cpus() so a
TRACE_STACK_ID written by a now-offline CPU cannot survive and
dangle against a cleared map.
- __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer
slot before inserting into the map, so stack_map_stat counters and
ref_count stay consistent with what the ring buffer actually
references (failed reservation -> full stack, map untouched).
- ref_count / successes / drops now saturate (INT_MAX / LONG_MAX)
instead of wrapping on multi-hour, billions-of-hits traces.
Global-instance gating:
- Enabling 'stackmap' on a secondary instance via the aggregate
trace_options file is now rejected, not just hidden in the
per-instance options/ directory.
- tracefs init is failure-atomic: the required stack_map file is
created before the map pointer is published; if it cannot be
created the map is destroyed and never published. An init-state
(PENDING/DONE/FAILED) lets boot-time trace_options=stackmap set
the flag before the map exists (hot path falls back until it is
published) while still rejecting enables after a permanent init
failure, so options/stackmap never reports an enabled no-op.
ABI / tooling:
- Binary magic corrected to 0x46534D42 ('FSMB'); version is 1 (first
upstream ABI). Documentation, tool and selftest updated to match.
- Text and binary exports now follow the same trampoline-marker and
trace_adjust_address() handling as the normal stack print path.
- stackmap_dump.py emits hex addresses in 'ips' and shows the ftrace
trampoline marker only in the resolved 'symbols'.
Selftests:
- New stackmap-reset.tc: verifies reset clears stale <stack_id N>
from the trace buffer and checks the stack_map_bin magic/version.
- stackmap-instance-gate.tc extended to verify the trace_options
write path is rejected on a secondary instance.
- stackmap-basic.tc no longer treats a nonzero drops count as a
failure (drops is a by-design fallback); only zero successes with
nonzero drops is fatal.
Open questions for maintainers
==============================
Two design points where I would value direction before polishing
further:
1. Eager vs lazy allocation. The element pool is allocated at
fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether
userspace ever enables the option (~8 MB at the default bits=14,
up to ~135 MB at bits=18). This keeps the hot path allocation-free
with no allocation-failure path under tracing pressure. Is eager
allocation acceptable, or would you prefer lazy allocation on the
first 'echo 1 > options/stackmap'?
2. Binary ABI now or later. stack_map_bin is a new tracefs binary
interface (magic 0x46534D42, version 1). Is it acceptable to
introduce it now, or would you prefer the first version ship with
the text stack_map interface only and add the binary export once
trace-cmd / libtraceevent integration is designed?
Test results
============
QEMU (aarch64 virt, v7.1-rc5 + this series), boot to init smoke test:
- stackmap functional suite: 16/16 PASS, including reset clearing the
trace buffer (stale <stack_id> count 48 -> 0), stack_map_bin
magic/version, global-vs-secondary instance gating, and the
trace_options rejection on a secondary instance.
- boot-time activation (trace_options=stackmap,stacktrace on the
kernel cmdline): 3/3 PASS -- the option survives the
pre-initialization window and the map is live once published.
- ftrace startup self-tests pass with the new TRACE_STACK_ID entry.
Device retention numbers above were collected on a Xiaomi SM8850
(ARM64) running an Android workload, comparing the same buffer with
the option off and on.
Known limitations
=================
- Per-instance stackmap support is not included; the option is gated
to the global instance (in the tracefs layout and at the
set_tracer_flag() write path). Per-instance maps are a follow-up.
- Deduplication is best-effort, not strict: under heavy concurrent
contention two CPUs racing with the same stack hash may each claim a
different slot, producing a few duplicate entries; ref_count is then
split across them. This keeps the hot path lock-free.
- The stackmap covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, serialized against reset
but not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
binary format settles (see open question 2).
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 177 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 216 ++++-
kernel/trace/trace.h | 17 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_selftest.c | 1 +
kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 111 +++
.../test.d/ftrace/stackmap-instance-gate.tc | 54 ++
.../ftrace/test.d/ftrace/stackmap-reset.tc | 76 ++
tools/tracing/stackmap_dump.py | 164 ++++
15 files changed, 1821 insertions(+), 3 deletions(-)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1