From nobody Mon Jun 8 21:51:06 2026 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E4653F5BE7 for ; Tue, 26 May 2026 11:53:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796432; cv=none; b=qfmhocyWhegUuJYHSjAfn47EDgc5MvkMNj81svYPlWMrvX9JGnVGk8aF5s6WiN167/uoA7CpTS39lqUlwiy+89yzKgSUqH7u9blM2itMZpAytzJPLqz2hX2GTzVZwjzvMIUstgVnGNss2M8RcFnR61lv+oA4GMgbVKEwmxH59Ho= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796432; c=relaxed/simple; bh=P/7pDIm+ilK5qLyJzxuEms6WupWhDL1iCa0E3hNIytU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=Z+is0BGY4XHNcQJH9OB3jfvt/z3GjIREr0Yes71pAiIi5CUTzmvIQMfq728egDaOSvJ1Ekxgwap9qpvNb7FnKcrVpOVOZBTpfwQGnIL3pvMpVjZPBNmfw7A9B3c2zrNbZI4DNqa3aZBcUnhp0+haKaI8SFOaHvBoAgQW5e/YLmM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=TwNzzRZ3; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="TwNzzRZ3" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2bd2c147abaso62723405ad.3 for ; Tue, 26 May 2026 04:53:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779796420; x=1780401220; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=9VJGPzp8w4dAZYdEisDmNBMdtvhMxeUcp9rN9DTW8k0=; b=TwNzzRZ33Ve8pveAY61Jg4cIRrkiKrpCM/w+XlIWtJq8+fnPS8D+qN4Jz2CFL9RiRo wboOivggxJ2c+ka7rJp/9lsHFEScOgUZvxAq86TWCuawbHrDh9upE4dlqImManRIazW3 jbFRWxwiuJM8Ii7YRVC20SnKeytnwF0btRWTW5TuTtW97EuPDGerB9/I6DBxWqd5Ktlf 8f1mW3i5j50vXIMdvVEARbFPT6MYTpSM7szaK9qTWrCYE6GgaRD3yCYxebV4SvrjIN/P aw06yI2F3QKL/qsJq8ejB2ApNFhryywIbK3ntlN+8IpjFwgPkW/JDbIhHE2gs8Xb61MI 5Adg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779796420; x=1780401220; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=9VJGPzp8w4dAZYdEisDmNBMdtvhMxeUcp9rN9DTW8k0=; b=GOLRGjm5D7Zen7fy3KwuUpKGg7P3k1my9x6oYOBeQf2V+5bpNk5opf3JaStINWxvex SJ+zQ8ANuYrWOTaHQiDA28W41cE32Xga0JG71xaT9jBBvNwllyK9ie8FGReVKq+eqcSD 0rGZ2WZ2d5bScH9juIPYWjbQ1RJcp6Af6JgADkqdQNsZsVZzxyjfUkszMdat52wTad4j 6LFQ630wJmdDZN0olDK5TvRDTuQHa7WHiaNivIqeEYi6/2ien/VNtHJDze+w1xKSy7Eq 6N5d0VtUKOC5Iu5P/PJW4DFK46uyCNraqngxyHoasRmQYUvecONXqu22jCR4quVjoIQu QiGw== X-Forwarded-Encrypted: i=1; AFNElJ/U6IzDiRVY9+/c265UbW//rfA952WLXN9b5QYoygq7lbphueCmh0NM+7dHUaOME/VDCbzPrZ7g4xklqhg=@vger.kernel.org X-Gm-Message-State: AOJu0YwTH7vuOhTcw5fQrxJShuXFU7fGSpPNIrjKYEiwWX+g/5mqRjW1 mm4dDe5jV0AEWeFqZoaVmMKsKXV/1Q+r9uOMItll8jaIp44f0BHh3HIr X-Gm-Gg: Acq92OFTHmtInjipFUYPAw4gaUEtMRqjvfwrgrlwAW4TdKnVUkF/eLravt38Kso39pu We04vh2QiZkI4pt72/ts9DGyTPc0ccajoji9VUut56q+qJCRRCVFksxjj/qfcZDNQzpO1ck3BFO PSDi6NSsMDd4dRDtwD7h8H1yKGRSGiHry00mF2Cd5Fxnl2u1cFOvvnSSS3E8pgbgdDaAW/yRveF Ennf4KsK3NdpdgXwW1dtbGQZx2ijF8O2hmgC7LTvJ+/D7RbNDjYVsEMm0D7B5+eLVNSdFiAFCH/ +MrON+mi6h8bCN+RE35qbOQwHuDxoSpauizJ1DMnlOpbAp2/DZL/3vtVJ2mqDXLtwXtWCAskQ4L AKGevWwhE8C5AUicPmgcjgllA4gxXOAsJVEncRWNMt0werEynuQfNQnztgiFQ/6/Ebm1EZk+JI4 gBSvkFoz6mAzPbLa/LGjM79uJy84WbovTqIVFhvgqgf+mXMP9e X-Received: by 2002:a17:903:2f92:b0:2ad:9b86:ddc2 with SMTP id d9443c01a7336-2beb05b58d2mr199131755ad.22.1779796419520; Tue, 26 May 2026 04:53:39 -0700 (PDT) Received: from localhost.localdomain ([2408:8607:1b00:8:d6ff:7a7e:f223:7f2e]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2beb58dd75bsm163399345ad.69.2026.05.26.04.53.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 May 2026 04:53:38 -0700 (PDT) From: Li Pengfei X-Google-Original-From: Li Pengfei To: mhiramat@kernel.org, rostedt@goodmis.org Cc: linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org, cmllamas@google.com, zhangbo56@xiaomi.com, Pengfei Li Subject: [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Date: Tue, 26 May 2026 19:52:43 +0800 Message-Id: X-Mailer: git-send-email 2.34.1 In-Reply-To: References: <20260514034916.2162517-1-lipengfei28@xiaomi.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Pengfei Li Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel stack traces for the ftrace ring buffer. Instead of storing full stack traces (80-160 bytes each) in the ring buffer for every event, ftrace can store a 4-byte stack_id when the stackmap option is enabled. The implementation is modeled after tracing_map.c (used by hist triggers), using the same lock-free design based on Dr. Cliff Click's non-blocking hash table algorithm: - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context - Pre-allocated element pool (zero allocation on hot path) - Linear probing with 2x over-provisioned table; probe length is bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup is O(1) even when the table is heavily loaded with claimed-but- empty slots from pool exhaustion - Single global instance (initialized for the global trace array) The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the existing tracing_map / hist_triggers requirement: the lock-free hot path uses cmpxchg in a context that may be reached from NMI. The stackmap is exported via three tracefs nodes: - stack_map: text export with symbol resolution (mode 0640) - stack_map_stat: counters (entries, successes, drops, success_rate) - stack_map_bin: binary export (all fields native-endian) Hot-path counters use per-CPU local_t (NMI-safe single-instruction increments) instead of atomic64_t. atomic64_t falls back to raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems, which would deadlock if an NMI hit while the spinlock was held. local_t avoids this hazard. Reset semantics: - Reset is a control-path operation only allowed when tracing is stopped on the owning trace_array. Online reset (with tracing active) is intentionally not supported. - Reset uses atomic_cmpxchg() to claim the resetting flag, then verifies tracer_tracing_is_on() returns false. - synchronize_rcu() drains in-flight get_id() callers from the ftrace callback path (which runs preempt-disabled). - A reader_sem (rw_semaphore) serializes the clearing memset against tracefs readers (seq_file iteration and stack_map_bin snapshot), which run in process context and aren't covered by synchronize_rcu(). The hot path doesn't take this lock. - Reset clears the resetting flag with atomic_set_release() so a subsequent get_id() observes a fully cleared map. - get_id() uses atomic_read_acquire() on resetting so subsequent loads of entry->key/val are properly ordered after the check (control dependencies only order stores per LKMM). - Concurrent reset returns -EBUSY; reset while tracing is active returns -EBUSY. Concurrency notes: - entry->val publication uses smp_store_release() paired with smp_load_acquire() in all dereferencing readers. - entry->key reads (in get_id, seq_start/next, bin_open) use READ_ONCE() to avoid LKMM data races with the cmpxchg writer. - elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before use in seq_show and bin_open. - Pool exhaustion: stackmap_get_elt() short-circuits via atomic_read() before the contended atomic RMW, avoiding cacheline contention once the pool is full. Slots that win cmpxchg but cannot get an elt are left 'claimed but empty'; subsequent lookups treat val=3D=3DNULL as a miss and probe past them. Hash key: - Per-instance random seed stored in the stackmap struct (no global state), seeded at create time. - 32-bit jhash is forced to 1 if it lands on 0 (which is the free-slot sentinel). Full memcmp confirms matches. Memory: - Single flat vmalloc for the element pool (no per-elt kzalloc). - bits parameter clamped to [10, 18]: at the maximum bits=3D18, the element pool is ~135 MB and a stack_map_bin snapshot may briefly allocate another ~135 MB. - struct stackmap_bin_snapshot uses u64 (not size_t) for its size field so data[] is 8-byte aligned on both 32-bit and 64-bit architectures, avoiding alignment faults when writing u64 IPs on strict-alignment architectures. Kernel command line parameter: - ftrace_stackmap.bits=3DN: set map capacity (2^N unique stacks, range 10-18, default 14) Signed-off-by: Pengfei Li --- kernel/trace/Kconfig | 22 + kernel/trace/Makefile | 1 + kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++++++++++++++++++ kernel/trace/trace_stackmap.h | 57 +++ 4 files changed, 860 insertions(+) create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index e130da35808f..e49cae886ff0 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -412,6 +412,28 @@ config STACK_TRACER =20 Say N if unsure. =20 +config FTRACE_STACKMAP + bool "Ftrace stack map deduplication" + depends on TRACING + depends on STACKTRACE + depends on ARCH_HAVE_NMI_SAFE_CMPXCHG + select KALLSYMS + help + This enables a global stack trace hash table for ftrace, inspired + by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store + only a stack_id in the ring buffer instead of the full stack trace, + significantly reducing trace buffer usage when the same call stacks + appear repeatedly. + + The deduplicated stacks are exported via: + /sys/kernel/debug/tracing/stack_map + + Writing to this file resets the stack map. Reading shows all unique + stacks with their stack_id and reference count. + + Say Y if you want to reduce ftrace buffer usage for stack traces. + Say N if unsure. + config TRACE_PREEMPT_TOGGLE bool help diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 8d3d96e847d8..c2d9b2bf895a 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) +=3D trace_hwlat.o obj-$(CONFIG_OSNOISE_TRACER) +=3D trace_osnoise.o obj-$(CONFIG_NOP_TRACER) +=3D trace_nop.o obj-$(CONFIG_STACK_TRACER) +=3D trace_stack.o +obj-$(CONFIG_FTRACE_STACKMAP) +=3D trace_stackmap.o obj-$(CONFIG_MMIOTRACE) +=3D trace_mmiotrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) +=3D trace_functions_graph.o obj-$(CONFIG_TRACE_BRANCH_PROFILING) +=3D trace_branch.o diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c new file mode 100644 index 000000000000..c89f6d527c96 --- /dev/null +++ b/kernel/trace/trace_stackmap.c @@ -0,0 +1,780 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace + * + * Modeled after tracing_map.c (used by hist triggers), this provides + * a lock-free hash map optimized for the ftrace hot path. The design + * is based on Dr. Cliff Click's non-blocking hash table algorithm. + * + * Key properties: + * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context + * - Pre-allocated element pool (zero allocation on hot path) + * - Linear probing with 2x over-provisioned table; probe length + * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup + * cost constant even when the table is heavily loaded + * - Single global instance (initialized for the global trace array) + * + * Reset is a control-path operation, only allowed when tracing is + * stopped on the owning trace_array. The protocol is: + * + * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights + * and blocks new get_id() callers (they observe resetting=3D1 and + * return -EINVAL). + * - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the + * resetting flag itself prevents new insertions even if userspace + * re-enables tracing immediately after the check. + * - synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path, which runs with preemption disabled. + * + * Online reset (with tracing active) is intentionally not supported + * to keep the design simple and the proof obligations small. + * + * The 32-bit jhash of the stack IPs is the hash table key. On hash + * collision, linear probing finds the next slot and full memcmp + * confirms the match. + * + * Concurrent userspace readers (cat stack_map / stack_map_bin) get + * a best-effort snapshot. They are coherent with the hot path + * (smp_load_acquire on entry->val), but they are not coherent with + * a concurrent reset; since reset requires tracing to be stopped, + * mid-iteration reset can produce truncated or partial output but + * never crashes. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "trace.h" +#include "trace_stackmap.h" + +/* + * Bound the linear-probe scan length. With a 2x over-provisioned table, + * a well-distributed hash gives very short probe chains. Capping at 64 + * keeps worst-case lookup O(1) even when the table is heavily loaded + * with claimed-but-empty slots from pool exhaustion. + */ +#define FTRACE_STACKMAP_MAX_PROBE 64 + +/* + * Memory ordering of entry->val: published with smp_store_release() + * by the inserter; consumed with smp_load_acquire() by every reader + * that dereferences the elt (get_id, seq_show, bin_open). This pairs + * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the + * publish) with the reads of those fields (which happen AFTER the + * load). seq_start / seq_next only test val for NULL and use the + * acquire load purely to keep memory ordering symmetric. + */ + +/* + * Each pre-allocated element holds one unique stack trace. + * Fixed size: MAX_DEPTH entries regardless of actual depth. + */ +struct stackmap_elt { + u32 nr; /* actual number of IPs */ + atomic_t ref_count; + unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH]; +}; + +/* + * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt. + * key =3D=3D 0 means the slot is free. + */ +struct stackmap_entry { + u32 key; /* 0 =3D free, non-zero =3D jhash */ + struct stackmap_elt *val; /* NULL until fully published */ +}; + +struct ftrace_stackmap { + struct trace_array *tr; /* owning trace_array */ + unsigned int map_bits; + unsigned int map_size; /* 1 << (map_bits + 1) */ + unsigned int max_elts; /* 1 << map_bits */ + u32 hash_seed; /* per-instance jhash seed */ + atomic_t next_elt; /* index into elts pool */ + struct stackmap_entry *entries; /* hash table */ + struct stackmap_elt *elts; /* flat element pool */ + atomic_t resetting; + /* + * Reader/reset serialization. Held in shared mode (read lock) + * across seq_file iteration and binary snapshot construction; + * held in exclusive mode (write lock) by reset's clearing + * phase. The hot path (get_id) does not take this lock =E2=80=94 it + * uses smp_load_acquire/smp_store_release on entry->val and + * the resetting flag for the lock-free protocol. + */ + struct rw_semaphore reader_sem; + /* + * Per-CPU counters using local_t. local_t increments are NMI- + * safe on all architectures (single-instruction or interrupt- + * masked) and avoid the raw_spinlock_t fallback that + * atomic64_t uses on 32-bit GENERIC_ATOMIC64 =E2=80=94 which would + * deadlock if an NMI hit while the spinlock was held. + */ + local_t __percpu *successes; /* events served (hits + new inserts) */ + local_t __percpu *drops; +}; + +/* + * Cap the bits parameter to keep worst-case allocations bounded: + * bits=3D18 =E2=86=92 256K elts, 512K slots, ~130 MB elt pool, ~130 MB = bin + * export. + * Smaller workloads should use the default (14) which gives 16K elts + * (~8 MB pool); bump bits via the ftrace_stackmap.bits=3D kernel + * parameter for higher unique-stack capacity. + */ +#define FTRACE_STACKMAP_BITS_MIN 10 +#define FTRACE_STACKMAP_BITS_MAX 18 +#define FTRACE_STACKMAP_BITS_DEFAULT 14 + +static unsigned int stackmap_map_bits =3D FTRACE_STACKMAP_BITS_DEFAULT; +static int __init stackmap_bits_setup(char *str) +{ + unsigned long val; + + if (kstrtoul(str, 0, &val)) + return -EINVAL; + val =3D clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX= ); + stackmap_map_bits =3D val; + return 0; +} +early_param("ftrace_stackmap.bits", stackmap_bits_setup); + +/* --- Element pool --- */ + +static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap) +{ + int idx; + + /* + * Fast-path early-out once the pool is fully consumed. Avoids + * the contended atomic RMW on next_elt for every traced event + * after the pool is exhausted. + */ + if (atomic_read(&smap->next_elt) >=3D smap->max_elts) + return NULL; + + idx =3D atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts); + if (idx < smap->max_elts) + return &smap->elts[idx]; + return NULL; +} + +/* --- Create / Destroy / Reset --- */ + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) +{ + struct ftrace_stackmap *smap; + unsigned int bits; + + smap =3D kzalloc(sizeof(*smap), GFP_KERNEL); + if (!smap) + return ERR_PTR(-ENOMEM); + + /* Defensive clamp: reject bogus bits even if early_param is bypassed. */ + bits =3D clamp_val(stackmap_map_bits, + FTRACE_STACKMAP_BITS_MIN, + FTRACE_STACKMAP_BITS_MAX); + + smap->tr =3D tr; + smap->map_bits =3D bits; + smap->max_elts =3D 1U << bits; + smap->map_size =3D 1U << (bits + 1); /* 2x over-provision */ + + smap->entries =3D vzalloc(sizeof(*smap->entries) * smap->map_size); + if (!smap->entries) { + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + /* + * Single large vmalloc of the element pool, indexed flat. + * At bits=3D18 this is 256K * sizeof(struct stackmap_elt). The + * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB. + */ + smap->elts =3D vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts); + if (!smap->elts) { + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + smap->successes =3D alloc_percpu(local_t); + if (!smap->successes) { + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + smap->drops =3D alloc_percpu(local_t); + if (!smap->drops) { + free_percpu(smap->successes); + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + smap->hash_seed =3D get_random_u32(); + atomic_set(&smap->next_elt, 0); + atomic_set(&smap->resetting, 0); + init_rwsem(&smap->reader_sem); + + return smap; +} + +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap) +{ + if (!smap || IS_ERR(smap)) + return; + free_percpu(smap->drops); + free_percpu(smap->successes); + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); +} + +/** + * ftrace_stackmap_reset - clear all entries in the stackmap + * @smap: the stackmap to reset + * + * Returns 0 on success, -EBUSY if another reset is already in + * progress, or if tracing is currently active on the owning + * trace_array. + * + * Online reset (with tracing active) is not supported. Caller must + * stop tracing first (echo 0 > tracing_on). + * + * Caller is process context (typically sysfs write handler). + * + * Protocol: + * 1. Atomically claim reset rights via cmpxchg on @resetting. + * 2. Verify tracing is stopped on @smap->tr; if not, release the + * claim and return -EBUSY. The resetting flag itself blocks + * any subsequent get_id() callers. + * 3. synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path (which runs preempt-disabled). + * 4. memset entries, elts, and counters. + * 5. Release the resetting flag with release semantics so any new + * get_id() observes a fully cleared map. + */ +int ftrace_stackmap_reset(struct ftrace_stackmap *smap) +{ + if (!smap) + return 0; + + if (atomic_cmpxchg(&smap->resetting, 0, 1) !=3D 0) + return -EBUSY; + + if (smap->tr && tracer_tracing_is_on(smap->tr)) { + atomic_set(&smap->resetting, 0); + return -EBUSY; + } + + /* + * synchronize_rcu() itself is a full barrier; no extra smp_mb() + * is needed before it. It drains in-flight ftrace callbacks that + * may have already passed the resetting check with the old value. + */ + synchronize_rcu(); + + /* + * Take the reader_sem in exclusive mode. This serializes the + * memset against any tracefs reader (seq_file iteration or + * stack_map_bin snapshot) that may currently hold the rwsem + * for read. synchronize_rcu() already drained the hot path; + * this rwsem covers process-context readers that aren't + * preempt-disabled. + */ + down_write(&smap->reader_sem); + + memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size); + memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts); + + atomic_set(&smap->next_elt, 0); + { + int cpu; + + for_each_possible_cpu(cpu) { + local_set(per_cpu_ptr(smap->successes, cpu), 0); + local_set(per_cpu_ptr(smap->drops, cpu), 0); + } + } + + up_write(&smap->reader_sem); + + /* Release resetting=3D0 so new get_id() observes a cleared map. */ + atomic_set_release(&smap->resetting, 0); + return 0; +} + +/* --- Core: get_id (lock-free, NMI-safe) --- */ + +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries) +{ + u32 key_hash, idx, test_key, trace_len; + struct stackmap_entry *entry; + struct stackmap_elt *val; + int probes =3D 0; + + /* + * atomic_read_acquire() pairs with atomic_set_release() in the + * reset path. This ensures that subsequent reads of entry->key + * and entry->val are ordered after this check; without acquire, + * the CPU would only have a control dependency, which orders + * subsequent stores but not loads (per LKMM). + */ + if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting)) + return -EINVAL; + if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH) + nr_entries =3D FTRACE_STACKMAP_MAX_DEPTH; + + trace_len =3D nr_entries * sizeof(unsigned long); + /* + * jhash2() requires the length in u32 units and the data to be + * u32-aligned. On 64-bit kernels sizeof(unsigned long)=3D=3D8, so + * trace_len is always a multiple of 8 (hence of 4). Use jhash2 + * directly; the cast to u32* is safe because ips[] is naturally + * aligned to sizeof(unsigned long) >=3D 4. + */ + key_hash =3D jhash2((const u32 *)ips, trace_len / sizeof(u32), + smap->hash_seed); + if (key_hash =3D=3D 0) + key_hash =3D 1; /* 0 means free slot */ + + idx =3D key_hash >> (32 - (smap->map_bits + 1)); + + while (probes < FTRACE_STACKMAP_MAX_PROBE) { + idx &=3D (smap->map_size - 1); + entry =3D &smap->entries[idx]; + /* + * READ_ONCE() to avoid LKMM data race with concurrent + * cmpxchg(&entry->key, 0, key_hash) on this slot. + */ + test_key =3D READ_ONCE(entry->key); + + if (test_key =3D=3D key_hash) { + /* + * smp_load_acquire pairs with smp_store_release in + * the publisher below; ensures we see fully-formed + * elt fields (nr, ips, ref_count) before dereference. + */ + val =3D smp_load_acquire(&entry->val); + /* + * READ_ONCE(val->nr) keeps style consistent with + * the seq_show / bin_open readers. nr is write-once + * (set before publish, never modified afterwards), + * so the load is data-race-free, but READ_ONCE + * silences any analysis tool that flags a plain + * read of a field that is also read under acquire + * elsewhere. + */ + if (val && READ_ONCE(val->nr) =3D=3D nr_entries && + memcmp(val->ips, ips, trace_len) =3D=3D 0) { + atomic_inc(&val->ref_count); + local_inc(this_cpu_ptr(smap->successes)); + return (int)idx; + } + /* + * val =3D=3D NULL: another CPU is mid-insert, or this + * slot is "claimed but empty" (pool exhausted). + * val !=3D NULL but mismatch: 32-bit hash collision + * with a different stack. In both cases, advance. + */ + } else if (!test_key) { + /* + * Free slot: try to claim it. + * + * If two CPUs race here with the same key_hash + * (same stack), one loses the cmpxchg, advances, + * and may insert the same stack at a later slot. + * This can produce a small number of duplicate + * entries under heavy contention. The trade-off + * is accepted to keep the hot path lock-free; + * ref_count is split across the duplicates and + * total memory cost is bounded by the element + * pool size. + */ + if (cmpxchg(&entry->key, 0, key_hash) =3D=3D 0) { + struct stackmap_elt *elt; + + elt =3D stackmap_get_elt(smap); + if (!elt) { + /* + * Pool exhausted. We claimed this + * slot with cmpxchg but cannot fill + * it. Leave key set so the slot + * stays "claimed but empty" =E2=80=94 future + * lookups treat val=3D=3DNULL as a miss + * and probe past it. Cannot revert + * key=3D0 without racing other CPUs. + */ + local_inc(this_cpu_ptr(smap->drops)); + return -ENOSPC; + } + + elt->nr =3D nr_entries; + atomic_set(&elt->ref_count, 1); + memcpy(elt->ips, ips, trace_len); + + /* + * Publish elt with release semantics so the + * reader's smp_load_acquire can safely + * dereference val->nr / val->ips. + */ + smp_store_release(&entry->val, elt); + local_inc(this_cpu_ptr(smap->successes)); + return (int)idx; + } + /* cmpxchg failed; another CPU claimed this slot. */ + } + + idx++; + probes++; + } + + local_inc(this_cpu_ptr(smap->drops)); + return -ENOSPC; +} + +/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */ + +struct stackmap_seq_private { + struct ftrace_stackmap *smap; +}; + +static void *stackmap_seq_start(struct seq_file *m, loff_t *pos) +{ + struct stackmap_seq_private *priv =3D m->private; + struct ftrace_stackmap *smap =3D priv->smap; + u32 i; + + if (!smap) + return NULL; + /* + * Take the reader_sem to serialize against ftrace_stackmap_reset(), + * which holds it for write while clearing the table. Released in + * stackmap_seq_stop(), which seq_file calls regardless of whether + * start() returned an element or NULL (per Documentation/filesystems + * /seq_file.rst: "the iterator value returned by start() or next() + * is guaranteed to be passed to a subsequent next() or stop()"). + */ + down_read(&smap->reader_sem); + for (i =3D *pos; i < smap->map_size; i++) { + if (READ_ONCE(smap->entries[i].key) && + smp_load_acquire(&smap->entries[i].val)) { + *pos =3D i; + return &smap->entries[i]; + } + } + return NULL; +} + +static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct stackmap_seq_private *priv =3D m->private; + struct ftrace_stackmap *smap =3D priv->smap; + u32 i; + + if (!smap) + return NULL; + for (i =3D *pos + 1; i < smap->map_size; i++) { + if (READ_ONCE(smap->entries[i].key) && + smp_load_acquire(&smap->entries[i].val)) { + *pos =3D i; + return &smap->entries[i]; + } + } + /* + * Advance *pos past the end so that on the next read() the + * subsequent stackmap_seq_start() call returns NULL and the + * iteration terminates. Without this, seq_read() would loop + * on the last element. + */ + *pos =3D smap->map_size; + return NULL; +} + +static void stackmap_seq_stop(struct seq_file *m, void *v) +{ + struct stackmap_seq_private *priv =3D m->private; + struct ftrace_stackmap *smap =3D priv->smap; + + /* + * seq_file invokes stop() unconditionally after each iteration + * pass (see seq_read_iter / traverse), even when start() returned + * NULL. Always release here, balanced against the down_read in + * stackmap_seq_start(). + */ + if (smap) + up_read(&smap->reader_sem); +} + +static int stackmap_seq_show(struct seq_file *m, void *v) +{ + struct stackmap_entry *entry =3D v; + struct stackmap_elt *elt =3D smp_load_acquire(&entry->val); + struct stackmap_seq_private *priv =3D m->private; + u32 idx =3D entry - priv->smap->entries; + u32 i, nr; + + if (!elt) + return 0; + + nr =3D READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr =3D FTRACE_STACKMAP_MAX_DEPTH; + + seq_printf(m, "stack_id %u [ref %u, depth %u]\n", + idx, atomic_read(&elt->ref_count), nr); + for (i =3D 0; i < nr; i++) + seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]); + seq_putc(m, '\n'); + return 0; +} + +static const struct seq_operations stackmap_seq_ops =3D { + .start =3D stackmap_seq_start, + .next =3D stackmap_seq_next, + .stop =3D stackmap_seq_stop, + .show =3D stackmap_seq_show, +}; + +static int stackmap_open(struct inode *inode, struct file *file) +{ + struct stackmap_seq_private *priv; + struct seq_file *m; + int ret; + + ret =3D seq_open_private(file, &stackmap_seq_ops, + sizeof(struct stackmap_seq_private)); + if (ret) + return ret; + m =3D file->private_data; + priv =3D m->private; + priv->smap =3D inode->i_private; + return 0; +} + +/* + * Accept exactly "0" or "reset" (optionally followed by a single newline). + */ +static bool stackmap_write_is_reset(const char *buf, size_t n) +{ + if (n > 0 && buf[n - 1] =3D=3D '\n') + n--; + return (n =3D=3D 1 && buf[0] =3D=3D '0') || + (n =3D=3D 5 && memcmp(buf, "reset", 5) =3D=3D 0); +} + +static ssize_t stackmap_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct seq_file *m =3D file->private_data; + struct stackmap_seq_private *priv =3D m->private; + char buf[8]; + size_t n =3D min(count, sizeof(buf) - 1); + int ret; + + if (n =3D=3D 0) + return -EINVAL; + if (copy_from_user(buf, ubuf, n)) + return -EFAULT; + buf[n] =3D '\0'; + + if (!stackmap_write_is_reset(buf, n)) + return -EINVAL; + + /* + * ftrace_stackmap_reset() atomically claims reset rights via + * cmpxchg and returns -EBUSY if another reset is in progress + * or if tracing is active. + */ + ret =3D ftrace_stackmap_reset(priv->smap); + if (ret) + return ret; + return count; +} + +const struct file_operations ftrace_stackmap_fops =3D { + .open =3D stackmap_open, + .read =3D seq_read, + .write =3D stackmap_write, + .llseek =3D seq_lseek, + .release =3D seq_release_private, +}; + +/* --- Stats --- */ + +static int stackmap_stat_show(struct seq_file *m, void *v) +{ + struct ftrace_stackmap *smap =3D m->private; + u64 successes =3D 0, drops =3D 0; + u32 entries; + int cpu; + + if (!smap) { + seq_puts(m, "stackmap not initialized\n"); + return 0; + } + + entries =3D atomic_read(&smap->next_elt); + for_each_possible_cpu(cpu) { + successes +=3D local_read(per_cpu_ptr(smap->successes, cpu)); + drops +=3D local_read(per_cpu_ptr(smap->drops, cpu)); + } + + seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts); + seq_printf(m, "table_size: %u\n", smap->map_size); + seq_printf(m, "successes: %llu\n", successes); + seq_printf(m, "drops: %llu\n", drops); + if (successes + drops > 0) + seq_printf(m, "success_rate: %llu%%\n", + successes * 100 / (successes + drops)); + return 0; +} + +static int stackmap_stat_open(struct inode *inode, struct file *file) +{ + return single_open(file, stackmap_stat_show, inode->i_private); +} + +const struct file_operations ftrace_stackmap_stat_fops =3D { + .open =3D stackmap_stat_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + +/* --- Binary export --- */ + +struct stackmap_bin_snapshot { + /* + * Use u64 (not size_t) so data[] is 8-byte aligned on both + * 32-bit and 64-bit architectures. The IP array within data[] + * is accessed as u64*, which would alignment-fault on strict + * architectures (e.g. older ARM, SPARC) if data[] started at + * a 4-byte boundary. + */ + u64 size; + char data[]; +}; + +static int stackmap_bin_open(struct inode *inode, struct file *file) +{ + struct ftrace_stackmap *smap =3D inode->i_private; + struct stackmap_bin_snapshot *snap; + struct ftrace_stackmap_bin_header *hdr; + size_t alloc_size, off; + u32 nr_entries, i, nr_stacks; + + if (!smap) + return -ENODEV; + + /* + * Worst-case allocation size: every populated entry uses a + * full-depth stack. The (+1) gives one slack slot in case a + * concurrent insert lands between this snapshot and iteration. + * The loop below performs an explicit bounds check anyway. + * + * At bits=3D18 this caps at ~135 MB. The file is mode 0440 + * (TRACE_MODE_READ), so only privileged users can open it. + */ + nr_entries =3D atomic_read(&smap->next_elt); + alloc_size =3D sizeof(*hdr) + (nr_entries + 1) * + (sizeof(struct ftrace_stackmap_bin_entry) + + FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64)); + + snap =3D vmalloc(sizeof(*snap) + alloc_size); + if (!snap) + return -ENOMEM; + + hdr =3D (struct ftrace_stackmap_bin_header *)snap->data; + hdr->magic =3D FTRACE_STACKMAP_BIN_MAGIC; + hdr->version =3D FTRACE_STACKMAP_BIN_VERSION; + hdr->reserved =3D 0; + off =3D sizeof(*hdr); + nr_stacks =3D 0; + + /* + * Take reader_sem to serialize against ftrace_stackmap_reset(), + * which clears the table and elt pool under the write lock. + */ + down_read(&smap->reader_sem); + + for (i =3D 0; i < smap->map_size; i++) { + struct stackmap_entry *entry =3D &smap->entries[i]; + struct stackmap_elt *elt; + struct ftrace_stackmap_bin_entry *e; + u64 *ips_out; + u32 k, nr; + + if (!READ_ONCE(entry->key)) + continue; + elt =3D smp_load_acquire(&entry->val); + if (!elt) + continue; + + nr =3D READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr =3D FTRACE_STACKMAP_MAX_DEPTH; + + /* Bounds check: stop if we would overflow the allocation. */ + if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size) + break; + + e =3D (struct ftrace_stackmap_bin_entry *)(snap->data + off); + e->stack_id =3D i; + e->nr =3D nr; + e->ref_count =3D atomic_read(&elt->ref_count); + e->reserved =3D 0; + off +=3D sizeof(*e); + + ips_out =3D (u64 *)(snap->data + off); + for (k =3D 0; k < nr; k++) + ips_out[k] =3D (u64)elt->ips[k]; + off +=3D nr * sizeof(u64); + nr_stacks++; + } + + up_read(&smap->reader_sem); + + hdr->nr_stacks =3D nr_stacks; + snap->size =3D off; + file->private_data =3D snap; + return 0; +} + +static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct stackmap_bin_snapshot *snap =3D file->private_data; + + if (!snap) + return -EINVAL; + return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size); +} + +static int stackmap_bin_release(struct inode *inode, struct file *file) +{ + vfree(file->private_data); + return 0; +} + +const struct file_operations ftrace_stackmap_bin_fops =3D { + .open =3D stackmap_bin_open, + .read =3D stackmap_bin_read, + .llseek =3D default_llseek, + .release =3D stackmap_bin_release, +}; diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h new file mode 100644 index 000000000000..2e82bd6fb1c3 --- /dev/null +++ b/kernel/trace/trace_stackmap.h @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _TRACE_STACKMAP_H +#define _TRACE_STACKMAP_H + +#include +#include + +#define FTRACE_STACKMAP_MAX_DEPTH 64 + +/* Binary export format */ +#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */ +#define FTRACE_STACKMAP_BIN_VERSION 2 + +struct ftrace_stackmap_bin_header { + u32 magic; + u32 version; + u32 nr_stacks; + u32 reserved; +}; + +struct ftrace_stackmap_bin_entry { + u32 stack_id; + u32 nr; + u32 ref_count; + u32 reserved; + /* followed by u64 ips[nr] */ +}; + +struct trace_array; + +#ifdef CONFIG_FTRACE_STACKMAP + +struct ftrace_stackmap; + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr); +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap); +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries); +int ftrace_stackmap_reset(struct ftrace_stackmap *smap); + +extern const struct file_operations ftrace_stackmap_fops; +extern const struct file_operations ftrace_stackmap_stat_fops; +extern const struct file_operations ftrace_stackmap_bin_fops; + +#else + +struct ftrace_stackmap; +static inline struct ftrace_stackmap * +ftrace_stackmap_create(struct trace_array *tr) { return NULL; } +static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { } +static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s, + unsigned long *ips, unsigned int n) +{ return -EOPNOTSUPP; } +static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { retur= n 0; } + +#endif +#endif /* _TRACE_STACKMAP_H */ --=20 2.34.1 From nobody Mon Jun 8 21:51:06 2026 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D1E703F075A for ; Tue, 26 May 2026 11:53:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796440; cv=none; b=Iy7aW9mhHmMmnWxPQcN2QoQHPsBE0buHCZ0ewU9W6ZVBv+f11j6L09oPR0kwuZIlgwIh1XGhrPZtZ8Bzq5YGMXfy+LQVp2L/lKjyZk8uVTaQhWtGFKfSMnEDu7Jo1m8djL0MJBewZVC8PwapXoNZtXpUik30akZYNL1bAlHSH/E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796440; c=relaxed/simple; bh=8y7HjdQmwsKjTt7BHCTO/8I5NXq//AbhuupGTMhwI9k=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=EbjPaOl+R6naqqi3Fp+AydTj0DEY1mwPMpMPveB0WmspCmvI6Qtkinz8LwLGXrzJhunuUOha56LfOUtfMldH/I2oNzMj8waUYFrYJltnUqPr7ORGK/LsTakSj9MaWg8SI4YslyclsDZQEOqhR5ESFPHYTgctquU0tq+D0CmMZxg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=iZrRKmLG; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="iZrRKmLG" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2ba17c8cfacso109839475ad.2 for ; Tue, 26 May 2026 04:53:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779796426; x=1780401226; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ZhFgkq2FLjXpBbD2dwJx+p+guFA7grjMcJu9sUU6rg0=; b=iZrRKmLGATrS3eHMHVKbACen6qJHuXOts/T07kzlphH2t5jn8t/emzfyqcx7LU7hMH yrTTOOtn72lKoOG/n1E8+8sHV8whLJDyBBidJ3nMpIWw6a9BZrI2nzyN7broiCM7QspN v/nM9x7mcSsIX4Ae9Wzcpzy+v+i61SRpnhAsnBOeNe0R9V857JnzUwukvL2MH344/1xW nQbJMzv2SKWdah2PdL+kiZPYU2vkrx+I81mGB85KS4o5MHKTh1OVkGiQpVd9JJiBuVtE 3f1/4LBJll3NS+mXhFCEo1QswLhqHhI9GFYFsyd2n/afiYV8XrdpmeytSQSspH4WPPUg ZVWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779796426; x=1780401226; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=ZhFgkq2FLjXpBbD2dwJx+p+guFA7grjMcJu9sUU6rg0=; b=nw+IPDijMNqnD4WhGAxFrG3cBlJIATp5uDd3cwcuEdiP8VyCfBjtzjR0KGTzR4u4m4 6PQqoJIgNJXpvUfCnLVt7smOiU2lbbUDsGG748NQeDTcJAIdmQQmYrbNHiyq21Fo7H1e 4o+UVs2VHJ8Zpug7LRiOvXHiWAxn1kPXQp3RJ/IX2UH/tB4TExylho6KE6uALry5GSLh sWLvOgglUWA16HFuXCdSlC08UX0kc2hx7Du/WEq58ZmizW9zlgVXBPJ9ee3Ow3x3s/LT oV6O92ioX/iiWEAwktrGsuYV6sdwwKw2Fb/fx+jEDEzxK3AbTXYtxgunt9f6g+J5lhbz EPpA== X-Forwarded-Encrypted: i=1; AFNElJ//582jwGIurmyOOMcTgsbUEUNZs4J/odU2EpxfdCZwgrGEV64wzB6Q2NSFirJfA35jOvHhOs5+pqYLIds=@vger.kernel.org X-Gm-Message-State: AOJu0YwGLn94pCicAWc6ozQQRjBDiCSG2dRDuarOVuXqkB6SJY3hSK8Z qykvMwZxEVgr/1weQXOY4i8kR+MM54UlVxNfP+WvqD75kPgsn6BiBezz X-Gm-Gg: Acq92OHVZtAruJ9fWwzhLqInSLhyKQhZmvHMDD+upxLVGGsn0e7zGGqvVSBpsT6zkIf TBpH8lxZJ8L8X56d2fcU8JqTSAXaMAOv1xLh/+RVKrN46cdK/0Ep7267TuZNB42D5DX2Hi7Ospv So29cQqFj2g7lr84K7j1/Sh3p4spx/+WxJ0WHz8mbEZaKK9TC/zh20XUSyk0YBJvVfnxyC83cjS MuxVstBG6qM2gCMxR+i/LjMWiynhzoF+d5j37I20SAQQLUfKq+S3+pzyAT8VlDfIDKvDhPvYQcj TSP9lv45ffEnE6cVF6jzucEmRqp+hFcA0kCw6N1FOHLysIK5tadUYpV1Ctgv4GBtUnHlt3amH3l oaHGV5ODvbYfEX2rCv20OKxUAvWoXg/HmkzgkJm1pzZbxaGSoMIV74Mc+22ANcXmcwaRoDBv5Lb LXJPab31OW5DchFlZdXI45DlqPOfEL4UoHhnORzw== X-Received: by 2002:a17:903:350d:b0:2b2:4cd2:e162 with SMTP id d9443c01a7336-2beb065aa18mr205771055ad.34.1779796426181; Tue, 26 May 2026 04:53:46 -0700 (PDT) Received: from localhost.localdomain ([2408:8607:1b00:8:d6ff:7a7e:f223:7f2e]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2beb58dd75bsm163399345ad.69.2026.05.26.04.53.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 May 2026 04:53:45 -0700 (PDT) From: Li Pengfei X-Google-Original-From: Li Pengfei To: mhiramat@kernel.org, rostedt@goodmis.org Cc: linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org, cmllamas@google.com, zhangbo56@xiaomi.com, Pengfei Li Subject: [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Date: Tue, 26 May 2026 19:52:44 +0800 Message-Id: X-Mailer: git-send-email 2.34.1 In-Reply-To: References: <20260514034916.2162517-1-lipengfei28@xiaomi.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pengfei Li Add TRACE_STACK_ID event type and integrate ftrace_stackmap into __ftrace_trace_stack(). When the 'stackmap' trace option is enabled, the stack recording path stores a 4-byte stack_id in the ring buffer instead of the full stack trace. Changes: - New TRACE_STACK_ID in trace_type enum - New stack_id_entry in trace_entries.h - New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern used by TRACE_ITER_PROF_TEXT_OFFSET) - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS so it is only exposed under the top-level trace instance, matching the convention already used for global-only options such as 'printk' and 'record-cmd'. Secondary instances under tracing/instances/*/ do not see the option at all, avoiding a confusing no-op. - Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id() when the stackmap option is active. If reserving a TRACE_STACK_ID ring-buffer slot fails after a successful get_id(), the path falls through to the full-stack recording so the event still gets a stack trace recorded. - Stackmap pointer read with smp_load_acquire(), published with smp_store_release() to ensure proper initialization ordering - NULL check on tr->stackmap is retained as defense-in-depth: events that fire before fs_initcall (when the map is created) or after a failed ftrace_stackmap_create() observe a NULL pointer and fall back to full stack recording without dereferencing it - ftrace_stackmap_create() takes the owning trace_array so the stackmap can later check tracing state during reset - Added stack_id print handler in trace_output.c - Added TRACE_STACK_ID to trace_valid_entry() in trace_selftest.c so ftrace startup selftests don't reject the new entry type when the stackmap option is enabled Fallback behavior: if stackmap returns an error (pool exhausted, resetting, or NULL pointer), the full stack trace is recorded as before -- no new failure modes introduced. Per-instance stackmap support is left as a follow-up; gating the option via TOP_LEVEL_TRACE_FLAGS makes the global-only scope explicit at the tracefs interface rather than relying on a silent runtime fallback. Usage: echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Signed-off-by: Pengfei Li --- kernel/trace/trace.c | 78 ++++++++++++++++++++++++++++++++++- kernel/trace/trace.h | 16 +++++++ kernel/trace/trace_entries.h | 15 +++++++ kernel/trace/trace_output.c | 23 +++++++++++ kernel/trace/trace_selftest.c | 1 + 5 files changed, 131 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6eb4d3097a4d..36120355e549 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -57,6 +57,7 @@ =20 #include "trace.h" #include "trace_output.h" +#include "trace_stackmap.h" =20 #ifdef CONFIG_FTRACE_STARTUP_TEST /* @@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export); /* trace_options that are only supported by global_trace */ #define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) | \ TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) | \ - TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS) + TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) | \ + FPROFILE_DEFAULT_FLAGS) =20 /* trace_flags that are default zero for instances */ #define ZEROED_TRACE_FLAGS \ (TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK= ) | \ - TRACE_ITER(COPY_MARKER)) + TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP)) =20 /* * The global_trace is the descriptor that holds the top-level tracing @@ -2184,6 +2186,49 @@ void __ftrace_trace_stack(struct trace_array *tr, } #endif =20 +#ifdef CONFIG_FTRACE_STACKMAP + /* + * If stackmap dedup is enabled, try to store only the stack_id + * in the ring buffer instead of the full stack trace. + */ + if (tr->trace_flags & TRACE_ITER(STACKMAP)) { + struct ftrace_stackmap *smap; + struct stack_id_entry *sid_entry; + int sid; + + smap =3D smp_load_acquire(&tr->stackmap); + if (!smap) + goto full_stack; + + sid =3D ftrace_stackmap_get_id(smap, fstack->calls, nr_entries); + if (sid >=3D 0) { + event =3D __trace_buffer_lock_reserve(buffer, + TRACE_STACK_ID, + sizeof(*sid_entry), trace_ctx); + if (!event) { + /* + * Could not reserve a TRACE_STACK_ID slot; + * fall back to the full-stack path so the + * event still gets a stack trace recorded. + */ + goto full_stack; + } + sid_entry =3D ring_buffer_event_data(event); + sid_entry->stack_id =3D sid; + /* + * stack_id is a synthetic side-event attached to a + * primary trace event that was already subject to + * filtering. No per-event filter is defined for + * TRACE_STACK_ID, so commit unconditionally. + */ + __buffer_unlock_commit(buffer, event); + goto out; + } + /* On stackmap failure, record the full stack instead. */ + } +full_stack: +#endif + event =3D __trace_buffer_lock_reserve(buffer, TRACE_STACK, struct_size(entry, caller, nr_entries), trace_ctx); @@ -9222,6 +9267,35 @@ static __init void tracer_init_tracefs_work_func(str= uct work_struct *work) NULL, &tracing_dyn_info_fops); #endif =20 +#ifdef CONFIG_FTRACE_STACKMAP + { + struct ftrace_stackmap *smap; + + smap =3D ftrace_stackmap_create(&global_trace); + if (!IS_ERR(smap)) { + /* + * Use smp_store_release to ensure the stackmap + * structure is fully initialized before publishing + * the pointer to concurrent trace event readers. + */ + smp_store_release(&global_trace.stackmap, smap); + trace_create_file("stack_map", TRACE_MODE_WRITE, NULL, + smap, &ftrace_stackmap_fops); + trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL, + smap, &ftrace_stackmap_stat_fops); + trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL, + smap, &ftrace_stackmap_bin_fops); + } else { + pr_warn("ftrace stackmap init failed, dedup disabled\n"); + /* + * global_trace is statically defined; its stackmap + * field is zero-initialized via BSS, so leaving it + * NULL ensures the smp_load_acquire() in + * __ftrace_trace_stack() falls back to full stack. + */ + } + } +#endif create_trace_instances(NULL); =20 update_tracer_options(); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 80fe152af1dd..7e7d5e5a35ff 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -57,6 +57,7 @@ enum trace_type { TRACE_TIMERLAT, TRACE_RAW_DATA, TRACE_FUNC_REPEATS, + TRACE_STACK_ID, =20 __TRACE_LAST_TYPE, }; @@ -453,6 +454,9 @@ struct trace_array { struct cond_snapshot *cond_snapshot; #endif struct trace_func_repeats __percpu *last_func_repeats; +#ifdef CONFIG_FTRACE_STACKMAP + struct ftrace_stackmap *stackmap; +#endif /* * On boot up, the ring buffer is set to the minimum size, so that * we do not waste memory on systems that are not using tracing. @@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void); TRACE_GRAPH_RET); \ IF_ASSIGN(var, ent, struct func_repeats_entry, \ TRACE_FUNC_REPEATS); \ + IF_ASSIGN(var, ent, struct stack_id_entry, \ + TRACE_STACK_ID); \ __ftrace_bad_type(); \ } while (0) =20 @@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parse= r, const char __user *ubuf, # define STACK_FLAGS #endif =20 +#ifdef CONFIG_FTRACE_STACKMAP +# define STACKMAP_FLAGS \ + C(STACKMAP, "stackmap"), +#else +# define STACKMAP_FLAGS +# define TRACE_ITER_STACKMAP_BIT -1 +#endif + #ifdef CONFIG_FUNCTION_PROFILER + # define PROFILER_FLAGS \ C(PROF_TEXT_OFFSET, "prof-text-offset"), # ifdef CONFIG_FUNCTION_GRAPH_TRACER @@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser= , const char __user *ubuf, FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ + STACKMAP_FLAGS \ BRANCH_FLAGS \ PROFILER_FLAGS \ FPROFILE_FLAGS diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index 54417468fdeb..89ed14b7e5fd 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry, (void *)__entry->caller[6], (void *)__entry->caller[7]) ); =20 +/* + * Stack ID entry - stores only a stack_id referencing the stackmap. + * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks. + */ +FTRACE_ENTRY(stack_id, stack_id_entry, + + TRACE_STACK_ID, + + F_STRUCT( + __field( int, stack_id ) + ), + + F_printk("", __entry->stack_id) +); + /* * trace_printk entry: */ diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index a5ad76175d10..68678ea88159 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event =3D= { .funcs =3D &trace_user_stack_funcs, }; =20 +/* TRACE_STACK_ID */ +static enum print_line_t trace_stack_id_print(struct trace_iterator *iter, + int flags, struct trace_event *event) +{ + struct stack_id_entry *field; + struct trace_seq *s =3D &iter->seq; + + trace_assign_type(field, iter->ent); + trace_seq_printf(s, "\n", field->stack_id); + + return trace_handle_return(s); +} + +static struct trace_event_functions trace_stack_id_funcs =3D { + .trace =3D trace_stack_id_print, +}; + +static struct trace_event trace_stack_id_event =3D { + .type =3D TRACE_STACK_ID, + .funcs =3D &trace_stack_id_funcs, +}; + /* TRACE_HWLAT */ static enum print_line_t trace_hwlat_print(struct trace_iterator *iter, int flags, @@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata =3D { &trace_wake_event, &trace_stack_event, &trace_user_stack_event, + &trace_stack_id_event, &trace_bputs_event, &trace_bprint_event, &trace_print_event, diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 929c84075315..0c97065b0d68 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *e= ntry) case TRACE_CTX: case TRACE_WAKE: case TRACE_STACK: + case TRACE_STACK_ID: case TRACE_PRINT: case TRACE_BRANCH: case TRACE_GRAPH_ENT: --=20 2.34.1 From nobody Mon Jun 8 21:51:06 2026 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6137B3F7AAD for ; Tue, 26 May 2026 11:53:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796448; cv=none; b=r2jKj7EgoKv3zjbL601fBQHR4/mHKQ1hAKZGafmEvVaRQjmgHFUZT40A3MC06xnVU6MPzhS9dgCu4DE5y9ZAuMB4HEywDU4wkeNUT4d4ilfAxjbFY6Lq6S+9Lqsk1rNtwkGKAxvE2UCpO/sI+ZLN3id6ZhSPpktOD55li44auoE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779796448; c=relaxed/simple; bh=7dXfYLqXXHz/zzNEt9CxgmXo5e7tNE8y0U7UoQWT9CI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=SimlkFFYdTBQPN46LljfwSIBqJPox2VLkeP9a/FMKR8dbkg3BSola8jpF9OWlA/NCfESOQihIhN80G1soyLkawfxNQf6rUbDFYtwo9s3NeVha5tahuLzJtCujUrmB8gLOBi4PwEolhWeA1VDJ6cQ6LrVtCiUGcC8rZMXLjfxGn4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OzbaGR8s; arc=none smtp.client-ip=209.85.214.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OzbaGR8s" Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-2ba21d32776so76591985ad.2 for ; Tue, 26 May 2026 04:53:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779796433; x=1780401233; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JJE9EN6+Z3/bVQ/jKJ2h2ipRodpjDG/kmXUV8v2yycQ=; b=OzbaGR8sSsxUbcZF5LMPAJ1GaV5oYffzDcOdkecHwe3NGDmmNCeNNalouhSZAy9QPv 826JAgBKpUcsz4IGPl+3EFy4XObV5/thEh23fb2Tpu1lw4IlObGUILPY35kLr459/T6+ QeQQGuQAu65KDiGYq2zYTpXR1L/jY91wOK7C7oVXrSBWg6Rhh28ig1mgs7zgnHu1YhRp bPDr+KEsUX+vmoBeWHvo/AC9IapNAtz4Cp1APD2nzWzjhUoEzPD3gJg+NvEsIKVZKTMx RiWZ5o6MBiKAVXCTmbmUQbO86cW66rjcuShJ2qrJSVKD9ZmRPmLGOuCTIZdvR4PQAb7m mvCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779796433; x=1780401233; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=JJE9EN6+Z3/bVQ/jKJ2h2ipRodpjDG/kmXUV8v2yycQ=; b=YIk+r4d0YJzBR/LnKo9qcRCg032RA4lPoWF6ZvaTtPS3Riet03YT0d8TEehLiR95C4 Kib4m90a5yAqUGMcZ+p1ndEL9sAzl2thoQvtBvsuaJhy/yKMqsdwTVXM/TfsGb4haLb8 LECCn8nvKIeA3ueSnK4E1y5ABz8yHfOQdkaJyXvt8xl4oCK+R+YZbucIC2i5M8LDQsjV 5pjnIpo/Uu78r4XfLANXOLrDjnfUnTMmYDtneTj9Tghsl4SWGkAl1FaeCW7N3mT8oGfd uSixO2SIqeG0/dD6/nysRgd3y7NzNvxVXK7DKsZ3zUzsW4u39LCTmTBtYb7xTb1AaRVf KB3w== X-Forwarded-Encrypted: i=1; AFNElJ/+cK1mfefccCnQNytsiKZcTqmXj2J9ta7zi1wCF4xbrI2kEGaMhIawzT28OBIFCerkqkbTZOmxhIJfQpo=@vger.kernel.org X-Gm-Message-State: AOJu0Yy10vOf2CxzEr4iFYDt+FFT3yxIIhu2oxAoxqaws8643eF4SK2F 9/gZpKzO4UjxU1GN2pMY4Z8tu6OMIfJly4EQJoNG5UnBhDSDyM2ursqW X-Gm-Gg: Acq92OHr7qpdn12vDMGA4e1FcluWVrsLvgj6jXRS/juv2KaMBbSoY4aIYRsWgsKMDlR sb4GFue1oS30lgyaxpe5INFCzQuTbH2Q4QLqTeFA6xZ2s500OgSmBbl5LYds4nSb+AE6e1EbgdK OBTsvB6Dlh7z5AOuCy/vgdHjE7I/FREaJbZ1Z8ZUDwGgiihqHrzEff80yd2L0T+LoVno+CISZeN T8d0I2niEjbZpJKNV7G8gk2URjfAbjAPFMGVE18tf25UY02pWCa3HRBAhrSFbW6JSxzMlDnuTxs Z5NNtPO9ky4cjg61WlOatfwV7dmw9BJhXMumgoqII8xyC+P6Poh6pWBTsnylYpRghDSlwKrvMLk MOdNetbAc1/k1rCbRVMDNZoiYKokx/ut/j7Zdz4dShI2gyZKhxBmZU33JIUonPCmk/S5nMwa4mP /mNQCpZyRExfCrVGRTNDbZplyXm6RCvO+D+y87Nw== X-Received: by 2002:a17:903:4b07:b0:2bd:b50f:c1bf with SMTP id d9443c01a7336-2beb06464edmr199879845ad.38.1779796433395; Tue, 26 May 2026 04:53:53 -0700 (PDT) Received: from localhost.localdomain ([2408:8607:1b00:8:d6ff:7a7e:f223:7f2e]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2beb58dd75bsm163399345ad.69.2026.05.26.04.53.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 May 2026 04:53:52 -0700 (PDT) From: Li Pengfei X-Google-Original-From: Li Pengfei To: mhiramat@kernel.org, rostedt@goodmis.org Cc: linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org, cmllamas@google.com, zhangbo56@xiaomi.com, Pengfei Li , kernel test robot Subject: [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Date: Tue, 26 May 2026 19:52:45 +0800 Message-Id: X-Mailer: git-send-email 2.34.1 In-Reply-To: References: <20260514034916.2162517-1-lipengfei28@xiaomi.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Pengfei Li Add supporting files for the ftrace stackmap feature: Documentation/trace/ftrace-stackmap.rst: Documentation covering design, usage, tracefs interface, binary format, and performance characteristics. Added to the 'Core Tracing Frameworks' toctree in Documentation/trace/index.rst. Documents: - Reset requires tracing to be stopped first - Boot-time activation via trace_options=3Dstackmap - bits parameter range [10, 18] and worst-case memory usage - tracefs file modes (0640 / 0440) - Best-effort snapshot semantics for stack_map_bin - Counter naming: successes (events served), drops, success_rate - Gravestone amplification when the pool is exhausted tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc: Functional selftest verifying: - stackmap tracefs nodes exist - enabling stackmap + stacktrace produces stack_id events - stack_map_stat shows non-zero successes and zero drops - reset clears entries when tracing is stopped - reset is rejected (-EBUSY) while tracing is active Test reads trace contents BEFORE switching back to the nop tracer (tracer_init() unconditionally calls tracing_reset_online_cpus(), which would empty the ring buffer). The function:tracer dependency is declared in '# requires:' so ftracetest skips on kernels without CONFIG_FUNCTION_TRACER instead of failing spuriously. An EXIT trap restores options/stackmap and options/stacktrace on any exit path. tools/tracing/stackmap_dump.py: Python script to parse the binary stack_map_bin export. Features: - Automatic endianness detection via magic number - Batched addr2line via stdin (avoids ARG_MAX with large stacks) - JSON output mode - Top-N filtering by ref_count Binary format: all fields are native-endian. The parser detects byte order by reading the magic value (0x464D5342 =3D 'FSMB'). Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@int= el.com/ Signed-off-by: Pengfei Li --- Documentation/trace/ftrace-stackmap.rst | 162 ++++++++++++++++++ Documentation/trace/index.rst | 1 + .../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++++++++++ .../test.d/ftrace/stackmap-instance-gate.tc | 42 +++++ tools/tracing/stackmap_dump.py | 150 ++++++++++++++++ 5 files changed, 458 insertions(+) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-b= asic.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-i= nstance-gate.tc create mode 100755 tools/tracing/stackmap_dump.py diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/= ftrace-stackmap.rst new file mode 100644 index 000000000000..191347be3664 --- /dev/null +++ b/Documentation/trace/ftrace-stackmap.rst @@ -0,0 +1,162 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Ftrace Stack Map +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +:Author: Pengfei Li + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D + +The ftrace stack map provides stack trace deduplication for the ftrace +ring buffer. When enabled, instead of storing full kernel stack traces +(typically 80-160 bytes each) in the ring buffer for every event, ftrace +stores only a 4-byte ``stack_id``. The full stacks are maintained in a +separate hash table and exported via tracefs for userspace to resolve. + +This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated +into ftrace's infrastructure, requiring no userspace daemon. + +Configuration +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Enable ``CONFIG_FTRACE_STACKMAP=3Dy`` in the kernel config. + +Kernel command line parameters: + +- ``ftrace_stackmap.bits=3DN`` - Set map capacity to 2^N unique stacks + (default: 14 =E2=86=92 16384 stacks; valid range: 10-18). + + At ``bits=3D18`` the kernel reserves roughly 130 MB of vmalloc memory + for the element pool. Each ``open()`` of ``stack_map_bin`` may + briefly allocate a similar amount for a snapshot. The cap is set + intentionally to bound memory usage. + +Usage +=3D=3D=3D=3D=3D + +Enable stack deduplication:: + + echo 1 > /sys/kernel/debug/tracing/options/stackmap + echo 1 > /sys/kernel/debug/tracing/options/stacktrace + echo function > /sys/kernel/debug/tracing/current_tracer + +The trace output will show ```` instead of full stack traces:: + + sh-1234 [006] d.h.. 123.456789: + +To view the actual stacks:: + + cat /sys/kernel/debug/tracing/stack_map + +Output format:: + + stack_id 42 [ref 1337, depth 8] + [0] schedule+0x48/0xc0 + [1] schedule_timeout+0x1c/0x30 + ... + +To view statistics:: + + cat /sys/kernel/debug/tracing/stack_map_stat + +Output:: + + entries: 2500 / 16384 + table_size: 32768 + successes: 148923 + drops: 0 + success_rate: 100% + +To reset the stack map (tracing must be stopped first):: + + echo 0 > /sys/kernel/debug/tracing/tracing_on + echo 0 > /sys/kernel/debug/tracing/stack_map + +Reset returns ``-EBUSY`` if tracing is currently active, or if another +reset is already in progress. + +Boot-time activation +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The stackmap option can be enabled from the kernel command line:: + + trace_options=3Dstackmap,stacktrace + +Trace events that fire before the tracefs filesystem is initialized +(``fs_initcall`` time) fall back to recording full stack traces; once +``ftrace_stackmap_create()`` runs, subsequent events are deduplicated. +The crossover is automatic and lossless =E2=80=94 no events are dropped, b= ut +early-boot stacks recorded before the crossover are not deduplicated. + +Tracefs Nodes +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The stack_map files are owned by root and not world-readable +(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440). + +``stack_map`` + Text export of all deduplicated stacks with symbol resolution. + Writing ``0`` or ``reset`` clears all entries (only when tracing + is stopped). + +``stack_map_stat`` + Statistics: entries (allocated unique stacks), table_size, + successes (events served), drops (events that fell back to + full-stack recording), and success_rate. Drops accumulate when + the element pool is exhausted; once that happens, slots that + won the cmpxchg but failed to allocate an element remain + "claimed but empty" and increase probe pressure for any future + insert hashing to the same bucket. Reset (when tracing is + stopped) clears these gravestones. + +``stack_map_bin`` + Binary export for efficient userspace consumption. Format: + + - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + rese= rved(u32) + - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) = + ips(u64 =C3=97 nr) + + All fields are written in the kernel's native byte order. + Userspace tools detect endianness by reading the magic value. + Magic: ``0x464D5342`` ('FSMB'), Version: 2. + + The export is a best-effort snapshot allocated at ``open()``; + concurrent inserts during the snapshot may be truncated. A + bounds check ensures no overflow. + +Design +=3D=3D=3D=3D=3D=3D + +The stack map is modeled after ``tracing_map.c`` (used by hist triggers), +using a lock-free design based on Dr. Cliff Click's non-blocking hash table +algorithm: + +- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context +- **Memory**: Pre-allocated element pool, zero allocation on the hot path + (no GFP_ATOMIC failures under memory pressure) +- **Collision**: Linear probing with a 2x over-provisioned table; probe + length is bounded so worst-case insert/lookup is O(1) +- **Scope**: Currently supports the global trace instance +- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp`` + confirms matches + +Deduplication is best-effort, not strict: if two CPUs race in the +insert path with the same ``key_hash`` (i.e. the same stack), the +``cmpxchg`` loser advances by one slot and may insert the same stack +again. Under heavy contention this can produce a small number of +duplicate entries for the same stack; ``ref_count`` is then split +across the duplicates. Total memory is still bounded by the element +pool size, and lookup correctness is unaffected (each duplicate is +a self-consistent entry with its own ``stack_id``). The trade-off is +intentional and keeps the hot path lock-free. + +Performance +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Typical results on an aarch64 SMP system (function tracer, 2 seconds): + +- Unique stacks: ~3000 +- Dedup rate: 84-98% (depends on workload diversity) +- Ring buffer savings: ~80% for stack data +- Overhead per event: ~50ns (one jhash + hash table lookup) diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5d9bf4694d5d..ac8b1141c23a 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -33,6 +33,7 @@ the Linux kernel. ftrace ftrace-design ftrace-uses + ftrace-stackmap kprobes kprobetrace fprobetrace diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc= b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc new file mode 100644 index 000000000000..18fa998ae460 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc @@ -0,0 +1,103 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap basic functionality +# requires: stack_map options/stackmap function:tracer + +# Test that ftrace stackmap deduplication works: +# 1. Enable stackmap + stacktrace options +# 2. Run function tracer briefly +# 3. Verify trace contains events (read BEFORE switching +# tracer back to nop, since tracer_init() resets the ring buffer) +# 4. Verify stack_map has entries and zero drops +# 5. Verify reset is rejected (-EBUSY) while tracing is active +# 6. Verify reset clears the map when tracing is stopped + +fail() { + echo "FAIL: $1" + exit_fail +} + +# Restore state on any exit (success, fail, or interrupt) so a +# half-finished test does not leave stacktrace/stackmap enabled. +cleanup() { + disable_tracing 2>/dev/null + echo nop > current_tracer 2>/dev/null + echo 0 > options/stackmap 2>/dev/null + echo 0 > options/stacktrace 2>/dev/null +} +trap cleanup EXIT + +disable_tracing +clear_trace + +# Verify stackmap files exist +test -f stack_map || fail "stack_map file missing" +test -f stack_map_stat || fail "stack_map_stat file missing" +test -f stack_map_bin || fail "stack_map_bin file missing" + +# Enable stackmap dedup +echo 1 > options/stackmap +echo 1 > options/stacktrace + +# Run function tracer briefly +echo function > current_tracer +enable_tracing +sleep 1 +disable_tracing + +# Read trace contents NOW, before switching tracer back to nop. +# tracer_init() unconditionally calls tracing_reset_online_cpus(), +# so the ring buffer would be empty after 'echo nop > current_tracer'. +count=3D$(grep -c " events" +fi + +# Now safe to switch back and disable options +echo nop > current_tracer +echo 0 > options/stackmap + +# Check stack_map_stat +entries=3D$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries:=3D0}" +if [ "$entries" -eq 0 ]; then + fail "stackmap has zero entries after tracing" +fi + +successes=3D$(cat stack_map_stat | grep "^successes:" | awk '{print $2}') +: "${successes:=3D0}" +if [ "$successes" -eq 0 ]; then + fail "stackmap has zero successes" +fi + +drops=3D$(cat stack_map_stat | grep "^drops:" | awk '{print $2}') +: "${drops:=3D0}" +if [ "$drops" -ne 0 ]; then + fail "stackmap had $drops drops (pool exhausted?)" +fi + +# Check stack_map text output is parseable +first_id=3D$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}') +if [ -z "$first_id" ]; then + fail "stack_map output has no stack_id entries" +fi + +# Test that reset is rejected while tracing is active +enable_tracing +if echo 0 > stack_map 2>/dev/null; then + disable_tracing + fail "stackmap reset should fail while tracing is active" +fi +disable_tracing + +# Test reset works when tracing is stopped +echo 0 > stack_map +entries_after=3D$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries_after:=3D-1}" +if [ "$entries_after" -ne 0 ]; then + fail "stackmap reset did not clear entries (got $entries_after)" +fi + +echo "stackmap basic test passed: $entries unique stacks, $successes succe= sses, $drops drops" +exit 0 diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance= -gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-g= ate.tc new file mode 100644 index 000000000000..49848eac2624 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc @@ -0,0 +1,42 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap option is gated to the top-level trace in= stance +# requires: stack_map options/stackmap instances + +# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the +# convention used for global-only options like 'printk' and 'record-cmd'. +# Verify that: +# 1. The global instance exposes options/stackmap and the stack_map* nodes. +# 2. A newly created secondary instance under instances/ does NOT expose +# options/stackmap or stack_map* nodes. + +fail() { + echo "FAIL: $1" + rmdir instances/test_stackmap_gate 2>/dev/null + exit_fail +} + +# 1. Global instance must expose the option and the nodes +test -e options/stackmap || fail "options/stackmap missing on global insta= nce" +test -e stack_map || fail "stack_map missing on global instance" +test -e stack_map_stat || fail "stack_map_stat missing on global instanc= e" +test -e stack_map_bin || fail "stack_map_bin missing on global instance" + +# 2. Create a secondary instance and verify it does NOT see the option +# or the stack_map* nodes. +mkdir instances/test_stackmap_gate || fail "could not create secondary ins= tance" + +if [ -e instances/test_stackmap_gate/options/stackmap ]; then + fail "secondary instance unexpectedly exposes options/stackmap" +fi + +for f in stack_map stack_map_stat stack_map_bin; do + if [ -e instances/test_stackmap_gate/$f ]; then + fail "secondary instance unexpectedly has $f" + fi +done + +rmdir instances/test_stackmap_gate || fail "could not remove secondary ins= tance" + +echo "stackmap option gating to top-level instance works" +exit 0 diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py new file mode 100755 index 000000000000..fcd8ddcd97de --- /dev/null +++ b/tools/tracing/stackmap_dump.py @@ -0,0 +1,150 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 +""" +stackmap_dump.py - Parse and display ftrace stack_map_bin binary export. + +Usage: + # Pull from device and parse + adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin + python3 stackmap_dump.py /tmp/stack_map.bin + + # With vmlinux for offline symbol resolution + python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux + + # JSON output for tooling + python3 stackmap_dump.py /tmp/stack_map.bin --json +""" + +import struct +import sys +import argparse +import json +import subprocess + +MAGIC =3D 0x464D5342 # 'FSMB' +HEADER_SIZE =3D 16 # 4 x u32 +ENTRY_SIZE =3D 16 # 4 x u32 + + +def detect_endianness(data): + """Detect byte order from magic number in header.""" + if len(data) < 4: + raise ValueError("File too small") + magic_le =3D struct.unpack_from('I', data, 0)[0] + if magic_be =3D=3D MAGIC: + return '>' + raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)") + + +def batch_addr2line(vmlinux, addrs): + """Resolve multiple addresses in one addr2line invocation.""" + if not addrs: + return {} + try: + # Feed addresses on stdin to avoid ARG_MAX limits with large + # numbers of addresses (one stack can have 30+ frames; a + # snapshot can have thousands of unique stacks). + stdin =3D '\n'.join(hex(a) for a in addrs) + '\n' + result =3D subprocess.run( + ['addr2line', '-f', '-e', vmlinux], + input=3Dstdin, capture_output=3DTrue, text=3DTrue, timeout=3D60 + ) + lines =3D result.stdout.split('\n') + # addr2line outputs 2 lines per address: function name + source lo= cation + symbols =3D {} + for i, addr in enumerate(addrs): + idx =3D i * 2 + if idx < len(lines) and lines[idx] and lines[idx] !=3D '??': + symbols[addr] =3D lines[idx] + return symbols + except (subprocess.TimeoutExpired, FileNotFoundError) as e: + print(f"warning: addr2line failed: {e}", file=3Dsys.stderr) + return {} + + +def parse_stackmap_bin(data): + """Parse binary stackmap data, yield (stack_id, ref_count, [ips]).""" + if len(data) < HEADER_SIZE: + raise ValueError("File too small for header") + + endian =3D detect_endianness(data) + header_fmt =3D f'{endian}IIII' + entry_fmt =3D f'{endian}IIII' + + magic, version, nr_stacks, _ =3D struct.unpack_from(header_fmt, data, = 0) + if version !=3D 2: + raise ValueError(f"Unsupported version: {version}") + + offset =3D HEADER_SIZE + for _ in range(nr_stacks): + if offset + ENTRY_SIZE > len(data): + break + stack_id, nr, ref_count, _ =3D struct.unpack_from(entry_fmt, data,= offset) + offset +=3D ENTRY_SIZE + + ips_size =3D nr * 8 + if offset + ips_size > len(data): + break + ips =3D struct.unpack_from(f'{endian}{nr}Q', data, offset) + offset +=3D ips_size + + yield stack_id, ref_count, list(ips) + + +def main(): + parser =3D argparse.ArgumentParser(description=3D'Parse ftrace stack_m= ap_bin') + parser.add_argument('file', help=3D'Path to stack_map_bin file') + parser.add_argument('--vmlinux', help=3D'Path to vmlinux for symbol re= solution') + parser.add_argument('--json', action=3D'store_true', help=3D'JSON outp= ut') + parser.add_argument('--top', type=3Dint, default=3D0, + help=3D'Show only top N stacks by ref_count') + args =3D parser.parse_args() + + with open(args.file, 'rb') as f: + data =3D f.read() + + stacks =3D list(parse_stackmap_bin(data)) + + if args.top > 0: + stacks.sort(key=3Dlambda x: x[1], reverse=3DTrue) + stacks =3D stacks[:args.top] + + # Batch symbol resolution + symbols =3D {} + if args.vmlinux: + all_addrs =3D set() + for _, _, ips in stacks: + all_addrs.update(ips) + symbols =3D batch_addr2line(args.vmlinux, list(all_addrs)) + + if args.json: + output =3D [] + for stack_id, ref_count, ips in stacks: + entry =3D { + 'stack_id': stack_id, + 'ref_count': ref_count, + 'ips': [f'0x{ip:x}' for ip in ips] + } + if args.vmlinux: + entry['symbols'] =3D [symbols.get(ip, f'0x{ip:x}') + for ip in ips] + output.append(entry) + print(json.dumps(output, indent=3D2)) + else: + for stack_id, ref_count, ips in stacks: + print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}= ]") + for i, ip in enumerate(ips): + sym =3D symbols.get(ip, '') + if sym: + sym =3D f' {sym}' + print(f" [{i}] 0x{ip:x}{sym}") + print() + + print(f"Total: {len(stacks)} unique stacks", file=3Dsys.stderr) + + +if __name__ =3D=3D '__main__': + main() --=20 2.34.1