From nobody Fri Jun 19 05:12:11 2026 Received: from mail-dl1-f47.google.com (mail-dl1-f47.google.com [74.125.82.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 992D13E9C20 for ; Tue, 16 Jun 2026 06:42:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781592132; cv=none; b=FpH5R9XGisc22PAqLAM9vhY1m6kHmSTmaxS2L2dava8Jd6gnvFtC570qv5YsyeWa/ivpGn0kFK54O+HN12hWdm3DTtksWKzlvYYqScz2Uxk0MaawGCj/3HAH+5BxjFP14S510pvcNvQkycLVl/goGMMUDcNJOIYP4ZsnaeiMF0A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781592132; c=relaxed/simple; bh=NxLdElKIHmJLQNvdTE08WaFPWZpQn0e7dC3lUvZ1JOM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=PuYaudpJfv8B7Bd//5OZmmuq5E0Olsrd5tfXVg/ATut78HjBQZlCNBv4OEeVY9ycIyGkCDAeEyeTMNqgZF6kmCEF95AeO9a5DLNf/kvRPm21h8kGTOj+N/vEp+zRA2V1lrOB+HuezCWfDjIn8hC/BllmiOsXYK1ifHLbSXOntlI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=d2EWZ0Gl; arc=none smtp.client-ip=74.125.82.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="d2EWZ0Gl" Received: by mail-dl1-f47.google.com with SMTP id a92af1059eb24-1390f75d8bbso5896780c88.0 for ; Mon, 15 Jun 2026 23:42:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781592129; x=1782196929; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Bvk5WayeZpGT+7V9K082C6hOqXsCiT3L8n5BFolWty0=; b=d2EWZ0Gl/ba5Q/z5lOrisx1uooOU9ZOoG5PTUk+el5ur0lxHUN+Wl0NDXoPOeRLA4R PgXAcSiudu6LufIpnlQ0TnieREGG900Fo+9jUJuMvr0kTTjyjz1N20pUM8t2gSrXb5lN 21JYSkswpjcExBbIGYCMxJqytOKBZz9sQ0bJgrMHaxfw7FNpondCpkPHj+R485l3d/u2 312WCkbw2k4mWkbm/V+EKCVmB1+Qa2DOsXYCXzQRruELOzoatS+gqTmftWlzult8LNUc 3h5J1VRIAMttlaInKII7HBp5CdHErd/+vOW6+nNj6DVniwlB+aYuER92sp3SIpogsMma HTpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781592129; x=1782196929; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Bvk5WayeZpGT+7V9K082C6hOqXsCiT3L8n5BFolWty0=; b=HTISz+Xo6XPthbB79XLRbhGiMr/oYyCHHQVfHK/4e702+kAVxATI+osJcIP96pBTk0 3RmypDe7gXFzZ5DzVwe0/Cj19+8nVp1gesjIx6q1DEBaq9O5TxnSTiXnNSEKbW9x4I7s qzaCn54mMpnhC6dmQHKxWkSZsRBhYj4NCFeEdFpABLqh4DqidmRDL7mBn3Gzjbg4tJBm u9yqIFEa9LCJuy0xC6skHusGaMY0fxAmEZeUTI41rXyiNdpqlVFPqybP6LgrYfs8ZUd6 JubSliYxWOHgQ2hDBwP9Mw5LmxPIZMQ8ntyopHtwjOmfM885kJgJSzR7HMpyhiGa+HGB LcbQ== X-Forwarded-Encrypted: i=1; AFNElJ/7PU063M5BWZHgxZ1UiWcpQxAote7O4Ff4hXUUDpA83NpthRpQnWZH5jcbvLl9BJ2ozTsfVRwtkNaqRiM=@vger.kernel.org X-Gm-Message-State: AOJu0YyUsMwxazX3V1zpl6NOoOThNz5m2TcvvaDKiNBf6jA+jmnoEyiM 4A8pkWfka22Y4osx0UiSh14PZ6GFNMbybCAsR7kH9BJALyxJWRrthCY3 X-Gm-Gg: Acq92OEwuOq1XVavPMLlJeNlXKn3/9g1wl2H3jWr2tKQy4lvoy1/u8Il+O2bixgWJY/ 1mBso/ILzPHk2qGUOGe5c+5/jhPsv1ihCmjN/ZyqiMKcGOgNebIwGxaWfn13buBKvwSUSKuSgs8 9dgv53Rtm88QWBHX5wHYyoEDmkAtzO31ps6KYXkFaR4v/DOBflEJFKaDDq17YMYfAZErgR8NtBB e0j7pj9JedbAJovp+pYoK+YA9Dfh7wqCu0mzWVRaLEksHDdxfI4G5V4B11lJklJ9AztSzcePrIU tOx4MIGXDSp+1S47c7aa9gGe9esassrydsoLktMRbsFfn2OatulSwj0Rt0ETwRaVFn1tUDbyAys gboh3QsFEnfaN7tbb0WTy7wxsD4h9wcDExO1+dYhM2ZY8T2NaOmd26HBhaWIlPC3ONRnxHTgXQ+ 7DGUtqa07q3RaNi8xQUnoLRfaZsRqTMlU39sfRmH7alYAEbXdn X-Received: by 2002:a05:7022:693:b0:12d:b993:c68f with SMTP id a92af1059eb24-1384bb1c45fmr9754837c88.4.1781592128428; Mon, 15 Jun 2026 23:42:08 -0700 (PDT) Received: from localhost.localdomain ([2408:8607:1b00:8:55be:2dbe:9cd4:7306]) by smtp.gmail.com with ESMTPSA id a92af1059eb24-1384b910c51sm12499158c88.4.2026.06.15.23.42.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Jun 2026 23:42:07 -0700 (PDT) From: Li Pengfei X-Google-Original-From: Li Pengfei To: Steven Rostedt , Masami Hiramatsu Cc: Mathieu Desnoyers , Mark Rutland , Jonathan Corbet , Shuah Khan , linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, lipengfei28@xiaomi.com, zhangbo56@xiaomi.com Subject: [RFC PATCH v4 1/3] trace: add lock-free stackmap for stack trace deduplication Date: Tue, 16 Jun 2026 14:41:17 +0800 Message-Id: <20260616064119.438063-2-lipengfei28@xiaomi.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com> References: <20260616064119.438063-1-lipengfei28@xiaomi.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Pengfei Li Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel stack traces for the ftrace ring buffer. Instead of storing full stack traces (80-160 bytes each) in the ring buffer for every event, ftrace can store a 4-byte stack_id when the stackmap option is enabled. The implementation is modeled after tracing_map.c (used by hist triggers), using the same lock-free design based on Dr. Cliff Click's non-blocking hash table algorithm: - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context - Pre-allocated element pool (zero allocation on hot path) - Linear probing with 2x over-provisioned table; probe length is bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup is O(1) even when the table is heavily loaded with claimed-but- empty slots from pool exhaustion - Single global instance (initialized for the global trace array) The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the existing tracing_map / hist_triggers requirement: the lock-free hot path uses cmpxchg in a context that may be reached from NMI. The stackmap is exported via three tracefs nodes: - stack_map: text export with symbol resolution (mode 0640) - stack_map_stat: counters (entries, successes, drops, success_rate) - stack_map_bin: binary export (magic 0x46534D42 'FSMB', version 1, all fields native-endian) ftrace_stackmap_get_id() never truncates: a stack deeper than FTRACE_STACKMAP_MAX_DEPTH (64) returns -E2BIG so the caller records a full stack instead. This prevents two distinct traces that share their first 64 frames from being merged into one stack_id. Hot-path counters use per-CPU local_t (NMI-safe single-instruction increments) instead of atomic64_t. atomic64_t falls back to raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems, which would deadlock if an NMI hit while the spinlock was held. local_t avoids this hazard. All counters saturate rather than wrap on long (from-boot, multi-hour) traces: ref_count via atomic_add_unless(.., INT_MAX) and successes/drops via local_add_unless(.., LONG_MAX). Reset semantics: - Reset is a control-path operation only allowed when tracing is stopped on the owning trace_array. Online reset (with tracing active) is intentionally not supported. - Reset is destructive: under the reader_sem write lock it clears the owning trace_array's ring buffer (and snapshot buffer) BEFORE the map, so an external observer never sees "trace still has but the map is already empty". The buffers are cleared with tracing_reset_all_cpus() rather than _online_cpus() so a TRACE_STACK_ID written by a now-offline CPU cannot survive a reset. - Reset uses atomic_cmpxchg() to claim the resetting flag, then verifies tracer_tracing_is_on() returns false. - synchronize_rcu() drains in-flight get_id() callers from the ftrace callback path (which runs preempt-disabled). - The reader_sem (rw_semaphore) serializes the clearing against tracefs readers (seq_file iteration and stack_map_bin snapshot), which run in process context and aren't covered by synchronize_rcu(). Readers take it shared, reset takes it exclusive, so a reset cannot tear an iteration in progress. The hot path doesn't take this lock. - Reset clears the resetting flag with atomic_set_release() so a subsequent get_id() observes a fully cleared map. - get_id() uses atomic_read_acquire() on resetting so subsequent loads of entry->key/val are properly ordered after the check (control dependencies only order stores per LKMM). - Concurrent reset, or reset while tracing is active, returns -EBUSY. Concurrency notes: - entry->val publication uses smp_store_release() paired with smp_load_acquire() in all dereferencing readers. - entry->key reads (in get_id, seq_start/next, bin_open) use READ_ONCE() to avoid LKMM data races with the cmpxchg writer. - elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before use in seq_show and bin_open. - Pool exhaustion: stackmap_get_elt() short-circuits via atomic_read() before the contended atomic RMW, avoiding cacheline contention once the pool is full. Slots that win cmpxchg but cannot get an elt are left 'claimed but empty'; subsequent lookups treat val=3D=3DNULL as a miss and probe past them. Hash key: - Per-instance random seed stored in the stackmap struct (no global state), seeded at create time. - 32-bit jhash is forced to 1 if it lands on 0 (which is the free-slot sentinel). Full memcmp confirms matches. Memory: - Single flat vmalloc for the element pool (no per-elt kzalloc). - bits parameter clamped to [10, 18]: at the maximum bits=3D18, the element pool is ~135 MB and a stack_map_bin snapshot may briefly allocate another ~135 MB. - struct stackmap_bin_snapshot uses u64 (not size_t) for its size field so data[] is 8-byte aligned on both 32-bit and 64-bit architectures, avoiding alignment faults when writing u64 IPs on strict-alignment architectures. Kernel command line parameter: - ftrace_stackmap.bits=3DN: set map capacity (2^N unique stacks, range 10-18, default 14) Signed-off-by: Pengfei Li --- kernel/trace/Kconfig | 22 + kernel/trace/Makefile | 1 + kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++++++++++++++++++ kernel/trace/trace_stackmap.h | 57 +++ 4 files changed, 969 insertions(+) create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index e130da35808f..e49cae886ff0 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -412,6 +412,28 @@ config STACK_TRACER =20 Say N if unsure. =20 +config FTRACE_STACKMAP + bool "Ftrace stack map deduplication" + depends on TRACING + depends on STACKTRACE + depends on ARCH_HAVE_NMI_SAFE_CMPXCHG + select KALLSYMS + help + This enables a global stack trace hash table for ftrace, inspired + by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store + only a stack_id in the ring buffer instead of the full stack trace, + significantly reducing trace buffer usage when the same call stacks + appear repeatedly. + + The deduplicated stacks are exported via: + /sys/kernel/debug/tracing/stack_map + + Writing to this file resets the stack map. Reading shows all unique + stacks with their stack_id and reference count. + + Say Y if you want to reduce ftrace buffer usage for stack traces. + Say N if unsure. + config TRACE_PREEMPT_TOGGLE bool help diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 8d3d96e847d8..c2d9b2bf895a 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) +=3D trace_hwlat.o obj-$(CONFIG_OSNOISE_TRACER) +=3D trace_osnoise.o obj-$(CONFIG_NOP_TRACER) +=3D trace_nop.o obj-$(CONFIG_STACK_TRACER) +=3D trace_stack.o +obj-$(CONFIG_FTRACE_STACKMAP) +=3D trace_stackmap.o obj-$(CONFIG_MMIOTRACE) +=3D trace_mmiotrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) +=3D trace_functions_graph.o obj-$(CONFIG_TRACE_BRANCH_PROFILING) +=3D trace_branch.o diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c new file mode 100644 index 000000000000..9e9fdf85071d --- /dev/null +++ b/kernel/trace/trace_stackmap.c @@ -0,0 +1,889 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace + * + * Modeled after tracing_map.c (used by hist triggers), this provides + * a lock-free hash map optimized for the ftrace hot path. The design + * is based on Dr. Cliff Click's non-blocking hash table algorithm. + * + * Key properties: + * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context + * - Pre-allocated element pool (zero allocation on hot path) + * - Linear probing with 2x over-provisioned table; probe length + * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup + * cost constant even when the table is heavily loaded + * - Single global instance (initialized for the global trace array) + * + * Reset is a control-path operation, only allowed when tracing is + * stopped on the owning trace_array. The protocol is: + * + * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights + * and blocks new get_id() callers (they observe resetting=3D1 and + * return -EINVAL). + * - trace_types_lock serializes the tracer_tracing_is_on() check and + * the destructive ring-buffer reset against tracefs writes to + * tracing_on. + * - synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path, which runs with preemption disabled. + * + * Online reset (with tracing active) is intentionally not supported + * to keep the design simple and the proof obligations small. + * + * The 32-bit jhash of the stack IPs is the hash table key. On hash + * collision, linear probing finds the next slot and full memcmp + * confirms the match. + * + * Concurrent userspace readers (cat stack_map / stack_map_bin) get + * a best-effort snapshot. They are coherent with the hot path + * (smp_load_acquire on entry->val); they are also serialized + * against reset via smap->reader_sem (readers take it in shared + * mode, reset in exclusive mode), so a reset cannot tear an + * iteration in progress -- it waits for active readers to drop + * the rwsem before clearing the map. The hot path is coordinated + * with reset separately, via acquire/release on smap->resetting. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "trace.h" +#include "trace_stackmap.h" + +/* + * Bound the linear-probe scan length. With a 2x over-provisioned table, + * a well-distributed hash gives very short probe chains. Capping at 64 + * keeps worst-case lookup O(1) even when the table is heavily loaded + * with claimed-but-empty slots from pool exhaustion. + */ +#define FTRACE_STACKMAP_MAX_PROBE 64 + +/* + * Memory ordering of entry->val: published with smp_store_release() + * by the inserter; consumed with smp_load_acquire() by every reader + * that dereferences the elt (get_id, seq_show, bin_open). This pairs + * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the + * publish) with the reads of those fields (which happen AFTER the + * load). seq_start / seq_next only test val for NULL and use the + * acquire load purely to keep memory ordering symmetric. + */ + +/* + * Each pre-allocated element holds one unique stack trace. + * Fixed size: MAX_DEPTH entries regardless of actual depth. + */ +struct stackmap_elt { + u32 nr; /* actual number of IPs */ + atomic_t ref_count; + unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH]; +}; + +/* + * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt. + * key =3D=3D 0 means the slot is free. + */ +struct stackmap_entry { + u32 key; /* 0 =3D free, non-zero =3D jhash */ + struct stackmap_elt *val; /* NULL until fully published */ +}; + +static struct stackmap_elt *stackmap_load_elt(struct stackmap_entry *entry) +{ + /* + * Pairs with the smp_store_release() that publishes entry->val + * after fully initializing the element payload. + */ + return smp_load_acquire(&entry->val); +} + +struct ftrace_stackmap { + struct trace_array *tr; /* owning trace_array */ + unsigned int map_bits; + unsigned int map_size; /* 1 << (map_bits + 1) */ + unsigned int max_elts; /* 1 << map_bits */ + u32 hash_seed; /* per-instance jhash seed */ + atomic_t next_elt; /* index into elts pool */ + struct stackmap_entry *entries; /* hash table */ + struct stackmap_elt *elts; /* flat element pool */ + atomic_t resetting; + /* + * Reader/reset serialization. Held in shared mode (read lock) + * across seq_file iteration and binary snapshot construction; + * held in exclusive mode (write lock) by reset's clearing + * phase. The hot path (get_id) does not take this lock =E2=80=94 it + * uses smp_load_acquire/smp_store_release on entry->val and + * the resetting flag for the lock-free protocol. + */ + struct rw_semaphore reader_sem; + /* + * Per-CPU counters using local_t. local_t increments are NMI- + * safe on all architectures (single-instruction or interrupt- + * masked) and avoid the raw_spinlock_t fallback that + * atomic64_t uses on 32-bit GENERIC_ATOMIC64 =E2=80=94 which would + * deadlock if an NMI hit while the spinlock was held. + */ + local_t __percpu *successes; /* events served (hits + new inserts) */ + local_t __percpu *drops; +}; + +/* + * Cap the bits parameter to keep worst-case allocations bounded: + * bits=3D18 =E2=86=92 256K elts, 512K slots, ~130 MB elt pool, ~130 MB = bin + * export. + * Smaller workloads should use the default (14) which gives 16K elts + * (~8 MB pool); bump bits via the ftrace_stackmap.bits=3D kernel + * parameter for higher unique-stack capacity. + */ +#define FTRACE_STACKMAP_BITS_MIN 10 +#define FTRACE_STACKMAP_BITS_MAX 18 +#define FTRACE_STACKMAP_BITS_DEFAULT 14 + +static unsigned int stackmap_map_bits =3D FTRACE_STACKMAP_BITS_DEFAULT; +static int __init stackmap_bits_setup(char *str) +{ + unsigned long val; + + if (kstrtoul(str, 0, &val)) + return -EINVAL; + val =3D clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX= ); + stackmap_map_bits =3D val; + return 0; +} +early_param("ftrace_stackmap.bits", stackmap_bits_setup); + +/* --- Element pool --- */ + +static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap) +{ + int idx; + + /* + * Fast-path early-out once the pool is fully consumed. Avoids + * the contended atomic RMW on next_elt for every traced event + * after the pool is exhausted. + */ + if (atomic_read(&smap->next_elt) >=3D smap->max_elts) + return NULL; + + idx =3D atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts); + if (idx < smap->max_elts) + return &smap->elts[idx]; + return NULL; +} + +/* --- Create / Destroy / Reset --- */ + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) +{ + struct ftrace_stackmap *smap; + unsigned int bits; + + smap =3D kzalloc_obj(*smap, GFP_KERNEL); + if (!smap) + return ERR_PTR(-ENOMEM); + + /* Defensive clamp: reject bogus bits even if early_param is bypassed. */ + bits =3D clamp_val(stackmap_map_bits, + FTRACE_STACKMAP_BITS_MIN, + FTRACE_STACKMAP_BITS_MAX); + + smap->tr =3D tr; + smap->map_bits =3D bits; + smap->max_elts =3D 1U << bits; + smap->map_size =3D 1U << (bits + 1); /* 2x over-provision */ + + smap->entries =3D vzalloc(sizeof(*smap->entries) * smap->map_size); + if (!smap->entries) { + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + /* + * Single large vmalloc of the element pool, indexed flat. + * At bits=3D18 this is 256K * sizeof(struct stackmap_elt). The + * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB. + */ + smap->elts =3D vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts); + if (!smap->elts) { + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + smap->successes =3D alloc_percpu(local_t); + if (!smap->successes) { + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + smap->drops =3D alloc_percpu(local_t); + if (!smap->drops) { + free_percpu(smap->successes); + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + smap->hash_seed =3D get_random_u32(); + atomic_set(&smap->next_elt, 0); + atomic_set(&smap->resetting, 0); + init_rwsem(&smap->reader_sem); + + return smap; +} + +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap) +{ + if (!smap || IS_ERR(smap)) + return; + free_percpu(smap->drops); + free_percpu(smap->successes); + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); +} + +/** + * ftrace_stackmap_reset - clear all entries in the stackmap + * @smap: the stackmap to reset + * + * Returns 0 on success, -EBUSY if another reset is already in + * progress, or if tracing is currently active on the owning + * trace_array. + * + * Online reset (with tracing active) is not supported. Caller must + * stop tracing first (echo 0 > tracing_on). + * + * Caller is process context (typically sysfs write handler). + * + * Protocol: + * 1. Atomically claim reset rights via cmpxchg on @resetting. + * 2. Take trace_types_lock to serialize against tracefs writes to + * tracing_on. + * 3. Verify tracing is stopped on @smap->tr; if not, release the + * claim and return -EBUSY. The resetting flag itself blocks + * any subsequent get_id() callers. + * 4. synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path (which runs preempt-disabled). + * 5. Reset the ring buffer(s), then memset entries, elts, and + * counters. + * 6. Release the resetting flag with release semantics so any new + * get_id() observes a fully cleared map. + */ +int ftrace_stackmap_reset(struct ftrace_stackmap *smap) +{ + struct trace_array *tr; + int ret =3D 0; + + if (!smap) + return 0; + + if (atomic_cmpxchg(&smap->resetting, 0, 1) !=3D 0) + return -EBUSY; + + mutex_lock(&trace_types_lock); + + tr =3D smap->tr; + if (tr && tracer_tracing_is_on(tr)) { + ret =3D -EBUSY; + goto out_unlock; + } + + /* + * synchronize_rcu() itself is a full barrier; no extra smp_mb() + * is needed before it. It drains in-flight ftrace callbacks that + * may have already passed the resetting check with the old value. + */ + synchronize_rcu(); + + /* + * Take the reader_sem in exclusive mode. This serializes the + * memset against any tracefs reader (seq_file iteration or + * stack_map_bin snapshot) that may currently hold the rwsem + * for read. synchronize_rcu() already drained the hot path; + * this rwsem covers process-context readers that aren't + * preempt-disabled. + */ + down_write(&smap->reader_sem); + + /* + * Clear the ring buffer(s) BEFORE the map, both under the write + * lock. The ring buffer may still hold TRACE_STACK_ID events + * whose stack_id points at slots we are about to free/reuse. + * Resetting the buffer first guarantees an external observer + * never sees the inconsistent "trace still has but + * the map is already empty" window: it sees either (old buffer, + * old map) or (cleared buffer, old map) or (cleared buffer, + * cleared map) -- never (old buffer, cleared map). + * + * Use tracing_reset_all_cpus() (not _online_cpus) so per-CPU + * buffers belonging to currently offline CPUs are also cleared. + * The ring buffer is allocated per-possible-CPU; an offline CPU's + * buffer can still hold a TRACE_STACK_ID event written before + * the CPU went offline. tracing_reset_online_cpus() iterates + * for_each_online_buffer_cpu() and would leave that data behind + * to be observed once the CPU comes back online (or by the + * trace reader, which iterates all allocated CPU buffers), + * recreating the stale-stack_id window we are trying to close. + * + * Since reset requires tracing to be stopped, this makes "reset" + * an explicitly destructive operation on the owning trace_array, + * keeping ring-buffer stack_ids and the map coherent. + */ + if (tr) { + tracing_reset_all_cpus(&tr->array_buffer); +#ifdef CONFIG_TRACER_SNAPSHOT + if (tr->allocated_snapshot) + tracing_reset_all_cpus(&tr->snapshot_buffer); +#endif + } + + memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size); + memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts); + + atomic_set(&smap->next_elt, 0); + { + int cpu; + + for_each_possible_cpu(cpu) { + local_set(per_cpu_ptr(smap->successes, cpu), 0); + local_set(per_cpu_ptr(smap->drops, cpu), 0); + } + } + + up_write(&smap->reader_sem); + +out_unlock: + mutex_unlock(&trace_types_lock); + + /* Release resetting=3D0 so new get_id() observes a cleared map. */ + atomic_set_release(&smap->resetting, 0); + return ret; +} + +/* --- Core: get_id (lock-free, NMI-safe) --- */ + +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries) +{ + u32 key_hash, idx, test_key, trace_len; + struct stackmap_entry *entry; + struct stackmap_elt *val; + int probes =3D 0; + + /* + * atomic_read_acquire() pairs with atomic_set_release() in the + * reset path. This ensures that subsequent reads of entry->key + * and entry->val are ordered after this check; without acquire, + * the CPU would only have a control dependency, which orders + * subsequent stores but not loads (per LKMM). + */ + if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting)) + return -EINVAL; + /* + * Never truncate: a stack deeper than the map can hold must not be + * silently shortened, or two distinct traces sharing their first + * FTRACE_STACKMAP_MAX_DEPTH frames would be merged into one + * stack_id. The caller is expected to fall back to a full stack + * trace for such events. Reject defensively in case of a future + * caller that forgets this contract. + */ + if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH) + return -E2BIG; + + trace_len =3D nr_entries * sizeof(unsigned long); + /* + * jhash2() requires the length in u32 units and the data to be + * u32-aligned. On 64-bit kernels sizeof(unsigned long)=3D=3D8, so + * trace_len is always a multiple of 8 (hence of 4). Use jhash2 + * directly; the cast to u32* is safe because ips[] is naturally + * aligned to sizeof(unsigned long) >=3D 4. + */ + key_hash =3D jhash2((const u32 *)ips, trace_len / sizeof(u32), + smap->hash_seed); + if (key_hash =3D=3D 0) + key_hash =3D 1; /* 0 means free slot */ + + idx =3D key_hash >> (32 - (smap->map_bits + 1)); + + while (probes < FTRACE_STACKMAP_MAX_PROBE) { + idx &=3D (smap->map_size - 1); + entry =3D &smap->entries[idx]; + /* + * READ_ONCE() to avoid LKMM data race with concurrent + * cmpxchg(&entry->key, 0, key_hash) on this slot. + */ + test_key =3D READ_ONCE(entry->key); + + if (test_key =3D=3D key_hash) { + val =3D stackmap_load_elt(entry); + /* + * READ_ONCE(val->nr) keeps style consistent with + * the seq_show / bin_open readers. nr is write-once + * (set before publish, never modified afterwards), + * so the load is data-race-free, but READ_ONCE + * silences any analysis tool that flags a plain + * read of a field that is also read under acquire + * elsewhere. + */ + if (val && READ_ONCE(val->nr) =3D=3D nr_entries && + memcmp(val->ips, ips, trace_len) =3D=3D 0) { + /* + * ref_count is a best-effort popularity + * counter. On a long (from-boot, multi-hour) + * trace a hot stack can be hit billions of + * times. atomic_add_unless() gives true + * saturation at INT_MAX even under concurrent + * hits on multiple CPUs (a plain + * check-then-inc could let several CPUs past + * the check near the cap and still wrap). + */ + atomic_add_unless(&val->ref_count, 1, INT_MAX); + /* + * successes/drops are best-effort throughput + * counters. Saturate at LONG_MAX so they do + * not wrap on long runs (notably where local_t + * is 32-bit), matching ref_count's behaviour. + */ + local_add_unless(this_cpu_ptr(smap->successes), + 1, LONG_MAX); + return (int)idx; + } + /* + * val =3D=3D NULL: another CPU is mid-insert, or this + * slot is "claimed but empty" (pool exhausted). + * val !=3D NULL but mismatch: 32-bit hash collision + * with a different stack. In both cases, advance. + */ + } else if (!test_key) { + /* + * Free slot: try to claim it. + * + * If two CPUs race here with the same key_hash + * (same stack), one loses the cmpxchg, advances, + * and may insert the same stack at a later slot. + * This can produce a small number of duplicate + * entries under heavy contention. The trade-off + * is accepted to keep the hot path lock-free; + * ref_count is split across the duplicates and + * total memory cost is bounded by the element + * pool size. + */ + if (cmpxchg(&entry->key, 0, key_hash) =3D=3D 0) { + struct stackmap_elt *elt; + + elt =3D stackmap_get_elt(smap); + if (!elt) { + /* + * Pool exhausted. We claimed this + * slot with cmpxchg but cannot fill + * it. Leave key set so the slot + * stays "claimed but empty" =E2=80=94 future + * lookups treat val=3D=3DNULL as a miss + * and probe past it. Cannot revert + * key=3D0 without racing other CPUs. + */ + local_add_unless(this_cpu_ptr(smap->drops), + 1, LONG_MAX); + return -ENOSPC; + } + + elt->nr =3D nr_entries; + atomic_set(&elt->ref_count, 1); + memcpy(elt->ips, ips, trace_len); + + /* + * Publish elt with release semantics so the + * reader's smp_load_acquire can safely + * dereference val->nr / val->ips. + */ + smp_store_release(&entry->val, elt); + local_add_unless(this_cpu_ptr(smap->successes), + 1, LONG_MAX); + return (int)idx; + } + /* cmpxchg failed; another CPU claimed this slot. */ + } + + idx++; + probes++; + } + + local_add_unless(this_cpu_ptr(smap->drops), 1, LONG_MAX); + return -ENOSPC; +} + +/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */ + +struct stackmap_seq_private { + struct ftrace_stackmap *smap; +}; + +static void *stackmap_seq_start(struct seq_file *m, loff_t *pos) +{ + struct stackmap_seq_private *priv =3D m->private; + struct ftrace_stackmap *smap =3D priv->smap; + u32 i; + + if (!smap) + return NULL; + /* + * Take the reader_sem to serialize against ftrace_stackmap_reset(), + * which holds it for write while clearing the table. Released in + * stackmap_seq_stop(), which seq_file calls regardless of whether + * start() returned an element or NULL (per Documentation/filesystems + * /seq_file.rst: "the iterator value returned by start() or next() + * is guaranteed to be passed to a subsequent next() or stop()"). + */ + down_read(&smap->reader_sem); + for (i =3D *pos; i < smap->map_size; i++) { + if (READ_ONCE(smap->entries[i].key) && + stackmap_load_elt(&smap->entries[i])) { + *pos =3D i; + return &smap->entries[i]; + } + } + return NULL; +} + +static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct stackmap_seq_private *priv =3D m->private; + struct ftrace_stackmap *smap =3D priv->smap; + u32 i; + + if (!smap) + return NULL; + for (i =3D *pos + 1; i < smap->map_size; i++) { + if (READ_ONCE(smap->entries[i].key) && + stackmap_load_elt(&smap->entries[i])) { + *pos =3D i; + return &smap->entries[i]; + } + } + /* + * Advance *pos past the end so that on the next read() the + * subsequent stackmap_seq_start() call returns NULL and the + * iteration terminates. Without this, seq_read() would loop + * on the last element. + */ + *pos =3D smap->map_size; + return NULL; +} + +static void stackmap_seq_stop(struct seq_file *m, void *v) +{ + struct stackmap_seq_private *priv =3D m->private; + struct ftrace_stackmap *smap =3D priv->smap; + + /* + * seq_file invokes stop() unconditionally after each iteration + * pass (see seq_read_iter / traverse), even when start() returned + * NULL. Always release here, balanced against the down_read in + * stackmap_seq_start(). + */ + if (smap) + up_read(&smap->reader_sem); +} + +static int stackmap_seq_show(struct seq_file *m, void *v) +{ + struct stackmap_entry *entry =3D v; + struct stackmap_seq_private *priv =3D m->private; + struct stackmap_elt *elt; + u32 idx =3D entry - priv->smap->entries; + u32 i, nr; + + elt =3D stackmap_load_elt(entry); + if (!elt) + return 0; + + nr =3D READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr =3D FTRACE_STACKMAP_MAX_DEPTH; + + seq_printf(m, "stack_id %u [ref %u, depth %u]\n", + idx, atomic_read(&elt->ref_count), nr); + for (i =3D 0; i < nr; i++) { + unsigned long ip =3D elt->ips[i]; + + /* + * Mirror trace_stack_print(): __ftrace_trace_stack() + * may replace trampoline addresses with + * FTRACE_TRAMPOLINE_MARKER before the stack reaches the + * map, and normal addresses must go through + * trace_adjust_address() (KASLR / module text delta) + * before symbolization. Without this the export would + * print a bogus symbol for the marker and unadjusted + * addresses for everything else. + */ + if (ip =3D=3D FTRACE_TRAMPOLINE_MARKER) { + seq_printf(m, " [%u] [FTRACE TRAMPOLINE]\n", i); + continue; + } + seq_printf(m, " [%u] %pS\n", i, + (void *)trace_adjust_address(priv->smap->tr, ip)); + } + seq_putc(m, '\n'); + return 0; +} + +static const struct seq_operations stackmap_seq_ops =3D { + .start =3D stackmap_seq_start, + .next =3D stackmap_seq_next, + .stop =3D stackmap_seq_stop, + .show =3D stackmap_seq_show, +}; + +static int stackmap_open(struct inode *inode, struct file *file) +{ + struct stackmap_seq_private *priv; + struct seq_file *m; + int ret; + + ret =3D seq_open_private(file, &stackmap_seq_ops, + sizeof(struct stackmap_seq_private)); + if (ret) + return ret; + m =3D file->private_data; + priv =3D m->private; + priv->smap =3D inode->i_private; + return 0; +} + +/* + * Accept exactly "0" or "reset" (optionally followed by a single newline). + */ +static bool stackmap_write_is_reset(const char *buf, size_t n) +{ + if (n > 0 && buf[n - 1] =3D=3D '\n') + n--; + return (n =3D=3D 1 && buf[0] =3D=3D '0') || + (n =3D=3D 5 && memcmp(buf, "reset", 5) =3D=3D 0); +} + +static ssize_t stackmap_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct seq_file *m =3D file->private_data; + struct stackmap_seq_private *priv =3D m->private; + char buf[8]; + size_t n =3D min(count, sizeof(buf) - 1); + int ret; + + if (n =3D=3D 0) + return -EINVAL; + if (copy_from_user(buf, ubuf, n)) + return -EFAULT; + buf[n] =3D '\0'; + + if (!stackmap_write_is_reset(buf, n)) + return -EINVAL; + + /* + * ftrace_stackmap_reset() atomically claims reset rights via + * cmpxchg and returns -EBUSY if another reset is in progress + * or if tracing is active. + */ + ret =3D ftrace_stackmap_reset(priv->smap); + if (ret) + return ret; + return count; +} + +const struct file_operations ftrace_stackmap_fops =3D { + .open =3D stackmap_open, + .read =3D seq_read, + .write =3D stackmap_write, + .llseek =3D seq_lseek, + .release =3D seq_release_private, +}; + +/* --- Stats --- */ + +static int stackmap_stat_show(struct seq_file *m, void *v) +{ + struct ftrace_stackmap *smap =3D m->private; + u64 successes =3D 0, drops =3D 0; + u32 entries; + int cpu; + + if (!smap) { + seq_puts(m, "stackmap not initialized\n"); + return 0; + } + + entries =3D atomic_read(&smap->next_elt); + for_each_possible_cpu(cpu) { + successes +=3D local_read(per_cpu_ptr(smap->successes, cpu)); + drops +=3D local_read(per_cpu_ptr(smap->drops, cpu)); + } + + seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts); + seq_printf(m, "table_size: %u\n", smap->map_size); + seq_printf(m, "successes: %llu\n", successes); + seq_printf(m, "drops: %llu\n", drops); + if (successes + drops > 0) + seq_printf(m, "success_rate: %llu%%\n", + successes * 100 / (successes + drops)); + return 0; +} + +static int stackmap_stat_open(struct inode *inode, struct file *file) +{ + return single_open(file, stackmap_stat_show, inode->i_private); +} + +const struct file_operations ftrace_stackmap_stat_fops =3D { + .open =3D stackmap_stat_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + +/* --- Binary export --- */ + +struct stackmap_bin_snapshot { + /* + * Use u64 (not size_t) so data[] is 8-byte aligned on both + * 32-bit and 64-bit architectures. The IP array within data[] + * is accessed as u64*, which would alignment-fault on strict + * architectures (e.g. older ARM, SPARC) if data[] started at + * a 4-byte boundary. + */ + u64 size; + char data[]; +}; + +static int stackmap_bin_open(struct inode *inode, struct file *file) +{ + struct ftrace_stackmap *smap =3D inode->i_private; + struct stackmap_bin_snapshot *snap; + struct ftrace_stackmap_bin_header *hdr; + size_t alloc_size, off; + u32 nr_entries, i, nr_stacks; + + if (!smap) + return -ENODEV; + + /* + * Worst-case allocation size: every populated entry uses a + * full-depth stack. The (+1) gives one slack slot in case a + * concurrent insert lands between this snapshot and iteration. + * The loop below performs an explicit bounds check anyway. + * + * At bits=3D18 this caps at ~135 MB. The file is mode 0440 + * (TRACE_MODE_READ), so only privileged users can open it. + */ + nr_entries =3D atomic_read(&smap->next_elt); + alloc_size =3D sizeof(*hdr) + (nr_entries + 1) * + (sizeof(struct ftrace_stackmap_bin_entry) + + FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64)); + + snap =3D vmalloc(sizeof(*snap) + alloc_size); + if (!snap) + return -ENOMEM; + + hdr =3D (struct ftrace_stackmap_bin_header *)snap->data; + hdr->magic =3D FTRACE_STACKMAP_BIN_MAGIC; + hdr->version =3D FTRACE_STACKMAP_BIN_VERSION; + hdr->reserved =3D 0; + off =3D sizeof(*hdr); + nr_stacks =3D 0; + + /* + * Take reader_sem to serialize against ftrace_stackmap_reset(), + * which clears the table and elt pool under the write lock. + */ + down_read(&smap->reader_sem); + + for (i =3D 0; i < smap->map_size; i++) { + struct stackmap_entry *entry =3D &smap->entries[i]; + struct stackmap_elt *elt; + struct ftrace_stackmap_bin_entry *e; + u64 *ips_out; + u32 k, nr; + + if (!READ_ONCE(entry->key)) + continue; + elt =3D stackmap_load_elt(entry); + if (!elt) + continue; + + nr =3D READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr =3D FTRACE_STACKMAP_MAX_DEPTH; + + /* Bounds check: stop if we would overflow the allocation. */ + if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size) + break; + + e =3D (struct ftrace_stackmap_bin_entry *)(snap->data + off); + e->stack_id =3D i; + e->nr =3D nr; + e->ref_count =3D atomic_read(&elt->ref_count); + e->reserved =3D 0; + off +=3D sizeof(*e); + + ips_out =3D (u64 *)(snap->data + off); + for (k =3D 0; k < nr; k++) { + unsigned long ip =3D elt->ips[k]; + + /* + * Emit the trampoline marker verbatim so userspace + * can render it as [FTRACE TRAMPOLINE]; pass every + * other address through trace_adjust_address() so the + * binary export follows the same address-adjustment + * rules as the text export. + */ + if (ip =3D=3D FTRACE_TRAMPOLINE_MARKER) + ips_out[k] =3D (u64)FTRACE_TRAMPOLINE_MARKER; + else + ips_out[k] =3D (u64)trace_adjust_address(smap->tr, ip); + } + off +=3D nr * sizeof(u64); + nr_stacks++; + } + + up_read(&smap->reader_sem); + + hdr->nr_stacks =3D nr_stacks; + snap->size =3D off; + file->private_data =3D snap; + return 0; +} + +static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct stackmap_bin_snapshot *snap =3D file->private_data; + + if (!snap) + return -EINVAL; + return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size); +} + +static int stackmap_bin_release(struct inode *inode, struct file *file) +{ + vfree(file->private_data); + return 0; +} + +const struct file_operations ftrace_stackmap_bin_fops =3D { + .open =3D stackmap_bin_open, + .read =3D stackmap_bin_read, + .llseek =3D default_llseek, + .release =3D stackmap_bin_release, +}; diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h new file mode 100644 index 000000000000..7c2e5ab9d36d --- /dev/null +++ b/kernel/trace/trace_stackmap.h @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _TRACE_STACKMAP_H +#define _TRACE_STACKMAP_H + +#include +#include + +#define FTRACE_STACKMAP_MAX_DEPTH 64 + +/* Binary export format */ +#define FTRACE_STACKMAP_BIN_MAGIC 0x46534D42 /* 'FSMB' */ +#define FTRACE_STACKMAP_BIN_VERSION 1 + +struct ftrace_stackmap_bin_header { + u32 magic; + u32 version; + u32 nr_stacks; + u32 reserved; +}; + +struct ftrace_stackmap_bin_entry { + u32 stack_id; + u32 nr; + u32 ref_count; + u32 reserved; + /* followed by u64 ips[nr] */ +}; + +struct trace_array; + +#ifdef CONFIG_FTRACE_STACKMAP + +struct ftrace_stackmap; + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr); +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap); +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries); +int ftrace_stackmap_reset(struct ftrace_stackmap *smap); + +extern const struct file_operations ftrace_stackmap_fops; +extern const struct file_operations ftrace_stackmap_stat_fops; +extern const struct file_operations ftrace_stackmap_bin_fops; + +#else + +struct ftrace_stackmap; +static inline struct ftrace_stackmap * +ftrace_stackmap_create(struct trace_array *tr) { return NULL; } +static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { } +static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s, + unsigned long *ips, unsigned int n) +{ return -EOPNOTSUPP; } +static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { retur= n 0; } + +#endif +#endif /* _TRACE_STACKMAP_H */ --=20 2.34.1 From nobody Fri Jun 19 05:12:11 2026 Received: from mail-dl1-f51.google.com (mail-dl1-f51.google.com [74.125.82.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D0BE3EF0A8 for ; Tue, 16 Jun 2026 06:42:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.51 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781592139; cv=none; b=eiBY/OUtFpa2ov0L2Aq7Oi4QCMeqt1as1PW4jPjQOTDhnIG7oUIMI6JU/AKA/EKBuMd7AUyD5pLKnkX+Sa9mOdFKRCxyy4Vx5ylIbwBh1/dFdWNA1mBfPOyiL8cle0zOxISGNQ6G35HYOP7wNqp/l6hyZtZ9pCCZChv4ILUogGc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781592139; c=relaxed/simple; bh=2SL4rWEWhhQPO1eRBQZkqFbpdV4LnrJiEf+oavUvleQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lmbE5ftIAQQXVBCnREnYKPlvd4Smg6ZZXJAOpOuMssIYjeY2ZeT0wtutkeGfML7iHT45AdnuUMW6vhkqiBpPGx7qUrIawXbZtBjtXqK65MFIOjyal/lsTZhi7wouz2zLpHWbxZAfNW5VxWZDkfO6H8HHd0HoceRTQnEa5WixuY4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ib9Rf42q; arc=none smtp.client-ip=74.125.82.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ib9Rf42q" Received: by mail-dl1-f51.google.com with SMTP id a92af1059eb24-1384ebe7a10so6619112c88.1 for ; Mon, 15 Jun 2026 23:42:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781592137; x=1782196937; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tjY/DZRC1d+FjG45aKF5yfv4kuV1+Jd8YmStqyJDnUE=; b=ib9Rf42qr0EylnWYIgR/MmdkL3N4QoDRAPxIA0IaYiHg47ZmUoOvDHGKwx2yydMY07 oHp30+dLDD4HpTj+CHGTRRIw/NueE+prX2Ck4KqxSckqMjWS8L+d9e9V60dMpOiW47SY JdD+RKBdnuVWJIImun12Sr4IkaIJpADhF8mttG52lkF++QR1BfNBUqitV9dyEzu28i0Y Xr4FZXchnzh9FnmWOFX+LWvWpJQ6WN6SpSp1wASkJiIuUo7kc0lAKbWLsFxswy41PxaE EWIHOr+vRoshJrcK7ivtRTw/Fui+Smz49txNlITGQcyxN10iBDcAIuEL0MBBbELjICgZ 9X/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781592137; x=1782196937; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=tjY/DZRC1d+FjG45aKF5yfv4kuV1+Jd8YmStqyJDnUE=; b=sW6hAoRnkW7let6CQJrBB2uSL1Q+5NbjIKbOoaIHP7ZRJQvkH7rvGZXm5Zq/S8p+vz +7s5FeT+lZYqSq3Jky0dC9Hy+tJa/VKZ65O3ys8+rwyYueR8cL4vopWs4fwqAobJaUf6 CTplu3ZiIjs1S1TSCh4dHakIZkAJ3Qu4qIOrlwyiCp/o/lDlc2GlpZelFx8NRNK1WcpX DUlUpR8MJH8CbL4yTOJz5DKWhlda9wFDb84nHqDhjaZe1wzokSrL8QN9D7CkSjvzuQ6b wq2grc9OEbNuzzvy3WFOVshiVHuTK9q/OT9GvqUYeyp50whzIpn1H2XW1FbNpGzqWZJX zXkQ== X-Forwarded-Encrypted: i=1; AFNElJ8+1TIhlVoOrV16EZhZGwfgIeOSOC+IRPV7U/F9tkryxB3v/QV122q7onEV39PJInWwb817v0iATuHDXx4=@vger.kernel.org X-Gm-Message-State: AOJu0Yya1IOuW9EoK9PH39AN/Rql3HB5Z6x8TV1HltVB08Yfg6kj0eQc uaKIq33vQ2GUTjmmbhJY+S7OHgtIjWn6cGdkJkK/RFMGcfkqN2x1/QJZ X-Gm-Gg: Acq92OEJaOz43buLSxUCvAr0icyHrOc0ObyrGsRIMBGj9uNTJuRffmwpqADQdVFFL+I QGfcXVAZSydFfigXxbeR5tQOFC2sbgRRKusjCFgGk8Jjp43GuW2kcXW+tr9Tm4lgcuyL4lZHL4m TQOS1AuEwQk3q5uMm/NxbHLALP6kzWGp4yktsmibZnAqOFiBKPl2yNiuj7LHaJPLFUZ3Q0mZUeQ 72FT7hSp5XuGNCrEQvHieBiXsbBAYEe7LhaylOnpmSyBOBPt0e/BBgCLcJEt0poOsFfs7ngFhjg 0hEHYWcUaLSp6IG93OoBAAhppHcDJq8jYltfbHWPFup3DytpdeQK0EGB+z9Uv+j2gdMHBm7j6X+ gzM67H8lo+OvZKY6LLAwWBU6TSvqvrMVD96Sk3gCxIr9F4FbklQ1d5h+M7BsQjwFy+Dr98LVRNe /tYXWgPd86lBYVToejWCraxLmRDpoI8d5hqIXXyQ== X-Received: by 2002:a05:701b:4558:20b0:12d:b7e5:a691 with SMTP id a92af1059eb24-1384bb1c4bcmr5608349c88.7.1781592136446; Mon, 15 Jun 2026 23:42:16 -0700 (PDT) Received: from localhost.localdomain ([2408:8607:1b00:8:55be:2dbe:9cd4:7306]) by smtp.gmail.com with ESMTPSA id a92af1059eb24-1384b910c51sm12499158c88.4.2026.06.15.23.42.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Jun 2026 23:42:15 -0700 (PDT) From: Li Pengfei X-Google-Original-From: Li Pengfei To: Steven Rostedt , Masami Hiramatsu Cc: Mathieu Desnoyers , Mark Rutland , Jonathan Corbet , Shuah Khan , linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, lipengfei28@xiaomi.com, zhangbo56@xiaomi.com Subject: [RFC PATCH v4 2/3] trace: integrate stackmap into ftrace stack recording path Date: Tue, 16 Jun 2026 14:41:18 +0800 Message-Id: <20260616064119.438063-3-lipengfei28@xiaomi.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com> References: <20260616064119.438063-1-lipengfei28@xiaomi.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pengfei Li Add TRACE_STACK_ID event type and integrate ftrace_stackmap into __ftrace_trace_stack(). When the 'stackmap' trace option is enabled, the stack recording path stores a 4-byte stack_id in the ring buffer instead of the full stack trace. Changes: - New TRACE_STACK_ID in trace_type enum and stack_id_entry in trace_entries.h. - New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern used by TRACE_ITER_PROF_TEXT_OFFSET). - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS so it is only exposed under the top-level trace instance, matching the convention already used for global-only options such as 'printk' and 'record-cmd'. Secondary instances under tracing/instances/*/ do not see the option in their options/ directory. - set_tracer_flag() additionally rejects enabling STACKMAP on a secondary instance. The per-option file is hidden on secondary instances, but a write to the aggregate trace_options file still reaches set_tracer_flag(); without this check the bit could be accepted and then become a silent no-op in the hot path (where tr->stackmap is NULL). This closes the global-instance-only gate at the write path, not just in the tracefs layout. - __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer slot BEFORE calling ftrace_stackmap_get_id(), so the map (and its ref_count / success counters) is only mutated when a ring-buffer event will actually reference the entry. If the reservation fails it falls back to a full stack; if get_id() fails it discards the reserved slot and falls back. A stack deeper than FTRACE_STACKMAP_MAX_DEPTH skips the map entirely (get_id() would return -E2BIG) and records a full stack, so deep traces are never truncated or merged. - Stackmap pointer read with smp_load_acquire(), published with smp_store_release() to ensure proper initialization ordering. The hot path falls back to a full stack whenever tr->stackmap is NULL. - ftrace_stackmap_create() takes the owning trace_array so the stackmap can later clear that trace_array's buffers during reset. - Added stack_id print handler in trace_output.c and TRACE_STACK_ID to trace_valid_entry() in trace_selftest.c so ftrace startup selftests accept the new entry type when the stackmap option is enabled. Failure-atomic init and boot-time activation: - The global stackmap and its tracefs files are created during tracer_init_tracefs(). stack_map is the single required file (it is both the resolver and the reset interface); it is created BEFORE the map pointer is published with smp_store_release(), so an observed non-NULL tr->stackmap implies the resolver/reset file exists. If stack_map cannot be created the map is destroyed and never published. - A small init-state (PENDING / DONE / FAILED) lets set_tracer_flag() distinguish "not initialized yet" from "init failed". Boot-time options (trace_options=3Dstackmap,stacktrace) are applied before the tracefs init work runs; the flag is allowed to be set while init is PENDING (the hot path falls back until the map is published, then the boot-set option takes effect), and is only rejected once init has permanently FAILED. On failure the STACKMAP flag is also cleared from the global instance so options/stackmap never reports an enabled no-op. Fallback behavior: if stackmap returns an error (pool exhausted, resetting, NULL pointer, or a too-deep stack), the full stack trace is recorded as before -- no new failure modes introduced. Per-instance stackmap support is left as a follow-up; gating the option to the global instance (both in the tracefs layout and at the set_tracer_flag() write path) makes the global-only scope explicit. Usage: echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Signed-off-by: Pengfei Li --- kernel/trace/trace.c | 216 +++++++++++++++++++++++++++++++++- kernel/trace/trace.h | 17 +++ kernel/trace/trace_entries.h | 15 +++ kernel/trace/trace_output.c | 23 ++++ kernel/trace/trace_selftest.c | 1 + 5 files changed, 269 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6eb4d3097a4d..e00bee5d0e01 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -57,6 +57,7 @@ =20 #include "trace.h" #include "trace_output.h" +#include "trace_stackmap.h" =20 #ifdef CONFIG_FTRACE_STARTUP_TEST /* @@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export); /* trace_options that are only supported by global_trace */ #define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) | \ TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) | \ - TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS) + TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) | \ + FPROFILE_DEFAULT_FLAGS) =20 /* trace_flags that are default zero for instances */ #define ZEROED_TRACE_FLAGS \ (TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK= ) | \ - TRACE_ITER(COPY_MARKER)) + TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP)) =20 /* * The global_trace is the descriptor that holds the top-level tracing @@ -1562,7 +1564,7 @@ void tracing_reset_online_cpus(struct array_buffer *b= uf) ring_buffer_record_enable(buffer); } =20 -static void tracing_reset_all_cpus(struct array_buffer *buf) +void tracing_reset_all_cpus(struct array_buffer *buf) { struct trace_buffer *buffer =3D buf->buffer; =20 @@ -2184,6 +2186,75 @@ void __ftrace_trace_stack(struct trace_array *tr, } #endif =20 +#ifdef CONFIG_FTRACE_STACKMAP + /* + * If stackmap dedup is enabled, try to store only the stack_id + * in the ring buffer instead of the full stack trace. + * + * Reserve the TRACE_STACK_ID ring-buffer slot BEFORE inserting + * into the stackmap. This guarantees the map is only mutated + * (and its ref_count / success counters bumped) when a + * ring-buffer event will actually reference the entry: + * - reservation fails -> fall back to full stack, map untouched + * - get_id() fails -> discard the reserved slot, fall back + * so stack_map_stat counters stay consistent with what the ring + * buffer holds, and a failed reservation never consumes a map + * slot for an event that records a full stack anyway. + */ + if (tr->trace_flags & TRACE_ITER(STACKMAP)) { + struct ftrace_stackmap *smap; + struct stack_id_entry *sid_entry; + int sid; + + /* + * Pairs with the smp_store_release() that publishes the + * fully initialized global stackmap at tracefs init. + */ + smap =3D smp_load_acquire(&tr->stackmap); + if (!smap) + goto full_stack; + + /* + * The stackmap stores at most FTRACE_STACKMAP_MAX_DEPTH + * frames per entry. A deeper trace would be truncated, and + * two distinct stacks that share the first MAX_DEPTH frames + * would hash and compare equal, silently merging into one + * stack_id. Keep the conservative full-stack path for deep + * traces so no information is lost or misattributed. + */ + if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH) + goto full_stack; + + event =3D __trace_buffer_lock_reserve(buffer, TRACE_STACK_ID, + sizeof(*sid_entry), trace_ctx); + if (!event) + goto full_stack; + + sid =3D ftrace_stackmap_get_id(smap, fstack->calls, nr_entries); + if (sid < 0) { + /* + * Pool exhausted or a reset is in progress. Discard + * the reserved stack_id slot and record the full + * stack instead, so the event still gets a trace. + */ + __trace_event_discard_commit(buffer, event); + goto full_stack; + } + + sid_entry =3D ring_buffer_event_data(event); + sid_entry->stack_id =3D sid; + /* + * stack_id is a synthetic side-event attached to a + * primary trace event that was already subject to + * filtering. No per-event filter is defined for + * TRACE_STACK_ID, so commit unconditionally. + */ + __buffer_unlock_commit(buffer, event); + goto out; + } +full_stack: +#endif + event =3D __trace_buffer_lock_reserve(buffer, TRACE_STACK, struct_size(entry, caller, nr_entries), trace_ctx); @@ -3979,6 +4050,33 @@ int trace_keep_overwrite(struct tracer *tracer, u64 = mask, int set) return 0; } =20 +#ifdef CONFIG_FTRACE_STACKMAP +/* + * Tracks tracefs-time initialization of the global stackmap so that + * set_tracer_flag() can distinguish "not initialized yet" from + * "initialization permanently failed". + * + * Boot-time options (trace_options=3Dstackmap,stacktrace) are applied + * very early, before tracer_init_tracefs() creates and publishes the + * map. We must allow the STACKMAP flag to be set during that window + * (the hot path falls back to a full stack while tr->stackmap is NULL, + * then starts using the map once it is published). We must, however, + * reject the enable once init has *failed*, so options/stackmap never + * reports an enabled no-op. + * + * Written once from the tracefs init work before any concurrent + * userspace writer to trace_options can run, then only read; a plain + * int is therefore sufficient. + */ +enum { + STACKMAP_INIT_PENDING, /* tracer_init_tracefs() not run yet */ + STACKMAP_INIT_DONE, /* map published, stack_map file created */ + STACKMAP_INIT_FAILED, /* permanent failure, never available */ +}; + +static int stackmap_init_state =3D STACKMAP_INIT_PENDING; +#endif + int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled) { switch (mask) { @@ -3993,6 +4091,33 @@ int set_tracer_flag(struct trace_array *tr, u64 mask= , int enabled) if (!!(tr->trace_flags & mask) =3D=3D !!enabled) return 0; =20 +#ifdef CONFIG_FTRACE_STACKMAP + /* + * STACKMAP is intentionally global-instance-only: the dedup map, + * its tracefs files (stack_map / stack_map_stat / stack_map_bin) + * and the lifetime/reset semantics are tied to the global trace + * array. options/stackmap is hidden on secondary instances via + * TOP_LEVEL_TRACE_FLAGS, but writes still reach set_tracer_flag() + * through the aggregate trace_options file. Reject the enable on + * a secondary instance so it cannot be silently accepted and then + * become a no-op in the hot path (where tr->stackmap is NULL and + * the code falls back to a full stack trace). + * + * On the global instance, allow the enable while init is still + * pending (boot-time trace_options=3Dstackmap is applied before the + * tracefs init work creates the map; the hot path falls back + * until the map is published). Only reject once init has + * permanently failed, so options/stackmap never reports an + * enabled no-op. READ_ONCE() suffices: this only inspects the + * init state, it does not dereference the map (the hot path uses + * smp_load_acquire(&tr->stackmap) for that). + */ + if (mask =3D=3D TRACE_ITER(STACKMAP) && enabled && + (tr !=3D &global_trace || + READ_ONCE(stackmap_init_state) =3D=3D STACKMAP_INIT_FAILED)) + return -EINVAL; +#endif + /* Give the tracer a chance to approve the change */ if (tr->current_trace->flag_changed) if (tr->current_trace->flag_changed(tr, mask, !!enabled)) @@ -9222,6 +9347,91 @@ static __init void tracer_init_tracefs_work_func(str= uct work_struct *work) NULL, &tracing_dyn_info_fops); #endif =20 +#ifdef CONFIG_FTRACE_STACKMAP + { + struct ftrace_stackmap *smap; + struct dentry *map_file; + + smap =3D ftrace_stackmap_create(&global_trace); + if (!IS_ERR(smap)) { + /* + * Failure-atomic init: stack_map is the single + * required tracefs file (it doubles as the reset + * interface and the human-readable resolver). If + * we cannot create it, the hot path must not be + * able to emit events that no one can + * resolve or clear, so refuse to publish the map + * and tear it down. + * + * Create stack_map BEFORE smp_store_release() so an + * observed non-NULL global_trace.stackmap implies + * its resolver/reset file exists. + */ + map_file =3D trace_create_file("stack_map", + TRACE_MODE_WRITE, NULL, + smap, + &ftrace_stackmap_fops); + if (!map_file) { + pr_warn("ftrace stackmap init: stack_map create failed, dedup disabled= \n"); + ftrace_stackmap_destroy(smap); + /* + * Permanent failure. Record it and clear a + * STACKMAP flag that a boot-time + * trace_options=3Dstackmap may have set, so + * options/stackmap does not report an + * enabled no-op and later userspace enables + * return -EINVAL. + */ + WRITE_ONCE(stackmap_init_state, + STACKMAP_INIT_FAILED); + global_trace.trace_flags &=3D + ~TRACE_ITER(STACKMAP); + } else { + /* + * smp_store_release pairs with the + * smp_load_acquire() in + * __ftrace_trace_stack(). Publishing only + * after the required file exists keeps + * "smap visible" =3D> "resolver/reset + * available". + */ + smp_store_release(&global_trace.stackmap, + smap); + WRITE_ONCE(stackmap_init_state, + STACKMAP_INIT_DONE); + /* + * stat and bin are auxiliary observability + * surfaces. If they fail to be created we + * keep dedup enabled (the kernel side still + * works, and stack_map alone is enough to + * resolve and reset); trace_create_file() + * already pr_warn()s on failure. + */ + trace_create_file("stack_map_stat", + TRACE_MODE_READ, NULL, + smap, + &ftrace_stackmap_stat_fops); + trace_create_file("stack_map_bin", + TRACE_MODE_READ, NULL, + smap, + &ftrace_stackmap_bin_fops); + } + } else { + pr_warn("ftrace stackmap init failed, dedup disabled\n"); + /* + * global_trace is statically defined; its stackmap + * field is zero-initialized via BSS, so leaving it + * NULL ensures the smp_load_acquire() in + * __ftrace_trace_stack() falls back to full stack. + * Mark init failed and clear any boot-time STACKMAP + * flag so userspace enables are rejected rather than + * becoming silent no-ops. + */ + WRITE_ONCE(stackmap_init_state, STACKMAP_INIT_FAILED); + global_trace.trace_flags &=3D ~TRACE_ITER(STACKMAP); + } + } +#endif create_trace_instances(NULL); =20 update_tracer_options(); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 80fe152af1dd..95db43bfc747 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -57,6 +57,7 @@ enum trace_type { TRACE_TIMERLAT, TRACE_RAW_DATA, TRACE_FUNC_REPEATS, + TRACE_STACK_ID, =20 __TRACE_LAST_TYPE, }; @@ -453,6 +454,9 @@ struct trace_array { struct cond_snapshot *cond_snapshot; #endif struct trace_func_repeats __percpu *last_func_repeats; +#ifdef CONFIG_FTRACE_STACKMAP + struct ftrace_stackmap *stackmap; +#endif /* * On boot up, the ring buffer is set to the minimum size, so that * we do not waste memory on systems that are not using tracing. @@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void); TRACE_GRAPH_RET); \ IF_ASSIGN(var, ent, struct func_repeats_entry, \ TRACE_FUNC_REPEATS); \ + IF_ASSIGN(var, ent, struct stack_id_entry, \ + TRACE_STACK_ID); \ __ftrace_bad_type(); \ } while (0) =20 @@ -689,6 +695,7 @@ extern int tracing_disabled; int tracer_init(struct tracer *t, struct trace_array *tr); int tracing_is_enabled(void); void tracing_reset_online_cpus(struct array_buffer *buf); +void tracing_reset_all_cpus(struct array_buffer *buf); void tracing_reset_all_online_cpus(void); void tracing_reset_all_online_cpus_unlocked(void); int tracing_open_generic(struct inode *inode, struct file *filp); @@ -1449,7 +1456,16 @@ extern int trace_get_user(struct trace_parser *parse= r, const char __user *ubuf, # define STACK_FLAGS #endif =20 +#ifdef CONFIG_FTRACE_STACKMAP +# define STACKMAP_FLAGS \ + C(STACKMAP, "stackmap"), +#else +# define STACKMAP_FLAGS +# define TRACE_ITER_STACKMAP_BIT -1 +#endif + #ifdef CONFIG_FUNCTION_PROFILER + # define PROFILER_FLAGS \ C(PROF_TEXT_OFFSET, "prof-text-offset"), # ifdef CONFIG_FUNCTION_GRAPH_TRACER @@ -1506,6 +1522,7 @@ extern int trace_get_user(struct trace_parser *parser= , const char __user *ubuf, FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ + STACKMAP_FLAGS \ BRANCH_FLAGS \ PROFILER_FLAGS \ FPROFILE_FLAGS diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index 54417468fdeb..89ed14b7e5fd 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry, (void *)__entry->caller[6], (void *)__entry->caller[7]) ); =20 +/* + * Stack ID entry - stores only a stack_id referencing the stackmap. + * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks. + */ +FTRACE_ENTRY(stack_id, stack_id_entry, + + TRACE_STACK_ID, + + F_STRUCT( + __field( int, stack_id ) + ), + + F_printk("", __entry->stack_id) +); + /* * trace_printk entry: */ diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index a5ad76175d10..68678ea88159 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event =3D= { .funcs =3D &trace_user_stack_funcs, }; =20 +/* TRACE_STACK_ID */ +static enum print_line_t trace_stack_id_print(struct trace_iterator *iter, + int flags, struct trace_event *event) +{ + struct stack_id_entry *field; + struct trace_seq *s =3D &iter->seq; + + trace_assign_type(field, iter->ent); + trace_seq_printf(s, "\n", field->stack_id); + + return trace_handle_return(s); +} + +static struct trace_event_functions trace_stack_id_funcs =3D { + .trace =3D trace_stack_id_print, +}; + +static struct trace_event trace_stack_id_event =3D { + .type =3D TRACE_STACK_ID, + .funcs =3D &trace_stack_id_funcs, +}; + /* TRACE_HWLAT */ static enum print_line_t trace_hwlat_print(struct trace_iterator *iter, int flags, @@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata =3D { &trace_wake_event, &trace_stack_event, &trace_user_stack_event, + &trace_stack_id_event, &trace_bputs_event, &trace_bprint_event, &trace_print_event, diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 929c84075315..0c97065b0d68 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *e= ntry) case TRACE_CTX: case TRACE_WAKE: case TRACE_STACK: + case TRACE_STACK_ID: case TRACE_PRINT: case TRACE_BRANCH: case TRACE_GRAPH_ENT: --=20 2.34.1 From nobody Fri Jun 19 05:12:11 2026 Received: from mail-dl1-f45.google.com (mail-dl1-f45.google.com [74.125.82.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5F973EDE54 for ; Tue, 16 Jun 2026 06:42:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781592150; cv=none; b=gIGPulETkfH/1YJQqWKPCNvEqy9Ry2fcKa4YedmCqes+jtplU0AL+zo1ZmtS7+fJMB5GmFdhjvYJB5H+orHTcfW2QzJq9GPQa/FRqXcSvUi+nxgYT+hz2grzINUY7yaKuNzUQkghZVamCIH1cevEgAp7F0al1U9aQp76Uu6nZF4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781592150; c=relaxed/simple; bh=WcZjrEZRfh3S+gNhBagj2KSvnMhBuQ9BnZjQPPnfosw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=o52emwHI5WAJzovyKuNFaWa/D4/doQLxWZmYiGEm8H2TQoDJL0nrmfQ+2HmlDyTgebBvlRgHBcqEu6pTykTn021DOAcaEdwAHnxzx8qsKOlZf3TwG6Pl12qy7b8+x1hh4tvIRuAU1L3SFo4FRUYaNwZGFN4IpWe+DCTFCU8KWko= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=J0KIce9H; arc=none smtp.client-ip=74.125.82.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="J0KIce9H" Received: by mail-dl1-f45.google.com with SMTP id a92af1059eb24-1384eb94d20so7672866c88.1 for ; Mon, 15 Jun 2026 23:42:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781592147; x=1782196947; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=gdc1n+XKhvad81vsVkXIxxp7+FIV7UAbFF/cS0eyfrM=; b=J0KIce9H301kNnYBKMaQM0948+SBU0IVW7itqJpf3XbX4VwpKXZCWuKe+Etgm5gcBZ 1MSeH0JeYERbuWcb5hvOYr7dLRCyg+QEHpO7MBZUCNxUQbLZiYCLhkQyQEnXxuT2Hqei vKntKi4B3XRZWlhn65Epn0m4CrGEJQVa8z8tyGf+7cfblAzmiM2Vp0MWzlYJqxjMTRb1 v+RlqT0vLs9O2mdXaEEtyICQGzCHk1m1gRHrs5/a35Y3oRBfBG6g8oDZZAaH25MJ3tgo nSTirVrlA4HJviIq7CK17HIKjUd0sczZJYHNEDweCosn4ZrQyz64xMVeltMgVocZa9ba t8WA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781592147; x=1782196947; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=gdc1n+XKhvad81vsVkXIxxp7+FIV7UAbFF/cS0eyfrM=; b=RoqpgNl0iYZpHCu1JAF6vu6H2valip8e80siZSnkN85+7tAE0jDapgV+KfiFs1tOjT Fte7kYd56TrmPubshSEv4smcF+hhmpTJd7mitCU4ITlUrhjeHbKnsh1ZzjWYHoxpm6Su 0yNGRuR/jJa+MrClRuvtVCyV+dSQrE+MrvWi61SUuoelgUpFY0Ne9AoU9educNQ2cMxf BAka1VxaMkWRGJsaNVb1bOrh4D6y4HFP51ggvt0fXK/bN5tLkhiWsYL3d8D0h+XPxXTh OjOmHM5or7jWGsrkHCGD9soUvOg9osemAWhP4dTM4JdvJ8mVBIMd5GgT5CRUpdLXcBS+ Qzwg== X-Forwarded-Encrypted: i=1; AFNElJ91ZpO/5NEy2IKF6efB52+GtJ62ugtLZGtRfupdOAdTJNeIQIf9MrFMHjfXm5MU26hoZo2vzIV3jZYcRVA=@vger.kernel.org X-Gm-Message-State: AOJu0YzdKPVMQSSW3x62IRcIobTPhYJ5O/bbjMuo4gmwaWf3waESehpk 57KggwzKl3cc81OsiGZs8hL5bq6lOKJDXj7DF/dRxjn/lkMvz5trQlp4 X-Gm-Gg: Acq92OEazXq5DgJWAhl6Zblh93ctED/8vh1ToHsanuaZcLUZzpvxi81XEq1o3fYrmJ5 yG7qDtbJJHQRORtOmwY42pZMV6zASnRoAxxfNZDQiCtgn+smyT2jKmgiyV7VhjXzJIz62gbrEz7 hohVe0sNB48s1lD88mjtL4gN9yKdjAve+itBqDbvQ8vZNMqU3tfSgVUqzbNE2MHuKi2OaU7swG7 VOsh3RlGvxoulE6n2bzOf3E/h3ply/cvJkLmHX7f1ci6ww6y2HS0HMzWlC3lnwfI6Vlj6bJos6z L7//iFYcQxXrAIGxsLCWHLooeefs2pC7LfWdDV6szVaaZtS+slq+5+R41pYIFAiBXod2QeOzUax ds5cL+HGT9xf0hhk54b3U2UMwU6MA54A9IUTzStM9VYjTg/HsOLT0SDnnPC97hN9JI/ZQEJl6SQ /JhuUJr4p+GrBWZGYl8KqEVmTTg4g7QRrk2DHD6w== X-Received: by 2002:a05:7022:f9a:b0:139:89ad:8deb with SMTP id a92af1059eb24-13989ad9011mr314176c88.9.1781592146682; Mon, 15 Jun 2026 23:42:26 -0700 (PDT) Received: from localhost.localdomain ([2408:8607:1b00:8:55be:2dbe:9cd4:7306]) by smtp.gmail.com with ESMTPSA id a92af1059eb24-1384b910c51sm12499158c88.4.2026.06.15.23.42.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Jun 2026 23:42:24 -0700 (PDT) From: Li Pengfei X-Google-Original-From: Li Pengfei To: Steven Rostedt , Masami Hiramatsu Cc: Mathieu Desnoyers , Mark Rutland , Jonathan Corbet , Shuah Khan , linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, lipengfei28@xiaomi.com, zhangbo56@xiaomi.com, kernel test robot Subject: [RFC PATCH v4 3/3] trace: add documentation, selftest and tooling for stackmap Date: Tue, 16 Jun 2026 14:41:19 +0800 Message-Id: <20260616064119.438063-4-lipengfei28@xiaomi.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com> References: <20260616064119.438063-1-lipengfei28@xiaomi.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Pengfei Li Add supporting files for the ftrace stackmap feature: Documentation/trace/ftrace-stackmap.rst: Documentation covering design, usage, tracefs interface, binary format, and performance characteristics. Added to the 'Core Tracing Frameworks' toctree in Documentation/trace/index.rst. Documents: - Reset is destructive: it requires tracing to be stopped and also clears the ring buffer so no stale survives - Boot-time activation via trace_options=3Dstackmap - bits parameter range [10, 18] and worst-case memory usage - tracefs file modes (0640 / 0440) - Best-effort snapshot semantics for stack_map_bin, serialized against reset via the reader_sem - Counter naming: successes (events served), drops, success_rate; successes/drops are best-effort and saturate on long runs - Gravestone amplification when the pool is exhausted tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc: Functional selftest verifying: - stackmap tracefs nodes exist - enabling stackmap + stacktrace produces stack_id events - stack_map_stat shows non-zero successes (a nonzero drops count is a legitimate by-design fallback and is not treated as failure; only zero successes alongside nonzero drops is fatal) - reset clears entries when tracing is stopped - reset is rejected (-EBUSY) while tracing is active Test reads trace contents BEFORE switching back to the nop tracer (tracer_init() unconditionally resets the ring buffer). The function:tracer dependency is declared in '# requires:' so ftracetest skips on kernels without CONFIG_FUNCTION_TRACER instead of failing spuriously. tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc: Verifies the destructive-reset semantics and the binary ABI header: - after 'echo 0 > stack_map', the trace buffer no longer contains any stale - stack_map_bin begins with the expected magic and version tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc: Verifies the option is gated to the top-level instance: a secondary instance neither exposes options/stackmap nor the stack_map* nodes, and writing 'stackmap' to its aggregate trace_options file is rejected rather than accepted as a no-op. tools/tracing/stackmap_dump.py: Python script to parse the binary stack_map_bin export. Features: - Automatic endianness detection via magic number - Batched addr2line via stdin (avoids ARG_MAX with large stacks) - JSON output mode (ips are always hex addresses; the ftrace trampoline marker is shown only in the resolved symbols) - Top-N filtering by ref_count Binary format: all fields are native-endian. The parser detects byte order by reading the magic value (0x46534D42 =3D 'FSMB'). Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@int= el.com/ Signed-off-by: Pengfei Li --- Documentation/trace/ftrace-stackmap.rst | 177 ++++++++++++++++++ Documentation/trace/index.rst | 1 + .../ftrace/test.d/ftrace/stackmap-basic.tc | 111 +++++++++++ .../test.d/ftrace/stackmap-instance-gate.tc | 54 ++++++ .../ftrace/test.d/ftrace/stackmap-reset.tc | 76 ++++++++ tools/tracing/stackmap_dump.py | 164 ++++++++++++++++ 6 files changed, 583 insertions(+) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-b= asic.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-i= nstance-gate.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-r= eset.tc create mode 100755 tools/tracing/stackmap_dump.py diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/= ftrace-stackmap.rst new file mode 100644 index 000000000000..8d0b5c389862 --- /dev/null +++ b/Documentation/trace/ftrace-stackmap.rst @@ -0,0 +1,177 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Ftrace Stack Map +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +:Author: Pengfei Li + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D + +The ftrace stack map provides stack trace deduplication for the ftrace +ring buffer. When enabled, instead of storing full kernel stack traces +(typically 80-160 bytes each) in the ring buffer for every event, ftrace +stores only a 4-byte ``stack_id``. The full stacks are maintained in a +separate hash table and exported via tracefs for userspace to resolve. + +This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated +into ftrace's infrastructure, requiring no userspace daemon. + +Configuration +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Enable ``CONFIG_FTRACE_STACKMAP=3Dy`` in the kernel config. + +Kernel command line parameters: + +- ``ftrace_stackmap.bits=3DN`` - Set map capacity to 2^N unique stacks + (default: 14 =E2=86=92 16384 stacks; valid range: 10-18). + + At ``bits=3D18`` the kernel reserves roughly 130 MB of vmalloc memory + for the element pool. Each ``open()`` of ``stack_map_bin`` may + briefly allocate a similar amount for a snapshot. The cap is set + intentionally to bound memory usage. + +Usage +=3D=3D=3D=3D=3D + +Enable stack deduplication:: + + echo 1 > /sys/kernel/debug/tracing/options/stackmap + echo 1 > /sys/kernel/debug/tracing/options/stacktrace + echo function > /sys/kernel/debug/tracing/current_tracer + +The trace output will show ```` instead of full stack traces:: + + sh-1234 [006] d.h.. 123.456789: + +To view the actual stacks:: + + cat /sys/kernel/debug/tracing/stack_map + +Output format:: + + stack_id 42 [ref 1337, depth 8] + [0] schedule+0x48/0xc0 + [1] schedule_timeout+0x1c/0x30 + ... + +To view statistics:: + + cat /sys/kernel/debug/tracing/stack_map_stat + +Output:: + + entries: 2500 / 16384 + table_size: 32768 + successes: 148923 + drops: 0 + success_rate: 100% + +To reset the stack map (tracing must be stopped first):: + + echo 0 > /sys/kernel/debug/tracing/tracing_on + echo 0 > /sys/kernel/debug/tracing/stack_map + +Reset returns ``-EBUSY`` if tracing is currently active, or if another +reset is already in progress. + +Reset is destructive to the trace buffer: because the ring buffer may +still hold ```` events that reference soon-to-be-reused +slots, resetting the map also resets the owning trace buffer (and its +snapshot, if allocated). This keeps ring-buffer stack_ids and the map +coherent. Read out any trace data you need before resetting. + +Boot-time activation +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The stackmap option can be enabled from the kernel command line:: + + trace_options=3Dstackmap,stacktrace + +Trace events that fire before the tracefs filesystem is initialized +(``fs_initcall`` time) fall back to recording full stack traces; once +``ftrace_stackmap_create()`` runs, subsequent events are deduplicated. +The crossover is automatic and lossless =E2=80=94 no events are dropped, b= ut +early-boot stacks recorded before the crossover are not deduplicated. + +Tracefs Nodes +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The stack_map files are owned by root and not world-readable +(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440). + +``stack_map`` + Text export of all deduplicated stacks with symbol resolution. + Writing ``0`` or ``reset`` clears all entries (only when tracing + is stopped). + +``stack_map_stat`` + Statistics: entries (allocated unique stacks), table_size, + successes (events served), drops (events that fell back to + full-stack recording), and success_rate. Drops accumulate when + the element pool is exhausted; once that happens, slots that + won the cmpxchg but failed to allocate an element remain + "claimed but empty" and increase probe pressure for any future + insert hashing to the same bucket. Reset (when tracing is + stopped) clears these gravestones. + +``stack_map_bin`` + Binary export for efficient userspace consumption. Format: + + - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + rese= rved(u32) + - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) = + ips(u64 =C3=97 nr) + + All fields are written in the kernel's native byte order. + Userspace tools detect endianness by reading the magic value. + Magic: ``0x46534D42`` ('FSMB'), Version: 1. + + Trampoline frames are exported as the sentinel value + ``0x7fffffff`` (FTRACE_TRAMPOLINE_MARKER); all other addresses are + passed through ``trace_adjust_address()`` so they match the + ``stack_map`` text output's address-adjustment rules. Note this is + the same adjustment ftrace applies to its own trace output (mainly + relevant for persistent / last-boot buffers), not a general KASLR + un-offset: resolving these addresses offline still requires the + matching kernel's symbol information. + + The export is a best-effort snapshot allocated at ``open()``; + concurrent inserts during the snapshot may be truncated. A + bounds check ensures no overflow. + +Design +=3D=3D=3D=3D=3D=3D + +The stack map is modeled after ``tracing_map.c`` (used by hist triggers), +using a lock-free design based on Dr. Cliff Click's non-blocking hash table +algorithm: + +- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context +- **Memory**: Pre-allocated element pool, zero allocation on the hot path + (no GFP_ATOMIC failures under memory pressure) +- **Collision**: Linear probing with a 2x over-provisioned table; probe + length is bounded so worst-case insert/lookup is O(1) +- **Scope**: Currently supports the global trace instance +- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp`` + confirms matches + +Deduplication is best-effort, not strict: if two CPUs race in the +insert path with the same ``key_hash`` (i.e. the same stack), the +``cmpxchg`` loser advances by one slot and may insert the same stack +again. Under heavy contention this can produce a small number of +duplicate entries for the same stack; ``ref_count`` is then split +across the duplicates. Total memory is still bounded by the element +pool size, and lookup correctness is unaffected (each duplicate is +a self-consistent entry with its own ``stack_id``). The trade-off is +intentional and keeps the hot path lock-free. + +Performance +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Typical results on an aarch64 SMP system (function tracer, 2 seconds): + +- Unique stacks: ~3000 +- Dedup rate: 84-98% (depends on workload diversity) +- Ring buffer savings: ~80% for stack data +- Overhead per event: ~50ns (one jhash + hash table lookup) diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5d9bf4694d5d..ac8b1141c23a 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -33,6 +33,7 @@ the Linux kernel. ftrace ftrace-design ftrace-uses + ftrace-stackmap kprobes kprobetrace fprobetrace diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc= b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc new file mode 100644 index 000000000000..64dfe7cc66bd --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc @@ -0,0 +1,111 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap basic functionality +# requires: stack_map options/stackmap function:tracer + +# Test that ftrace stackmap deduplication works: +# 1. Enable stackmap + stacktrace options +# 2. Run function tracer briefly +# 3. Verify trace contains events (read BEFORE switching +# tracer back to nop, since tracer_init() resets the ring buffer) +# 4. Verify stack_map has entries and at least some successes (drops is +# a legitimate by-design fallback counter and is allowed to be nonzero; +# only zero successes alongside nonzero drops indicates breakage) +# 5. Verify reset is rejected (-EBUSY) while tracing is active +# 6. Verify reset clears the map when tracing is stopped + +fail() { + echo "FAIL: $1" + exit_fail +} + +# Restore state on any exit (success, fail, or interrupt) so a +# half-finished test does not leave stacktrace/stackmap enabled. +cleanup() { + disable_tracing 2>/dev/null + echo nop > current_tracer 2>/dev/null + echo 0 > options/stackmap 2>/dev/null + echo 0 > options/stacktrace 2>/dev/null +} +trap cleanup EXIT + +disable_tracing +clear_trace + +# Verify stackmap files exist +test -f stack_map || fail "stack_map file missing" +test -f stack_map_stat || fail "stack_map_stat file missing" +test -f stack_map_bin || fail "stack_map_bin file missing" + +# Enable stackmap dedup +echo 1 > options/stackmap +echo 1 > options/stacktrace + +# Run function tracer briefly +echo function > current_tracer +enable_tracing +sleep 1 +disable_tracing + +# Read trace contents NOW, before switching tracer back to nop. +# tracer_init() unconditionally calls tracing_reset_online_cpus(), +# so the ring buffer would be empty after 'echo nop > current_tracer'. +count=3D$(grep -c " events" +fi + +# Now safe to switch back and disable options +echo nop > current_tracer +echo 0 > options/stackmap + +# Check stack_map_stat +entries=3D$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries:=3D0}" +if [ "$entries" -eq 0 ]; then + fail "stackmap has zero entries after tracing" +fi + +successes=3D$(cat stack_map_stat | grep "^successes:" | awk '{print $2}') +: "${successes:=3D0}" +if [ "$successes" -eq 0 ]; then + fail "stackmap has zero successes" +fi + +drops=3D$(cat stack_map_stat | grep "^drops:" | awk '{print $2}') +: "${drops:=3D0}" +# drops is a legitimate by-design fallback counter: when the map is full +# or under heavy probe pressure, stackmap falls back to recording a full +# stack instead of a stack_id. A nonzero drops count is therefore not a +# failure. Only treat it as fatal if dedup never worked at all (no +# successes), which would indicate the feature is genuinely broken rather +# than merely under pressure. +if [ "$successes" -eq 0 ] && [ "$drops" -ne 0 ]; then + fail "stackmap had $drops drops and zero successes (feature broken?)" +fi + +# Check stack_map text output is parseable +first_id=3D$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}') +if [ -z "$first_id" ]; then + fail "stack_map output has no stack_id entries" +fi + +# Test that reset is rejected while tracing is active +enable_tracing +if echo 0 > stack_map 2>/dev/null; then + disable_tracing + fail "stackmap reset should fail while tracing is active" +fi +disable_tracing + +# Test reset works when tracing is stopped +echo 0 > stack_map +entries_after=3D$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries_after:=3D-1}" +if [ "$entries_after" -ne 0 ]; then + fail "stackmap reset did not clear entries (got $entries_after)" +fi + +echo "stackmap basic test passed: $entries unique stacks, $successes succe= sses, $drops drops" +exit 0 diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance= -gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-g= ate.tc new file mode 100644 index 000000000000..28810ba20432 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc @@ -0,0 +1,54 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap option is gated to the top-level trace in= stance +# requires: stack_map options/stackmap instances + +# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the +# convention used for global-only options like 'printk' and 'record-cmd'. +# Verify that: +# 1. The global instance exposes options/stackmap and the stack_map* nodes. +# 2. A newly created secondary instance under instances/ does NOT expose +# options/stackmap or stack_map* nodes. + +fail() { + echo "FAIL: $1" + rmdir instances/test_stackmap_gate 2>/dev/null + exit_fail +} + +# 1. Global instance must expose the option and the nodes +test -e options/stackmap || fail "options/stackmap missing on global insta= nce" +test -e stack_map || fail "stack_map missing on global instance" +test -e stack_map_stat || fail "stack_map_stat missing on global instanc= e" +test -e stack_map_bin || fail "stack_map_bin missing on global instance" + +# 2. Create a secondary instance and verify it does NOT see the option +# or the stack_map* nodes. +mkdir instances/test_stackmap_gate || fail "could not create secondary ins= tance" + +if [ -e instances/test_stackmap_gate/options/stackmap ]; then + fail "secondary instance unexpectedly exposes options/stackmap" +fi + +for f in stack_map stack_map_stat stack_map_bin; do + if [ -e instances/test_stackmap_gate/$f ]; then + fail "secondary instance unexpectedly has $f" + fi +done + +# 3. The aggregate trace_options file still reaches set_tracer_flag(), +# so writing 'stackmap' there must be rejected on a secondary +# instance. Otherwise the bit could appear set in trace_options +# while the hot path silently falls back to a full stack trace +# (tr->stackmap =3D=3D NULL). +if echo stackmap > instances/test_stackmap_gate/trace_options 2>/dev/null;= then + fail "secondary instance accepted 'echo stackmap > trace_options'" +fi +if grep -qw stackmap instances/test_stackmap_gate/trace_options; then + fail "secondary instance trace_options reports stackmap as set" +fi + +rmdir instances/test_stackmap_gate || fail "could not remove secondary ins= tance" + +echo "stackmap option gating to top-level instance works" +exit 0 diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc= b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc new file mode 100644 index 000000000000..803cc282f9ab --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc @@ -0,0 +1,76 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap reset clears the trace buffer and ABI hea= der +# requires: stack_map options/stackmap function:tracer + +# Lock in the two things most likely to regress in the stackmap ABI / +# lifetime: +# 1. Resetting the stackmap (echo 0 > stack_map, tracing stopped) also +# clears the trace buffer, so no stale can be left +# dangling against an emptied map. +# 2. The stack_map_bin header carries the expected magic ('FSMB' =3D +# 0x46534D42) and version (1). + +fail() { + echo "FAIL: $1" + exit_fail +} + +cleanup() { + disable_tracing 2>/dev/null + echo nop > current_tracer 2>/dev/null + echo 0 > options/stackmap 2>/dev/null + echo 0 > options/stacktrace 2>/dev/null +} +trap cleanup EXIT + +disable_tracing +clear_trace + +echo 1 > options/stackmap +echo 1 > options/stacktrace +echo function > current_tracer +enable_tracing +sleep 1 +disable_tracing + +# Sanity: the buffer must contain stack_id events before reset, otherwise +# the post-reset emptiness check below would be meaningless. +before=3D$(grep -c " events captured before reset" +fi + +# Reset while tracing is stopped. This must succeed AND clear the trace +# buffer (destructive reset semantics). +echo 0 > stack_map || fail "reset rejected while tracing stopped" + +after=3D$(grep -c " events after reset" +fi + +entries=3D$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries:=3D-1}" +if [ "$entries" -ne 0 ]; then + fail "stackmap still has $entries entries after reset" +fi + +# Binary export header: magic 'FSMB' (0x46534D42) + version 1. +# od -tx4 renders the 32-bit words in the target's native byte order, +# which matches what the kernel wrote, so the comparison is endian-safe. +if command -v od >/dev/null 2>&1; then + magic=3D$(od -An -tx4 -N4 stack_map_bin | tr -d ' \n') + if [ "$magic" !=3D "46534d42" ]; then + fail "stack_map_bin bad magic: 0x$magic (expected 46534d42)" + fi + ver=3D$(od -An -tx4 -j4 -N4 stack_map_bin | tr -d ' \n') + if [ "$ver" !=3D "00000001" ]; then + fail "stack_map_bin bad version: 0x$ver (expected 00000001)" + fi +fi + +echo "stackmap reset test passed: cleared $before stack_id events, ABI hea= der ok" +exit 0 diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py new file mode 100755 index 000000000000..2d9c49b776e6 --- /dev/null +++ b/tools/tracing/stackmap_dump.py @@ -0,0 +1,164 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 +""" +stackmap_dump.py - Parse and display ftrace stack_map_bin binary export. + +Usage: + # Pull from device and parse + adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin + python3 stackmap_dump.py /tmp/stack_map.bin + + # With vmlinux for offline symbol resolution + python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux + + # JSON output for tooling + python3 stackmap_dump.py /tmp/stack_map.bin --json +""" + +import struct +import sys +import argparse +import json +import subprocess + +MAGIC =3D 0x46534D42 # 'FSMB' +HEADER_SIZE =3D 16 # 4 x u32 +ENTRY_SIZE =3D 16 # 4 x u32 + +# __ftrace_trace_stack() replaces trampoline addresses with this marker +# (FTRACE_TRAMPOLINE_MARKER =3D=3D (unsigned long)INT_MAX) before the stack +# is stored, so the binary export carries it verbatim. +FTRACE_TRAMPOLINE_MARKER =3D 0x7fffffff +TRAMPOLINE_LABEL =3D '[FTRACE TRAMPOLINE]' + + +def detect_endianness(data): + """Detect byte order from magic number in header.""" + if len(data) < 4: + raise ValueError("File too small") + magic_le =3D struct.unpack_from('I', data, 0)[0] + if magic_be =3D=3D MAGIC: + return '>' + raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)") + + +def batch_addr2line(vmlinux, addrs): + """Resolve multiple addresses in one addr2line invocation.""" + if not addrs: + return {} + try: + # Feed addresses on stdin to avoid ARG_MAX limits with large + # numbers of addresses (one stack can have 30+ frames; a + # snapshot can have thousands of unique stacks). + stdin =3D '\n'.join(hex(a) for a in addrs) + '\n' + result =3D subprocess.run( + ['addr2line', '-f', '-e', vmlinux], + input=3Dstdin, capture_output=3DTrue, text=3DTrue, timeout=3D60 + ) + lines =3D result.stdout.split('\n') + # addr2line outputs 2 lines per address: function name + source lo= cation + symbols =3D {} + for i, addr in enumerate(addrs): + idx =3D i * 2 + if idx < len(lines) and lines[idx] and lines[idx] !=3D '??': + symbols[addr] =3D lines[idx] + return symbols + except (subprocess.TimeoutExpired, FileNotFoundError) as e: + print(f"warning: addr2line failed: {e}", file=3Dsys.stderr) + return {} + + +def parse_stackmap_bin(data): + """Parse binary stackmap data, yield (stack_id, ref_count, [ips]).""" + if len(data) < HEADER_SIZE: + raise ValueError("File too small for header") + + endian =3D detect_endianness(data) + header_fmt =3D f'{endian}IIII' + entry_fmt =3D f'{endian}IIII' + + magic, version, nr_stacks, _ =3D struct.unpack_from(header_fmt, data, = 0) + if version !=3D 1: + raise ValueError(f"Unsupported version: {version}") + + offset =3D HEADER_SIZE + for _ in range(nr_stacks): + if offset + ENTRY_SIZE > len(data): + break + stack_id, nr, ref_count, _ =3D struct.unpack_from(entry_fmt, data,= offset) + offset +=3D ENTRY_SIZE + + ips_size =3D nr * 8 + if offset + ips_size > len(data): + break + ips =3D struct.unpack_from(f'{endian}{nr}Q', data, offset) + offset +=3D ips_size + + yield stack_id, ref_count, list(ips) + + +def main(): + parser =3D argparse.ArgumentParser(description=3D'Parse ftrace stack_m= ap_bin') + parser.add_argument('file', help=3D'Path to stack_map_bin file') + parser.add_argument('--vmlinux', help=3D'Path to vmlinux for symbol re= solution') + parser.add_argument('--json', action=3D'store_true', help=3D'JSON outp= ut') + parser.add_argument('--top', type=3Dint, default=3D0, + help=3D'Show only top N stacks by ref_count') + args =3D parser.parse_args() + + with open(args.file, 'rb') as f: + data =3D f.read() + + stacks =3D list(parse_stackmap_bin(data)) + + if args.top > 0: + stacks.sort(key=3Dlambda x: x[1], reverse=3DTrue) + stacks =3D stacks[:args.top] + + # Batch symbol resolution + symbols =3D {} + if args.vmlinux: + all_addrs =3D set() + for _, _, ips in stacks: + all_addrs.update(ip for ip in ips + if ip !=3D FTRACE_TRAMPOLINE_MARKER) + symbols =3D batch_addr2line(args.vmlinux, list(all_addrs)) + + def render(ip): + if ip =3D=3D FTRACE_TRAMPOLINE_MARKER: + return TRAMPOLINE_LABEL + return symbols.get(ip, f'0x{ip:x}') + + if args.json: + output =3D [] + for stack_id, ref_count, ips in stacks: + entry =3D { + 'stack_id': stack_id, + 'ref_count': ref_count, + 'ips': [f'0x{ip:x}' for ip in ips] + } + if args.vmlinux: + entry['symbols'] =3D [render(ip) for ip in ips] + output.append(entry) + print(json.dumps(output, indent=3D2)) + else: + for stack_id, ref_count, ips in stacks: + print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}= ]") + for i, ip in enumerate(ips): + if ip =3D=3D FTRACE_TRAMPOLINE_MARKER: + print(f" [{i}] {TRAMPOLINE_LABEL}") + continue + sym =3D symbols.get(ip, '') + if sym: + sym =3D f' {sym}' + print(f" [{i}] 0x{ip:x}{sym}") + print() + + print(f"Total: {len(stacks)} unique stacks", file=3Dsys.stderr) + + +if __name__ =3D=3D '__main__': + main() --=20 2.34.1