* [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer
@ 2026-05-14 3:49 Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
` (5 more replies)
0 siblings, 6 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Steven, all,
This series adds stack trace deduplication to ftrace, reducing ring
buffer usage by ~80% when stacktrace is enabled.
Problem:
When the stacktrace option is enabled, each trace event stores a full
kernel stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On
production devices with 4-8MB trace buffers, this fills the buffer in
seconds, limiting the usefulness of boot-time tracing and always-on
performance monitoring.
Solution:
A lock-free hash map (modeled after tracing_map.c as suggested by
Steven [1]) that deduplicates stack traces. The ring buffer stores
only a 4-byte stack_id; full stacks are exported via tracefs.
Design (following tracing_map.c pattern):
- Lock-free insert via cmpxchg (NMI/IRQ/any context safe)
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table
- Per-trace_array instance support
We adopted the same lock-free algorithm as tracing_map but with a
purpose-built data structure, because tracing_map's API is designed
for histogram aggregation with fixed-size keys and sum/var fields,
while our use case requires variable-length stack traces with
reference counting.
Test results (ARM64, Qualcomm SM8850, kernel 6.12):
- kmem_cache_alloc events, 1 second capture:
774 unique stacks, 8264 hits, 0 drops, 100% hit rate
Ring buffer savings: 795KB -> 176KB (78% reduction)
- Function tracer, 3 seconds:
3632 unique stacks, 25466 hits, 0 drops
Ring buffer savings: 2.5MB -> 653KB (74% reduction)
Note: An earlier prototype using rhashtable crashed in IRQ context
(BUG at rhashtable.h:912), which led us to adopt the tracing_map
cmpxchg-based approach.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
# trace output: <stack_id 42>
# resolve: cat /sys/kernel/debug/tracing/stack_map
[1] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 111 ++++
kernel/trace/Kconfig | 21 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 46 ++
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_stackmap.c | 569 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 54 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 74 +++
tools/tracing/stackmap_dump.py | 120 ++++
11 files changed, 1050 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
@ 2026-05-14 3:49 ` Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
` (4 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28
From: Pengfei Li <lipengfei28@xiaomi.com>
Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.
The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:
- Lock-free insert via cmpxchg (safe in NMI/IRQ/any context)
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table
- Per-trace_array instance support
The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution
- stack_map_stat: statistics (entries, hits, drops, hit_rate)
- stack_map_bin: binary export for efficient userspace consumption
Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks)
Test results on ARM64 (SM8850, Android 16, kernel 6.12):
- 774 unique stacks from kmem_cache_alloc in 1 second
- 100% hit rate, 0 drops
- 92% hit rate under heavy load (all kmem events)
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/Kconfig | 21 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace_stackmap.c | 569 ++++++++++++++++++++++++++++++++++
kernel/trace/trace_stackmap.h | 54 ++++
4 files changed, 645 insertions(+)
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..2a63fd2c9a96 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,27 @@ config STACK_TRACER
Say N if unsure.
+config FTRACE_STACKMAP
+ bool "Ftrace stack map deduplication"
+ depends on TRACING
+ depends on STACKTRACE
+ select KALLSYMS
+ help
+ This enables a global stack trace hash table for ftrace, inspired
+ by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+ only a stack_id in the ring buffer instead of the full stack trace,
+ significantly reducing trace buffer usage when the same call stacks
+ appear repeatedly.
+
+ The deduplicated stacks are exported via:
+ /sys/kernel/debug/tracing/stack_map
+
+ Writing to this file resets the stack map. Reading shows all unique
+ stacks with their stack_id and reference count.
+
+ Say Y if you want to reduce ftrace buffer usage for stack traces.
+ Say N if unsure.
+
config TRACE_PREEMPT_TOGGLE
bool
help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 1decdce8cbef..f1b6175099cc 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..c402e7e7f902
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,569 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg (safe in NMI/IRQ/any context)
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table
+ * - Per-trace_array instance support
+ *
+ * The 32-bit jhash of the stack IPs is used as the hash table key.
+ * On hash collision (different stacks, same 32-bit hash), linear
+ * probing finds the next slot. Full stack comparison (memcmp) is
+ * used to confirm matches.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/random.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+ u32 nr; /* actual number of IPs */
+ atomic_t ref_count;
+ unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+ u32 key; /* 0 = free, non-zero = jhash */
+ struct stackmap_elt *val; /* NULL until fully published */
+};
+
+struct ftrace_stackmap {
+ unsigned int map_bits;
+ unsigned int map_size; /* 1 << (map_bits + 1) */
+ unsigned int max_elts; /* 1 << map_bits */
+ atomic_t next_elt; /* index into elts pool */
+ struct stackmap_entry *entries; /* hash table */
+ struct stackmap_elt **elts; /* pre-allocated pool */
+ atomic_t resetting;
+ atomic64_t hits;
+ atomic64_t drops;
+};
+
+static u32 stackmap_hash_seed;
+
+static unsigned int stackmap_map_bits = 14; /* 16384 elts, 32768 slots */
+static int __init stackmap_bits_setup(char *str)
+{
+ unsigned long val;
+
+ if (kstrtoul(str, 0, &val))
+ return -EINVAL;
+ val = clamp_val(val, 10, 20); /* 1K - 1M elts */
+ stackmap_map_bits = val;
+ return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+ int idx;
+
+ idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+ if (idx < smap->max_elts)
+ return smap->elts[idx];
+ return NULL;
+}
+
+static int stackmap_alloc_elts(struct ftrace_stackmap *smap)
+{
+ unsigned int i;
+
+ smap->elts = vzalloc(sizeof(*smap->elts) * smap->max_elts);
+ if (!smap->elts)
+ return -ENOMEM;
+
+ for (i = 0; i < smap->max_elts; i++) {
+ smap->elts[i] = kzalloc(sizeof(struct stackmap_elt), GFP_KERNEL);
+ if (!smap->elts[i])
+ goto fail;
+ }
+ return 0;
+fail:
+ while (i--)
+ kfree(smap->elts[i]);
+ vfree(smap->elts);
+ smap->elts = NULL;
+ return -ENOMEM;
+}
+
+static void stackmap_free_elts(struct ftrace_stackmap *smap)
+{
+ unsigned int i;
+
+ if (!smap->elts)
+ return;
+ for (i = 0; i < smap->max_elts; i++)
+ kfree(smap->elts[i]);
+ vfree(smap->elts);
+ smap->elts = NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(void)
+{
+ struct ftrace_stackmap *smap;
+ static bool seed_initialized;
+ int err;
+
+ smap = kzalloc(sizeof(*smap), GFP_KERNEL);
+ if (!smap)
+ return ERR_PTR(-ENOMEM);
+
+ smap->map_bits = stackmap_map_bits;
+ smap->max_elts = 1 << smap->map_bits;
+ smap->map_size = smap->max_elts * 2; /* 2x over-provision */
+
+ smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+ if (!smap->entries) {
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ err = stackmap_alloc_elts(smap);
+ if (err) {
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(err);
+ }
+
+ atomic_set(&smap->next_elt, 0);
+ atomic_set(&smap->resetting, 0);
+ atomic64_set(&smap->hits, 0);
+ atomic64_set(&smap->drops, 0);
+
+ if (!seed_initialized) {
+ stackmap_hash_seed = get_random_u32();
+ seed_initialized = true;
+ }
+
+ return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+ if (!smap || IS_ERR(smap))
+ return;
+ stackmap_free_elts(smap);
+ vfree(smap->entries);
+ kfree(smap);
+}
+
+void ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+ unsigned int i;
+
+ if (!smap)
+ return;
+
+ /*
+ * Reset protocol:
+ *
+ * 1. Set resetting=1 so get_id() returns -EINVAL immediately.
+ * get_id() callers in NMI/IRQ context will see this and bail
+ * out before touching entries or elts.
+ *
+ * 2. smp_mb() ensures the resetting store is visible to all CPUs
+ * before we start clearing entries. Any get_id() that already
+ * passed the resetting check will complete its cmpxchg and
+ * WRITE_ONCE(entry->val) before we memset, because:
+ * - the cmpxchg claims the slot atomically
+ * - WRITE_ONCE(entry->val) happens before we clear entries
+ * We accept that a handful of in-flight inserts may write into
+ * entries that we are about to clear; those entries will simply
+ * be wiped by the memset below, which is safe.
+ *
+ * 3. Clear entries table, then reset elt pool.
+ *
+ * 4. Clear resetting=0 with another smp_mb() so new get_id()
+ * calls see a fully reset map.
+ */
+ atomic_set(&smap->resetting, 1);
+ smp_mb();
+
+ /* Clear hash table */
+ memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+
+ /* Reset elt pool */
+ for (i = 0; i < smap->max_elts; i++)
+ memset(smap->elts[i], 0, sizeof(struct stackmap_elt));
+
+ atomic_set(&smap->next_elt, 0);
+ atomic64_set(&smap->hits, 0);
+ atomic64_set(&smap->drops, 0);
+
+ smp_mb();
+ atomic_set(&smap->resetting, 0);
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries)
+{
+ u32 key_hash, idx, test_key, trace_len;
+ struct stackmap_entry *entry;
+ struct stackmap_elt *val;
+ int dup_try = 0;
+
+ if (!smap || !nr_entries || atomic_read(&smap->resetting))
+ return -EINVAL;
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ nr_entries = FTRACE_STACKMAP_MAX_DEPTH;
+
+ trace_len = nr_entries * sizeof(unsigned long);
+ /*
+ * jhash2() requires the length in u32 units and the data to be
+ * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+ * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+ * directly; the cast to u32* is safe because ips[] is naturally
+ * aligned to sizeof(unsigned long) >= 4.
+ */
+ key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+ stackmap_hash_seed);
+ if (key_hash == 0)
+ key_hash = 1; /* 0 means free slot */
+
+ idx = key_hash >> (32 - (smap->map_bits + 1));
+
+ while (1) {
+ idx &= (smap->map_size - 1);
+ entry = &smap->entries[idx];
+ test_key = entry->key;
+
+ if (test_key && test_key == key_hash) {
+ val = READ_ONCE(entry->val);
+ if (val && val->nr == nr_entries &&
+ memcmp(val->ips, ips, trace_len) == 0) {
+ atomic_inc(&val->ref_count);
+ atomic64_inc(&smap->hits);
+ return (int)idx;
+ } else if (unlikely(!val)) {
+ /* Another CPU is mid-insert; retry */
+ dup_try++;
+ if (dup_try > smap->map_size) {
+ atomic64_inc(&smap->drops);
+ break;
+ }
+ continue;
+ }
+ }
+
+ if (!test_key) {
+ /* Free slot: try to claim it */
+ if (!cmpxchg(&entry->key, 0, key_hash)) {
+ struct stackmap_elt *elt;
+
+ elt = stackmap_get_elt(smap);
+ if (!elt) {
+ /*
+ * Pool exhausted. We claimed this slot with
+ * cmpxchg but cannot fill it. Leave key set
+ * so the slot stays "claimed but empty" —
+ * future lookups will skip it (val == NULL
+ * triggers the mid-insert retry path which
+ * will eventually drop). This is safer than
+ * writing key=0 without cmpxchg, which could
+ * race with another CPU's cmpxchg on the same
+ * slot.
+ */
+ atomic64_inc(&smap->drops);
+ break;
+ }
+
+ elt->nr = nr_entries;
+ atomic_set(&elt->ref_count, 1);
+ memcpy(elt->ips, ips, trace_len);
+
+ /* Ensure elt is fully visible before publish */
+ smp_wmb();
+ WRITE_ONCE(entry->val, elt);
+ atomic64_inc(&smap->hits);
+ return (int)idx;
+ } else {
+ /* cmpxchg failed; someone else claimed it */
+ dup_try++;
+ continue;
+ }
+ }
+
+ idx++;
+ dup_try++;
+ if (dup_try > smap->map_size) {
+ atomic64_inc(&smap->drops);
+ break;
+ }
+ }
+
+ return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+ struct ftrace_stackmap *smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos; i < smap->map_size; i++) {
+ if (smap->entries[i].key && smap->entries[i].val) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ for (i = *pos + 1; i < smap->map_size; i++) {
+ if (smap->entries[i].key && smap->entries[i].val) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ *pos = i;
+ return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v) { }
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+ struct stackmap_entry *entry = v;
+ struct stackmap_elt *elt = entry->val;
+ struct stackmap_seq_private *priv = m->private;
+ u32 idx = entry - priv->smap->entries;
+ u32 i;
+
+ if (!elt)
+ return 0;
+
+ seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+ idx, atomic_read(&elt->ref_count), elt->nr);
+ for (i = 0; i < elt->nr; i++)
+ seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+ .start = stackmap_seq_start,
+ .next = stackmap_seq_next,
+ .stop = stackmap_seq_stop,
+ .show = stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+ struct stackmap_seq_private *priv;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open_private(file, &stackmap_seq_ops,
+ sizeof(struct stackmap_seq_private));
+ if (ret)
+ return ret;
+ m = file->private_data;
+ priv = m->private;
+ priv->smap = inode->i_private;
+ return 0;
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct stackmap_seq_private *priv = m->private;
+ char buf[8];
+ size_t n = min(count, sizeof(buf) - 1);
+
+ if (copy_from_user(buf, ubuf, n))
+ return -EFAULT;
+ buf[n] = '\0';
+ if (n == 0 || (buf[0] != '0' && strncmp(buf, "reset", 5) != 0))
+ return -EINVAL;
+
+ ftrace_stackmap_reset(priv->smap);
+ return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+ .open = stackmap_open,
+ .read = seq_read,
+ .write = stackmap_write,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+ struct ftrace_stackmap *smap = m->private;
+ u32 entries;
+ u64 hits, drops;
+
+ if (!smap) {
+ seq_puts(m, "stackmap not initialized\n");
+ return 0;
+ }
+
+ entries = atomic_read(&smap->next_elt);
+ hits = atomic64_read(&smap->hits);
+ drops = atomic64_read(&smap->drops);
+
+ seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts);
+ seq_printf(m, "table_size: %u\n", smap->map_size);
+ seq_printf(m, "hits: %llu\n", hits);
+ seq_printf(m, "drops: %llu\n", drops);
+ if (hits + drops > 0)
+ seq_printf(m, "hit_rate: %llu%%\n",
+ hits * 100 / (hits + drops));
+ return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+ .open = stackmap_stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+ size_t size;
+ char data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+ struct ftrace_stackmap *smap = inode->i_private;
+ struct stackmap_bin_snapshot *snap;
+ struct ftrace_stackmap_bin_header *hdr;
+ size_t alloc_size, off;
+ u32 i, nr_stacks;
+
+ if (!smap)
+ return -ENODEV;
+
+ /*
+ * Allocate based on actual entry count, not max_elts worst case.
+ * Each entry needs a header struct plus up to MAX_DEPTH u64 IPs.
+ * Add 1 to nr_entries to avoid zero-size alloc on empty map.
+ */
+ {
+ u32 nr_entries = atomic_read(&smap->next_elt);
+
+ alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+ (sizeof(struct ftrace_stackmap_bin_entry) +
+ FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+ }
+
+ snap = vmalloc(sizeof(*snap) + alloc_size);
+ if (!snap)
+ return -ENOMEM;
+
+ hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+ hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+ hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+ hdr->reserved = 0;
+ off = sizeof(*hdr);
+ nr_stacks = 0;
+
+ for (i = 0; i < smap->map_size; i++) {
+ struct stackmap_entry *entry = &smap->entries[i];
+ struct stackmap_elt *elt;
+ struct ftrace_stackmap_bin_entry *e;
+ u64 *ips_out;
+ u32 k;
+
+ if (!entry->key)
+ continue;
+ elt = READ_ONCE(entry->val);
+ if (!elt)
+ continue;
+
+ e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+ e->stack_id = i;
+ e->nr = elt->nr;
+ e->ref_count = atomic_read(&elt->ref_count);
+ e->reserved = 0;
+ off += sizeof(*e);
+
+ ips_out = (u64 *)(snap->data + off);
+ for (k = 0; k < elt->nr; k++)
+ ips_out[k] = (u64)elt->ips[k];
+ off += elt->nr * sizeof(u64);
+ nr_stacks++;
+ }
+
+ hdr->nr_stacks = nr_stacks;
+ snap->size = off;
+ file->private_data = snap;
+ return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct stackmap_bin_snapshot *snap = file->private_data;
+
+ if (!snap)
+ return -EINVAL;
+ return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+ vfree(file->private_data);
+ return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+ .open = stackmap_bin_open,
+ .read = stackmap_bin_read,
+ .llseek = default_llseek,
+ .release = stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..74ad649a79f7
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH 64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION 2
+
+struct ftrace_stackmap_bin_header {
+ u32 magic;
+ u32 version;
+ u32 nr_stacks;
+ u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+ u32 stack_id;
+ u32 nr;
+ u32 ref_count;
+ u32 reserved;
+ /* followed by u64 ips[nr] */
+};
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(void);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries);
+void ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *ftrace_stackmap_create(void) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+ unsigned long *ips, unsigned int n)
+{ return -ENOSYS; }
+static inline void ftrace_stackmap_reset(struct ftrace_stackmap *s) { }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
@ 2026-05-14 3:49 ` Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
` (3 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28
From: Pengfei Li <lipengfei28@xiaomi.com>
Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.
Changes:
- New TRACE_STACK_ID in trace_type enum
- New stack_id_entry in trace_entries.h (just 'int stack_id')
- New TRACE_ITER_STACKMAP trace option flag
- Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id()
when stackmap option is active
- Added stack_id print handler in trace_output.c
- Added stackmap field to struct trace_array (per-instance support)
The stack_id event is committed unconditionally (no filter check)
since it is a synthetic side-event tied to the parent event which
was already subject to filtering.
Fallback behavior: if stackmap returns an error (pool exhausted or
resetting), the full stack trace is recorded as before.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/trace.c | 46 ++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 16 +++++++++++++
kernel/trace/trace_entries.h | 15 ++++++++++++
kernel/trace/trace_output.c | 23 ++++++++++++++++++
4 files changed, 100 insertions(+)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..c72cb8491217 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
#include "trace.h"
#include "trace_output.h"
+#include "trace_stackmap.h"
#ifdef CONFIG_FTRACE_STARTUP_TEST
/*
@@ -2184,6 +2185,37 @@ void __ftrace_trace_stack(struct trace_array *tr,
}
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * If stackmap dedup is enabled, try to store only the stack_id
+ * in the ring buffer instead of the full stack trace.
+ */
+ if (tr->trace_flags & TRACE_ITER_STACKMAP) {
+ struct stack_id_entry *sid_entry;
+ int sid;
+
+ sid = ftrace_stackmap_get_id(tr->stackmap, fstack->calls, nr_entries);
+ if (sid >= 0) {
+ event = __trace_buffer_lock_reserve(buffer,
+ TRACE_STACK_ID,
+ sizeof(*sid_entry), trace_ctx);
+ if (!event)
+ goto out;
+ sid_entry = ring_buffer_event_data(event);
+ sid_entry->stack_id = sid;
+ /*
+ * stack_id is a synthetic side-event attached to a
+ * primary trace event that was already subject to
+ * filtering. No per-event filter is defined for
+ * TRACE_STACK_ID, so commit unconditionally.
+ */
+ __buffer_unlock_commit(buffer, event);
+ goto out;
+ }
+ /* Fall through to full stack on stackmap failure */
+ }
+#endif
+
event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
struct_size(entry, caller, nr_entries),
trace_ctx);
@@ -9222,6 +9254,20 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
NULL, &tracing_dyn_info_fops);
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ global_trace.stackmap = ftrace_stackmap_create();
+ if (!IS_ERR(global_trace.stackmap)) {
+ trace_create_file("stack_map", TRACE_MODE_WRITE, NULL,
+ global_trace.stackmap, &ftrace_stackmap_fops);
+ trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL,
+ global_trace.stackmap, &ftrace_stackmap_stat_fops);
+ trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL,
+ global_trace.stackmap, &ftrace_stackmap_bin_fops);
+ } else {
+ pr_warn("ftrace stackmap init failed, dedup disabled\n");
+ global_trace.stackmap = NULL;
+ }
+#endif
create_trace_instances(NULL);
update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..74f421a89347 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_STACK_ID,
__TRACE_LAST_TYPE,
};
@@ -453,6 +454,9 @@ struct trace_array {
struct cond_snapshot *cond_snapshot;
#endif
struct trace_func_repeats __percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+ struct ftrace_stackmap *stackmap;
+#endif
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct func_repeats_entry, \
TRACE_FUNC_REPEATS); \
+ IF_ASSIGN(var, ent, struct stack_id_entry, \
+ TRACE_STACK_ID); \
__ftrace_bad_type(); \
} while (0)
@@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
# define STACK_FLAGS
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS \
+ C(STACKMAP, "stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP 0UL
+#endif
+
#ifdef CONFIG_FUNCTION_PROFILER
+
# define PROFILER_FLAGS \
C(PROF_TEXT_OFFSET, "prof-text-offset"),
# ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
+ STACKMAP_FLAGS \
BRANCH_FLAGS \
PROFILER_FLAGS \
FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
(void *)__entry->caller[6], (void *)__entry->caller[7])
);
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+ TRACE_STACK_ID,
+
+ F_STRUCT(
+ __field( int, stack_id )
+ ),
+
+ F_printk("<stack_id %d>", __entry->stack_id)
+);
+
/*
* trace_printk entry:
*/
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
.funcs = &trace_user_stack_funcs,
};
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+ int flags, struct trace_event *event)
+{
+ struct stack_id_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+ trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+ .trace = trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+ .type = TRACE_STACK_ID,
+ .funcs = &trace_stack_id_funcs,
+};
+
/* TRACE_HWLAT */
static enum print_line_t
trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
&trace_wake_event,
&trace_stack_event,
&trace_user_stack_event,
+ &trace_stack_id_event,
&trace_bputs_event,
&trace_bprint_event,
&trace_print_event,
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
@ 2026-05-14 3:49 ` Li Pengfei
2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
` (2 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28
From: Pengfei Li <lipengfei28@xiaomi.com>
Add supporting files for the ftrace stackmap feature:
Documentation/trace/ftrace-stackmap.rst:
Comprehensive documentation covering design, usage, tracefs
interface, binary format, and performance characteristics.
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
Basic functional selftest that verifies:
- stackmap tracefs nodes exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero hits
- reset clears entries
tools/tracing/stackmap_dump.py:
Python script to parse the binary stack_map_bin export.
Supports offline symbol resolution via addr2line, JSON output,
and top-N filtering by ref_count.
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
Documentation/trace/ftrace-stackmap.rst | 111 ++++++++++++++++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 74 +++++++++++
tools/tracing/stackmap_dump.py | 120 ++++++++++++++++++
3 files changed, 305 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..8f6410d4258c
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,111 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks (default: 14, range: 10-20)
+
+Usage
+=====
+
+Enable stack deduplication::
+
+ echo 1 > /sys/kernel/debug/tracing/options/stackmap
+ echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+ echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+ sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+ cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+ stack_id 42 [ref 1337, depth 8]
+ [0] schedule+0x48/0xc0
+ [1] schedule_timeout+0x1c/0x30
+ ...
+
+To view statistics::
+
+ cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+ entries: 2500
+ table_size: 5000
+ hits: 148923
+ drops: 0
+ hit_rate: 98%
+
+To reset the stack map::
+
+ echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Tracefs Nodes
+=============
+
+``stack_map``
+ Text export of all deduplicated stacks with symbol resolution.
+ Writing ``0`` or ``reset`` clears all entries.
+
+``stack_map_stat``
+ Statistics: entry count, hits, drops, and hit rate.
+
+``stack_map_bin``
+ Binary export for efficient userspace consumption. Format:
+
+ - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+ - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+ Magic: ``0x464D5342`` ('FSMB'), Version: 2
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+ (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table
+- **Per-instance**: Each trace_array has its own stackmap, supporting
+ multiple ftrace instances
+- **Hash**: 32-bit jhash of stack IPs; full ``memcmp`` confirms matches
+
+Performance
+===========
+
+Typical results on ARM64 Android device (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Hit rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100755
index 000000000000..3b0a7f60769f
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,74 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify stack_map has entries
+# 4. Verify stack_map_stat shows hits
+# 5. Verify trace contains <stack_id> events
+# 6. Verify reset works
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat has entries
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+if [ "$entries" -eq 0 ]; then
+ fail "stackmap has zero entries after tracing"
+fi
+
+# Check hits > 0
+hits=$(cat stack_map_stat | grep "^hits:" | awk '{print $2}')
+if [ "$hits" -eq 0 ]; then
+ fail "stackmap has zero hits"
+fi
+
+# Check drops == 0 (pool should be large enough for 1s trace)
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+ fail "stack_map output has no stack_id entries"
+fi
+
+# Check trace has stack_id events
+count=$(cat trace | grep -c "stack_id" || true)
+if [ "$count" -eq 0 ]; then
+ fail "trace has no <stack_id> events"
+fi
+
+# Test reset
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+if [ "$entries_after" -ne 0 ]; then
+ fail "stackmap reset did not clear entries"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $hits hits, $drops drops"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..91ce80c681ea
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,120 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+ # Pull from device and parse
+ adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+ python3 stackmap_dump.py /tmp/stack_map.bin
+
+ # With vmlinux for offline symbol resolution
+ python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+ # JSON output for tooling
+ python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x464D5342 # 'FSMB'
+HEADER_FMT = '<IIII' # magic, version, nr_stacks, reserved
+ENTRY_FMT = '<IIII' # stack_id, nr, ref_count, reserved
+HEADER_SIZE = struct.calcsize(HEADER_FMT)
+ENTRY_SIZE = struct.calcsize(ENTRY_FMT)
+
+
+def addr2line(vmlinux, addr):
+ """Resolve address to symbol using addr2line."""
+ try:
+ result = subprocess.run(
+ ['addr2line', '-f', '-e', vmlinux, hex(addr)],
+ capture_output=True, text=True, timeout=5
+ )
+ lines = result.stdout.strip().split('\n')
+ if len(lines) >= 1 and lines[0] != '??':
+ return lines[0]
+ except (subprocess.TimeoutExpired, FileNotFoundError):
+ pass
+ return None
+
+
+def parse_stackmap_bin(data):
+ """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+ if len(data) < HEADER_SIZE:
+ raise ValueError("File too small for header")
+
+ magic, version, nr_stacks, _ = struct.unpack_from(HEADER_FMT, data, 0)
+ if magic != MAGIC:
+ raise ValueError(f"Bad magic: 0x{magic:08x}, expected 0x{MAGIC:08x}")
+ if version not in (1, 2):
+ raise ValueError(f"Unsupported version: {version}")
+
+ offset = HEADER_SIZE
+ for _ in range(nr_stacks):
+ if offset + ENTRY_SIZE > len(data):
+ break
+ stack_id, nr, ref_count, _ = struct.unpack_from(ENTRY_FMT, data, offset)
+ offset += ENTRY_SIZE
+
+ ips_size = nr * 8
+ if offset + ips_size > len(data):
+ break
+ ips = struct.unpack_from(f'<{nr}Q', data, offset)
+ offset += ips_size
+
+ yield stack_id, ref_count, list(ips)
+
+
+def main():
+ parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+ parser.add_argument('file', help='Path to stack_map_bin file')
+ parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+ parser.add_argument('--json', action='store_true', help='JSON output')
+ parser.add_argument('--top', type=int, default=0,
+ help='Show only top N stacks by ref_count')
+ args = parser.parse_args()
+
+ with open(args.file, 'rb') as f:
+ data = f.read()
+
+ stacks = list(parse_stackmap_bin(data))
+
+ if args.top > 0:
+ stacks.sort(key=lambda x: x[1], reverse=True)
+ stacks = stacks[:args.top]
+
+ if args.json:
+ output = []
+ for stack_id, ref_count, ips in stacks:
+ entry = {
+ 'stack_id': stack_id,
+ 'ref_count': ref_count,
+ 'ips': [f'0x{ip:x}' for ip in ips]
+ }
+ if args.vmlinux:
+ entry['symbols'] = [addr2line(args.vmlinux, ip) or f'0x{ip:x}'
+ for ip in ips]
+ output.append(entry)
+ print(json.dumps(output, indent=2))
+ else:
+ for stack_id, ref_count, ips in stacks:
+ print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+ for i, ip in enumerate(ips):
+ sym = ''
+ if args.vmlinux:
+ resolved = addr2line(args.vmlinux, ip)
+ if resolved:
+ sym = f' {resolved}'
+ print(f" [{i}] 0x{ip:x}{sym}")
+ print()
+
+ print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+ main()
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
` (2 preceding siblings ...)
2026-05-14 3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
@ 2026-05-21 15:23 ` Steven Rostedt
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei
5 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2026-05-21 15:23 UTC (permalink / raw)
To: Li Pengfei
Cc: linux-trace-kernel, mhiramat, linux-kernel, cmllamas, zhangbo56,
lipengfei28
On Thu, 14 May 2026 11:49:13 +0800
Li Pengfei <ljdlns1987@gmail.com> wrote:
> From: Pengfei Li <lipengfei28@xiaomi.com>
>
> Hi Steven, all,
>
Hi Pengfei,
Can you address the Sashiko reviews:
https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
It has a way to copy the comments. Just reply to this series with a past of
Sashiko's review and reply to them to explain why the comments may not be
an issue, or submit a new version with fixes if they are issues.
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
` (3 preceding siblings ...)
2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
@ 2026-05-22 10:40 ` Li Pengfei
2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
` (3 more replies)
2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei
5 siblings, 4 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
lkp
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Steven, all,
This is v2 of the ftrace stackmap series. It addresses the Sashiko
review at [1] and incorporates the kernel test robot's toctree fix.
The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.
Problem
=======
With stacktrace enabled, each trace event stores a full kernel
stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On
production devices with 4-8 MB trace buffers, this fills the
buffer in seconds, limiting the usefulness of boot-time tracing
and always-on performance monitoring.
Design
======
The implementation is a lock-free hash map modeled after
tracing_map.c, as suggested by Steven [2]:
- lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- pre-allocated element pool, so there is no allocation on the hot path
- linear probing with a 2x over-provisioned table
- bounded probe length to keep worst-case lookup/insert cost bounded
- currently implemented for the global trace instance
The ring buffer stores only stack_id. Full stacks are exported via:
/sys/kernel/debug/tracing/stack_map
/sys/kernel/debug/tracing/stack_map_stat
/sys/kernel/debug/tracing/stack_map_bin
Reset semantics
===============
Reset is treated as a control-path operation and is only supported
when tracing is stopped on the owning trace_array. Online reset is
intentionally not supported.
The reset path:
- atomically claims reset rights via cmpxchg
- rejects reset with -EBUSY if tracing is active
- blocks new get_id() callers via the resetting flag
- waits for in-flight ftrace callback paths with synchronize_rcu()
- clears the map and releases resetting with release semantics
Why not reuse tracing_map.c
===========================
This series follows the same overall lock-free approach, but uses a
purpose-built structure. tracing_map.c is designed for histogram-style
aggregation with fixed-size keys and value fields, while this use case
needs variable-length stack storage plus reference counting.
Why not reuse BPF stackmap
==========================
BPF_MAP_TYPE_STACK_TRACE addresses a similar problem, but requires a
BPF program and the BPF runtime. This series keeps the functionality
inside ftrace and available without CONFIG_BPF.
Unlike BPF stackmap, which may replace entries on collision, this
design keeps stack_id stable once assigned, which is important because
ring buffer events may reference that stack_id long after insertion.
Test results
============
Platform: ARM64 Qualcomm SM8850 (8 cores), kernel 6.12, bits=14,
tracing sched_switch + kmem_cache_alloc with stacktrace trigger,
5-second capture, default ring buffer.
Per-event payload (measured from tracing stats):
Event Full stack Stackmap Reduction
--------------------- ---------- -------- ---------
sched_switch 102 B/entry 48 B/entry -53%
kmem_cache_alloc 111 B/entry 44 B/entry -60%
In the same 5-second capture window, the smaller per-event footprint
translated to many more retained events before wraparound. For
sched_switch:
- without stackmap: 43,950 retained entries
- with stackmap: 1,710,044 retained entries
During the same runs, the stackmap observed a few thousand unique
stacks and no drops.
Boot-time activation is also supported via:
trace_options=stackmap,stacktrace
Events that occur before stackmap initialization fall back to full
stack traces; later events are deduplicated. This transition does
not itself drop events, but early boot stacks recorded before
initialization are not deduplicated.
QEMU validation
===============
The series also runs cleanly in QEMU on aarch64 (mainline,
qemu-system-aarch64, 2 vCPU, virt machine, busybox initrd).
A post-init smoke test verified:
- stack_map, stack_map_stat, stack_map_bin, and options/stackmap exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset is rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- stack_map_bin magic is correct
Changes since RFC v1
====================
- tightened reset semantics: reset now requires tracing to be stopped
and returns -EBUSY if tracing is active or another reset is in progress
- fixed publication/consumption ordering with smp_store_release() /
smp_load_acquire()
- bounded probe length and added pool-exhaustion fast-path handling
- moved hash_seed into struct ftrace_stackmap
- switched the element pool to a single flat vmalloc allocation
- bounded bits range to [10, 18] to limit worst-case memory usage
- fixed TRACE_ITER(STACKMAP) handling
- tightened stack_map reset input parsing
- renamed stat counters to "successes" / "success_rate" so the meaning
is unambiguous (counts events served, including first-time inserts)
- added documentation, selftest coverage, and userspace dump tooling
Known limitations
=================
- Per-instance stackmap support is not included in this series.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
binary format settles.
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
[1] https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
[2] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 145 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 21 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 66 ++
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 56 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 100 +++
tools/tracing/stackmap_dump.py | 150 ++++
12 files changed, 1237 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
@ 2026-05-22 10:40 ` Li Pengfei
2026-05-22 10:40 ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
lkp
From: Pengfei Li <lipengfei28@xiaomi.com>
Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.
The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:
- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
is O(1) even when the table is heavily loaded with claimed-but-
empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)
The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (all fields native-endian)
Counter naming:
- 'successes' counts events that were successfully assigned a
stack_id (covers both first-time inserts and dedup hits).
- 'drops' counts events that fell back to recording the full stack
(pool exhausted, probe limit reached, or reset in progress).
- 'success_rate' is successes / (successes + drops).
Reset semantics:
- Reset is a control-path operation only allowed when tracing is
stopped on the owning trace_array. Online reset (with tracing
active) is intentionally not supported to keep the proof
obligations small.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
verifies tracer_tracing_is_on() returns false. The resetting
flag itself blocks subsequent get_id() callers; userspace
re-enabling tracing after our check still cannot let new
insertions through.
- synchronize_rcu() drains in-flight get_id() callers from the
ftrace callback path, which runs preempt-disabled.
- Reset clears the resetting flag with atomic_set_release() so a
subsequent get_id() observes a fully cleared map.
- Concurrent reset returns -EBUSY; reset while tracing is active
returns -EBUSY.
Concurrency notes:
- entry->val publication uses smp_store_release() paired with
smp_load_acquire() in all dereferencing readers (lookup, seq_show,
bin_open). seq_start/seq_next only check val for NULL and use
READ_ONCE().
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
atomic_read() before the contended atomic RMW, avoiding cacheline
contention once the pool is full. Slots that win cmpxchg but
cannot get an elt are left 'claimed but empty'; subsequent
lookups treat val==NULL as a miss and probe past them. The
bounded probe length keeps per-event cost O(1).
Hash key:
- Per-instance random seed stored in the stackmap struct (no
global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
free-slot sentinel). Full memcmp confirms matches.
Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
element pool is ~130 MB and a stack_map_bin snapshot may briefly
allocate another ~130 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
field so data[] is 8-byte aligned on both 32-bit and 64-bit
architectures, avoiding alignment faults when writing u64 IPs
on strict-alignment architectures.
Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
range 10-18, default 14)
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/Kconfig | 21 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++++++++++++++++++
kernel/trace/trace_stackmap.h | 56 +++
4 files changed, 721 insertions(+)
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..2a63fd2c9a96 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,27 @@ config STACK_TRACER
Say N if unsure.
+config FTRACE_STACKMAP
+ bool "Ftrace stack map deduplication"
+ depends on TRACING
+ depends on STACKTRACE
+ select KALLSYMS
+ help
+ This enables a global stack trace hash table for ftrace, inspired
+ by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+ only a stack_id in the ring buffer instead of the full stack trace,
+ significantly reducing trace buffer usage when the same call stacks
+ appear repeatedly.
+
+ The deduplicated stacks are exported via:
+ /sys/kernel/debug/tracing/stack_map
+
+ Writing to this file resets the stack map. Reading shows all unique
+ stacks with their stack_id and reference count.
+
+ Say Y if you want to reduce ftrace buffer usage for stack traces.
+ Say N if unsure.
+
config TRACE_PREEMPT_TOGGLE
bool
help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 1decdce8cbef..f1b6175099cc 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..b23a60e9286c
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,643 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ * cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ * and blocks new get_id() callers (they observe resetting=1 and
+ * return -EINVAL).
+ * - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the
+ * resetting flag itself prevents new insertions even if userspace
+ * re-enables tracing immediately after the check.
+ * - synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val), but they are not coherent with
+ * a concurrent reset; since reset requires tracing to be stopped,
+ * mid-iteration reset can produce truncated or partial output but
+ * never crashes.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE 64
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+ u32 nr; /* actual number of IPs */
+ atomic_t ref_count;
+ unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+ u32 key; /* 0 = free, non-zero = jhash */
+ struct stackmap_elt *val; /* NULL until fully published */
+};
+
+struct ftrace_stackmap {
+ struct trace_array *tr; /* owning trace_array */
+ unsigned int map_bits;
+ unsigned int map_size; /* 1 << (map_bits + 1) */
+ unsigned int max_elts; /* 1 << map_bits */
+ u32 hash_seed; /* per-instance jhash seed */
+ atomic_t next_elt; /* index into elts pool */
+ struct stackmap_entry *entries; /* hash table */
+ struct stackmap_elt *elts; /* flat element pool */
+ atomic_t resetting;
+ atomic64_t successes; /* events served (hits + new inserts) */
+ atomic64_t drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ * export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN 10
+#define FTRACE_STACKMAP_BITS_MAX 18
+#define FTRACE_STACKMAP_BITS_DEFAULT 14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+ unsigned long val;
+
+ if (kstrtoul(str, 0, &val))
+ return -EINVAL;
+ val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+ stackmap_map_bits = val;
+ return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+ int idx;
+
+ /*
+ * Fast-path early-out once the pool is fully consumed. Avoids
+ * the contended atomic RMW on next_elt for every traced event
+ * after the pool is exhausted.
+ */
+ if (atomic_read(&smap->next_elt) >= smap->max_elts)
+ return NULL;
+
+ idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+ if (idx < smap->max_elts)
+ return &smap->elts[idx];
+ return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+ struct ftrace_stackmap *smap;
+ unsigned int bits;
+
+ smap = kzalloc(sizeof(*smap), GFP_KERNEL);
+ if (!smap)
+ return ERR_PTR(-ENOMEM);
+
+ /* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+ bits = clamp_val(stackmap_map_bits,
+ FTRACE_STACKMAP_BITS_MIN,
+ FTRACE_STACKMAP_BITS_MAX);
+
+ smap->tr = tr;
+ smap->map_bits = bits;
+ smap->max_elts = 1U << bits;
+ smap->map_size = 1U << (bits + 1); /* 2x over-provision */
+ BUG_ON(!is_power_of_2(smap->map_size));
+
+ smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+ if (!smap->entries) {
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * Single large vmalloc of the element pool, indexed flat.
+ * At bits=16 this is 64K * sizeof(struct stackmap_elt). The
+ * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~33 MB.
+ */
+ smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+ if (!smap->elts) {
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->hash_seed = get_random_u32();
+ atomic_set(&smap->next_elt, 0);
+ atomic_set(&smap->resetting, 0);
+ atomic64_set(&smap->successes, 0);
+ atomic64_set(&smap->drops, 0);
+
+ return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+ if (!smap || IS_ERR(smap))
+ return;
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ * 1. Atomically claim reset rights via cmpxchg on @resetting.
+ * 2. Verify tracing is stopped on @smap->tr; if not, release the
+ * claim and return -EBUSY. The resetting flag itself blocks
+ * any subsequent get_id() callers.
+ * 3. synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path (which runs preempt-disabled).
+ * 4. memset entries, elts, and counters.
+ * 5. Release the resetting flag with release semantics so any new
+ * get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+ if (!smap)
+ return 0;
+
+ if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+ return -EBUSY;
+
+ if (smap->tr && tracer_tracing_is_on(smap->tr)) {
+ atomic_set(&smap->resetting, 0);
+ return -EBUSY;
+ }
+
+ /*
+ * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+ * is needed before it. It drains in-flight ftrace callbacks that
+ * may have already passed the resetting check with the old value.
+ */
+ synchronize_rcu();
+
+ memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+ memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+ atomic_set(&smap->next_elt, 0);
+ atomic64_set(&smap->successes, 0);
+ atomic64_set(&smap->drops, 0);
+
+ /* Release resetting=0 so new get_id() observes a cleared map. */
+ atomic_set_release(&smap->resetting, 0);
+ return 0;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries)
+{
+ u32 key_hash, idx, test_key, trace_len;
+ struct stackmap_entry *entry;
+ struct stackmap_elt *val;
+ int probes = 0;
+
+ if (!smap || !nr_entries || atomic_read(&smap->resetting))
+ return -EINVAL;
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ nr_entries = FTRACE_STACKMAP_MAX_DEPTH;
+
+ trace_len = nr_entries * sizeof(unsigned long);
+ /*
+ * jhash2() requires the length in u32 units and the data to be
+ * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+ * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+ * directly; the cast to u32* is safe because ips[] is naturally
+ * aligned to sizeof(unsigned long) >= 4.
+ */
+ key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+ smap->hash_seed);
+ if (key_hash == 0)
+ key_hash = 1; /* 0 means free slot */
+
+ idx = key_hash >> (32 - (smap->map_bits + 1));
+
+ while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+ idx &= (smap->map_size - 1);
+ entry = &smap->entries[idx];
+ test_key = entry->key;
+
+ if (test_key == key_hash) {
+ /*
+ * smp_load_acquire pairs with smp_store_release in
+ * the publisher below; ensures we see fully-formed
+ * elt fields (nr, ips, ref_count) before dereference.
+ */
+ val = smp_load_acquire(&entry->val);
+ if (val && val->nr == nr_entries &&
+ memcmp(val->ips, ips, trace_len) == 0) {
+ atomic_inc(&val->ref_count);
+ atomic64_inc(&smap->successes);
+ return (int)idx;
+ }
+ /*
+ * val == NULL: another CPU is mid-insert, or this
+ * slot is "claimed but empty" (pool exhausted).
+ * val != NULL but mismatch: 32-bit hash collision
+ * with a different stack. In both cases, advance.
+ */
+ } else if (!test_key) {
+ /* Free slot: try to claim it */
+ if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+ struct stackmap_elt *elt;
+
+ elt = stackmap_get_elt(smap);
+ if (!elt) {
+ /*
+ * Pool exhausted. We claimed this
+ * slot with cmpxchg but cannot fill
+ * it. Leave key set so the slot
+ * stays "claimed but empty" — future
+ * lookups treat val==NULL as a miss
+ * and probe past it. Cannot revert
+ * key=0 without racing other CPUs.
+ */
+ atomic64_inc(&smap->drops);
+ return -ENOSPC;
+ }
+
+ elt->nr = nr_entries;
+ atomic_set(&elt->ref_count, 1);
+ memcpy(elt->ips, ips, trace_len);
+
+ /*
+ * Publish elt with release semantics so the
+ * reader's smp_load_acquire can safely
+ * dereference val->nr / val->ips.
+ */
+ smp_store_release(&entry->val, elt);
+ atomic64_inc(&smap->successes);
+ return (int)idx;
+ }
+ /* cmpxchg failed; another CPU claimed this slot. */
+ }
+
+ idx++;
+ probes++;
+ }
+
+ atomic64_inc(&smap->drops);
+ return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+ struct ftrace_stackmap *smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos; i < smap->map_size; i++) {
+ if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos + 1; i < smap->map_size; i++) {
+ if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v) { }
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+ struct stackmap_entry *entry = v;
+ struct stackmap_elt *elt = smp_load_acquire(&entry->val);
+ struct stackmap_seq_private *priv = m->private;
+ u32 idx = entry - priv->smap->entries;
+ u32 i, nr;
+
+ if (!elt)
+ return 0;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+ idx, atomic_read(&elt->ref_count), nr);
+ for (i = 0; i < nr; i++)
+ seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+ .start = stackmap_seq_start,
+ .next = stackmap_seq_next,
+ .stop = stackmap_seq_stop,
+ .show = stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+ struct stackmap_seq_private *priv;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open_private(file, &stackmap_seq_ops,
+ sizeof(struct stackmap_seq_private));
+ if (ret)
+ return ret;
+ m = file->private_data;
+ priv = m->private;
+ priv->smap = inode->i_private;
+ return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+ if (n > 0 && buf[n - 1] == '\n')
+ n--;
+ return (n == 1 && buf[0] == '0') ||
+ (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct stackmap_seq_private *priv = m->private;
+ char buf[8];
+ size_t n = min(count, sizeof(buf) - 1);
+ int ret;
+
+ if (n == 0)
+ return -EINVAL;
+ if (copy_from_user(buf, ubuf, n))
+ return -EFAULT;
+ buf[n] = '\0';
+
+ if (!stackmap_write_is_reset(buf, n))
+ return -EINVAL;
+
+ /*
+ * ftrace_stackmap_reset() atomically claims reset rights via
+ * cmpxchg and returns -EBUSY if another reset is in progress
+ * or if tracing is active.
+ */
+ ret = ftrace_stackmap_reset(priv->smap);
+ if (ret)
+ return ret;
+ return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+ .open = stackmap_open,
+ .read = seq_read,
+ .write = stackmap_write,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+ struct ftrace_stackmap *smap = m->private;
+ u32 entries;
+ u64 successes, drops;
+
+ if (!smap) {
+ seq_puts(m, "stackmap not initialized\n");
+ return 0;
+ }
+
+ entries = atomic_read(&smap->next_elt);
+ successes = atomic64_read(&smap->successes);
+ drops = atomic64_read(&smap->drops);
+
+ seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts);
+ seq_printf(m, "table_size: %u\n", smap->map_size);
+ seq_printf(m, "successes: %llu\n", successes);
+ seq_printf(m, "drops: %llu\n", drops);
+ if (successes + drops > 0)
+ seq_printf(m, "success_rate: %llu%%\n",
+ successes * 100 / (successes + drops));
+ return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+ .open = stackmap_stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+ /*
+ * Use u64 (not size_t) so data[] is 8-byte aligned on both
+ * 32-bit and 64-bit architectures. The IP array within data[]
+ * is accessed as u64*, which would alignment-fault on strict
+ * architectures (e.g. older ARM, SPARC) if data[] started at
+ * a 4-byte boundary.
+ */
+ u64 size;
+ char data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+ struct ftrace_stackmap *smap = inode->i_private;
+ struct stackmap_bin_snapshot *snap;
+ struct ftrace_stackmap_bin_header *hdr;
+ size_t alloc_size, off;
+ u32 nr_entries, i, nr_stacks;
+
+ if (!smap)
+ return -ENODEV;
+
+ /*
+ * Worst-case allocation size: every populated entry uses a
+ * full-depth stack. The (+1) gives one slack slot in case a
+ * concurrent insert lands between this snapshot and iteration.
+ * The loop below performs an explicit bounds check anyway.
+ *
+ * At bits=16 this caps at ~33 MB. The file is mode 0440
+ * (TRACE_MODE_READ), so only privileged users can open it.
+ */
+ nr_entries = atomic_read(&smap->next_elt);
+ alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+ (sizeof(struct ftrace_stackmap_bin_entry) +
+ FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+ snap = vmalloc(sizeof(*snap) + alloc_size);
+ if (!snap)
+ return -ENOMEM;
+
+ hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+ hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+ hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+ hdr->reserved = 0;
+ off = sizeof(*hdr);
+ nr_stacks = 0;
+
+ for (i = 0; i < smap->map_size; i++) {
+ struct stackmap_entry *entry = &smap->entries[i];
+ struct stackmap_elt *elt;
+ struct ftrace_stackmap_bin_entry *e;
+ u64 *ips_out;
+ u32 k, nr;
+
+ if (!entry->key)
+ continue;
+ elt = smp_load_acquire(&entry->val);
+ if (!elt)
+ continue;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ /* Bounds check: stop if we would overflow the allocation. */
+ if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+ break;
+
+ e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+ e->stack_id = i;
+ e->nr = nr;
+ e->ref_count = atomic_read(&elt->ref_count);
+ e->reserved = 0;
+ off += sizeof(*e);
+
+ ips_out = (u64 *)(snap->data + off);
+ for (k = 0; k < nr; k++)
+ ips_out[k] = (u64)elt->ips[k];
+ off += nr * sizeof(u64);
+ nr_stacks++;
+ }
+
+ hdr->nr_stacks = nr_stacks;
+ snap->size = off;
+ file->private_data = snap;
+ return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct stackmap_bin_snapshot *snap = file->private_data;
+
+ if (!snap)
+ return -EINVAL;
+ return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+ vfree(file->private_data);
+ return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+ .open = stackmap_bin_open,
+ .read = stackmap_bin_read,
+ .llseek = default_llseek,
+ .release = stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..da51ed919e2c
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH 64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION 2
+
+struct ftrace_stackmap_bin_header {
+ u32 magic;
+ u32 version;
+ u32 nr_stacks;
+ u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+ u32 stack_id;
+ u32 nr;
+ u32 ref_count;
+ u32 reserved;
+ /* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+ unsigned long *ips, unsigned int n)
+{ return -ENOSYS; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
@ 2026-05-22 10:40 ` Li Pengfei
2026-05-22 10:40 ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu
3 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
lkp
From: Pengfei Li <lipengfei28@xiaomi.com>
Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.
Changes:
- New TRACE_STACK_ID in trace_type enum
- New stack_id_entry in trace_entries.h
- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
used by TRACE_ITER_PROF_TEXT_OFFSET)
- Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id()
when the stackmap option is active
- Stackmap pointer read with smp_load_acquire(), published with
smp_store_release() to ensure proper initialization ordering
- NULL check on tr->stackmap prevents dereference if creation failed
or if used on a secondary trace instance (graceful fallback)
- ftrace_stackmap_create() takes the owning trace_array so the
stackmap can later check tracing state during reset
- Added stack_id print handler in trace_output.c
Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, or NULL pointer), the full stack trace is recorded as
before — no new failure modes introduced.
Note: stackmap is currently initialized only for the global trace
instance. Secondary instances fall back to full stack recording.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/trace.c | 66 ++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 16 +++++++++
kernel/trace/trace_entries.h | 15 ++++++++
kernel/trace/trace_output.c | 23 +++++++++++++
4 files changed, 120 insertions(+)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..49a675dffad5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
#include "trace.h"
#include "trace_output.h"
+#include "trace_stackmap.h"
#ifdef CONFIG_FTRACE_STARTUP_TEST
/*
@@ -2184,6 +2185,43 @@ void __ftrace_trace_stack(struct trace_array *tr,
}
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * If stackmap dedup is enabled, try to store only the stack_id
+ * in the ring buffer instead of the full stack trace.
+ */
+ if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+ struct ftrace_stackmap *smap;
+ struct stack_id_entry *sid_entry;
+ int sid;
+
+ smap = smp_load_acquire(&tr->stackmap);
+ if (!smap)
+ goto full_stack;
+
+ sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+ if (sid >= 0) {
+ event = __trace_buffer_lock_reserve(buffer,
+ TRACE_STACK_ID,
+ sizeof(*sid_entry), trace_ctx);
+ if (!event)
+ goto out;
+ sid_entry = ring_buffer_event_data(event);
+ sid_entry->stack_id = sid;
+ /*
+ * stack_id is a synthetic side-event attached to a
+ * primary trace event that was already subject to
+ * filtering. No per-event filter is defined for
+ * TRACE_STACK_ID, so commit unconditionally.
+ */
+ __buffer_unlock_commit(buffer, event);
+ goto out;
+ }
+ /* Fall through to full stack on stackmap failure */
+ }
+full_stack:
+#endif
+
event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
struct_size(entry, caller, nr_entries),
trace_ctx);
@@ -9222,6 +9260,34 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
NULL, &tracing_dyn_info_fops);
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ {
+ struct ftrace_stackmap *smap;
+
+ smap = ftrace_stackmap_create(&global_trace);
+ if (!IS_ERR(smap)) {
+ /*
+ * Use smp_store_release to ensure the stackmap
+ * structure is fully initialized before publishing
+ * the pointer to concurrent trace event readers.
+ */
+ smp_store_release(&global_trace.stackmap, smap);
+ trace_create_file("stack_map", TRACE_MODE_WRITE, NULL,
+ smap, &ftrace_stackmap_fops);
+ trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL,
+ smap, &ftrace_stackmap_stat_fops);
+ trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL,
+ smap, &ftrace_stackmap_bin_fops);
+ } else {
+ pr_warn("ftrace stackmap init failed, dedup disabled\n");
+ /*
+ * global_trace.stackmap is already NULL from kzalloc;
+ * leaving it NULL ensures the load-acquire in
+ * __ftrace_trace_stack falls back to full stack.
+ */
+ }
+ }
+#endif
create_trace_instances(NULL);
update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..7e7d5e5a35ff 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_STACK_ID,
__TRACE_LAST_TYPE,
};
@@ -453,6 +454,9 @@ struct trace_array {
struct cond_snapshot *cond_snapshot;
#endif
struct trace_func_repeats __percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+ struct ftrace_stackmap *stackmap;
+#endif
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct func_repeats_entry, \
TRACE_FUNC_REPEATS); \
+ IF_ASSIGN(var, ent, struct stack_id_entry, \
+ TRACE_STACK_ID); \
__ftrace_bad_type(); \
} while (0)
@@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
# define STACK_FLAGS
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS \
+ C(STACKMAP, "stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT -1
+#endif
+
#ifdef CONFIG_FUNCTION_PROFILER
+
# define PROFILER_FLAGS \
C(PROF_TEXT_OFFSET, "prof-text-offset"),
# ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
+ STACKMAP_FLAGS \
BRANCH_FLAGS \
PROFILER_FLAGS \
FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
(void *)__entry->caller[6], (void *)__entry->caller[7])
);
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+ TRACE_STACK_ID,
+
+ F_STRUCT(
+ __field( int, stack_id )
+ ),
+
+ F_printk("<stack_id %d>", __entry->stack_id)
+);
+
/*
* trace_printk entry:
*/
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
.funcs = &trace_user_stack_funcs,
};
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+ int flags, struct trace_event *event)
+{
+ struct stack_id_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+ trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+ .trace = trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+ .type = TRACE_STACK_ID,
+ .funcs = &trace_stack_id_funcs,
+};
+
/* TRACE_HWLAT */
static enum print_line_t
trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
&trace_wake_event,
&trace_stack_event,
&trace_user_stack_event,
+ &trace_stack_id_event,
&trace_bputs_event,
&trace_bprint_event,
&trace_print_event,
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-22 10:40 ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
@ 2026-05-22 10:40 ` Li Pengfei
2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu
3 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
lkp
From: Pengfei Li <lipengfei28@xiaomi.com>
Add supporting files for the ftrace stackmap feature:
Documentation/trace/ftrace-stackmap.rst:
Documentation covering design, usage, tracefs interface, binary
format, and performance characteristics. Added to the 'Core Tracing
Frameworks' toctree in Documentation/trace/index.rst. Documents:
- Reset requires tracing to be stopped first
- Boot-time activation via trace_options=stackmap
- bits parameter range [10, 18] and worst-case memory usage
- tracefs file modes (0640 / 0440)
- Best-effort snapshot semantics for stack_map_bin
- Counter naming: successes (events served), drops, success_rate
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
Functional selftest verifying:
- stackmap tracefs nodes exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset clears entries when tracing is stopped
- reset is rejected (-EBUSY) while tracing is active
Uses an EXIT trap to restore options/stackmap and options/stacktrace
on any exit path.
tools/tracing/stackmap_dump.py:
Python script to parse the binary stack_map_bin export.
Features:
- Automatic endianness detection via magic number
- Batched addr2line via stdin (avoids ARG_MAX with large stacks)
- JSON output mode
- Top-N filtering by ref_count
Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x464D5342 = 'FSMB').
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
Documentation/trace/ftrace-stackmap.rst | 145 +++++++++++++++++
Documentation/trace/index.rst | 1 +
.../ftrace/test.d/ftrace/stackmap-basic.tc | 100 ++++++++++++
tools/tracing/stackmap_dump.py | 150 ++++++++++++++++++
4 files changed, 396 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..1230d44d1d23
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,145 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+ (default: 14 → 16384 stacks; valid range: 10-18).
+
+ At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+ for the element pool. Each ``open()`` of ``stack_map_bin`` may
+ briefly allocate a similar amount for a snapshot. The cap is set
+ intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+ echo 1 > /sys/kernel/debug/tracing/options/stackmap
+ echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+ echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+ sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+ cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+ stack_id 42 [ref 1337, depth 8]
+ [0] schedule+0x48/0xc0
+ [1] schedule_timeout+0x1c/0x30
+ ...
+
+To view statistics::
+
+ cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+ entries: 2500 / 16384
+ table_size: 32768
+ successes: 148923
+ drops: 0
+ success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_on
+ echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+ trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+ Text export of all deduplicated stacks with symbol resolution.
+ Writing ``0`` or ``reset`` clears all entries (only when tracing
+ is stopped).
+
+``stack_map_stat``
+ Statistics: entry count, hits, drops, and hit rate.
+
+``stack_map_bin``
+ Binary export for efficient userspace consumption. Format:
+
+ - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+ - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+ All fields are written in the kernel's native byte order.
+ Userspace tools detect endianness by reading the magic value.
+ Magic: ``0x464D5342`` ('FSMB'), Version: 2.
+
+ The export is a best-effort snapshot allocated at ``open()``;
+ concurrent inserts during the snapshot may be truncated. A
+ bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+ (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+ length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+ confirms matches
+
+Performance
+===========
+
+Typical results on ARM64 Android device (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Hit rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
ftrace
ftrace-design
ftrace-uses
+ ftrace-stackmap
kprobes
kprobetrace
fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100755
index 000000000000..34e4e31ff7a1
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,100 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify stack_map has entries
+# 4. Verify stack_map_stat shows successes and zero drops
+# 5. Verify trace contains <stack_id> events
+# 6. Verify reset works when tracing is stopped
+# 7. Verify reset is rejected (-EBUSY) while tracing is active
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+ disable_tracing 2>/dev/null
+ echo nop > current_tracer 2>/dev/null
+ echo 0 > options/stackmap 2>/dev/null
+ echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat has entries (default empty to avoid [: too many args)
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+ fail "stackmap has zero entries after tracing"
+fi
+
+# Check successes > 0
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+ fail "stackmap has zero successes"
+fi
+
+# Check drops == 0 (pool should be large enough for 1s trace)
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+if [ "$drops" -ne 0 ]; then
+ fail "stackmap had $drops drops (pool exhausted?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+ fail "stack_map output has no stack_id entries"
+fi
+
+# Check trace has stack_id events
+count=$(grep -c "stack_id" trace || true)
+if [ "$count" -eq 0 ]; then
+ fail "trace has no <stack_id> events"
+fi
+
+# Test reset (tracing must be stopped — disable_tracing was called above)
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+ fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+ disable_tracing
+ fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..fc5d0c9cf0af
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+ # Pull from device and parse
+ adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+ python3 stackmap_dump.py /tmp/stack_map.bin
+
+ # With vmlinux for offline symbol resolution
+ python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+ # JSON output for tooling
+ python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x464D5342 # 'FSMB'
+HEADER_SIZE = 16 # 4 x u32
+ENTRY_SIZE = 16 # 4 x u32
+
+
+def detect_endianness(data):
+ """Detect byte order from magic number in header."""
+ if len(data) < 4:
+ raise ValueError("File too small")
+ magic_le = struct.unpack_from('<I', data, 0)[0]
+ if magic_le == MAGIC:
+ return '<'
+ magic_be = struct.unpack_from('>I', data, 0)[0]
+ if magic_be == MAGIC:
+ return '>'
+ raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+ """Resolve multiple addresses in one addr2line invocation."""
+ if not addrs:
+ return {}
+ try:
+ # Feed addresses on stdin to avoid ARG_MAX limits with large
+ # numbers of addresses (one stack can have 30+ frames; a
+ # snapshot can have thousands of unique stacks).
+ stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+ result = subprocess.run(
+ ['addr2line', '-f', '-e', vmlinux],
+ input=stdin, capture_output=True, text=True, timeout=60
+ )
+ lines = result.stdout.split('\n')
+ # addr2line outputs 2 lines per address: function name + source location
+ symbols = {}
+ for i, addr in enumerate(addrs):
+ idx = i * 2
+ if idx < len(lines) and lines[idx] and lines[idx] != '??':
+ symbols[addr] = lines[idx]
+ return symbols
+ except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+ print(f"warning: addr2line failed: {e}", file=sys.stderr)
+ return {}
+
+
+def parse_stackmap_bin(data):
+ """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+ if len(data) < HEADER_SIZE:
+ raise ValueError("File too small for header")
+
+ endian = detect_endianness(data)
+ header_fmt = f'{endian}IIII'
+ entry_fmt = f'{endian}IIII'
+
+ magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+ if version not in (1, 2):
+ raise ValueError(f"Unsupported version: {version}")
+
+ offset = HEADER_SIZE
+ for _ in range(nr_stacks):
+ if offset + ENTRY_SIZE > len(data):
+ break
+ stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+ offset += ENTRY_SIZE
+
+ ips_size = nr * 8
+ if offset + ips_size > len(data):
+ break
+ ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+ offset += ips_size
+
+ yield stack_id, ref_count, list(ips)
+
+
+def main():
+ parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+ parser.add_argument('file', help='Path to stack_map_bin file')
+ parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+ parser.add_argument('--json', action='store_true', help='JSON output')
+ parser.add_argument('--top', type=int, default=0,
+ help='Show only top N stacks by ref_count')
+ args = parser.parse_args()
+
+ with open(args.file, 'rb') as f:
+ data = f.read()
+
+ stacks = list(parse_stackmap_bin(data))
+
+ if args.top > 0:
+ stacks.sort(key=lambda x: x[1], reverse=True)
+ stacks = stacks[:args.top]
+
+ # Batch symbol resolution
+ symbols = {}
+ if args.vmlinux:
+ all_addrs = set()
+ for _, _, ips in stacks:
+ all_addrs.update(ips)
+ symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+ if args.json:
+ output = []
+ for stack_id, ref_count, ips in stacks:
+ entry = {
+ 'stack_id': stack_id,
+ 'ref_count': ref_count,
+ 'ips': [f'0x{ip:x}' for ip in ips]
+ }
+ if args.vmlinux:
+ entry['symbols'] = [symbols.get(ip, f'0x{ip:x}')
+ for ip in ips]
+ output.append(entry)
+ print(json.dumps(output, indent=2))
+ else:
+ for stack_id, ref_count, ips in stacks:
+ print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+ for i, ip in enumerate(ips):
+ sym = symbols.get(ip, '')
+ if sym:
+ sym = f' {sym}'
+ print(f" [{i}] 0x{ip:x}{sym}")
+ print()
+
+ print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+ main()
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
` (2 preceding siblings ...)
2026-05-22 10:40 ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
@ 2026-05-25 6:58 ` Masami Hiramatsu
2026-05-25 7:39 ` Li Pengfei
3 siblings, 1 reply; 15+ messages in thread
From: Masami Hiramatsu @ 2026-05-25 6:58 UTC (permalink / raw)
To: Li Pengfei
Cc: linux-trace-kernel, rostedt, mhiramat, linux-kernel, cmllamas,
zhangbo56, lipengfei28, lkp
Hi Pengfei,
On Fri, 22 May 2026 18:40:14 +0800
Li Pengfei <ljdlns1987@gmail.com> wrote:
> From: Pengfei Li <lipengfei28@xiaomi.com>
>
> Hi Steven, all,
>
> This is v2 of the ftrace stackmap series. It addresses the Sashiko
> review at [1] and incorporates the kernel test robot's toctree fix.
>
> The series adds stack trace deduplication to ftrace. When the
> stacktrace option is enabled, the ring buffer stores a 4-byte
> stack_id instead of a full kernel stack trace, while the full
> stacks are exported via tracefs.
Sashiko still made some comments on the series. Please review it.
https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com
And reply to the comment on this thread, so that we can discuss it
here.
Thanks,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu
@ 2026-05-25 7:39 ` Li Pengfei
0 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-25 7:39 UTC (permalink / raw)
To: mhiramat
Cc: linux-trace-kernel, rostedt, linux-kernel, cmllamas, zhangbo56,
lipengfei28, lkp
Hi Masami,
I went through the Sashiko comments on v2 [1]. Per-finding response
below; v3 will incorporate the fixes.
[1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com
Patch 1/3:
- memset() torn reads against lockless readers: agreed, the
reset path is not well serialized against tracefs readers.
Will tighten slow-path synchronization in v3.
- seq_next() not advancing *pos on EOF: agreed, will fix in v3.
- atomic_read(&resetting) without acquire: agreed, will switch
to atomic_read_acquire() in v3.
- Plain reads of entry->key: agreed, will use READ_ONCE() in v3.
- atomic64_inc() in NMI-safe hot path on 32-bit GENERIC_ATOMIC64:
agreed, will move the counters off the hot path (local_t /
per-CPU) in v3.
Patch 2/3:
- TRACE_STACK_ID not in trace_valid_entry(): agreed, will add in v3.
- "NULL from kzalloc" comment: wording bug, will correct in v3.
- Reset memset synchronization: same fix as patch 1, finding 1.
Patch 3/3:
- Selftest missing 'function:tracer' in '# requires:': agreed,
will add in v3.
- Selftest wiping the ring buffer via 'echo nop > current_tracer'
before reading trace: agreed, will reorder in v3.
I'll send v3 once the changes are tested.
Pengfei
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
` (4 preceding siblings ...)
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
@ 2026-05-26 11:52 ` Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
` (2 more replies)
5 siblings, 3 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Masami, Steven, all,
This is v3 of the ftrace stackmap series. It addresses the Sashiko
review on v2 [1] that Masami pointed out.
[1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com
The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.
Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending.
Changes since v2
================
Patch 1 (lock-free stackmap):
- Hot-path counters changed from atomic64_t to per-CPU local_t.
This avoids the raw_spinlock_t fallback that atomic64_t uses on
32-bit GENERIC_ATOMIC64, which would deadlock from NMI context.
- reset() now serializes against tracefs readers via an
rw_semaphore (held for write during the clearing memset, held
for read by seq_file iteration and bin snapshot construction).
synchronize_rcu() alone was insufficient because seq_file/bin
readers are in process context, not preempt-disabled.
- get_id() uses atomic_read_acquire() on smap->resetting so
subsequent loads of entry->key/val are properly ordered after
the check (LKMM control dependencies only order stores).
- All plain reads of entry->key now use READ_ONCE() to avoid
LKMM data races with the cmpxchg writer.
- val->nr in the hot path now uses READ_ONCE() to keep style
consistent with the seq_show / bin_open readers.
- stackmap_seq_next() now updates *pos past map_size on EOF so
seq_read() terminates instead of looping on the last element.
- Added a comment in the cmpxchg-claim path documenting that
two CPUs racing with the same key_hash may produce a small
number of duplicate entries; this is an accepted trade-off
for keeping the hot path lock-free.
- Removed BUG_ON in create path (the constraint is satisfied by
construction; no runtime check needed).
Patch 2 (integration):
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and
ZEROED_TRACE_FLAGS so the option is only exposed under the
top-level trace instance, matching the convention used for
other global-only options such as 'printk' and 'record-cmd'.
Secondary instances under tracing/instances/*/ no longer see
the option at all, instead of seeing it as a silent no-op.
- TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c
so ftrace startup selftests don't reject the entry type.
- Corrected a comment about how global_trace.stackmap is
zero-initialized (BSS, not kzalloc).
Patch 3 (docs / selftest / tooling):
- Selftest now reads trace contents BEFORE switching back to the
nop tracer (tracer_init() calls tracing_reset_online_cpus()
which would have emptied the ring buffer).
- Added 'function:tracer' to the selftest '# requires:' line so
ftracetest skips when CONFIG_FUNCTION_TRACER is disabled
instead of failing spuriously.
- Selftest grep tightened to '<stack_id' to avoid future
false-positives if any other tracepoint name contains
"stack_id".
- New stackmap-instance-gate.tc selftest asserts the option and
stack_map* nodes are present on the global instance and absent
on a freshly-created secondary instance, locking in the
TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2.
- Documentation Performance section made vendor-neutral
("aarch64 SMP system" instead of a specific device name) and
the term "Hit rate" replaced with "Dedup rate" to match the
actual stat field name (success_rate).
- Documentation Design section now states that deduplication is
best-effort under heavy contention (cmpxchg races may produce
a small number of duplicate entries for the same stack), so
users observing entries > unique-stacks have a documented
explanation.
Test results
============
Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI)
Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots)
Method: 5-second capture with stacktrace trigger
Functional tests (all PASS):
- tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist
- options/stackmap writable, trace shows <stack_id N>
- stack_map text export with correct symbols
- reset clears entries when tracing stopped
- reset rejected (-EBUSY) while tracing active
- per-event trigger: only specified events get stacks
Performance (sched_switch, 5s):
entries: 466 / 16384
successes: 9159
drops: 0
success_rate: 100%
dedup rate: 95.2% (466 unique stacks / 9625 total events)
Performance (kmem_cache_alloc, 5s):
entries: 1177 / 16384
successes: 60078
drops: 0
success_rate: 100%
dedup rate: 98.1% (1177 unique stacks / 61255 total events)
Ring buffer space savings:
Event Full stack Stackmap Saving
---------------- --------------- --------------- ------
sched_switch 9625 × 88B=847KB 12B×9625+88B×466=156KB 82%
kmem_cache_alloc 61255×88B=5.4MB 12B×61255+88B×1177=839KB 85%
QEMU validation (v3 base: v7.1-rc5)
===================================
The series boots cleanly on aarch64 QEMU. A post-init smoke test
(12/12 PASS) verified all functional behaviors including:
- tracefs nodes appear with correct file modes
- stack_id events emitted, kernel symbols resolve correctly
(e.g. __schedule+0x7cc/0x1138)
- reset rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- per-CPU local_t counters aggregate correctly across CPUs
- stack_map_bin magic correct (0x464D5342 'FSMB')
- 'stackmap' option visible on the global instance, hidden on
secondary instances under tracing/instances/*/
Boot-time activation via 'trace_options=stackmap,stacktrace' works:
events that fire before stackmap initialization fall back to
recording full stack traces; later events are deduplicated. No
events are dropped due to the transition.
Known limitations
=================
- Per-instance stackmap support is not included in this series.
Following the convention used for other global-only options
(PRINTK, RECORD_CMD), the 'stackmap' option is gated to the
top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is
not exposed under tracing/instances/*/options/. Per-instance
maps would be a follow-up.
- The element pool is allocated eagerly at fs_initcall when
CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will
ever enable the option. At the default bits=14 this is roughly
8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager
allocation keeps the hot path entirely allocation-free and
avoids any allocation-failure path under tracing pressure.
Lazy allocation on first 'echo 1 > options/stackmap' is a
reasonable follow-up if maintainers prefer that trade-off.
- Deduplication is best-effort, not strict: under heavy
concurrent contention two CPUs racing in the insert path with
the same stack hash may each succeed in claiming a different
slot, producing a small number of duplicate entries for the
same stack. ref_count is then split across the duplicates.
This is intentional: it keeps the hot path lock-free and
bounds memory by the element pool size.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once
the binary format settles.
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 162 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 78 +-
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_selftest.c | 1 +
kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++
.../test.d/ftrace/stackmap-instance-gate.tc | 42 +
tools/tracing/stackmap_dump.py | 150 ++++
14 files changed, 1449 insertions(+), 2 deletions(-)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication
2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei
@ 2026-05-26 11:52 ` Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
From: Pengfei Li <lipengfei28@xiaomi.com>
Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.
The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:
- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
is O(1) even when the table is heavily loaded with claimed-but-
empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)
The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the
existing tracing_map / hist_triggers requirement: the lock-free
hot path uses cmpxchg in a context that may be reached from NMI.
The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (all fields native-endian)
Hot-path counters use per-CPU local_t (NMI-safe single-instruction
increments) instead of atomic64_t. atomic64_t falls back to
raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems,
which would deadlock if an NMI hit while the spinlock was held.
local_t avoids this hazard.
Reset semantics:
- Reset is a control-path operation only allowed when tracing is
stopped on the owning trace_array. Online reset (with tracing
active) is intentionally not supported.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
verifies tracer_tracing_is_on() returns false.
- synchronize_rcu() drains in-flight get_id() callers from the
ftrace callback path (which runs preempt-disabled).
- A reader_sem (rw_semaphore) serializes the clearing memset
against tracefs readers (seq_file iteration and stack_map_bin
snapshot), which run in process context and aren't covered by
synchronize_rcu(). The hot path doesn't take this lock.
- Reset clears the resetting flag with atomic_set_release() so a
subsequent get_id() observes a fully cleared map.
- get_id() uses atomic_read_acquire() on resetting so subsequent
loads of entry->key/val are properly ordered after the check
(control dependencies only order stores per LKMM).
- Concurrent reset returns -EBUSY; reset while tracing is active
returns -EBUSY.
Concurrency notes:
- entry->val publication uses smp_store_release() paired with
smp_load_acquire() in all dereferencing readers.
- entry->key reads (in get_id, seq_start/next, bin_open) use
READ_ONCE() to avoid LKMM data races with the cmpxchg writer.
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
atomic_read() before the contended atomic RMW, avoiding cacheline
contention once the pool is full. Slots that win cmpxchg but
cannot get an elt are left 'claimed but empty'; subsequent
lookups treat val==NULL as a miss and probe past them.
Hash key:
- Per-instance random seed stored in the stackmap struct (no
global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
free-slot sentinel). Full memcmp confirms matches.
Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
element pool is ~135 MB and a stack_map_bin snapshot may briefly
allocate another ~135 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
field so data[] is 8-byte aligned on both 32-bit and 64-bit
architectures, avoiding alignment faults when writing u64 IPs
on strict-alignment architectures.
Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
range 10-18, default 14)
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 +++
4 files changed, 860 insertions(+)
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..e49cae886ff0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,28 @@ config STACK_TRACER
Say N if unsure.
+config FTRACE_STACKMAP
+ bool "Ftrace stack map deduplication"
+ depends on TRACING
+ depends on STACKTRACE
+ depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
+ select KALLSYMS
+ help
+ This enables a global stack trace hash table for ftrace, inspired
+ by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+ only a stack_id in the ring buffer instead of the full stack trace,
+ significantly reducing trace buffer usage when the same call stacks
+ appear repeatedly.
+
+ The deduplicated stacks are exported via:
+ /sys/kernel/debug/tracing/stack_map
+
+ Writing to this file resets the stack map. Reading shows all unique
+ stacks with their stack_id and reference count.
+
+ Say Y if you want to reduce ftrace buffer usage for stack traces.
+ Say N if unsure.
+
config TRACE_PREEMPT_TOGGLE
bool
help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 8d3d96e847d8..c2d9b2bf895a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..c89f6d527c96
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,780 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ * cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ * and blocks new get_id() callers (they observe resetting=1 and
+ * return -EINVAL).
+ * - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the
+ * resetting flag itself prevents new insertions even if userspace
+ * re-enables tracing immediately after the check.
+ * - synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val), but they are not coherent with
+ * a concurrent reset; since reset requires tracing to be stopped,
+ * mid-iteration reset can produce truncated or partial output but
+ * never crashes.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/local_lock.h>
+#include <linux/percpu.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+#include <asm/local.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE 64
+
+/*
+ * Memory ordering of entry->val: published with smp_store_release()
+ * by the inserter; consumed with smp_load_acquire() by every reader
+ * that dereferences the elt (get_id, seq_show, bin_open). This pairs
+ * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the
+ * publish) with the reads of those fields (which happen AFTER the
+ * load). seq_start / seq_next only test val for NULL and use the
+ * acquire load purely to keep memory ordering symmetric.
+ */
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+ u32 nr; /* actual number of IPs */
+ atomic_t ref_count;
+ unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+ u32 key; /* 0 = free, non-zero = jhash */
+ struct stackmap_elt *val; /* NULL until fully published */
+};
+
+struct ftrace_stackmap {
+ struct trace_array *tr; /* owning trace_array */
+ unsigned int map_bits;
+ unsigned int map_size; /* 1 << (map_bits + 1) */
+ unsigned int max_elts; /* 1 << map_bits */
+ u32 hash_seed; /* per-instance jhash seed */
+ atomic_t next_elt; /* index into elts pool */
+ struct stackmap_entry *entries; /* hash table */
+ struct stackmap_elt *elts; /* flat element pool */
+ atomic_t resetting;
+ /*
+ * Reader/reset serialization. Held in shared mode (read lock)
+ * across seq_file iteration and binary snapshot construction;
+ * held in exclusive mode (write lock) by reset's clearing
+ * phase. The hot path (get_id) does not take this lock — it
+ * uses smp_load_acquire/smp_store_release on entry->val and
+ * the resetting flag for the lock-free protocol.
+ */
+ struct rw_semaphore reader_sem;
+ /*
+ * Per-CPU counters using local_t. local_t increments are NMI-
+ * safe on all architectures (single-instruction or interrupt-
+ * masked) and avoid the raw_spinlock_t fallback that
+ * atomic64_t uses on 32-bit GENERIC_ATOMIC64 — which would
+ * deadlock if an NMI hit while the spinlock was held.
+ */
+ local_t __percpu *successes; /* events served (hits + new inserts) */
+ local_t __percpu *drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ * export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN 10
+#define FTRACE_STACKMAP_BITS_MAX 18
+#define FTRACE_STACKMAP_BITS_DEFAULT 14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+ unsigned long val;
+
+ if (kstrtoul(str, 0, &val))
+ return -EINVAL;
+ val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+ stackmap_map_bits = val;
+ return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+ int idx;
+
+ /*
+ * Fast-path early-out once the pool is fully consumed. Avoids
+ * the contended atomic RMW on next_elt for every traced event
+ * after the pool is exhausted.
+ */
+ if (atomic_read(&smap->next_elt) >= smap->max_elts)
+ return NULL;
+
+ idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+ if (idx < smap->max_elts)
+ return &smap->elts[idx];
+ return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+ struct ftrace_stackmap *smap;
+ unsigned int bits;
+
+ smap = kzalloc(sizeof(*smap), GFP_KERNEL);
+ if (!smap)
+ return ERR_PTR(-ENOMEM);
+
+ /* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+ bits = clamp_val(stackmap_map_bits,
+ FTRACE_STACKMAP_BITS_MIN,
+ FTRACE_STACKMAP_BITS_MAX);
+
+ smap->tr = tr;
+ smap->map_bits = bits;
+ smap->max_elts = 1U << bits;
+ smap->map_size = 1U << (bits + 1); /* 2x over-provision */
+
+ smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+ if (!smap->entries) {
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * Single large vmalloc of the element pool, indexed flat.
+ * At bits=18 this is 256K * sizeof(struct stackmap_elt). The
+ * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB.
+ */
+ smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+ if (!smap->elts) {
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->successes = alloc_percpu(local_t);
+ if (!smap->successes) {
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+ smap->drops = alloc_percpu(local_t);
+ if (!smap->drops) {
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->hash_seed = get_random_u32();
+ atomic_set(&smap->next_elt, 0);
+ atomic_set(&smap->resetting, 0);
+ init_rwsem(&smap->reader_sem);
+
+ return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+ if (!smap || IS_ERR(smap))
+ return;
+ free_percpu(smap->drops);
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ * 1. Atomically claim reset rights via cmpxchg on @resetting.
+ * 2. Verify tracing is stopped on @smap->tr; if not, release the
+ * claim and return -EBUSY. The resetting flag itself blocks
+ * any subsequent get_id() callers.
+ * 3. synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path (which runs preempt-disabled).
+ * 4. memset entries, elts, and counters.
+ * 5. Release the resetting flag with release semantics so any new
+ * get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+ if (!smap)
+ return 0;
+
+ if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+ return -EBUSY;
+
+ if (smap->tr && tracer_tracing_is_on(smap->tr)) {
+ atomic_set(&smap->resetting, 0);
+ return -EBUSY;
+ }
+
+ /*
+ * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+ * is needed before it. It drains in-flight ftrace callbacks that
+ * may have already passed the resetting check with the old value.
+ */
+ synchronize_rcu();
+
+ /*
+ * Take the reader_sem in exclusive mode. This serializes the
+ * memset against any tracefs reader (seq_file iteration or
+ * stack_map_bin snapshot) that may currently hold the rwsem
+ * for read. synchronize_rcu() already drained the hot path;
+ * this rwsem covers process-context readers that aren't
+ * preempt-disabled.
+ */
+ down_write(&smap->reader_sem);
+
+ memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+ memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+ atomic_set(&smap->next_elt, 0);
+ {
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ local_set(per_cpu_ptr(smap->successes, cpu), 0);
+ local_set(per_cpu_ptr(smap->drops, cpu), 0);
+ }
+ }
+
+ up_write(&smap->reader_sem);
+
+ /* Release resetting=0 so new get_id() observes a cleared map. */
+ atomic_set_release(&smap->resetting, 0);
+ return 0;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries)
+{
+ u32 key_hash, idx, test_key, trace_len;
+ struct stackmap_entry *entry;
+ struct stackmap_elt *val;
+ int probes = 0;
+
+ /*
+ * atomic_read_acquire() pairs with atomic_set_release() in the
+ * reset path. This ensures that subsequent reads of entry->key
+ * and entry->val are ordered after this check; without acquire,
+ * the CPU would only have a control dependency, which orders
+ * subsequent stores but not loads (per LKMM).
+ */
+ if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting))
+ return -EINVAL;
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ nr_entries = FTRACE_STACKMAP_MAX_DEPTH;
+
+ trace_len = nr_entries * sizeof(unsigned long);
+ /*
+ * jhash2() requires the length in u32 units and the data to be
+ * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+ * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+ * directly; the cast to u32* is safe because ips[] is naturally
+ * aligned to sizeof(unsigned long) >= 4.
+ */
+ key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+ smap->hash_seed);
+ if (key_hash == 0)
+ key_hash = 1; /* 0 means free slot */
+
+ idx = key_hash >> (32 - (smap->map_bits + 1));
+
+ while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+ idx &= (smap->map_size - 1);
+ entry = &smap->entries[idx];
+ /*
+ * READ_ONCE() to avoid LKMM data race with concurrent
+ * cmpxchg(&entry->key, 0, key_hash) on this slot.
+ */
+ test_key = READ_ONCE(entry->key);
+
+ if (test_key == key_hash) {
+ /*
+ * smp_load_acquire pairs with smp_store_release in
+ * the publisher below; ensures we see fully-formed
+ * elt fields (nr, ips, ref_count) before dereference.
+ */
+ val = smp_load_acquire(&entry->val);
+ /*
+ * READ_ONCE(val->nr) keeps style consistent with
+ * the seq_show / bin_open readers. nr is write-once
+ * (set before publish, never modified afterwards),
+ * so the load is data-race-free, but READ_ONCE
+ * silences any analysis tool that flags a plain
+ * read of a field that is also read under acquire
+ * elsewhere.
+ */
+ if (val && READ_ONCE(val->nr) == nr_entries &&
+ memcmp(val->ips, ips, trace_len) == 0) {
+ atomic_inc(&val->ref_count);
+ local_inc(this_cpu_ptr(smap->successes));
+ return (int)idx;
+ }
+ /*
+ * val == NULL: another CPU is mid-insert, or this
+ * slot is "claimed but empty" (pool exhausted).
+ * val != NULL but mismatch: 32-bit hash collision
+ * with a different stack. In both cases, advance.
+ */
+ } else if (!test_key) {
+ /*
+ * Free slot: try to claim it.
+ *
+ * If two CPUs race here with the same key_hash
+ * (same stack), one loses the cmpxchg, advances,
+ * and may insert the same stack at a later slot.
+ * This can produce a small number of duplicate
+ * entries under heavy contention. The trade-off
+ * is accepted to keep the hot path lock-free;
+ * ref_count is split across the duplicates and
+ * total memory cost is bounded by the element
+ * pool size.
+ */
+ if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+ struct stackmap_elt *elt;
+
+ elt = stackmap_get_elt(smap);
+ if (!elt) {
+ /*
+ * Pool exhausted. We claimed this
+ * slot with cmpxchg but cannot fill
+ * it. Leave key set so the slot
+ * stays "claimed but empty" — future
+ * lookups treat val==NULL as a miss
+ * and probe past it. Cannot revert
+ * key=0 without racing other CPUs.
+ */
+ local_inc(this_cpu_ptr(smap->drops));
+ return -ENOSPC;
+ }
+
+ elt->nr = nr_entries;
+ atomic_set(&elt->ref_count, 1);
+ memcpy(elt->ips, ips, trace_len);
+
+ /*
+ * Publish elt with release semantics so the
+ * reader's smp_load_acquire can safely
+ * dereference val->nr / val->ips.
+ */
+ smp_store_release(&entry->val, elt);
+ local_inc(this_cpu_ptr(smap->successes));
+ return (int)idx;
+ }
+ /* cmpxchg failed; another CPU claimed this slot. */
+ }
+
+ idx++;
+ probes++;
+ }
+
+ local_inc(this_cpu_ptr(smap->drops));
+ return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+ struct ftrace_stackmap *smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ /*
+ * Take the reader_sem to serialize against ftrace_stackmap_reset(),
+ * which holds it for write while clearing the table. Released in
+ * stackmap_seq_stop(), which seq_file calls regardless of whether
+ * start() returned an element or NULL (per Documentation/filesystems
+ * /seq_file.rst: "the iterator value returned by start() or next()
+ * is guaranteed to be passed to a subsequent next() or stop()").
+ */
+ down_read(&smap->reader_sem);
+ for (i = *pos; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ smp_load_acquire(&smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos + 1; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ smp_load_acquire(&smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ /*
+ * Advance *pos past the end so that on the next read() the
+ * subsequent stackmap_seq_start() call returns NULL and the
+ * iteration terminates. Without this, seq_read() would loop
+ * on the last element.
+ */
+ *pos = smap->map_size;
+ return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+
+ /*
+ * seq_file invokes stop() unconditionally after each iteration
+ * pass (see seq_read_iter / traverse), even when start() returned
+ * NULL. Always release here, balanced against the down_read in
+ * stackmap_seq_start().
+ */
+ if (smap)
+ up_read(&smap->reader_sem);
+}
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+ struct stackmap_entry *entry = v;
+ struct stackmap_elt *elt = smp_load_acquire(&entry->val);
+ struct stackmap_seq_private *priv = m->private;
+ u32 idx = entry - priv->smap->entries;
+ u32 i, nr;
+
+ if (!elt)
+ return 0;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+ idx, atomic_read(&elt->ref_count), nr);
+ for (i = 0; i < nr; i++)
+ seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+ .start = stackmap_seq_start,
+ .next = stackmap_seq_next,
+ .stop = stackmap_seq_stop,
+ .show = stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+ struct stackmap_seq_private *priv;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open_private(file, &stackmap_seq_ops,
+ sizeof(struct stackmap_seq_private));
+ if (ret)
+ return ret;
+ m = file->private_data;
+ priv = m->private;
+ priv->smap = inode->i_private;
+ return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+ if (n > 0 && buf[n - 1] == '\n')
+ n--;
+ return (n == 1 && buf[0] == '0') ||
+ (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct stackmap_seq_private *priv = m->private;
+ char buf[8];
+ size_t n = min(count, sizeof(buf) - 1);
+ int ret;
+
+ if (n == 0)
+ return -EINVAL;
+ if (copy_from_user(buf, ubuf, n))
+ return -EFAULT;
+ buf[n] = '\0';
+
+ if (!stackmap_write_is_reset(buf, n))
+ return -EINVAL;
+
+ /*
+ * ftrace_stackmap_reset() atomically claims reset rights via
+ * cmpxchg and returns -EBUSY if another reset is in progress
+ * or if tracing is active.
+ */
+ ret = ftrace_stackmap_reset(priv->smap);
+ if (ret)
+ return ret;
+ return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+ .open = stackmap_open,
+ .read = seq_read,
+ .write = stackmap_write,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+ struct ftrace_stackmap *smap = m->private;
+ u64 successes = 0, drops = 0;
+ u32 entries;
+ int cpu;
+
+ if (!smap) {
+ seq_puts(m, "stackmap not initialized\n");
+ return 0;
+ }
+
+ entries = atomic_read(&smap->next_elt);
+ for_each_possible_cpu(cpu) {
+ successes += local_read(per_cpu_ptr(smap->successes, cpu));
+ drops += local_read(per_cpu_ptr(smap->drops, cpu));
+ }
+
+ seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts);
+ seq_printf(m, "table_size: %u\n", smap->map_size);
+ seq_printf(m, "successes: %llu\n", successes);
+ seq_printf(m, "drops: %llu\n", drops);
+ if (successes + drops > 0)
+ seq_printf(m, "success_rate: %llu%%\n",
+ successes * 100 / (successes + drops));
+ return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+ .open = stackmap_stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+ /*
+ * Use u64 (not size_t) so data[] is 8-byte aligned on both
+ * 32-bit and 64-bit architectures. The IP array within data[]
+ * is accessed as u64*, which would alignment-fault on strict
+ * architectures (e.g. older ARM, SPARC) if data[] started at
+ * a 4-byte boundary.
+ */
+ u64 size;
+ char data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+ struct ftrace_stackmap *smap = inode->i_private;
+ struct stackmap_bin_snapshot *snap;
+ struct ftrace_stackmap_bin_header *hdr;
+ size_t alloc_size, off;
+ u32 nr_entries, i, nr_stacks;
+
+ if (!smap)
+ return -ENODEV;
+
+ /*
+ * Worst-case allocation size: every populated entry uses a
+ * full-depth stack. The (+1) gives one slack slot in case a
+ * concurrent insert lands between this snapshot and iteration.
+ * The loop below performs an explicit bounds check anyway.
+ *
+ * At bits=18 this caps at ~135 MB. The file is mode 0440
+ * (TRACE_MODE_READ), so only privileged users can open it.
+ */
+ nr_entries = atomic_read(&smap->next_elt);
+ alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+ (sizeof(struct ftrace_stackmap_bin_entry) +
+ FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+ snap = vmalloc(sizeof(*snap) + alloc_size);
+ if (!snap)
+ return -ENOMEM;
+
+ hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+ hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+ hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+ hdr->reserved = 0;
+ off = sizeof(*hdr);
+ nr_stacks = 0;
+
+ /*
+ * Take reader_sem to serialize against ftrace_stackmap_reset(),
+ * which clears the table and elt pool under the write lock.
+ */
+ down_read(&smap->reader_sem);
+
+ for (i = 0; i < smap->map_size; i++) {
+ struct stackmap_entry *entry = &smap->entries[i];
+ struct stackmap_elt *elt;
+ struct ftrace_stackmap_bin_entry *e;
+ u64 *ips_out;
+ u32 k, nr;
+
+ if (!READ_ONCE(entry->key))
+ continue;
+ elt = smp_load_acquire(&entry->val);
+ if (!elt)
+ continue;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ /* Bounds check: stop if we would overflow the allocation. */
+ if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+ break;
+
+ e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+ e->stack_id = i;
+ e->nr = nr;
+ e->ref_count = atomic_read(&elt->ref_count);
+ e->reserved = 0;
+ off += sizeof(*e);
+
+ ips_out = (u64 *)(snap->data + off);
+ for (k = 0; k < nr; k++)
+ ips_out[k] = (u64)elt->ips[k];
+ off += nr * sizeof(u64);
+ nr_stacks++;
+ }
+
+ up_read(&smap->reader_sem);
+
+ hdr->nr_stacks = nr_stacks;
+ snap->size = off;
+ file->private_data = snap;
+ return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct stackmap_bin_snapshot *snap = file->private_data;
+
+ if (!snap)
+ return -EINVAL;
+ return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+ vfree(file->private_data);
+ return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+ .open = stackmap_bin_open,
+ .read = stackmap_bin_read,
+ .llseek = default_llseek,
+ .release = stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..2e82bd6fb1c3
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH 64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION 2
+
+struct ftrace_stackmap_bin_header {
+ u32 magic;
+ u32 version;
+ u32 nr_stacks;
+ u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+ u32 stack_id;
+ u32 nr;
+ u32 ref_count;
+ u32 reserved;
+ /* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *
+ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+ unsigned long *ips, unsigned int n)
+{ return -EOPNOTSUPP; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path
2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
@ 2026-05-26 11:52 ` Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
From: Pengfei Li <lipengfei28@xiaomi.com>
Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.
Changes:
- New TRACE_STACK_ID in trace_type enum
- New stack_id_entry in trace_entries.h
- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
used by TRACE_ITER_PROF_TEXT_OFFSET)
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS
so it is only exposed under the top-level trace instance, matching
the convention already used for global-only options such as 'printk'
and 'record-cmd'. Secondary instances under tracing/instances/*/
do not see the option at all, avoiding a confusing no-op.
- Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id()
when the stackmap option is active. If reserving a TRACE_STACK_ID
ring-buffer slot fails after a successful get_id(), the path falls
through to the full-stack recording so the event still gets a stack
trace recorded.
- Stackmap pointer read with smp_load_acquire(), published with
smp_store_release() to ensure proper initialization ordering
- NULL check on tr->stackmap is retained as defense-in-depth: events
that fire before fs_initcall (when the map is created) or after a
failed ftrace_stackmap_create() observe a NULL pointer and fall back
to full stack recording without dereferencing it
- ftrace_stackmap_create() takes the owning trace_array so the
stackmap can later check tracing state during reset
- Added stack_id print handler in trace_output.c
- Added TRACE_STACK_ID to trace_valid_entry() in trace_selftest.c
so ftrace startup selftests don't reject the new entry type when
the stackmap option is enabled
Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, or NULL pointer), the full stack trace is recorded as
before -- no new failure modes introduced.
Per-instance stackmap support is left as a follow-up; gating the
option via TOP_LEVEL_TRACE_FLAGS makes the global-only scope
explicit at the tracefs interface rather than relying on a silent
runtime fallback.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/trace.c | 78 ++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 16 +++++++
kernel/trace/trace_entries.h | 15 +++++++
kernel/trace/trace_output.c | 23 +++++++++++
kernel/trace/trace_selftest.c | 1 +
5 files changed, 131 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..36120355e549 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
#include "trace.h"
#include "trace_output.h"
+#include "trace_stackmap.h"
#ifdef CONFIG_FTRACE_STARTUP_TEST
/*
@@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
/* trace_options that are only supported by global_trace */
#define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) | \
TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) | \
- TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS)
+ TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) | \
+ FPROFILE_DEFAULT_FLAGS)
/* trace_flags that are default zero for instances */
#define ZEROED_TRACE_FLAGS \
(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \
- TRACE_ITER(COPY_MARKER))
+ TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP))
/*
* The global_trace is the descriptor that holds the top-level tracing
@@ -2184,6 +2186,49 @@ void __ftrace_trace_stack(struct trace_array *tr,
}
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * If stackmap dedup is enabled, try to store only the stack_id
+ * in the ring buffer instead of the full stack trace.
+ */
+ if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+ struct ftrace_stackmap *smap;
+ struct stack_id_entry *sid_entry;
+ int sid;
+
+ smap = smp_load_acquire(&tr->stackmap);
+ if (!smap)
+ goto full_stack;
+
+ sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+ if (sid >= 0) {
+ event = __trace_buffer_lock_reserve(buffer,
+ TRACE_STACK_ID,
+ sizeof(*sid_entry), trace_ctx);
+ if (!event) {
+ /*
+ * Could not reserve a TRACE_STACK_ID slot;
+ * fall back to the full-stack path so the
+ * event still gets a stack trace recorded.
+ */
+ goto full_stack;
+ }
+ sid_entry = ring_buffer_event_data(event);
+ sid_entry->stack_id = sid;
+ /*
+ * stack_id is a synthetic side-event attached to a
+ * primary trace event that was already subject to
+ * filtering. No per-event filter is defined for
+ * TRACE_STACK_ID, so commit unconditionally.
+ */
+ __buffer_unlock_commit(buffer, event);
+ goto out;
+ }
+ /* On stackmap failure, record the full stack instead. */
+ }
+full_stack:
+#endif
+
event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
struct_size(entry, caller, nr_entries),
trace_ctx);
@@ -9222,6 +9267,35 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
NULL, &tracing_dyn_info_fops);
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ {
+ struct ftrace_stackmap *smap;
+
+ smap = ftrace_stackmap_create(&global_trace);
+ if (!IS_ERR(smap)) {
+ /*
+ * Use smp_store_release to ensure the stackmap
+ * structure is fully initialized before publishing
+ * the pointer to concurrent trace event readers.
+ */
+ smp_store_release(&global_trace.stackmap, smap);
+ trace_create_file("stack_map", TRACE_MODE_WRITE, NULL,
+ smap, &ftrace_stackmap_fops);
+ trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL,
+ smap, &ftrace_stackmap_stat_fops);
+ trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL,
+ smap, &ftrace_stackmap_bin_fops);
+ } else {
+ pr_warn("ftrace stackmap init failed, dedup disabled\n");
+ /*
+ * global_trace is statically defined; its stackmap
+ * field is zero-initialized via BSS, so leaving it
+ * NULL ensures the smp_load_acquire() in
+ * __ftrace_trace_stack() falls back to full stack.
+ */
+ }
+ }
+#endif
create_trace_instances(NULL);
update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..7e7d5e5a35ff 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_STACK_ID,
__TRACE_LAST_TYPE,
};
@@ -453,6 +454,9 @@ struct trace_array {
struct cond_snapshot *cond_snapshot;
#endif
struct trace_func_repeats __percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+ struct ftrace_stackmap *stackmap;
+#endif
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct func_repeats_entry, \
TRACE_FUNC_REPEATS); \
+ IF_ASSIGN(var, ent, struct stack_id_entry, \
+ TRACE_STACK_ID); \
__ftrace_bad_type(); \
} while (0)
@@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
# define STACK_FLAGS
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS \
+ C(STACKMAP, "stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT -1
+#endif
+
#ifdef CONFIG_FUNCTION_PROFILER
+
# define PROFILER_FLAGS \
C(PROF_TEXT_OFFSET, "prof-text-offset"),
# ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
+ STACKMAP_FLAGS \
BRANCH_FLAGS \
PROFILER_FLAGS \
FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
(void *)__entry->caller[6], (void *)__entry->caller[7])
);
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+ TRACE_STACK_ID,
+
+ F_STRUCT(
+ __field( int, stack_id )
+ ),
+
+ F_printk("<stack_id %d>", __entry->stack_id)
+);
+
/*
* trace_printk entry:
*/
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
.funcs = &trace_user_stack_funcs,
};
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+ int flags, struct trace_event *event)
+{
+ struct stack_id_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+ trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+ .trace = trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+ .type = TRACE_STACK_ID,
+ .funcs = &trace_stack_id_funcs,
+};
+
/* TRACE_HWLAT */
static enum print_line_t
trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
&trace_wake_event,
&trace_stack_event,
&trace_user_stack_event,
+ &trace_stack_id_event,
&trace_bputs_event,
&trace_bprint_event,
&trace_print_event,
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 929c84075315..0c97065b0d68 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *entry)
case TRACE_CTX:
case TRACE_WAKE:
case TRACE_STACK:
+ case TRACE_STACK_ID:
case TRACE_PRINT:
case TRACE_BRANCH:
case TRACE_GRAPH_ENT:
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap
2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
@ 2026-05-26 11:52 ` Li Pengfei
2 siblings, 0 replies; 15+ messages in thread
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li,
kernel test robot
From: Pengfei Li <lipengfei28@xiaomi.com>
Add supporting files for the ftrace stackmap feature:
Documentation/trace/ftrace-stackmap.rst:
Documentation covering design, usage, tracefs interface, binary
format, and performance characteristics. Added to the 'Core Tracing
Frameworks' toctree in Documentation/trace/index.rst. Documents:
- Reset requires tracing to be stopped first
- Boot-time activation via trace_options=stackmap
- bits parameter range [10, 18] and worst-case memory usage
- tracefs file modes (0640 / 0440)
- Best-effort snapshot semantics for stack_map_bin
- Counter naming: successes (events served), drops, success_rate
- Gravestone amplification when the pool is exhausted
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
Functional selftest verifying:
- stackmap tracefs nodes exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset clears entries when tracing is stopped
- reset is rejected (-EBUSY) while tracing is active
Test reads trace contents BEFORE switching back to the nop tracer
(tracer_init() unconditionally calls tracing_reset_online_cpus(),
which would empty the ring buffer). The function:tracer dependency
is declared in '# requires:' so ftracetest skips on kernels without
CONFIG_FUNCTION_TRACER instead of failing spuriously. An EXIT trap
restores options/stackmap and options/stacktrace on any exit path.
tools/tracing/stackmap_dump.py:
Python script to parse the binary stack_map_bin export.
Features:
- Automatic endianness detection via magic number
- Batched addr2line via stdin (avoids ARG_MAX with large stacks)
- JSON output mode
- Top-N filtering by ref_count
Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x464D5342 = 'FSMB').
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
Documentation/trace/ftrace-stackmap.rst | 162 ++++++++++++++++++
Documentation/trace/index.rst | 1 +
.../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++++++++++
.../test.d/ftrace/stackmap-instance-gate.tc | 42 +++++
tools/tracing/stackmap_dump.py | 150 ++++++++++++++++
5 files changed, 458 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100755 tools/tracing/stackmap_dump.py
diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..191347be3664
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+ (default: 14 → 16384 stacks; valid range: 10-18).
+
+ At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+ for the element pool. Each ``open()`` of ``stack_map_bin`` may
+ briefly allocate a similar amount for a snapshot. The cap is set
+ intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+ echo 1 > /sys/kernel/debug/tracing/options/stackmap
+ echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+ echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+ sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+ cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+ stack_id 42 [ref 1337, depth 8]
+ [0] schedule+0x48/0xc0
+ [1] schedule_timeout+0x1c/0x30
+ ...
+
+To view statistics::
+
+ cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+ entries: 2500 / 16384
+ table_size: 32768
+ successes: 148923
+ drops: 0
+ success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_on
+ echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+ trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+ Text export of all deduplicated stacks with symbol resolution.
+ Writing ``0`` or ``reset`` clears all entries (only when tracing
+ is stopped).
+
+``stack_map_stat``
+ Statistics: entries (allocated unique stacks), table_size,
+ successes (events served), drops (events that fell back to
+ full-stack recording), and success_rate. Drops accumulate when
+ the element pool is exhausted; once that happens, slots that
+ won the cmpxchg but failed to allocate an element remain
+ "claimed but empty" and increase probe pressure for any future
+ insert hashing to the same bucket. Reset (when tracing is
+ stopped) clears these gravestones.
+
+``stack_map_bin``
+ Binary export for efficient userspace consumption. Format:
+
+ - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+ - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+ All fields are written in the kernel's native byte order.
+ Userspace tools detect endianness by reading the magic value.
+ Magic: ``0x464D5342`` ('FSMB'), Version: 2.
+
+ The export is a best-effort snapshot allocated at ``open()``;
+ concurrent inserts during the snapshot may be truncated. A
+ bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+ (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+ length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+ confirms matches
+
+Deduplication is best-effort, not strict: if two CPUs race in the
+insert path with the same ``key_hash`` (i.e. the same stack), the
+``cmpxchg`` loser advances by one slot and may insert the same stack
+again. Under heavy contention this can produce a small number of
+duplicate entries for the same stack; ``ref_count`` is then split
+across the duplicates. Total memory is still bounded by the element
+pool size, and lookup correctness is unaffected (each duplicate is
+a self-consistent entry with its own ``stack_id``). The trade-off is
+intentional and keeps the hot path lock-free.
+
+Performance
+===========
+
+Typical results on an aarch64 SMP system (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Dedup rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
ftrace
ftrace-design
ftrace-uses
+ ftrace-stackmap
kprobes
kprobetrace
fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100644
index 000000000000..18fa998ae460
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,103 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap function:tracer
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify trace contains <stack_id> events (read BEFORE switching
+# tracer back to nop, since tracer_init() resets the ring buffer)
+# 4. Verify stack_map has entries and zero drops
+# 5. Verify reset is rejected (-EBUSY) while tracing is active
+# 6. Verify reset clears the map when tracing is stopped
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+ disable_tracing 2>/dev/null
+ echo nop > current_tracer 2>/dev/null
+ echo 0 > options/stackmap 2>/dev/null
+ echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Read trace contents NOW, before switching tracer back to nop.
+# tracer_init() unconditionally calls tracing_reset_online_cpus(),
+# so the ring buffer would be empty after 'echo nop > current_tracer'.
+count=$(grep -c "<stack_id" trace || true)
+: "${count:=0}"
+if [ "$count" -eq 0 ]; then
+ fail "trace has no <stack_id> events"
+fi
+
+# Now safe to switch back and disable options
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+ fail "stackmap has zero entries after tracing"
+fi
+
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+ fail "stackmap has zero successes"
+fi
+
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+if [ "$drops" -ne 0 ]; then
+ fail "stackmap had $drops drops (pool exhausted?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+ fail "stack_map output has no stack_id entries"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+ disable_tracing
+ fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+# Test reset works when tracing is stopped
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+ fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
new file mode 100644
index 000000000000..49848eac2624
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
@@ -0,0 +1,42 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap option is gated to the top-level trace instance
+# requires: stack_map options/stackmap instances
+
+# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the
+# convention used for global-only options like 'printk' and 'record-cmd'.
+# Verify that:
+# 1. The global instance exposes options/stackmap and the stack_map* nodes.
+# 2. A newly created secondary instance under instances/ does NOT expose
+# options/stackmap or stack_map* nodes.
+
+fail() {
+ echo "FAIL: $1"
+ rmdir instances/test_stackmap_gate 2>/dev/null
+ exit_fail
+}
+
+# 1. Global instance must expose the option and the nodes
+test -e options/stackmap || fail "options/stackmap missing on global instance"
+test -e stack_map || fail "stack_map missing on global instance"
+test -e stack_map_stat || fail "stack_map_stat missing on global instance"
+test -e stack_map_bin || fail "stack_map_bin missing on global instance"
+
+# 2. Create a secondary instance and verify it does NOT see the option
+# or the stack_map* nodes.
+mkdir instances/test_stackmap_gate || fail "could not create secondary instance"
+
+if [ -e instances/test_stackmap_gate/options/stackmap ]; then
+ fail "secondary instance unexpectedly exposes options/stackmap"
+fi
+
+for f in stack_map stack_map_stat stack_map_bin; do
+ if [ -e instances/test_stackmap_gate/$f ]; then
+ fail "secondary instance unexpectedly has $f"
+ fi
+done
+
+rmdir instances/test_stackmap_gate || fail "could not remove secondary instance"
+
+echo "stackmap option gating to top-level instance works"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..fcd8ddcd97de
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+ # Pull from device and parse
+ adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+ python3 stackmap_dump.py /tmp/stack_map.bin
+
+ # With vmlinux for offline symbol resolution
+ python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+ # JSON output for tooling
+ python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x464D5342 # 'FSMB'
+HEADER_SIZE = 16 # 4 x u32
+ENTRY_SIZE = 16 # 4 x u32
+
+
+def detect_endianness(data):
+ """Detect byte order from magic number in header."""
+ if len(data) < 4:
+ raise ValueError("File too small")
+ magic_le = struct.unpack_from('<I', data, 0)[0]
+ if magic_le == MAGIC:
+ return '<'
+ magic_be = struct.unpack_from('>I', data, 0)[0]
+ if magic_be == MAGIC:
+ return '>'
+ raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+ """Resolve multiple addresses in one addr2line invocation."""
+ if not addrs:
+ return {}
+ try:
+ # Feed addresses on stdin to avoid ARG_MAX limits with large
+ # numbers of addresses (one stack can have 30+ frames; a
+ # snapshot can have thousands of unique stacks).
+ stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+ result = subprocess.run(
+ ['addr2line', '-f', '-e', vmlinux],
+ input=stdin, capture_output=True, text=True, timeout=60
+ )
+ lines = result.stdout.split('\n')
+ # addr2line outputs 2 lines per address: function name + source location
+ symbols = {}
+ for i, addr in enumerate(addrs):
+ idx = i * 2
+ if idx < len(lines) and lines[idx] and lines[idx] != '??':
+ symbols[addr] = lines[idx]
+ return symbols
+ except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+ print(f"warning: addr2line failed: {e}", file=sys.stderr)
+ return {}
+
+
+def parse_stackmap_bin(data):
+ """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+ if len(data) < HEADER_SIZE:
+ raise ValueError("File too small for header")
+
+ endian = detect_endianness(data)
+ header_fmt = f'{endian}IIII'
+ entry_fmt = f'{endian}IIII'
+
+ magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+ if version != 2:
+ raise ValueError(f"Unsupported version: {version}")
+
+ offset = HEADER_SIZE
+ for _ in range(nr_stacks):
+ if offset + ENTRY_SIZE > len(data):
+ break
+ stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+ offset += ENTRY_SIZE
+
+ ips_size = nr * 8
+ if offset + ips_size > len(data):
+ break
+ ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+ offset += ips_size
+
+ yield stack_id, ref_count, list(ips)
+
+
+def main():
+ parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+ parser.add_argument('file', help='Path to stack_map_bin file')
+ parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+ parser.add_argument('--json', action='store_true', help='JSON output')
+ parser.add_argument('--top', type=int, default=0,
+ help='Show only top N stacks by ref_count')
+ args = parser.parse_args()
+
+ with open(args.file, 'rb') as f:
+ data = f.read()
+
+ stacks = list(parse_stackmap_bin(data))
+
+ if args.top > 0:
+ stacks.sort(key=lambda x: x[1], reverse=True)
+ stacks = stacks[:args.top]
+
+ # Batch symbol resolution
+ symbols = {}
+ if args.vmlinux:
+ all_addrs = set()
+ for _, _, ips in stacks:
+ all_addrs.update(ips)
+ symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+ if args.json:
+ output = []
+ for stack_id, ref_count, ips in stacks:
+ entry = {
+ 'stack_id': stack_id,
+ 'ref_count': ref_count,
+ 'ips': [f'0x{ip:x}' for ip in ips]
+ }
+ if args.vmlinux:
+ entry['symbols'] = [symbols.get(ip, f'0x{ip:x}')
+ for ip in ips]
+ output.append(entry)
+ print(json.dumps(output, indent=2))
+ else:
+ for stack_id, ref_count, ips in stacks:
+ print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+ for i, ip in enumerate(ips):
+ sym = symbols.get(ip, '')
+ if sym:
+ sym = f' {sym}'
+ print(f" [{i}] 0x{ip:x}{sym}")
+ print()
+
+ print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+ main()
--
2.34.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-05-26 11:53 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-22 10:40 ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-22 10:40 ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu
2026-05-25 7:39 ` Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox