* [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer
@ 2026-05-14 3:49 Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
` (5 more replies)
0 siblings, 6 replies; 16+ messages in thread
From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Steven, all,
This series adds stack trace deduplication to ftrace, reducing ring
buffer usage by ~80% when stacktrace is enabled.
Problem:
When the stacktrace option is enabled, each trace event stores a full
kernel stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On
production devices with 4-8MB trace buffers, this fills the buffer in
seconds, limiting the usefulness of boot-time tracing and always-on
performance monitoring.
Solution:
A lock-free hash map (modeled after tracing_map.c as suggested by
Steven [1]) that deduplicates stack traces. The ring buffer stores
only a 4-byte stack_id; full stacks are exported via tracefs.
Design (following tracing_map.c pattern):
- Lock-free insert via cmpxchg (NMI/IRQ/any context safe)
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table
- Per-trace_array instance support
We adopted the same lock-free algorithm as tracing_map but with a
purpose-built data structure, because tracing_map's API is designed
for histogram aggregation with fixed-size keys and sum/var fields,
while our use case requires variable-length stack traces with
reference counting.
Test results (ARM64, Qualcomm SM8850, kernel 6.12):
- kmem_cache_alloc events, 1 second capture:
774 unique stacks, 8264 hits, 0 drops, 100% hit rate
Ring buffer savings: 795KB -> 176KB (78% reduction)
- Function tracer, 3 seconds:
3632 unique stacks, 25466 hits, 0 drops
Ring buffer savings: 2.5MB -> 653KB (74% reduction)
Note: An earlier prototype using rhashtable crashed in IRQ context
(BUG at rhashtable.h:912), which led us to adopt the tracing_map
cmpxchg-based approach.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
# trace output: <stack_id 42>
# resolve: cat /sys/kernel/debug/tracing/stack_map
[1] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 111 ++++
kernel/trace/Kconfig | 21 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 46 ++
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_stackmap.c | 569 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 54 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 74 +++
tools/tracing/stackmap_dump.py | 120 ++++
11 files changed, 1050 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply [flat|nested] 16+ messages in thread* [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication 2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei @ 2026-05-14 3:49 ` Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei ` (4 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw) To: linux-trace-kernel Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28 From: Pengfei Li <lipengfei28@xiaomi.com> Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel stack traces for the ftrace ring buffer. Instead of storing full stack traces (80-160 bytes each) in the ring buffer for every event, ftrace can store a 4-byte stack_id when the stackmap option is enabled. The implementation is modeled after tracing_map.c (used by hist triggers), using the same lock-free design based on Dr. Cliff Click's non-blocking hash table algorithm: - Lock-free insert via cmpxchg (safe in NMI/IRQ/any context) - Pre-allocated element pool (zero allocation on hot path) - Linear probing with 2x over-provisioned table - Per-trace_array instance support The stackmap is exported via three tracefs nodes: - stack_map: text export with symbol resolution - stack_map_stat: statistics (entries, hits, drops, hit_rate) - stack_map_bin: binary export for efficient userspace consumption Kernel command line parameter: - ftrace_stackmap.bits=N: set map capacity (2^N unique stacks) Test results on ARM64 (SM8850, Android 16, kernel 6.12): - 774 unique stacks from kmem_cache_alloc in 1 second - 100% hit rate, 0 drops - 92% hit rate under heavy load (all kmem events) Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- kernel/trace/Kconfig | 21 ++ kernel/trace/Makefile | 1 + kernel/trace/trace_stackmap.c | 569 ++++++++++++++++++++++++++++++++++ kernel/trace/trace_stackmap.h | 54 ++++ 4 files changed, 645 insertions(+) create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index e130da35808f..2a63fd2c9a96 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -412,6 +412,27 @@ config STACK_TRACER Say N if unsure. +config FTRACE_STACKMAP + bool "Ftrace stack map deduplication" + depends on TRACING + depends on STACKTRACE + select KALLSYMS + help + This enables a global stack trace hash table for ftrace, inspired + by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store + only a stack_id in the ring buffer instead of the full stack trace, + significantly reducing trace buffer usage when the same call stacks + appear repeatedly. + + The deduplicated stacks are exported via: + /sys/kernel/debug/tracing/stack_map + + Writing to this file resets the stack map. Reading shows all unique + stacks with their stack_id and reference count. + + Say Y if you want to reduce ftrace buffer usage for stack traces. + Say N if unsure. + config TRACE_PREEMPT_TOGGLE bool help diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 1decdce8cbef..f1b6175099cc 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o obj-$(CONFIG_NOP_TRACER) += trace_nop.o obj-$(CONFIG_STACK_TRACER) += trace_stack.o +obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c new file mode 100644 index 000000000000..c402e7e7f902 --- /dev/null +++ b/kernel/trace/trace_stackmap.c @@ -0,0 +1,569 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace + * + * Modeled after tracing_map.c (used by hist triggers), this provides + * a lock-free hash map optimized for the ftrace hot path. The design + * is based on Dr. Cliff Click's non-blocking hash table algorithm. + * + * Key properties: + * - Lock-free insert via cmpxchg (safe in NMI/IRQ/any context) + * - Pre-allocated element pool (zero allocation on hot path) + * - Linear probing with 2x over-provisioned table + * - Per-trace_array instance support + * + * The 32-bit jhash of the stack IPs is used as the hash table key. + * On hash collision (different stacks, same 32-bit hash), linear + * probing finds the next slot. Full stack comparison (memcmp) is + * used to confirm matches. + */ + +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/jhash.h> +#include <linux/seq_file.h> +#include <linux/kallsyms.h> +#include <linux/vmalloc.h> +#include <linux/atomic.h> +#include <linux/random.h> + +#include "trace.h" +#include "trace_stackmap.h" + +/* + * Each pre-allocated element holds one unique stack trace. + * Fixed size: MAX_DEPTH entries regardless of actual depth. + */ +struct stackmap_elt { + u32 nr; /* actual number of IPs */ + atomic_t ref_count; + unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH]; +}; + +/* + * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt. + * key == 0 means the slot is free. + */ +struct stackmap_entry { + u32 key; /* 0 = free, non-zero = jhash */ + struct stackmap_elt *val; /* NULL until fully published */ +}; + +struct ftrace_stackmap { + unsigned int map_bits; + unsigned int map_size; /* 1 << (map_bits + 1) */ + unsigned int max_elts; /* 1 << map_bits */ + atomic_t next_elt; /* index into elts pool */ + struct stackmap_entry *entries; /* hash table */ + struct stackmap_elt **elts; /* pre-allocated pool */ + atomic_t resetting; + atomic64_t hits; + atomic64_t drops; +}; + +static u32 stackmap_hash_seed; + +static unsigned int stackmap_map_bits = 14; /* 16384 elts, 32768 slots */ +static int __init stackmap_bits_setup(char *str) +{ + unsigned long val; + + if (kstrtoul(str, 0, &val)) + return -EINVAL; + val = clamp_val(val, 10, 20); /* 1K - 1M elts */ + stackmap_map_bits = val; + return 0; +} +early_param("ftrace_stackmap.bits", stackmap_bits_setup); + +/* --- Element pool --- */ + +static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap) +{ + int idx; + + idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts); + if (idx < smap->max_elts) + return smap->elts[idx]; + return NULL; +} + +static int stackmap_alloc_elts(struct ftrace_stackmap *smap) +{ + unsigned int i; + + smap->elts = vzalloc(sizeof(*smap->elts) * smap->max_elts); + if (!smap->elts) + return -ENOMEM; + + for (i = 0; i < smap->max_elts; i++) { + smap->elts[i] = kzalloc(sizeof(struct stackmap_elt), GFP_KERNEL); + if (!smap->elts[i]) + goto fail; + } + return 0; +fail: + while (i--) + kfree(smap->elts[i]); + vfree(smap->elts); + smap->elts = NULL; + return -ENOMEM; +} + +static void stackmap_free_elts(struct ftrace_stackmap *smap) +{ + unsigned int i; + + if (!smap->elts) + return; + for (i = 0; i < smap->max_elts; i++) + kfree(smap->elts[i]); + vfree(smap->elts); + smap->elts = NULL; +} + +/* --- Create / Destroy / Reset --- */ + +struct ftrace_stackmap *ftrace_stackmap_create(void) +{ + struct ftrace_stackmap *smap; + static bool seed_initialized; + int err; + + smap = kzalloc(sizeof(*smap), GFP_KERNEL); + if (!smap) + return ERR_PTR(-ENOMEM); + + smap->map_bits = stackmap_map_bits; + smap->max_elts = 1 << smap->map_bits; + smap->map_size = smap->max_elts * 2; /* 2x over-provision */ + + smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size); + if (!smap->entries) { + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + err = stackmap_alloc_elts(smap); + if (err) { + vfree(smap->entries); + kfree(smap); + return ERR_PTR(err); + } + + atomic_set(&smap->next_elt, 0); + atomic_set(&smap->resetting, 0); + atomic64_set(&smap->hits, 0); + atomic64_set(&smap->drops, 0); + + if (!seed_initialized) { + stackmap_hash_seed = get_random_u32(); + seed_initialized = true; + } + + return smap; +} + +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap) +{ + if (!smap || IS_ERR(smap)) + return; + stackmap_free_elts(smap); + vfree(smap->entries); + kfree(smap); +} + +void ftrace_stackmap_reset(struct ftrace_stackmap *smap) +{ + unsigned int i; + + if (!smap) + return; + + /* + * Reset protocol: + * + * 1. Set resetting=1 so get_id() returns -EINVAL immediately. + * get_id() callers in NMI/IRQ context will see this and bail + * out before touching entries or elts. + * + * 2. smp_mb() ensures the resetting store is visible to all CPUs + * before we start clearing entries. Any get_id() that already + * passed the resetting check will complete its cmpxchg and + * WRITE_ONCE(entry->val) before we memset, because: + * - the cmpxchg claims the slot atomically + * - WRITE_ONCE(entry->val) happens before we clear entries + * We accept that a handful of in-flight inserts may write into + * entries that we are about to clear; those entries will simply + * be wiped by the memset below, which is safe. + * + * 3. Clear entries table, then reset elt pool. + * + * 4. Clear resetting=0 with another smp_mb() so new get_id() + * calls see a fully reset map. + */ + atomic_set(&smap->resetting, 1); + smp_mb(); + + /* Clear hash table */ + memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size); + + /* Reset elt pool */ + for (i = 0; i < smap->max_elts; i++) + memset(smap->elts[i], 0, sizeof(struct stackmap_elt)); + + atomic_set(&smap->next_elt, 0); + atomic64_set(&smap->hits, 0); + atomic64_set(&smap->drops, 0); + + smp_mb(); + atomic_set(&smap->resetting, 0); +} + +/* --- Core: get_id (lock-free, NMI-safe) --- */ + +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries) +{ + u32 key_hash, idx, test_key, trace_len; + struct stackmap_entry *entry; + struct stackmap_elt *val; + int dup_try = 0; + + if (!smap || !nr_entries || atomic_read(&smap->resetting)) + return -EINVAL; + if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH) + nr_entries = FTRACE_STACKMAP_MAX_DEPTH; + + trace_len = nr_entries * sizeof(unsigned long); + /* + * jhash2() requires the length in u32 units and the data to be + * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so + * trace_len is always a multiple of 8 (hence of 4). Use jhash2 + * directly; the cast to u32* is safe because ips[] is naturally + * aligned to sizeof(unsigned long) >= 4. + */ + key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32), + stackmap_hash_seed); + if (key_hash == 0) + key_hash = 1; /* 0 means free slot */ + + idx = key_hash >> (32 - (smap->map_bits + 1)); + + while (1) { + idx &= (smap->map_size - 1); + entry = &smap->entries[idx]; + test_key = entry->key; + + if (test_key && test_key == key_hash) { + val = READ_ONCE(entry->val); + if (val && val->nr == nr_entries && + memcmp(val->ips, ips, trace_len) == 0) { + atomic_inc(&val->ref_count); + atomic64_inc(&smap->hits); + return (int)idx; + } else if (unlikely(!val)) { + /* Another CPU is mid-insert; retry */ + dup_try++; + if (dup_try > smap->map_size) { + atomic64_inc(&smap->drops); + break; + } + continue; + } + } + + if (!test_key) { + /* Free slot: try to claim it */ + if (!cmpxchg(&entry->key, 0, key_hash)) { + struct stackmap_elt *elt; + + elt = stackmap_get_elt(smap); + if (!elt) { + /* + * Pool exhausted. We claimed this slot with + * cmpxchg but cannot fill it. Leave key set + * so the slot stays "claimed but empty" — + * future lookups will skip it (val == NULL + * triggers the mid-insert retry path which + * will eventually drop). This is safer than + * writing key=0 without cmpxchg, which could + * race with another CPU's cmpxchg on the same + * slot. + */ + atomic64_inc(&smap->drops); + break; + } + + elt->nr = nr_entries; + atomic_set(&elt->ref_count, 1); + memcpy(elt->ips, ips, trace_len); + + /* Ensure elt is fully visible before publish */ + smp_wmb(); + WRITE_ONCE(entry->val, elt); + atomic64_inc(&smap->hits); + return (int)idx; + } else { + /* cmpxchg failed; someone else claimed it */ + dup_try++; + continue; + } + } + + idx++; + dup_try++; + if (dup_try > smap->map_size) { + atomic64_inc(&smap->drops); + break; + } + } + + return -ENOSPC; +} + +/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */ + +struct stackmap_seq_private { + struct ftrace_stackmap *smap; +}; + +static void *stackmap_seq_start(struct seq_file *m, loff_t *pos) +{ + struct stackmap_seq_private *priv = m->private; + struct ftrace_stackmap *smap = priv->smap; + u32 i; + + if (!smap) + return NULL; + for (i = *pos; i < smap->map_size; i++) { + if (smap->entries[i].key && smap->entries[i].val) { + *pos = i; + return &smap->entries[i]; + } + } + return NULL; +} + +static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct stackmap_seq_private *priv = m->private; + struct ftrace_stackmap *smap = priv->smap; + u32 i; + + for (i = *pos + 1; i < smap->map_size; i++) { + if (smap->entries[i].key && smap->entries[i].val) { + *pos = i; + return &smap->entries[i]; + } + } + *pos = i; + return NULL; +} + +static void stackmap_seq_stop(struct seq_file *m, void *v) { } + +static int stackmap_seq_show(struct seq_file *m, void *v) +{ + struct stackmap_entry *entry = v; + struct stackmap_elt *elt = entry->val; + struct stackmap_seq_private *priv = m->private; + u32 idx = entry - priv->smap->entries; + u32 i; + + if (!elt) + return 0; + + seq_printf(m, "stack_id %u [ref %u, depth %u]\n", + idx, atomic_read(&elt->ref_count), elt->nr); + for (i = 0; i < elt->nr; i++) + seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]); + seq_putc(m, '\n'); + return 0; +} + +static const struct seq_operations stackmap_seq_ops = { + .start = stackmap_seq_start, + .next = stackmap_seq_next, + .stop = stackmap_seq_stop, + .show = stackmap_seq_show, +}; + +static int stackmap_open(struct inode *inode, struct file *file) +{ + struct stackmap_seq_private *priv; + struct seq_file *m; + int ret; + + ret = seq_open_private(file, &stackmap_seq_ops, + sizeof(struct stackmap_seq_private)); + if (ret) + return ret; + m = file->private_data; + priv = m->private; + priv->smap = inode->i_private; + return 0; +} + +static ssize_t stackmap_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct seq_file *m = file->private_data; + struct stackmap_seq_private *priv = m->private; + char buf[8]; + size_t n = min(count, sizeof(buf) - 1); + + if (copy_from_user(buf, ubuf, n)) + return -EFAULT; + buf[n] = '\0'; + if (n == 0 || (buf[0] != '0' && strncmp(buf, "reset", 5) != 0)) + return -EINVAL; + + ftrace_stackmap_reset(priv->smap); + return count; +} + +const struct file_operations ftrace_stackmap_fops = { + .open = stackmap_open, + .read = seq_read, + .write = stackmap_write, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +/* --- Stats --- */ + +static int stackmap_stat_show(struct seq_file *m, void *v) +{ + struct ftrace_stackmap *smap = m->private; + u32 entries; + u64 hits, drops; + + if (!smap) { + seq_puts(m, "stackmap not initialized\n"); + return 0; + } + + entries = atomic_read(&smap->next_elt); + hits = atomic64_read(&smap->hits); + drops = atomic64_read(&smap->drops); + + seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts); + seq_printf(m, "table_size: %u\n", smap->map_size); + seq_printf(m, "hits: %llu\n", hits); + seq_printf(m, "drops: %llu\n", drops); + if (hits + drops > 0) + seq_printf(m, "hit_rate: %llu%%\n", + hits * 100 / (hits + drops)); + return 0; +} + +static int stackmap_stat_open(struct inode *inode, struct file *file) +{ + return single_open(file, stackmap_stat_show, inode->i_private); +} + +const struct file_operations ftrace_stackmap_stat_fops = { + .open = stackmap_stat_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +/* --- Binary export --- */ + +struct stackmap_bin_snapshot { + size_t size; + char data[]; +}; + +static int stackmap_bin_open(struct inode *inode, struct file *file) +{ + struct ftrace_stackmap *smap = inode->i_private; + struct stackmap_bin_snapshot *snap; + struct ftrace_stackmap_bin_header *hdr; + size_t alloc_size, off; + u32 i, nr_stacks; + + if (!smap) + return -ENODEV; + + /* + * Allocate based on actual entry count, not max_elts worst case. + * Each entry needs a header struct plus up to MAX_DEPTH u64 IPs. + * Add 1 to nr_entries to avoid zero-size alloc on empty map. + */ + { + u32 nr_entries = atomic_read(&smap->next_elt); + + alloc_size = sizeof(*hdr) + (nr_entries + 1) * + (sizeof(struct ftrace_stackmap_bin_entry) + + FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64)); + } + + snap = vmalloc(sizeof(*snap) + alloc_size); + if (!snap) + return -ENOMEM; + + hdr = (struct ftrace_stackmap_bin_header *)snap->data; + hdr->magic = FTRACE_STACKMAP_BIN_MAGIC; + hdr->version = FTRACE_STACKMAP_BIN_VERSION; + hdr->reserved = 0; + off = sizeof(*hdr); + nr_stacks = 0; + + for (i = 0; i < smap->map_size; i++) { + struct stackmap_entry *entry = &smap->entries[i]; + struct stackmap_elt *elt; + struct ftrace_stackmap_bin_entry *e; + u64 *ips_out; + u32 k; + + if (!entry->key) + continue; + elt = READ_ONCE(entry->val); + if (!elt) + continue; + + e = (struct ftrace_stackmap_bin_entry *)(snap->data + off); + e->stack_id = i; + e->nr = elt->nr; + e->ref_count = atomic_read(&elt->ref_count); + e->reserved = 0; + off += sizeof(*e); + + ips_out = (u64 *)(snap->data + off); + for (k = 0; k < elt->nr; k++) + ips_out[k] = (u64)elt->ips[k]; + off += elt->nr * sizeof(u64); + nr_stacks++; + } + + hdr->nr_stacks = nr_stacks; + snap->size = off; + file->private_data = snap; + return 0; +} + +static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct stackmap_bin_snapshot *snap = file->private_data; + + if (!snap) + return -EINVAL; + return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size); +} + +static int stackmap_bin_release(struct inode *inode, struct file *file) +{ + vfree(file->private_data); + return 0; +} + +const struct file_operations ftrace_stackmap_bin_fops = { + .open = stackmap_bin_open, + .read = stackmap_bin_read, + .llseek = default_llseek, + .release = stackmap_bin_release, +}; diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h new file mode 100644 index 000000000000..74ad649a79f7 --- /dev/null +++ b/kernel/trace/trace_stackmap.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _TRACE_STACKMAP_H +#define _TRACE_STACKMAP_H + +#include <linux/types.h> +#include <linux/atomic.h> + +#define FTRACE_STACKMAP_MAX_DEPTH 64 + +/* Binary export format */ +#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */ +#define FTRACE_STACKMAP_BIN_VERSION 2 + +struct ftrace_stackmap_bin_header { + u32 magic; + u32 version; + u32 nr_stacks; + u32 reserved; +}; + +struct ftrace_stackmap_bin_entry { + u32 stack_id; + u32 nr; + u32 ref_count; + u32 reserved; + /* followed by u64 ips[nr] */ +}; + +#ifdef CONFIG_FTRACE_STACKMAP + +struct ftrace_stackmap; + +struct ftrace_stackmap *ftrace_stackmap_create(void); +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap); +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries); +void ftrace_stackmap_reset(struct ftrace_stackmap *smap); + +extern const struct file_operations ftrace_stackmap_fops; +extern const struct file_operations ftrace_stackmap_stat_fops; +extern const struct file_operations ftrace_stackmap_bin_fops; + +#else + +struct ftrace_stackmap; +static inline struct ftrace_stackmap *ftrace_stackmap_create(void) { return NULL; } +static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { } +static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s, + unsigned long *ips, unsigned int n) +{ return -ENOSYS; } +static inline void ftrace_stackmap_reset(struct ftrace_stackmap *s) { } + +#endif +#endif /* _TRACE_STACKMAP_H */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path 2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei @ 2026-05-14 3:49 ` Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei ` (3 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw) To: linux-trace-kernel Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28 From: Pengfei Li <lipengfei28@xiaomi.com> Add TRACE_STACK_ID event type and integrate ftrace_stackmap into __ftrace_trace_stack(). When the 'stackmap' trace option is enabled, the stack recording path stores a 4-byte stack_id in the ring buffer instead of the full stack trace. Changes: - New TRACE_STACK_ID in trace_type enum - New stack_id_entry in trace_entries.h (just 'int stack_id') - New TRACE_ITER_STACKMAP trace option flag - Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id() when stackmap option is active - Added stack_id print handler in trace_output.c - Added stackmap field to struct trace_array (per-instance support) The stack_id event is committed unconditionally (no filter check) since it is a synthetic side-event tied to the parent event which was already subject to filtering. Fallback behavior: if stackmap returns an error (pool exhausted or resetting), the full stack trace is recorded as before. Usage: echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- kernel/trace/trace.c | 46 ++++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 16 +++++++++++++ kernel/trace/trace_entries.h | 15 ++++++++++++ kernel/trace/trace_output.c | 23 ++++++++++++++++++ 4 files changed, 100 insertions(+) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6eb4d3097a4d..c72cb8491217 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -57,6 +57,7 @@ #include "trace.h" #include "trace_output.h" +#include "trace_stackmap.h" #ifdef CONFIG_FTRACE_STARTUP_TEST /* @@ -2184,6 +2185,37 @@ void __ftrace_trace_stack(struct trace_array *tr, } #endif +#ifdef CONFIG_FTRACE_STACKMAP + /* + * If stackmap dedup is enabled, try to store only the stack_id + * in the ring buffer instead of the full stack trace. + */ + if (tr->trace_flags & TRACE_ITER_STACKMAP) { + struct stack_id_entry *sid_entry; + int sid; + + sid = ftrace_stackmap_get_id(tr->stackmap, fstack->calls, nr_entries); + if (sid >= 0) { + event = __trace_buffer_lock_reserve(buffer, + TRACE_STACK_ID, + sizeof(*sid_entry), trace_ctx); + if (!event) + goto out; + sid_entry = ring_buffer_event_data(event); + sid_entry->stack_id = sid; + /* + * stack_id is a synthetic side-event attached to a + * primary trace event that was already subject to + * filtering. No per-event filter is defined for + * TRACE_STACK_ID, so commit unconditionally. + */ + __buffer_unlock_commit(buffer, event); + goto out; + } + /* Fall through to full stack on stackmap failure */ + } +#endif + event = __trace_buffer_lock_reserve(buffer, TRACE_STACK, struct_size(entry, caller, nr_entries), trace_ctx); @@ -9222,6 +9254,20 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work) NULL, &tracing_dyn_info_fops); #endif +#ifdef CONFIG_FTRACE_STACKMAP + global_trace.stackmap = ftrace_stackmap_create(); + if (!IS_ERR(global_trace.stackmap)) { + trace_create_file("stack_map", TRACE_MODE_WRITE, NULL, + global_trace.stackmap, &ftrace_stackmap_fops); + trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL, + global_trace.stackmap, &ftrace_stackmap_stat_fops); + trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL, + global_trace.stackmap, &ftrace_stackmap_bin_fops); + } else { + pr_warn("ftrace stackmap init failed, dedup disabled\n"); + global_trace.stackmap = NULL; + } +#endif create_trace_instances(NULL); update_tracer_options(); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 80fe152af1dd..74f421a89347 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -57,6 +57,7 @@ enum trace_type { TRACE_TIMERLAT, TRACE_RAW_DATA, TRACE_FUNC_REPEATS, + TRACE_STACK_ID, __TRACE_LAST_TYPE, }; @@ -453,6 +454,9 @@ struct trace_array { struct cond_snapshot *cond_snapshot; #endif struct trace_func_repeats __percpu *last_func_repeats; +#ifdef CONFIG_FTRACE_STACKMAP + struct ftrace_stackmap *stackmap; +#endif /* * On boot up, the ring buffer is set to the minimum size, so that * we do not waste memory on systems that are not using tracing. @@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void); TRACE_GRAPH_RET); \ IF_ASSIGN(var, ent, struct func_repeats_entry, \ TRACE_FUNC_REPEATS); \ + IF_ASSIGN(var, ent, struct stack_id_entry, \ + TRACE_STACK_ID); \ __ftrace_bad_type(); \ } while (0) @@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, # define STACK_FLAGS #endif +#ifdef CONFIG_FTRACE_STACKMAP +# define STACKMAP_FLAGS \ + C(STACKMAP, "stackmap"), +#else +# define STACKMAP_FLAGS +# define TRACE_ITER_STACKMAP 0UL +#endif + #ifdef CONFIG_FUNCTION_PROFILER + # define PROFILER_FLAGS \ C(PROF_TEXT_OFFSET, "prof-text-offset"), # ifdef CONFIG_FUNCTION_GRAPH_TRACER @@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ + STACKMAP_FLAGS \ BRANCH_FLAGS \ PROFILER_FLAGS \ FPROFILE_FLAGS diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index 54417468fdeb..89ed14b7e5fd 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry, (void *)__entry->caller[6], (void *)__entry->caller[7]) ); +/* + * Stack ID entry - stores only a stack_id referencing the stackmap. + * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks. + */ +FTRACE_ENTRY(stack_id, stack_id_entry, + + TRACE_STACK_ID, + + F_STRUCT( + __field( int, stack_id ) + ), + + F_printk("<stack_id %d>", __entry->stack_id) +); + /* * trace_printk entry: */ diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index a5ad76175d10..68678ea88159 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = { .funcs = &trace_user_stack_funcs, }; +/* TRACE_STACK_ID */ +static enum print_line_t trace_stack_id_print(struct trace_iterator *iter, + int flags, struct trace_event *event) +{ + struct stack_id_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + trace_seq_printf(s, "<stack_id %d>\n", field->stack_id); + + return trace_handle_return(s); +} + +static struct trace_event_functions trace_stack_id_funcs = { + .trace = trace_stack_id_print, +}; + +static struct trace_event trace_stack_id_event = { + .type = TRACE_STACK_ID, + .funcs = &trace_stack_id_funcs, +}; + /* TRACE_HWLAT */ static enum print_line_t trace_hwlat_print(struct trace_iterator *iter, int flags, @@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = { &trace_wake_event, &trace_stack_event, &trace_user_stack_event, + &trace_stack_id_event, &trace_bputs_event, &trace_bprint_event, &trace_print_event, -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap 2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei @ 2026-05-14 3:49 ` Li Pengfei 2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt ` (2 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-14 3:49 UTC (permalink / raw) To: linux-trace-kernel Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28 From: Pengfei Li <lipengfei28@xiaomi.com> Add supporting files for the ftrace stackmap feature: Documentation/trace/ftrace-stackmap.rst: Comprehensive documentation covering design, usage, tracefs interface, binary format, and performance characteristics. tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc: Basic functional selftest that verifies: - stackmap tracefs nodes exist - enabling stackmap + stacktrace produces stack_id events - stack_map_stat shows non-zero hits - reset clears entries tools/tracing/stackmap_dump.py: Python script to parse the binary stack_map_bin export. Supports offline symbol resolution via addr2line, JSON output, and top-N filtering by ref_count. Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- Documentation/trace/ftrace-stackmap.rst | 111 ++++++++++++++++ .../ftrace/test.d/ftrace/stackmap-basic.tc | 74 +++++++++++ tools/tracing/stackmap_dump.py | 120 ++++++++++++++++++ 3 files changed, 305 insertions(+) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100755 tools/tracing/stackmap_dump.py diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst new file mode 100644 index 000000000000..8f6410d4258c --- /dev/null +++ b/Documentation/trace/ftrace-stackmap.rst @@ -0,0 +1,111 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Ftrace Stack Map +====================== + +:Author: Pengfei Li <lipengfei28@xiaomi.com> + +Overview +======== + +The ftrace stack map provides stack trace deduplication for the ftrace +ring buffer. When enabled, instead of storing full kernel stack traces +(typically 80-160 bytes each) in the ring buffer for every event, ftrace +stores only a 4-byte ``stack_id``. The full stacks are maintained in a +separate hash table and exported via tracefs for userspace to resolve. + +This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated +into ftrace's infrastructure, requiring no userspace daemon. + +Configuration +============= + +Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config. + +Kernel command line parameters: + +- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks (default: 14, range: 10-20) + +Usage +===== + +Enable stack deduplication:: + + echo 1 > /sys/kernel/debug/tracing/options/stackmap + echo 1 > /sys/kernel/debug/tracing/options/stacktrace + echo function > /sys/kernel/debug/tracing/current_tracer + +The trace output will show ``<stack_id N>`` instead of full stack traces:: + + sh-1234 [006] d.h.. 123.456789: <stack_id 42> + +To view the actual stacks:: + + cat /sys/kernel/debug/tracing/stack_map + +Output format:: + + stack_id 42 [ref 1337, depth 8] + [0] schedule+0x48/0xc0 + [1] schedule_timeout+0x1c/0x30 + ... + +To view statistics:: + + cat /sys/kernel/debug/tracing/stack_map_stat + +Output:: + + entries: 2500 + table_size: 5000 + hits: 148923 + drops: 0 + hit_rate: 98% + +To reset the stack map:: + + echo 0 > /sys/kernel/debug/tracing/stack_map + +Tracefs Nodes +============= + +``stack_map`` + Text export of all deduplicated stacks with symbol resolution. + Writing ``0`` or ``reset`` clears all entries. + +``stack_map_stat`` + Statistics: entry count, hits, drops, and hit rate. + +``stack_map_bin`` + Binary export for efficient userspace consumption. Format: + + - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32) + - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr) + + Magic: ``0x464D5342`` ('FSMB'), Version: 2 + +Design +====== + +The stack map is modeled after ``tracing_map.c`` (used by hist triggers), +using a lock-free design based on Dr. Cliff Click's non-blocking hash table +algorithm: + +- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context +- **Memory**: Pre-allocated element pool, zero allocation on the hot path + (no GFP_ATOMIC failures under memory pressure) +- **Collision**: Linear probing with a 2x over-provisioned table +- **Per-instance**: Each trace_array has its own stackmap, supporting + multiple ftrace instances +- **Hash**: 32-bit jhash of stack IPs; full ``memcmp`` confirms matches + +Performance +=========== + +Typical results on ARM64 Android device (function tracer, 2 seconds): + +- Unique stacks: ~3000 +- Hit rate: 84-98% (depends on workload diversity) +- Ring buffer savings: ~80% for stack data +- Overhead per event: ~50ns (one jhash + hash table lookup) diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc new file mode 100755 index 000000000000..3b0a7f60769f --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc @@ -0,0 +1,74 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap basic functionality +# requires: stack_map options/stackmap + +# Test that ftrace stackmap deduplication works: +# 1. Enable stackmap + stacktrace options +# 2. Run function tracer briefly +# 3. Verify stack_map has entries +# 4. Verify stack_map_stat shows hits +# 5. Verify trace contains <stack_id> events +# 6. Verify reset works + +fail() { + echo "FAIL: $1" + exit_fail +} + +disable_tracing +clear_trace + +# Verify stackmap files exist +test -f stack_map || fail "stack_map file missing" +test -f stack_map_stat || fail "stack_map_stat file missing" +test -f stack_map_bin || fail "stack_map_bin file missing" + +# Enable stackmap dedup +echo 1 > options/stackmap +echo 1 > options/stacktrace + +# Run function tracer briefly +echo function > current_tracer +enable_tracing +sleep 1 +disable_tracing +echo nop > current_tracer +echo 0 > options/stackmap + +# Check stack_map_stat has entries +entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +if [ "$entries" -eq 0 ]; then + fail "stackmap has zero entries after tracing" +fi + +# Check hits > 0 +hits=$(cat stack_map_stat | grep "^hits:" | awk '{print $2}') +if [ "$hits" -eq 0 ]; then + fail "stackmap has zero hits" +fi + +# Check drops == 0 (pool should be large enough for 1s trace) +drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}') + +# Check stack_map text output is parseable +first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}') +if [ -z "$first_id" ]; then + fail "stack_map output has no stack_id entries" +fi + +# Check trace has stack_id events +count=$(cat trace | grep -c "stack_id" || true) +if [ "$count" -eq 0 ]; then + fail "trace has no <stack_id> events" +fi + +# Test reset +echo 0 > stack_map +entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +if [ "$entries_after" -ne 0 ]; then + fail "stackmap reset did not clear entries" +fi + +echo "stackmap basic test passed: $entries unique stacks, $hits hits, $drops drops" +exit 0 diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py new file mode 100755 index 000000000000..91ce80c681ea --- /dev/null +++ b/tools/tracing/stackmap_dump.py @@ -0,0 +1,120 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 +""" +stackmap_dump.py - Parse and display ftrace stack_map_bin binary export. + +Usage: + # Pull from device and parse + adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin + python3 stackmap_dump.py /tmp/stack_map.bin + + # With vmlinux for offline symbol resolution + python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux + + # JSON output for tooling + python3 stackmap_dump.py /tmp/stack_map.bin --json +""" + +import struct +import sys +import argparse +import json +import subprocess + +MAGIC = 0x464D5342 # 'FSMB' +HEADER_FMT = '<IIII' # magic, version, nr_stacks, reserved +ENTRY_FMT = '<IIII' # stack_id, nr, ref_count, reserved +HEADER_SIZE = struct.calcsize(HEADER_FMT) +ENTRY_SIZE = struct.calcsize(ENTRY_FMT) + + +def addr2line(vmlinux, addr): + """Resolve address to symbol using addr2line.""" + try: + result = subprocess.run( + ['addr2line', '-f', '-e', vmlinux, hex(addr)], + capture_output=True, text=True, timeout=5 + ) + lines = result.stdout.strip().split('\n') + if len(lines) >= 1 and lines[0] != '??': + return lines[0] + except (subprocess.TimeoutExpired, FileNotFoundError): + pass + return None + + +def parse_stackmap_bin(data): + """Parse binary stackmap data, yield (stack_id, ref_count, [ips]).""" + if len(data) < HEADER_SIZE: + raise ValueError("File too small for header") + + magic, version, nr_stacks, _ = struct.unpack_from(HEADER_FMT, data, 0) + if magic != MAGIC: + raise ValueError(f"Bad magic: 0x{magic:08x}, expected 0x{MAGIC:08x}") + if version not in (1, 2): + raise ValueError(f"Unsupported version: {version}") + + offset = HEADER_SIZE + for _ in range(nr_stacks): + if offset + ENTRY_SIZE > len(data): + break + stack_id, nr, ref_count, _ = struct.unpack_from(ENTRY_FMT, data, offset) + offset += ENTRY_SIZE + + ips_size = nr * 8 + if offset + ips_size > len(data): + break + ips = struct.unpack_from(f'<{nr}Q', data, offset) + offset += ips_size + + yield stack_id, ref_count, list(ips) + + +def main(): + parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin') + parser.add_argument('file', help='Path to stack_map_bin file') + parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution') + parser.add_argument('--json', action='store_true', help='JSON output') + parser.add_argument('--top', type=int, default=0, + help='Show only top N stacks by ref_count') + args = parser.parse_args() + + with open(args.file, 'rb') as f: + data = f.read() + + stacks = list(parse_stackmap_bin(data)) + + if args.top > 0: + stacks.sort(key=lambda x: x[1], reverse=True) + stacks = stacks[:args.top] + + if args.json: + output = [] + for stack_id, ref_count, ips in stacks: + entry = { + 'stack_id': stack_id, + 'ref_count': ref_count, + 'ips': [f'0x{ip:x}' for ip in ips] + } + if args.vmlinux: + entry['symbols'] = [addr2line(args.vmlinux, ip) or f'0x{ip:x}' + for ip in ips] + output.append(entry) + print(json.dumps(output, indent=2)) + else: + for stack_id, ref_count, ips in stacks: + print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]") + for i, ip in enumerate(ips): + sym = '' + if args.vmlinux: + resolved = addr2line(args.vmlinux, ip) + if resolved: + sym = f' {resolved}' + print(f" [{i}] 0x{ip:x}{sym}") + print() + + print(f"Total: {len(stacks)} unique stacks", file=sys.stderr) + + +if __name__ == '__main__': + main() -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer 2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei ` (2 preceding siblings ...) 2026-05-14 3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei @ 2026-05-21 15:23 ` Steven Rostedt 2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei 5 siblings, 0 replies; 16+ messages in thread From: Steven Rostedt @ 2026-05-21 15:23 UTC (permalink / raw) To: Li Pengfei Cc: linux-trace-kernel, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28 On Thu, 14 May 2026 11:49:13 +0800 Li Pengfei <ljdlns1987@gmail.com> wrote: > From: Pengfei Li <lipengfei28@xiaomi.com> > > Hi Steven, all, > Hi Pengfei, Can you address the Sashiko reviews: https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com It has a way to copy the comments. Just reply to this series with a past of Sashiko's review and reply to them to explain why the comments may not be an issue, or submit a new version with fixes if they are issues. Thanks, -- Steve ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer 2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei ` (3 preceding siblings ...) 2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt @ 2026-05-22 10:40 ` Li Pengfei 2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei ` (3 more replies) 2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei 5 siblings, 4 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw) To: linux-trace-kernel Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28, lkp From: Pengfei Li <lipengfei28@xiaomi.com> Hi Steven, all, This is v2 of the ftrace stackmap series. It addresses the Sashiko review at [1] and incorporates the kernel test robot's toctree fix. The series adds stack trace deduplication to ftrace. When the stacktrace option is enabled, the ring buffer stores a 4-byte stack_id instead of a full kernel stack trace, while the full stacks are exported via tracefs. Problem ======= With stacktrace enabled, each trace event stores a full kernel stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On production devices with 4-8 MB trace buffers, this fills the buffer in seconds, limiting the usefulness of boot-time tracing and always-on performance monitoring. Design ====== The implementation is a lock-free hash map modeled after tracing_map.c, as suggested by Steven [2]: - lock-free insert via cmpxchg, safe in NMI/IRQ/any context - pre-allocated element pool, so there is no allocation on the hot path - linear probing with a 2x over-provisioned table - bounded probe length to keep worst-case lookup/insert cost bounded - currently implemented for the global trace instance The ring buffer stores only stack_id. Full stacks are exported via: /sys/kernel/debug/tracing/stack_map /sys/kernel/debug/tracing/stack_map_stat /sys/kernel/debug/tracing/stack_map_bin Reset semantics =============== Reset is treated as a control-path operation and is only supported when tracing is stopped on the owning trace_array. Online reset is intentionally not supported. The reset path: - atomically claims reset rights via cmpxchg - rejects reset with -EBUSY if tracing is active - blocks new get_id() callers via the resetting flag - waits for in-flight ftrace callback paths with synchronize_rcu() - clears the map and releases resetting with release semantics Why not reuse tracing_map.c =========================== This series follows the same overall lock-free approach, but uses a purpose-built structure. tracing_map.c is designed for histogram-style aggregation with fixed-size keys and value fields, while this use case needs variable-length stack storage plus reference counting. Why not reuse BPF stackmap ========================== BPF_MAP_TYPE_STACK_TRACE addresses a similar problem, but requires a BPF program and the BPF runtime. This series keeps the functionality inside ftrace and available without CONFIG_BPF. Unlike BPF stackmap, which may replace entries on collision, this design keeps stack_id stable once assigned, which is important because ring buffer events may reference that stack_id long after insertion. Test results ============ Platform: ARM64 Qualcomm SM8850 (8 cores), kernel 6.12, bits=14, tracing sched_switch + kmem_cache_alloc with stacktrace trigger, 5-second capture, default ring buffer. Per-event payload (measured from tracing stats): Event Full stack Stackmap Reduction --------------------- ---------- -------- --------- sched_switch 102 B/entry 48 B/entry -53% kmem_cache_alloc 111 B/entry 44 B/entry -60% In the same 5-second capture window, the smaller per-event footprint translated to many more retained events before wraparound. For sched_switch: - without stackmap: 43,950 retained entries - with stackmap: 1,710,044 retained entries During the same runs, the stackmap observed a few thousand unique stacks and no drops. Boot-time activation is also supported via: trace_options=stackmap,stacktrace Events that occur before stackmap initialization fall back to full stack traces; later events are deduplicated. This transition does not itself drop events, but early boot stacks recorded before initialization are not deduplicated. QEMU validation =============== The series also runs cleanly in QEMU on aarch64 (mainline, qemu-system-aarch64, 2 vCPU, virt machine, busybox initrd). A post-init smoke test verified: - stack_map, stack_map_stat, stack_map_bin, and options/stackmap exist - enabling stackmap + stacktrace produces stack_id events - stack_map_stat shows non-zero successes and zero drops - reset is rejected with -EBUSY while tracing is active - reset clears the map when tracing is stopped - stack_map_bin magic is correct Changes since RFC v1 ==================== - tightened reset semantics: reset now requires tracing to be stopped and returns -EBUSY if tracing is active or another reset is in progress - fixed publication/consumption ordering with smp_store_release() / smp_load_acquire() - bounded probe length and added pool-exhaustion fast-path handling - moved hash_seed into struct ftrace_stackmap - switched the element pool to a single flat vmalloc allocation - bounded bits range to [10, 18] to limit worst-case memory usage - fixed TRACE_ITER(STACKMAP) handling - tightened stack_map reset input parsing - renamed stat counters to "successes" / "success_rate" so the meaning is unambiguous (counts events served, including first-time inserts) - added documentation, selftest coverage, and userspace dump tooling Known limitations ================= - Per-instance stackmap support is not included in this series. - The stackmap currently covers kernel stacks only. - stack_map_bin is a best-effort snapshot, not a fully atomic export. - trace-cmd / libtraceevent integration is left for follow-up once the binary format settles. Usage ===== echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace [1] https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com [2] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/ Pengfei Li (3): trace: add lock-free stackmap for stack trace deduplication trace: integrate stackmap into ftrace stack recording path trace: add documentation, selftest and tooling for stackmap Documentation/trace/ftrace-stackmap.rst | 145 ++++ Documentation/trace/index.rst | 1 + kernel/trace/Kconfig | 21 + kernel/trace/Makefile | 1 + kernel/trace/trace.c | 66 ++ kernel/trace/trace.h | 16 + kernel/trace/trace_entries.h | 15 + kernel/trace/trace_output.c | 23 + kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++ kernel/trace/trace_stackmap.h | 56 ++ .../ftrace/test.d/ftrace/stackmap-basic.tc | 100 +++ tools/tracing/stackmap_dump.py | 150 ++++ 12 files changed, 1237 insertions(+) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100755 tools/tracing/stackmap_dump.py -- 2.34.1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication 2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei @ 2026-05-22 10:40 ` Li Pengfei 2026-05-22 10:40 ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw) To: linux-trace-kernel Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28, lkp From: Pengfei Li <lipengfei28@xiaomi.com> Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel stack traces for the ftrace ring buffer. Instead of storing full stack traces (80-160 bytes each) in the ring buffer for every event, ftrace can store a 4-byte stack_id when the stackmap option is enabled. The implementation is modeled after tracing_map.c (used by hist triggers), using the same lock-free design based on Dr. Cliff Click's non-blocking hash table algorithm: - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context - Pre-allocated element pool (zero allocation on hot path) - Linear probing with 2x over-provisioned table; probe length is bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup is O(1) even when the table is heavily loaded with claimed-but- empty slots from pool exhaustion - Single global instance (initialized for the global trace array) The stackmap is exported via three tracefs nodes: - stack_map: text export with symbol resolution (mode 0640) - stack_map_stat: counters (entries, successes, drops, success_rate) - stack_map_bin: binary export (all fields native-endian) Counter naming: - 'successes' counts events that were successfully assigned a stack_id (covers both first-time inserts and dedup hits). - 'drops' counts events that fell back to recording the full stack (pool exhausted, probe limit reached, or reset in progress). - 'success_rate' is successes / (successes + drops). Reset semantics: - Reset is a control-path operation only allowed when tracing is stopped on the owning trace_array. Online reset (with tracing active) is intentionally not supported to keep the proof obligations small. - Reset uses atomic_cmpxchg() to claim the resetting flag, then verifies tracer_tracing_is_on() returns false. The resetting flag itself blocks subsequent get_id() callers; userspace re-enabling tracing after our check still cannot let new insertions through. - synchronize_rcu() drains in-flight get_id() callers from the ftrace callback path, which runs preempt-disabled. - Reset clears the resetting flag with atomic_set_release() so a subsequent get_id() observes a fully cleared map. - Concurrent reset returns -EBUSY; reset while tracing is active returns -EBUSY. Concurrency notes: - entry->val publication uses smp_store_release() paired with smp_load_acquire() in all dereferencing readers (lookup, seq_show, bin_open). seq_start/seq_next only check val for NULL and use READ_ONCE(). - elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before use in seq_show and bin_open. - Pool exhaustion: stackmap_get_elt() short-circuits via atomic_read() before the contended atomic RMW, avoiding cacheline contention once the pool is full. Slots that win cmpxchg but cannot get an elt are left 'claimed but empty'; subsequent lookups treat val==NULL as a miss and probe past them. The bounded probe length keeps per-event cost O(1). Hash key: - Per-instance random seed stored in the stackmap struct (no global state), seeded at create time. - 32-bit jhash is forced to 1 if it lands on 0 (which is the free-slot sentinel). Full memcmp confirms matches. Memory: - Single flat vmalloc for the element pool (no per-elt kzalloc). - bits parameter clamped to [10, 18]: at the maximum bits=18, the element pool is ~130 MB and a stack_map_bin snapshot may briefly allocate another ~130 MB. - struct stackmap_bin_snapshot uses u64 (not size_t) for its size field so data[] is 8-byte aligned on both 32-bit and 64-bit architectures, avoiding alignment faults when writing u64 IPs on strict-alignment architectures. Kernel command line parameter: - ftrace_stackmap.bits=N: set map capacity (2^N unique stacks, range 10-18, default 14) Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- kernel/trace/Kconfig | 21 ++ kernel/trace/Makefile | 1 + kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++++++++++++++++++ kernel/trace/trace_stackmap.h | 56 +++ 4 files changed, 721 insertions(+) create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index e130da35808f..2a63fd2c9a96 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -412,6 +412,27 @@ config STACK_TRACER Say N if unsure. +config FTRACE_STACKMAP + bool "Ftrace stack map deduplication" + depends on TRACING + depends on STACKTRACE + select KALLSYMS + help + This enables a global stack trace hash table for ftrace, inspired + by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store + only a stack_id in the ring buffer instead of the full stack trace, + significantly reducing trace buffer usage when the same call stacks + appear repeatedly. + + The deduplicated stacks are exported via: + /sys/kernel/debug/tracing/stack_map + + Writing to this file resets the stack map. Reading shows all unique + stacks with their stack_id and reference count. + + Say Y if you want to reduce ftrace buffer usage for stack traces. + Say N if unsure. + config TRACE_PREEMPT_TOGGLE bool help diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 1decdce8cbef..f1b6175099cc 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o obj-$(CONFIG_NOP_TRACER) += trace_nop.o obj-$(CONFIG_STACK_TRACER) += trace_stack.o +obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c new file mode 100644 index 000000000000..b23a60e9286c --- /dev/null +++ b/kernel/trace/trace_stackmap.c @@ -0,0 +1,643 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace + * + * Modeled after tracing_map.c (used by hist triggers), this provides + * a lock-free hash map optimized for the ftrace hot path. The design + * is based on Dr. Cliff Click's non-blocking hash table algorithm. + * + * Key properties: + * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context + * - Pre-allocated element pool (zero allocation on hot path) + * - Linear probing with 2x over-provisioned table; probe length + * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup + * cost constant even when the table is heavily loaded + * - Single global instance (initialized for the global trace array) + * + * Reset is a control-path operation, only allowed when tracing is + * stopped on the owning trace_array. The protocol is: + * + * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights + * and blocks new get_id() callers (they observe resetting=1 and + * return -EINVAL). + * - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the + * resetting flag itself prevents new insertions even if userspace + * re-enables tracing immediately after the check. + * - synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path, which runs with preemption disabled. + * + * Online reset (with tracing active) is intentionally not supported + * to keep the design simple and the proof obligations small. + * + * The 32-bit jhash of the stack IPs is the hash table key. On hash + * collision, linear probing finds the next slot and full memcmp + * confirms the match. + * + * Concurrent userspace readers (cat stack_map / stack_map_bin) get + * a best-effort snapshot. They are coherent with the hot path + * (smp_load_acquire on entry->val), but they are not coherent with + * a concurrent reset; since reset requires tracing to be stopped, + * mid-iteration reset can produce truncated or partial output but + * never crashes. + */ + +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/jhash.h> +#include <linux/seq_file.h> +#include <linux/kallsyms.h> +#include <linux/vmalloc.h> +#include <linux/atomic.h> +#include <linux/random.h> +#include <linux/rcupdate.h> +#include <linux/log2.h> + +#include "trace.h" +#include "trace_stackmap.h" + +/* + * Bound the linear-probe scan length. With a 2x over-provisioned table, + * a well-distributed hash gives very short probe chains. Capping at 64 + * keeps worst-case lookup O(1) even when the table is heavily loaded + * with claimed-but-empty slots from pool exhaustion. + */ +#define FTRACE_STACKMAP_MAX_PROBE 64 + +/* + * Each pre-allocated element holds one unique stack trace. + * Fixed size: MAX_DEPTH entries regardless of actual depth. + */ +struct stackmap_elt { + u32 nr; /* actual number of IPs */ + atomic_t ref_count; + unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH]; +}; + +/* + * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt. + * key == 0 means the slot is free. + */ +struct stackmap_entry { + u32 key; /* 0 = free, non-zero = jhash */ + struct stackmap_elt *val; /* NULL until fully published */ +}; + +struct ftrace_stackmap { + struct trace_array *tr; /* owning trace_array */ + unsigned int map_bits; + unsigned int map_size; /* 1 << (map_bits + 1) */ + unsigned int max_elts; /* 1 << map_bits */ + u32 hash_seed; /* per-instance jhash seed */ + atomic_t next_elt; /* index into elts pool */ + struct stackmap_entry *entries; /* hash table */ + struct stackmap_elt *elts; /* flat element pool */ + atomic_t resetting; + atomic64_t successes; /* events served (hits + new inserts) */ + atomic64_t drops; +}; + +/* + * Cap the bits parameter to keep worst-case allocations bounded: + * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin + * export. + * Smaller workloads should use the default (14) which gives 16K elts + * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel + * parameter for higher unique-stack capacity. + */ +#define FTRACE_STACKMAP_BITS_MIN 10 +#define FTRACE_STACKMAP_BITS_MAX 18 +#define FTRACE_STACKMAP_BITS_DEFAULT 14 + +static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT; +static int __init stackmap_bits_setup(char *str) +{ + unsigned long val; + + if (kstrtoul(str, 0, &val)) + return -EINVAL; + val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX); + stackmap_map_bits = val; + return 0; +} +early_param("ftrace_stackmap.bits", stackmap_bits_setup); + +/* --- Element pool --- */ + +static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap) +{ + int idx; + + /* + * Fast-path early-out once the pool is fully consumed. Avoids + * the contended atomic RMW on next_elt for every traced event + * after the pool is exhausted. + */ + if (atomic_read(&smap->next_elt) >= smap->max_elts) + return NULL; + + idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts); + if (idx < smap->max_elts) + return &smap->elts[idx]; + return NULL; +} + +/* --- Create / Destroy / Reset --- */ + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) +{ + struct ftrace_stackmap *smap; + unsigned int bits; + + smap = kzalloc(sizeof(*smap), GFP_KERNEL); + if (!smap) + return ERR_PTR(-ENOMEM); + + /* Defensive clamp: reject bogus bits even if early_param is bypassed. */ + bits = clamp_val(stackmap_map_bits, + FTRACE_STACKMAP_BITS_MIN, + FTRACE_STACKMAP_BITS_MAX); + + smap->tr = tr; + smap->map_bits = bits; + smap->max_elts = 1U << bits; + smap->map_size = 1U << (bits + 1); /* 2x over-provision */ + BUG_ON(!is_power_of_2(smap->map_size)); + + smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size); + if (!smap->entries) { + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + /* + * Single large vmalloc of the element pool, indexed flat. + * At bits=16 this is 64K * sizeof(struct stackmap_elt). The + * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~33 MB. + */ + smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts); + if (!smap->elts) { + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + smap->hash_seed = get_random_u32(); + atomic_set(&smap->next_elt, 0); + atomic_set(&smap->resetting, 0); + atomic64_set(&smap->successes, 0); + atomic64_set(&smap->drops, 0); + + return smap; +} + +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap) +{ + if (!smap || IS_ERR(smap)) + return; + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); +} + +/** + * ftrace_stackmap_reset - clear all entries in the stackmap + * @smap: the stackmap to reset + * + * Returns 0 on success, -EBUSY if another reset is already in + * progress, or if tracing is currently active on the owning + * trace_array. + * + * Online reset (with tracing active) is not supported. Caller must + * stop tracing first (echo 0 > tracing_on). + * + * Caller is process context (typically sysfs write handler). + * + * Protocol: + * 1. Atomically claim reset rights via cmpxchg on @resetting. + * 2. Verify tracing is stopped on @smap->tr; if not, release the + * claim and return -EBUSY. The resetting flag itself blocks + * any subsequent get_id() callers. + * 3. synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path (which runs preempt-disabled). + * 4. memset entries, elts, and counters. + * 5. Release the resetting flag with release semantics so any new + * get_id() observes a fully cleared map. + */ +int ftrace_stackmap_reset(struct ftrace_stackmap *smap) +{ + if (!smap) + return 0; + + if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0) + return -EBUSY; + + if (smap->tr && tracer_tracing_is_on(smap->tr)) { + atomic_set(&smap->resetting, 0); + return -EBUSY; + } + + /* + * synchronize_rcu() itself is a full barrier; no extra smp_mb() + * is needed before it. It drains in-flight ftrace callbacks that + * may have already passed the resetting check with the old value. + */ + synchronize_rcu(); + + memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size); + memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts); + + atomic_set(&smap->next_elt, 0); + atomic64_set(&smap->successes, 0); + atomic64_set(&smap->drops, 0); + + /* Release resetting=0 so new get_id() observes a cleared map. */ + atomic_set_release(&smap->resetting, 0); + return 0; +} + +/* --- Core: get_id (lock-free, NMI-safe) --- */ + +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries) +{ + u32 key_hash, idx, test_key, trace_len; + struct stackmap_entry *entry; + struct stackmap_elt *val; + int probes = 0; + + if (!smap || !nr_entries || atomic_read(&smap->resetting)) + return -EINVAL; + if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH) + nr_entries = FTRACE_STACKMAP_MAX_DEPTH; + + trace_len = nr_entries * sizeof(unsigned long); + /* + * jhash2() requires the length in u32 units and the data to be + * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so + * trace_len is always a multiple of 8 (hence of 4). Use jhash2 + * directly; the cast to u32* is safe because ips[] is naturally + * aligned to sizeof(unsigned long) >= 4. + */ + key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32), + smap->hash_seed); + if (key_hash == 0) + key_hash = 1; /* 0 means free slot */ + + idx = key_hash >> (32 - (smap->map_bits + 1)); + + while (probes < FTRACE_STACKMAP_MAX_PROBE) { + idx &= (smap->map_size - 1); + entry = &smap->entries[idx]; + test_key = entry->key; + + if (test_key == key_hash) { + /* + * smp_load_acquire pairs with smp_store_release in + * the publisher below; ensures we see fully-formed + * elt fields (nr, ips, ref_count) before dereference. + */ + val = smp_load_acquire(&entry->val); + if (val && val->nr == nr_entries && + memcmp(val->ips, ips, trace_len) == 0) { + atomic_inc(&val->ref_count); + atomic64_inc(&smap->successes); + return (int)idx; + } + /* + * val == NULL: another CPU is mid-insert, or this + * slot is "claimed but empty" (pool exhausted). + * val != NULL but mismatch: 32-bit hash collision + * with a different stack. In both cases, advance. + */ + } else if (!test_key) { + /* Free slot: try to claim it */ + if (cmpxchg(&entry->key, 0, key_hash) == 0) { + struct stackmap_elt *elt; + + elt = stackmap_get_elt(smap); + if (!elt) { + /* + * Pool exhausted. We claimed this + * slot with cmpxchg but cannot fill + * it. Leave key set so the slot + * stays "claimed but empty" — future + * lookups treat val==NULL as a miss + * and probe past it. Cannot revert + * key=0 without racing other CPUs. + */ + atomic64_inc(&smap->drops); + return -ENOSPC; + } + + elt->nr = nr_entries; + atomic_set(&elt->ref_count, 1); + memcpy(elt->ips, ips, trace_len); + + /* + * Publish elt with release semantics so the + * reader's smp_load_acquire can safely + * dereference val->nr / val->ips. + */ + smp_store_release(&entry->val, elt); + atomic64_inc(&smap->successes); + return (int)idx; + } + /* cmpxchg failed; another CPU claimed this slot. */ + } + + idx++; + probes++; + } + + atomic64_inc(&smap->drops); + return -ENOSPC; +} + +/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */ + +struct stackmap_seq_private { + struct ftrace_stackmap *smap; +}; + +static void *stackmap_seq_start(struct seq_file *m, loff_t *pos) +{ + struct stackmap_seq_private *priv = m->private; + struct ftrace_stackmap *smap = priv->smap; + u32 i; + + if (!smap) + return NULL; + for (i = *pos; i < smap->map_size; i++) { + if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) { + *pos = i; + return &smap->entries[i]; + } + } + return NULL; +} + +static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct stackmap_seq_private *priv = m->private; + struct ftrace_stackmap *smap = priv->smap; + u32 i; + + if (!smap) + return NULL; + for (i = *pos + 1; i < smap->map_size; i++) { + if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) { + *pos = i; + return &smap->entries[i]; + } + } + return NULL; +} + +static void stackmap_seq_stop(struct seq_file *m, void *v) { } + +static int stackmap_seq_show(struct seq_file *m, void *v) +{ + struct stackmap_entry *entry = v; + struct stackmap_elt *elt = smp_load_acquire(&entry->val); + struct stackmap_seq_private *priv = m->private; + u32 idx = entry - priv->smap->entries; + u32 i, nr; + + if (!elt) + return 0; + + nr = READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr = FTRACE_STACKMAP_MAX_DEPTH; + + seq_printf(m, "stack_id %u [ref %u, depth %u]\n", + idx, atomic_read(&elt->ref_count), nr); + for (i = 0; i < nr; i++) + seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]); + seq_putc(m, '\n'); + return 0; +} + +static const struct seq_operations stackmap_seq_ops = { + .start = stackmap_seq_start, + .next = stackmap_seq_next, + .stop = stackmap_seq_stop, + .show = stackmap_seq_show, +}; + +static int stackmap_open(struct inode *inode, struct file *file) +{ + struct stackmap_seq_private *priv; + struct seq_file *m; + int ret; + + ret = seq_open_private(file, &stackmap_seq_ops, + sizeof(struct stackmap_seq_private)); + if (ret) + return ret; + m = file->private_data; + priv = m->private; + priv->smap = inode->i_private; + return 0; +} + +/* + * Accept exactly "0" or "reset" (optionally followed by a single newline). + */ +static bool stackmap_write_is_reset(const char *buf, size_t n) +{ + if (n > 0 && buf[n - 1] == '\n') + n--; + return (n == 1 && buf[0] == '0') || + (n == 5 && memcmp(buf, "reset", 5) == 0); +} + +static ssize_t stackmap_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct seq_file *m = file->private_data; + struct stackmap_seq_private *priv = m->private; + char buf[8]; + size_t n = min(count, sizeof(buf) - 1); + int ret; + + if (n == 0) + return -EINVAL; + if (copy_from_user(buf, ubuf, n)) + return -EFAULT; + buf[n] = '\0'; + + if (!stackmap_write_is_reset(buf, n)) + return -EINVAL; + + /* + * ftrace_stackmap_reset() atomically claims reset rights via + * cmpxchg and returns -EBUSY if another reset is in progress + * or if tracing is active. + */ + ret = ftrace_stackmap_reset(priv->smap); + if (ret) + return ret; + return count; +} + +const struct file_operations ftrace_stackmap_fops = { + .open = stackmap_open, + .read = seq_read, + .write = stackmap_write, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +/* --- Stats --- */ + +static int stackmap_stat_show(struct seq_file *m, void *v) +{ + struct ftrace_stackmap *smap = m->private; + u32 entries; + u64 successes, drops; + + if (!smap) { + seq_puts(m, "stackmap not initialized\n"); + return 0; + } + + entries = atomic_read(&smap->next_elt); + successes = atomic64_read(&smap->successes); + drops = atomic64_read(&smap->drops); + + seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts); + seq_printf(m, "table_size: %u\n", smap->map_size); + seq_printf(m, "successes: %llu\n", successes); + seq_printf(m, "drops: %llu\n", drops); + if (successes + drops > 0) + seq_printf(m, "success_rate: %llu%%\n", + successes * 100 / (successes + drops)); + return 0; +} + +static int stackmap_stat_open(struct inode *inode, struct file *file) +{ + return single_open(file, stackmap_stat_show, inode->i_private); +} + +const struct file_operations ftrace_stackmap_stat_fops = { + .open = stackmap_stat_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +/* --- Binary export --- */ + +struct stackmap_bin_snapshot { + /* + * Use u64 (not size_t) so data[] is 8-byte aligned on both + * 32-bit and 64-bit architectures. The IP array within data[] + * is accessed as u64*, which would alignment-fault on strict + * architectures (e.g. older ARM, SPARC) if data[] started at + * a 4-byte boundary. + */ + u64 size; + char data[]; +}; + +static int stackmap_bin_open(struct inode *inode, struct file *file) +{ + struct ftrace_stackmap *smap = inode->i_private; + struct stackmap_bin_snapshot *snap; + struct ftrace_stackmap_bin_header *hdr; + size_t alloc_size, off; + u32 nr_entries, i, nr_stacks; + + if (!smap) + return -ENODEV; + + /* + * Worst-case allocation size: every populated entry uses a + * full-depth stack. The (+1) gives one slack slot in case a + * concurrent insert lands between this snapshot and iteration. + * The loop below performs an explicit bounds check anyway. + * + * At bits=16 this caps at ~33 MB. The file is mode 0440 + * (TRACE_MODE_READ), so only privileged users can open it. + */ + nr_entries = atomic_read(&smap->next_elt); + alloc_size = sizeof(*hdr) + (nr_entries + 1) * + (sizeof(struct ftrace_stackmap_bin_entry) + + FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64)); + + snap = vmalloc(sizeof(*snap) + alloc_size); + if (!snap) + return -ENOMEM; + + hdr = (struct ftrace_stackmap_bin_header *)snap->data; + hdr->magic = FTRACE_STACKMAP_BIN_MAGIC; + hdr->version = FTRACE_STACKMAP_BIN_VERSION; + hdr->reserved = 0; + off = sizeof(*hdr); + nr_stacks = 0; + + for (i = 0; i < smap->map_size; i++) { + struct stackmap_entry *entry = &smap->entries[i]; + struct stackmap_elt *elt; + struct ftrace_stackmap_bin_entry *e; + u64 *ips_out; + u32 k, nr; + + if (!entry->key) + continue; + elt = smp_load_acquire(&entry->val); + if (!elt) + continue; + + nr = READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr = FTRACE_STACKMAP_MAX_DEPTH; + + /* Bounds check: stop if we would overflow the allocation. */ + if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size) + break; + + e = (struct ftrace_stackmap_bin_entry *)(snap->data + off); + e->stack_id = i; + e->nr = nr; + e->ref_count = atomic_read(&elt->ref_count); + e->reserved = 0; + off += sizeof(*e); + + ips_out = (u64 *)(snap->data + off); + for (k = 0; k < nr; k++) + ips_out[k] = (u64)elt->ips[k]; + off += nr * sizeof(u64); + nr_stacks++; + } + + hdr->nr_stacks = nr_stacks; + snap->size = off; + file->private_data = snap; + return 0; +} + +static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct stackmap_bin_snapshot *snap = file->private_data; + + if (!snap) + return -EINVAL; + return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size); +} + +static int stackmap_bin_release(struct inode *inode, struct file *file) +{ + vfree(file->private_data); + return 0; +} + +const struct file_operations ftrace_stackmap_bin_fops = { + .open = stackmap_bin_open, + .read = stackmap_bin_read, + .llseek = default_llseek, + .release = stackmap_bin_release, +}; diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h new file mode 100644 index 000000000000..da51ed919e2c --- /dev/null +++ b/kernel/trace/trace_stackmap.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _TRACE_STACKMAP_H +#define _TRACE_STACKMAP_H + +#include <linux/types.h> +#include <linux/atomic.h> + +#define FTRACE_STACKMAP_MAX_DEPTH 64 + +/* Binary export format */ +#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */ +#define FTRACE_STACKMAP_BIN_VERSION 2 + +struct ftrace_stackmap_bin_header { + u32 magic; + u32 version; + u32 nr_stacks; + u32 reserved; +}; + +struct ftrace_stackmap_bin_entry { + u32 stack_id; + u32 nr; + u32 ref_count; + u32 reserved; + /* followed by u64 ips[nr] */ +}; + +struct trace_array; + +#ifdef CONFIG_FTRACE_STACKMAP + +struct ftrace_stackmap; + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr); +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap); +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries); +int ftrace_stackmap_reset(struct ftrace_stackmap *smap); + +extern const struct file_operations ftrace_stackmap_fops; +extern const struct file_operations ftrace_stackmap_stat_fops; +extern const struct file_operations ftrace_stackmap_bin_fops; + +#else + +struct ftrace_stackmap; +static inline struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) { return NULL; } +static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { } +static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s, + unsigned long *ips, unsigned int n) +{ return -ENOSYS; } +static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; } + +#endif +#endif /* _TRACE_STACKMAP_H */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path 2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei 2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei @ 2026-05-22 10:40 ` Li Pengfei 2026-05-22 10:40 ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei 2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu 3 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw) To: linux-trace-kernel Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28, lkp From: Pengfei Li <lipengfei28@xiaomi.com> Add TRACE_STACK_ID event type and integrate ftrace_stackmap into __ftrace_trace_stack(). When the 'stackmap' trace option is enabled, the stack recording path stores a 4-byte stack_id in the ring buffer instead of the full stack trace. Changes: - New TRACE_STACK_ID in trace_type enum - New stack_id_entry in trace_entries.h - New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern used by TRACE_ITER_PROF_TEXT_OFFSET) - Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id() when the stackmap option is active - Stackmap pointer read with smp_load_acquire(), published with smp_store_release() to ensure proper initialization ordering - NULL check on tr->stackmap prevents dereference if creation failed or if used on a secondary trace instance (graceful fallback) - ftrace_stackmap_create() takes the owning trace_array so the stackmap can later check tracing state during reset - Added stack_id print handler in trace_output.c Fallback behavior: if stackmap returns an error (pool exhausted, resetting, or NULL pointer), the full stack trace is recorded as before — no new failure modes introduced. Note: stackmap is currently initialized only for the global trace instance. Secondary instances fall back to full stack recording. Usage: echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- kernel/trace/trace.c | 66 ++++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 16 +++++++++ kernel/trace/trace_entries.h | 15 ++++++++ kernel/trace/trace_output.c | 23 +++++++++++++ 4 files changed, 120 insertions(+) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6eb4d3097a4d..49a675dffad5 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -57,6 +57,7 @@ #include "trace.h" #include "trace_output.h" +#include "trace_stackmap.h" #ifdef CONFIG_FTRACE_STARTUP_TEST /* @@ -2184,6 +2185,43 @@ void __ftrace_trace_stack(struct trace_array *tr, } #endif +#ifdef CONFIG_FTRACE_STACKMAP + /* + * If stackmap dedup is enabled, try to store only the stack_id + * in the ring buffer instead of the full stack trace. + */ + if (tr->trace_flags & TRACE_ITER(STACKMAP)) { + struct ftrace_stackmap *smap; + struct stack_id_entry *sid_entry; + int sid; + + smap = smp_load_acquire(&tr->stackmap); + if (!smap) + goto full_stack; + + sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries); + if (sid >= 0) { + event = __trace_buffer_lock_reserve(buffer, + TRACE_STACK_ID, + sizeof(*sid_entry), trace_ctx); + if (!event) + goto out; + sid_entry = ring_buffer_event_data(event); + sid_entry->stack_id = sid; + /* + * stack_id is a synthetic side-event attached to a + * primary trace event that was already subject to + * filtering. No per-event filter is defined for + * TRACE_STACK_ID, so commit unconditionally. + */ + __buffer_unlock_commit(buffer, event); + goto out; + } + /* Fall through to full stack on stackmap failure */ + } +full_stack: +#endif + event = __trace_buffer_lock_reserve(buffer, TRACE_STACK, struct_size(entry, caller, nr_entries), trace_ctx); @@ -9222,6 +9260,34 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work) NULL, &tracing_dyn_info_fops); #endif +#ifdef CONFIG_FTRACE_STACKMAP + { + struct ftrace_stackmap *smap; + + smap = ftrace_stackmap_create(&global_trace); + if (!IS_ERR(smap)) { + /* + * Use smp_store_release to ensure the stackmap + * structure is fully initialized before publishing + * the pointer to concurrent trace event readers. + */ + smp_store_release(&global_trace.stackmap, smap); + trace_create_file("stack_map", TRACE_MODE_WRITE, NULL, + smap, &ftrace_stackmap_fops); + trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL, + smap, &ftrace_stackmap_stat_fops); + trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL, + smap, &ftrace_stackmap_bin_fops); + } else { + pr_warn("ftrace stackmap init failed, dedup disabled\n"); + /* + * global_trace.stackmap is already NULL from kzalloc; + * leaving it NULL ensures the load-acquire in + * __ftrace_trace_stack falls back to full stack. + */ + } + } +#endif create_trace_instances(NULL); update_tracer_options(); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 80fe152af1dd..7e7d5e5a35ff 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -57,6 +57,7 @@ enum trace_type { TRACE_TIMERLAT, TRACE_RAW_DATA, TRACE_FUNC_REPEATS, + TRACE_STACK_ID, __TRACE_LAST_TYPE, }; @@ -453,6 +454,9 @@ struct trace_array { struct cond_snapshot *cond_snapshot; #endif struct trace_func_repeats __percpu *last_func_repeats; +#ifdef CONFIG_FTRACE_STACKMAP + struct ftrace_stackmap *stackmap; +#endif /* * On boot up, the ring buffer is set to the minimum size, so that * we do not waste memory on systems that are not using tracing. @@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void); TRACE_GRAPH_RET); \ IF_ASSIGN(var, ent, struct func_repeats_entry, \ TRACE_FUNC_REPEATS); \ + IF_ASSIGN(var, ent, struct stack_id_entry, \ + TRACE_STACK_ID); \ __ftrace_bad_type(); \ } while (0) @@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, # define STACK_FLAGS #endif +#ifdef CONFIG_FTRACE_STACKMAP +# define STACKMAP_FLAGS \ + C(STACKMAP, "stackmap"), +#else +# define STACKMAP_FLAGS +# define TRACE_ITER_STACKMAP_BIT -1 +#endif + #ifdef CONFIG_FUNCTION_PROFILER + # define PROFILER_FLAGS \ C(PROF_TEXT_OFFSET, "prof-text-offset"), # ifdef CONFIG_FUNCTION_GRAPH_TRACER @@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ + STACKMAP_FLAGS \ BRANCH_FLAGS \ PROFILER_FLAGS \ FPROFILE_FLAGS diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index 54417468fdeb..89ed14b7e5fd 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry, (void *)__entry->caller[6], (void *)__entry->caller[7]) ); +/* + * Stack ID entry - stores only a stack_id referencing the stackmap. + * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks. + */ +FTRACE_ENTRY(stack_id, stack_id_entry, + + TRACE_STACK_ID, + + F_STRUCT( + __field( int, stack_id ) + ), + + F_printk("<stack_id %d>", __entry->stack_id) +); + /* * trace_printk entry: */ diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index a5ad76175d10..68678ea88159 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = { .funcs = &trace_user_stack_funcs, }; +/* TRACE_STACK_ID */ +static enum print_line_t trace_stack_id_print(struct trace_iterator *iter, + int flags, struct trace_event *event) +{ + struct stack_id_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + trace_seq_printf(s, "<stack_id %d>\n", field->stack_id); + + return trace_handle_return(s); +} + +static struct trace_event_functions trace_stack_id_funcs = { + .trace = trace_stack_id_print, +}; + +static struct trace_event trace_stack_id_event = { + .type = TRACE_STACK_ID, + .funcs = &trace_stack_id_funcs, +}; + /* TRACE_HWLAT */ static enum print_line_t trace_hwlat_print(struct trace_iterator *iter, int flags, @@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = { &trace_wake_event, &trace_stack_event, &trace_user_stack_event, + &trace_stack_id_event, &trace_bputs_event, &trace_bprint_event, &trace_print_event, -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap 2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei 2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei 2026-05-22 10:40 ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei @ 2026-05-22 10:40 ` Li Pengfei 2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu 3 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw) To: linux-trace-kernel Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28, lkp From: Pengfei Li <lipengfei28@xiaomi.com> Add supporting files for the ftrace stackmap feature: Documentation/trace/ftrace-stackmap.rst: Documentation covering design, usage, tracefs interface, binary format, and performance characteristics. Added to the 'Core Tracing Frameworks' toctree in Documentation/trace/index.rst. Documents: - Reset requires tracing to be stopped first - Boot-time activation via trace_options=stackmap - bits parameter range [10, 18] and worst-case memory usage - tracefs file modes (0640 / 0440) - Best-effort snapshot semantics for stack_map_bin - Counter naming: successes (events served), drops, success_rate tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc: Functional selftest verifying: - stackmap tracefs nodes exist - enabling stackmap + stacktrace produces stack_id events - stack_map_stat shows non-zero successes and zero drops - reset clears entries when tracing is stopped - reset is rejected (-EBUSY) while tracing is active Uses an EXIT trap to restore options/stackmap and options/stacktrace on any exit path. tools/tracing/stackmap_dump.py: Python script to parse the binary stack_map_bin export. Features: - Automatic endianness detection via magic number - Batched addr2line via stdin (avoids ARG_MAX with large stacks) - JSON output mode - Top-N filtering by ref_count Binary format: all fields are native-endian. The parser detects byte order by reading the magic value (0x464D5342 = 'FSMB'). Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/ Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- Documentation/trace/ftrace-stackmap.rst | 145 +++++++++++++++++ Documentation/trace/index.rst | 1 + .../ftrace/test.d/ftrace/stackmap-basic.tc | 100 ++++++++++++ tools/tracing/stackmap_dump.py | 150 ++++++++++++++++++ 4 files changed, 396 insertions(+) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100755 tools/tracing/stackmap_dump.py diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst new file mode 100644 index 000000000000..1230d44d1d23 --- /dev/null +++ b/Documentation/trace/ftrace-stackmap.rst @@ -0,0 +1,145 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Ftrace Stack Map +====================== + +:Author: Pengfei Li <lipengfei28@xiaomi.com> + +Overview +======== + +The ftrace stack map provides stack trace deduplication for the ftrace +ring buffer. When enabled, instead of storing full kernel stack traces +(typically 80-160 bytes each) in the ring buffer for every event, ftrace +stores only a 4-byte ``stack_id``. The full stacks are maintained in a +separate hash table and exported via tracefs for userspace to resolve. + +This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated +into ftrace's infrastructure, requiring no userspace daemon. + +Configuration +============= + +Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config. + +Kernel command line parameters: + +- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks + (default: 14 → 16384 stacks; valid range: 10-18). + + At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory + for the element pool. Each ``open()`` of ``stack_map_bin`` may + briefly allocate a similar amount for a snapshot. The cap is set + intentionally to bound memory usage. + +Usage +===== + +Enable stack deduplication:: + + echo 1 > /sys/kernel/debug/tracing/options/stackmap + echo 1 > /sys/kernel/debug/tracing/options/stacktrace + echo function > /sys/kernel/debug/tracing/current_tracer + +The trace output will show ``<stack_id N>`` instead of full stack traces:: + + sh-1234 [006] d.h.. 123.456789: <stack_id 42> + +To view the actual stacks:: + + cat /sys/kernel/debug/tracing/stack_map + +Output format:: + + stack_id 42 [ref 1337, depth 8] + [0] schedule+0x48/0xc0 + [1] schedule_timeout+0x1c/0x30 + ... + +To view statistics:: + + cat /sys/kernel/debug/tracing/stack_map_stat + +Output:: + + entries: 2500 / 16384 + table_size: 32768 + successes: 148923 + drops: 0 + success_rate: 100% + +To reset the stack map (tracing must be stopped first):: + + echo 0 > /sys/kernel/debug/tracing/tracing_on + echo 0 > /sys/kernel/debug/tracing/stack_map + +Reset returns ``-EBUSY`` if tracing is currently active, or if another +reset is already in progress. + +Boot-time activation +==================== + +The stackmap option can be enabled from the kernel command line:: + + trace_options=stackmap,stacktrace + +Trace events that fire before the tracefs filesystem is initialized +(``fs_initcall`` time) fall back to recording full stack traces; once +``ftrace_stackmap_create()`` runs, subsequent events are deduplicated. +The crossover is automatic and lossless — no events are dropped, but +early-boot stacks recorded before the crossover are not deduplicated. + +Tracefs Nodes +============= + +The stack_map files are owned by root and not world-readable +(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440). + +``stack_map`` + Text export of all deduplicated stacks with symbol resolution. + Writing ``0`` or ``reset`` clears all entries (only when tracing + is stopped). + +``stack_map_stat`` + Statistics: entry count, hits, drops, and hit rate. + +``stack_map_bin`` + Binary export for efficient userspace consumption. Format: + + - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32) + - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr) + + All fields are written in the kernel's native byte order. + Userspace tools detect endianness by reading the magic value. + Magic: ``0x464D5342`` ('FSMB'), Version: 2. + + The export is a best-effort snapshot allocated at ``open()``; + concurrent inserts during the snapshot may be truncated. A + bounds check ensures no overflow. + +Design +====== + +The stack map is modeled after ``tracing_map.c`` (used by hist triggers), +using a lock-free design based on Dr. Cliff Click's non-blocking hash table +algorithm: + +- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context +- **Memory**: Pre-allocated element pool, zero allocation on the hot path + (no GFP_ATOMIC failures under memory pressure) +- **Collision**: Linear probing with a 2x over-provisioned table; probe + length is bounded so worst-case insert/lookup is O(1) +- **Scope**: Currently supports the global trace instance +- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp`` + confirms matches + +Performance +=========== + +Typical results on ARM64 Android device (function tracer, 2 seconds): + +- Unique stacks: ~3000 +- Hit rate: 84-98% (depends on workload diversity) +- Ring buffer savings: ~80% for stack data +- Overhead per event: ~50ns (one jhash + hash table lookup) diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5d9bf4694d5d..ac8b1141c23a 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -33,6 +33,7 @@ the Linux kernel. ftrace ftrace-design ftrace-uses + ftrace-stackmap kprobes kprobetrace fprobetrace diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc new file mode 100755 index 000000000000..34e4e31ff7a1 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc @@ -0,0 +1,100 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap basic functionality +# requires: stack_map options/stackmap + +# Test that ftrace stackmap deduplication works: +# 1. Enable stackmap + stacktrace options +# 2. Run function tracer briefly +# 3. Verify stack_map has entries +# 4. Verify stack_map_stat shows successes and zero drops +# 5. Verify trace contains <stack_id> events +# 6. Verify reset works when tracing is stopped +# 7. Verify reset is rejected (-EBUSY) while tracing is active + +fail() { + echo "FAIL: $1" + exit_fail +} + +# Restore state on any exit (success, fail, or interrupt) so a +# half-finished test does not leave stacktrace/stackmap enabled. +cleanup() { + disable_tracing 2>/dev/null + echo nop > current_tracer 2>/dev/null + echo 0 > options/stackmap 2>/dev/null + echo 0 > options/stacktrace 2>/dev/null +} +trap cleanup EXIT + +disable_tracing +clear_trace + +# Verify stackmap files exist +test -f stack_map || fail "stack_map file missing" +test -f stack_map_stat || fail "stack_map_stat file missing" +test -f stack_map_bin || fail "stack_map_bin file missing" + +# Enable stackmap dedup +echo 1 > options/stackmap +echo 1 > options/stacktrace + +# Run function tracer briefly +echo function > current_tracer +enable_tracing +sleep 1 +disable_tracing +echo nop > current_tracer +echo 0 > options/stackmap + +# Check stack_map_stat has entries (default empty to avoid [: too many args) +entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries:=0}" +if [ "$entries" -eq 0 ]; then + fail "stackmap has zero entries after tracing" +fi + +# Check successes > 0 +successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}') +: "${successes:=0}" +if [ "$successes" -eq 0 ]; then + fail "stackmap has zero successes" +fi + +# Check drops == 0 (pool should be large enough for 1s trace) +drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}') +: "${drops:=0}" +if [ "$drops" -ne 0 ]; then + fail "stackmap had $drops drops (pool exhausted?)" +fi + +# Check stack_map text output is parseable +first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}') +if [ -z "$first_id" ]; then + fail "stack_map output has no stack_id entries" +fi + +# Check trace has stack_id events +count=$(grep -c "stack_id" trace || true) +if [ "$count" -eq 0 ]; then + fail "trace has no <stack_id> events" +fi + +# Test reset (tracing must be stopped — disable_tracing was called above) +echo 0 > stack_map +entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries_after:=-1}" +if [ "$entries_after" -ne 0 ]; then + fail "stackmap reset did not clear entries (got $entries_after)" +fi + +# Test that reset is rejected while tracing is active +enable_tracing +if echo 0 > stack_map 2>/dev/null; then + disable_tracing + fail "stackmap reset should fail while tracing is active" +fi +disable_tracing + +echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops" +exit 0 diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py new file mode 100755 index 000000000000..fc5d0c9cf0af --- /dev/null +++ b/tools/tracing/stackmap_dump.py @@ -0,0 +1,150 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 +""" +stackmap_dump.py - Parse and display ftrace stack_map_bin binary export. + +Usage: + # Pull from device and parse + adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin + python3 stackmap_dump.py /tmp/stack_map.bin + + # With vmlinux for offline symbol resolution + python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux + + # JSON output for tooling + python3 stackmap_dump.py /tmp/stack_map.bin --json +""" + +import struct +import sys +import argparse +import json +import subprocess + +MAGIC = 0x464D5342 # 'FSMB' +HEADER_SIZE = 16 # 4 x u32 +ENTRY_SIZE = 16 # 4 x u32 + + +def detect_endianness(data): + """Detect byte order from magic number in header.""" + if len(data) < 4: + raise ValueError("File too small") + magic_le = struct.unpack_from('<I', data, 0)[0] + if magic_le == MAGIC: + return '<' + magic_be = struct.unpack_from('>I', data, 0)[0] + if magic_be == MAGIC: + return '>' + raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)") + + +def batch_addr2line(vmlinux, addrs): + """Resolve multiple addresses in one addr2line invocation.""" + if not addrs: + return {} + try: + # Feed addresses on stdin to avoid ARG_MAX limits with large + # numbers of addresses (one stack can have 30+ frames; a + # snapshot can have thousands of unique stacks). + stdin = '\n'.join(hex(a) for a in addrs) + '\n' + result = subprocess.run( + ['addr2line', '-f', '-e', vmlinux], + input=stdin, capture_output=True, text=True, timeout=60 + ) + lines = result.stdout.split('\n') + # addr2line outputs 2 lines per address: function name + source location + symbols = {} + for i, addr in enumerate(addrs): + idx = i * 2 + if idx < len(lines) and lines[idx] and lines[idx] != '??': + symbols[addr] = lines[idx] + return symbols + except (subprocess.TimeoutExpired, FileNotFoundError) as e: + print(f"warning: addr2line failed: {e}", file=sys.stderr) + return {} + + +def parse_stackmap_bin(data): + """Parse binary stackmap data, yield (stack_id, ref_count, [ips]).""" + if len(data) < HEADER_SIZE: + raise ValueError("File too small for header") + + endian = detect_endianness(data) + header_fmt = f'{endian}IIII' + entry_fmt = f'{endian}IIII' + + magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0) + if version not in (1, 2): + raise ValueError(f"Unsupported version: {version}") + + offset = HEADER_SIZE + for _ in range(nr_stacks): + if offset + ENTRY_SIZE > len(data): + break + stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset) + offset += ENTRY_SIZE + + ips_size = nr * 8 + if offset + ips_size > len(data): + break + ips = struct.unpack_from(f'{endian}{nr}Q', data, offset) + offset += ips_size + + yield stack_id, ref_count, list(ips) + + +def main(): + parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin') + parser.add_argument('file', help='Path to stack_map_bin file') + parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution') + parser.add_argument('--json', action='store_true', help='JSON output') + parser.add_argument('--top', type=int, default=0, + help='Show only top N stacks by ref_count') + args = parser.parse_args() + + with open(args.file, 'rb') as f: + data = f.read() + + stacks = list(parse_stackmap_bin(data)) + + if args.top > 0: + stacks.sort(key=lambda x: x[1], reverse=True) + stacks = stacks[:args.top] + + # Batch symbol resolution + symbols = {} + if args.vmlinux: + all_addrs = set() + for _, _, ips in stacks: + all_addrs.update(ips) + symbols = batch_addr2line(args.vmlinux, list(all_addrs)) + + if args.json: + output = [] + for stack_id, ref_count, ips in stacks: + entry = { + 'stack_id': stack_id, + 'ref_count': ref_count, + 'ips': [f'0x{ip:x}' for ip in ips] + } + if args.vmlinux: + entry['symbols'] = [symbols.get(ip, f'0x{ip:x}') + for ip in ips] + output.append(entry) + print(json.dumps(output, indent=2)) + else: + for stack_id, ref_count, ips in stacks: + print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]") + for i, ip in enumerate(ips): + sym = symbols.get(ip, '') + if sym: + sym = f' {sym}' + print(f" [{i}] 0x{ip:x}{sym}") + print() + + print(f"Total: {len(stacks)} unique stacks", file=sys.stderr) + + +if __name__ == '__main__': + main() -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer 2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei ` (2 preceding siblings ...) 2026-05-22 10:40 ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei @ 2026-05-25 6:58 ` Masami Hiramatsu 2026-05-25 7:39 ` Li Pengfei 3 siblings, 1 reply; 16+ messages in thread From: Masami Hiramatsu @ 2026-05-25 6:58 UTC (permalink / raw) To: Li Pengfei Cc: linux-trace-kernel, rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28, lkp Hi Pengfei, On Fri, 22 May 2026 18:40:14 +0800 Li Pengfei <ljdlns1987@gmail.com> wrote: > From: Pengfei Li <lipengfei28@xiaomi.com> > > Hi Steven, all, > > This is v2 of the ftrace stackmap series. It addresses the Sashiko > review at [1] and incorporates the kernel test robot's toctree fix. > > The series adds stack trace deduplication to ftrace. When the > stacktrace option is enabled, the ring buffer stores a 4-byte > stack_id instead of a full kernel stack trace, while the full > stacks are exported via tracefs. Sashiko still made some comments on the series. Please review it. https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com And reply to the comment on this thread, so that we can discuss it here. Thanks, -- Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer 2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu @ 2026-05-25 7:39 ` Li Pengfei 0 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-25 7:39 UTC (permalink / raw) To: mhiramat Cc: linux-trace-kernel, rostedt, linux-kernel, cmllamas, zhangbo56, lipengfei28, lkp Hi Masami, I went through the Sashiko comments on v2 [1]. Per-finding response below; v3 will incorporate the fixes. [1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com Patch 1/3: - memset() torn reads against lockless readers: agreed, the reset path is not well serialized against tracefs readers. Will tighten slow-path synchronization in v3. - seq_next() not advancing *pos on EOF: agreed, will fix in v3. - atomic_read(&resetting) without acquire: agreed, will switch to atomic_read_acquire() in v3. - Plain reads of entry->key: agreed, will use READ_ONCE() in v3. - atomic64_inc() in NMI-safe hot path on 32-bit GENERIC_ATOMIC64: agreed, will move the counters off the hot path (local_t / per-CPU) in v3. Patch 2/3: - TRACE_STACK_ID not in trace_valid_entry(): agreed, will add in v3. - "NULL from kzalloc" comment: wording bug, will correct in v3. - Reset memset synchronization: same fix as patch 1, finding 1. Patch 3/3: - Selftest missing 'function:tracer' in '# requires:': agreed, will add in v3. - Selftest wiping the ring buffer via 'echo nop > current_tracer' before reading trace: agreed, will reorder in v3. I'll send v3 once the changes are tested. Pengfei ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer 2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei ` (4 preceding siblings ...) 2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei @ 2026-05-26 11:52 ` Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei ` (3 more replies) 5 siblings, 4 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw) To: mhiramat, rostedt Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li From: Pengfei Li <lipengfei28@xiaomi.com> Hi Masami, Steven, all, This is v3 of the ftrace stackmap series. It addresses the Sashiko review on v2 [1] that Masami pointed out. [1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com The series adds stack trace deduplication to ftrace. When the stacktrace option is enabled, the ring buffer stores a 4-byte stack_id instead of a full kernel stack trace, while the full stacks are exported via tracefs. Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending. Changes since v2 ================ Patch 1 (lock-free stackmap): - Hot-path counters changed from atomic64_t to per-CPU local_t. This avoids the raw_spinlock_t fallback that atomic64_t uses on 32-bit GENERIC_ATOMIC64, which would deadlock from NMI context. - reset() now serializes against tracefs readers via an rw_semaphore (held for write during the clearing memset, held for read by seq_file iteration and bin snapshot construction). synchronize_rcu() alone was insufficient because seq_file/bin readers are in process context, not preempt-disabled. - get_id() uses atomic_read_acquire() on smap->resetting so subsequent loads of entry->key/val are properly ordered after the check (LKMM control dependencies only order stores). - All plain reads of entry->key now use READ_ONCE() to avoid LKMM data races with the cmpxchg writer. - val->nr in the hot path now uses READ_ONCE() to keep style consistent with the seq_show / bin_open readers. - stackmap_seq_next() now updates *pos past map_size on EOF so seq_read() terminates instead of looping on the last element. - Added a comment in the cmpxchg-claim path documenting that two CPUs racing with the same key_hash may produce a small number of duplicate entries; this is an accepted trade-off for keeping the hot path lock-free. - Removed BUG_ON in create path (the constraint is satisfied by construction; no runtime check needed). Patch 2 (integration): - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS so the option is only exposed under the top-level trace instance, matching the convention used for other global-only options such as 'printk' and 'record-cmd'. Secondary instances under tracing/instances/*/ no longer see the option at all, instead of seeing it as a silent no-op. - TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c so ftrace startup selftests don't reject the entry type. - Corrected a comment about how global_trace.stackmap is zero-initialized (BSS, not kzalloc). Patch 3 (docs / selftest / tooling): - Selftest now reads trace contents BEFORE switching back to the nop tracer (tracer_init() calls tracing_reset_online_cpus() which would have emptied the ring buffer). - Added 'function:tracer' to the selftest '# requires:' line so ftracetest skips when CONFIG_FUNCTION_TRACER is disabled instead of failing spuriously. - Selftest grep tightened to '<stack_id' to avoid future false-positives if any other tracepoint name contains "stack_id". - New stackmap-instance-gate.tc selftest asserts the option and stack_map* nodes are present on the global instance and absent on a freshly-created secondary instance, locking in the TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2. - Documentation Performance section made vendor-neutral ("aarch64 SMP system" instead of a specific device name) and the term "Hit rate" replaced with "Dedup rate" to match the actual stat field name (success_rate). - Documentation Design section now states that deduplication is best-effort under heavy contention (cmpxchg races may produce a small number of duplicate entries for the same stack), so users observing entries > unique-stacks have a documented explanation. Test results ============ Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI) Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots) Method: 5-second capture with stacktrace trigger Functional tests (all PASS): - tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist - options/stackmap writable, trace shows <stack_id N> - stack_map text export with correct symbols - reset clears entries when tracing stopped - reset rejected (-EBUSY) while tracing active - per-event trigger: only specified events get stacks Performance (sched_switch, 5s): entries: 466 / 16384 successes: 9159 drops: 0 success_rate: 100% dedup rate: 95.2% (466 unique stacks / 9625 total events) Performance (kmem_cache_alloc, 5s): entries: 1177 / 16384 successes: 60078 drops: 0 success_rate: 100% dedup rate: 98.1% (1177 unique stacks / 61255 total events) Ring buffer space savings: Event Full stack Stackmap Saving ---------------- --------------- --------------- ------ sched_switch 9625 × 88B=847KB 12B×9625+88B×466=156KB 82% kmem_cache_alloc 61255×88B=5.4MB 12B×61255+88B×1177=839KB 85% QEMU validation (v3 base: v7.1-rc5) =================================== The series boots cleanly on aarch64 QEMU. A post-init smoke test (12/12 PASS) verified all functional behaviors including: - tracefs nodes appear with correct file modes - stack_id events emitted, kernel symbols resolve correctly (e.g. __schedule+0x7cc/0x1138) - reset rejected with -EBUSY while tracing is active - reset clears the map when tracing is stopped - per-CPU local_t counters aggregate correctly across CPUs - stack_map_bin magic correct (0x464D5342 'FSMB') - 'stackmap' option visible on the global instance, hidden on secondary instances under tracing/instances/*/ Boot-time activation via 'trace_options=stackmap,stacktrace' works: events that fire before stackmap initialization fall back to recording full stack traces; later events are deduplicated. No events are dropped due to the transition. Known limitations ================= - Per-instance stackmap support is not included in this series. Following the convention used for other global-only options (PRINTK, RECORD_CMD), the 'stackmap' option is gated to the top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is not exposed under tracing/instances/*/options/. Per-instance maps would be a follow-up. - The element pool is allocated eagerly at fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will ever enable the option. At the default bits=14 this is roughly 8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager allocation keeps the hot path entirely allocation-free and avoids any allocation-failure path under tracing pressure. Lazy allocation on first 'echo 1 > options/stackmap' is a reasonable follow-up if maintainers prefer that trade-off. - Deduplication is best-effort, not strict: under heavy concurrent contention two CPUs racing in the insert path with the same stack hash may each succeed in claiming a different slot, producing a small number of duplicate entries for the same stack. ref_count is then split across the duplicates. This is intentional: it keeps the hot path lock-free and bounds memory by the element pool size. - The stackmap currently covers kernel stacks only. - stack_map_bin is a best-effort snapshot, not a fully atomic export. - trace-cmd / libtraceevent integration is left for follow-up once the binary format settles. Usage ===== echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Pengfei Li (3): trace: add lock-free stackmap for stack trace deduplication trace: integrate stackmap into ftrace stack recording path trace: add documentation, selftest and tooling for stackmap Documentation/trace/ftrace-stackmap.rst | 162 ++++ Documentation/trace/index.rst | 1 + kernel/trace/Kconfig | 22 + kernel/trace/Makefile | 1 + kernel/trace/trace.c | 78 +- kernel/trace/trace.h | 16 + kernel/trace/trace_entries.h | 15 + kernel/trace/trace_output.c | 23 + kernel/trace/trace_selftest.c | 1 + kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++ kernel/trace/trace_stackmap.h | 57 ++ .../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++ .../test.d/ftrace/stackmap-instance-gate.tc | 42 + tools/tracing/stackmap_dump.py | 150 ++++ 14 files changed, 1449 insertions(+), 2 deletions(-) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc create mode 100755 tools/tracing/stackmap_dump.py -- 2.34.1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication 2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei @ 2026-05-26 11:52 ` Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw) To: mhiramat, rostedt Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li From: Pengfei Li <lipengfei28@xiaomi.com> Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel stack traces for the ftrace ring buffer. Instead of storing full stack traces (80-160 bytes each) in the ring buffer for every event, ftrace can store a 4-byte stack_id when the stackmap option is enabled. The implementation is modeled after tracing_map.c (used by hist triggers), using the same lock-free design based on Dr. Cliff Click's non-blocking hash table algorithm: - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context - Pre-allocated element pool (zero allocation on hot path) - Linear probing with 2x over-provisioned table; probe length is bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup is O(1) even when the table is heavily loaded with claimed-but- empty slots from pool exhaustion - Single global instance (initialized for the global trace array) The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the existing tracing_map / hist_triggers requirement: the lock-free hot path uses cmpxchg in a context that may be reached from NMI. The stackmap is exported via three tracefs nodes: - stack_map: text export with symbol resolution (mode 0640) - stack_map_stat: counters (entries, successes, drops, success_rate) - stack_map_bin: binary export (all fields native-endian) Hot-path counters use per-CPU local_t (NMI-safe single-instruction increments) instead of atomic64_t. atomic64_t falls back to raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems, which would deadlock if an NMI hit while the spinlock was held. local_t avoids this hazard. Reset semantics: - Reset is a control-path operation only allowed when tracing is stopped on the owning trace_array. Online reset (with tracing active) is intentionally not supported. - Reset uses atomic_cmpxchg() to claim the resetting flag, then verifies tracer_tracing_is_on() returns false. - synchronize_rcu() drains in-flight get_id() callers from the ftrace callback path (which runs preempt-disabled). - A reader_sem (rw_semaphore) serializes the clearing memset against tracefs readers (seq_file iteration and stack_map_bin snapshot), which run in process context and aren't covered by synchronize_rcu(). The hot path doesn't take this lock. - Reset clears the resetting flag with atomic_set_release() so a subsequent get_id() observes a fully cleared map. - get_id() uses atomic_read_acquire() on resetting so subsequent loads of entry->key/val are properly ordered after the check (control dependencies only order stores per LKMM). - Concurrent reset returns -EBUSY; reset while tracing is active returns -EBUSY. Concurrency notes: - entry->val publication uses smp_store_release() paired with smp_load_acquire() in all dereferencing readers. - entry->key reads (in get_id, seq_start/next, bin_open) use READ_ONCE() to avoid LKMM data races with the cmpxchg writer. - elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before use in seq_show and bin_open. - Pool exhaustion: stackmap_get_elt() short-circuits via atomic_read() before the contended atomic RMW, avoiding cacheline contention once the pool is full. Slots that win cmpxchg but cannot get an elt are left 'claimed but empty'; subsequent lookups treat val==NULL as a miss and probe past them. Hash key: - Per-instance random seed stored in the stackmap struct (no global state), seeded at create time. - 32-bit jhash is forced to 1 if it lands on 0 (which is the free-slot sentinel). Full memcmp confirms matches. Memory: - Single flat vmalloc for the element pool (no per-elt kzalloc). - bits parameter clamped to [10, 18]: at the maximum bits=18, the element pool is ~135 MB and a stack_map_bin snapshot may briefly allocate another ~135 MB. - struct stackmap_bin_snapshot uses u64 (not size_t) for its size field so data[] is 8-byte aligned on both 32-bit and 64-bit architectures, avoiding alignment faults when writing u64 IPs on strict-alignment architectures. Kernel command line parameter: - ftrace_stackmap.bits=N: set map capacity (2^N unique stacks, range 10-18, default 14) Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- kernel/trace/Kconfig | 22 + kernel/trace/Makefile | 1 + kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++++++++++++++++++ kernel/trace/trace_stackmap.h | 57 +++ 4 files changed, 860 insertions(+) create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index e130da35808f..e49cae886ff0 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -412,6 +412,28 @@ config STACK_TRACER Say N if unsure. +config FTRACE_STACKMAP + bool "Ftrace stack map deduplication" + depends on TRACING + depends on STACKTRACE + depends on ARCH_HAVE_NMI_SAFE_CMPXCHG + select KALLSYMS + help + This enables a global stack trace hash table for ftrace, inspired + by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store + only a stack_id in the ring buffer instead of the full stack trace, + significantly reducing trace buffer usage when the same call stacks + appear repeatedly. + + The deduplicated stacks are exported via: + /sys/kernel/debug/tracing/stack_map + + Writing to this file resets the stack map. Reading shows all unique + stacks with their stack_id and reference count. + + Say Y if you want to reduce ftrace buffer usage for stack traces. + Say N if unsure. + config TRACE_PREEMPT_TOGGLE bool help diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 8d3d96e847d8..c2d9b2bf895a 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o obj-$(CONFIG_NOP_TRACER) += trace_nop.o obj-$(CONFIG_STACK_TRACER) += trace_stack.o +obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c new file mode 100644 index 000000000000..c89f6d527c96 --- /dev/null +++ b/kernel/trace/trace_stackmap.c @@ -0,0 +1,780 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace + * + * Modeled after tracing_map.c (used by hist triggers), this provides + * a lock-free hash map optimized for the ftrace hot path. The design + * is based on Dr. Cliff Click's non-blocking hash table algorithm. + * + * Key properties: + * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context + * - Pre-allocated element pool (zero allocation on hot path) + * - Linear probing with 2x over-provisioned table; probe length + * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup + * cost constant even when the table is heavily loaded + * - Single global instance (initialized for the global trace array) + * + * Reset is a control-path operation, only allowed when tracing is + * stopped on the owning trace_array. The protocol is: + * + * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights + * and blocks new get_id() callers (they observe resetting=1 and + * return -EINVAL). + * - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the + * resetting flag itself prevents new insertions even if userspace + * re-enables tracing immediately after the check. + * - synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path, which runs with preemption disabled. + * + * Online reset (with tracing active) is intentionally not supported + * to keep the design simple and the proof obligations small. + * + * The 32-bit jhash of the stack IPs is the hash table key. On hash + * collision, linear probing finds the next slot and full memcmp + * confirms the match. + * + * Concurrent userspace readers (cat stack_map / stack_map_bin) get + * a best-effort snapshot. They are coherent with the hot path + * (smp_load_acquire on entry->val), but they are not coherent with + * a concurrent reset; since reset requires tracing to be stopped, + * mid-iteration reset can produce truncated or partial output but + * never crashes. + */ + +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/jhash.h> +#include <linux/seq_file.h> +#include <linux/kallsyms.h> +#include <linux/vmalloc.h> +#include <linux/atomic.h> +#include <linux/local_lock.h> +#include <linux/percpu.h> +#include <linux/random.h> +#include <linux/rcupdate.h> +#include <linux/log2.h> +#include <asm/local.h> + +#include "trace.h" +#include "trace_stackmap.h" + +/* + * Bound the linear-probe scan length. With a 2x over-provisioned table, + * a well-distributed hash gives very short probe chains. Capping at 64 + * keeps worst-case lookup O(1) even when the table is heavily loaded + * with claimed-but-empty slots from pool exhaustion. + */ +#define FTRACE_STACKMAP_MAX_PROBE 64 + +/* + * Memory ordering of entry->val: published with smp_store_release() + * by the inserter; consumed with smp_load_acquire() by every reader + * that dereferences the elt (get_id, seq_show, bin_open). This pairs + * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the + * publish) with the reads of those fields (which happen AFTER the + * load). seq_start / seq_next only test val for NULL and use the + * acquire load purely to keep memory ordering symmetric. + */ + +/* + * Each pre-allocated element holds one unique stack trace. + * Fixed size: MAX_DEPTH entries regardless of actual depth. + */ +struct stackmap_elt { + u32 nr; /* actual number of IPs */ + atomic_t ref_count; + unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH]; +}; + +/* + * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt. + * key == 0 means the slot is free. + */ +struct stackmap_entry { + u32 key; /* 0 = free, non-zero = jhash */ + struct stackmap_elt *val; /* NULL until fully published */ +}; + +struct ftrace_stackmap { + struct trace_array *tr; /* owning trace_array */ + unsigned int map_bits; + unsigned int map_size; /* 1 << (map_bits + 1) */ + unsigned int max_elts; /* 1 << map_bits */ + u32 hash_seed; /* per-instance jhash seed */ + atomic_t next_elt; /* index into elts pool */ + struct stackmap_entry *entries; /* hash table */ + struct stackmap_elt *elts; /* flat element pool */ + atomic_t resetting; + /* + * Reader/reset serialization. Held in shared mode (read lock) + * across seq_file iteration and binary snapshot construction; + * held in exclusive mode (write lock) by reset's clearing + * phase. The hot path (get_id) does not take this lock — it + * uses smp_load_acquire/smp_store_release on entry->val and + * the resetting flag for the lock-free protocol. + */ + struct rw_semaphore reader_sem; + /* + * Per-CPU counters using local_t. local_t increments are NMI- + * safe on all architectures (single-instruction or interrupt- + * masked) and avoid the raw_spinlock_t fallback that + * atomic64_t uses on 32-bit GENERIC_ATOMIC64 — which would + * deadlock if an NMI hit while the spinlock was held. + */ + local_t __percpu *successes; /* events served (hits + new inserts) */ + local_t __percpu *drops; +}; + +/* + * Cap the bits parameter to keep worst-case allocations bounded: + * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin + * export. + * Smaller workloads should use the default (14) which gives 16K elts + * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel + * parameter for higher unique-stack capacity. + */ +#define FTRACE_STACKMAP_BITS_MIN 10 +#define FTRACE_STACKMAP_BITS_MAX 18 +#define FTRACE_STACKMAP_BITS_DEFAULT 14 + +static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT; +static int __init stackmap_bits_setup(char *str) +{ + unsigned long val; + + if (kstrtoul(str, 0, &val)) + return -EINVAL; + val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX); + stackmap_map_bits = val; + return 0; +} +early_param("ftrace_stackmap.bits", stackmap_bits_setup); + +/* --- Element pool --- */ + +static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap) +{ + int idx; + + /* + * Fast-path early-out once the pool is fully consumed. Avoids + * the contended atomic RMW on next_elt for every traced event + * after the pool is exhausted. + */ + if (atomic_read(&smap->next_elt) >= smap->max_elts) + return NULL; + + idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts); + if (idx < smap->max_elts) + return &smap->elts[idx]; + return NULL; +} + +/* --- Create / Destroy / Reset --- */ + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) +{ + struct ftrace_stackmap *smap; + unsigned int bits; + + smap = kzalloc(sizeof(*smap), GFP_KERNEL); + if (!smap) + return ERR_PTR(-ENOMEM); + + /* Defensive clamp: reject bogus bits even if early_param is bypassed. */ + bits = clamp_val(stackmap_map_bits, + FTRACE_STACKMAP_BITS_MIN, + FTRACE_STACKMAP_BITS_MAX); + + smap->tr = tr; + smap->map_bits = bits; + smap->max_elts = 1U << bits; + smap->map_size = 1U << (bits + 1); /* 2x over-provision */ + + smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size); + if (!smap->entries) { + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + /* + * Single large vmalloc of the element pool, indexed flat. + * At bits=18 this is 256K * sizeof(struct stackmap_elt). The + * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB. + */ + smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts); + if (!smap->elts) { + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + smap->successes = alloc_percpu(local_t); + if (!smap->successes) { + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + smap->drops = alloc_percpu(local_t); + if (!smap->drops) { + free_percpu(smap->successes); + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + smap->hash_seed = get_random_u32(); + atomic_set(&smap->next_elt, 0); + atomic_set(&smap->resetting, 0); + init_rwsem(&smap->reader_sem); + + return smap; +} + +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap) +{ + if (!smap || IS_ERR(smap)) + return; + free_percpu(smap->drops); + free_percpu(smap->successes); + vfree(smap->elts); + vfree(smap->entries); + kfree(smap); +} + +/** + * ftrace_stackmap_reset - clear all entries in the stackmap + * @smap: the stackmap to reset + * + * Returns 0 on success, -EBUSY if another reset is already in + * progress, or if tracing is currently active on the owning + * trace_array. + * + * Online reset (with tracing active) is not supported. Caller must + * stop tracing first (echo 0 > tracing_on). + * + * Caller is process context (typically sysfs write handler). + * + * Protocol: + * 1. Atomically claim reset rights via cmpxchg on @resetting. + * 2. Verify tracing is stopped on @smap->tr; if not, release the + * claim and return -EBUSY. The resetting flag itself blocks + * any subsequent get_id() callers. + * 3. synchronize_rcu() drains in-flight get_id() callers from the + * ftrace callback path (which runs preempt-disabled). + * 4. memset entries, elts, and counters. + * 5. Release the resetting flag with release semantics so any new + * get_id() observes a fully cleared map. + */ +int ftrace_stackmap_reset(struct ftrace_stackmap *smap) +{ + if (!smap) + return 0; + + if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0) + return -EBUSY; + + if (smap->tr && tracer_tracing_is_on(smap->tr)) { + atomic_set(&smap->resetting, 0); + return -EBUSY; + } + + /* + * synchronize_rcu() itself is a full barrier; no extra smp_mb() + * is needed before it. It drains in-flight ftrace callbacks that + * may have already passed the resetting check with the old value. + */ + synchronize_rcu(); + + /* + * Take the reader_sem in exclusive mode. This serializes the + * memset against any tracefs reader (seq_file iteration or + * stack_map_bin snapshot) that may currently hold the rwsem + * for read. synchronize_rcu() already drained the hot path; + * this rwsem covers process-context readers that aren't + * preempt-disabled. + */ + down_write(&smap->reader_sem); + + memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size); + memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts); + + atomic_set(&smap->next_elt, 0); + { + int cpu; + + for_each_possible_cpu(cpu) { + local_set(per_cpu_ptr(smap->successes, cpu), 0); + local_set(per_cpu_ptr(smap->drops, cpu), 0); + } + } + + up_write(&smap->reader_sem); + + /* Release resetting=0 so new get_id() observes a cleared map. */ + atomic_set_release(&smap->resetting, 0); + return 0; +} + +/* --- Core: get_id (lock-free, NMI-safe) --- */ + +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries) +{ + u32 key_hash, idx, test_key, trace_len; + struct stackmap_entry *entry; + struct stackmap_elt *val; + int probes = 0; + + /* + * atomic_read_acquire() pairs with atomic_set_release() in the + * reset path. This ensures that subsequent reads of entry->key + * and entry->val are ordered after this check; without acquire, + * the CPU would only have a control dependency, which orders + * subsequent stores but not loads (per LKMM). + */ + if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting)) + return -EINVAL; + if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH) + nr_entries = FTRACE_STACKMAP_MAX_DEPTH; + + trace_len = nr_entries * sizeof(unsigned long); + /* + * jhash2() requires the length in u32 units and the data to be + * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so + * trace_len is always a multiple of 8 (hence of 4). Use jhash2 + * directly; the cast to u32* is safe because ips[] is naturally + * aligned to sizeof(unsigned long) >= 4. + */ + key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32), + smap->hash_seed); + if (key_hash == 0) + key_hash = 1; /* 0 means free slot */ + + idx = key_hash >> (32 - (smap->map_bits + 1)); + + while (probes < FTRACE_STACKMAP_MAX_PROBE) { + idx &= (smap->map_size - 1); + entry = &smap->entries[idx]; + /* + * READ_ONCE() to avoid LKMM data race with concurrent + * cmpxchg(&entry->key, 0, key_hash) on this slot. + */ + test_key = READ_ONCE(entry->key); + + if (test_key == key_hash) { + /* + * smp_load_acquire pairs with smp_store_release in + * the publisher below; ensures we see fully-formed + * elt fields (nr, ips, ref_count) before dereference. + */ + val = smp_load_acquire(&entry->val); + /* + * READ_ONCE(val->nr) keeps style consistent with + * the seq_show / bin_open readers. nr is write-once + * (set before publish, never modified afterwards), + * so the load is data-race-free, but READ_ONCE + * silences any analysis tool that flags a plain + * read of a field that is also read under acquire + * elsewhere. + */ + if (val && READ_ONCE(val->nr) == nr_entries && + memcmp(val->ips, ips, trace_len) == 0) { + atomic_inc(&val->ref_count); + local_inc(this_cpu_ptr(smap->successes)); + return (int)idx; + } + /* + * val == NULL: another CPU is mid-insert, or this + * slot is "claimed but empty" (pool exhausted). + * val != NULL but mismatch: 32-bit hash collision + * with a different stack. In both cases, advance. + */ + } else if (!test_key) { + /* + * Free slot: try to claim it. + * + * If two CPUs race here with the same key_hash + * (same stack), one loses the cmpxchg, advances, + * and may insert the same stack at a later slot. + * This can produce a small number of duplicate + * entries under heavy contention. The trade-off + * is accepted to keep the hot path lock-free; + * ref_count is split across the duplicates and + * total memory cost is bounded by the element + * pool size. + */ + if (cmpxchg(&entry->key, 0, key_hash) == 0) { + struct stackmap_elt *elt; + + elt = stackmap_get_elt(smap); + if (!elt) { + /* + * Pool exhausted. We claimed this + * slot with cmpxchg but cannot fill + * it. Leave key set so the slot + * stays "claimed but empty" — future + * lookups treat val==NULL as a miss + * and probe past it. Cannot revert + * key=0 without racing other CPUs. + */ + local_inc(this_cpu_ptr(smap->drops)); + return -ENOSPC; + } + + elt->nr = nr_entries; + atomic_set(&elt->ref_count, 1); + memcpy(elt->ips, ips, trace_len); + + /* + * Publish elt with release semantics so the + * reader's smp_load_acquire can safely + * dereference val->nr / val->ips. + */ + smp_store_release(&entry->val, elt); + local_inc(this_cpu_ptr(smap->successes)); + return (int)idx; + } + /* cmpxchg failed; another CPU claimed this slot. */ + } + + idx++; + probes++; + } + + local_inc(this_cpu_ptr(smap->drops)); + return -ENOSPC; +} + +/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */ + +struct stackmap_seq_private { + struct ftrace_stackmap *smap; +}; + +static void *stackmap_seq_start(struct seq_file *m, loff_t *pos) +{ + struct stackmap_seq_private *priv = m->private; + struct ftrace_stackmap *smap = priv->smap; + u32 i; + + if (!smap) + return NULL; + /* + * Take the reader_sem to serialize against ftrace_stackmap_reset(), + * which holds it for write while clearing the table. Released in + * stackmap_seq_stop(), which seq_file calls regardless of whether + * start() returned an element or NULL (per Documentation/filesystems + * /seq_file.rst: "the iterator value returned by start() or next() + * is guaranteed to be passed to a subsequent next() or stop()"). + */ + down_read(&smap->reader_sem); + for (i = *pos; i < smap->map_size; i++) { + if (READ_ONCE(smap->entries[i].key) && + smp_load_acquire(&smap->entries[i].val)) { + *pos = i; + return &smap->entries[i]; + } + } + return NULL; +} + +static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct stackmap_seq_private *priv = m->private; + struct ftrace_stackmap *smap = priv->smap; + u32 i; + + if (!smap) + return NULL; + for (i = *pos + 1; i < smap->map_size; i++) { + if (READ_ONCE(smap->entries[i].key) && + smp_load_acquire(&smap->entries[i].val)) { + *pos = i; + return &smap->entries[i]; + } + } + /* + * Advance *pos past the end so that on the next read() the + * subsequent stackmap_seq_start() call returns NULL and the + * iteration terminates. Without this, seq_read() would loop + * on the last element. + */ + *pos = smap->map_size; + return NULL; +} + +static void stackmap_seq_stop(struct seq_file *m, void *v) +{ + struct stackmap_seq_private *priv = m->private; + struct ftrace_stackmap *smap = priv->smap; + + /* + * seq_file invokes stop() unconditionally after each iteration + * pass (see seq_read_iter / traverse), even when start() returned + * NULL. Always release here, balanced against the down_read in + * stackmap_seq_start(). + */ + if (smap) + up_read(&smap->reader_sem); +} + +static int stackmap_seq_show(struct seq_file *m, void *v) +{ + struct stackmap_entry *entry = v; + struct stackmap_elt *elt = smp_load_acquire(&entry->val); + struct stackmap_seq_private *priv = m->private; + u32 idx = entry - priv->smap->entries; + u32 i, nr; + + if (!elt) + return 0; + + nr = READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr = FTRACE_STACKMAP_MAX_DEPTH; + + seq_printf(m, "stack_id %u [ref %u, depth %u]\n", + idx, atomic_read(&elt->ref_count), nr); + for (i = 0; i < nr; i++) + seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]); + seq_putc(m, '\n'); + return 0; +} + +static const struct seq_operations stackmap_seq_ops = { + .start = stackmap_seq_start, + .next = stackmap_seq_next, + .stop = stackmap_seq_stop, + .show = stackmap_seq_show, +}; + +static int stackmap_open(struct inode *inode, struct file *file) +{ + struct stackmap_seq_private *priv; + struct seq_file *m; + int ret; + + ret = seq_open_private(file, &stackmap_seq_ops, + sizeof(struct stackmap_seq_private)); + if (ret) + return ret; + m = file->private_data; + priv = m->private; + priv->smap = inode->i_private; + return 0; +} + +/* + * Accept exactly "0" or "reset" (optionally followed by a single newline). + */ +static bool stackmap_write_is_reset(const char *buf, size_t n) +{ + if (n > 0 && buf[n - 1] == '\n') + n--; + return (n == 1 && buf[0] == '0') || + (n == 5 && memcmp(buf, "reset", 5) == 0); +} + +static ssize_t stackmap_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct seq_file *m = file->private_data; + struct stackmap_seq_private *priv = m->private; + char buf[8]; + size_t n = min(count, sizeof(buf) - 1); + int ret; + + if (n == 0) + return -EINVAL; + if (copy_from_user(buf, ubuf, n)) + return -EFAULT; + buf[n] = '\0'; + + if (!stackmap_write_is_reset(buf, n)) + return -EINVAL; + + /* + * ftrace_stackmap_reset() atomically claims reset rights via + * cmpxchg and returns -EBUSY if another reset is in progress + * or if tracing is active. + */ + ret = ftrace_stackmap_reset(priv->smap); + if (ret) + return ret; + return count; +} + +const struct file_operations ftrace_stackmap_fops = { + .open = stackmap_open, + .read = seq_read, + .write = stackmap_write, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +/* --- Stats --- */ + +static int stackmap_stat_show(struct seq_file *m, void *v) +{ + struct ftrace_stackmap *smap = m->private; + u64 successes = 0, drops = 0; + u32 entries; + int cpu; + + if (!smap) { + seq_puts(m, "stackmap not initialized\n"); + return 0; + } + + entries = atomic_read(&smap->next_elt); + for_each_possible_cpu(cpu) { + successes += local_read(per_cpu_ptr(smap->successes, cpu)); + drops += local_read(per_cpu_ptr(smap->drops, cpu)); + } + + seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts); + seq_printf(m, "table_size: %u\n", smap->map_size); + seq_printf(m, "successes: %llu\n", successes); + seq_printf(m, "drops: %llu\n", drops); + if (successes + drops > 0) + seq_printf(m, "success_rate: %llu%%\n", + successes * 100 / (successes + drops)); + return 0; +} + +static int stackmap_stat_open(struct inode *inode, struct file *file) +{ + return single_open(file, stackmap_stat_show, inode->i_private); +} + +const struct file_operations ftrace_stackmap_stat_fops = { + .open = stackmap_stat_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +/* --- Binary export --- */ + +struct stackmap_bin_snapshot { + /* + * Use u64 (not size_t) so data[] is 8-byte aligned on both + * 32-bit and 64-bit architectures. The IP array within data[] + * is accessed as u64*, which would alignment-fault on strict + * architectures (e.g. older ARM, SPARC) if data[] started at + * a 4-byte boundary. + */ + u64 size; + char data[]; +}; + +static int stackmap_bin_open(struct inode *inode, struct file *file) +{ + struct ftrace_stackmap *smap = inode->i_private; + struct stackmap_bin_snapshot *snap; + struct ftrace_stackmap_bin_header *hdr; + size_t alloc_size, off; + u32 nr_entries, i, nr_stacks; + + if (!smap) + return -ENODEV; + + /* + * Worst-case allocation size: every populated entry uses a + * full-depth stack. The (+1) gives one slack slot in case a + * concurrent insert lands between this snapshot and iteration. + * The loop below performs an explicit bounds check anyway. + * + * At bits=18 this caps at ~135 MB. The file is mode 0440 + * (TRACE_MODE_READ), so only privileged users can open it. + */ + nr_entries = atomic_read(&smap->next_elt); + alloc_size = sizeof(*hdr) + (nr_entries + 1) * + (sizeof(struct ftrace_stackmap_bin_entry) + + FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64)); + + snap = vmalloc(sizeof(*snap) + alloc_size); + if (!snap) + return -ENOMEM; + + hdr = (struct ftrace_stackmap_bin_header *)snap->data; + hdr->magic = FTRACE_STACKMAP_BIN_MAGIC; + hdr->version = FTRACE_STACKMAP_BIN_VERSION; + hdr->reserved = 0; + off = sizeof(*hdr); + nr_stacks = 0; + + /* + * Take reader_sem to serialize against ftrace_stackmap_reset(), + * which clears the table and elt pool under the write lock. + */ + down_read(&smap->reader_sem); + + for (i = 0; i < smap->map_size; i++) { + struct stackmap_entry *entry = &smap->entries[i]; + struct stackmap_elt *elt; + struct ftrace_stackmap_bin_entry *e; + u64 *ips_out; + u32 k, nr; + + if (!READ_ONCE(entry->key)) + continue; + elt = smp_load_acquire(&entry->val); + if (!elt) + continue; + + nr = READ_ONCE(elt->nr); + if (nr > FTRACE_STACKMAP_MAX_DEPTH) + nr = FTRACE_STACKMAP_MAX_DEPTH; + + /* Bounds check: stop if we would overflow the allocation. */ + if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size) + break; + + e = (struct ftrace_stackmap_bin_entry *)(snap->data + off); + e->stack_id = i; + e->nr = nr; + e->ref_count = atomic_read(&elt->ref_count); + e->reserved = 0; + off += sizeof(*e); + + ips_out = (u64 *)(snap->data + off); + for (k = 0; k < nr; k++) + ips_out[k] = (u64)elt->ips[k]; + off += nr * sizeof(u64); + nr_stacks++; + } + + up_read(&smap->reader_sem); + + hdr->nr_stacks = nr_stacks; + snap->size = off; + file->private_data = snap; + return 0; +} + +static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct stackmap_bin_snapshot *snap = file->private_data; + + if (!snap) + return -EINVAL; + return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size); +} + +static int stackmap_bin_release(struct inode *inode, struct file *file) +{ + vfree(file->private_data); + return 0; +} + +const struct file_operations ftrace_stackmap_bin_fops = { + .open = stackmap_bin_open, + .read = stackmap_bin_read, + .llseek = default_llseek, + .release = stackmap_bin_release, +}; diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h new file mode 100644 index 000000000000..2e82bd6fb1c3 --- /dev/null +++ b/kernel/trace/trace_stackmap.h @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _TRACE_STACKMAP_H +#define _TRACE_STACKMAP_H + +#include <linux/types.h> +#include <linux/atomic.h> + +#define FTRACE_STACKMAP_MAX_DEPTH 64 + +/* Binary export format */ +#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */ +#define FTRACE_STACKMAP_BIN_VERSION 2 + +struct ftrace_stackmap_bin_header { + u32 magic; + u32 version; + u32 nr_stacks; + u32 reserved; +}; + +struct ftrace_stackmap_bin_entry { + u32 stack_id; + u32 nr; + u32 ref_count; + u32 reserved; + /* followed by u64 ips[nr] */ +}; + +struct trace_array; + +#ifdef CONFIG_FTRACE_STACKMAP + +struct ftrace_stackmap; + +struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr); +void ftrace_stackmap_destroy(struct ftrace_stackmap *smap); +int ftrace_stackmap_get_id(struct ftrace_stackmap *smap, + unsigned long *ips, unsigned int nr_entries); +int ftrace_stackmap_reset(struct ftrace_stackmap *smap); + +extern const struct file_operations ftrace_stackmap_fops; +extern const struct file_operations ftrace_stackmap_stat_fops; +extern const struct file_operations ftrace_stackmap_bin_fops; + +#else + +struct ftrace_stackmap; +static inline struct ftrace_stackmap * +ftrace_stackmap_create(struct trace_array *tr) { return NULL; } +static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { } +static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s, + unsigned long *ips, unsigned int n) +{ return -EOPNOTSUPP; } +static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; } + +#endif +#endif /* _TRACE_STACKMAP_H */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path 2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei @ 2026-05-26 11:52 ` Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei 2026-05-26 19:39 ` [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt 3 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw) To: mhiramat, rostedt Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li From: Pengfei Li <lipengfei28@xiaomi.com> Add TRACE_STACK_ID event type and integrate ftrace_stackmap into __ftrace_trace_stack(). When the 'stackmap' trace option is enabled, the stack recording path stores a 4-byte stack_id in the ring buffer instead of the full stack trace. Changes: - New TRACE_STACK_ID in trace_type enum - New stack_id_entry in trace_entries.h - New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern used by TRACE_ITER_PROF_TEXT_OFFSET) - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS so it is only exposed under the top-level trace instance, matching the convention already used for global-only options such as 'printk' and 'record-cmd'. Secondary instances under tracing/instances/*/ do not see the option at all, avoiding a confusing no-op. - Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id() when the stackmap option is active. If reserving a TRACE_STACK_ID ring-buffer slot fails after a successful get_id(), the path falls through to the full-stack recording so the event still gets a stack trace recorded. - Stackmap pointer read with smp_load_acquire(), published with smp_store_release() to ensure proper initialization ordering - NULL check on tr->stackmap is retained as defense-in-depth: events that fire before fs_initcall (when the map is created) or after a failed ftrace_stackmap_create() observe a NULL pointer and fall back to full stack recording without dereferencing it - ftrace_stackmap_create() takes the owning trace_array so the stackmap can later check tracing state during reset - Added stack_id print handler in trace_output.c - Added TRACE_STACK_ID to trace_valid_entry() in trace_selftest.c so ftrace startup selftests don't reject the new entry type when the stackmap option is enabled Fallback behavior: if stackmap returns an error (pool exhausted, resetting, or NULL pointer), the full stack trace is recorded as before -- no new failure modes introduced. Per-instance stackmap support is left as a follow-up; gating the option via TOP_LEVEL_TRACE_FLAGS makes the global-only scope explicit at the tracefs interface rather than relying on a silent runtime fallback. Usage: echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- kernel/trace/trace.c | 78 ++++++++++++++++++++++++++++++++++- kernel/trace/trace.h | 16 +++++++ kernel/trace/trace_entries.h | 15 +++++++ kernel/trace/trace_output.c | 23 +++++++++++ kernel/trace/trace_selftest.c | 1 + 5 files changed, 131 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6eb4d3097a4d..36120355e549 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -57,6 +57,7 @@ #include "trace.h" #include "trace_output.h" +#include "trace_stackmap.h" #ifdef CONFIG_FTRACE_STARTUP_TEST /* @@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export); /* trace_options that are only supported by global_trace */ #define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) | \ TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) | \ - TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS) + TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) | \ + FPROFILE_DEFAULT_FLAGS) /* trace_flags that are default zero for instances */ #define ZEROED_TRACE_FLAGS \ (TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \ - TRACE_ITER(COPY_MARKER)) + TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP)) /* * The global_trace is the descriptor that holds the top-level tracing @@ -2184,6 +2186,49 @@ void __ftrace_trace_stack(struct trace_array *tr, } #endif +#ifdef CONFIG_FTRACE_STACKMAP + /* + * If stackmap dedup is enabled, try to store only the stack_id + * in the ring buffer instead of the full stack trace. + */ + if (tr->trace_flags & TRACE_ITER(STACKMAP)) { + struct ftrace_stackmap *smap; + struct stack_id_entry *sid_entry; + int sid; + + smap = smp_load_acquire(&tr->stackmap); + if (!smap) + goto full_stack; + + sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries); + if (sid >= 0) { + event = __trace_buffer_lock_reserve(buffer, + TRACE_STACK_ID, + sizeof(*sid_entry), trace_ctx); + if (!event) { + /* + * Could not reserve a TRACE_STACK_ID slot; + * fall back to the full-stack path so the + * event still gets a stack trace recorded. + */ + goto full_stack; + } + sid_entry = ring_buffer_event_data(event); + sid_entry->stack_id = sid; + /* + * stack_id is a synthetic side-event attached to a + * primary trace event that was already subject to + * filtering. No per-event filter is defined for + * TRACE_STACK_ID, so commit unconditionally. + */ + __buffer_unlock_commit(buffer, event); + goto out; + } + /* On stackmap failure, record the full stack instead. */ + } +full_stack: +#endif + event = __trace_buffer_lock_reserve(buffer, TRACE_STACK, struct_size(entry, caller, nr_entries), trace_ctx); @@ -9222,6 +9267,35 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work) NULL, &tracing_dyn_info_fops); #endif +#ifdef CONFIG_FTRACE_STACKMAP + { + struct ftrace_stackmap *smap; + + smap = ftrace_stackmap_create(&global_trace); + if (!IS_ERR(smap)) { + /* + * Use smp_store_release to ensure the stackmap + * structure is fully initialized before publishing + * the pointer to concurrent trace event readers. + */ + smp_store_release(&global_trace.stackmap, smap); + trace_create_file("stack_map", TRACE_MODE_WRITE, NULL, + smap, &ftrace_stackmap_fops); + trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL, + smap, &ftrace_stackmap_stat_fops); + trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL, + smap, &ftrace_stackmap_bin_fops); + } else { + pr_warn("ftrace stackmap init failed, dedup disabled\n"); + /* + * global_trace is statically defined; its stackmap + * field is zero-initialized via BSS, so leaving it + * NULL ensures the smp_load_acquire() in + * __ftrace_trace_stack() falls back to full stack. + */ + } + } +#endif create_trace_instances(NULL); update_tracer_options(); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 80fe152af1dd..7e7d5e5a35ff 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -57,6 +57,7 @@ enum trace_type { TRACE_TIMERLAT, TRACE_RAW_DATA, TRACE_FUNC_REPEATS, + TRACE_STACK_ID, __TRACE_LAST_TYPE, }; @@ -453,6 +454,9 @@ struct trace_array { struct cond_snapshot *cond_snapshot; #endif struct trace_func_repeats __percpu *last_func_repeats; +#ifdef CONFIG_FTRACE_STACKMAP + struct ftrace_stackmap *stackmap; +#endif /* * On boot up, the ring buffer is set to the minimum size, so that * we do not waste memory on systems that are not using tracing. @@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void); TRACE_GRAPH_RET); \ IF_ASSIGN(var, ent, struct func_repeats_entry, \ TRACE_FUNC_REPEATS); \ + IF_ASSIGN(var, ent, struct stack_id_entry, \ + TRACE_STACK_ID); \ __ftrace_bad_type(); \ } while (0) @@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, # define STACK_FLAGS #endif +#ifdef CONFIG_FTRACE_STACKMAP +# define STACKMAP_FLAGS \ + C(STACKMAP, "stackmap"), +#else +# define STACKMAP_FLAGS +# define TRACE_ITER_STACKMAP_BIT -1 +#endif + #ifdef CONFIG_FUNCTION_PROFILER + # define PROFILER_FLAGS \ C(PROF_TEXT_OFFSET, "prof-text-offset"), # ifdef CONFIG_FUNCTION_GRAPH_TRACER @@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ + STACKMAP_FLAGS \ BRANCH_FLAGS \ PROFILER_FLAGS \ FPROFILE_FLAGS diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index 54417468fdeb..89ed14b7e5fd 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry, (void *)__entry->caller[6], (void *)__entry->caller[7]) ); +/* + * Stack ID entry - stores only a stack_id referencing the stackmap. + * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks. + */ +FTRACE_ENTRY(stack_id, stack_id_entry, + + TRACE_STACK_ID, + + F_STRUCT( + __field( int, stack_id ) + ), + + F_printk("<stack_id %d>", __entry->stack_id) +); + /* * trace_printk entry: */ diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index a5ad76175d10..68678ea88159 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = { .funcs = &trace_user_stack_funcs, }; +/* TRACE_STACK_ID */ +static enum print_line_t trace_stack_id_print(struct trace_iterator *iter, + int flags, struct trace_event *event) +{ + struct stack_id_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + trace_seq_printf(s, "<stack_id %d>\n", field->stack_id); + + return trace_handle_return(s); +} + +static struct trace_event_functions trace_stack_id_funcs = { + .trace = trace_stack_id_print, +}; + +static struct trace_event trace_stack_id_event = { + .type = TRACE_STACK_ID, + .funcs = &trace_stack_id_funcs, +}; + /* TRACE_HWLAT */ static enum print_line_t trace_hwlat_print(struct trace_iterator *iter, int flags, @@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = { &trace_wake_event, &trace_stack_event, &trace_user_stack_event, + &trace_stack_id_event, &trace_bputs_event, &trace_bprint_event, &trace_print_event, diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 929c84075315..0c97065b0d68 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *entry) case TRACE_CTX: case TRACE_WAKE: case TRACE_STACK: + case TRACE_STACK_ID: case TRACE_PRINT: case TRACE_BRANCH: case TRACE_GRAPH_ENT: -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap 2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei @ 2026-05-26 11:52 ` Li Pengfei 2026-05-26 19:39 ` [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt 3 siblings, 0 replies; 16+ messages in thread From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw) To: mhiramat, rostedt Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li, kernel test robot From: Pengfei Li <lipengfei28@xiaomi.com> Add supporting files for the ftrace stackmap feature: Documentation/trace/ftrace-stackmap.rst: Documentation covering design, usage, tracefs interface, binary format, and performance characteristics. Added to the 'Core Tracing Frameworks' toctree in Documentation/trace/index.rst. Documents: - Reset requires tracing to be stopped first - Boot-time activation via trace_options=stackmap - bits parameter range [10, 18] and worst-case memory usage - tracefs file modes (0640 / 0440) - Best-effort snapshot semantics for stack_map_bin - Counter naming: successes (events served), drops, success_rate - Gravestone amplification when the pool is exhausted tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc: Functional selftest verifying: - stackmap tracefs nodes exist - enabling stackmap + stacktrace produces stack_id events - stack_map_stat shows non-zero successes and zero drops - reset clears entries when tracing is stopped - reset is rejected (-EBUSY) while tracing is active Test reads trace contents BEFORE switching back to the nop tracer (tracer_init() unconditionally calls tracing_reset_online_cpus(), which would empty the ring buffer). The function:tracer dependency is declared in '# requires:' so ftracetest skips on kernels without CONFIG_FUNCTION_TRACER instead of failing spuriously. An EXIT trap restores options/stackmap and options/stacktrace on any exit path. tools/tracing/stackmap_dump.py: Python script to parse the binary stack_map_bin export. Features: - Automatic endianness detection via magic number - Batched addr2line via stdin (avoids ARG_MAX with large stacks) - JSON output mode - Top-N filtering by ref_count Binary format: all fields are native-endian. The parser detects byte order by reading the magic value (0x464D5342 = 'FSMB'). Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/ Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com> --- Documentation/trace/ftrace-stackmap.rst | 162 ++++++++++++++++++ Documentation/trace/index.rst | 1 + .../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++++++++++ .../test.d/ftrace/stackmap-instance-gate.tc | 42 +++++ tools/tracing/stackmap_dump.py | 150 ++++++++++++++++ 5 files changed, 458 insertions(+) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc create mode 100755 tools/tracing/stackmap_dump.py diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst new file mode 100644 index 000000000000..191347be3664 --- /dev/null +++ b/Documentation/trace/ftrace-stackmap.rst @@ -0,0 +1,162 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Ftrace Stack Map +====================== + +:Author: Pengfei Li <lipengfei28@xiaomi.com> + +Overview +======== + +The ftrace stack map provides stack trace deduplication for the ftrace +ring buffer. When enabled, instead of storing full kernel stack traces +(typically 80-160 bytes each) in the ring buffer for every event, ftrace +stores only a 4-byte ``stack_id``. The full stacks are maintained in a +separate hash table and exported via tracefs for userspace to resolve. + +This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated +into ftrace's infrastructure, requiring no userspace daemon. + +Configuration +============= + +Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config. + +Kernel command line parameters: + +- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks + (default: 14 → 16384 stacks; valid range: 10-18). + + At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory + for the element pool. Each ``open()`` of ``stack_map_bin`` may + briefly allocate a similar amount for a snapshot. The cap is set + intentionally to bound memory usage. + +Usage +===== + +Enable stack deduplication:: + + echo 1 > /sys/kernel/debug/tracing/options/stackmap + echo 1 > /sys/kernel/debug/tracing/options/stacktrace + echo function > /sys/kernel/debug/tracing/current_tracer + +The trace output will show ``<stack_id N>`` instead of full stack traces:: + + sh-1234 [006] d.h.. 123.456789: <stack_id 42> + +To view the actual stacks:: + + cat /sys/kernel/debug/tracing/stack_map + +Output format:: + + stack_id 42 [ref 1337, depth 8] + [0] schedule+0x48/0xc0 + [1] schedule_timeout+0x1c/0x30 + ... + +To view statistics:: + + cat /sys/kernel/debug/tracing/stack_map_stat + +Output:: + + entries: 2500 / 16384 + table_size: 32768 + successes: 148923 + drops: 0 + success_rate: 100% + +To reset the stack map (tracing must be stopped first):: + + echo 0 > /sys/kernel/debug/tracing/tracing_on + echo 0 > /sys/kernel/debug/tracing/stack_map + +Reset returns ``-EBUSY`` if tracing is currently active, or if another +reset is already in progress. + +Boot-time activation +==================== + +The stackmap option can be enabled from the kernel command line:: + + trace_options=stackmap,stacktrace + +Trace events that fire before the tracefs filesystem is initialized +(``fs_initcall`` time) fall back to recording full stack traces; once +``ftrace_stackmap_create()`` runs, subsequent events are deduplicated. +The crossover is automatic and lossless — no events are dropped, but +early-boot stacks recorded before the crossover are not deduplicated. + +Tracefs Nodes +============= + +The stack_map files are owned by root and not world-readable +(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440). + +``stack_map`` + Text export of all deduplicated stacks with symbol resolution. + Writing ``0`` or ``reset`` clears all entries (only when tracing + is stopped). + +``stack_map_stat`` + Statistics: entries (allocated unique stacks), table_size, + successes (events served), drops (events that fell back to + full-stack recording), and success_rate. Drops accumulate when + the element pool is exhausted; once that happens, slots that + won the cmpxchg but failed to allocate an element remain + "claimed but empty" and increase probe pressure for any future + insert hashing to the same bucket. Reset (when tracing is + stopped) clears these gravestones. + +``stack_map_bin`` + Binary export for efficient userspace consumption. Format: + + - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32) + - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr) + + All fields are written in the kernel's native byte order. + Userspace tools detect endianness by reading the magic value. + Magic: ``0x464D5342`` ('FSMB'), Version: 2. + + The export is a best-effort snapshot allocated at ``open()``; + concurrent inserts during the snapshot may be truncated. A + bounds check ensures no overflow. + +Design +====== + +The stack map is modeled after ``tracing_map.c`` (used by hist triggers), +using a lock-free design based on Dr. Cliff Click's non-blocking hash table +algorithm: + +- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context +- **Memory**: Pre-allocated element pool, zero allocation on the hot path + (no GFP_ATOMIC failures under memory pressure) +- **Collision**: Linear probing with a 2x over-provisioned table; probe + length is bounded so worst-case insert/lookup is O(1) +- **Scope**: Currently supports the global trace instance +- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp`` + confirms matches + +Deduplication is best-effort, not strict: if two CPUs race in the +insert path with the same ``key_hash`` (i.e. the same stack), the +``cmpxchg`` loser advances by one slot and may insert the same stack +again. Under heavy contention this can produce a small number of +duplicate entries for the same stack; ``ref_count`` is then split +across the duplicates. Total memory is still bounded by the element +pool size, and lookup correctness is unaffected (each duplicate is +a self-consistent entry with its own ``stack_id``). The trade-off is +intentional and keeps the hot path lock-free. + +Performance +=========== + +Typical results on an aarch64 SMP system (function tracer, 2 seconds): + +- Unique stacks: ~3000 +- Dedup rate: 84-98% (depends on workload diversity) +- Ring buffer savings: ~80% for stack data +- Overhead per event: ~50ns (one jhash + hash table lookup) diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5d9bf4694d5d..ac8b1141c23a 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -33,6 +33,7 @@ the Linux kernel. ftrace ftrace-design ftrace-uses + ftrace-stackmap kprobes kprobetrace fprobetrace diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc new file mode 100644 index 000000000000..18fa998ae460 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc @@ -0,0 +1,103 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap basic functionality +# requires: stack_map options/stackmap function:tracer + +# Test that ftrace stackmap deduplication works: +# 1. Enable stackmap + stacktrace options +# 2. Run function tracer briefly +# 3. Verify trace contains <stack_id> events (read BEFORE switching +# tracer back to nop, since tracer_init() resets the ring buffer) +# 4. Verify stack_map has entries and zero drops +# 5. Verify reset is rejected (-EBUSY) while tracing is active +# 6. Verify reset clears the map when tracing is stopped + +fail() { + echo "FAIL: $1" + exit_fail +} + +# Restore state on any exit (success, fail, or interrupt) so a +# half-finished test does not leave stacktrace/stackmap enabled. +cleanup() { + disable_tracing 2>/dev/null + echo nop > current_tracer 2>/dev/null + echo 0 > options/stackmap 2>/dev/null + echo 0 > options/stacktrace 2>/dev/null +} +trap cleanup EXIT + +disable_tracing +clear_trace + +# Verify stackmap files exist +test -f stack_map || fail "stack_map file missing" +test -f stack_map_stat || fail "stack_map_stat file missing" +test -f stack_map_bin || fail "stack_map_bin file missing" + +# Enable stackmap dedup +echo 1 > options/stackmap +echo 1 > options/stacktrace + +# Run function tracer briefly +echo function > current_tracer +enable_tracing +sleep 1 +disable_tracing + +# Read trace contents NOW, before switching tracer back to nop. +# tracer_init() unconditionally calls tracing_reset_online_cpus(), +# so the ring buffer would be empty after 'echo nop > current_tracer'. +count=$(grep -c "<stack_id" trace || true) +: "${count:=0}" +if [ "$count" -eq 0 ]; then + fail "trace has no <stack_id> events" +fi + +# Now safe to switch back and disable options +echo nop > current_tracer +echo 0 > options/stackmap + +# Check stack_map_stat +entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries:=0}" +if [ "$entries" -eq 0 ]; then + fail "stackmap has zero entries after tracing" +fi + +successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}') +: "${successes:=0}" +if [ "$successes" -eq 0 ]; then + fail "stackmap has zero successes" +fi + +drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}') +: "${drops:=0}" +if [ "$drops" -ne 0 ]; then + fail "stackmap had $drops drops (pool exhausted?)" +fi + +# Check stack_map text output is parseable +first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}') +if [ -z "$first_id" ]; then + fail "stack_map output has no stack_id entries" +fi + +# Test that reset is rejected while tracing is active +enable_tracing +if echo 0 > stack_map 2>/dev/null; then + disable_tracing + fail "stackmap reset should fail while tracing is active" +fi +disable_tracing + +# Test reset works when tracing is stopped +echo 0 > stack_map +entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}') +: "${entries_after:=-1}" +if [ "$entries_after" -ne 0 ]; then + fail "stackmap reset did not clear entries (got $entries_after)" +fi + +echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops" +exit 0 diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc new file mode 100644 index 000000000000..49848eac2624 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc @@ -0,0 +1,42 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: ftrace - stackmap option is gated to the top-level trace instance +# requires: stack_map options/stackmap instances + +# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the +# convention used for global-only options like 'printk' and 'record-cmd'. +# Verify that: +# 1. The global instance exposes options/stackmap and the stack_map* nodes. +# 2. A newly created secondary instance under instances/ does NOT expose +# options/stackmap or stack_map* nodes. + +fail() { + echo "FAIL: $1" + rmdir instances/test_stackmap_gate 2>/dev/null + exit_fail +} + +# 1. Global instance must expose the option and the nodes +test -e options/stackmap || fail "options/stackmap missing on global instance" +test -e stack_map || fail "stack_map missing on global instance" +test -e stack_map_stat || fail "stack_map_stat missing on global instance" +test -e stack_map_bin || fail "stack_map_bin missing on global instance" + +# 2. Create a secondary instance and verify it does NOT see the option +# or the stack_map* nodes. +mkdir instances/test_stackmap_gate || fail "could not create secondary instance" + +if [ -e instances/test_stackmap_gate/options/stackmap ]; then + fail "secondary instance unexpectedly exposes options/stackmap" +fi + +for f in stack_map stack_map_stat stack_map_bin; do + if [ -e instances/test_stackmap_gate/$f ]; then + fail "secondary instance unexpectedly has $f" + fi +done + +rmdir instances/test_stackmap_gate || fail "could not remove secondary instance" + +echo "stackmap option gating to top-level instance works" +exit 0 diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py new file mode 100755 index 000000000000..fcd8ddcd97de --- /dev/null +++ b/tools/tracing/stackmap_dump.py @@ -0,0 +1,150 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 +""" +stackmap_dump.py - Parse and display ftrace stack_map_bin binary export. + +Usage: + # Pull from device and parse + adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin + python3 stackmap_dump.py /tmp/stack_map.bin + + # With vmlinux for offline symbol resolution + python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux + + # JSON output for tooling + python3 stackmap_dump.py /tmp/stack_map.bin --json +""" + +import struct +import sys +import argparse +import json +import subprocess + +MAGIC = 0x464D5342 # 'FSMB' +HEADER_SIZE = 16 # 4 x u32 +ENTRY_SIZE = 16 # 4 x u32 + + +def detect_endianness(data): + """Detect byte order from magic number in header.""" + if len(data) < 4: + raise ValueError("File too small") + magic_le = struct.unpack_from('<I', data, 0)[0] + if magic_le == MAGIC: + return '<' + magic_be = struct.unpack_from('>I', data, 0)[0] + if magic_be == MAGIC: + return '>' + raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)") + + +def batch_addr2line(vmlinux, addrs): + """Resolve multiple addresses in one addr2line invocation.""" + if not addrs: + return {} + try: + # Feed addresses on stdin to avoid ARG_MAX limits with large + # numbers of addresses (one stack can have 30+ frames; a + # snapshot can have thousands of unique stacks). + stdin = '\n'.join(hex(a) for a in addrs) + '\n' + result = subprocess.run( + ['addr2line', '-f', '-e', vmlinux], + input=stdin, capture_output=True, text=True, timeout=60 + ) + lines = result.stdout.split('\n') + # addr2line outputs 2 lines per address: function name + source location + symbols = {} + for i, addr in enumerate(addrs): + idx = i * 2 + if idx < len(lines) and lines[idx] and lines[idx] != '??': + symbols[addr] = lines[idx] + return symbols + except (subprocess.TimeoutExpired, FileNotFoundError) as e: + print(f"warning: addr2line failed: {e}", file=sys.stderr) + return {} + + +def parse_stackmap_bin(data): + """Parse binary stackmap data, yield (stack_id, ref_count, [ips]).""" + if len(data) < HEADER_SIZE: + raise ValueError("File too small for header") + + endian = detect_endianness(data) + header_fmt = f'{endian}IIII' + entry_fmt = f'{endian}IIII' + + magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0) + if version != 2: + raise ValueError(f"Unsupported version: {version}") + + offset = HEADER_SIZE + for _ in range(nr_stacks): + if offset + ENTRY_SIZE > len(data): + break + stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset) + offset += ENTRY_SIZE + + ips_size = nr * 8 + if offset + ips_size > len(data): + break + ips = struct.unpack_from(f'{endian}{nr}Q', data, offset) + offset += ips_size + + yield stack_id, ref_count, list(ips) + + +def main(): + parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin') + parser.add_argument('file', help='Path to stack_map_bin file') + parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution') + parser.add_argument('--json', action='store_true', help='JSON output') + parser.add_argument('--top', type=int, default=0, + help='Show only top N stacks by ref_count') + args = parser.parse_args() + + with open(args.file, 'rb') as f: + data = f.read() + + stacks = list(parse_stackmap_bin(data)) + + if args.top > 0: + stacks.sort(key=lambda x: x[1], reverse=True) + stacks = stacks[:args.top] + + # Batch symbol resolution + symbols = {} + if args.vmlinux: + all_addrs = set() + for _, _, ips in stacks: + all_addrs.update(ips) + symbols = batch_addr2line(args.vmlinux, list(all_addrs)) + + if args.json: + output = [] + for stack_id, ref_count, ips in stacks: + entry = { + 'stack_id': stack_id, + 'ref_count': ref_count, + 'ips': [f'0x{ip:x}' for ip in ips] + } + if args.vmlinux: + entry['symbols'] = [symbols.get(ip, f'0x{ip:x}') + for ip in ips] + output.append(entry) + print(json.dumps(output, indent=2)) + else: + for stack_id, ref_count, ips in stacks: + print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]") + for i, ip in enumerate(ips): + sym = symbols.get(ip, '') + if sym: + sym = f' {sym}' + print(f" [{i}] 0x{ip:x}{sym}") + print() + + print(f"Total: {len(stacks)} unique stacks", file=sys.stderr) + + +if __name__ == '__main__': + main() -- 2.34.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer 2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei ` (2 preceding siblings ...) 2026-05-26 11:52 ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei @ 2026-05-26 19:39 ` Steven Rostedt 3 siblings, 0 replies; 16+ messages in thread From: Steven Rostedt @ 2026-05-26 19:39 UTC (permalink / raw) To: Li Pengfei Cc: mhiramat, linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li Please DO NOT SEND new versions of a patch or patch series as a reply to the old one. It makes it extremely difficult for maintainers to manage the replies and patches. A new version should ALWAYS start a new email thread! On Tue, 26 May 2026 19:52:42 +0800 Li Pengfei <ljdlns1987@gmail.com> wrote: > From: Pengfei Li <lipengfei28@xiaomi.com> > > Hi Masami, Steven, all, > > This is v3 of the ftrace stackmap series. It addresses the Sashiko > review on v2 [1] that Masami pointed out. > > [1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com > > The series adds stack trace deduplication to ftrace. When the > stacktrace option is enabled, the ring buffer stores a 4-byte > stack_id instead of a full kernel stack trace, while the full > stacks are exported via tracefs. > > Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending. > > Changes since v2 > ================ Then you can use lore to add a link to the old version via the Message-ID of the old version. You can have the above as: Changes since v2: https://lore.kernel.org/all/20260522104017.1668638-1-lipengfei28@xiaomi.com/ -- Steve ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-05-26 19:38 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei 2026-05-14 3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei 2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt 2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei 2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei 2026-05-22 10:40 ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei 2026-05-22 10:40 ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei 2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu 2026-05-25 7:39 ` Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 " Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei 2026-05-26 11:52 ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei 2026-05-26 19:39 ` [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox