From: Li Pengfei <ljdlns1987@gmail.com>
To: mhiramat@kernel.org, rostedt@goodmis.org
Cc: linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org,
cmllamas@google.com, zhangbo56@xiaomi.com,
Pengfei Li <lipengfei28@xiaomi.com>
Subject: [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer
Date: Tue, 26 May 2026 19:52:42 +0800 [thread overview]
Message-ID: <cover.1779769138.git.lipengfei28@xiaomi.com> (raw)
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Masami, Steven, all,
This is v3 of the ftrace stackmap series. It addresses the Sashiko
review on v2 [1] that Masami pointed out.
[1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com
The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.
Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending.
Changes since v2
================
Patch 1 (lock-free stackmap):
- Hot-path counters changed from atomic64_t to per-CPU local_t.
This avoids the raw_spinlock_t fallback that atomic64_t uses on
32-bit GENERIC_ATOMIC64, which would deadlock from NMI context.
- reset() now serializes against tracefs readers via an
rw_semaphore (held for write during the clearing memset, held
for read by seq_file iteration and bin snapshot construction).
synchronize_rcu() alone was insufficient because seq_file/bin
readers are in process context, not preempt-disabled.
- get_id() uses atomic_read_acquire() on smap->resetting so
subsequent loads of entry->key/val are properly ordered after
the check (LKMM control dependencies only order stores).
- All plain reads of entry->key now use READ_ONCE() to avoid
LKMM data races with the cmpxchg writer.
- val->nr in the hot path now uses READ_ONCE() to keep style
consistent with the seq_show / bin_open readers.
- stackmap_seq_next() now updates *pos past map_size on EOF so
seq_read() terminates instead of looping on the last element.
- Added a comment in the cmpxchg-claim path documenting that
two CPUs racing with the same key_hash may produce a small
number of duplicate entries; this is an accepted trade-off
for keeping the hot path lock-free.
- Removed BUG_ON in create path (the constraint is satisfied by
construction; no runtime check needed).
Patch 2 (integration):
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and
ZEROED_TRACE_FLAGS so the option is only exposed under the
top-level trace instance, matching the convention used for
other global-only options such as 'printk' and 'record-cmd'.
Secondary instances under tracing/instances/*/ no longer see
the option at all, instead of seeing it as a silent no-op.
- TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c
so ftrace startup selftests don't reject the entry type.
- Corrected a comment about how global_trace.stackmap is
zero-initialized (BSS, not kzalloc).
Patch 3 (docs / selftest / tooling):
- Selftest now reads trace contents BEFORE switching back to the
nop tracer (tracer_init() calls tracing_reset_online_cpus()
which would have emptied the ring buffer).
- Added 'function:tracer' to the selftest '# requires:' line so
ftracetest skips when CONFIG_FUNCTION_TRACER is disabled
instead of failing spuriously.
- Selftest grep tightened to '<stack_id' to avoid future
false-positives if any other tracepoint name contains
"stack_id".
- New stackmap-instance-gate.tc selftest asserts the option and
stack_map* nodes are present on the global instance and absent
on a freshly-created secondary instance, locking in the
TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2.
- Documentation Performance section made vendor-neutral
("aarch64 SMP system" instead of a specific device name) and
the term "Hit rate" replaced with "Dedup rate" to match the
actual stat field name (success_rate).
- Documentation Design section now states that deduplication is
best-effort under heavy contention (cmpxchg races may produce
a small number of duplicate entries for the same stack), so
users observing entries > unique-stacks have a documented
explanation.
Test results
============
Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI)
Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots)
Method: 5-second capture with stacktrace trigger
Functional tests (all PASS):
- tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist
- options/stackmap writable, trace shows <stack_id N>
- stack_map text export with correct symbols
- reset clears entries when tracing stopped
- reset rejected (-EBUSY) while tracing active
- per-event trigger: only specified events get stacks
Performance (sched_switch, 5s):
entries: 466 / 16384
successes: 9159
drops: 0
success_rate: 100%
dedup rate: 95.2% (466 unique stacks / 9625 total events)
Performance (kmem_cache_alloc, 5s):
entries: 1177 / 16384
successes: 60078
drops: 0
success_rate: 100%
dedup rate: 98.1% (1177 unique stacks / 61255 total events)
Ring buffer space savings:
Event Full stack Stackmap Saving
---------------- --------------- --------------- ------
sched_switch 9625 × 88B=847KB 12B×9625+88B×466=156KB 82%
kmem_cache_alloc 61255×88B=5.4MB 12B×61255+88B×1177=839KB 85%
QEMU validation (v3 base: v7.1-rc5)
===================================
The series boots cleanly on aarch64 QEMU. A post-init smoke test
(12/12 PASS) verified all functional behaviors including:
- tracefs nodes appear with correct file modes
- stack_id events emitted, kernel symbols resolve correctly
(e.g. __schedule+0x7cc/0x1138)
- reset rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- per-CPU local_t counters aggregate correctly across CPUs
- stack_map_bin magic correct (0x464D5342 'FSMB')
- 'stackmap' option visible on the global instance, hidden on
secondary instances under tracing/instances/*/
Boot-time activation via 'trace_options=stackmap,stacktrace' works:
events that fire before stackmap initialization fall back to
recording full stack traces; later events are deduplicated. No
events are dropped due to the transition.
Known limitations
=================
- Per-instance stackmap support is not included in this series.
Following the convention used for other global-only options
(PRINTK, RECORD_CMD), the 'stackmap' option is gated to the
top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is
not exposed under tracing/instances/*/options/. Per-instance
maps would be a follow-up.
- The element pool is allocated eagerly at fs_initcall when
CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will
ever enable the option. At the default bits=14 this is roughly
8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager
allocation keeps the hot path entirely allocation-free and
avoids any allocation-failure path under tracing pressure.
Lazy allocation on first 'echo 1 > options/stackmap' is a
reasonable follow-up if maintainers prefer that trade-off.
- Deduplication is best-effort, not strict: under heavy
concurrent contention two CPUs racing in the insert path with
the same stack hash may each succeed in claiming a different
slot, producing a small number of duplicate entries for the
same stack. ref_count is then split across the duplicates.
This is intentional: it keeps the hot path lock-free and
bounds memory by the element pool size.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once
the binary format settles.
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 162 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 78 +-
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_selftest.c | 1 +
kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++
.../test.d/ftrace/stackmap-instance-gate.tc | 42 +
tools/tracing/stackmap_dump.py | 150 ++++
14 files changed, 1449 insertions(+), 2 deletions(-)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
next prev parent reply other threads:[~2026-05-26 11:53 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-14 3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-14 3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
2026-05-22 10:40 ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-22 10:40 ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-22 10:40 ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-25 6:58 ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu
2026-05-25 7:39 ` Li Pengfei
2026-05-26 11:52 ` Li Pengfei [this message]
2026-05-26 11:52 ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-26 11:52 ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-26 19:39 ` [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1779769138.git.lipengfei28@xiaomi.com \
--to=ljdlns1987@gmail.com \
--cc=cmllamas@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=lipengfei28@xiaomi.com \
--cc=mhiramat@kernel.org \
--cc=rostedt@goodmis.org \
--cc=zhangbo56@xiaomi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox