All of lore.kernel.org
 help / color / mirror / Atom feed
From: Li Pengfei <ljdlns1987@gmail.com>
To: mhiramat@kernel.org, rostedt@goodmis.org
Cc: linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org,
	cmllamas@google.com, zhangbo56@xiaomi.com,
	Pengfei Li <lipengfei28@xiaomi.com>
Subject: [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer
Date: Tue, 26 May 2026 19:52:42 +0800	[thread overview]
Message-ID: <cover.1779769138.git.lipengfei28@xiaomi.com> (raw)
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Hi Masami, Steven, all,

This is v3 of the ftrace stackmap series. It addresses the Sashiko
review on v2 [1] that Masami pointed out.

[1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com

The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.

Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending.

Changes since v2
================

Patch 1 (lock-free stackmap):
  - Hot-path counters changed from atomic64_t to per-CPU local_t.
    This avoids the raw_spinlock_t fallback that atomic64_t uses on
    32-bit GENERIC_ATOMIC64, which would deadlock from NMI context.
  - reset() now serializes against tracefs readers via an
    rw_semaphore (held for write during the clearing memset, held
    for read by seq_file iteration and bin snapshot construction).
    synchronize_rcu() alone was insufficient because seq_file/bin
    readers are in process context, not preempt-disabled.
  - get_id() uses atomic_read_acquire() on smap->resetting so
    subsequent loads of entry->key/val are properly ordered after
    the check (LKMM control dependencies only order stores).
  - All plain reads of entry->key now use READ_ONCE() to avoid
    LKMM data races with the cmpxchg writer.
  - val->nr in the hot path now uses READ_ONCE() to keep style
    consistent with the seq_show / bin_open readers.
  - stackmap_seq_next() now updates *pos past map_size on EOF so
    seq_read() terminates instead of looping on the last element.
  - Added a comment in the cmpxchg-claim path documenting that
    two CPUs racing with the same key_hash may produce a small
    number of duplicate entries; this is an accepted trade-off
    for keeping the hot path lock-free.
  - Removed BUG_ON in create path (the constraint is satisfied by
    construction; no runtime check needed).

Patch 2 (integration):
  - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and
    ZEROED_TRACE_FLAGS so the option is only exposed under the
    top-level trace instance, matching the convention used for
    other global-only options such as 'printk' and 'record-cmd'.
    Secondary instances under tracing/instances/*/ no longer see
    the option at all, instead of seeing it as a silent no-op.
  - TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c
    so ftrace startup selftests don't reject the entry type.
  - Corrected a comment about how global_trace.stackmap is
    zero-initialized (BSS, not kzalloc).

Patch 3 (docs / selftest / tooling):
  - Selftest now reads trace contents BEFORE switching back to the
    nop tracer (tracer_init() calls tracing_reset_online_cpus()
    which would have emptied the ring buffer).
  - Added 'function:tracer' to the selftest '# requires:' line so
    ftracetest skips when CONFIG_FUNCTION_TRACER is disabled
    instead of failing spuriously.
  - Selftest grep tightened to '<stack_id' to avoid future
    false-positives if any other tracepoint name contains
    "stack_id".
  - New stackmap-instance-gate.tc selftest asserts the option and
    stack_map* nodes are present on the global instance and absent
    on a freshly-created secondary instance, locking in the
    TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2.
  - Documentation Performance section made vendor-neutral
    ("aarch64 SMP system" instead of a specific device name) and
    the term "Hit rate" replaced with "Dedup rate" to match the
    actual stat field name (success_rate).
  - Documentation Design section now states that deduplication is
    best-effort under heavy contention (cmpxchg races may produce
    a small number of duplicate entries for the same stack), so
    users observing entries > unique-stacks have a documented
    explanation.

Test results
============

Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI)
Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots)
Method: 5-second capture with stacktrace trigger

Functional tests (all PASS):
  - tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist
  - options/stackmap writable, trace shows <stack_id N>
  - stack_map text export with correct symbols
  - reset clears entries when tracing stopped
  - reset rejected (-EBUSY) while tracing active
  - per-event trigger: only specified events get stacks

Performance (sched_switch, 5s):
  entries:       466 / 16384
  successes:     9159
  drops:         0
  success_rate:  100%
  dedup rate:    95.2% (466 unique stacks / 9625 total events)

Performance (kmem_cache_alloc, 5s):
  entries:       1177 / 16384
  successes:     60078
  drops:         0
  success_rate:  100%
  dedup rate:    98.1% (1177 unique stacks / 61255 total events)

Ring buffer space savings:
  Event               Full stack         Stackmap           Saving
  ----------------    ---------------    ---------------    ------
  sched_switch        9625 × 88B=847KB   12B×9625+88B×466=156KB   82%
  kmem_cache_alloc    61255×88B=5.4MB    12B×61255+88B×1177=839KB  85%

QEMU validation (v3 base: v7.1-rc5)
===================================

The series boots cleanly on aarch64 QEMU. A post-init smoke test
(12/12 PASS) verified all functional behaviors including:
- tracefs nodes appear with correct file modes
- stack_id events emitted, kernel symbols resolve correctly
  (e.g. __schedule+0x7cc/0x1138)
- reset rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- per-CPU local_t counters aggregate correctly across CPUs
- stack_map_bin magic correct (0x464D5342 'FSMB')
- 'stackmap' option visible on the global instance, hidden on
  secondary instances under tracing/instances/*/

Boot-time activation via 'trace_options=stackmap,stacktrace' works:
events that fire before stackmap initialization fall back to
recording full stack traces; later events are deduplicated. No
events are dropped due to the transition.

Known limitations
=================

- Per-instance stackmap support is not included in this series.
  Following the convention used for other global-only options
  (PRINTK, RECORD_CMD), the 'stackmap' option is gated to the
  top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is
  not exposed under tracing/instances/*/options/. Per-instance
  maps would be a follow-up.
- The element pool is allocated eagerly at fs_initcall when
  CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will
  ever enable the option. At the default bits=14 this is roughly
  8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager
  allocation keeps the hot path entirely allocation-free and
  avoids any allocation-failure path under tracing pressure.
  Lazy allocation on first 'echo 1 > options/stackmap' is a
  reasonable follow-up if maintainers prefer that trade-off.
- Deduplication is best-effort, not strict: under heavy
  concurrent contention two CPUs racing in the insert path with
  the same stack hash may each succeed in claiming a different
  slot, producing a small number of duplicate entries for the
  same stack. ref_count is then split across the duplicates.
  This is intentional: it keeps the hot path lock-free and
  bounds memory by the element pool size.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once
  the binary format settles.

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace


Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 162 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  22 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          |  78 +-
 kernel/trace/trace.h                          |  16 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_selftest.c                 |   1 +
 kernel/trace/trace_stackmap.c                 | 780 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  57 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 103 +++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  42 +
 tools/tracing/stackmap_dump.py                | 150 ++++
 14 files changed, 1449 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1


  parent reply	other threads:[~2026-05-26 11:53 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-14  3:49 [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Li Pengfei
2026-05-14  3:49 ` [RFC PATCH 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-14  3:49 ` [RFC PATCH 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-14  3:49 ` [RFC PATCH 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-21 15:23 ` [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
2026-05-22 10:40 ` [RFC PATCH v2 " Li Pengfei
2026-05-22 10:40   ` [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-22 10:40   ` [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-22 10:40   ` [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-25  6:58   ` [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer Masami Hiramatsu
2026-05-25  7:39     ` Li Pengfei
2026-05-26 11:52 ` Li Pengfei [this message]
2026-05-26 11:52   ` [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication Li Pengfei
2026-05-26 11:52   ` [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path Li Pengfei
2026-05-26 11:52   ` [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap Li Pengfei
2026-05-26 19:39   ` [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer Steven Rostedt
2026-05-27  2:23     ` Li Pengfei
2026-06-08  2:06   ` Li Pengfei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1779769138.git.lipengfei28@xiaomi.com \
    --to=ljdlns1987@gmail.com \
    --cc=cmllamas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=lipengfei28@xiaomi.com \
    --cc=mhiramat@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=zhangbo56@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.