* [RFC PATCH v4 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Masami, Steven, all,
This is v4 of the ftrace stackmap series. It is sent as a new thread.
v3: https://lore.kernel.org/all/cover.1779769138.git.lipengfei28@xiaomi.com/
The series adds stack trace deduplication to ftrace. When the
'stackmap' option is enabled alongside 'stacktrace', the ring buffer
stores a 4-byte stack_id instead of a full kernel stack trace, and the
full stacks are exported once via tracefs (stack_map / stack_map_bin).
Rebased onto v7.1-rc5 (e8c2f9fdadee).
Motivation
==========
The target use case is long-duration, from-boot kernel tracing where
the same stacks recur enormously often and the bottleneck is ring
buffer space, not CPU.
Concretely: tracing the slab allocator from boot for hours to study
memory aging and to catch the allocation backtraces behind a usage
peak. With a stacktrace trigger on the slab tracepoints, every event
today carries a full kernel stack (~80-160 bytes). On a fixed-size
ring buffer that bounds how far back in time the trace reaches: the
buffer wraps in seconds to minutes and the early-boot history -- the
part we care about -- is overwritten before it can be consumed.
In this workload the set of distinct stacks is small and highly
repetitive, so storing a 4-byte stack_id per event and the full stack
only once dramatically increases the time span a given buffer covers.
The intended operating model is exactly the low-overhead one ftrace is
good at: let the trace run for a long time producing a comparatively
small, dense log, then resolve stack_ids offline (cat stack_map, or
parse stack_map_bin with the included tool) during analysis.
This is complementary to, not a replacement for, the existing full
stack recording: deep stacks and the early pre-init window still fall
back to full stacks (see below).
Effect on retention
====================
Same fixed per-CPU buffer, slab allocation workload with a shallow
kernel stack (kmem_cache_alloc), stackmap OFF vs ON:
retained events bytes/event time span
stackmap OFF 645,068 ~104 B 15.0 s
stackmap ON 1,397,741 ~48 B 27.7 s
2.17x 2.17x 1.85x
The buffer holds ~2.17x more events and reaches ~1.85x further back in
time for the same memory. The win grows with stack depth and with how
repetitive the stacks are; for deep, highly-repeated stacks the
per-event size approaches the 4-byte stack_id plus event header.
Changes since v3
================
Correctness:
- Deep stacks are never silently truncated or merged. A stack deeper
than FTRACE_STACKMAP_MAX_DEPTH (64) now falls back to a full stack
trace; ftrace_stackmap_get_id() returns -E2BIG rather than
truncating, so two distinct stacks sharing their first 64 frames
can no longer collapse to one stack_id.
- reset is now genuinely destructive and coherent: under the
reader_sem write lock it clears the owning trace_array's ring
buffer (and snapshot) BEFORE clearing the map, and uses
tracing_reset_all_cpus() rather than _online_cpus() so a
TRACE_STACK_ID written by a now-offline CPU cannot survive and
dangle against a cleared map.
- __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer
slot before inserting into the map, so stack_map_stat counters and
ref_count stay consistent with what the ring buffer actually
references (failed reservation -> full stack, map untouched).
- ref_count / successes / drops now saturate (INT_MAX / LONG_MAX)
instead of wrapping on multi-hour, billions-of-hits traces.
Global-instance gating:
- Enabling 'stackmap' on a secondary instance via the aggregate
trace_options file is now rejected, not just hidden in the
per-instance options/ directory.
- tracefs init is failure-atomic: the required stack_map file is
created before the map pointer is published; if it cannot be
created the map is destroyed and never published. An init-state
(PENDING/DONE/FAILED) lets boot-time trace_options=stackmap set
the flag before the map exists (hot path falls back until it is
published) while still rejecting enables after a permanent init
failure, so options/stackmap never reports an enabled no-op.
ABI / tooling:
- Binary magic corrected to 0x46534D42 ('FSMB'); version is 1 (first
upstream ABI). Documentation, tool and selftest updated to match.
- Text and binary exports now follow the same trampoline-marker and
trace_adjust_address() handling as the normal stack print path.
- stackmap_dump.py emits hex addresses in 'ips' and shows the ftrace
trampoline marker only in the resolved 'symbols'.
Selftests:
- New stackmap-reset.tc: verifies reset clears stale <stack_id N>
from the trace buffer and checks the stack_map_bin magic/version.
- stackmap-instance-gate.tc extended to verify the trace_options
write path is rejected on a secondary instance.
- stackmap-basic.tc no longer treats a nonzero drops count as a
failure (drops is a by-design fallback); only zero successes with
nonzero drops is fatal.
Open questions for maintainers
==============================
Two design points where I would value direction before polishing
further:
1. Eager vs lazy allocation. The element pool is allocated at
fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether
userspace ever enables the option (~8 MB at the default bits=14,
up to ~135 MB at bits=18). This keeps the hot path allocation-free
with no allocation-failure path under tracing pressure. Is eager
allocation acceptable, or would you prefer lazy allocation on the
first 'echo 1 > options/stackmap'?
2. Binary ABI now or later. stack_map_bin is a new tracefs binary
interface (magic 0x46534D42, version 1). Is it acceptable to
introduce it now, or would you prefer the first version ship with
the text stack_map interface only and add the binary export once
trace-cmd / libtraceevent integration is designed?
Test results
============
QEMU (aarch64 virt, v7.1-rc5 + this series), boot to init smoke test:
- stackmap functional suite: 16/16 PASS, including reset clearing the
trace buffer (stale <stack_id> count 48 -> 0), stack_map_bin
magic/version, global-vs-secondary instance gating, and the
trace_options rejection on a secondary instance.
- boot-time activation (trace_options=stackmap,stacktrace on the
kernel cmdline): 3/3 PASS -- the option survives the
pre-initialization window and the map is live once published.
- ftrace startup self-tests pass with the new TRACE_STACK_ID entry.
Device retention numbers above were collected on a Xiaomi SM8850
(ARM64) running an Android workload, comparing the same buffer with
the option off and on.
Known limitations
=================
- Per-instance stackmap support is not included; the option is gated
to the global instance (in the tracefs layout and at the
set_tracer_flag() write path). Per-instance maps are a follow-up.
- Deduplication is best-effort, not strict: under heavy concurrent
contention two CPUs racing with the same stack hash may each claim a
different slot, producing a few duplicate entries; ref_count is then
split across them. This keeps the hot path lock-free.
- The stackmap covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, serialized against reset
but not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
binary format settles (see open question 2).
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 177 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 216 ++++-
kernel/trace/trace.h | 17 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_selftest.c | 1 +
kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 111 +++
.../test.d/ftrace/stackmap-instance-gate.tc | 54 ++
.../ftrace/test.d/ftrace/stackmap-reset.tc | 76 ++
tools/tracing/stackmap_dump.py | 164 ++++
15 files changed, 1821 insertions(+), 3 deletions(-)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply
* [RFC PATCH v4 1/3] trace: add lock-free stackmap for stack trace deduplication
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.
The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:
- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
is O(1) even when the table is heavily loaded with claimed-but-
empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)
The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the
existing tracing_map / hist_triggers requirement: the lock-free
hot path uses cmpxchg in a context that may be reached from NMI.
The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (magic 0x46534D42 'FSMB', version 1,
all fields native-endian)
ftrace_stackmap_get_id() never truncates: a stack deeper than
FTRACE_STACKMAP_MAX_DEPTH (64) returns -E2BIG so the caller records a
full stack instead. This prevents two distinct traces that share their
first 64 frames from being merged into one stack_id.
Hot-path counters use per-CPU local_t (NMI-safe single-instruction
increments) instead of atomic64_t. atomic64_t falls back to
raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems,
which would deadlock if an NMI hit while the spinlock was held.
local_t avoids this hazard. All counters saturate rather than wrap on
long (from-boot, multi-hour) traces: ref_count via
atomic_add_unless(.., INT_MAX) and successes/drops via
local_add_unless(.., LONG_MAX).
Reset semantics:
- Reset is a control-path operation only allowed when tracing is
stopped on the owning trace_array. Online reset (with tracing
active) is intentionally not supported.
- Reset is destructive: under the reader_sem write lock it clears the
owning trace_array's ring buffer (and snapshot buffer) BEFORE the
map, so an external observer never sees "trace still has
<stack_id N> but the map is already empty". The buffers are cleared
with tracing_reset_all_cpus() rather than _online_cpus() so a
TRACE_STACK_ID written by a now-offline CPU cannot survive a reset.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
verifies tracer_tracing_is_on() returns false.
- synchronize_rcu() drains in-flight get_id() callers from the
ftrace callback path (which runs preempt-disabled).
- The reader_sem (rw_semaphore) serializes the clearing against
tracefs readers (seq_file iteration and stack_map_bin snapshot),
which run in process context and aren't covered by
synchronize_rcu(). Readers take it shared, reset takes it
exclusive, so a reset cannot tear an iteration in progress. The
hot path doesn't take this lock.
- Reset clears the resetting flag with atomic_set_release() so a
subsequent get_id() observes a fully cleared map.
- get_id() uses atomic_read_acquire() on resetting so subsequent
loads of entry->key/val are properly ordered after the check
(control dependencies only order stores per LKMM).
- Concurrent reset, or reset while tracing is active, returns -EBUSY.
Concurrency notes:
- entry->val publication uses smp_store_release() paired with
smp_load_acquire() in all dereferencing readers.
- entry->key reads (in get_id, seq_start/next, bin_open) use
READ_ONCE() to avoid LKMM data races with the cmpxchg writer.
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
atomic_read() before the contended atomic RMW, avoiding cacheline
contention once the pool is full. Slots that win cmpxchg but
cannot get an elt are left 'claimed but empty'; subsequent
lookups treat val==NULL as a miss and probe past them.
Hash key:
- Per-instance random seed stored in the stackmap struct (no
global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
free-slot sentinel). Full memcmp confirms matches.
Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
element pool is ~135 MB and a stack_map_bin snapshot may briefly
allocate another ~135 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
field so data[] is 8-byte aligned on both 32-bit and 64-bit
architectures, avoiding alignment faults when writing u64 IPs
on strict-alignment architectures.
Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
range 10-18, default 14)
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 +++
4 files changed, 969 insertions(+)
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..e49cae886ff0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,28 @@ config STACK_TRACER
Say N if unsure.
+config FTRACE_STACKMAP
+ bool "Ftrace stack map deduplication"
+ depends on TRACING
+ depends on STACKTRACE
+ depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
+ select KALLSYMS
+ help
+ This enables a global stack trace hash table for ftrace, inspired
+ by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+ only a stack_id in the ring buffer instead of the full stack trace,
+ significantly reducing trace buffer usage when the same call stacks
+ appear repeatedly.
+
+ The deduplicated stacks are exported via:
+ /sys/kernel/debug/tracing/stack_map
+
+ Writing to this file resets the stack map. Reading shows all unique
+ stacks with their stack_id and reference count.
+
+ Say Y if you want to reduce ftrace buffer usage for stack traces.
+ Say N if unsure.
+
config TRACE_PREEMPT_TOGGLE
bool
help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 8d3d96e847d8..c2d9b2bf895a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..9e9fdf85071d
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,889 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ * cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ * and blocks new get_id() callers (they observe resetting=1 and
+ * return -EINVAL).
+ * - trace_types_lock serializes the tracer_tracing_is_on() check and
+ * the destructive ring-buffer reset against tracefs writes to
+ * tracing_on.
+ * - synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val); they are also serialized
+ * against reset via smap->reader_sem (readers take it in shared
+ * mode, reset in exclusive mode), so a reset cannot tear an
+ * iteration in progress -- it waits for active readers to drop
+ * the rwsem before clearing the map. The hot path is coordinated
+ * with reset separately, via acquire/release on smap->resetting.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/local_lock.h>
+#include <linux/percpu.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+#include <asm/local.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE 64
+
+/*
+ * Memory ordering of entry->val: published with smp_store_release()
+ * by the inserter; consumed with smp_load_acquire() by every reader
+ * that dereferences the elt (get_id, seq_show, bin_open). This pairs
+ * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the
+ * publish) with the reads of those fields (which happen AFTER the
+ * load). seq_start / seq_next only test val for NULL and use the
+ * acquire load purely to keep memory ordering symmetric.
+ */
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+ u32 nr; /* actual number of IPs */
+ atomic_t ref_count;
+ unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+ u32 key; /* 0 = free, non-zero = jhash */
+ struct stackmap_elt *val; /* NULL until fully published */
+};
+
+static struct stackmap_elt *stackmap_load_elt(struct stackmap_entry *entry)
+{
+ /*
+ * Pairs with the smp_store_release() that publishes entry->val
+ * after fully initializing the element payload.
+ */
+ return smp_load_acquire(&entry->val);
+}
+
+struct ftrace_stackmap {
+ struct trace_array *tr; /* owning trace_array */
+ unsigned int map_bits;
+ unsigned int map_size; /* 1 << (map_bits + 1) */
+ unsigned int max_elts; /* 1 << map_bits */
+ u32 hash_seed; /* per-instance jhash seed */
+ atomic_t next_elt; /* index into elts pool */
+ struct stackmap_entry *entries; /* hash table */
+ struct stackmap_elt *elts; /* flat element pool */
+ atomic_t resetting;
+ /*
+ * Reader/reset serialization. Held in shared mode (read lock)
+ * across seq_file iteration and binary snapshot construction;
+ * held in exclusive mode (write lock) by reset's clearing
+ * phase. The hot path (get_id) does not take this lock — it
+ * uses smp_load_acquire/smp_store_release on entry->val and
+ * the resetting flag for the lock-free protocol.
+ */
+ struct rw_semaphore reader_sem;
+ /*
+ * Per-CPU counters using local_t. local_t increments are NMI-
+ * safe on all architectures (single-instruction or interrupt-
+ * masked) and avoid the raw_spinlock_t fallback that
+ * atomic64_t uses on 32-bit GENERIC_ATOMIC64 — which would
+ * deadlock if an NMI hit while the spinlock was held.
+ */
+ local_t __percpu *successes; /* events served (hits + new inserts) */
+ local_t __percpu *drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ * export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN 10
+#define FTRACE_STACKMAP_BITS_MAX 18
+#define FTRACE_STACKMAP_BITS_DEFAULT 14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+ unsigned long val;
+
+ if (kstrtoul(str, 0, &val))
+ return -EINVAL;
+ val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+ stackmap_map_bits = val;
+ return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+ int idx;
+
+ /*
+ * Fast-path early-out once the pool is fully consumed. Avoids
+ * the contended atomic RMW on next_elt for every traced event
+ * after the pool is exhausted.
+ */
+ if (atomic_read(&smap->next_elt) >= smap->max_elts)
+ return NULL;
+
+ idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+ if (idx < smap->max_elts)
+ return &smap->elts[idx];
+ return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+ struct ftrace_stackmap *smap;
+ unsigned int bits;
+
+ smap = kzalloc_obj(*smap, GFP_KERNEL);
+ if (!smap)
+ return ERR_PTR(-ENOMEM);
+
+ /* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+ bits = clamp_val(stackmap_map_bits,
+ FTRACE_STACKMAP_BITS_MIN,
+ FTRACE_STACKMAP_BITS_MAX);
+
+ smap->tr = tr;
+ smap->map_bits = bits;
+ smap->max_elts = 1U << bits;
+ smap->map_size = 1U << (bits + 1); /* 2x over-provision */
+
+ smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+ if (!smap->entries) {
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * Single large vmalloc of the element pool, indexed flat.
+ * At bits=18 this is 256K * sizeof(struct stackmap_elt). The
+ * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB.
+ */
+ smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+ if (!smap->elts) {
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->successes = alloc_percpu(local_t);
+ if (!smap->successes) {
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+ smap->drops = alloc_percpu(local_t);
+ if (!smap->drops) {
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->hash_seed = get_random_u32();
+ atomic_set(&smap->next_elt, 0);
+ atomic_set(&smap->resetting, 0);
+ init_rwsem(&smap->reader_sem);
+
+ return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+ if (!smap || IS_ERR(smap))
+ return;
+ free_percpu(smap->drops);
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ * 1. Atomically claim reset rights via cmpxchg on @resetting.
+ * 2. Take trace_types_lock to serialize against tracefs writes to
+ * tracing_on.
+ * 3. Verify tracing is stopped on @smap->tr; if not, release the
+ * claim and return -EBUSY. The resetting flag itself blocks
+ * any subsequent get_id() callers.
+ * 4. synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path (which runs preempt-disabled).
+ * 5. Reset the ring buffer(s), then memset entries, elts, and
+ * counters.
+ * 6. Release the resetting flag with release semantics so any new
+ * get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+ struct trace_array *tr;
+ int ret = 0;
+
+ if (!smap)
+ return 0;
+
+ if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+ return -EBUSY;
+
+ mutex_lock(&trace_types_lock);
+
+ tr = smap->tr;
+ if (tr && tracer_tracing_is_on(tr)) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ /*
+ * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+ * is needed before it. It drains in-flight ftrace callbacks that
+ * may have already passed the resetting check with the old value.
+ */
+ synchronize_rcu();
+
+ /*
+ * Take the reader_sem in exclusive mode. This serializes the
+ * memset against any tracefs reader (seq_file iteration or
+ * stack_map_bin snapshot) that may currently hold the rwsem
+ * for read. synchronize_rcu() already drained the hot path;
+ * this rwsem covers process-context readers that aren't
+ * preempt-disabled.
+ */
+ down_write(&smap->reader_sem);
+
+ /*
+ * Clear the ring buffer(s) BEFORE the map, both under the write
+ * lock. The ring buffer may still hold TRACE_STACK_ID events
+ * whose stack_id points at slots we are about to free/reuse.
+ * Resetting the buffer first guarantees an external observer
+ * never sees the inconsistent "trace still has <stack_id N> but
+ * the map is already empty" window: it sees either (old buffer,
+ * old map) or (cleared buffer, old map) or (cleared buffer,
+ * cleared map) -- never (old buffer, cleared map).
+ *
+ * Use tracing_reset_all_cpus() (not _online_cpus) so per-CPU
+ * buffers belonging to currently offline CPUs are also cleared.
+ * The ring buffer is allocated per-possible-CPU; an offline CPU's
+ * buffer can still hold a TRACE_STACK_ID event written before
+ * the CPU went offline. tracing_reset_online_cpus() iterates
+ * for_each_online_buffer_cpu() and would leave that data behind
+ * to be observed once the CPU comes back online (or by the
+ * trace reader, which iterates all allocated CPU buffers),
+ * recreating the stale-stack_id window we are trying to close.
+ *
+ * Since reset requires tracing to be stopped, this makes "reset"
+ * an explicitly destructive operation on the owning trace_array,
+ * keeping ring-buffer stack_ids and the map coherent.
+ */
+ if (tr) {
+ tracing_reset_all_cpus(&tr->array_buffer);
+#ifdef CONFIG_TRACER_SNAPSHOT
+ if (tr->allocated_snapshot)
+ tracing_reset_all_cpus(&tr->snapshot_buffer);
+#endif
+ }
+
+ memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+ memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+ atomic_set(&smap->next_elt, 0);
+ {
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ local_set(per_cpu_ptr(smap->successes, cpu), 0);
+ local_set(per_cpu_ptr(smap->drops, cpu), 0);
+ }
+ }
+
+ up_write(&smap->reader_sem);
+
+out_unlock:
+ mutex_unlock(&trace_types_lock);
+
+ /* Release resetting=0 so new get_id() observes a cleared map. */
+ atomic_set_release(&smap->resetting, 0);
+ return ret;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries)
+{
+ u32 key_hash, idx, test_key, trace_len;
+ struct stackmap_entry *entry;
+ struct stackmap_elt *val;
+ int probes = 0;
+
+ /*
+ * atomic_read_acquire() pairs with atomic_set_release() in the
+ * reset path. This ensures that subsequent reads of entry->key
+ * and entry->val are ordered after this check; without acquire,
+ * the CPU would only have a control dependency, which orders
+ * subsequent stores but not loads (per LKMM).
+ */
+ if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting))
+ return -EINVAL;
+ /*
+ * Never truncate: a stack deeper than the map can hold must not be
+ * silently shortened, or two distinct traces sharing their first
+ * FTRACE_STACKMAP_MAX_DEPTH frames would be merged into one
+ * stack_id. The caller is expected to fall back to a full stack
+ * trace for such events. Reject defensively in case of a future
+ * caller that forgets this contract.
+ */
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ return -E2BIG;
+
+ trace_len = nr_entries * sizeof(unsigned long);
+ /*
+ * jhash2() requires the length in u32 units and the data to be
+ * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+ * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+ * directly; the cast to u32* is safe because ips[] is naturally
+ * aligned to sizeof(unsigned long) >= 4.
+ */
+ key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+ smap->hash_seed);
+ if (key_hash == 0)
+ key_hash = 1; /* 0 means free slot */
+
+ idx = key_hash >> (32 - (smap->map_bits + 1));
+
+ while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+ idx &= (smap->map_size - 1);
+ entry = &smap->entries[idx];
+ /*
+ * READ_ONCE() to avoid LKMM data race with concurrent
+ * cmpxchg(&entry->key, 0, key_hash) on this slot.
+ */
+ test_key = READ_ONCE(entry->key);
+
+ if (test_key == key_hash) {
+ val = stackmap_load_elt(entry);
+ /*
+ * READ_ONCE(val->nr) keeps style consistent with
+ * the seq_show / bin_open readers. nr is write-once
+ * (set before publish, never modified afterwards),
+ * so the load is data-race-free, but READ_ONCE
+ * silences any analysis tool that flags a plain
+ * read of a field that is also read under acquire
+ * elsewhere.
+ */
+ if (val && READ_ONCE(val->nr) == nr_entries &&
+ memcmp(val->ips, ips, trace_len) == 0) {
+ /*
+ * ref_count is a best-effort popularity
+ * counter. On a long (from-boot, multi-hour)
+ * trace a hot stack can be hit billions of
+ * times. atomic_add_unless() gives true
+ * saturation at INT_MAX even under concurrent
+ * hits on multiple CPUs (a plain
+ * check-then-inc could let several CPUs past
+ * the check near the cap and still wrap).
+ */
+ atomic_add_unless(&val->ref_count, 1, INT_MAX);
+ /*
+ * successes/drops are best-effort throughput
+ * counters. Saturate at LONG_MAX so they do
+ * not wrap on long runs (notably where local_t
+ * is 32-bit), matching ref_count's behaviour.
+ */
+ local_add_unless(this_cpu_ptr(smap->successes),
+ 1, LONG_MAX);
+ return (int)idx;
+ }
+ /*
+ * val == NULL: another CPU is mid-insert, or this
+ * slot is "claimed but empty" (pool exhausted).
+ * val != NULL but mismatch: 32-bit hash collision
+ * with a different stack. In both cases, advance.
+ */
+ } else if (!test_key) {
+ /*
+ * Free slot: try to claim it.
+ *
+ * If two CPUs race here with the same key_hash
+ * (same stack), one loses the cmpxchg, advances,
+ * and may insert the same stack at a later slot.
+ * This can produce a small number of duplicate
+ * entries under heavy contention. The trade-off
+ * is accepted to keep the hot path lock-free;
+ * ref_count is split across the duplicates and
+ * total memory cost is bounded by the element
+ * pool size.
+ */
+ if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+ struct stackmap_elt *elt;
+
+ elt = stackmap_get_elt(smap);
+ if (!elt) {
+ /*
+ * Pool exhausted. We claimed this
+ * slot with cmpxchg but cannot fill
+ * it. Leave key set so the slot
+ * stays "claimed but empty" — future
+ * lookups treat val==NULL as a miss
+ * and probe past it. Cannot revert
+ * key=0 without racing other CPUs.
+ */
+ local_add_unless(this_cpu_ptr(smap->drops),
+ 1, LONG_MAX);
+ return -ENOSPC;
+ }
+
+ elt->nr = nr_entries;
+ atomic_set(&elt->ref_count, 1);
+ memcpy(elt->ips, ips, trace_len);
+
+ /*
+ * Publish elt with release semantics so the
+ * reader's smp_load_acquire can safely
+ * dereference val->nr / val->ips.
+ */
+ smp_store_release(&entry->val, elt);
+ local_add_unless(this_cpu_ptr(smap->successes),
+ 1, LONG_MAX);
+ return (int)idx;
+ }
+ /* cmpxchg failed; another CPU claimed this slot. */
+ }
+
+ idx++;
+ probes++;
+ }
+
+ local_add_unless(this_cpu_ptr(smap->drops), 1, LONG_MAX);
+ return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+ struct ftrace_stackmap *smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ /*
+ * Take the reader_sem to serialize against ftrace_stackmap_reset(),
+ * which holds it for write while clearing the table. Released in
+ * stackmap_seq_stop(), which seq_file calls regardless of whether
+ * start() returned an element or NULL (per Documentation/filesystems
+ * /seq_file.rst: "the iterator value returned by start() or next()
+ * is guaranteed to be passed to a subsequent next() or stop()").
+ */
+ down_read(&smap->reader_sem);
+ for (i = *pos; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ stackmap_load_elt(&smap->entries[i])) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos + 1; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ stackmap_load_elt(&smap->entries[i])) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ /*
+ * Advance *pos past the end so that on the next read() the
+ * subsequent stackmap_seq_start() call returns NULL and the
+ * iteration terminates. Without this, seq_read() would loop
+ * on the last element.
+ */
+ *pos = smap->map_size;
+ return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+
+ /*
+ * seq_file invokes stop() unconditionally after each iteration
+ * pass (see seq_read_iter / traverse), even when start() returned
+ * NULL. Always release here, balanced against the down_read in
+ * stackmap_seq_start().
+ */
+ if (smap)
+ up_read(&smap->reader_sem);
+}
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+ struct stackmap_entry *entry = v;
+ struct stackmap_seq_private *priv = m->private;
+ struct stackmap_elt *elt;
+ u32 idx = entry - priv->smap->entries;
+ u32 i, nr;
+
+ elt = stackmap_load_elt(entry);
+ if (!elt)
+ return 0;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+ idx, atomic_read(&elt->ref_count), nr);
+ for (i = 0; i < nr; i++) {
+ unsigned long ip = elt->ips[i];
+
+ /*
+ * Mirror trace_stack_print(): __ftrace_trace_stack()
+ * may replace trampoline addresses with
+ * FTRACE_TRAMPOLINE_MARKER before the stack reaches the
+ * map, and normal addresses must go through
+ * trace_adjust_address() (KASLR / module text delta)
+ * before symbolization. Without this the export would
+ * print a bogus symbol for the marker and unadjusted
+ * addresses for everything else.
+ */
+ if (ip == FTRACE_TRAMPOLINE_MARKER) {
+ seq_printf(m, " [%u] [FTRACE TRAMPOLINE]\n", i);
+ continue;
+ }
+ seq_printf(m, " [%u] %pS\n", i,
+ (void *)trace_adjust_address(priv->smap->tr, ip));
+ }
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+ .start = stackmap_seq_start,
+ .next = stackmap_seq_next,
+ .stop = stackmap_seq_stop,
+ .show = stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+ struct stackmap_seq_private *priv;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open_private(file, &stackmap_seq_ops,
+ sizeof(struct stackmap_seq_private));
+ if (ret)
+ return ret;
+ m = file->private_data;
+ priv = m->private;
+ priv->smap = inode->i_private;
+ return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+ if (n > 0 && buf[n - 1] == '\n')
+ n--;
+ return (n == 1 && buf[0] == '0') ||
+ (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct stackmap_seq_private *priv = m->private;
+ char buf[8];
+ size_t n = min(count, sizeof(buf) - 1);
+ int ret;
+
+ if (n == 0)
+ return -EINVAL;
+ if (copy_from_user(buf, ubuf, n))
+ return -EFAULT;
+ buf[n] = '\0';
+
+ if (!stackmap_write_is_reset(buf, n))
+ return -EINVAL;
+
+ /*
+ * ftrace_stackmap_reset() atomically claims reset rights via
+ * cmpxchg and returns -EBUSY if another reset is in progress
+ * or if tracing is active.
+ */
+ ret = ftrace_stackmap_reset(priv->smap);
+ if (ret)
+ return ret;
+ return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+ .open = stackmap_open,
+ .read = seq_read,
+ .write = stackmap_write,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+ struct ftrace_stackmap *smap = m->private;
+ u64 successes = 0, drops = 0;
+ u32 entries;
+ int cpu;
+
+ if (!smap) {
+ seq_puts(m, "stackmap not initialized\n");
+ return 0;
+ }
+
+ entries = atomic_read(&smap->next_elt);
+ for_each_possible_cpu(cpu) {
+ successes += local_read(per_cpu_ptr(smap->successes, cpu));
+ drops += local_read(per_cpu_ptr(smap->drops, cpu));
+ }
+
+ seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts);
+ seq_printf(m, "table_size: %u\n", smap->map_size);
+ seq_printf(m, "successes: %llu\n", successes);
+ seq_printf(m, "drops: %llu\n", drops);
+ if (successes + drops > 0)
+ seq_printf(m, "success_rate: %llu%%\n",
+ successes * 100 / (successes + drops));
+ return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+ .open = stackmap_stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+ /*
+ * Use u64 (not size_t) so data[] is 8-byte aligned on both
+ * 32-bit and 64-bit architectures. The IP array within data[]
+ * is accessed as u64*, which would alignment-fault on strict
+ * architectures (e.g. older ARM, SPARC) if data[] started at
+ * a 4-byte boundary.
+ */
+ u64 size;
+ char data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+ struct ftrace_stackmap *smap = inode->i_private;
+ struct stackmap_bin_snapshot *snap;
+ struct ftrace_stackmap_bin_header *hdr;
+ size_t alloc_size, off;
+ u32 nr_entries, i, nr_stacks;
+
+ if (!smap)
+ return -ENODEV;
+
+ /*
+ * Worst-case allocation size: every populated entry uses a
+ * full-depth stack. The (+1) gives one slack slot in case a
+ * concurrent insert lands between this snapshot and iteration.
+ * The loop below performs an explicit bounds check anyway.
+ *
+ * At bits=18 this caps at ~135 MB. The file is mode 0440
+ * (TRACE_MODE_READ), so only privileged users can open it.
+ */
+ nr_entries = atomic_read(&smap->next_elt);
+ alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+ (sizeof(struct ftrace_stackmap_bin_entry) +
+ FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+ snap = vmalloc(sizeof(*snap) + alloc_size);
+ if (!snap)
+ return -ENOMEM;
+
+ hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+ hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+ hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+ hdr->reserved = 0;
+ off = sizeof(*hdr);
+ nr_stacks = 0;
+
+ /*
+ * Take reader_sem to serialize against ftrace_stackmap_reset(),
+ * which clears the table and elt pool under the write lock.
+ */
+ down_read(&smap->reader_sem);
+
+ for (i = 0; i < smap->map_size; i++) {
+ struct stackmap_entry *entry = &smap->entries[i];
+ struct stackmap_elt *elt;
+ struct ftrace_stackmap_bin_entry *e;
+ u64 *ips_out;
+ u32 k, nr;
+
+ if (!READ_ONCE(entry->key))
+ continue;
+ elt = stackmap_load_elt(entry);
+ if (!elt)
+ continue;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ /* Bounds check: stop if we would overflow the allocation. */
+ if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+ break;
+
+ e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+ e->stack_id = i;
+ e->nr = nr;
+ e->ref_count = atomic_read(&elt->ref_count);
+ e->reserved = 0;
+ off += sizeof(*e);
+
+ ips_out = (u64 *)(snap->data + off);
+ for (k = 0; k < nr; k++) {
+ unsigned long ip = elt->ips[k];
+
+ /*
+ * Emit the trampoline marker verbatim so userspace
+ * can render it as [FTRACE TRAMPOLINE]; pass every
+ * other address through trace_adjust_address() so the
+ * binary export follows the same address-adjustment
+ * rules as the text export.
+ */
+ if (ip == FTRACE_TRAMPOLINE_MARKER)
+ ips_out[k] = (u64)FTRACE_TRAMPOLINE_MARKER;
+ else
+ ips_out[k] = (u64)trace_adjust_address(smap->tr, ip);
+ }
+ off += nr * sizeof(u64);
+ nr_stacks++;
+ }
+
+ up_read(&smap->reader_sem);
+
+ hdr->nr_stacks = nr_stacks;
+ snap->size = off;
+ file->private_data = snap;
+ return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct stackmap_bin_snapshot *snap = file->private_data;
+
+ if (!snap)
+ return -EINVAL;
+ return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+ vfree(file->private_data);
+ return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+ .open = stackmap_bin_open,
+ .read = stackmap_bin_read,
+ .llseek = default_llseek,
+ .release = stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..7c2e5ab9d36d
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH 64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC 0x46534D42 /* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION 1
+
+struct ftrace_stackmap_bin_header {
+ u32 magic;
+ u32 version;
+ u32 nr_stacks;
+ u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+ u32 stack_id;
+ u32 nr;
+ u32 ref_count;
+ u32 reserved;
+ /* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *
+ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+ unsigned long *ips, unsigned int n)
+{ return -EOPNOTSUPP; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
--
2.34.1
^ permalink raw reply related
* [RFC PATCH v4 2/3] trace: integrate stackmap into ftrace stack recording path
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.
Changes:
- New TRACE_STACK_ID in trace_type enum and stack_id_entry in
trace_entries.h.
- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
used by TRACE_ITER_PROF_TEXT_OFFSET).
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS
so it is only exposed under the top-level trace instance, matching
the convention already used for global-only options such as 'printk'
and 'record-cmd'. Secondary instances under tracing/instances/*/
do not see the option in their options/ directory.
- set_tracer_flag() additionally rejects enabling STACKMAP on a
secondary instance. The per-option file is hidden on secondary
instances, but a write to the aggregate trace_options file still
reaches set_tracer_flag(); without this check the bit could be
accepted and then become a silent no-op in the hot path (where
tr->stackmap is NULL). This closes the global-instance-only gate
at the write path, not just in the tracefs layout.
- __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer slot
BEFORE calling ftrace_stackmap_get_id(), so the map (and its
ref_count / success counters) is only mutated when a ring-buffer
event will actually reference the entry. If the reservation fails
it falls back to a full stack; if get_id() fails it discards the
reserved slot and falls back. A stack deeper than
FTRACE_STACKMAP_MAX_DEPTH skips the map entirely (get_id() would
return -E2BIG) and records a full stack, so deep traces are never
truncated or merged.
- Stackmap pointer read with smp_load_acquire(), published with
smp_store_release() to ensure proper initialization ordering. The
hot path falls back to a full stack whenever tr->stackmap is NULL.
- ftrace_stackmap_create() takes the owning trace_array so the
stackmap can later clear that trace_array's buffers during reset.
- Added stack_id print handler in trace_output.c and TRACE_STACK_ID
to trace_valid_entry() in trace_selftest.c so ftrace startup
selftests accept the new entry type when the stackmap option is
enabled.
Failure-atomic init and boot-time activation:
- The global stackmap and its tracefs files are created during
tracer_init_tracefs(). stack_map is the single required file (it is
both the resolver and the reset interface); it is created BEFORE the
map pointer is published with smp_store_release(), so an observed
non-NULL tr->stackmap implies the resolver/reset file exists. If
stack_map cannot be created the map is destroyed and never published.
- A small init-state (PENDING / DONE / FAILED) lets set_tracer_flag()
distinguish "not initialized yet" from "init failed". Boot-time
options (trace_options=stackmap,stacktrace) are applied before the
tracefs init work runs; the flag is allowed to be set while init is
PENDING (the hot path falls back until the map is published, then the
boot-set option takes effect), and is only rejected once init has
permanently FAILED. On failure the STACKMAP flag is also cleared from
the global instance so options/stackmap never reports an enabled
no-op.
Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, NULL pointer, or a too-deep stack), the full stack trace is
recorded as before -- no new failure modes introduced.
Per-instance stackmap support is left as a follow-up; gating the
option to the global instance (both in the tracefs layout and at the
set_tracer_flag() write path) makes the global-only scope explicit.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/trace.c | 216 +++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 17 +++
kernel/trace/trace_entries.h | 15 +++
kernel/trace/trace_output.c | 23 ++++
kernel/trace/trace_selftest.c | 1 +
5 files changed, 269 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..e00bee5d0e01 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
#include "trace.h"
#include "trace_output.h"
+#include "trace_stackmap.h"
#ifdef CONFIG_FTRACE_STARTUP_TEST
/*
@@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
/* trace_options that are only supported by global_trace */
#define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) | \
TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) | \
- TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS)
+ TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) | \
+ FPROFILE_DEFAULT_FLAGS)
/* trace_flags that are default zero for instances */
#define ZEROED_TRACE_FLAGS \
(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \
- TRACE_ITER(COPY_MARKER))
+ TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP))
/*
* The global_trace is the descriptor that holds the top-level tracing
@@ -1562,7 +1564,7 @@ void tracing_reset_online_cpus(struct array_buffer *buf)
ring_buffer_record_enable(buffer);
}
-static void tracing_reset_all_cpus(struct array_buffer *buf)
+void tracing_reset_all_cpus(struct array_buffer *buf)
{
struct trace_buffer *buffer = buf->buffer;
@@ -2184,6 +2186,75 @@ void __ftrace_trace_stack(struct trace_array *tr,
}
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * If stackmap dedup is enabled, try to store only the stack_id
+ * in the ring buffer instead of the full stack trace.
+ *
+ * Reserve the TRACE_STACK_ID ring-buffer slot BEFORE inserting
+ * into the stackmap. This guarantees the map is only mutated
+ * (and its ref_count / success counters bumped) when a
+ * ring-buffer event will actually reference the entry:
+ * - reservation fails -> fall back to full stack, map untouched
+ * - get_id() fails -> discard the reserved slot, fall back
+ * so stack_map_stat counters stay consistent with what the ring
+ * buffer holds, and a failed reservation never consumes a map
+ * slot for an event that records a full stack anyway.
+ */
+ if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+ struct ftrace_stackmap *smap;
+ struct stack_id_entry *sid_entry;
+ int sid;
+
+ /*
+ * Pairs with the smp_store_release() that publishes the
+ * fully initialized global stackmap at tracefs init.
+ */
+ smap = smp_load_acquire(&tr->stackmap);
+ if (!smap)
+ goto full_stack;
+
+ /*
+ * The stackmap stores at most FTRACE_STACKMAP_MAX_DEPTH
+ * frames per entry. A deeper trace would be truncated, and
+ * two distinct stacks that share the first MAX_DEPTH frames
+ * would hash and compare equal, silently merging into one
+ * stack_id. Keep the conservative full-stack path for deep
+ * traces so no information is lost or misattributed.
+ */
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ goto full_stack;
+
+ event = __trace_buffer_lock_reserve(buffer, TRACE_STACK_ID,
+ sizeof(*sid_entry), trace_ctx);
+ if (!event)
+ goto full_stack;
+
+ sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+ if (sid < 0) {
+ /*
+ * Pool exhausted or a reset is in progress. Discard
+ * the reserved stack_id slot and record the full
+ * stack instead, so the event still gets a trace.
+ */
+ __trace_event_discard_commit(buffer, event);
+ goto full_stack;
+ }
+
+ sid_entry = ring_buffer_event_data(event);
+ sid_entry->stack_id = sid;
+ /*
+ * stack_id is a synthetic side-event attached to a
+ * primary trace event that was already subject to
+ * filtering. No per-event filter is defined for
+ * TRACE_STACK_ID, so commit unconditionally.
+ */
+ __buffer_unlock_commit(buffer, event);
+ goto out;
+ }
+full_stack:
+#endif
+
event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
struct_size(entry, caller, nr_entries),
trace_ctx);
@@ -3979,6 +4050,33 @@ int trace_keep_overwrite(struct tracer *tracer, u64 mask, int set)
return 0;
}
+#ifdef CONFIG_FTRACE_STACKMAP
+/*
+ * Tracks tracefs-time initialization of the global stackmap so that
+ * set_tracer_flag() can distinguish "not initialized yet" from
+ * "initialization permanently failed".
+ *
+ * Boot-time options (trace_options=stackmap,stacktrace) are applied
+ * very early, before tracer_init_tracefs() creates and publishes the
+ * map. We must allow the STACKMAP flag to be set during that window
+ * (the hot path falls back to a full stack while tr->stackmap is NULL,
+ * then starts using the map once it is published). We must, however,
+ * reject the enable once init has *failed*, so options/stackmap never
+ * reports an enabled no-op.
+ *
+ * Written once from the tracefs init work before any concurrent
+ * userspace writer to trace_options can run, then only read; a plain
+ * int is therefore sufficient.
+ */
+enum {
+ STACKMAP_INIT_PENDING, /* tracer_init_tracefs() not run yet */
+ STACKMAP_INIT_DONE, /* map published, stack_map file created */
+ STACKMAP_INIT_FAILED, /* permanent failure, never available */
+};
+
+static int stackmap_init_state = STACKMAP_INIT_PENDING;
+#endif
+
int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled)
{
switch (mask) {
@@ -3993,6 +4091,33 @@ int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled)
if (!!(tr->trace_flags & mask) == !!enabled)
return 0;
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * STACKMAP is intentionally global-instance-only: the dedup map,
+ * its tracefs files (stack_map / stack_map_stat / stack_map_bin)
+ * and the lifetime/reset semantics are tied to the global trace
+ * array. options/stackmap is hidden on secondary instances via
+ * TOP_LEVEL_TRACE_FLAGS, but writes still reach set_tracer_flag()
+ * through the aggregate trace_options file. Reject the enable on
+ * a secondary instance so it cannot be silently accepted and then
+ * become a no-op in the hot path (where tr->stackmap is NULL and
+ * the code falls back to a full stack trace).
+ *
+ * On the global instance, allow the enable while init is still
+ * pending (boot-time trace_options=stackmap is applied before the
+ * tracefs init work creates the map; the hot path falls back
+ * until the map is published). Only reject once init has
+ * permanently failed, so options/stackmap never reports an
+ * enabled no-op. READ_ONCE() suffices: this only inspects the
+ * init state, it does not dereference the map (the hot path uses
+ * smp_load_acquire(&tr->stackmap) for that).
+ */
+ if (mask == TRACE_ITER(STACKMAP) && enabled &&
+ (tr != &global_trace ||
+ READ_ONCE(stackmap_init_state) == STACKMAP_INIT_FAILED))
+ return -EINVAL;
+#endif
+
/* Give the tracer a chance to approve the change */
if (tr->current_trace->flag_changed)
if (tr->current_trace->flag_changed(tr, mask, !!enabled))
@@ -9222,6 +9347,91 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
NULL, &tracing_dyn_info_fops);
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ {
+ struct ftrace_stackmap *smap;
+ struct dentry *map_file;
+
+ smap = ftrace_stackmap_create(&global_trace);
+ if (!IS_ERR(smap)) {
+ /*
+ * Failure-atomic init: stack_map is the single
+ * required tracefs file (it doubles as the reset
+ * interface and the human-readable resolver). If
+ * we cannot create it, the hot path must not be
+ * able to emit <stack_id N> events that no one can
+ * resolve or clear, so refuse to publish the map
+ * and tear it down.
+ *
+ * Create stack_map BEFORE smp_store_release() so an
+ * observed non-NULL global_trace.stackmap implies
+ * its resolver/reset file exists.
+ */
+ map_file = trace_create_file("stack_map",
+ TRACE_MODE_WRITE, NULL,
+ smap,
+ &ftrace_stackmap_fops);
+ if (!map_file) {
+ pr_warn("ftrace stackmap init: stack_map create failed, dedup disabled\n");
+ ftrace_stackmap_destroy(smap);
+ /*
+ * Permanent failure. Record it and clear a
+ * STACKMAP flag that a boot-time
+ * trace_options=stackmap may have set, so
+ * options/stackmap does not report an
+ * enabled no-op and later userspace enables
+ * return -EINVAL.
+ */
+ WRITE_ONCE(stackmap_init_state,
+ STACKMAP_INIT_FAILED);
+ global_trace.trace_flags &=
+ ~TRACE_ITER(STACKMAP);
+ } else {
+ /*
+ * smp_store_release pairs with the
+ * smp_load_acquire() in
+ * __ftrace_trace_stack(). Publishing only
+ * after the required file exists keeps
+ * "smap visible" => "resolver/reset
+ * available".
+ */
+ smp_store_release(&global_trace.stackmap,
+ smap);
+ WRITE_ONCE(stackmap_init_state,
+ STACKMAP_INIT_DONE);
+ /*
+ * stat and bin are auxiliary observability
+ * surfaces. If they fail to be created we
+ * keep dedup enabled (the kernel side still
+ * works, and stack_map alone is enough to
+ * resolve and reset); trace_create_file()
+ * already pr_warn()s on failure.
+ */
+ trace_create_file("stack_map_stat",
+ TRACE_MODE_READ, NULL,
+ smap,
+ &ftrace_stackmap_stat_fops);
+ trace_create_file("stack_map_bin",
+ TRACE_MODE_READ, NULL,
+ smap,
+ &ftrace_stackmap_bin_fops);
+ }
+ } else {
+ pr_warn("ftrace stackmap init failed, dedup disabled\n");
+ /*
+ * global_trace is statically defined; its stackmap
+ * field is zero-initialized via BSS, so leaving it
+ * NULL ensures the smp_load_acquire() in
+ * __ftrace_trace_stack() falls back to full stack.
+ * Mark init failed and clear any boot-time STACKMAP
+ * flag so userspace enables are rejected rather than
+ * becoming silent no-ops.
+ */
+ WRITE_ONCE(stackmap_init_state, STACKMAP_INIT_FAILED);
+ global_trace.trace_flags &= ~TRACE_ITER(STACKMAP);
+ }
+ }
+#endif
create_trace_instances(NULL);
update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..95db43bfc747 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_STACK_ID,
__TRACE_LAST_TYPE,
};
@@ -453,6 +454,9 @@ struct trace_array {
struct cond_snapshot *cond_snapshot;
#endif
struct trace_func_repeats __percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+ struct ftrace_stackmap *stackmap;
+#endif
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct func_repeats_entry, \
TRACE_FUNC_REPEATS); \
+ IF_ASSIGN(var, ent, struct stack_id_entry, \
+ TRACE_STACK_ID); \
__ftrace_bad_type(); \
} while (0)
@@ -689,6 +695,7 @@ extern int tracing_disabled;
int tracer_init(struct tracer *t, struct trace_array *tr);
int tracing_is_enabled(void);
void tracing_reset_online_cpus(struct array_buffer *buf);
+void tracing_reset_all_cpus(struct array_buffer *buf);
void tracing_reset_all_online_cpus(void);
void tracing_reset_all_online_cpus_unlocked(void);
int tracing_open_generic(struct inode *inode, struct file *filp);
@@ -1449,7 +1456,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
# define STACK_FLAGS
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS \
+ C(STACKMAP, "stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT -1
+#endif
+
#ifdef CONFIG_FUNCTION_PROFILER
+
# define PROFILER_FLAGS \
C(PROF_TEXT_OFFSET, "prof-text-offset"),
# ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1522,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
+ STACKMAP_FLAGS \
BRANCH_FLAGS \
PROFILER_FLAGS \
FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
(void *)__entry->caller[6], (void *)__entry->caller[7])
);
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+ TRACE_STACK_ID,
+
+ F_STRUCT(
+ __field( int, stack_id )
+ ),
+
+ F_printk("<stack_id %d>", __entry->stack_id)
+);
+
/*
* trace_printk entry:
*/
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
.funcs = &trace_user_stack_funcs,
};
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+ int flags, struct trace_event *event)
+{
+ struct stack_id_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+ trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+ .trace = trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+ .type = TRACE_STACK_ID,
+ .funcs = &trace_stack_id_funcs,
+};
+
/* TRACE_HWLAT */
static enum print_line_t
trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
&trace_wake_event,
&trace_stack_event,
&trace_user_stack_event,
+ &trace_stack_id_event,
&trace_bputs_event,
&trace_bprint_event,
&trace_print_event,
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 929c84075315..0c97065b0d68 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *entry)
case TRACE_CTX:
case TRACE_WAKE:
case TRACE_STACK:
+ case TRACE_STACK_ID:
case TRACE_PRINT:
case TRACE_BRANCH:
case TRACE_GRAPH_ENT:
--
2.34.1
^ permalink raw reply related
* [RFC PATCH v4 3/3] trace: add documentation, selftest and tooling for stackmap
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56, kernel test robot
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add supporting files for the ftrace stackmap feature:
Documentation/trace/ftrace-stackmap.rst:
Documentation covering design, usage, tracefs interface, binary
format, and performance characteristics. Added to the 'Core Tracing
Frameworks' toctree in Documentation/trace/index.rst. Documents:
- Reset is destructive: it requires tracing to be stopped and also
clears the ring buffer so no stale <stack_id N> survives
- Boot-time activation via trace_options=stackmap
- bits parameter range [10, 18] and worst-case memory usage
- tracefs file modes (0640 / 0440)
- Best-effort snapshot semantics for stack_map_bin, serialized
against reset via the reader_sem
- Counter naming: successes (events served), drops, success_rate;
successes/drops are best-effort and saturate on long runs
- Gravestone amplification when the pool is exhausted
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
Functional selftest verifying:
- stackmap tracefs nodes exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes (a nonzero drops count is
a legitimate by-design fallback and is not treated as failure;
only zero successes alongside nonzero drops is fatal)
- reset clears entries when tracing is stopped
- reset is rejected (-EBUSY) while tracing is active
Test reads trace contents BEFORE switching back to the nop tracer
(tracer_init() unconditionally resets the ring buffer). The
function:tracer dependency is declared in '# requires:' so
ftracetest skips on kernels without CONFIG_FUNCTION_TRACER instead
of failing spuriously.
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc:
Verifies the destructive-reset semantics and the binary ABI header:
- after 'echo 0 > stack_map', the trace buffer no longer contains
any stale <stack_id N>
- stack_map_bin begins with the expected magic and version
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc:
Verifies the option is gated to the top-level instance: a secondary
instance neither exposes options/stackmap nor the stack_map* nodes,
and writing 'stackmap' to its aggregate trace_options file is
rejected rather than accepted as a no-op.
tools/tracing/stackmap_dump.py:
Python script to parse the binary stack_map_bin export.
Features:
- Automatic endianness detection via magic number
- Batched addr2line via stdin (avoids ARG_MAX with large stacks)
- JSON output mode (ips are always hex addresses; the ftrace
trampoline marker is shown only in the resolved symbols)
- Top-N filtering by ref_count
Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x46534D42 = 'FSMB').
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
Documentation/trace/ftrace-stackmap.rst | 177 ++++++++++++++++++
Documentation/trace/index.rst | 1 +
.../ftrace/test.d/ftrace/stackmap-basic.tc | 111 +++++++++++
.../test.d/ftrace/stackmap-instance-gate.tc | 54 ++++++
.../ftrace/test.d/ftrace/stackmap-reset.tc | 76 ++++++++
tools/tracing/stackmap_dump.py | 164 ++++++++++++++++
6 files changed, 583 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
create mode 100755 tools/tracing/stackmap_dump.py
diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..8d0b5c389862
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,177 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+ (default: 14 → 16384 stacks; valid range: 10-18).
+
+ At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+ for the element pool. Each ``open()`` of ``stack_map_bin`` may
+ briefly allocate a similar amount for a snapshot. The cap is set
+ intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+ echo 1 > /sys/kernel/debug/tracing/options/stackmap
+ echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+ echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+ sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+ cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+ stack_id 42 [ref 1337, depth 8]
+ [0] schedule+0x48/0xc0
+ [1] schedule_timeout+0x1c/0x30
+ ...
+
+To view statistics::
+
+ cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+ entries: 2500 / 16384
+ table_size: 32768
+ successes: 148923
+ drops: 0
+ success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_on
+ echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Reset is destructive to the trace buffer: because the ring buffer may
+still hold ``<stack_id N>`` events that reference soon-to-be-reused
+slots, resetting the map also resets the owning trace buffer (and its
+snapshot, if allocated). This keeps ring-buffer stack_ids and the map
+coherent. Read out any trace data you need before resetting.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+ trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+ Text export of all deduplicated stacks with symbol resolution.
+ Writing ``0`` or ``reset`` clears all entries (only when tracing
+ is stopped).
+
+``stack_map_stat``
+ Statistics: entries (allocated unique stacks), table_size,
+ successes (events served), drops (events that fell back to
+ full-stack recording), and success_rate. Drops accumulate when
+ the element pool is exhausted; once that happens, slots that
+ won the cmpxchg but failed to allocate an element remain
+ "claimed but empty" and increase probe pressure for any future
+ insert hashing to the same bucket. Reset (when tracing is
+ stopped) clears these gravestones.
+
+``stack_map_bin``
+ Binary export for efficient userspace consumption. Format:
+
+ - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+ - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+ All fields are written in the kernel's native byte order.
+ Userspace tools detect endianness by reading the magic value.
+ Magic: ``0x46534D42`` ('FSMB'), Version: 1.
+
+ Trampoline frames are exported as the sentinel value
+ ``0x7fffffff`` (FTRACE_TRAMPOLINE_MARKER); all other addresses are
+ passed through ``trace_adjust_address()`` so they match the
+ ``stack_map`` text output's address-adjustment rules. Note this is
+ the same adjustment ftrace applies to its own trace output (mainly
+ relevant for persistent / last-boot buffers), not a general KASLR
+ un-offset: resolving these addresses offline still requires the
+ matching kernel's symbol information.
+
+ The export is a best-effort snapshot allocated at ``open()``;
+ concurrent inserts during the snapshot may be truncated. A
+ bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+ (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+ length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+ confirms matches
+
+Deduplication is best-effort, not strict: if two CPUs race in the
+insert path with the same ``key_hash`` (i.e. the same stack), the
+``cmpxchg`` loser advances by one slot and may insert the same stack
+again. Under heavy contention this can produce a small number of
+duplicate entries for the same stack; ``ref_count`` is then split
+across the duplicates. Total memory is still bounded by the element
+pool size, and lookup correctness is unaffected (each duplicate is
+a self-consistent entry with its own ``stack_id``). The trade-off is
+intentional and keeps the hot path lock-free.
+
+Performance
+===========
+
+Typical results on an aarch64 SMP system (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Dedup rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
ftrace
ftrace-design
ftrace-uses
+ ftrace-stackmap
kprobes
kprobetrace
fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100644
index 000000000000..64dfe7cc66bd
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,111 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap function:tracer
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify trace contains <stack_id> events (read BEFORE switching
+# tracer back to nop, since tracer_init() resets the ring buffer)
+# 4. Verify stack_map has entries and at least some successes (drops is
+# a legitimate by-design fallback counter and is allowed to be nonzero;
+# only zero successes alongside nonzero drops indicates breakage)
+# 5. Verify reset is rejected (-EBUSY) while tracing is active
+# 6. Verify reset clears the map when tracing is stopped
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+ disable_tracing 2>/dev/null
+ echo nop > current_tracer 2>/dev/null
+ echo 0 > options/stackmap 2>/dev/null
+ echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Read trace contents NOW, before switching tracer back to nop.
+# tracer_init() unconditionally calls tracing_reset_online_cpus(),
+# so the ring buffer would be empty after 'echo nop > current_tracer'.
+count=$(grep -c "<stack_id" trace || true)
+: "${count:=0}"
+if [ "$count" -eq 0 ]; then
+ fail "trace has no <stack_id> events"
+fi
+
+# Now safe to switch back and disable options
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+ fail "stackmap has zero entries after tracing"
+fi
+
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+ fail "stackmap has zero successes"
+fi
+
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+# drops is a legitimate by-design fallback counter: when the map is full
+# or under heavy probe pressure, stackmap falls back to recording a full
+# stack instead of a stack_id. A nonzero drops count is therefore not a
+# failure. Only treat it as fatal if dedup never worked at all (no
+# successes), which would indicate the feature is genuinely broken rather
+# than merely under pressure.
+if [ "$successes" -eq 0 ] && [ "$drops" -ne 0 ]; then
+ fail "stackmap had $drops drops and zero successes (feature broken?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+ fail "stack_map output has no stack_id entries"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+ disable_tracing
+ fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+# Test reset works when tracing is stopped
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+ fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
new file mode 100644
index 000000000000..28810ba20432
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
@@ -0,0 +1,54 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap option is gated to the top-level trace instance
+# requires: stack_map options/stackmap instances
+
+# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the
+# convention used for global-only options like 'printk' and 'record-cmd'.
+# Verify that:
+# 1. The global instance exposes options/stackmap and the stack_map* nodes.
+# 2. A newly created secondary instance under instances/ does NOT expose
+# options/stackmap or stack_map* nodes.
+
+fail() {
+ echo "FAIL: $1"
+ rmdir instances/test_stackmap_gate 2>/dev/null
+ exit_fail
+}
+
+# 1. Global instance must expose the option and the nodes
+test -e options/stackmap || fail "options/stackmap missing on global instance"
+test -e stack_map || fail "stack_map missing on global instance"
+test -e stack_map_stat || fail "stack_map_stat missing on global instance"
+test -e stack_map_bin || fail "stack_map_bin missing on global instance"
+
+# 2. Create a secondary instance and verify it does NOT see the option
+# or the stack_map* nodes.
+mkdir instances/test_stackmap_gate || fail "could not create secondary instance"
+
+if [ -e instances/test_stackmap_gate/options/stackmap ]; then
+ fail "secondary instance unexpectedly exposes options/stackmap"
+fi
+
+for f in stack_map stack_map_stat stack_map_bin; do
+ if [ -e instances/test_stackmap_gate/$f ]; then
+ fail "secondary instance unexpectedly has $f"
+ fi
+done
+
+# 3. The aggregate trace_options file still reaches set_tracer_flag(),
+# so writing 'stackmap' there must be rejected on a secondary
+# instance. Otherwise the bit could appear set in trace_options
+# while the hot path silently falls back to a full stack trace
+# (tr->stackmap == NULL).
+if echo stackmap > instances/test_stackmap_gate/trace_options 2>/dev/null; then
+ fail "secondary instance accepted 'echo stackmap > trace_options'"
+fi
+if grep -qw stackmap instances/test_stackmap_gate/trace_options; then
+ fail "secondary instance trace_options reports stackmap as set"
+fi
+
+rmdir instances/test_stackmap_gate || fail "could not remove secondary instance"
+
+echo "stackmap option gating to top-level instance works"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
new file mode 100644
index 000000000000..803cc282f9ab
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
@@ -0,0 +1,76 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap reset clears the trace buffer and ABI header
+# requires: stack_map options/stackmap function:tracer
+
+# Lock in the two things most likely to regress in the stackmap ABI /
+# lifetime:
+# 1. Resetting the stackmap (echo 0 > stack_map, tracing stopped) also
+# clears the trace buffer, so no stale <stack_id N> can be left
+# dangling against an emptied map.
+# 2. The stack_map_bin header carries the expected magic ('FSMB' =
+# 0x46534D42) and version (1).
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+cleanup() {
+ disable_tracing 2>/dev/null
+ echo nop > current_tracer 2>/dev/null
+ echo 0 > options/stackmap 2>/dev/null
+ echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Sanity: the buffer must contain stack_id events before reset, otherwise
+# the post-reset emptiness check below would be meaningless.
+before=$(grep -c "<stack_id" trace || true)
+: "${before:=0}"
+if [ "$before" -eq 0 ]; then
+ fail "no <stack_id> events captured before reset"
+fi
+
+# Reset while tracing is stopped. This must succeed AND clear the trace
+# buffer (destructive reset semantics).
+echo 0 > stack_map || fail "reset rejected while tracing stopped"
+
+after=$(grep -c "<stack_id" trace || true)
+: "${after:=0}"
+if [ "$after" -ne 0 ]; then
+ fail "trace still has $after stale <stack_id> events after reset"
+fi
+
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=-1}"
+if [ "$entries" -ne 0 ]; then
+ fail "stackmap still has $entries entries after reset"
+fi
+
+# Binary export header: magic 'FSMB' (0x46534D42) + version 1.
+# od -tx4 renders the 32-bit words in the target's native byte order,
+# which matches what the kernel wrote, so the comparison is endian-safe.
+if command -v od >/dev/null 2>&1; then
+ magic=$(od -An -tx4 -N4 stack_map_bin | tr -d ' \n')
+ if [ "$magic" != "46534d42" ]; then
+ fail "stack_map_bin bad magic: 0x$magic (expected 46534d42)"
+ fi
+ ver=$(od -An -tx4 -j4 -N4 stack_map_bin | tr -d ' \n')
+ if [ "$ver" != "00000001" ]; then
+ fail "stack_map_bin bad version: 0x$ver (expected 00000001)"
+ fi
+fi
+
+echo "stackmap reset test passed: cleared $before stack_id events, ABI header ok"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..2d9c49b776e6
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+ # Pull from device and parse
+ adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+ python3 stackmap_dump.py /tmp/stack_map.bin
+
+ # With vmlinux for offline symbol resolution
+ python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+ # JSON output for tooling
+ python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x46534D42 # 'FSMB'
+HEADER_SIZE = 16 # 4 x u32
+ENTRY_SIZE = 16 # 4 x u32
+
+# __ftrace_trace_stack() replaces trampoline addresses with this marker
+# (FTRACE_TRAMPOLINE_MARKER == (unsigned long)INT_MAX) before the stack
+# is stored, so the binary export carries it verbatim.
+FTRACE_TRAMPOLINE_MARKER = 0x7fffffff
+TRAMPOLINE_LABEL = '[FTRACE TRAMPOLINE]'
+
+
+def detect_endianness(data):
+ """Detect byte order from magic number in header."""
+ if len(data) < 4:
+ raise ValueError("File too small")
+ magic_le = struct.unpack_from('<I', data, 0)[0]
+ if magic_le == MAGIC:
+ return '<'
+ magic_be = struct.unpack_from('>I', data, 0)[0]
+ if magic_be == MAGIC:
+ return '>'
+ raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+ """Resolve multiple addresses in one addr2line invocation."""
+ if not addrs:
+ return {}
+ try:
+ # Feed addresses on stdin to avoid ARG_MAX limits with large
+ # numbers of addresses (one stack can have 30+ frames; a
+ # snapshot can have thousands of unique stacks).
+ stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+ result = subprocess.run(
+ ['addr2line', '-f', '-e', vmlinux],
+ input=stdin, capture_output=True, text=True, timeout=60
+ )
+ lines = result.stdout.split('\n')
+ # addr2line outputs 2 lines per address: function name + source location
+ symbols = {}
+ for i, addr in enumerate(addrs):
+ idx = i * 2
+ if idx < len(lines) and lines[idx] and lines[idx] != '??':
+ symbols[addr] = lines[idx]
+ return symbols
+ except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+ print(f"warning: addr2line failed: {e}", file=sys.stderr)
+ return {}
+
+
+def parse_stackmap_bin(data):
+ """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+ if len(data) < HEADER_SIZE:
+ raise ValueError("File too small for header")
+
+ endian = detect_endianness(data)
+ header_fmt = f'{endian}IIII'
+ entry_fmt = f'{endian}IIII'
+
+ magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+ if version != 1:
+ raise ValueError(f"Unsupported version: {version}")
+
+ offset = HEADER_SIZE
+ for _ in range(nr_stacks):
+ if offset + ENTRY_SIZE > len(data):
+ break
+ stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+ offset += ENTRY_SIZE
+
+ ips_size = nr * 8
+ if offset + ips_size > len(data):
+ break
+ ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+ offset += ips_size
+
+ yield stack_id, ref_count, list(ips)
+
+
+def main():
+ parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+ parser.add_argument('file', help='Path to stack_map_bin file')
+ parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+ parser.add_argument('--json', action='store_true', help='JSON output')
+ parser.add_argument('--top', type=int, default=0,
+ help='Show only top N stacks by ref_count')
+ args = parser.parse_args()
+
+ with open(args.file, 'rb') as f:
+ data = f.read()
+
+ stacks = list(parse_stackmap_bin(data))
+
+ if args.top > 0:
+ stacks.sort(key=lambda x: x[1], reverse=True)
+ stacks = stacks[:args.top]
+
+ # Batch symbol resolution
+ symbols = {}
+ if args.vmlinux:
+ all_addrs = set()
+ for _, _, ips in stacks:
+ all_addrs.update(ip for ip in ips
+ if ip != FTRACE_TRAMPOLINE_MARKER)
+ symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+ def render(ip):
+ if ip == FTRACE_TRAMPOLINE_MARKER:
+ return TRAMPOLINE_LABEL
+ return symbols.get(ip, f'0x{ip:x}')
+
+ if args.json:
+ output = []
+ for stack_id, ref_count, ips in stacks:
+ entry = {
+ 'stack_id': stack_id,
+ 'ref_count': ref_count,
+ 'ips': [f'0x{ip:x}' for ip in ips]
+ }
+ if args.vmlinux:
+ entry['symbols'] = [render(ip) for ip in ips]
+ output.append(entry)
+ print(json.dumps(output, indent=2))
+ else:
+ for stack_id, ref_count, ips in stacks:
+ print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+ for i, ip in enumerate(ips):
+ if ip == FTRACE_TRAMPOLINE_MARKER:
+ print(f" [{i}] {TRAMPOLINE_LABEL}")
+ continue
+ sym = symbols.get(ip, '')
+ if sym:
+ sym = f' {sym}'
+ print(f" [{i}] 0x{ip:x}{sym}")
+ print()
+
+ print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+ main()
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v3 05/13] verification/rvgen: Convert __fill_setup_invariants_func() to Lark
From: Nam Cao @ 2026-06-16 9:00 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <ded9af4b0cdde4696c0a295e0c622a9d39c445ee.camel@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> val is a string, doing val * 1000 repeats the string 1000 times in python!
Oh crap.
^ permalink raw reply
* Re: [PATCH v3 2/9] rv: add generic uprobe infrastructure for RV monitors
From: Gabriele Monaco @ 2026-06-16 9:49 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <9d1a1d491af16853b2b421f358fd6cca965588ab.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Introduce rv_uprobe, a thin wrapper around uprobe_consumer providing
> rv_uprobe_attach_path(), rv_uprobe_attach(), and rv_uprobe_detach()
> for RV monitors. An opaque priv pointer is forwarded unchanged to
> entry/return handlers so monitors can carry per-binding state (e.g. a
> latency threshold) to the hot path without any global lookup.
>
> rv_uprobe_detach() is fully synchronous (nosync + sync + path_put +
> kfree), closing the use-after-free window present in open-coded
> patterns where kfree() precedes uprobe_unregister_sync().
>
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
I find your implementation solid, but I wonder if it isn't a bit
overdoing it. It's good to have helper functions abstracting away the
common uprobes code, but I find the number of structures used to indirect/hide
the process a bit confusing.
Users of this headers still know they're using uprobes, so we don't
really need to hide it that much, plus the double function pointers are
an additional performance hit we could easily skip.
As I get it, we want to pass a private pointer (essentially to get the
threshold in tlob) and the uprobe infrastructure doesn't allow that
transparently. So let's keep on using struct embedding but a bit more
cut to the bone:
struct rv_uprobe {
struct uprobe_consumer uc;
struct uprobe *uprobe;
void *priv;
};
Then handlers would use container_of() appropriately to derive the data they
need.
static int tlob_uprobe_entry_handler(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data)
{
struct rv_uprobe *p = container_of(self, struct rv_uprobe, uc);
struct tlob_uprobe_binding *b = p->priv;
...
}
You could even define the struct directly inside your private data and save one
step (and no void *):
struct rv_uprobe {
struct uprobe_consumer uc;
struct uprobe *uprobe;
};
#define DECLARE_RV_UPROBE(name) struct rv_uprobe name
// And then
struct tlob_uprobe_binding {
u64 threshold_ns;
...
DECLARE_RV_UPROBE(start);
DECLARE_RV_UPROBE(stop);
}
static int tlob_uprobe_entry_handler(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data)
{
struct tlob_uprobe_binding *b = container_of(self, struct tlob_uprobe_binding, start.uc);
...
}
Now I intentionally skipped offset, as I don't really see why carry it around,
and path, as we could live without it by relying on inodes (uprobes
reference-track them but you'd need to save them somewhere).
Mind, I only build tested this idea (appended after). This was supported by an
LLM which did a few more changes like standardising names and return values of
registration functions (a bit more consistent with other stuff, you don't have
to commit with that though).
By the way, do we really need the duplicate check? Would it break if a user
defines it twice? If not, we could simplify the tlob_add_uprobe() even further
and doing all path related checks in the library, since those aren't quite tlob
specific.
Thoughts?
Thanks,
Gabriele
**Suggested simplification:**
---
diff --git a/include/rv/rv_uprobe.h b/include/rv/rv_uprobe.h
index 9106c5c927..4fb3f50a63 100644
--- a/include/rv/rv_uprobe.h
+++ b/include/rv/rv_uprobe.h
@@ -7,83 +7,56 @@
#ifndef _RV_UPROBE_H
#define _RV_UPROBE_H
-#include <linux/path.h>
#include <linux/types.h>
+#include <linux/uprobes.h>
struct pt_regs;
+struct inode;
/**
* struct rv_uprobe - a single uprobe registered on behalf of an RV monitor
*
- * @offset: byte offset within the ELF binary where the probe is installed
- * @priv: monitor-private pointer; set at attach time, never touched by
- * this layer; passed unchanged to entry_fn / ret_fn
- * @path: resolved path of the probed binary (read-only after attach);
- * callers may use path.dentry for identity comparisons
- *
- * The implementation fields (uprobe_consumer, uprobe handle, callbacks) are
- * private to rv_uprobe.c and are not exposed here; monitors must not access
- * them directly.
+ * @uc: underlying uprobe consumer (publicly visible)
+ * @uprobe: active uprobe structure handle
+ * @inode: inode of the target binary (read-only after registration)
*/
struct rv_uprobe {
- /* public: read-only after rv_uprobe_attach*() */
- loff_t offset;
- void *priv;
- struct path path;
+ struct uprobe_consumer uc;
+ struct uprobe *uprobe;
+ struct inode *inode;
};
-/**
- * rv_uprobe_attach_path - register an uprobe given an already-resolved path
- * @path: path of the target binary; rv_uprobe takes its own reference
- * @offset: byte offset within the binary
- * @entry_fn: called on probe hit (entry); may be NULL
- * @ret_fn: called on function return (uretprobe); may be NULL
- * @priv: opaque pointer forwarded to callbacks unchanged
- *
- * Use this variant when the caller has already resolved the path (e.g. to
- * register multiple probes on the same binary with a single kern_path call).
- * The inode is derived internally via d_real_inode(), so inode and path are
- * always consistent.
- *
- * Returns a pointer to the new rv_uprobe on success, ERR_PTR on failure.
- */
-struct rv_uprobe *rv_uprobe_attach_path(struct path *path, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv);
+/* Seamless inline declaration of a named uprobe inside user structs */
+#define DECLARE_RV_UPROBE(name) \
+ struct rv_uprobe name
/**
- * rv_uprobe_attach - resolve binpath and register an uprobe
- * @binpath: absolute path to the target binary
- * @offset: byte offset within the binary
- * @entry_fn: called on probe hit (entry); may be NULL
- * @ret_fn: called on function return (uretprobe); may be NULL
- * @priv: opaque pointer forwarded to callbacks unchanged
+ * rv_uprobe_register - resolve binpath and register an uprobe
+ * @binpath: absolute path to the target binary
+ * @offset: byte offset within the binary
+ * @p: pointer to the allocated/embedded rv_uprobe structure
+ * @handler: called on probe hit (entry); may be NULL
+ * @ret_handler: called on function return (uretprobe); may be NULL
*
- * Resolves binpath via kern_path(), then delegates to rv_uprobe_attach_path().
+ * Resolves binpath via kern_path(), registers the uprobe directly using the
+ * embedded `uprobe_consumer`, and immediately releases the path reference.
*
- * Returns a pointer to the new rv_uprobe on success, ERR_PTR on failure.
+ * Returns 0 on success, or a negative error code on failure.
*/
-struct rv_uprobe *rv_uprobe_attach(const char *binpath, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv);
+int rv_uprobe_register(const char *binpath, loff_t offset, struct rv_uprobe *p,
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data),
+ int (*ret_handler)(struct uprobe_consumer *self, unsigned long func,
+ struct pt_regs *regs, __u64 *data));
/**
- * rv_uprobe_detach - synchronously unregister an uprobe and free it
- * @p: probe to detach; may be NULL (no-op)
- *
- * Calls uprobe_unregister_nosync(), then uprobe_unregister_sync() to wait
- * for any in-progress handler to finish, then releases the path reference
- * and frees the rv_uprobe struct. The caller's priv data is NOT freed.
+ * rv_uprobe_unregister - synchronously unregister a uprobe
+ * @p: probe to unregister; may be NULL (no-op)
*
- * When removing a single probe, prefer this over the three-phase API.
- * Safe to call from process context only (uprobe_unregister_sync() may
- * schedule).
+ * Dequeues the uprobe and waits synchronously for all in-flight handlers
+ * to complete.
*/
-void rv_uprobe_detach(struct rv_uprobe *p);
+void rv_uprobe_unregister(struct rv_uprobe *p);
/**
* rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
@@ -91,9 +64,7 @@ void rv_uprobe_detach(struct rv_uprobe *p);
*
* Removes the uprobe from the uprobe subsystem but does NOT wait for
* in-flight handlers to complete. The caller must call rv_uprobe_sync()
- * before calling rv_uprobe_free() on the same probe.
- *
- * Use this to batch multiple deregistrations before a single rv_uprobe_sync().
+ * before freeing any container holding this probe.
*/
void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
@@ -101,19 +72,8 @@ void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
* rv_uprobe_sync - wait for all in-flight uprobe handlers to complete
*
* Global barrier: waits for every in-flight uprobe handler across the system
- * to finish. Call once after a batch of rv_uprobe_unregister_nosync() calls
- * and before any rv_uprobe_free() call.
+ * to finish.
*/
void rv_uprobe_sync(void);
-/**
- * rv_uprobe_free - release resources of a previously deregistered probe
- * @p: probe to free; may be NULL (no-op)
- *
- * Releases the path reference and frees the rv_uprobe struct. Must only
- * be called after rv_uprobe_sync() has returned. The caller's priv data
- * is NOT freed.
- */
-void rv_uprobe_free(struct rv_uprobe *p);
-
#endif /* _RV_UPROBE_H */
diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
index d8e0c47947..28a6c740c7 100644
--- a/kernel/trace/rv/monitors/tlob/tlob.c
+++ b/kernel/trace/rv/monitors/tlob/tlob.c
@@ -252,8 +252,8 @@ struct tlob_uprobe_binding {
char binpath[TLOB_MAX_PATH];
loff_t offset_start;
loff_t offset_stop;
- struct rv_uprobe *start_probe;
- struct rv_uprobe *stop_probe;
+ DECLARE_RV_UPROBE(start_probe);
+ DECLARE_RV_UPROBE(stop_probe);
};
/* RCU callback: free the slab once no readers remain. */
@@ -512,16 +512,16 @@ int tlob_stop_task(struct task_struct *task)
EXPORT_SYMBOL_GPL(tlob_stop_task);
-static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct pt_regs *regs,
+static int tlob_uprobe_entry_handler(struct uprobe_consumer *self, struct pt_regs *regs,
__u64 *data)
{
- struct tlob_uprobe_binding *b = p->priv;
+ struct tlob_uprobe_binding *b = container_of(self, struct tlob_uprobe_binding, start_probe.uc);
tlob_start_task(current, b->threshold_ns);
return 0;
}
-static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct pt_regs *regs,
+static int tlob_uprobe_stop_handler(struct uprobe_consumer *self, struct pt_regs *regs,
__u64 *data)
{
tlob_stop_task(current);
@@ -537,6 +537,7 @@ static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
{
struct tlob_uprobe_binding *b, *tmp_b;
char pathbuf[TLOB_MAX_PATH];
+ struct inode *inode;
struct path path;
char *canon;
int ret;
@@ -561,10 +562,12 @@ static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
goto err_path;
}
- /* Reject duplicate start offset for the same binary. */
+ inode = d_real_inode(path.dentry);
+
+ /* Reject duplicate start offset for the same binary inode. */
list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
if (tmp_b->offset_start == offset_start &&
- tmp_b->start_probe->path.dentry == path.dentry) {
+ tmp_b->start_probe.inode == inode) {
ret = -EEXIST;
goto err_path;
}
@@ -577,29 +580,25 @@ static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
}
strscpy(b->binpath, canon, sizeof(b->binpath));
- /* Both probes share b (priv) and path; attach_path refs path itself. */
- b->start_probe = rv_uprobe_attach_path(&path, offset_start,
- tlob_uprobe_entry_handler, NULL, b);
- if (IS_ERR(b->start_probe)) {
- ret = PTR_ERR(b->start_probe);
- b->start_probe = NULL;
- goto err_path;
- }
+ path_put(&path);
+
+ /* Both probes are registered directly on the embedded fields */
+ ret = rv_uprobe_register(b->binpath, offset_start, &b->start_probe,
+ tlob_uprobe_entry_handler, NULL);
+ if (ret)
+ goto err_free;
- b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
- tlob_uprobe_stop_handler, NULL, b);
- if (IS_ERR(b->stop_probe)) {
- ret = PTR_ERR(b->stop_probe);
- b->stop_probe = NULL;
+ ret = rv_uprobe_register(b->binpath, offset_stop, &b->stop_probe,
+ tlob_uprobe_stop_handler, NULL);
+ if (ret)
goto err_start;
- }
- path_put(&path);
list_add_tail(&b->list, &tlob_uprobe_list);
return 0;
err_start:
- rv_uprobe_detach(b->start_probe);
+ rv_uprobe_unregister(&b->start_probe);
+ goto err_free;
err_path:
path_put(&path);
err_free:
@@ -611,21 +610,24 @@ static int tlob_remove_uprobe_by_key(loff_t offset_start, const char *binpath)
{
struct tlob_uprobe_binding *b, *tmp;
struct path remove_path;
+ struct inode *inode;
int ret;
ret = kern_path(binpath, LOOKUP_FOLLOW, &remove_path);
if (ret)
return ret;
+ inode = d_real_inode(remove_path.dentry);
+
ret = -ENOENT;
list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
if (b->offset_start != offset_start)
continue;
- if (b->start_probe->path.dentry != remove_path.dentry)
+ if (b->start_probe.inode != inode)
continue;
list_del(&b->list);
- rv_uprobe_detach(b->start_probe);
- rv_uprobe_detach(b->stop_probe);
+ rv_uprobe_unregister(&b->start_probe);
+ rv_uprobe_unregister(&b->stop_probe);
kfree(b);
ret = 0;
break;
@@ -643,8 +645,8 @@ static void tlob_remove_all_uprobes(void)
mutex_lock(&tlob_uprobe_mutex);
list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
list_move(&b->list, &pending);
- rv_uprobe_unregister_nosync(b->start_probe);
- rv_uprobe_unregister_nosync(b->stop_probe);
+ rv_uprobe_unregister_nosync(&b->start_probe);
+ rv_uprobe_unregister_nosync(&b->stop_probe);
}
mutex_unlock(&tlob_uprobe_mutex);
@@ -658,8 +660,6 @@ static void tlob_remove_all_uprobes(void)
rv_uprobe_sync();
list_for_each_entry_safe(b, tmp, &pending, list) {
- rv_uprobe_free(b->start_probe);
- rv_uprobe_free(b->stop_probe);
kfree(b);
}
}
diff --git a/kernel/trace/rv/rv_uprobe.c b/kernel/trace/rv/rv_uprobe.c
index 3d8b764dde..69b8b0c27e 100644
--- a/kernel/trace/rv/rv_uprobe.c
+++ b/kernel/trace/rv/rv_uprobe.c
@@ -10,149 +10,74 @@
#include <linux/uprobes.h>
#include <rv/rv_uprobe.h>
-/*
- * Private extension of struct rv_uprobe. Allocated by rv_uprobe_attach*()
- * and returned to callers as &impl->pub.
- */
-struct rv_uprobe_impl {
- struct rv_uprobe pub; /* must be first; callers hold &pub */
- struct uprobe_consumer uc;
- struct uprobe *uprobe;
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data);
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data);
-};
-
-static int rv_uprobe_handler(struct uprobe_consumer *uc,
- struct pt_regs *regs, __u64 *data)
-{
- struct rv_uprobe_impl *impl = container_of(uc, struct rv_uprobe_impl, uc);
-
- if (impl->entry_fn)
- return impl->entry_fn(&impl->pub, regs, data);
- return 0;
-}
-
-static int rv_uprobe_ret_handler(struct uprobe_consumer *uc,
- unsigned long func,
- struct pt_regs *regs, __u64 *data)
-{
- struct rv_uprobe_impl *impl = container_of(uc, struct rv_uprobe_impl, uc);
-
- if (impl->ret_fn)
- return impl->ret_fn(&impl->pub, func, regs, data);
- return 0;
-}
-
-static struct rv_uprobe *
-__rv_uprobe_attach(struct inode *inode, struct path *path, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv)
-{
- struct rv_uprobe_impl *impl;
- int ret;
-
- if (!entry_fn && !ret_fn)
- return ERR_PTR(-EINVAL);
-
- impl = kzalloc_obj(*impl, GFP_KERNEL);
- if (!impl)
- return ERR_PTR(-ENOMEM);
-
- impl->pub.offset = offset;
- impl->pub.priv = priv;
- impl->entry_fn = entry_fn;
- impl->ret_fn = ret_fn;
- path_get(path);
- impl->pub.path = *path;
-
- if (entry_fn)
- impl->uc.handler = rv_uprobe_handler;
- if (ret_fn)
- impl->uc.ret_handler = rv_uprobe_ret_handler;
-
- impl->uprobe = uprobe_register(inode, offset, 0, &impl->uc);
- if (IS_ERR(impl->uprobe)) {
- ret = PTR_ERR(impl->uprobe);
- path_put(&impl->pub.path);
- kfree(impl);
- return ERR_PTR(ret);
- }
-
- return &impl->pub;
-}
-
-/**
- * rv_uprobe_attach_path - register an uprobe given an already-resolved path
- */
-struct rv_uprobe *rv_uprobe_attach_path(struct path *path, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv)
-{
- struct inode *inode = d_real_inode(path->dentry);
-
- return __rv_uprobe_attach(inode, path, offset, entry_fn, ret_fn, priv);
-}
-EXPORT_SYMBOL_GPL(rv_uprobe_attach_path);
-
/**
- * rv_uprobe_attach - resolve binpath and register an uprobe
+ * rv_uprobe_register - resolve binpath and register an uprobe
*/
-struct rv_uprobe *rv_uprobe_attach(const char *binpath, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv)
+int rv_uprobe_register(const char *binpath, loff_t offset, struct rv_uprobe *p,
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data),
+ int (*ret_handler)(struct uprobe_consumer *self, unsigned long func,
+ struct pt_regs *regs, __u64 *data))
{
- struct rv_uprobe *p;
+ struct inode *inode;
struct path path;
int ret;
+ if (!handler && !ret_handler)
+ return -EINVAL;
+
ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
if (ret)
- return ERR_PTR(ret);
+ return ret;
if (!d_is_reg(path.dentry)) {
path_put(&path);
- return ERR_PTR(-EINVAL);
+ return -EINVAL;
}
- p = rv_uprobe_attach_path(&path, offset, entry_fn, ret_fn, priv);
+ inode = d_real_inode(path.dentry);
+
+ p->uc.handler = handler;
+ p->uc.ret_handler = ret_handler;
+ p->inode = inode;
+
+ p->uprobe = uprobe_register(inode, offset, 0, &p->uc);
path_put(&path);
- return p;
+
+ if (IS_ERR(p->uprobe)) {
+ ret = PTR_ERR(p->uprobe);
+ p->uprobe = NULL;
+ p->inode = NULL;
+ return ret;
+ }
+
+ return 0;
}
-EXPORT_SYMBOL_GPL(rv_uprobe_attach);
+EXPORT_SYMBOL_GPL(rv_uprobe_register);
/**
- * rv_uprobe_detach - synchronously unregister an uprobe and free it
+ * rv_uprobe_unregister - synchronously unregister a uprobe
*/
-void rv_uprobe_detach(struct rv_uprobe *p)
+void rv_uprobe_unregister(struct rv_uprobe *p)
{
- if (!p)
+ if (!p || IS_ERR_OR_NULL(p->uprobe))
return;
rv_uprobe_unregister_nosync(p);
rv_uprobe_sync();
- rv_uprobe_free(p);
}
-EXPORT_SYMBOL_GPL(rv_uprobe_detach);
+EXPORT_SYMBOL_GPL(rv_uprobe_unregister);
/**
* rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
*/
void rv_uprobe_unregister_nosync(struct rv_uprobe *p)
{
- struct rv_uprobe_impl *impl;
-
- if (!p)
+ if (!p || IS_ERR_OR_NULL(p->uprobe))
return;
- impl = container_of(p, struct rv_uprobe_impl, pub);
- uprobe_unregister_nosync(impl->uprobe, &impl->uc);
+ uprobe_unregister_nosync(p->uprobe, &p->uc);
+ p->uprobe = NULL;
+ p->inode = NULL;
}
EXPORT_SYMBOL_GPL(rv_uprobe_unregister_nosync);
@@ -164,19 +89,3 @@ void rv_uprobe_sync(void)
uprobe_unregister_sync();
}
EXPORT_SYMBOL_GPL(rv_uprobe_sync);
-
-/**
- * rv_uprobe_free - release resources of a previously deregistered probe
- */
-void rv_uprobe_free(struct rv_uprobe *p)
-{
- struct rv_uprobe_impl *impl;
-
- if (!p)
- return;
-
- impl = container_of(p, struct rv_uprobe_impl, pub);
- path_put(&p->path);
- kfree(impl);
-}
-EXPORT_SYMBOL_GPL(rv_uprobe_free);
^ permalink raw reply related
* Re: [PATCH v3 8/9] selftests/verification: fix verificationtest-ktap for out-of-tree execution
From: Gabriele Monaco @ 2026-06-16 11:14 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <95e700c62601cf432842269d89a86a492d073f0e.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> verificationtest-ktap used CWD-relative paths which broke when
> invoked outside the verification directory (e.g. via vng).
I still don't get this, I even run from my home directory and it works
without this commit:
make -C linux/tools/testing/selftests/verification/ run_tests
It builds and runs your new tests just fine, how exactly do you run
them? vng shouldn't affect this since it runs from the same directory
as overlay mount, what errors are you getting?
The only issue I see is with the read only filesystems since ftracetest
attempts to write logs, but that doesn't seem solved by your commit
either. Adding the appropriate rwdir lets me run them just fine:
vng -v --rwdir tools/testing/selftests/verification/logs -- make -C tools/testing/selftests/verification run_tests
I'm suspecting you're trying to call verificationtest-ktap manually
from somewhere else to call it only on tlob/, but that looks like
asking for troubles. While you're at it I don't see why you don't just
call ftracetest.
Gabriele
>
> Resolve paths via realpath "$(dirname "$0")" so the script works
> from any working directory. Accept an optional subdirectory argument
> interpreted relative to the script's directory.
>
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
> tools/testing/selftests/verification/verificationtest-ktap | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/verification/verificationtest-
> ktap b/tools/testing/selftests/verification/verificationtest-ktap
> index 18f7fe324e2f..055747cef38a 100755
> --- a/tools/testing/selftests/verification/verificationtest-ktap
> +++ b/tools/testing/selftests/verification/verificationtest-ktap
> @@ -5,4 +5,6 @@
> #
> # Copyright (C) Arm Ltd., 2023
>
> -../ftrace/ftracetest -K -v --rv ../verification
> +dir=$(realpath "$(dirname "$0")")
> +testdir=$(cd "$dir" && realpath "${1:-.}")
> +"$dir/../ftrace/ftracetest" -K -v --rv "$testdir"
^ permalink raw reply
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Brendan Jackman @ 2026-06-16 11:57 UTC (permalink / raw)
To: Vlastimil Babka (SUSE), Gregory Price, David Hildenbrand (Arm)
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>
On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
>>> > >
>>> > > I understand this question in two ways:
>>> > >
>>> > > 1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>>> >
>>> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
>>> > me reply to that one below.
>>> >
>>> ... snip ...
>>> >
>>> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
>>> > context might be better. I think there was also talk about how the memalloc_*
>>> > interface might be a better way forward. Maybe we would start giving the
>>> > allocator more context ("we are allocating a folio").
>>> >
>>> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>>> >
>>>
>>> I will still probably send the next RFC version tomorrow or friday,
>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>
>>> Also, I made a new `anondax` driver which enables userland testing
>>> of this functionality without any specialty hardware.
>>>
>>
>> (apologies for the length of this email: this will all be covered in
>> the coming cover letter, but I just wanted to share a bit of a preview)
>>
>> ===
>>
>> Just another small update - I am planning to post the RFC today once i
>> get some mild cleanup done. It will be based on the dax atomic hotplug
>>
>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>
>> But a couple specific details regarding the memalloc pieces that i've
>> learned the past couple of days playing with it.
>>
>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>> on the private node, even if it happens within a memalloc_private
>> context. Since memalloc_folio may be useful in contexts outside of
>> private nodes, I kept this as a separate flag.
>>
>> If we think there will *never* be additional users of memalloc_folio,
>> then we could fold _folio into _private to save the flag for now and
>> add it back when we actually need it.
>>
>> 2) memalloc_private is needed to unlock private nodes, but in the
>> original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>
>> This is *highly* restrictive. I found when playing with mbind that
>> MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>> implies a bug).
>>
>> That leads me to #3
>
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.
Minor side-note: couldn't we just define it such that the allocator
ignores the context when not in_task() (and warn if you try to enter the
context while not currently in_task())?
(Don't think this would change the conclusion very much, e.g. doesn't
help with the nesting issues. Mostly curious in case I'm missing a
detail here).
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.
^ permalink raw reply
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-16 13:47 UTC (permalink / raw)
To: Brendan Jackman
Cc: Vlastimil Babka (SUSE), David Hildenbrand (Arm), Balbir Singh,
lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, Matthew Wilcox
In-Reply-To: <DJAGEUY8S09F.3V3HF570G85OF@linux.dev>
On Tue, Jun 16, 2026 at 11:57:42AM +0000, Brendan Jackman wrote:
> On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
> > On 6/12/26 17:29, Gregory Price wrote:
> >> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
> >>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> >>> > >
> >
> > I think the memalloc approach is dangerous due to unexpected nesting. There
> > might be nested page allocations in page allocation itself (due to some
> > debugging option). But also interrupts do not change what "current" points
> > to. Suddenly those could start requesting folios and/or private nodes and be
> > surprised, I'm afraid.
>
> Minor side-note: couldn't we just define it such that the allocator
> ignores the context when not in_task() (and warn if you try to enter the
> context while not currently in_task())?
>
> (Don't think this would change the conclusion very much, e.g. doesn't
> help with the nesting issues. Mostly curious in case I'm missing a
> detail here).
>
I looked at this - only solves one issue and oh boy is that an obtuse
confusing condition to understand. We still suffer from recursion in
reclaim.
~Gregory
^ permalink raw reply
* Re: [PATCH v3 9/9] selftests/verification: add tlob selftests
From: Gabriele Monaco @ 2026-06-16 14:58 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <4aeb668c8446a9f6366d92e218df386bef7bc965.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add selftest coverage for the tlob uprobe monitoring interface under
> tools/testing/selftests/verification/.
>
> test.d/tlob/ contains both the helper sources (tlob_target, tlob_sym)
> and the seven test scripts so the test suite is self-contained.
> tlob_target provides busy-spin, sleep, and preempt workloads;
> tlob_sym
> resolves ELF symbol offsets for uprobe registration.
>
> Seven test scripts exercise uprobe binding management, budget
> violation
> detection, and per-state time accounting (running_ns, waiting_ns,
> sleeping_ns).
>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
> .../testing/selftests/verification/.gitignore | 2 +
> tools/testing/selftests/verification/Makefile | 19 +-
> .../verification/test.d/tlob/Makefile | 20 ++
> .../verification/test.d/tlob/test.d/functions | 1 +
Tests seems to work fine, thanks. I think I'm getting what you're
trying to do and probably defining a dummy functions for tlob isn't the
right thing to do.
ftracetest --rv doesn't quite work with different folders as you'd pick
the ftrace functions. We shouldn't be adding a dummy functions script
every time but rather fix ftracetest directly.
Try the attached patch, it seems to work on my side. Then you'd be able
to use ftracetest --rv as you please (folder is mandatory, I don't
think we need a version inside selftests/verification,
verificationtest-ktap is only meant for the makefile just like
ftracetest-ktap).
Thanks,
Gabriele
---
From 32242f83af8214a51c47167a1904d3663aa43870 Mon Sep 17 00:00:00 2001
From: Gabriele Monaco <gmonaco@redhat.com>
Date: Tue, 16 Jun 2026 16:26:53 +0200
Subject: [PATCH] selftests/tracing: support flexible helper functions
Target directory or file arguments passed to ftracetest had
to contain their own nested test.d/functions files to properly override
the top-level directory (TOP_DIR). This works for standard ftrace tests
that would fall back to ftrace's functions, but doesn't for rv tests.
Introduce find_functions() to recursively check parent and grandparent
directories of the target test directory/file to dynamically locate the
appropriate root functions config.
This allows to define additional directories for rv selftests and run
them with the appropriate rv functions:
ftracetest --rv tools/testing/selftests/verification/test.d/<topic>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/testing/selftests/ftrace/ftracetest | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/ftracetest b/tools/testing/selftests/ftrace/ftracetest
index 0a56bf209f..5d1b1f9380 100755
--- a/tools/testing/selftests/ftrace/ftracetest
+++ b/tools/testing/selftests/ftrace/ftracetest
@@ -80,6 +80,23 @@ find_testcases() { #directory
echo `find $1 -name \*.tc | sort`
}
+find_functions() { # [directory] [test_cases]
+ local BASE_DIR="$1"
+ if [ -f "$1" ]; then
+ BASE_DIR="`absdir $1`"
+ elif [ ! -d "$1" ]; then
+ return
+ fi
+
+ if [ -f "$BASE_DIR"/test.d/functions ]; then
+ echo "$BASE_DIR"
+ elif [ -f "$BASE_DIR"/../test.d/functions ]; then
+ abspath "$BASE_DIR"/..
+ elif [ -f "$BASE_DIR"/../../test.d/functions ]; then
+ abspath "$BASE_DIR"/../..
+ fi
+}
+
parse_opts() { # opts
local OPT_TEST_CASES=
local OPT_TEST_DIR=
@@ -159,7 +176,8 @@ parse_opts() { # opts
if [ -n "$OPT_TEST_CASES" ]; then
TEST_CASES=$OPT_TEST_CASES
fi
- if [ -n "$OPT_TEST_DIR" -a -f "$OPT_TEST_DIR"/test.d/functions ]; then
+ OPT_TEST_DIR="`find_functions $OPT_TEST_DIR $OPT_TEST_CASES`"
+ if [ -n "$OPT_TEST_DIR" ]; then
TOP_DIR=$OPT_TEST_DIR
TEST_DIR=$TOP_DIR/test.d
fi
--
2.54.0
^ permalink raw reply related
* [PATCH] tracing: ring-buffer: allowlist clang-generated symbols
From: Arnd Bergmann @ 2026-06-16 16:42 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Nathan Chancellor
Cc: Arnd Bergmann, Mathieu Desnoyers, Nick Desaulniers, Bill Wendling,
Justin Stitt, Vincent Donnefort, Marc Zyngier,
Thomas Weißschuh, Paolo Bonzini, linux-kernel,
linux-trace-kernel, llvm
From: Arnd Bergmann <arnd@arndb.de>
In randconfig build testing using clang-22, I came across two
sets of extra symbols in the ring buffer code that may get
inserted by the compiler:
Unexpected symbols in kernel/trace/simple_ring_buffer.o:
U memset
Unexpected symbols in kernel/trace/simple_ring_buffer.o:
U llvm_gcda_emit_arcs
U llvm_gcda_emit_function
U llvm_gcda_end_file
U llvm_gcda_start_file
U llvm_gcda_summary_info
U llvm_gcov_init
Add all of these to the allowlist.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
kernel/trace/Makefile | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index f934ff586bd4..aa8564fb8ff4 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -146,6 +146,7 @@ KASAN_SANITIZE_undefsyms_base.o := y
UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
__aeabi_unwind_cpp __s390_indirect_jump __x86_indirect_thunk simple_ring_buffer \
+ memset llvm_gcda llvm_gcov \
$(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
quiet_cmd_check_undefined = NM $<
--
2.39.5
^ permalink raw reply related
* Re: [syzbot] [afs?] WARNING: ODEBUG bug in delete_node (3)
From: David Howells @ 2026-06-16 16:58 UTC (permalink / raw)
To: syzbot
Cc: dhowells, linux-afs, linux-kernel, linux-trace-kernel,
marc.dionne, mathieu.desnoyers, mhiramat, rostedt, syzkaller-bugs
In-Reply-To: <67e9006d.050a0220.1547ec.0097.GAE@google.com>
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v7.1
^ permalink raw reply
* [PATCH] ring-buffer: Fix ring_buffer_read_page() copying only one event per page
From: David Carlier @ 2026-06-16 17:55 UTC (permalink / raw)
To: rostedt, mhiramat
Cc: mathieu.desnoyers, linux-trace-kernel, linux-kernel,
David Carlier
Commit 8928e4a3be34 ("ring-buffer: Show persistent buffer dropped events
in trace_pipe file") split the "commit" variable in
ring_buffer_read_page() into "commit" (raw) and "size" (masked page
size), but the inner copy loop's terminator was changed to compare rpos
against "event_size" instead of "size".
rpos is the cumulative read offset within the page; event_size is the
length of the single event just copied. The loop thus breaks after the
first event, so only one event is copied per call. This regresses the
per-event memcpy path (partial reads, the active commit page, and
mapped/remote buffers) used by splice/trace_pipe_raw and mmap consumers
into a one-event-at-a-time read.
Compare rpos against the page size as the original code did.
Fixes: 8928e4a3be34 ("ring-buffer: Show persistent buffer dropped events in trace_pipe file")
Signed-off-by: David Carlier <devnexen@gmail.com>
---
kernel/trace/ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 06fb365bb86e..06155ab3366f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7181,7 +7181,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
rpos = reader->read;
pos += event_size;
- if (rpos >= event_size)
+ if (rpos >= size)
break;
event = rb_reader_event(cpu_buffer);
--
2.53.0
^ permalink raw reply related
* Re: [syzbot] [afs] WARNING: ODEBUG bug in delete_node (3)
From: syzbot @ 2026-06-16 20:29 UTC (permalink / raw)
To: dhowells, linux-afs, linux-kernel, linux-trace-kernel,
marc.dionne, mathieu.desnoyers, mhiramat, rostedt, syzkaller-bugs
In-Reply-To: <2150109.1781629120@warthog.procyon.org.uk>
Hello,
syzbot has tested the proposed patch but the reproducer is still triggering an issue:
lost connection to test machine
Tested on:
commit: 8cd9520d Linux 7.1
git tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v7.1
console output: https://syzkaller.appspot.com/x/log.txt?x=17026986580000
kernel config: https://syzkaller.appspot.com/x/.config?x=115afb95b0e6f56a
dashboard link: https://syzkaller.appspot.com/bug?extid=ab13429207fe1c8c92e8
compiler: Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
Note: no patches were applied.
^ permalink raw reply
* [PATCH] tracing: Use more common error handling code in event_enable_trigger_parse()
From: Markus Elfring @ 2026-06-16 20:33 UTC (permalink / raw)
To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
Steven Rostedt, Tom Zanussi
Cc: LKML, kernel-janitors
From: Markus Elfring <elfring@users.sourceforge.net>
Date: Tue, 16 Jun 2026 22:22:52 +0200
Use an additional label so that a bit of exception handling can be better
reused at the end of this function implementation.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
kernel/trace/trace_events_trigger.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 655db2e82513..7e17dd6d5b50 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1789,17 +1789,14 @@ int event_enable_trigger_parse(struct event_command *cmd_ops,
enable_data->file = event_enable_file;
trigger_data = trigger_data_alloc(cmd_ops, cmd, param, enable_data);
- if (!trigger_data) {
- kfree(enable_data);
- return ret;
- }
+ if (!trigger_data)
+ goto out_free_enable_data;
if (remove) {
event_trigger_unregister(cmd_ops, file, glob+1, trigger_data);
kfree(trigger_data);
- kfree(enable_data);
ret = 0;
- return ret;
+ goto out_free_enable_data;
}
/* Up the trigger_data count to make sure nothing frees it on failure */
@@ -1837,6 +1834,7 @@ int event_enable_trigger_parse(struct event_command *cmd_ops,
out_free:
event_trigger_reset_filter(cmd_ops, trigger_data);
event_trigger_free(trigger_data);
+out_free_enable_data:
kfree(enable_data);
return ret;
--
2.54.0
^ permalink raw reply related
* [PATCH v5 0/7] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-17 1:02 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
Hi,
Here is the 5th version of series to introduce more typecast features
to probe events. The previous version is here:
https://lore.kernel.org/all/178148603548.185520.3389196102475741865.stgit@devnote2/
In this version, I fixed some issues found by Sashiko reviews,
convert this_cpu_read() into this_cpu_ptr() + dereference[6/7],
and add some syntax testcases.
Steve introduced BTF typecast feature for eprobe[1].
This series extends it and add more options:
1. Expanding BTF typecast to kprobe and fprobe.
(currently only function entry/exit)
2. Introduce container_of like typecast. This adds a "assigned
member" option to the typecast.
(STRUCT,MEMBER)VAR->ANOTHER_MEMBER
This casts VAR to STRUCT type but the VAR is as the address
of STRUCT.MEMBER. In C, it is:
container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER
3. Support nested typecast, e.g.
(STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER
the nest level must be smaller than 3.
4. Add $current variable to point "current" task_struct.
This is useful with typecast, e.g.
(task_struct)$current->pid
5. per-cpu dereference support.
Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
access per-cpu data on the current CPU (accessing other CPU
data is not stable, because it can be changed.)
You can access the member of per-cpu data structure using
typecast like:
(STRUCT)this_cpu_ptr(VAR)->MEMBER
And added a test script to test part of them.
Thanks,
---
Masami Hiramatsu (Google) (7):
tracing/events: Fix to check the simple_tsk_fn creation
tracing/probes: Support typecast for various probe events
tracing/probes: Support nested typecast
tracing/probes: Support field specifier option for typecast
tracing/probes: Add $current variable support
tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
tracing/probes: Add a new testcase for BTF typecasts
Documentation/trace/eprobetrace.rst | 9
Documentation/trace/fprobetrace.rst | 10
Documentation/trace/kprobetrace.rst | 11 +
kernel/trace/trace.c | 8
kernel/trace/trace_probe.c | 439 +++++++++++++++-----
kernel/trace/trace_probe.h | 17 +
kernel/trace/trace_probe_tmpl.h | 25 +
samples/trace_events/trace-events-sample.c | 44 ++
samples/trace_events/trace-events-sample.h | 34 +-
.../ftrace/test.d/dynevent/btf_probe_event.tc | 51 ++
.../ftrace/test.d/dynevent/fprobe_syntax_errors.tc | 11 +
.../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 11 +
.../ftrace/test.d/kprobe/uprobe_syntax_errors.tc | 5
13 files changed, 559 insertions(+), 116 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v5 1/7] tracing/events: Fix to check the simple_tsk_fn creation
From: Masami Hiramatsu (Google) @ 2026-06-17 1:02 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165816303.269421.7302603996990753309.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Sashiko pointed that this sample code does not correctly handle the
failure of thread creation because kthread_run() can return -errno.
Check the simple_tsk_fn is correctly initialized (created) or not.
Link: https://sashiko.dev/#/patchset/178092865666.163648.10457567771536160909.stgit%40devnote2
Fixes: 9cfe06f8cd5c ("tracing/events: add trace-events-sample")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v4:
- Fix to remove decrementing counter in error path, since foo_bar_reg() always returns 0.
- Add a newline to error message.
Changes in v3:
- Recover the usage counter.
---
samples/trace_events/trace-events-sample.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index ecc7db237f2e..0b7a6efdb247 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -107,6 +107,10 @@ int foo_bar_reg(void)
* for consistency sake, we still take the thread_mutex.
*/
simple_tsk_fn = kthread_run(simple_thread_fn, NULL, "event-sample-fn");
+ if (IS_ERR_OR_NULL(simple_tsk_fn)) {
+ pr_err("Failed to create simple_thread_fn\n");
+ simple_tsk_fn = NULL;
+ }
out:
mutex_unlock(&thread_mutex);
return 0;
^ permalink raw reply related
* [PATCH v5 2/7] tracing/probes: Support typecast for various probe events
From: Masami Hiramatsu (Google) @ 2026-06-17 1:03 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165816303.269421.7302603996990753309.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Support BTF typecast feature on other probe events, but only if it is
kernel function entry or return, and must use function parameter name
or $retval. This means you can do:
(STRUCT)PARAM->MEMBER
but you can not do (this should be enabled by nesting support)
(STRUCT)%reg->MEMBER
To support other probe events, we just need to use last_struct type
when we find a function parameter in parse_btf_arg().
This also update <tracefs>/README file to show struct typecast.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v5:
- Add comments about $retval with typecast.
- Even if the type of retvalue is not known, if user specifies typecast,
use it for its type.
Changes in v3:
- Clarify the limitation.
Changes in v2:
- Fix to re-enable typecast on eprobe.
---
Documentation/trace/fprobetrace.rst | 3 +++
Documentation/trace/kprobetrace.rst | 4 ++++
kernel/trace/trace.c | 2 +-
kernel/trace/trace_probe.c | 23 +++++++++++++++++------
kernel/trace/trace_probe.h | 5 +++++
5 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index b4c2ca3d02c1..7435ded2d66d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,6 +57,9 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..f73614997d52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,6 +61,10 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that this is available only when the probe is
+ on function entry.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..aa93e7b01146 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,7 +4325,7 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t <argname>[->field[->field|.field...]],\n"
+ "\t [(structname)]<argname>[->field[->field|.field...]],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fd1caa1f9723..064ca59387ba 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -706,7 +706,7 @@ static int parse_btf_arg(char *varname,
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
- /* Check whether the function return type is not void */
+ /* Check whether the function return type is not void, even with typecast. */
if (query_btf_context(ctx) == 0) {
if (ctx->proto->type == 0) {
trace_probe_log_err(ctx->offset, NO_RETVAL);
@@ -715,6 +715,13 @@ static int parse_btf_arg(char *varname,
tid = ctx->proto->type;
goto found;
}
+ /*
+ * Even if we can not find appropriate BTF info, we can still access
+ * the field via typecast.
+ */
+ if (ctx->struct_btf)
+ goto found;
+
if (field) {
trace_probe_log_err(ctx->offset + field - varname,
NO_BTF_ENTRY);
@@ -759,7 +766,10 @@ static int parse_btf_arg(char *varname,
return -ENOENT;
found:
- type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+ if (ctx->struct_btf)
+ type = ctx->last_struct;
+ else
+ type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -836,10 +846,11 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
char *tmp;
int ret;
- /* Currently this only works for eprobes */
- if (!(ctx->flags & TPARG_FL_TEVENT)) {
- trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
- return -EINVAL;
+ if (!(tparg_is_event_probe(ctx->flags) ||
+ tparg_is_function_entry(ctx->flags) ||
+ tparg_is_function_return(ctx->flags))) {
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
}
tmp = strchr(arg, ')');
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 15758cc11fc6..883938a74aee 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -414,6 +414,11 @@ static inline bool tparg_is_function_return(unsigned int flags)
return (flags & TPARG_FL_LOC_MASK) == (TPARG_FL_KERNEL | TPARG_FL_RETURN);
}
+static inline bool tparg_is_event_probe(unsigned int flags)
+{
+ return !!(flags & TPARG_FL_TEVENT);
+}
+
struct traceprobe_parse_context {
struct trace_event_call *event;
/* BTF related parameters */
^ permalink raw reply related
* [PATCH v5 3/7] tracing/probes: Support nested typecast
From: Masami Hiramatsu (Google) @ 2026-06-17 1:03 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165816303.269421.7302603996990753309.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When we hit an open parenthesis right after typecast closing
parenthesis, it means we have nested typecast. This allows us to
typecast a generic data member in a structure to a pointer to
another structure.
For example, to cast a DATA_MEMBER of VAR structure to STRUCT pointer
and get MEMBER value.
(STRUCT)(VAR->DATA_MEMBER)->MEMBER
Also, we can nest typecast.
(STRUCT1)((STRUCT2)$ARG->FIELD2)->FIELD1
Currently the max nest level is limited to 3.
This also allows user to use typecasting for registers or stacks on
kprobe events. e.g.
(STRUCT)(%ax)->MEMBER
(STRUCT)($stack0)->MEMBER
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v4:
- Use orig_offset for reporting NO_PTR_STRCT error.
Changes in v2:
- Fix to skip "->" after closing parenthetsis.
---
Documentation/trace/eprobetrace.rst | 2 +
Documentation/trace/fprobetrace.rst | 2 +
Documentation/trace/kprobetrace.rst | 2 +
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 78 +++++++++++++++++++++++++++++++----
kernel/trace/trace_probe.h | 7 +++
6 files changed, 83 insertions(+), 9 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index fe3602540569..cd0b4aa7f896 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -50,6 +50,8 @@ Synopsis of eprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
need to be prefixed with a '$'.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
Types
-----
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 7435ded2d66d..6b8bb27bb62d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -60,6 +60,8 @@ Synopsis of fprobe-events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index f73614997d52..c4382765d5b2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -65,6 +65,8 @@ Synopsis of kprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
on function entry.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa93e7b01146..4f70318918c2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4326,6 +4326,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
"\t [(structname)]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname)](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 064ca59387ba..5d8b85435eda 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -839,10 +839,35 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+/* Find the matching closing parenthesis for a given opening parenthesis. */
+static char *find_matched_close_paren(char *s)
+{
+ char *p = s;
+ int count = 0;
+
+ while (*p) {
+ if (*p == '(')
+ count++;
+ else if (*p == ')') {
+ if (--count == 0)
+ return p;
+ }
+ p++;
+ }
+ return NULL;
+}
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
{
+ int orig_offset = ctx->offset;
+ bool nested = false;
char *tmp;
int ret;
@@ -859,19 +884,56 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
DEREF_OPEN_BRACE);
return -EINVAL;
}
- *tmp = '\0';
- ret = query_btf_struct(arg + 1, ctx);
- *tmp = ')';
+ *tmp++ = '\0';
+
+ /* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+ if (*tmp == '(') {
+ char *close = find_matched_close_paren(tmp);
+
+ ctx->offset += tmp - arg;
+ if (!close) {
+ trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ /* We expect a field access for typecast */
+ if (close[1] != '-' || close[2] != '>') {
+ trace_probe_log_err(ctx->offset + close - tmp + 1,
+ TYPECAST_REQ_FIELD);
+ return -EINVAL;
+ }
+ ctx->nested_level++;
+ if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+ return -E2BIG;
+ }
+ *close = '\0';
+
+ ctx->offset += 1; /* for the '(' */
+ /* We need to parse the nested one */
+ ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+ pcode, end, ctx);
+ if (ret < 0)
+ return ret;
+ ctx->nested_level--;
+ clear_struct_btf(ctx);
+
+ tmp = close + 3;/* Skip "->" after closing parenthesis */
+ nested = true;
+ }
+
+ ret = query_btf_struct(arg + 1, ctx);
if (ret < 0) {
- trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+ trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
return -EINVAL;
}
- tmp++;
-
- ctx->offset += tmp - arg;
- ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->offset = orig_offset + tmp - arg;
+ /* If it is nested, tmp points to the field name. */
+ if (nested)
+ ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+ else
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 883938a74aee..982d32a5df8b 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -435,8 +435,11 @@ struct traceprobe_parse_context {
struct trace_probe *tp;
unsigned int flags;
int offset;
+ int nested_level;
};
+#define TRACEPROBE_MAX_NESTED_LEVEL 3
+
extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
const char *argv,
struct traceprobe_parse_context *ctx);
@@ -571,7 +574,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
- C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
+ C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [PATCH v5 4/7] tracing/probes: Support field specifier option for typecast
From: Masami Hiramatsu (Google) @ 2026-06-17 1:03 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165816303.269421.7302603996990753309.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a field specifier option for the typecast. This works like
container_of() macro.
(STRUCT[,FIELD[.FIELD2...]])VAR
This is equivalent to :
container_of(VAR, struct STRUCT, FIELD[.FIELD2...])
For example:
echo "f tick_nohz_handler next_tick=(tick_sched,sched_timer)timer->next_tick" >> dynamic_events
This will trace tick_nohz_handler() with its tick_sched::next_tick which
is converted from @timer by contianer_of(tick, struct tick_sched, sched_timer).
So, if you enabkle both fprobes:tick_nohz_handler__entry and
timer:hrtimer_expire_entry events, we will see something like:
<idle>-0 [002] d.h1. 3778.087272: hrtimer_expire_entry: hrtimer=00000000d63db328 f
unction=tick_nohz_handler now=3777450051040
<idle>-0 [002] d.h1. 3778.087281: tick_nohz_handler__entry: (tick_nohz_handler+0x4
/0x140) next_tick=3777450000000
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v3:
- Fix error caret position.
Changes in v2:
- Use byteoffset for typecast field offset instead of bitoffset. This fixes negative modulo calculation.
- Check whether a field is specified after typecast.
- Reject if typecast field option has arrow operator.
---
Documentation/trace/eprobetrace.rst | 5 +
Documentation/trace/fprobetrace.rst | 8 +-
Documentation/trace/kprobetrace.rst | 8 +-
kernel/trace/trace.c | 4 -
kernel/trace/trace_probe.c | 179 ++++++++++++++++++++++++-----------
kernel/trace/trace_probe.h | 5 +
6 files changed, 142 insertions(+), 67 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index cd0b4aa7f896..680e0af43d5d 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -49,7 +49,10 @@ Synopsis of eprobe_events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
- need to be prefixed with a '$'.
+ need to be prefixed with a '$'. ASGN can be specified optionally.
+ If ASGN is specified, FIELD will be cast to the same offset
+ position as the ASGN member, rather than to the beginning of
+ the STRUCT.
(STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 6b8bb27bb62d..290a9e6f7491 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,10 +57,12 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
- ->MEMBER.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ ->MEMBER. ASGN can be specified optionally. If ASGN is specified,
+ FIELD will be cast to the same offset position as the ASGN member,
+ rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index c4382765d5b2..a62707e6a9f2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,11 +61,13 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
- on function entry.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ on function entry. ASGN can be specified optionally. If ASGN
+ is specified, FIELD will be cast to the same offset position
+ as the ASGN member, rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4f70318918c2..0e36af853199 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,8 +4325,8 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t [(structname)]<argname>[->field[->field|.field...]],\n"
- "\t [(structname)](fetcharg)->field[->field|.field...],\n"
+ "\t [(structname[,field])]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 5d8b85435eda..0f2be8c817fc 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -574,6 +574,65 @@ static int split_next_field(char *varname, char **next_field,
return ret;
}
+/* Inner loop for solving dot operator ('.'). Return bit-offset of the given field */
+static int get_bitoffset_of_field(char **pfieldname, const struct btf_type **ptype,
+ struct traceprobe_parse_context *ctx)
+{
+ const struct btf_type *type = *ptype;
+ const struct btf_member *field;
+ struct btf *btf = ctx_btf(ctx);
+ char *fieldname = *pfieldname;
+ int bitoffs = 0;
+ u32 anon_offs;
+ char *next;
+ int is_ptr;
+ s32 tid;
+
+ do {
+ next = NULL;
+ is_ptr = split_next_field(fieldname, &next, ctx);
+ if (is_ptr < 0)
+ return is_ptr;
+
+ anon_offs = 0;
+ field = btf_find_struct_member(btf, type, fieldname,
+ &anon_offs);
+ if (IS_ERR(field)) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return PTR_ERR(field);
+ }
+ if (!field) {
+ trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
+ return -ENOENT;
+ }
+ /* Add anonymous structure/union offset */
+ bitoffs += anon_offs;
+
+ /* Accumulate the bit-offsets of the dot-connected fields */
+ if (btf_type_kflag(type)) {
+ bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
+ ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
+ } else {
+ bitoffs += field->offset;
+ ctx->last_bitsize = 0;
+ }
+
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
+
+ if (next)
+ ctx->offset += next - fieldname;
+ fieldname = next;
+ } while (!is_ptr && fieldname);
+
+ *pfieldname = fieldname;
+ *ptype = type;
+
+ return bitoffs;
+}
/*
* Parse the field of data structure. The @type must be a pointer type
* pointing the target data structure type.
@@ -583,16 +642,14 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct traceprobe_parse_context *ctx)
{
struct fetch_insn *code = *pcode;
- const struct btf_member *field;
- u32 bitoffs, anon_offs;
- bool is_struct = ctx->struct_btf != NULL;
struct btf *btf = ctx_btf(ctx);
- char *next;
- int is_ptr;
+ bool is_first_field = true;
+ int bitoffs;
s32 tid;
do {
- if (!is_struct) {
+ /* For the first field of typecast, @type will be the target structure type. */
+ if (!(is_first_field && ctx->struct_btf)) {
/* Outer loop for solving arrow operator ('->') */
if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
@@ -606,60 +663,25 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return -EINVAL;
}
}
- /* Only the first type can skip being a pointer */
- is_struct = false;
-
- bitoffs = 0;
- do {
- /* Inner loop for solving dot operator ('.') */
- next = NULL;
- is_ptr = split_next_field(fieldname, &next, ctx);
- if (is_ptr < 0)
- return is_ptr;
-
- anon_offs = 0;
- field = btf_find_struct_member(btf, type, fieldname,
- &anon_offs);
- if (IS_ERR(field)) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return PTR_ERR(field);
- }
- if (!field) {
- trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
- return -ENOENT;
- }
- /* Add anonymous structure/union offset */
- bitoffs += anon_offs;
-
- /* Accumulate the bit-offsets of the dot-connected fields */
- if (btf_type_kflag(type)) {
- bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
- ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
- } else {
- bitoffs += field->offset;
- ctx->last_bitsize = 0;
- }
-
- type = btf_type_skip_modifiers(btf, field->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
- }
-
- ctx->offset += next - fieldname;
- fieldname = next;
- } while (!is_ptr && fieldname);
+ bitoffs = get_bitoffset_of_field(&fieldname, &type, ctx);
+ if (bitoffs < 0)
+ return bitoffs;
if (++code == end) {
trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
return -EINVAL;
}
code->op = FETCH_OP_DEREF; /* TODO: user deref support */
code->offset = bitoffs / 8;
+ if (is_first_field && ctx->struct_btf) {
+ /* The first field can be typecasted with field option. */
+ code->offset -= ctx->prefix_byteoffs;
+ }
*pcode = code;
ctx->last_bitoffs = bitoffs % 8;
ctx->last_type = type;
+ is_first_field = false;
} while (fieldname);
return 0;
@@ -690,6 +712,11 @@ static int parse_btf_arg(char *varname,
NOSUP_DAT_ARG);
return -EOPNOTSUPP;
}
+ if (!field && ctx->struct_btf) {
+ /* Typecast without field option is not supported */
+ trace_probe_log_err(ctx->offset + strlen(varname), TYPECAST_REQ_FIELD);
+ return -EOPNOTSUPP;
+ }
if (ctx->flags & TPARG_FL_TEVENT) {
ret = parse_trace_event(varname, code, ctx);
@@ -700,8 +727,7 @@ static int parse_btf_arg(char *varname,
/* TEVENT is only here via a typecast */
if (WARN_ON_ONCE(ctx->struct_btf == NULL))
return -EINVAL;
- type = ctx->last_struct;
- goto found_type;
+ goto found;
}
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
@@ -770,7 +796,6 @@ static int parse_btf_arg(char *varname,
type = ctx->last_struct;
else
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
-found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -839,6 +864,46 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+static int parse_btf_casttype(char *casttype, struct traceprobe_parse_context *ctx)
+{
+ char *field;
+ int ret;
+
+ /* Field option - evaluated later. */
+ field = strchr(casttype, ',');
+ if (field)
+ *field++ = '\0';
+
+ ret = query_btf_struct(casttype, ctx);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ if (field) {
+ struct btf_type *type = (struct btf_type *)ctx->last_struct;
+
+ ctx->offset += field - casttype;
+ ret = get_bitoffset_of_field(&field, &ctx->last_struct, ctx);
+ if (ret < 0)
+ return ret;
+ if (ret % 8) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_ALIGNED);
+ return -EINVAL;
+ }
+ if (field != NULL) {
+ /* this means @field skips an arrow operator ("->"). */
+ trace_probe_log_err(ctx->offset - 2, TYPECAST_BAD_ARROW);
+ return -EINVAL;
+ }
+ ctx->prefix_byteoffs = ret / 8;
+ /* Restore the original struct type (overwritten by get_bitoffset_of_field) */
+ ctx->last_struct = type;
+ }
+
+ return ret;
+}
+
/* Find the matching closing parenthesis for a given opening parenthesis. */
static char *find_matched_close_paren(char *s)
{
@@ -922,11 +987,10 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
nested = true;
}
- ret = query_btf_struct(arg + 1, ctx);
- if (ret < 0) {
- trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
- return -EINVAL;
- }
+ ctx->offset = orig_offset + 1; /* for the '(' */
+ ret = parse_btf_casttype(arg + 1, ctx);
+ if (ret < 0)
+ return ret;
ctx->offset = orig_offset + tmp - arg;
/* If it is nested, tmp points to the field name. */
@@ -934,6 +998,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
else
ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->prefix_byteoffs = 0;
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 982d32a5df8b..44f113faae61 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -436,6 +436,7 @@ struct traceprobe_parse_context {
unsigned int flags;
int offset;
int nested_level;
+ int prefix_byteoffs; /* The byte offset of the prefix field of typecast */
};
#define TRACEPROBE_MAX_NESTED_LEVEL 3
@@ -576,7 +577,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
- C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"), \
+ C(TYPECAST_NOT_ALIGNED, "Typecast field option is not byte-aligned"), \
+ C(TYPECAST_BAD_ARROW, "Typecast field option does not support -> operator"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [PATCH v5 5/7] tracing/probes: Add $current variable support
From: Masami Hiramatsu (Google) @ 2026-06-17 1:03 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165816303.269421.7302603996990753309.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since we can use the BTF to cast value to a structure pointer type,
it is useful to introduce "$current" special variable support to
fetcharg.
User can define a fetcharg to access current task_struct properties
using BTF info. e.g.
$current->cpus_ptr
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v5:
- Use s32 for bof_find_btf_id().
Changes in v4:
- Add $current in README when CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y case.
- Fix to prohibit using $current in eprobes and address based kprobes.
Changes in v3:
- Remove $current support from eprobes (because eprobes is only for event)
- Prohibit uprobes to use $current.
Changes in v2:
- Support to parse $current in parse_btf_arg().
- If no typecast on $current, it automatically casted to task_struct.
- Check error case if $current follows something except for "-".
---
Documentation/trace/fprobetrace.rst | 1 +
Documentation/trace/kprobetrace.rst | 1 +
kernel/trace/trace.c | 4 ++--
kernel/trace/trace_probe.c | 38 ++++++++++++++++++++++++++++++++++-
kernel/trace/trace_probe.h | 1 +
kernel/trace/trace_probe_tmpl.h | 3 +++
6 files changed, 45 insertions(+), 3 deletions(-)
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 290a9e6f7491..3392cab016b3 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -50,6 +50,7 @@ Synopsis of fprobe-events
$argN : Fetch the Nth function argument. (N >= 1) (\*2)
$retval : Fetch return value.(\*3)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index a62707e6a9f2..81e4fe38791d 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -53,6 +53,7 @@ Synopsis of kprobe_events
$argN : Fetch the Nth function argument. (N >= 1) (\*1)
$retval : Fetch return value.(\*2)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0e36af853199..7a5676524f1a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4323,13 +4323,13 @@ static const char readme_msg[] =
"\t args: <name>=fetcharg[:type]\n"
"\t fetcharg: (%<register>|$<efield>), @<address>, @<symbol>[+|-<offset>],\n"
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
- "\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
+ "\t $stack<index>, $stack, $retval, $comm, $arg<N>, $current\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
"\t [(structname[,field])]<argname>[->field[->field|.field...]],\n"
"\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
- "\t $stack<index>, $stack, $retval, $comm,\n"
+ "\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 0f2be8c817fc..93edac7fc943 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -730,6 +730,24 @@ static int parse_btf_arg(char *varname,
goto found;
}
+ if (strcmp(varname, "$current") == 0) {
+ code->op = FETCH_OP_CURRENT;
+ /* If no typecast is specified for $current, use task_struct by default */
+ if (!ctx->struct_btf) {
+ s32 ttid = bpf_find_btf_id("task_struct", BTF_KIND_STRUCT,
+ &ctx->struct_btf);
+
+ if (ttid < 0) {
+ trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);
+ return -ENOENT;
+ }
+ /* btf_type_skip_modifier() requires u32 for type id. */
+ tid = ttid;
+ ctx->last_struct = btf_type_skip_modifiers(ctx->struct_btf, tid, &tid);
+ }
+ goto found;
+ }
+
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void, even with typecast. */
@@ -763,8 +781,8 @@ static int parse_btf_arg(char *varname,
return -ENOENT;
}
}
- params = ctx->params;
+ params = ctx->params;
for (i = 0; i < ctx->nr_params; i++) {
const char *name = btf_name_by_offset(ctx->btf, params[i].name_off);
@@ -1254,6 +1272,24 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
return 0;
}
+ /* $current returns the address of the current task_struct. */
+ if (str_has_prefix(arg, "current")) {
+ /* $current is only supported by kernel probe and need function name. */
+ if (!(ctx->flags & TPARG_FL_KERNEL) || !ctx->funcname) {
+ err = TP_ERR_BAD_VAR;
+ goto inval;
+ }
+ arg += strlen("current");
+ if (*arg == '-' && IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS))
+ return parse_btf_arg(orig_arg, pcode, end, ctx);
+
+ if (*arg != '\0')
+ goto inval;
+
+ code->op = FETCH_OP_CURRENT;
+ return 0;
+ }
+
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
len = str_has_prefix(arg, "arg");
if (len) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 44f113faae61..62645e847bd1 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -96,6 +96,7 @@ enum fetch_op {
FETCH_OP_FOFFS, /* File offset: .immediate */
FETCH_OP_DATA, /* Allocated data: .data */
FETCH_OP_EDATA, /* Entry data: .offset */
+ FETCH_OP_CURRENT, /* Current task_struct address */
// Stage 2 (dereference) op
FETCH_OP_DEREF, /* Dereference: .offset */
FETCH_OP_UDEREF, /* User-space Dereference: .offset */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f39b37fcdb3b..f630930288d2 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -112,6 +112,9 @@ process_common_fetch_insn(struct fetch_insn *code, unsigned long *val)
case FETCH_OP_DATA:
*val = (unsigned long)code->data;
break;
+ case FETCH_OP_CURRENT:
+ *val = (unsigned long)current;
+ break;
default:
return -EILSEQ;
}
^ permalink raw reply related
* [PATCH v5 6/7] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu (Google) @ 2026-06-17 1:03 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165816303.269421.7302603996990753309.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When tracing the kernel local variables, sometimes we need to get the
CPU local variables. To access it, current simple dereference is not
enough.
Thus, introduce a special this_cpu_read() dereference to access per-cpu
variable for the current CPU (accessing other CPU variable may race with
updates on other CPUs). Also this_cpu_ptr() is for accessing per-cpu
pointer.
Those are working as same as the kernel percpu macro.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v5:
- Simplify this_cpu_read() into +0(this_cpu_ptr()).
Changes in v3:
- Remove NULL check for percpu var because it is just an offset, could be 0.
- Simplify process_fetch_insn_bottom() code.
- If the last operation is this_cpu_read(), read only memory of the specific
size (of type).
Changes in v2:
- Drop +CPU/+PCPU and introduce this_cpu_read() and this_cpu_ptr().
- Support these method with BTF typecast.
- Just check the base address is NOT NULL instead of is_kernel_percpu_address().
---
Documentation/trace/eprobetrace.rst | 2
Documentation/trace/fprobetrace.rst | 2
Documentation/trace/kprobetrace.rst | 2
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 151 +++++++++++++++++++++++++----------
kernel/trace/trace_probe.h | 1
kernel/trace/trace_probe_tmpl.h | 22 ++++-
7 files changed, 132 insertions(+), 49 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 680e0af43d5d..279396951b34 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -39,6 +39,8 @@ Synopsis of eprobe_events
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$comm : Fetch current task comm.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 3392cab016b3..3439bc9bd351 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -52,6 +52,8 @@ Synopsis of fprobe-events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 81e4fe38791d..9ae330eb0a52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,6 +55,8 @@ Synopsis of kprobe_events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7a5676524f1a..d4121acc2938 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4332,6 +4332,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
+ "\t this_cpu_read(<fetcharg>), this_cpu_ptr(<fetcharg>)\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
"\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 93edac7fc943..34286a2ac72e 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -349,6 +349,93 @@ static int parse_trace_event(char *arg, struct fetch_insn *code,
return -EINVAL;
}
+/* this_cpu_* parser */
+#define THIS_CPU_PTR_PREFIX "this_cpu_ptr("
+#define THIS_CPU_READ_PREFIX "this_cpu_read("
+#define THIS_CPU_PTR_LEN (sizeof(THIS_CPU_PTR_PREFIX) - 1)
+#define THIS_CPU_READ_LEN (sizeof(THIS_CPU_READ_PREFIX) - 1)
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
+/* handle dereference nested call */
+static inline int handle_dereference(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end, struct traceprobe_parse_context *ctx,
+ int deref, long offset)
+{
+ const struct fetch_type *type = find_fetch_type(NULL, ctx->flags);
+ struct fetch_insn *code = *pcode;
+ int cur_offs = ctx->offset;
+ char *tmp;
+ int ret;
+
+ tmp = strrchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+
+ *tmp = '\0';
+ ret = parse_probe_arg(arg, type, &code, end, ctx);
+ if (ret)
+ return ret;
+ ctx->offset = cur_offs;
+ if (code->op == FETCH_OP_COMM || code->op == FETCH_OP_DATA) {
+ trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
+ return -EINVAL;
+ }
+ code++;
+ if (code == end) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+ return -EINVAL;
+ }
+ *pcode = code;
+
+ code->op = deref;
+ code->offset = offset;
+ /* Reset the last type if used */
+ ctx->last_type = NULL;
+ return 0;
+}
+
+static int parse_this_cpu(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ struct fetch_insn *code;
+ bool is_ptr = false;
+ int ret;
+
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX)) {
+ arg += THIS_CPU_PTR_LEN;
+ ctx->offset += THIS_CPU_PTR_LEN;
+ is_ptr = true;
+ } else if (str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ arg += THIS_CPU_READ_LEN;
+ ctx->offset += THIS_CPU_READ_LEN;
+ } else
+ return -EINVAL;
+
+ ret = handle_dereference(arg, pcode, end, ctx, FETCH_OP_CPU_PTR, 0);
+ if (ret || is_ptr)
+ return ret;
+
+ /* this_cpu_read(VAR) -> +0(this_cpu_ptr(VAR)) */
+ code = *pcode;
+ code++;
+ if (code == end) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+ return -EINVAL;
+ }
+ code->op = FETCH_OP_DEREF;
+ code->offset = 0;
+ *pcode = code;
+ return 0;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -940,11 +1027,6 @@ static char *find_matched_close_paren(char *s)
return NULL;
}
-static int
-parse_probe_arg(char *arg, const struct fetch_type *type,
- struct fetch_insn **pcode, struct fetch_insn *end,
- struct traceprobe_parse_context *ctx);
-
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
@@ -970,7 +1052,8 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
*tmp++ = '\0';
/* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
- if (*tmp == '(') {
+ if (*tmp == '(' || str_has_prefix(tmp, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(tmp, THIS_CPU_READ_PREFIX)) {
char *close = find_matched_close_paren(tmp);
ctx->offset += tmp - arg;
@@ -990,12 +1073,18 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
return -E2BIG;
}
- *close = '\0';
- ctx->offset += 1; /* for the '(' */
- /* We need to parse the nested one */
- ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
- pcode, end, ctx);
+ if (*tmp == '(') {
+ /* Extract the inner argument */
+ *close = '\0';
+ ctx->offset += 1;/* for the '(' */
+ /* Parse the nested one */
+ ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+ pcode, end, ctx);
+ } else {
+ /* this_cpu_* will be parsed in parse_this_cpu() */
+ ret = parse_this_cpu(tmp, pcode, end, ctx);
+ }
if (ret < 0)
return ret;
ctx->nested_level--;
@@ -1465,36 +1554,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
}
ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
arg = tmp + 1;
- tmp = strrchr(arg, ')');
- if (!tmp) {
- trace_probe_log_err(ctx->offset + strlen(arg),
- DEREF_OPEN_BRACE);
- return -EINVAL;
- } else {
- const struct fetch_type *t2 = find_fetch_type(NULL, ctx->flags);
- int cur_offs = ctx->offset;
-
- *tmp = '\0';
- ret = parse_probe_arg(arg, t2, &code, end, ctx);
- if (ret)
- break;
- ctx->offset = cur_offs;
- if (code->op == FETCH_OP_COMM ||
- code->op == FETCH_OP_DATA) {
- trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
- return -EINVAL;
- }
- if (++code == end) {
- trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
- return -EINVAL;
- }
- *pcode = code;
-
- code->op = deref;
- code->offset = offset;
- /* Reset the last type if used */
- ctx->last_type = NULL;
- }
+ ret = handle_dereference(arg, pcode, end, ctx, deref, offset);
+ if (ret < 0)
+ return ret;
break;
case '\\': /* Immediate value */
if (arg[1] == '"') { /* Immediate string */
@@ -1515,15 +1577,18 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
ret = handle_typecast(arg, pcode, end, ctx);
break;
default:
- if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ ret = parse_this_cpu(arg, pcode, end, ctx);
+ } else if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
!tparg_is_function_return(ctx->flags)) {
trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
return -EINVAL;
}
ret = parse_btf_arg(arg, pcode, end, ctx);
- break;
}
+ break;
}
if (!ret && code->op == FETCH_OP_NOP) {
/* Parsed, but do not find fetch method */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 62645e847bd1..3651ce1f144f 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -100,6 +100,7 @@ enum fetch_op {
// Stage 2 (dereference) op
FETCH_OP_DEREF, /* Dereference: .offset */
FETCH_OP_UDEREF, /* User-space Dereference: .offset */
+ FETCH_OP_CPU_PTR, /* Per-CPU pointer for this CPU */
// Stage 3 (store) ops
FETCH_OP_ST_RAW, /* Raw: .size */
FETCH_OP_ST_MEM, /* Mem: .offset, .size */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f630930288d2..9265b03cf19d 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -129,25 +129,35 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
struct fetch_insn *s3 = NULL;
int total = 0, ret = 0, i = 0;
u32 loc = 0;
- unsigned long lval = val;
+ unsigned long lval, llval = val;
stage2:
/* 2nd stage: dereference memory if needed */
do {
- if (code->op == FETCH_OP_DEREF) {
- lval = val;
+ lval = val;
+ switch (code->op) {
+ case FETCH_OP_DEREF:
ret = probe_mem_read(&val, (void *)val + code->offset,
sizeof(val));
- } else if (code->op == FETCH_OP_UDEREF) {
- lval = val;
+ break;
+ case FETCH_OP_UDEREF:
ret = probe_mem_read_user(&val,
(void *)val + code->offset, sizeof(val));
- } else
break;
+ case FETCH_OP_CPU_PTR:
+ val = (unsigned long)this_cpu_ptr((void __percpu *)val);
+ ret = 0;
+ break;
+ default:
+ lval = llval;
+ goto out;
+ }
if (ret)
return ret;
+ llval = lval;
code++;
} while (1);
+out:
s3 = code;
stage3:
^ permalink raw reply related
* [PATCH v5 7/7] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-17 1:03 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178165816303.269421.7302603996990753309.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.
Add a new ftrace kselftest and update the trace event sample module
to test and validate these features.
Specifically, update the trace-events-sample module to set up a
periodic timer whose callback accesses a per-CPU counter. Introduce
a new sample trace event, foo_timer_fn, to trace this callback
and log the current counter value.
Then, add a new test case, btf_probe_event.tc, which defines a
dynamic probe on the timer callback. The probe uses BTF typecasting
to recover the parent structure from the timer argument and
this_cpu_read() to fetch the per-CPU counter. The test verifies
the integrity of the implementation by ensuring the values
recorded by the dynamic probe match those from the static tracepoint.
Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v5:
- Add more syntax test cases.
Changes in v4:
- Fix uprobe $current test.
Changes in v3:
- Add syntax test case.
- Update testcase to use this_cpu_read()
Changes in v2:
- Use timer_shutdown_sync() instead of timer_delete_sync() for teardown.
---
samples/trace_events/trace-events-sample.c | 40 +++++++++++++++-
samples/trace_events/trace-events-sample.h | 34 ++++++++++++-
.../ftrace/test.d/dynevent/btf_probe_event.tc | 51 ++++++++++++++++++++
.../ftrace/test.d/dynevent/fprobe_syntax_errors.tc | 11 ++++
.../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 11 ++++
.../ftrace/test.d/kprobe/uprobe_syntax_errors.tc | 5 ++
6 files changed, 147 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index 0b7a6efdb247..ca5d98c360cb 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -94,6 +94,20 @@ static int simple_thread_fn(void *arg)
static DEFINE_MUTEX(thread_mutex);
static int simple_thread_cnt;
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+ struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+ get_cpu();
+ trace_foo_timer_fn(data);
+ (*this_cpu_ptr(data->counter))++;
+ put_cpu();
+
+ mod_timer(t, jiffies + HZ);
+}
+
int foo_bar_reg(void)
{
mutex_lock(&thread_mutex);
@@ -132,9 +146,27 @@ void foo_bar_unreg(void)
static int __init trace_event_init(void)
{
+ foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+ if (!foo_timer_data)
+ return -ENOMEM;
+
+ foo_timer_data->name = "sample_timer_counter";
+ foo_timer_data->counter = alloc_percpu(int);
+ if (!foo_timer_data->counter) {
+ kfree(foo_timer_data);
+ return -ENOMEM;
+ }
+
+ timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+ mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
- if (IS_ERR(simple_tsk))
- return -1;
+ if (IS_ERR(simple_tsk)) {
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
+ return PTR_ERR(simple_tsk);
+ }
return 0;
}
@@ -147,6 +179,10 @@ static void __exit trace_event_exit(void)
kthread_stop(simple_tsk_fn);
simple_tsk_fn = NULL;
mutex_unlock(&thread_mutex);
+
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
}
module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
*/
/*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
*/
#ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
#define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
static inline int __length_of(const int *list)
{
int i;
@@ -270,6 +272,13 @@ enum {
TRACE_SAMPLE_BAR = 4,
TRACE_SAMPLE_ZOO = 8,
};
+
+struct foo_timer_data {
+ const char *name;
+ struct timer_list timer;
+ int __percpu *counter;
+};
+
#endif
/*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
__get_rel_bitmask(bitmask),
__get_rel_cpumask(cpumask))
);
+
+TRACE_EVENT(foo_timer_fn,
+
+ TP_PROTO(struct foo_timer_data *data),
+
+ TP_ARGS(data),
+
+ TP_STRUCT__entry(
+ __string( name, data->name )
+ __field( int, count )
+ ),
+
+ TP_fast_assign(
+ __assign_str(name);
+ __entry->count = *this_cpu_ptr(data->counter);
+ ),
+
+ TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
#endif
/***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..96791e120b7d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+ modprobe trace-events-sample || exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and this_cpu_read() access.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# this_cpu_read(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+ if echo $line | grep -q "foo_timer_fn:"; then
+ NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+ COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+ if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+ MATCH=$((MATCH+1))
+ fi
+ fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+ echo "No matching events found"
+ exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
index fee479295e2f..0402d0271bfe 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
@@ -112,6 +112,17 @@ check_error 'f vfs_read%return $retval->^foo' # NO_PTR_STRCT
check_error 'f vfs_read file->^foo' # NO_BTF_FIELD
check_error 'f vfs_read file^-.foo' # BAD_HYPHEN
check_error 'f vfs_read ^file:string' # BAD_TYPE4STR
+if grep -qF "[(structname" README ; then
+check_error 'f vfs_read arg1=(task_struct)file^' # TYPECAST_REQ_FIELD
+check_error 'f vfs_read arg1=(f)((f)((f)((f)^((f)file->f)->f)->f)->f)->f' # TOO_MANY_NESTED
+check_error 'f vfs_read arg1=(task_struct,^in_execve)file->comm' # TYPECAST_NOT_ALIGNED
+check_error 'f vfs_read arg1=(task_struct,^foo_bar)file:u8' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(^task_struct1234)file:u8' # NO_PTR_STRCT
+check_error 'f vfs_read arg1=(task_struct,se^->group_node)file->comm' # TYPECAST_BAD_ARROW
+check_error 'f vfs_read arg1=(task_struct,^->pid)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.pid)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.)file->comm' # NO_BTF_FIELD
+fi
fi
else
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
index 8f1c58f0c239..742b342e4930 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -115,6 +115,17 @@ check_error 'p vfs_read+20 ^$arg*' # NOFENTRY_ARGS
check_error 'p vfs_read ^hoge' # NO_BTFARG
check_error 'p kfree ^$arg10' # NO_BTFARG (exceed the number of parameters)
check_error 'r kfree ^$retval' # NO_RETVAL
+if grep -qF "[(structname" README ; then
+check_error 'p vfs_read arg1=(task_struct)file^' # TYPECAST_REQ_FIELD
+check_error 'p vfs_read arg1=(f)((f)((f)((f)^((f)file->f)->f)->f)->f)->f' # TOO_MANY_NESTED
+check_error 'p vfs_read arg1=(task_struct,^in_execve)file->comm' # TYPECAST_NOT_ALIGNED
+check_error 'p vfs_read arg1=(task_struct,^foo_bar)file:u8' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(^task_struct1234)file:u8' # NO_PTR_STRCT
+check_error 'p vfs_read arg1=(task_struct,se^->group_node)file->comm' # TYPECAST_BAD_ARROW
+check_error 'p vfs_read arg1=(task_struct,^->pid)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.pid)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.)file->comm' # NO_BTF_FIELD
+fi
else
check_error 'p vfs_read ^$arg*' # NOSUP_BTFARG
fi
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
index c817158b99db..e12dc967ec76 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
@@ -28,4 +28,9 @@ if grep -q ".*symstr.*" README; then
check_error 'p /bin/sh:10 $stack0:^symstr' # BAD_TYPE
fi
+# $current is not supported by uprobe
+if grep -q "\$current.*" README; then
+check_error 'p /bin/sh:10 ^$current:u8' # BAD_VAR
+fi
+
exit 0
^ permalink raw reply related
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-17 4:02 UTC (permalink / raw)
To: Gregory Price
Cc: David Hildenbrand (Arm), lsf-pc, linux-kernel, linux-cxl, cgroups,
linux-mm, linux-trace-kernel, damon, kernel-team, gregkh, rafael,
dakr, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <aimSzvoJDrpeQsmM@gourry-fedora-PF4VCD3F>
On Wed, Jun 10, 2026 at 12:37:34PM -0400, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/10/26 12:41, Gregory Price wrote:
> > > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > >
> > > Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> > > which causes spillage into private nodes because slub allows private
> > > nodes in its mask. I think this is fixable.
> > >
> > > I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> > > code, etc), but it seems like fully dropping the FALLBACK entries and
> > > requiring __GFP_THISNODE might be sufficient.
> >
> > Sorry, I haven't been able to follow up so far, and not sure if that's what you
> > are discussing here ...
> >
> > After the LSF/MM session, I was wondering, whether if we focus on allowing only
> > folios allocations to end up on private memory nodes for now: could the
> > __GFP_THISNODE approach work there?
> >
> > Essentially, disallow any allocations on non-folio paths, and allow folio
> > allocation only with __GFP_THISNODE set.
> >
> > I have to find time to read the other mails in this thread, on my todo list.
> >
> > So sorry if that is precisely what is being discussed here.
> >
>
> So, I remember this being asked, and I didn't fully grok the request.
>
> I'm still not sure I fully understand the question, so apologies if I'm
> answer the wrong things here.
>
> I understand this question in two ways:
>
> 1) Can we disallow PAGE allocation and limit this to FOLIO allocation
> 2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
>
>
> 1) Can we disallow page allocation and limit this to folios?
>
> No, I don't think so.
>
> Folio allocations are written in terms of page allocations, we would
> have to rewrite folio allocation interfaces and introduce a bunch of
> boilerplate for the sake of this.
>
> struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> int preferred_nid, nodemask_t *nodemask)
> {
> struct page *page;
>
> page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
> if (page)
> set_page_refcounted(page);
> return page;
> }
>
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> nodemask_t *nodemask)
> {
> struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> preferred_nid, nodemask);
> return page_rmappable_folio(page);
> }
>
> At the end of the day, this all reduces to `get_pages_from_freelist`,
> and at that level we don't really care about folio vs page.
>
> __GFP_COMP is insufficient to differentiate between a non-folio compound
> page and a folio, and __GFP_COMP is passed into __alloc_pages_*
> interfaces all over the kernel.
>
> Trying to detach these paths things seems like a horrible rats nest /
> not feasible / will create a lot of boilerplate for little value.
>
> (I did not fully understand this request when it was asked, I do
> not fully understand this request not, please let me know if I
> have misunderstood what you were asking).
>
I agree with this, any changes to folio only allocation could then be
easily adapted for N_MEMORY_PRIVATE
>
>
> 2) Can we disallow SLAB allocation.
>
> Yeah, but I think a better question is whether there's a difference
> between alloc_pages_node() and kmalloc_node() when it all just sinks
> to the same fundamental code in mm/page_alloc.c
>
> Maybe there's an argument for something like NP_OPT_KMALLOC (allow slab
> allocations on the private node w/ __GFP_THISNODE)
>
> On my current set, I don't implement any explicit filtering at all in
> mm/page_alloc.c - the filtering is a function of the nodes not being
> present in the FALLBACK list and only having a NOFALLBACK list.
>
> What __GFP_THISNODE actually does under the hood is just switch
> which zone list (FALLBACK vs NOFALLBACK) is used for the target node.
>
> For isolation w/o __GFP_PRIVATE, we're removing N_MEMORY_PRIVATE nodes
> from *their own FALLBACK* list and only adding them to their NOFALLBACK
> list. That means to reach a private node you MUST use __GFP_THISNODE.
>
> I realize this is confusing, but essentially we don't have to modify
> mm/page_alloc.c to get the __GFP_THISNODE filtering, we get this from
> the fallback/nofallback list construction.
>
>
> Ok, so how does this flush out in practice - and why do I call this
> filtering mechanism fragile?
>
> consider kmalloc_node() and __slab_alloc():
>
> kmalloc_node(...)
> └─ ___slab_alloc() mm/slub.c:4406 pc.flags |= __GFP_THISNODE
> └─ new_slab(s, pc.flags, node)
> └─ allocate_slab(s, flags, node)
> └─ alloc_slab_page(flags, node, oo, …)
> └─ __alloc_frozen_pages(flags, order, node, NULL);
>
> Slab silently upgrades the page allocator flags here to include
> __GFP_THISNODE - even if the user didn't request that behavior.
>
> This is exactly the kind of "spillage" I said was hard to police at LSF.
>
> Without __GFP_PRIVATE, we have to keep an eye on what around the kernel
> is using __GFP_THISNODE and how.
>
> For mm/slub.c we can choose to do one of thwo things
>
> 1) 100% refuse slab allocations on private nodes, i.e.:
>
> kmalloc_node(..., private_nid, __GFP_THISNODE)
>
> And will fail (return NULL).
>
Doesn't this iterate through N_MEMORY only? N_MEMORY_PRIVATE should not
be in the regular for_each(...) loops
> or
>
> 2) Do not upgrade private-node slab requests w/ __GFP_THISNODE
>
> This allows kmalloc_node() to work the same as folio_alloc()
> or alloc_pages() interfaces (__GFP_THISNODE is the key), with
> the understanding that any __GFP_THISNODE user
>
> We can opt these nodes into slab/kmalloc with a NP_OPT_SLAB
> if the owner wants kmalloc_node(), with the understanding that any
> caller using __GFP_THISNODE may get access.
>
> That's the kind of fragility I was trying to avoid.
>
>
> That said, in practice, I have found that basic kernel operations don't
> generally target use kmalloc_node() w/ __GFP_THISNODE - there's just
> nothing to prevent anyone from doing so.
>
> So this seems promising...
> And then theres arch/powerpc/platforms/powernv/memtrace.c
>
> static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> ... snip ...
> page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
> __GFP_NOWARN | __GFP_ZERO, nid, NULL);
> ... snip ...
> }
>
> static int memtrace_init_regions_runtime(u64 size)
> {
> ... snip ...
> for_each_online_node(nid) {
> m = memtrace_alloc_node(nid, size);
> ... snip ...
> }
>
> static int memtrace_enable_set(void *data, u64 val)
> {
> ... snip ...
> if (memtrace_init_regions_runtime(val))
> goto out_unlock;
> ... snip ...
> }
>
> This is the *exact* pattern I said would be hard to police - and it
> doesn't look like a bug, just not informed that private nodes exist.
>
> This is why I'm concerned with trying to depend on __GFP_THISNODE as the
> filtering function.
>
> That said, the number of __GFP_THISNODE users is very limited
> kernel-wide, so maybe that's an acceptable maintenance burden?
>
Balbir
^ permalink raw reply
* Re: [PATCH v3 7/9] rv/tlob: add KUnit tests for the tlob monitor
From: Gabriele Monaco @ 2026-06-17 7:49 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <d9abf11ee0af54689fce87ab95f8e8f7f5eb233d.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> +/*
> + * Valid add lines return -ENOENT (kern_path() finds no such file in
> the test
> + * environment) rather than 0; a non-(-EINVAL) return confirms the
> format was
> + * accepted by the parser.
> + */
> +static void tlob_parse_valid_accepted(struct kunit *test)
> +{
> + char buf[128];
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(tlob_parse_valid); i++) {
> + strscpy(buf, tlob_parse_valid[i], sizeof(buf));
> + KUNIT_EXPECT_NE(test,
> tlob_create_or_delete_uprobe(buf), -EINVAL);
Can you perhaps add those uprobes for real from this test case?
Unit tests should not touch the system, you usually do that by:
1. stubbing the tested function when it starts doing bad things
2. test with dummy data not attaching anything (not applicable here)
3. test a different function not affecting the system
I think the cleanest here is 3. so you could just kunit test
tlob_parse_uprobe_line() and tlob_parse_remove_line().
Alternatively just stub the entirety of tlob_add_uprobe()
tlob_remove_uprobe_by_key() and maybe even check that they're called
when expected and not called on failure (right now you aren't testing
valid removals, probably because that's going to break).
I believe a good unit test should be validating the parsing logic only
/or/ the add/remove logic (but that's hard, you can skip it or even
check in selftests).
Right now your tests are trying to do both, so you don't know if
failures came from the uprobes subsystem or allocation (you shouln't
even get there from the unit test).
Then you can just check for success and not for ! EINVAL , which is
confusing.
Thanks,
Gabriele
> + }
> +}
> +
> +static void tlob_parse_invalid_rejected(struct kunit *test)
> +{
> + char buf[128];
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(tlob_parse_invalid); i++) {
> + strscpy(buf, tlob_parse_invalid[i], sizeof(buf));
> + KUNIT_EXPECT_EQ(test,
> tlob_create_or_delete_uprobe(buf), -EINVAL);
> + }
> +}
> +
> +static void tlob_parse_out_of_range_rejected(struct kunit *test)
> +{
> + char buf[128];
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(tlob_parse_out_of_range); i++) {
> + strscpy(buf, tlob_parse_out_of_range[i],
> sizeof(buf));
> + KUNIT_EXPECT_EQ(test,
> tlob_create_or_delete_uprobe(buf), -ERANGE);
> + }
> +}
> +
> +static struct kunit_case tlob_parse_cases[] = {
> + KUNIT_CASE(tlob_parse_valid_accepted),
> + KUNIT_CASE(tlob_parse_invalid_rejected),
> + KUNIT_CASE(tlob_parse_out_of_range_rejected),
> + {}
> +};
> +
> +static struct kunit_suite tlob_parse_suite = {
> + .name = "tlob_parse",
> + .test_cases = tlob_parse_cases,
> +};
> +
> +kunit_test_suite(tlob_parse_suite);
> +
> +MODULE_DESCRIPTION("KUnit tests for the tlob RV monitor");
> +MODULE_LICENSE("GPL");
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox