* [PATCH] ring-buffer: Fix ring_buffer_read_page() copying only one event per page
From: David Carlier @ 2026-06-16 17:55 UTC (permalink / raw)
To: rostedt, mhiramat
Cc: mathieu.desnoyers, linux-trace-kernel, linux-kernel,
David Carlier
Commit 8928e4a3be34 ("ring-buffer: Show persistent buffer dropped events
in trace_pipe file") split the "commit" variable in
ring_buffer_read_page() into "commit" (raw) and "size" (masked page
size), but the inner copy loop's terminator was changed to compare rpos
against "event_size" instead of "size".
rpos is the cumulative read offset within the page; event_size is the
length of the single event just copied. The loop thus breaks after the
first event, so only one event is copied per call. This regresses the
per-event memcpy path (partial reads, the active commit page, and
mapped/remote buffers) used by splice/trace_pipe_raw and mmap consumers
into a one-event-at-a-time read.
Compare rpos against the page size as the original code did.
Fixes: 8928e4a3be34 ("ring-buffer: Show persistent buffer dropped events in trace_pipe file")
Signed-off-by: David Carlier <devnexen@gmail.com>
---
kernel/trace/ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 06fb365bb86e..06155ab3366f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7181,7 +7181,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
rpos = reader->read;
pos += event_size;
- if (rpos >= event_size)
+ if (rpos >= size)
break;
event = rb_reader_event(cpu_buffer);
--
2.53.0
^ permalink raw reply related
* Re: [syzbot] [afs?] WARNING: ODEBUG bug in delete_node (3)
From: David Howells @ 2026-06-16 16:58 UTC (permalink / raw)
To: syzbot
Cc: dhowells, linux-afs, linux-kernel, linux-trace-kernel,
marc.dionne, mathieu.desnoyers, mhiramat, rostedt, syzkaller-bugs
In-Reply-To: <67e9006d.050a0220.1547ec.0097.GAE@google.com>
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v7.1
^ permalink raw reply
* [PATCH] tracing: ring-buffer: allowlist clang-generated symbols
From: Arnd Bergmann @ 2026-06-16 16:42 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Nathan Chancellor
Cc: Arnd Bergmann, Mathieu Desnoyers, Nick Desaulniers, Bill Wendling,
Justin Stitt, Vincent Donnefort, Marc Zyngier,
Thomas Weißschuh, Paolo Bonzini, linux-kernel,
linux-trace-kernel, llvm
From: Arnd Bergmann <arnd@arndb.de>
In randconfig build testing using clang-22, I came across two
sets of extra symbols in the ring buffer code that may get
inserted by the compiler:
Unexpected symbols in kernel/trace/simple_ring_buffer.o:
U memset
Unexpected symbols in kernel/trace/simple_ring_buffer.o:
U llvm_gcda_emit_arcs
U llvm_gcda_emit_function
U llvm_gcda_end_file
U llvm_gcda_start_file
U llvm_gcda_summary_info
U llvm_gcov_init
Add all of these to the allowlist.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
kernel/trace/Makefile | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index f934ff586bd4..aa8564fb8ff4 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -146,6 +146,7 @@ KASAN_SANITIZE_undefsyms_base.o := y
UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
__aeabi_unwind_cpp __s390_indirect_jump __x86_indirect_thunk simple_ring_buffer \
+ memset llvm_gcda llvm_gcov \
$(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
quiet_cmd_check_undefined = NM $<
--
2.39.5
^ permalink raw reply related
* Re: [PATCH v3 9/9] selftests/verification: add tlob selftests
From: Gabriele Monaco @ 2026-06-16 14:58 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <4aeb668c8446a9f6366d92e218df386bef7bc965.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add selftest coverage for the tlob uprobe monitoring interface under
> tools/testing/selftests/verification/.
>
> test.d/tlob/ contains both the helper sources (tlob_target, tlob_sym)
> and the seven test scripts so the test suite is self-contained.
> tlob_target provides busy-spin, sleep, and preempt workloads;
> tlob_sym
> resolves ELF symbol offsets for uprobe registration.
>
> Seven test scripts exercise uprobe binding management, budget
> violation
> detection, and per-state time accounting (running_ns, waiting_ns,
> sleeping_ns).
>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
> .../testing/selftests/verification/.gitignore | 2 +
> tools/testing/selftests/verification/Makefile | 19 +-
> .../verification/test.d/tlob/Makefile | 20 ++
> .../verification/test.d/tlob/test.d/functions | 1 +
Tests seems to work fine, thanks. I think I'm getting what you're
trying to do and probably defining a dummy functions for tlob isn't the
right thing to do.
ftracetest --rv doesn't quite work with different folders as you'd pick
the ftrace functions. We shouldn't be adding a dummy functions script
every time but rather fix ftracetest directly.
Try the attached patch, it seems to work on my side. Then you'd be able
to use ftracetest --rv as you please (folder is mandatory, I don't
think we need a version inside selftests/verification,
verificationtest-ktap is only meant for the makefile just like
ftracetest-ktap).
Thanks,
Gabriele
---
From 32242f83af8214a51c47167a1904d3663aa43870 Mon Sep 17 00:00:00 2001
From: Gabriele Monaco <gmonaco@redhat.com>
Date: Tue, 16 Jun 2026 16:26:53 +0200
Subject: [PATCH] selftests/tracing: support flexible helper functions
Target directory or file arguments passed to ftracetest had
to contain their own nested test.d/functions files to properly override
the top-level directory (TOP_DIR). This works for standard ftrace tests
that would fall back to ftrace's functions, but doesn't for rv tests.
Introduce find_functions() to recursively check parent and grandparent
directories of the target test directory/file to dynamically locate the
appropriate root functions config.
This allows to define additional directories for rv selftests and run
them with the appropriate rv functions:
ftracetest --rv tools/testing/selftests/verification/test.d/<topic>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
tools/testing/selftests/ftrace/ftracetest | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/ftracetest b/tools/testing/selftests/ftrace/ftracetest
index 0a56bf209f..5d1b1f9380 100755
--- a/tools/testing/selftests/ftrace/ftracetest
+++ b/tools/testing/selftests/ftrace/ftracetest
@@ -80,6 +80,23 @@ find_testcases() { #directory
echo `find $1 -name \*.tc | sort`
}
+find_functions() { # [directory] [test_cases]
+ local BASE_DIR="$1"
+ if [ -f "$1" ]; then
+ BASE_DIR="`absdir $1`"
+ elif [ ! -d "$1" ]; then
+ return
+ fi
+
+ if [ -f "$BASE_DIR"/test.d/functions ]; then
+ echo "$BASE_DIR"
+ elif [ -f "$BASE_DIR"/../test.d/functions ]; then
+ abspath "$BASE_DIR"/..
+ elif [ -f "$BASE_DIR"/../../test.d/functions ]; then
+ abspath "$BASE_DIR"/../..
+ fi
+}
+
parse_opts() { # opts
local OPT_TEST_CASES=
local OPT_TEST_DIR=
@@ -159,7 +176,8 @@ parse_opts() { # opts
if [ -n "$OPT_TEST_CASES" ]; then
TEST_CASES=$OPT_TEST_CASES
fi
- if [ -n "$OPT_TEST_DIR" -a -f "$OPT_TEST_DIR"/test.d/functions ]; then
+ OPT_TEST_DIR="`find_functions $OPT_TEST_DIR $OPT_TEST_CASES`"
+ if [ -n "$OPT_TEST_DIR" ]; then
TOP_DIR=$OPT_TEST_DIR
TEST_DIR=$TOP_DIR/test.d
fi
--
2.54.0
^ permalink raw reply related
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-16 13:47 UTC (permalink / raw)
To: Brendan Jackman
Cc: Vlastimil Babka (SUSE), David Hildenbrand (Arm), Balbir Singh,
lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, Matthew Wilcox
In-Reply-To: <DJAGEUY8S09F.3V3HF570G85OF@linux.dev>
On Tue, Jun 16, 2026 at 11:57:42AM +0000, Brendan Jackman wrote:
> On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
> > On 6/12/26 17:29, Gregory Price wrote:
> >> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
> >>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> >>> > >
> >
> > I think the memalloc approach is dangerous due to unexpected nesting. There
> > might be nested page allocations in page allocation itself (due to some
> > debugging option). But also interrupts do not change what "current" points
> > to. Suddenly those could start requesting folios and/or private nodes and be
> > surprised, I'm afraid.
>
> Minor side-note: couldn't we just define it such that the allocator
> ignores the context when not in_task() (and warn if you try to enter the
> context while not currently in_task())?
>
> (Don't think this would change the conclusion very much, e.g. doesn't
> help with the nesting issues. Mostly curious in case I'm missing a
> detail here).
>
I looked at this - only solves one issue and oh boy is that an obtuse
confusing condition to understand. We still suffer from recursion in
reclaim.
~Gregory
^ permalink raw reply
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Brendan Jackman @ 2026-06-16 11:57 UTC (permalink / raw)
To: Vlastimil Babka (SUSE), Gregory Price, David Hildenbrand (Arm)
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>
On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
>>> > >
>>> > > I understand this question in two ways:
>>> > >
>>> > > 1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>>> >
>>> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
>>> > me reply to that one below.
>>> >
>>> ... snip ...
>>> >
>>> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
>>> > context might be better. I think there was also talk about how the memalloc_*
>>> > interface might be a better way forward. Maybe we would start giving the
>>> > allocator more context ("we are allocating a folio").
>>> >
>>> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>>> >
>>>
>>> I will still probably send the next RFC version tomorrow or friday,
>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>
>>> Also, I made a new `anondax` driver which enables userland testing
>>> of this functionality without any specialty hardware.
>>>
>>
>> (apologies for the length of this email: this will all be covered in
>> the coming cover letter, but I just wanted to share a bit of a preview)
>>
>> ===
>>
>> Just another small update - I am planning to post the RFC today once i
>> get some mild cleanup done. It will be based on the dax atomic hotplug
>>
>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>
>> But a couple specific details regarding the memalloc pieces that i've
>> learned the past couple of days playing with it.
>>
>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>> on the private node, even if it happens within a memalloc_private
>> context. Since memalloc_folio may be useful in contexts outside of
>> private nodes, I kept this as a separate flag.
>>
>> If we think there will *never* be additional users of memalloc_folio,
>> then we could fold _folio into _private to save the flag for now and
>> add it back when we actually need it.
>>
>> 2) memalloc_private is needed to unlock private nodes, but in the
>> original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>
>> This is *highly* restrictive. I found when playing with mbind that
>> MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>> implies a bug).
>>
>> That leads me to #3
>
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.
Minor side-note: couldn't we just define it such that the allocator
ignores the context when not in_task() (and warn if you try to enter the
context while not currently in_task())?
(Don't think this would change the conclusion very much, e.g. doesn't
help with the nesting issues. Mostly curious in case I'm missing a
detail here).
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.
^ permalink raw reply
* Re: [PATCH v3 8/9] selftests/verification: fix verificationtest-ktap for out-of-tree execution
From: Gabriele Monaco @ 2026-06-16 11:14 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <95e700c62601cf432842269d89a86a492d073f0e.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> verificationtest-ktap used CWD-relative paths which broke when
> invoked outside the verification directory (e.g. via vng).
I still don't get this, I even run from my home directory and it works
without this commit:
make -C linux/tools/testing/selftests/verification/ run_tests
It builds and runs your new tests just fine, how exactly do you run
them? vng shouldn't affect this since it runs from the same directory
as overlay mount, what errors are you getting?
The only issue I see is with the read only filesystems since ftracetest
attempts to write logs, but that doesn't seem solved by your commit
either. Adding the appropriate rwdir lets me run them just fine:
vng -v --rwdir tools/testing/selftests/verification/logs -- make -C tools/testing/selftests/verification run_tests
I'm suspecting you're trying to call verificationtest-ktap manually
from somewhere else to call it only on tlob/, but that looks like
asking for troubles. While you're at it I don't see why you don't just
call ftracetest.
Gabriele
>
> Resolve paths via realpath "$(dirname "$0")" so the script works
> from any working directory. Accept an optional subdirectory argument
> interpreted relative to the script's directory.
>
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
> tools/testing/selftests/verification/verificationtest-ktap | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/verification/verificationtest-
> ktap b/tools/testing/selftests/verification/verificationtest-ktap
> index 18f7fe324e2f..055747cef38a 100755
> --- a/tools/testing/selftests/verification/verificationtest-ktap
> +++ b/tools/testing/selftests/verification/verificationtest-ktap
> @@ -5,4 +5,6 @@
> #
> # Copyright (C) Arm Ltd., 2023
>
> -../ftrace/ftracetest -K -v --rv ../verification
> +dir=$(realpath "$(dirname "$0")")
> +testdir=$(cd "$dir" && realpath "${1:-.}")
> +"$dir/../ftrace/ftracetest" -K -v --rv "$testdir"
^ permalink raw reply
* Re: [PATCH v3 2/9] rv: add generic uprobe infrastructure for RV monitors
From: Gabriele Monaco @ 2026-06-16 9:49 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <9d1a1d491af16853b2b421f358fd6cca965588ab.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Introduce rv_uprobe, a thin wrapper around uprobe_consumer providing
> rv_uprobe_attach_path(), rv_uprobe_attach(), and rv_uprobe_detach()
> for RV monitors. An opaque priv pointer is forwarded unchanged to
> entry/return handlers so monitors can carry per-binding state (e.g. a
> latency threshold) to the hot path without any global lookup.
>
> rv_uprobe_detach() is fully synchronous (nosync + sync + path_put +
> kfree), closing the use-after-free window present in open-coded
> patterns where kfree() precedes uprobe_unregister_sync().
>
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
I find your implementation solid, but I wonder if it isn't a bit
overdoing it. It's good to have helper functions abstracting away the
common uprobes code, but I find the number of structures used to indirect/hide
the process a bit confusing.
Users of this headers still know they're using uprobes, so we don't
really need to hide it that much, plus the double function pointers are
an additional performance hit we could easily skip.
As I get it, we want to pass a private pointer (essentially to get the
threshold in tlob) and the uprobe infrastructure doesn't allow that
transparently. So let's keep on using struct embedding but a bit more
cut to the bone:
struct rv_uprobe {
struct uprobe_consumer uc;
struct uprobe *uprobe;
void *priv;
};
Then handlers would use container_of() appropriately to derive the data they
need.
static int tlob_uprobe_entry_handler(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data)
{
struct rv_uprobe *p = container_of(self, struct rv_uprobe, uc);
struct tlob_uprobe_binding *b = p->priv;
...
}
You could even define the struct directly inside your private data and save one
step (and no void *):
struct rv_uprobe {
struct uprobe_consumer uc;
struct uprobe *uprobe;
};
#define DECLARE_RV_UPROBE(name) struct rv_uprobe name
// And then
struct tlob_uprobe_binding {
u64 threshold_ns;
...
DECLARE_RV_UPROBE(start);
DECLARE_RV_UPROBE(stop);
}
static int tlob_uprobe_entry_handler(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data)
{
struct tlob_uprobe_binding *b = container_of(self, struct tlob_uprobe_binding, start.uc);
...
}
Now I intentionally skipped offset, as I don't really see why carry it around,
and path, as we could live without it by relying on inodes (uprobes
reference-track them but you'd need to save them somewhere).
Mind, I only build tested this idea (appended after). This was supported by an
LLM which did a few more changes like standardising names and return values of
registration functions (a bit more consistent with other stuff, you don't have
to commit with that though).
By the way, do we really need the duplicate check? Would it break if a user
defines it twice? If not, we could simplify the tlob_add_uprobe() even further
and doing all path related checks in the library, since those aren't quite tlob
specific.
Thoughts?
Thanks,
Gabriele
**Suggested simplification:**
---
diff --git a/include/rv/rv_uprobe.h b/include/rv/rv_uprobe.h
index 9106c5c927..4fb3f50a63 100644
--- a/include/rv/rv_uprobe.h
+++ b/include/rv/rv_uprobe.h
@@ -7,83 +7,56 @@
#ifndef _RV_UPROBE_H
#define _RV_UPROBE_H
-#include <linux/path.h>
#include <linux/types.h>
+#include <linux/uprobes.h>
struct pt_regs;
+struct inode;
/**
* struct rv_uprobe - a single uprobe registered on behalf of an RV monitor
*
- * @offset: byte offset within the ELF binary where the probe is installed
- * @priv: monitor-private pointer; set at attach time, never touched by
- * this layer; passed unchanged to entry_fn / ret_fn
- * @path: resolved path of the probed binary (read-only after attach);
- * callers may use path.dentry for identity comparisons
- *
- * The implementation fields (uprobe_consumer, uprobe handle, callbacks) are
- * private to rv_uprobe.c and are not exposed here; monitors must not access
- * them directly.
+ * @uc: underlying uprobe consumer (publicly visible)
+ * @uprobe: active uprobe structure handle
+ * @inode: inode of the target binary (read-only after registration)
*/
struct rv_uprobe {
- /* public: read-only after rv_uprobe_attach*() */
- loff_t offset;
- void *priv;
- struct path path;
+ struct uprobe_consumer uc;
+ struct uprobe *uprobe;
+ struct inode *inode;
};
-/**
- * rv_uprobe_attach_path - register an uprobe given an already-resolved path
- * @path: path of the target binary; rv_uprobe takes its own reference
- * @offset: byte offset within the binary
- * @entry_fn: called on probe hit (entry); may be NULL
- * @ret_fn: called on function return (uretprobe); may be NULL
- * @priv: opaque pointer forwarded to callbacks unchanged
- *
- * Use this variant when the caller has already resolved the path (e.g. to
- * register multiple probes on the same binary with a single kern_path call).
- * The inode is derived internally via d_real_inode(), so inode and path are
- * always consistent.
- *
- * Returns a pointer to the new rv_uprobe on success, ERR_PTR on failure.
- */
-struct rv_uprobe *rv_uprobe_attach_path(struct path *path, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv);
+/* Seamless inline declaration of a named uprobe inside user structs */
+#define DECLARE_RV_UPROBE(name) \
+ struct rv_uprobe name
/**
- * rv_uprobe_attach - resolve binpath and register an uprobe
- * @binpath: absolute path to the target binary
- * @offset: byte offset within the binary
- * @entry_fn: called on probe hit (entry); may be NULL
- * @ret_fn: called on function return (uretprobe); may be NULL
- * @priv: opaque pointer forwarded to callbacks unchanged
+ * rv_uprobe_register - resolve binpath and register an uprobe
+ * @binpath: absolute path to the target binary
+ * @offset: byte offset within the binary
+ * @p: pointer to the allocated/embedded rv_uprobe structure
+ * @handler: called on probe hit (entry); may be NULL
+ * @ret_handler: called on function return (uretprobe); may be NULL
*
- * Resolves binpath via kern_path(), then delegates to rv_uprobe_attach_path().
+ * Resolves binpath via kern_path(), registers the uprobe directly using the
+ * embedded `uprobe_consumer`, and immediately releases the path reference.
*
- * Returns a pointer to the new rv_uprobe on success, ERR_PTR on failure.
+ * Returns 0 on success, or a negative error code on failure.
*/
-struct rv_uprobe *rv_uprobe_attach(const char *binpath, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv);
+int rv_uprobe_register(const char *binpath, loff_t offset, struct rv_uprobe *p,
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data),
+ int (*ret_handler)(struct uprobe_consumer *self, unsigned long func,
+ struct pt_regs *regs, __u64 *data));
/**
- * rv_uprobe_detach - synchronously unregister an uprobe and free it
- * @p: probe to detach; may be NULL (no-op)
- *
- * Calls uprobe_unregister_nosync(), then uprobe_unregister_sync() to wait
- * for any in-progress handler to finish, then releases the path reference
- * and frees the rv_uprobe struct. The caller's priv data is NOT freed.
+ * rv_uprobe_unregister - synchronously unregister a uprobe
+ * @p: probe to unregister; may be NULL (no-op)
*
- * When removing a single probe, prefer this over the three-phase API.
- * Safe to call from process context only (uprobe_unregister_sync() may
- * schedule).
+ * Dequeues the uprobe and waits synchronously for all in-flight handlers
+ * to complete.
*/
-void rv_uprobe_detach(struct rv_uprobe *p);
+void rv_uprobe_unregister(struct rv_uprobe *p);
/**
* rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
@@ -91,9 +64,7 @@ void rv_uprobe_detach(struct rv_uprobe *p);
*
* Removes the uprobe from the uprobe subsystem but does NOT wait for
* in-flight handlers to complete. The caller must call rv_uprobe_sync()
- * before calling rv_uprobe_free() on the same probe.
- *
- * Use this to batch multiple deregistrations before a single rv_uprobe_sync().
+ * before freeing any container holding this probe.
*/
void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
@@ -101,19 +72,8 @@ void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
* rv_uprobe_sync - wait for all in-flight uprobe handlers to complete
*
* Global barrier: waits for every in-flight uprobe handler across the system
- * to finish. Call once after a batch of rv_uprobe_unregister_nosync() calls
- * and before any rv_uprobe_free() call.
+ * to finish.
*/
void rv_uprobe_sync(void);
-/**
- * rv_uprobe_free - release resources of a previously deregistered probe
- * @p: probe to free; may be NULL (no-op)
- *
- * Releases the path reference and frees the rv_uprobe struct. Must only
- * be called after rv_uprobe_sync() has returned. The caller's priv data
- * is NOT freed.
- */
-void rv_uprobe_free(struct rv_uprobe *p);
-
#endif /* _RV_UPROBE_H */
diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
index d8e0c47947..28a6c740c7 100644
--- a/kernel/trace/rv/monitors/tlob/tlob.c
+++ b/kernel/trace/rv/monitors/tlob/tlob.c
@@ -252,8 +252,8 @@ struct tlob_uprobe_binding {
char binpath[TLOB_MAX_PATH];
loff_t offset_start;
loff_t offset_stop;
- struct rv_uprobe *start_probe;
- struct rv_uprobe *stop_probe;
+ DECLARE_RV_UPROBE(start_probe);
+ DECLARE_RV_UPROBE(stop_probe);
};
/* RCU callback: free the slab once no readers remain. */
@@ -512,16 +512,16 @@ int tlob_stop_task(struct task_struct *task)
EXPORT_SYMBOL_GPL(tlob_stop_task);
-static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct pt_regs *regs,
+static int tlob_uprobe_entry_handler(struct uprobe_consumer *self, struct pt_regs *regs,
__u64 *data)
{
- struct tlob_uprobe_binding *b = p->priv;
+ struct tlob_uprobe_binding *b = container_of(self, struct tlob_uprobe_binding, start_probe.uc);
tlob_start_task(current, b->threshold_ns);
return 0;
}
-static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct pt_regs *regs,
+static int tlob_uprobe_stop_handler(struct uprobe_consumer *self, struct pt_regs *regs,
__u64 *data)
{
tlob_stop_task(current);
@@ -537,6 +537,7 @@ static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
{
struct tlob_uprobe_binding *b, *tmp_b;
char pathbuf[TLOB_MAX_PATH];
+ struct inode *inode;
struct path path;
char *canon;
int ret;
@@ -561,10 +562,12 @@ static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
goto err_path;
}
- /* Reject duplicate start offset for the same binary. */
+ inode = d_real_inode(path.dentry);
+
+ /* Reject duplicate start offset for the same binary inode. */
list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
if (tmp_b->offset_start == offset_start &&
- tmp_b->start_probe->path.dentry == path.dentry) {
+ tmp_b->start_probe.inode == inode) {
ret = -EEXIST;
goto err_path;
}
@@ -577,29 +580,25 @@ static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
}
strscpy(b->binpath, canon, sizeof(b->binpath));
- /* Both probes share b (priv) and path; attach_path refs path itself. */
- b->start_probe = rv_uprobe_attach_path(&path, offset_start,
- tlob_uprobe_entry_handler, NULL, b);
- if (IS_ERR(b->start_probe)) {
- ret = PTR_ERR(b->start_probe);
- b->start_probe = NULL;
- goto err_path;
- }
+ path_put(&path);
+
+ /* Both probes are registered directly on the embedded fields */
+ ret = rv_uprobe_register(b->binpath, offset_start, &b->start_probe,
+ tlob_uprobe_entry_handler, NULL);
+ if (ret)
+ goto err_free;
- b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
- tlob_uprobe_stop_handler, NULL, b);
- if (IS_ERR(b->stop_probe)) {
- ret = PTR_ERR(b->stop_probe);
- b->stop_probe = NULL;
+ ret = rv_uprobe_register(b->binpath, offset_stop, &b->stop_probe,
+ tlob_uprobe_stop_handler, NULL);
+ if (ret)
goto err_start;
- }
- path_put(&path);
list_add_tail(&b->list, &tlob_uprobe_list);
return 0;
err_start:
- rv_uprobe_detach(b->start_probe);
+ rv_uprobe_unregister(&b->start_probe);
+ goto err_free;
err_path:
path_put(&path);
err_free:
@@ -611,21 +610,24 @@ static int tlob_remove_uprobe_by_key(loff_t offset_start, const char *binpath)
{
struct tlob_uprobe_binding *b, *tmp;
struct path remove_path;
+ struct inode *inode;
int ret;
ret = kern_path(binpath, LOOKUP_FOLLOW, &remove_path);
if (ret)
return ret;
+ inode = d_real_inode(remove_path.dentry);
+
ret = -ENOENT;
list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
if (b->offset_start != offset_start)
continue;
- if (b->start_probe->path.dentry != remove_path.dentry)
+ if (b->start_probe.inode != inode)
continue;
list_del(&b->list);
- rv_uprobe_detach(b->start_probe);
- rv_uprobe_detach(b->stop_probe);
+ rv_uprobe_unregister(&b->start_probe);
+ rv_uprobe_unregister(&b->stop_probe);
kfree(b);
ret = 0;
break;
@@ -643,8 +645,8 @@ static void tlob_remove_all_uprobes(void)
mutex_lock(&tlob_uprobe_mutex);
list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
list_move(&b->list, &pending);
- rv_uprobe_unregister_nosync(b->start_probe);
- rv_uprobe_unregister_nosync(b->stop_probe);
+ rv_uprobe_unregister_nosync(&b->start_probe);
+ rv_uprobe_unregister_nosync(&b->stop_probe);
}
mutex_unlock(&tlob_uprobe_mutex);
@@ -658,8 +660,6 @@ static void tlob_remove_all_uprobes(void)
rv_uprobe_sync();
list_for_each_entry_safe(b, tmp, &pending, list) {
- rv_uprobe_free(b->start_probe);
- rv_uprobe_free(b->stop_probe);
kfree(b);
}
}
diff --git a/kernel/trace/rv/rv_uprobe.c b/kernel/trace/rv/rv_uprobe.c
index 3d8b764dde..69b8b0c27e 100644
--- a/kernel/trace/rv/rv_uprobe.c
+++ b/kernel/trace/rv/rv_uprobe.c
@@ -10,149 +10,74 @@
#include <linux/uprobes.h>
#include <rv/rv_uprobe.h>
-/*
- * Private extension of struct rv_uprobe. Allocated by rv_uprobe_attach*()
- * and returned to callers as &impl->pub.
- */
-struct rv_uprobe_impl {
- struct rv_uprobe pub; /* must be first; callers hold &pub */
- struct uprobe_consumer uc;
- struct uprobe *uprobe;
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data);
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data);
-};
-
-static int rv_uprobe_handler(struct uprobe_consumer *uc,
- struct pt_regs *regs, __u64 *data)
-{
- struct rv_uprobe_impl *impl = container_of(uc, struct rv_uprobe_impl, uc);
-
- if (impl->entry_fn)
- return impl->entry_fn(&impl->pub, regs, data);
- return 0;
-}
-
-static int rv_uprobe_ret_handler(struct uprobe_consumer *uc,
- unsigned long func,
- struct pt_regs *regs, __u64 *data)
-{
- struct rv_uprobe_impl *impl = container_of(uc, struct rv_uprobe_impl, uc);
-
- if (impl->ret_fn)
- return impl->ret_fn(&impl->pub, func, regs, data);
- return 0;
-}
-
-static struct rv_uprobe *
-__rv_uprobe_attach(struct inode *inode, struct path *path, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv)
-{
- struct rv_uprobe_impl *impl;
- int ret;
-
- if (!entry_fn && !ret_fn)
- return ERR_PTR(-EINVAL);
-
- impl = kzalloc_obj(*impl, GFP_KERNEL);
- if (!impl)
- return ERR_PTR(-ENOMEM);
-
- impl->pub.offset = offset;
- impl->pub.priv = priv;
- impl->entry_fn = entry_fn;
- impl->ret_fn = ret_fn;
- path_get(path);
- impl->pub.path = *path;
-
- if (entry_fn)
- impl->uc.handler = rv_uprobe_handler;
- if (ret_fn)
- impl->uc.ret_handler = rv_uprobe_ret_handler;
-
- impl->uprobe = uprobe_register(inode, offset, 0, &impl->uc);
- if (IS_ERR(impl->uprobe)) {
- ret = PTR_ERR(impl->uprobe);
- path_put(&impl->pub.path);
- kfree(impl);
- return ERR_PTR(ret);
- }
-
- return &impl->pub;
-}
-
-/**
- * rv_uprobe_attach_path - register an uprobe given an already-resolved path
- */
-struct rv_uprobe *rv_uprobe_attach_path(struct path *path, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv)
-{
- struct inode *inode = d_real_inode(path->dentry);
-
- return __rv_uprobe_attach(inode, path, offset, entry_fn, ret_fn, priv);
-}
-EXPORT_SYMBOL_GPL(rv_uprobe_attach_path);
-
/**
- * rv_uprobe_attach - resolve binpath and register an uprobe
+ * rv_uprobe_register - resolve binpath and register an uprobe
*/
-struct rv_uprobe *rv_uprobe_attach(const char *binpath, loff_t offset,
- int (*entry_fn)(struct rv_uprobe *p, struct pt_regs *regs, __u64 *data),
- int (*ret_fn)(struct rv_uprobe *p, unsigned long func,
- struct pt_regs *regs, __u64 *data),
- void *priv)
+int rv_uprobe_register(const char *binpath, loff_t offset, struct rv_uprobe *p,
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data),
+ int (*ret_handler)(struct uprobe_consumer *self, unsigned long func,
+ struct pt_regs *regs, __u64 *data))
{
- struct rv_uprobe *p;
+ struct inode *inode;
struct path path;
int ret;
+ if (!handler && !ret_handler)
+ return -EINVAL;
+
ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
if (ret)
- return ERR_PTR(ret);
+ return ret;
if (!d_is_reg(path.dentry)) {
path_put(&path);
- return ERR_PTR(-EINVAL);
+ return -EINVAL;
}
- p = rv_uprobe_attach_path(&path, offset, entry_fn, ret_fn, priv);
+ inode = d_real_inode(path.dentry);
+
+ p->uc.handler = handler;
+ p->uc.ret_handler = ret_handler;
+ p->inode = inode;
+
+ p->uprobe = uprobe_register(inode, offset, 0, &p->uc);
path_put(&path);
- return p;
+
+ if (IS_ERR(p->uprobe)) {
+ ret = PTR_ERR(p->uprobe);
+ p->uprobe = NULL;
+ p->inode = NULL;
+ return ret;
+ }
+
+ return 0;
}
-EXPORT_SYMBOL_GPL(rv_uprobe_attach);
+EXPORT_SYMBOL_GPL(rv_uprobe_register);
/**
- * rv_uprobe_detach - synchronously unregister an uprobe and free it
+ * rv_uprobe_unregister - synchronously unregister a uprobe
*/
-void rv_uprobe_detach(struct rv_uprobe *p)
+void rv_uprobe_unregister(struct rv_uprobe *p)
{
- if (!p)
+ if (!p || IS_ERR_OR_NULL(p->uprobe))
return;
rv_uprobe_unregister_nosync(p);
rv_uprobe_sync();
- rv_uprobe_free(p);
}
-EXPORT_SYMBOL_GPL(rv_uprobe_detach);
+EXPORT_SYMBOL_GPL(rv_uprobe_unregister);
/**
* rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
*/
void rv_uprobe_unregister_nosync(struct rv_uprobe *p)
{
- struct rv_uprobe_impl *impl;
-
- if (!p)
+ if (!p || IS_ERR_OR_NULL(p->uprobe))
return;
- impl = container_of(p, struct rv_uprobe_impl, pub);
- uprobe_unregister_nosync(impl->uprobe, &impl->uc);
+ uprobe_unregister_nosync(p->uprobe, &p->uc);
+ p->uprobe = NULL;
+ p->inode = NULL;
}
EXPORT_SYMBOL_GPL(rv_uprobe_unregister_nosync);
@@ -164,19 +89,3 @@ void rv_uprobe_sync(void)
uprobe_unregister_sync();
}
EXPORT_SYMBOL_GPL(rv_uprobe_sync);
-
-/**
- * rv_uprobe_free - release resources of a previously deregistered probe
- */
-void rv_uprobe_free(struct rv_uprobe *p)
-{
- struct rv_uprobe_impl *impl;
-
- if (!p)
- return;
-
- impl = container_of(p, struct rv_uprobe_impl, pub);
- path_put(&p->path);
- kfree(impl);
-}
-EXPORT_SYMBOL_GPL(rv_uprobe_free);
^ permalink raw reply related
* Re: [PATCH v3 05/13] verification/rvgen: Convert __fill_setup_invariants_func() to Lark
From: Nam Cao @ 2026-06-16 9:00 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
linux-kernel
In-Reply-To: <ded9af4b0cdde4696c0a295e0c622a9d39c445ee.camel@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> val is a string, doing val * 1000 repeats the string 1000 times in python!
Oh crap.
^ permalink raw reply
* [RFC PATCH v4 3/3] trace: add documentation, selftest and tooling for stackmap
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56, kernel test robot
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add supporting files for the ftrace stackmap feature:
Documentation/trace/ftrace-stackmap.rst:
Documentation covering design, usage, tracefs interface, binary
format, and performance characteristics. Added to the 'Core Tracing
Frameworks' toctree in Documentation/trace/index.rst. Documents:
- Reset is destructive: it requires tracing to be stopped and also
clears the ring buffer so no stale <stack_id N> survives
- Boot-time activation via trace_options=stackmap
- bits parameter range [10, 18] and worst-case memory usage
- tracefs file modes (0640 / 0440)
- Best-effort snapshot semantics for stack_map_bin, serialized
against reset via the reader_sem
- Counter naming: successes (events served), drops, success_rate;
successes/drops are best-effort and saturate on long runs
- Gravestone amplification when the pool is exhausted
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
Functional selftest verifying:
- stackmap tracefs nodes exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes (a nonzero drops count is
a legitimate by-design fallback and is not treated as failure;
only zero successes alongside nonzero drops is fatal)
- reset clears entries when tracing is stopped
- reset is rejected (-EBUSY) while tracing is active
Test reads trace contents BEFORE switching back to the nop tracer
(tracer_init() unconditionally resets the ring buffer). The
function:tracer dependency is declared in '# requires:' so
ftracetest skips on kernels without CONFIG_FUNCTION_TRACER instead
of failing spuriously.
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc:
Verifies the destructive-reset semantics and the binary ABI header:
- after 'echo 0 > stack_map', the trace buffer no longer contains
any stale <stack_id N>
- stack_map_bin begins with the expected magic and version
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc:
Verifies the option is gated to the top-level instance: a secondary
instance neither exposes options/stackmap nor the stack_map* nodes,
and writing 'stackmap' to its aggregate trace_options file is
rejected rather than accepted as a no-op.
tools/tracing/stackmap_dump.py:
Python script to parse the binary stack_map_bin export.
Features:
- Automatic endianness detection via magic number
- Batched addr2line via stdin (avoids ARG_MAX with large stacks)
- JSON output mode (ips are always hex addresses; the ftrace
trampoline marker is shown only in the resolved symbols)
- Top-N filtering by ref_count
Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x46534D42 = 'FSMB').
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
Documentation/trace/ftrace-stackmap.rst | 177 ++++++++++++++++++
Documentation/trace/index.rst | 1 +
.../ftrace/test.d/ftrace/stackmap-basic.tc | 111 +++++++++++
.../test.d/ftrace/stackmap-instance-gate.tc | 54 ++++++
.../ftrace/test.d/ftrace/stackmap-reset.tc | 76 ++++++++
tools/tracing/stackmap_dump.py | 164 ++++++++++++++++
6 files changed, 583 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
create mode 100755 tools/tracing/stackmap_dump.py
diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..8d0b5c389862
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,177 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+ (default: 14 → 16384 stacks; valid range: 10-18).
+
+ At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+ for the element pool. Each ``open()`` of ``stack_map_bin`` may
+ briefly allocate a similar amount for a snapshot. The cap is set
+ intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+ echo 1 > /sys/kernel/debug/tracing/options/stackmap
+ echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+ echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+ sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+ cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+ stack_id 42 [ref 1337, depth 8]
+ [0] schedule+0x48/0xc0
+ [1] schedule_timeout+0x1c/0x30
+ ...
+
+To view statistics::
+
+ cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+ entries: 2500 / 16384
+ table_size: 32768
+ successes: 148923
+ drops: 0
+ success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_on
+ echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Reset is destructive to the trace buffer: because the ring buffer may
+still hold ``<stack_id N>`` events that reference soon-to-be-reused
+slots, resetting the map also resets the owning trace buffer (and its
+snapshot, if allocated). This keeps ring-buffer stack_ids and the map
+coherent. Read out any trace data you need before resetting.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+ trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+ Text export of all deduplicated stacks with symbol resolution.
+ Writing ``0`` or ``reset`` clears all entries (only when tracing
+ is stopped).
+
+``stack_map_stat``
+ Statistics: entries (allocated unique stacks), table_size,
+ successes (events served), drops (events that fell back to
+ full-stack recording), and success_rate. Drops accumulate when
+ the element pool is exhausted; once that happens, slots that
+ won the cmpxchg but failed to allocate an element remain
+ "claimed but empty" and increase probe pressure for any future
+ insert hashing to the same bucket. Reset (when tracing is
+ stopped) clears these gravestones.
+
+``stack_map_bin``
+ Binary export for efficient userspace consumption. Format:
+
+ - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+ - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+ All fields are written in the kernel's native byte order.
+ Userspace tools detect endianness by reading the magic value.
+ Magic: ``0x46534D42`` ('FSMB'), Version: 1.
+
+ Trampoline frames are exported as the sentinel value
+ ``0x7fffffff`` (FTRACE_TRAMPOLINE_MARKER); all other addresses are
+ passed through ``trace_adjust_address()`` so they match the
+ ``stack_map`` text output's address-adjustment rules. Note this is
+ the same adjustment ftrace applies to its own trace output (mainly
+ relevant for persistent / last-boot buffers), not a general KASLR
+ un-offset: resolving these addresses offline still requires the
+ matching kernel's symbol information.
+
+ The export is a best-effort snapshot allocated at ``open()``;
+ concurrent inserts during the snapshot may be truncated. A
+ bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+ (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+ length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+ confirms matches
+
+Deduplication is best-effort, not strict: if two CPUs race in the
+insert path with the same ``key_hash`` (i.e. the same stack), the
+``cmpxchg`` loser advances by one slot and may insert the same stack
+again. Under heavy contention this can produce a small number of
+duplicate entries for the same stack; ``ref_count`` is then split
+across the duplicates. Total memory is still bounded by the element
+pool size, and lookup correctness is unaffected (each duplicate is
+a self-consistent entry with its own ``stack_id``). The trade-off is
+intentional and keeps the hot path lock-free.
+
+Performance
+===========
+
+Typical results on an aarch64 SMP system (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Dedup rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
ftrace
ftrace-design
ftrace-uses
+ ftrace-stackmap
kprobes
kprobetrace
fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100644
index 000000000000..64dfe7cc66bd
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,111 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap function:tracer
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify trace contains <stack_id> events (read BEFORE switching
+# tracer back to nop, since tracer_init() resets the ring buffer)
+# 4. Verify stack_map has entries and at least some successes (drops is
+# a legitimate by-design fallback counter and is allowed to be nonzero;
+# only zero successes alongside nonzero drops indicates breakage)
+# 5. Verify reset is rejected (-EBUSY) while tracing is active
+# 6. Verify reset clears the map when tracing is stopped
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+ disable_tracing 2>/dev/null
+ echo nop > current_tracer 2>/dev/null
+ echo 0 > options/stackmap 2>/dev/null
+ echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Read trace contents NOW, before switching tracer back to nop.
+# tracer_init() unconditionally calls tracing_reset_online_cpus(),
+# so the ring buffer would be empty after 'echo nop > current_tracer'.
+count=$(grep -c "<stack_id" trace || true)
+: "${count:=0}"
+if [ "$count" -eq 0 ]; then
+ fail "trace has no <stack_id> events"
+fi
+
+# Now safe to switch back and disable options
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+ fail "stackmap has zero entries after tracing"
+fi
+
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+ fail "stackmap has zero successes"
+fi
+
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+# drops is a legitimate by-design fallback counter: when the map is full
+# or under heavy probe pressure, stackmap falls back to recording a full
+# stack instead of a stack_id. A nonzero drops count is therefore not a
+# failure. Only treat it as fatal if dedup never worked at all (no
+# successes), which would indicate the feature is genuinely broken rather
+# than merely under pressure.
+if [ "$successes" -eq 0 ] && [ "$drops" -ne 0 ]; then
+ fail "stackmap had $drops drops and zero successes (feature broken?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+ fail "stack_map output has no stack_id entries"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+ disable_tracing
+ fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+# Test reset works when tracing is stopped
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+ fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
new file mode 100644
index 000000000000..28810ba20432
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
@@ -0,0 +1,54 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap option is gated to the top-level trace instance
+# requires: stack_map options/stackmap instances
+
+# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the
+# convention used for global-only options like 'printk' and 'record-cmd'.
+# Verify that:
+# 1. The global instance exposes options/stackmap and the stack_map* nodes.
+# 2. A newly created secondary instance under instances/ does NOT expose
+# options/stackmap or stack_map* nodes.
+
+fail() {
+ echo "FAIL: $1"
+ rmdir instances/test_stackmap_gate 2>/dev/null
+ exit_fail
+}
+
+# 1. Global instance must expose the option and the nodes
+test -e options/stackmap || fail "options/stackmap missing on global instance"
+test -e stack_map || fail "stack_map missing on global instance"
+test -e stack_map_stat || fail "stack_map_stat missing on global instance"
+test -e stack_map_bin || fail "stack_map_bin missing on global instance"
+
+# 2. Create a secondary instance and verify it does NOT see the option
+# or the stack_map* nodes.
+mkdir instances/test_stackmap_gate || fail "could not create secondary instance"
+
+if [ -e instances/test_stackmap_gate/options/stackmap ]; then
+ fail "secondary instance unexpectedly exposes options/stackmap"
+fi
+
+for f in stack_map stack_map_stat stack_map_bin; do
+ if [ -e instances/test_stackmap_gate/$f ]; then
+ fail "secondary instance unexpectedly has $f"
+ fi
+done
+
+# 3. The aggregate trace_options file still reaches set_tracer_flag(),
+# so writing 'stackmap' there must be rejected on a secondary
+# instance. Otherwise the bit could appear set in trace_options
+# while the hot path silently falls back to a full stack trace
+# (tr->stackmap == NULL).
+if echo stackmap > instances/test_stackmap_gate/trace_options 2>/dev/null; then
+ fail "secondary instance accepted 'echo stackmap > trace_options'"
+fi
+if grep -qw stackmap instances/test_stackmap_gate/trace_options; then
+ fail "secondary instance trace_options reports stackmap as set"
+fi
+
+rmdir instances/test_stackmap_gate || fail "could not remove secondary instance"
+
+echo "stackmap option gating to top-level instance works"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
new file mode 100644
index 000000000000..803cc282f9ab
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
@@ -0,0 +1,76 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap reset clears the trace buffer and ABI header
+# requires: stack_map options/stackmap function:tracer
+
+# Lock in the two things most likely to regress in the stackmap ABI /
+# lifetime:
+# 1. Resetting the stackmap (echo 0 > stack_map, tracing stopped) also
+# clears the trace buffer, so no stale <stack_id N> can be left
+# dangling against an emptied map.
+# 2. The stack_map_bin header carries the expected magic ('FSMB' =
+# 0x46534D42) and version (1).
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+cleanup() {
+ disable_tracing 2>/dev/null
+ echo nop > current_tracer 2>/dev/null
+ echo 0 > options/stackmap 2>/dev/null
+ echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Sanity: the buffer must contain stack_id events before reset, otherwise
+# the post-reset emptiness check below would be meaningless.
+before=$(grep -c "<stack_id" trace || true)
+: "${before:=0}"
+if [ "$before" -eq 0 ]; then
+ fail "no <stack_id> events captured before reset"
+fi
+
+# Reset while tracing is stopped. This must succeed AND clear the trace
+# buffer (destructive reset semantics).
+echo 0 > stack_map || fail "reset rejected while tracing stopped"
+
+after=$(grep -c "<stack_id" trace || true)
+: "${after:=0}"
+if [ "$after" -ne 0 ]; then
+ fail "trace still has $after stale <stack_id> events after reset"
+fi
+
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=-1}"
+if [ "$entries" -ne 0 ]; then
+ fail "stackmap still has $entries entries after reset"
+fi
+
+# Binary export header: magic 'FSMB' (0x46534D42) + version 1.
+# od -tx4 renders the 32-bit words in the target's native byte order,
+# which matches what the kernel wrote, so the comparison is endian-safe.
+if command -v od >/dev/null 2>&1; then
+ magic=$(od -An -tx4 -N4 stack_map_bin | tr -d ' \n')
+ if [ "$magic" != "46534d42" ]; then
+ fail "stack_map_bin bad magic: 0x$magic (expected 46534d42)"
+ fi
+ ver=$(od -An -tx4 -j4 -N4 stack_map_bin | tr -d ' \n')
+ if [ "$ver" != "00000001" ]; then
+ fail "stack_map_bin bad version: 0x$ver (expected 00000001)"
+ fi
+fi
+
+echo "stackmap reset test passed: cleared $before stack_id events, ABI header ok"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..2d9c49b776e6
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+ # Pull from device and parse
+ adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+ python3 stackmap_dump.py /tmp/stack_map.bin
+
+ # With vmlinux for offline symbol resolution
+ python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+ # JSON output for tooling
+ python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x46534D42 # 'FSMB'
+HEADER_SIZE = 16 # 4 x u32
+ENTRY_SIZE = 16 # 4 x u32
+
+# __ftrace_trace_stack() replaces trampoline addresses with this marker
+# (FTRACE_TRAMPOLINE_MARKER == (unsigned long)INT_MAX) before the stack
+# is stored, so the binary export carries it verbatim.
+FTRACE_TRAMPOLINE_MARKER = 0x7fffffff
+TRAMPOLINE_LABEL = '[FTRACE TRAMPOLINE]'
+
+
+def detect_endianness(data):
+ """Detect byte order from magic number in header."""
+ if len(data) < 4:
+ raise ValueError("File too small")
+ magic_le = struct.unpack_from('<I', data, 0)[0]
+ if magic_le == MAGIC:
+ return '<'
+ magic_be = struct.unpack_from('>I', data, 0)[0]
+ if magic_be == MAGIC:
+ return '>'
+ raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+ """Resolve multiple addresses in one addr2line invocation."""
+ if not addrs:
+ return {}
+ try:
+ # Feed addresses on stdin to avoid ARG_MAX limits with large
+ # numbers of addresses (one stack can have 30+ frames; a
+ # snapshot can have thousands of unique stacks).
+ stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+ result = subprocess.run(
+ ['addr2line', '-f', '-e', vmlinux],
+ input=stdin, capture_output=True, text=True, timeout=60
+ )
+ lines = result.stdout.split('\n')
+ # addr2line outputs 2 lines per address: function name + source location
+ symbols = {}
+ for i, addr in enumerate(addrs):
+ idx = i * 2
+ if idx < len(lines) and lines[idx] and lines[idx] != '??':
+ symbols[addr] = lines[idx]
+ return symbols
+ except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+ print(f"warning: addr2line failed: {e}", file=sys.stderr)
+ return {}
+
+
+def parse_stackmap_bin(data):
+ """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+ if len(data) < HEADER_SIZE:
+ raise ValueError("File too small for header")
+
+ endian = detect_endianness(data)
+ header_fmt = f'{endian}IIII'
+ entry_fmt = f'{endian}IIII'
+
+ magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+ if version != 1:
+ raise ValueError(f"Unsupported version: {version}")
+
+ offset = HEADER_SIZE
+ for _ in range(nr_stacks):
+ if offset + ENTRY_SIZE > len(data):
+ break
+ stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+ offset += ENTRY_SIZE
+
+ ips_size = nr * 8
+ if offset + ips_size > len(data):
+ break
+ ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+ offset += ips_size
+
+ yield stack_id, ref_count, list(ips)
+
+
+def main():
+ parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+ parser.add_argument('file', help='Path to stack_map_bin file')
+ parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+ parser.add_argument('--json', action='store_true', help='JSON output')
+ parser.add_argument('--top', type=int, default=0,
+ help='Show only top N stacks by ref_count')
+ args = parser.parse_args()
+
+ with open(args.file, 'rb') as f:
+ data = f.read()
+
+ stacks = list(parse_stackmap_bin(data))
+
+ if args.top > 0:
+ stacks.sort(key=lambda x: x[1], reverse=True)
+ stacks = stacks[:args.top]
+
+ # Batch symbol resolution
+ symbols = {}
+ if args.vmlinux:
+ all_addrs = set()
+ for _, _, ips in stacks:
+ all_addrs.update(ip for ip in ips
+ if ip != FTRACE_TRAMPOLINE_MARKER)
+ symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+ def render(ip):
+ if ip == FTRACE_TRAMPOLINE_MARKER:
+ return TRAMPOLINE_LABEL
+ return symbols.get(ip, f'0x{ip:x}')
+
+ if args.json:
+ output = []
+ for stack_id, ref_count, ips in stacks:
+ entry = {
+ 'stack_id': stack_id,
+ 'ref_count': ref_count,
+ 'ips': [f'0x{ip:x}' for ip in ips]
+ }
+ if args.vmlinux:
+ entry['symbols'] = [render(ip) for ip in ips]
+ output.append(entry)
+ print(json.dumps(output, indent=2))
+ else:
+ for stack_id, ref_count, ips in stacks:
+ print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+ for i, ip in enumerate(ips):
+ if ip == FTRACE_TRAMPOLINE_MARKER:
+ print(f" [{i}] {TRAMPOLINE_LABEL}")
+ continue
+ sym = symbols.get(ip, '')
+ if sym:
+ sym = f' {sym}'
+ print(f" [{i}] 0x{ip:x}{sym}")
+ print()
+
+ print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+ main()
--
2.34.1
^ permalink raw reply related
* [RFC PATCH v4 2/3] trace: integrate stackmap into ftrace stack recording path
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.
Changes:
- New TRACE_STACK_ID in trace_type enum and stack_id_entry in
trace_entries.h.
- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
used by TRACE_ITER_PROF_TEXT_OFFSET).
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS
so it is only exposed under the top-level trace instance, matching
the convention already used for global-only options such as 'printk'
and 'record-cmd'. Secondary instances under tracing/instances/*/
do not see the option in their options/ directory.
- set_tracer_flag() additionally rejects enabling STACKMAP on a
secondary instance. The per-option file is hidden on secondary
instances, but a write to the aggregate trace_options file still
reaches set_tracer_flag(); without this check the bit could be
accepted and then become a silent no-op in the hot path (where
tr->stackmap is NULL). This closes the global-instance-only gate
at the write path, not just in the tracefs layout.
- __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer slot
BEFORE calling ftrace_stackmap_get_id(), so the map (and its
ref_count / success counters) is only mutated when a ring-buffer
event will actually reference the entry. If the reservation fails
it falls back to a full stack; if get_id() fails it discards the
reserved slot and falls back. A stack deeper than
FTRACE_STACKMAP_MAX_DEPTH skips the map entirely (get_id() would
return -E2BIG) and records a full stack, so deep traces are never
truncated or merged.
- Stackmap pointer read with smp_load_acquire(), published with
smp_store_release() to ensure proper initialization ordering. The
hot path falls back to a full stack whenever tr->stackmap is NULL.
- ftrace_stackmap_create() takes the owning trace_array so the
stackmap can later clear that trace_array's buffers during reset.
- Added stack_id print handler in trace_output.c and TRACE_STACK_ID
to trace_valid_entry() in trace_selftest.c so ftrace startup
selftests accept the new entry type when the stackmap option is
enabled.
Failure-atomic init and boot-time activation:
- The global stackmap and its tracefs files are created during
tracer_init_tracefs(). stack_map is the single required file (it is
both the resolver and the reset interface); it is created BEFORE the
map pointer is published with smp_store_release(), so an observed
non-NULL tr->stackmap implies the resolver/reset file exists. If
stack_map cannot be created the map is destroyed and never published.
- A small init-state (PENDING / DONE / FAILED) lets set_tracer_flag()
distinguish "not initialized yet" from "init failed". Boot-time
options (trace_options=stackmap,stacktrace) are applied before the
tracefs init work runs; the flag is allowed to be set while init is
PENDING (the hot path falls back until the map is published, then the
boot-set option takes effect), and is only rejected once init has
permanently FAILED. On failure the STACKMAP flag is also cleared from
the global instance so options/stackmap never reports an enabled
no-op.
Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, NULL pointer, or a too-deep stack), the full stack trace is
recorded as before -- no new failure modes introduced.
Per-instance stackmap support is left as a follow-up; gating the
option to the global instance (both in the tracefs layout and at the
set_tracer_flag() write path) makes the global-only scope explicit.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/trace.c | 216 +++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 17 +++
kernel/trace/trace_entries.h | 15 +++
kernel/trace/trace_output.c | 23 ++++
kernel/trace/trace_selftest.c | 1 +
5 files changed, 269 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..e00bee5d0e01 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
#include "trace.h"
#include "trace_output.h"
+#include "trace_stackmap.h"
#ifdef CONFIG_FTRACE_STARTUP_TEST
/*
@@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
/* trace_options that are only supported by global_trace */
#define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) | \
TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) | \
- TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS)
+ TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) | \
+ FPROFILE_DEFAULT_FLAGS)
/* trace_flags that are default zero for instances */
#define ZEROED_TRACE_FLAGS \
(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \
- TRACE_ITER(COPY_MARKER))
+ TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP))
/*
* The global_trace is the descriptor that holds the top-level tracing
@@ -1562,7 +1564,7 @@ void tracing_reset_online_cpus(struct array_buffer *buf)
ring_buffer_record_enable(buffer);
}
-static void tracing_reset_all_cpus(struct array_buffer *buf)
+void tracing_reset_all_cpus(struct array_buffer *buf)
{
struct trace_buffer *buffer = buf->buffer;
@@ -2184,6 +2186,75 @@ void __ftrace_trace_stack(struct trace_array *tr,
}
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * If stackmap dedup is enabled, try to store only the stack_id
+ * in the ring buffer instead of the full stack trace.
+ *
+ * Reserve the TRACE_STACK_ID ring-buffer slot BEFORE inserting
+ * into the stackmap. This guarantees the map is only mutated
+ * (and its ref_count / success counters bumped) when a
+ * ring-buffer event will actually reference the entry:
+ * - reservation fails -> fall back to full stack, map untouched
+ * - get_id() fails -> discard the reserved slot, fall back
+ * so stack_map_stat counters stay consistent with what the ring
+ * buffer holds, and a failed reservation never consumes a map
+ * slot for an event that records a full stack anyway.
+ */
+ if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+ struct ftrace_stackmap *smap;
+ struct stack_id_entry *sid_entry;
+ int sid;
+
+ /*
+ * Pairs with the smp_store_release() that publishes the
+ * fully initialized global stackmap at tracefs init.
+ */
+ smap = smp_load_acquire(&tr->stackmap);
+ if (!smap)
+ goto full_stack;
+
+ /*
+ * The stackmap stores at most FTRACE_STACKMAP_MAX_DEPTH
+ * frames per entry. A deeper trace would be truncated, and
+ * two distinct stacks that share the first MAX_DEPTH frames
+ * would hash and compare equal, silently merging into one
+ * stack_id. Keep the conservative full-stack path for deep
+ * traces so no information is lost or misattributed.
+ */
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ goto full_stack;
+
+ event = __trace_buffer_lock_reserve(buffer, TRACE_STACK_ID,
+ sizeof(*sid_entry), trace_ctx);
+ if (!event)
+ goto full_stack;
+
+ sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+ if (sid < 0) {
+ /*
+ * Pool exhausted or a reset is in progress. Discard
+ * the reserved stack_id slot and record the full
+ * stack instead, so the event still gets a trace.
+ */
+ __trace_event_discard_commit(buffer, event);
+ goto full_stack;
+ }
+
+ sid_entry = ring_buffer_event_data(event);
+ sid_entry->stack_id = sid;
+ /*
+ * stack_id is a synthetic side-event attached to a
+ * primary trace event that was already subject to
+ * filtering. No per-event filter is defined for
+ * TRACE_STACK_ID, so commit unconditionally.
+ */
+ __buffer_unlock_commit(buffer, event);
+ goto out;
+ }
+full_stack:
+#endif
+
event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
struct_size(entry, caller, nr_entries),
trace_ctx);
@@ -3979,6 +4050,33 @@ int trace_keep_overwrite(struct tracer *tracer, u64 mask, int set)
return 0;
}
+#ifdef CONFIG_FTRACE_STACKMAP
+/*
+ * Tracks tracefs-time initialization of the global stackmap so that
+ * set_tracer_flag() can distinguish "not initialized yet" from
+ * "initialization permanently failed".
+ *
+ * Boot-time options (trace_options=stackmap,stacktrace) are applied
+ * very early, before tracer_init_tracefs() creates and publishes the
+ * map. We must allow the STACKMAP flag to be set during that window
+ * (the hot path falls back to a full stack while tr->stackmap is NULL,
+ * then starts using the map once it is published). We must, however,
+ * reject the enable once init has *failed*, so options/stackmap never
+ * reports an enabled no-op.
+ *
+ * Written once from the tracefs init work before any concurrent
+ * userspace writer to trace_options can run, then only read; a plain
+ * int is therefore sufficient.
+ */
+enum {
+ STACKMAP_INIT_PENDING, /* tracer_init_tracefs() not run yet */
+ STACKMAP_INIT_DONE, /* map published, stack_map file created */
+ STACKMAP_INIT_FAILED, /* permanent failure, never available */
+};
+
+static int stackmap_init_state = STACKMAP_INIT_PENDING;
+#endif
+
int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled)
{
switch (mask) {
@@ -3993,6 +4091,33 @@ int set_tracer_flag(struct trace_array *tr, u64 mask, int enabled)
if (!!(tr->trace_flags & mask) == !!enabled)
return 0;
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * STACKMAP is intentionally global-instance-only: the dedup map,
+ * its tracefs files (stack_map / stack_map_stat / stack_map_bin)
+ * and the lifetime/reset semantics are tied to the global trace
+ * array. options/stackmap is hidden on secondary instances via
+ * TOP_LEVEL_TRACE_FLAGS, but writes still reach set_tracer_flag()
+ * through the aggregate trace_options file. Reject the enable on
+ * a secondary instance so it cannot be silently accepted and then
+ * become a no-op in the hot path (where tr->stackmap is NULL and
+ * the code falls back to a full stack trace).
+ *
+ * On the global instance, allow the enable while init is still
+ * pending (boot-time trace_options=stackmap is applied before the
+ * tracefs init work creates the map; the hot path falls back
+ * until the map is published). Only reject once init has
+ * permanently failed, so options/stackmap never reports an
+ * enabled no-op. READ_ONCE() suffices: this only inspects the
+ * init state, it does not dereference the map (the hot path uses
+ * smp_load_acquire(&tr->stackmap) for that).
+ */
+ if (mask == TRACE_ITER(STACKMAP) && enabled &&
+ (tr != &global_trace ||
+ READ_ONCE(stackmap_init_state) == STACKMAP_INIT_FAILED))
+ return -EINVAL;
+#endif
+
/* Give the tracer a chance to approve the change */
if (tr->current_trace->flag_changed)
if (tr->current_trace->flag_changed(tr, mask, !!enabled))
@@ -9222,6 +9347,91 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
NULL, &tracing_dyn_info_fops);
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ {
+ struct ftrace_stackmap *smap;
+ struct dentry *map_file;
+
+ smap = ftrace_stackmap_create(&global_trace);
+ if (!IS_ERR(smap)) {
+ /*
+ * Failure-atomic init: stack_map is the single
+ * required tracefs file (it doubles as the reset
+ * interface and the human-readable resolver). If
+ * we cannot create it, the hot path must not be
+ * able to emit <stack_id N> events that no one can
+ * resolve or clear, so refuse to publish the map
+ * and tear it down.
+ *
+ * Create stack_map BEFORE smp_store_release() so an
+ * observed non-NULL global_trace.stackmap implies
+ * its resolver/reset file exists.
+ */
+ map_file = trace_create_file("stack_map",
+ TRACE_MODE_WRITE, NULL,
+ smap,
+ &ftrace_stackmap_fops);
+ if (!map_file) {
+ pr_warn("ftrace stackmap init: stack_map create failed, dedup disabled\n");
+ ftrace_stackmap_destroy(smap);
+ /*
+ * Permanent failure. Record it and clear a
+ * STACKMAP flag that a boot-time
+ * trace_options=stackmap may have set, so
+ * options/stackmap does not report an
+ * enabled no-op and later userspace enables
+ * return -EINVAL.
+ */
+ WRITE_ONCE(stackmap_init_state,
+ STACKMAP_INIT_FAILED);
+ global_trace.trace_flags &=
+ ~TRACE_ITER(STACKMAP);
+ } else {
+ /*
+ * smp_store_release pairs with the
+ * smp_load_acquire() in
+ * __ftrace_trace_stack(). Publishing only
+ * after the required file exists keeps
+ * "smap visible" => "resolver/reset
+ * available".
+ */
+ smp_store_release(&global_trace.stackmap,
+ smap);
+ WRITE_ONCE(stackmap_init_state,
+ STACKMAP_INIT_DONE);
+ /*
+ * stat and bin are auxiliary observability
+ * surfaces. If they fail to be created we
+ * keep dedup enabled (the kernel side still
+ * works, and stack_map alone is enough to
+ * resolve and reset); trace_create_file()
+ * already pr_warn()s on failure.
+ */
+ trace_create_file("stack_map_stat",
+ TRACE_MODE_READ, NULL,
+ smap,
+ &ftrace_stackmap_stat_fops);
+ trace_create_file("stack_map_bin",
+ TRACE_MODE_READ, NULL,
+ smap,
+ &ftrace_stackmap_bin_fops);
+ }
+ } else {
+ pr_warn("ftrace stackmap init failed, dedup disabled\n");
+ /*
+ * global_trace is statically defined; its stackmap
+ * field is zero-initialized via BSS, so leaving it
+ * NULL ensures the smp_load_acquire() in
+ * __ftrace_trace_stack() falls back to full stack.
+ * Mark init failed and clear any boot-time STACKMAP
+ * flag so userspace enables are rejected rather than
+ * becoming silent no-ops.
+ */
+ WRITE_ONCE(stackmap_init_state, STACKMAP_INIT_FAILED);
+ global_trace.trace_flags &= ~TRACE_ITER(STACKMAP);
+ }
+ }
+#endif
create_trace_instances(NULL);
update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..95db43bfc747 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_STACK_ID,
__TRACE_LAST_TYPE,
};
@@ -453,6 +454,9 @@ struct trace_array {
struct cond_snapshot *cond_snapshot;
#endif
struct trace_func_repeats __percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+ struct ftrace_stackmap *stackmap;
+#endif
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct func_repeats_entry, \
TRACE_FUNC_REPEATS); \
+ IF_ASSIGN(var, ent, struct stack_id_entry, \
+ TRACE_STACK_ID); \
__ftrace_bad_type(); \
} while (0)
@@ -689,6 +695,7 @@ extern int tracing_disabled;
int tracer_init(struct tracer *t, struct trace_array *tr);
int tracing_is_enabled(void);
void tracing_reset_online_cpus(struct array_buffer *buf);
+void tracing_reset_all_cpus(struct array_buffer *buf);
void tracing_reset_all_online_cpus(void);
void tracing_reset_all_online_cpus_unlocked(void);
int tracing_open_generic(struct inode *inode, struct file *filp);
@@ -1449,7 +1456,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
# define STACK_FLAGS
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS \
+ C(STACKMAP, "stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT -1
+#endif
+
#ifdef CONFIG_FUNCTION_PROFILER
+
# define PROFILER_FLAGS \
C(PROF_TEXT_OFFSET, "prof-text-offset"),
# ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1522,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
+ STACKMAP_FLAGS \
BRANCH_FLAGS \
PROFILER_FLAGS \
FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
(void *)__entry->caller[6], (void *)__entry->caller[7])
);
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+ TRACE_STACK_ID,
+
+ F_STRUCT(
+ __field( int, stack_id )
+ ),
+
+ F_printk("<stack_id %d>", __entry->stack_id)
+);
+
/*
* trace_printk entry:
*/
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
.funcs = &trace_user_stack_funcs,
};
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+ int flags, struct trace_event *event)
+{
+ struct stack_id_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+ trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+ .trace = trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+ .type = TRACE_STACK_ID,
+ .funcs = &trace_stack_id_funcs,
+};
+
/* TRACE_HWLAT */
static enum print_line_t
trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
&trace_wake_event,
&trace_stack_event,
&trace_user_stack_event,
+ &trace_stack_id_event,
&trace_bputs_event,
&trace_bprint_event,
&trace_print_event,
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 929c84075315..0c97065b0d68 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *entry)
case TRACE_CTX:
case TRACE_WAKE:
case TRACE_STACK:
+ case TRACE_STACK_ID:
case TRACE_PRINT:
case TRACE_BRANCH:
case TRACE_GRAPH_ENT:
--
2.34.1
^ permalink raw reply related
* [RFC PATCH v4 1/3] trace: add lock-free stackmap for stack trace deduplication
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56
In-Reply-To: <20260616064119.438063-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.
The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:
- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
is O(1) even when the table is heavily loaded with claimed-but-
empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)
The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the
existing tracing_map / hist_triggers requirement: the lock-free
hot path uses cmpxchg in a context that may be reached from NMI.
The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (magic 0x46534D42 'FSMB', version 1,
all fields native-endian)
ftrace_stackmap_get_id() never truncates: a stack deeper than
FTRACE_STACKMAP_MAX_DEPTH (64) returns -E2BIG so the caller records a
full stack instead. This prevents two distinct traces that share their
first 64 frames from being merged into one stack_id.
Hot-path counters use per-CPU local_t (NMI-safe single-instruction
increments) instead of atomic64_t. atomic64_t falls back to
raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems,
which would deadlock if an NMI hit while the spinlock was held.
local_t avoids this hazard. All counters saturate rather than wrap on
long (from-boot, multi-hour) traces: ref_count via
atomic_add_unless(.., INT_MAX) and successes/drops via
local_add_unless(.., LONG_MAX).
Reset semantics:
- Reset is a control-path operation only allowed when tracing is
stopped on the owning trace_array. Online reset (with tracing
active) is intentionally not supported.
- Reset is destructive: under the reader_sem write lock it clears the
owning trace_array's ring buffer (and snapshot buffer) BEFORE the
map, so an external observer never sees "trace still has
<stack_id N> but the map is already empty". The buffers are cleared
with tracing_reset_all_cpus() rather than _online_cpus() so a
TRACE_STACK_ID written by a now-offline CPU cannot survive a reset.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
verifies tracer_tracing_is_on() returns false.
- synchronize_rcu() drains in-flight get_id() callers from the
ftrace callback path (which runs preempt-disabled).
- The reader_sem (rw_semaphore) serializes the clearing against
tracefs readers (seq_file iteration and stack_map_bin snapshot),
which run in process context and aren't covered by
synchronize_rcu(). Readers take it shared, reset takes it
exclusive, so a reset cannot tear an iteration in progress. The
hot path doesn't take this lock.
- Reset clears the resetting flag with atomic_set_release() so a
subsequent get_id() observes a fully cleared map.
- get_id() uses atomic_read_acquire() on resetting so subsequent
loads of entry->key/val are properly ordered after the check
(control dependencies only order stores per LKMM).
- Concurrent reset, or reset while tracing is active, returns -EBUSY.
Concurrency notes:
- entry->val publication uses smp_store_release() paired with
smp_load_acquire() in all dereferencing readers.
- entry->key reads (in get_id, seq_start/next, bin_open) use
READ_ONCE() to avoid LKMM data races with the cmpxchg writer.
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
atomic_read() before the contended atomic RMW, avoiding cacheline
contention once the pool is full. Slots that win cmpxchg but
cannot get an elt are left 'claimed but empty'; subsequent
lookups treat val==NULL as a miss and probe past them.
Hash key:
- Per-instance random seed stored in the stackmap struct (no
global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
free-slot sentinel). Full memcmp confirms matches.
Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
element pool is ~135 MB and a stack_map_bin snapshot may briefly
allocate another ~135 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
field so data[] is 8-byte aligned on both 32-bit and 64-bit
architectures, avoiding alignment faults when writing u64 IPs
on strict-alignment architectures.
Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
range 10-18, default 14)
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 +++
4 files changed, 969 insertions(+)
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..e49cae886ff0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,28 @@ config STACK_TRACER
Say N if unsure.
+config FTRACE_STACKMAP
+ bool "Ftrace stack map deduplication"
+ depends on TRACING
+ depends on STACKTRACE
+ depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
+ select KALLSYMS
+ help
+ This enables a global stack trace hash table for ftrace, inspired
+ by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+ only a stack_id in the ring buffer instead of the full stack trace,
+ significantly reducing trace buffer usage when the same call stacks
+ appear repeatedly.
+
+ The deduplicated stacks are exported via:
+ /sys/kernel/debug/tracing/stack_map
+
+ Writing to this file resets the stack map. Reading shows all unique
+ stacks with their stack_id and reference count.
+
+ Say Y if you want to reduce ftrace buffer usage for stack traces.
+ Say N if unsure.
+
config TRACE_PREEMPT_TOGGLE
bool
help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 8d3d96e847d8..c2d9b2bf895a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..9e9fdf85071d
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,889 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ * cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ * and blocks new get_id() callers (they observe resetting=1 and
+ * return -EINVAL).
+ * - trace_types_lock serializes the tracer_tracing_is_on() check and
+ * the destructive ring-buffer reset against tracefs writes to
+ * tracing_on.
+ * - synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val); they are also serialized
+ * against reset via smap->reader_sem (readers take it in shared
+ * mode, reset in exclusive mode), so a reset cannot tear an
+ * iteration in progress -- it waits for active readers to drop
+ * the rwsem before clearing the map. The hot path is coordinated
+ * with reset separately, via acquire/release on smap->resetting.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/local_lock.h>
+#include <linux/percpu.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+#include <asm/local.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE 64
+
+/*
+ * Memory ordering of entry->val: published with smp_store_release()
+ * by the inserter; consumed with smp_load_acquire() by every reader
+ * that dereferences the elt (get_id, seq_show, bin_open). This pairs
+ * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the
+ * publish) with the reads of those fields (which happen AFTER the
+ * load). seq_start / seq_next only test val for NULL and use the
+ * acquire load purely to keep memory ordering symmetric.
+ */
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+ u32 nr; /* actual number of IPs */
+ atomic_t ref_count;
+ unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+ u32 key; /* 0 = free, non-zero = jhash */
+ struct stackmap_elt *val; /* NULL until fully published */
+};
+
+static struct stackmap_elt *stackmap_load_elt(struct stackmap_entry *entry)
+{
+ /*
+ * Pairs with the smp_store_release() that publishes entry->val
+ * after fully initializing the element payload.
+ */
+ return smp_load_acquire(&entry->val);
+}
+
+struct ftrace_stackmap {
+ struct trace_array *tr; /* owning trace_array */
+ unsigned int map_bits;
+ unsigned int map_size; /* 1 << (map_bits + 1) */
+ unsigned int max_elts; /* 1 << map_bits */
+ u32 hash_seed; /* per-instance jhash seed */
+ atomic_t next_elt; /* index into elts pool */
+ struct stackmap_entry *entries; /* hash table */
+ struct stackmap_elt *elts; /* flat element pool */
+ atomic_t resetting;
+ /*
+ * Reader/reset serialization. Held in shared mode (read lock)
+ * across seq_file iteration and binary snapshot construction;
+ * held in exclusive mode (write lock) by reset's clearing
+ * phase. The hot path (get_id) does not take this lock — it
+ * uses smp_load_acquire/smp_store_release on entry->val and
+ * the resetting flag for the lock-free protocol.
+ */
+ struct rw_semaphore reader_sem;
+ /*
+ * Per-CPU counters using local_t. local_t increments are NMI-
+ * safe on all architectures (single-instruction or interrupt-
+ * masked) and avoid the raw_spinlock_t fallback that
+ * atomic64_t uses on 32-bit GENERIC_ATOMIC64 — which would
+ * deadlock if an NMI hit while the spinlock was held.
+ */
+ local_t __percpu *successes; /* events served (hits + new inserts) */
+ local_t __percpu *drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ * export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN 10
+#define FTRACE_STACKMAP_BITS_MAX 18
+#define FTRACE_STACKMAP_BITS_DEFAULT 14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+ unsigned long val;
+
+ if (kstrtoul(str, 0, &val))
+ return -EINVAL;
+ val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+ stackmap_map_bits = val;
+ return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+ int idx;
+
+ /*
+ * Fast-path early-out once the pool is fully consumed. Avoids
+ * the contended atomic RMW on next_elt for every traced event
+ * after the pool is exhausted.
+ */
+ if (atomic_read(&smap->next_elt) >= smap->max_elts)
+ return NULL;
+
+ idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+ if (idx < smap->max_elts)
+ return &smap->elts[idx];
+ return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+ struct ftrace_stackmap *smap;
+ unsigned int bits;
+
+ smap = kzalloc_obj(*smap, GFP_KERNEL);
+ if (!smap)
+ return ERR_PTR(-ENOMEM);
+
+ /* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+ bits = clamp_val(stackmap_map_bits,
+ FTRACE_STACKMAP_BITS_MIN,
+ FTRACE_STACKMAP_BITS_MAX);
+
+ smap->tr = tr;
+ smap->map_bits = bits;
+ smap->max_elts = 1U << bits;
+ smap->map_size = 1U << (bits + 1); /* 2x over-provision */
+
+ smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+ if (!smap->entries) {
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * Single large vmalloc of the element pool, indexed flat.
+ * At bits=18 this is 256K * sizeof(struct stackmap_elt). The
+ * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB.
+ */
+ smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+ if (!smap->elts) {
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->successes = alloc_percpu(local_t);
+ if (!smap->successes) {
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+ smap->drops = alloc_percpu(local_t);
+ if (!smap->drops) {
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->hash_seed = get_random_u32();
+ atomic_set(&smap->next_elt, 0);
+ atomic_set(&smap->resetting, 0);
+ init_rwsem(&smap->reader_sem);
+
+ return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+ if (!smap || IS_ERR(smap))
+ return;
+ free_percpu(smap->drops);
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ * 1. Atomically claim reset rights via cmpxchg on @resetting.
+ * 2. Take trace_types_lock to serialize against tracefs writes to
+ * tracing_on.
+ * 3. Verify tracing is stopped on @smap->tr; if not, release the
+ * claim and return -EBUSY. The resetting flag itself blocks
+ * any subsequent get_id() callers.
+ * 4. synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path (which runs preempt-disabled).
+ * 5. Reset the ring buffer(s), then memset entries, elts, and
+ * counters.
+ * 6. Release the resetting flag with release semantics so any new
+ * get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+ struct trace_array *tr;
+ int ret = 0;
+
+ if (!smap)
+ return 0;
+
+ if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+ return -EBUSY;
+
+ mutex_lock(&trace_types_lock);
+
+ tr = smap->tr;
+ if (tr && tracer_tracing_is_on(tr)) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ /*
+ * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+ * is needed before it. It drains in-flight ftrace callbacks that
+ * may have already passed the resetting check with the old value.
+ */
+ synchronize_rcu();
+
+ /*
+ * Take the reader_sem in exclusive mode. This serializes the
+ * memset against any tracefs reader (seq_file iteration or
+ * stack_map_bin snapshot) that may currently hold the rwsem
+ * for read. synchronize_rcu() already drained the hot path;
+ * this rwsem covers process-context readers that aren't
+ * preempt-disabled.
+ */
+ down_write(&smap->reader_sem);
+
+ /*
+ * Clear the ring buffer(s) BEFORE the map, both under the write
+ * lock. The ring buffer may still hold TRACE_STACK_ID events
+ * whose stack_id points at slots we are about to free/reuse.
+ * Resetting the buffer first guarantees an external observer
+ * never sees the inconsistent "trace still has <stack_id N> but
+ * the map is already empty" window: it sees either (old buffer,
+ * old map) or (cleared buffer, old map) or (cleared buffer,
+ * cleared map) -- never (old buffer, cleared map).
+ *
+ * Use tracing_reset_all_cpus() (not _online_cpus) so per-CPU
+ * buffers belonging to currently offline CPUs are also cleared.
+ * The ring buffer is allocated per-possible-CPU; an offline CPU's
+ * buffer can still hold a TRACE_STACK_ID event written before
+ * the CPU went offline. tracing_reset_online_cpus() iterates
+ * for_each_online_buffer_cpu() and would leave that data behind
+ * to be observed once the CPU comes back online (or by the
+ * trace reader, which iterates all allocated CPU buffers),
+ * recreating the stale-stack_id window we are trying to close.
+ *
+ * Since reset requires tracing to be stopped, this makes "reset"
+ * an explicitly destructive operation on the owning trace_array,
+ * keeping ring-buffer stack_ids and the map coherent.
+ */
+ if (tr) {
+ tracing_reset_all_cpus(&tr->array_buffer);
+#ifdef CONFIG_TRACER_SNAPSHOT
+ if (tr->allocated_snapshot)
+ tracing_reset_all_cpus(&tr->snapshot_buffer);
+#endif
+ }
+
+ memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+ memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+ atomic_set(&smap->next_elt, 0);
+ {
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ local_set(per_cpu_ptr(smap->successes, cpu), 0);
+ local_set(per_cpu_ptr(smap->drops, cpu), 0);
+ }
+ }
+
+ up_write(&smap->reader_sem);
+
+out_unlock:
+ mutex_unlock(&trace_types_lock);
+
+ /* Release resetting=0 so new get_id() observes a cleared map. */
+ atomic_set_release(&smap->resetting, 0);
+ return ret;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries)
+{
+ u32 key_hash, idx, test_key, trace_len;
+ struct stackmap_entry *entry;
+ struct stackmap_elt *val;
+ int probes = 0;
+
+ /*
+ * atomic_read_acquire() pairs with atomic_set_release() in the
+ * reset path. This ensures that subsequent reads of entry->key
+ * and entry->val are ordered after this check; without acquire,
+ * the CPU would only have a control dependency, which orders
+ * subsequent stores but not loads (per LKMM).
+ */
+ if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting))
+ return -EINVAL;
+ /*
+ * Never truncate: a stack deeper than the map can hold must not be
+ * silently shortened, or two distinct traces sharing their first
+ * FTRACE_STACKMAP_MAX_DEPTH frames would be merged into one
+ * stack_id. The caller is expected to fall back to a full stack
+ * trace for such events. Reject defensively in case of a future
+ * caller that forgets this contract.
+ */
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ return -E2BIG;
+
+ trace_len = nr_entries * sizeof(unsigned long);
+ /*
+ * jhash2() requires the length in u32 units and the data to be
+ * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+ * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+ * directly; the cast to u32* is safe because ips[] is naturally
+ * aligned to sizeof(unsigned long) >= 4.
+ */
+ key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+ smap->hash_seed);
+ if (key_hash == 0)
+ key_hash = 1; /* 0 means free slot */
+
+ idx = key_hash >> (32 - (smap->map_bits + 1));
+
+ while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+ idx &= (smap->map_size - 1);
+ entry = &smap->entries[idx];
+ /*
+ * READ_ONCE() to avoid LKMM data race with concurrent
+ * cmpxchg(&entry->key, 0, key_hash) on this slot.
+ */
+ test_key = READ_ONCE(entry->key);
+
+ if (test_key == key_hash) {
+ val = stackmap_load_elt(entry);
+ /*
+ * READ_ONCE(val->nr) keeps style consistent with
+ * the seq_show / bin_open readers. nr is write-once
+ * (set before publish, never modified afterwards),
+ * so the load is data-race-free, but READ_ONCE
+ * silences any analysis tool that flags a plain
+ * read of a field that is also read under acquire
+ * elsewhere.
+ */
+ if (val && READ_ONCE(val->nr) == nr_entries &&
+ memcmp(val->ips, ips, trace_len) == 0) {
+ /*
+ * ref_count is a best-effort popularity
+ * counter. On a long (from-boot, multi-hour)
+ * trace a hot stack can be hit billions of
+ * times. atomic_add_unless() gives true
+ * saturation at INT_MAX even under concurrent
+ * hits on multiple CPUs (a plain
+ * check-then-inc could let several CPUs past
+ * the check near the cap and still wrap).
+ */
+ atomic_add_unless(&val->ref_count, 1, INT_MAX);
+ /*
+ * successes/drops are best-effort throughput
+ * counters. Saturate at LONG_MAX so they do
+ * not wrap on long runs (notably where local_t
+ * is 32-bit), matching ref_count's behaviour.
+ */
+ local_add_unless(this_cpu_ptr(smap->successes),
+ 1, LONG_MAX);
+ return (int)idx;
+ }
+ /*
+ * val == NULL: another CPU is mid-insert, or this
+ * slot is "claimed but empty" (pool exhausted).
+ * val != NULL but mismatch: 32-bit hash collision
+ * with a different stack. In both cases, advance.
+ */
+ } else if (!test_key) {
+ /*
+ * Free slot: try to claim it.
+ *
+ * If two CPUs race here with the same key_hash
+ * (same stack), one loses the cmpxchg, advances,
+ * and may insert the same stack at a later slot.
+ * This can produce a small number of duplicate
+ * entries under heavy contention. The trade-off
+ * is accepted to keep the hot path lock-free;
+ * ref_count is split across the duplicates and
+ * total memory cost is bounded by the element
+ * pool size.
+ */
+ if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+ struct stackmap_elt *elt;
+
+ elt = stackmap_get_elt(smap);
+ if (!elt) {
+ /*
+ * Pool exhausted. We claimed this
+ * slot with cmpxchg but cannot fill
+ * it. Leave key set so the slot
+ * stays "claimed but empty" — future
+ * lookups treat val==NULL as a miss
+ * and probe past it. Cannot revert
+ * key=0 without racing other CPUs.
+ */
+ local_add_unless(this_cpu_ptr(smap->drops),
+ 1, LONG_MAX);
+ return -ENOSPC;
+ }
+
+ elt->nr = nr_entries;
+ atomic_set(&elt->ref_count, 1);
+ memcpy(elt->ips, ips, trace_len);
+
+ /*
+ * Publish elt with release semantics so the
+ * reader's smp_load_acquire can safely
+ * dereference val->nr / val->ips.
+ */
+ smp_store_release(&entry->val, elt);
+ local_add_unless(this_cpu_ptr(smap->successes),
+ 1, LONG_MAX);
+ return (int)idx;
+ }
+ /* cmpxchg failed; another CPU claimed this slot. */
+ }
+
+ idx++;
+ probes++;
+ }
+
+ local_add_unless(this_cpu_ptr(smap->drops), 1, LONG_MAX);
+ return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+ struct ftrace_stackmap *smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ /*
+ * Take the reader_sem to serialize against ftrace_stackmap_reset(),
+ * which holds it for write while clearing the table. Released in
+ * stackmap_seq_stop(), which seq_file calls regardless of whether
+ * start() returned an element or NULL (per Documentation/filesystems
+ * /seq_file.rst: "the iterator value returned by start() or next()
+ * is guaranteed to be passed to a subsequent next() or stop()").
+ */
+ down_read(&smap->reader_sem);
+ for (i = *pos; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ stackmap_load_elt(&smap->entries[i])) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos + 1; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ stackmap_load_elt(&smap->entries[i])) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ /*
+ * Advance *pos past the end so that on the next read() the
+ * subsequent stackmap_seq_start() call returns NULL and the
+ * iteration terminates. Without this, seq_read() would loop
+ * on the last element.
+ */
+ *pos = smap->map_size;
+ return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+
+ /*
+ * seq_file invokes stop() unconditionally after each iteration
+ * pass (see seq_read_iter / traverse), even when start() returned
+ * NULL. Always release here, balanced against the down_read in
+ * stackmap_seq_start().
+ */
+ if (smap)
+ up_read(&smap->reader_sem);
+}
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+ struct stackmap_entry *entry = v;
+ struct stackmap_seq_private *priv = m->private;
+ struct stackmap_elt *elt;
+ u32 idx = entry - priv->smap->entries;
+ u32 i, nr;
+
+ elt = stackmap_load_elt(entry);
+ if (!elt)
+ return 0;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+ idx, atomic_read(&elt->ref_count), nr);
+ for (i = 0; i < nr; i++) {
+ unsigned long ip = elt->ips[i];
+
+ /*
+ * Mirror trace_stack_print(): __ftrace_trace_stack()
+ * may replace trampoline addresses with
+ * FTRACE_TRAMPOLINE_MARKER before the stack reaches the
+ * map, and normal addresses must go through
+ * trace_adjust_address() (KASLR / module text delta)
+ * before symbolization. Without this the export would
+ * print a bogus symbol for the marker and unadjusted
+ * addresses for everything else.
+ */
+ if (ip == FTRACE_TRAMPOLINE_MARKER) {
+ seq_printf(m, " [%u] [FTRACE TRAMPOLINE]\n", i);
+ continue;
+ }
+ seq_printf(m, " [%u] %pS\n", i,
+ (void *)trace_adjust_address(priv->smap->tr, ip));
+ }
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+ .start = stackmap_seq_start,
+ .next = stackmap_seq_next,
+ .stop = stackmap_seq_stop,
+ .show = stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+ struct stackmap_seq_private *priv;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open_private(file, &stackmap_seq_ops,
+ sizeof(struct stackmap_seq_private));
+ if (ret)
+ return ret;
+ m = file->private_data;
+ priv = m->private;
+ priv->smap = inode->i_private;
+ return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+ if (n > 0 && buf[n - 1] == '\n')
+ n--;
+ return (n == 1 && buf[0] == '0') ||
+ (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct stackmap_seq_private *priv = m->private;
+ char buf[8];
+ size_t n = min(count, sizeof(buf) - 1);
+ int ret;
+
+ if (n == 0)
+ return -EINVAL;
+ if (copy_from_user(buf, ubuf, n))
+ return -EFAULT;
+ buf[n] = '\0';
+
+ if (!stackmap_write_is_reset(buf, n))
+ return -EINVAL;
+
+ /*
+ * ftrace_stackmap_reset() atomically claims reset rights via
+ * cmpxchg and returns -EBUSY if another reset is in progress
+ * or if tracing is active.
+ */
+ ret = ftrace_stackmap_reset(priv->smap);
+ if (ret)
+ return ret;
+ return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+ .open = stackmap_open,
+ .read = seq_read,
+ .write = stackmap_write,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+ struct ftrace_stackmap *smap = m->private;
+ u64 successes = 0, drops = 0;
+ u32 entries;
+ int cpu;
+
+ if (!smap) {
+ seq_puts(m, "stackmap not initialized\n");
+ return 0;
+ }
+
+ entries = atomic_read(&smap->next_elt);
+ for_each_possible_cpu(cpu) {
+ successes += local_read(per_cpu_ptr(smap->successes, cpu));
+ drops += local_read(per_cpu_ptr(smap->drops, cpu));
+ }
+
+ seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts);
+ seq_printf(m, "table_size: %u\n", smap->map_size);
+ seq_printf(m, "successes: %llu\n", successes);
+ seq_printf(m, "drops: %llu\n", drops);
+ if (successes + drops > 0)
+ seq_printf(m, "success_rate: %llu%%\n",
+ successes * 100 / (successes + drops));
+ return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+ .open = stackmap_stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+ /*
+ * Use u64 (not size_t) so data[] is 8-byte aligned on both
+ * 32-bit and 64-bit architectures. The IP array within data[]
+ * is accessed as u64*, which would alignment-fault on strict
+ * architectures (e.g. older ARM, SPARC) if data[] started at
+ * a 4-byte boundary.
+ */
+ u64 size;
+ char data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+ struct ftrace_stackmap *smap = inode->i_private;
+ struct stackmap_bin_snapshot *snap;
+ struct ftrace_stackmap_bin_header *hdr;
+ size_t alloc_size, off;
+ u32 nr_entries, i, nr_stacks;
+
+ if (!smap)
+ return -ENODEV;
+
+ /*
+ * Worst-case allocation size: every populated entry uses a
+ * full-depth stack. The (+1) gives one slack slot in case a
+ * concurrent insert lands between this snapshot and iteration.
+ * The loop below performs an explicit bounds check anyway.
+ *
+ * At bits=18 this caps at ~135 MB. The file is mode 0440
+ * (TRACE_MODE_READ), so only privileged users can open it.
+ */
+ nr_entries = atomic_read(&smap->next_elt);
+ alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+ (sizeof(struct ftrace_stackmap_bin_entry) +
+ FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+ snap = vmalloc(sizeof(*snap) + alloc_size);
+ if (!snap)
+ return -ENOMEM;
+
+ hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+ hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+ hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+ hdr->reserved = 0;
+ off = sizeof(*hdr);
+ nr_stacks = 0;
+
+ /*
+ * Take reader_sem to serialize against ftrace_stackmap_reset(),
+ * which clears the table and elt pool under the write lock.
+ */
+ down_read(&smap->reader_sem);
+
+ for (i = 0; i < smap->map_size; i++) {
+ struct stackmap_entry *entry = &smap->entries[i];
+ struct stackmap_elt *elt;
+ struct ftrace_stackmap_bin_entry *e;
+ u64 *ips_out;
+ u32 k, nr;
+
+ if (!READ_ONCE(entry->key))
+ continue;
+ elt = stackmap_load_elt(entry);
+ if (!elt)
+ continue;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ /* Bounds check: stop if we would overflow the allocation. */
+ if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+ break;
+
+ e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+ e->stack_id = i;
+ e->nr = nr;
+ e->ref_count = atomic_read(&elt->ref_count);
+ e->reserved = 0;
+ off += sizeof(*e);
+
+ ips_out = (u64 *)(snap->data + off);
+ for (k = 0; k < nr; k++) {
+ unsigned long ip = elt->ips[k];
+
+ /*
+ * Emit the trampoline marker verbatim so userspace
+ * can render it as [FTRACE TRAMPOLINE]; pass every
+ * other address through trace_adjust_address() so the
+ * binary export follows the same address-adjustment
+ * rules as the text export.
+ */
+ if (ip == FTRACE_TRAMPOLINE_MARKER)
+ ips_out[k] = (u64)FTRACE_TRAMPOLINE_MARKER;
+ else
+ ips_out[k] = (u64)trace_adjust_address(smap->tr, ip);
+ }
+ off += nr * sizeof(u64);
+ nr_stacks++;
+ }
+
+ up_read(&smap->reader_sem);
+
+ hdr->nr_stacks = nr_stacks;
+ snap->size = off;
+ file->private_data = snap;
+ return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct stackmap_bin_snapshot *snap = file->private_data;
+
+ if (!snap)
+ return -EINVAL;
+ return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+ vfree(file->private_data);
+ return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+ .open = stackmap_bin_open,
+ .read = stackmap_bin_read,
+ .llseek = default_llseek,
+ .release = stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..7c2e5ab9d36d
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH 64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC 0x46534D42 /* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION 1
+
+struct ftrace_stackmap_bin_header {
+ u32 magic;
+ u32 version;
+ u32 nr_stacks;
+ u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+ u32 stack_id;
+ u32 nr;
+ u32 ref_count;
+ u32 reserved;
+ /* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *
+ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+ unsigned long *ips, unsigned int n)
+{ return -EOPNOTSUPP; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
--
2.34.1
^ permalink raw reply related
* [RFC PATCH v4 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-06-16 6:41 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Mark Rutland, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
lipengfei28, zhangbo56
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Masami, Steven, all,
This is v4 of the ftrace stackmap series. It is sent as a new thread.
v3: https://lore.kernel.org/all/cover.1779769138.git.lipengfei28@xiaomi.com/
The series adds stack trace deduplication to ftrace. When the
'stackmap' option is enabled alongside 'stacktrace', the ring buffer
stores a 4-byte stack_id instead of a full kernel stack trace, and the
full stacks are exported once via tracefs (stack_map / stack_map_bin).
Rebased onto v7.1-rc5 (e8c2f9fdadee).
Motivation
==========
The target use case is long-duration, from-boot kernel tracing where
the same stacks recur enormously often and the bottleneck is ring
buffer space, not CPU.
Concretely: tracing the slab allocator from boot for hours to study
memory aging and to catch the allocation backtraces behind a usage
peak. With a stacktrace trigger on the slab tracepoints, every event
today carries a full kernel stack (~80-160 bytes). On a fixed-size
ring buffer that bounds how far back in time the trace reaches: the
buffer wraps in seconds to minutes and the early-boot history -- the
part we care about -- is overwritten before it can be consumed.
In this workload the set of distinct stacks is small and highly
repetitive, so storing a 4-byte stack_id per event and the full stack
only once dramatically increases the time span a given buffer covers.
The intended operating model is exactly the low-overhead one ftrace is
good at: let the trace run for a long time producing a comparatively
small, dense log, then resolve stack_ids offline (cat stack_map, or
parse stack_map_bin with the included tool) during analysis.
This is complementary to, not a replacement for, the existing full
stack recording: deep stacks and the early pre-init window still fall
back to full stacks (see below).
Effect on retention
====================
Same fixed per-CPU buffer, slab allocation workload with a shallow
kernel stack (kmem_cache_alloc), stackmap OFF vs ON:
retained events bytes/event time span
stackmap OFF 645,068 ~104 B 15.0 s
stackmap ON 1,397,741 ~48 B 27.7 s
2.17x 2.17x 1.85x
The buffer holds ~2.17x more events and reaches ~1.85x further back in
time for the same memory. The win grows with stack depth and with how
repetitive the stacks are; for deep, highly-repeated stacks the
per-event size approaches the 4-byte stack_id plus event header.
Changes since v3
================
Correctness:
- Deep stacks are never silently truncated or merged. A stack deeper
than FTRACE_STACKMAP_MAX_DEPTH (64) now falls back to a full stack
trace; ftrace_stackmap_get_id() returns -E2BIG rather than
truncating, so two distinct stacks sharing their first 64 frames
can no longer collapse to one stack_id.
- reset is now genuinely destructive and coherent: under the
reader_sem write lock it clears the owning trace_array's ring
buffer (and snapshot) BEFORE clearing the map, and uses
tracing_reset_all_cpus() rather than _online_cpus() so a
TRACE_STACK_ID written by a now-offline CPU cannot survive and
dangle against a cleared map.
- __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer
slot before inserting into the map, so stack_map_stat counters and
ref_count stay consistent with what the ring buffer actually
references (failed reservation -> full stack, map untouched).
- ref_count / successes / drops now saturate (INT_MAX / LONG_MAX)
instead of wrapping on multi-hour, billions-of-hits traces.
Global-instance gating:
- Enabling 'stackmap' on a secondary instance via the aggregate
trace_options file is now rejected, not just hidden in the
per-instance options/ directory.
- tracefs init is failure-atomic: the required stack_map file is
created before the map pointer is published; if it cannot be
created the map is destroyed and never published. An init-state
(PENDING/DONE/FAILED) lets boot-time trace_options=stackmap set
the flag before the map exists (hot path falls back until it is
published) while still rejecting enables after a permanent init
failure, so options/stackmap never reports an enabled no-op.
ABI / tooling:
- Binary magic corrected to 0x46534D42 ('FSMB'); version is 1 (first
upstream ABI). Documentation, tool and selftest updated to match.
- Text and binary exports now follow the same trampoline-marker and
trace_adjust_address() handling as the normal stack print path.
- stackmap_dump.py emits hex addresses in 'ips' and shows the ftrace
trampoline marker only in the resolved 'symbols'.
Selftests:
- New stackmap-reset.tc: verifies reset clears stale <stack_id N>
from the trace buffer and checks the stack_map_bin magic/version.
- stackmap-instance-gate.tc extended to verify the trace_options
write path is rejected on a secondary instance.
- stackmap-basic.tc no longer treats a nonzero drops count as a
failure (drops is a by-design fallback); only zero successes with
nonzero drops is fatal.
Open questions for maintainers
==============================
Two design points where I would value direction before polishing
further:
1. Eager vs lazy allocation. The element pool is allocated at
fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether
userspace ever enables the option (~8 MB at the default bits=14,
up to ~135 MB at bits=18). This keeps the hot path allocation-free
with no allocation-failure path under tracing pressure. Is eager
allocation acceptable, or would you prefer lazy allocation on the
first 'echo 1 > options/stackmap'?
2. Binary ABI now or later. stack_map_bin is a new tracefs binary
interface (magic 0x46534D42, version 1). Is it acceptable to
introduce it now, or would you prefer the first version ship with
the text stack_map interface only and add the binary export once
trace-cmd / libtraceevent integration is designed?
Test results
============
QEMU (aarch64 virt, v7.1-rc5 + this series), boot to init smoke test:
- stackmap functional suite: 16/16 PASS, including reset clearing the
trace buffer (stale <stack_id> count 48 -> 0), stack_map_bin
magic/version, global-vs-secondary instance gating, and the
trace_options rejection on a secondary instance.
- boot-time activation (trace_options=stackmap,stacktrace on the
kernel cmdline): 3/3 PASS -- the option survives the
pre-initialization window and the map is live once published.
- ftrace startup self-tests pass with the new TRACE_STACK_ID entry.
Device retention numbers above were collected on a Xiaomi SM8850
(ARM64) running an Android workload, comparing the same buffer with
the option off and on.
Known limitations
=================
- Per-instance stackmap support is not included; the option is gated
to the global instance (in the tracefs layout and at the
set_tracer_flag() write path). Per-instance maps are a follow-up.
- Deduplication is best-effort, not strict: under heavy concurrent
contention two CPUs racing with the same stack hash may each claim a
different slot, producing a few duplicate entries; ref_count is then
split across them. This keeps the hot path lock-free.
- The stackmap covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, serialized against reset
but not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
binary format settles (see open question 2).
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 177 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 216 ++++-
kernel/trace/trace.h | 17 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_selftest.c | 1 +
kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 111 +++
.../test.d/ftrace/stackmap-instance-gate.tc | 54 ++
.../ftrace/test.d/ftrace/stackmap-reset.tc | 76 ++
tools/tracing/stackmap_dump.py | 164 ++++
15 files changed, 1821 insertions(+), 3 deletions(-)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply
* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Masami Hiramatsu @ 2026-06-16 2:09 UTC (permalink / raw)
To: Martin Kaiser; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <20260615145500.2662456-1-martin@kaiser.cx>
On Mon, 15 Jun 2026 16:54:12 +0200
Martin Kaiser <martin@kaiser.cx> wrote:
> For a char * element in an event, the FILTER_PTR_STRING filter type is
> used. When the event occurs, a pointer is stored in the ringbuffer.
>
> If an eprobe references such a char * element of a "base event" and
> decodes the pointer as string, the pointer cannot be dereferenced.
>
> $ echo 'e syscalls.sys_enter_openat $filename:string' > \
> /sys/kernel/tracing/dynamic_events
> $ trace-cmd start -e eprobes
> $ trace-cmd show
> ... : sys_enter_openat: (syscalls.sys_enter_openat) arg1=(fault)
>
> The problem is in get_event_field
>
> val = (unsigned long)(*(char *)addr);
>
> addr points to the position in the ringbuffer where the pointer was
> stored. We must read the complete pointer, not just the lowest byte.
>
> Fix the assignment, make the example above work.
>
Ah, this is a bit complicated. It seems to work with sched_switch event
as commit f04dec93466a ("tracing/eprobes: Fix reading of string fields"):
echo 'e:sw sched/sched_switch comm=$next_comm:string' > dynamic_events
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
sh-162 [002] d..3. 54.027213: sw: (sched.sched_switch) comm="swapper/2"
<idle>-0 [007] d..3. 54.034573: sw: (sched.sched_switch) comm="rcu_preempt"
rcu_preempt-15 [007] d..3. 54.034589: sw: (sched.sched_switch) comm="swapper/7"
Maybe comm is stored as a fixed string information in the event record?
/sys/kernel/tracing # cat events/sched/sched_switch/format
name: sched_switch
ID: 254
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[16]; offset:8; size:16; signed:0;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[16]; offset:40; size:16; signed:0;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
But the filename is a pointer.
/sys/kernel/tracing # cat events/syscalls/sys_enter_openat/format
name: sys_enter_openat
ID: 705
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int __syscall_nr; offset:8; size:4; signed:1;
field:int dfd; offset:16; size:8; signed:0;
field:const char * filename; offset:24; size:8; signed:0;
field:int flags; offset:32; size:8; signed:0;
field:umode_t mode; offset:40; size:8; signed:0;
field:__data_loc char[] __filename_val; offset:48; size:4; signed:0;
In this case, the filename field should use __data_loc directly instead of
pointing data on the ring buffer.
Can you try
echo 'e syscalls.sys_enter_openat $__filename_val:string' > \
/sys/kernel/tracing/dynamic_events
Instead?
I think better solution is fixing sycall tracer.
Thanks,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v4 6/7] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu @ 2026-06-16 1:15 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178148609402.185520.8189233495763938815.stgit@devnote2>
On Mon, 15 Jun 2026 10:14:54 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> + case FETCH_OP_DEREF_CPU:
> + val = (unsigned long)this_cpu_ptr((void __percpu *)val);
> + ret = probe_mem_read(&val, (void *)val, sizeof(val));
> + break;
> + case FETCH_OP_CPU_PTR:
> + val = (unsigned long)this_cpu_ptr((void __percpu *)val);
> + ret = 0;
> + break;
Hmm, maybe I can just convert the FETCH_OP_DEREF_CPU to
FETCH_OP_CPU_PTR + FETCH_OP_DEREF to simply the code.
> + default:
> + lval = llval;
> + goto out;
> + }
> if (ret)
> return ret;
> + llval = lval;
> code++;
> } while (1);
> +out:
>
> s3 = code;
> stage3:
> @@ -181,6 +195,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
> case FETCH_OP_ST_UMEM:
> probe_mem_read_user(dest, (void *)val + code->offset, code->size);
> break;
> + case FETCH_OP_ST_CPUMEM:
> + val = (unsigned long)this_cpu_ptr((void __percpu *)val);
> + probe_mem_read(dest, (void *)val, code->size);
> + break;
Then, I can just drop this change.
Thanks,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH 3/3] rv/reactors: add KUnit tests for reactor_panic
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1781541556.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
Add KUnit tests for the panic reactor covering:
- Reactor registration and unregistration lifecycle
- Panic notifier chain reachability
The real rv_panic_reaction() calls vpanic(), which is __noreturn and
halts the system. KUnit cannot test across that boundary. Instead, the
test drives atomic_notifier_call_chain(&panic_notifier_list, ...) directly
with a high-priority mock notifier that returns NOTIFY_STOP, verifying
that the panic notification propagates without triggering real handlers
(kdump, watchdog, reboot).
The mock notifier busy-waits to simulate real handler execution time
(e.g., crash_save_vmcoreinfo, emergency_restart preamble) under the
panic context constraints.
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
kernel/trace/rv/Kconfig | 10 +++
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/reactor_panic_kunit.c | 106 ++++++++++++++++++++++++++
3 files changed, 117 insertions(+)
create mode 100644 kernel/trace/rv/reactor_panic_kunit.c
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index ff47895c897f..6c6c43c5f86c 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -121,3 +121,13 @@ config RV_REACT_PANIC
help
Enables the panic reactor. The panic reactor emits a printk()
message if an exception is found and panic()s the system.
+
+config RV_REACT_PANIC_KUNIT
+ bool "KUnit tests for reactor_panic" if !KUNIT_ALL_TESTS
+ depends on RV_REACT_PANIC && KUNIT
+ default KUNIT_ALL_TESTS
+ help
+ This builds KUnit tests for the panic reactor. These are only
+ for development and testing, not for regular kernel use cases.
+
+ If unsure, say N.
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index ef0a2dcb927c..2ebfe5e5068c 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -25,3 +25,4 @@ obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
obj-$(CONFIG_RV_REACT_PRINTK_KUNIT) += reactor_printk_kunit.o
obj-$(CONFIG_RV_REACT_PANIC) += reactor_panic.o
+obj-$(CONFIG_RV_REACT_PANIC_KUNIT) += reactor_panic_kunit.o
diff --git a/kernel/trace/rv/reactor_panic_kunit.c b/kernel/trace/rv/reactor_panic_kunit.c
new file mode 100644
index 000000000000..f9a09ae7aaad
--- /dev/null
+++ b/kernel/trace/rv/reactor_panic_kunit.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for reactor_panic
+ *
+ */
+
+#include <kunit/test.h>
+#include <linux/rv.h>
+#include <linux/panic_notifier.h>
+#include <linux/notifier.h>
+#include <linux/limits.h>
+#include <linux/sched/clock.h>
+#include <linux/processor.h>
+
+/* Simulated execution time for mock panic notifier (nanoseconds). */
+#define RV_PANIC_NOTIFIER_EXEC_NS 2000000ULL
+
+/* Test state */
+static struct {
+ bool notifier_called;
+} panic_test_state;
+
+/*
+ * Mock panic notifier callback.
+ *
+ * Runs at INT_MAX priority and returns NOTIFY_STOP to prevent real panic
+ * handlers (kdump, watchdog) from executing during the test. Busy-waits
+ * RV_PANIC_NOTIFIER_EXEC_NS to simulate a real handler's execution time.
+ */
+static int mock_panic_notifier_fn(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ char *msg = data;
+ u64 start = sched_clock();
+
+ panic_test_state.notifier_called = true;
+ pr_emerg("KUnit: reactor_panic test intercepted panic notifier: %s\n",
+ msg ? msg : "(no message)");
+
+ while (sched_clock() - start < RV_PANIC_NOTIFIER_EXEC_NS)
+ cpu_relax();
+
+ return NOTIFY_STOP;
+}
+
+static struct notifier_block mock_panic_nb = {
+ .notifier_call = mock_panic_notifier_fn,
+ .priority = INT_MAX,
+};
+
+static struct rv_reactor mock_panic_reactor = {
+ .name = "test_panic",
+ .description = "test panic reactor",
+};
+
+static int reactor_panic_kunit_init(struct kunit *test)
+{
+ panic_test_state.notifier_called = false;
+ return 0;
+}
+
+/* Test 1: register and unregister reactor */
+static void test_panic_register_unregister(struct kunit *test)
+{
+ int ret;
+
+ ret = rv_register_reactor(&mock_panic_reactor);
+ KUNIT_EXPECT_EQ(test, ret, 0);
+ KUNIT_EXPECT_STREQ(test, mock_panic_reactor.name, "test_panic");
+
+ rv_unregister_reactor(&mock_panic_reactor);
+}
+
+/*
+ * Test 2: panic notifier chain is reachable.
+ *
+ * vpanic() calls atomic_notifier_call_chain(&panic_notifier_list, ...).
+ * Drive the chain directly to verify panic notifiers receive the notification —
+ * the observable side-effect of reactor_panic without halting the system.
+ */
+static void test_panic_notifier_called(struct kunit *test)
+{
+ atomic_notifier_chain_register(&panic_notifier_list, &mock_panic_nb);
+ atomic_notifier_call_chain(&panic_notifier_list, 0,
+ "panic violation message");
+ atomic_notifier_chain_unregister(&panic_notifier_list, &mock_panic_nb);
+
+ KUNIT_EXPECT_TRUE(test, panic_test_state.notifier_called);
+}
+
+static struct kunit_case reactor_panic_kunit_cases[] = {
+ KUNIT_CASE(test_panic_register_unregister),
+ KUNIT_CASE(test_panic_notifier_called),
+ {}
+};
+
+static struct kunit_suite reactor_panic_kunit_suite = {
+ .name = "rv_reactor_panic",
+ .init = reactor_panic_kunit_init,
+ .test_cases = reactor_panic_kunit_cases,
+};
+
+kunit_test_suite(reactor_panic_kunit_suite);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("KUnit tests for reactor_panic");
--
2.25.1
^ permalink raw reply related
* [PATCH 0/3] rv/reactors: fix lockdep warning and add KUnit tests
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang
From: Wen Yang <wen.yang@linux.dev>
We occasionally hit a lockdep "Invalid wait context" warning in production
environments when rv_react() callbacks are interrupted.
The bug is intermittent in production. KUnit tests with busy-wait callbacks
can reproduce it by holding the CPU long enough for a timer interrupt to fire
during rv_react(), exposing the lockdep constraint violation:
[ 44.820913] =============================
[ 44.820923] [ BUG: Invalid wait context ]
[ 44.821137] 7.1.0-rc7-next-20260612-virtme #6 Tainted: G N
[ 44.821203] -----------------------------
[ 44.821211] kunit_try_catch/209 is trying to lock:
[ 44.821244] ffff8a743ed3e8a0 (&rq->__lock){-...}-{2:2}, at: __schedule+0x102/0x13d0
[ 44.821688] other info that might help us debug this:
[ 44.821708] context-{5:5}
[ 44.821730] 1 lock held by kunit_try_catch/209:
[ 44.821745] #0: ffffffffb6ba62c0 (rv_react_map-wait-type-override){+.+.}-{1:1}, at: rv_react+0x9d/0xf0
[ 44.821803] stack backtrace:
[ 44.822110] CPU: 10 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 7.1.0-rc7-next-20260612-virtme #6 PREEMPT_{RT,(full)}
[ 44.822197] Tainted: [N]=TEST
[ 44.822210] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 44.822328] Call Trace:
[ 44.822377] <TASK>
[ 44.822806] dump_stack_lvl+0x78/0xe0
[ 44.822860] __lock_acquire+0x926/0x1c90
[ 44.822888] lock_acquire+0xd3/0x310
[ 44.822901] ? __schedule+0x102/0x13d0
[ 44.822919] ? rcu_qs+0x2d/0x1a0
[ 44.822954] _raw_spin_lock_nested+0x36/0x50
[ 44.822966] ? __schedule+0x102/0x13d0
[ 44.822979] __schedule+0x102/0x13d0
[ 44.822993] ? mark_held_locks+0x40/0x70
[ 44.823009] preempt_schedule_irq+0x37/0x70
[ 44.823018] irqentry_exit+0x1da/0x8c0
[ 44.823032] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 44.823093] RIP: 0010:mock_printk_react+0x2a/0x50
[ 44.823250] Code: f3 0f 1e fa 0f 1f 44 00 00 41 54 49 89 f4 55 48 89 fd 53 e8 18 8b db ff 4c 89 e6 48 89 ef 48 89 c3 e8 fa 8e ed ff eb 02 f3 90 <e8> 01 8b db ff 48 29 d8 48 3d 3f 4b 4c 00 76 ee 5b 5d 41 5c c3 cc
[ 44.823303] RSP: 0018:ffffd1c3c0733d38 EFLAGS: 00000297
[ 44.823332] RAX: 00000000000119f3 RBX: 0000000a74e60d1c RCX: 000000000000001f
[ 44.823342] RDX: 0000000000000000 RSI: 000000003348c8a2 RDI: ffffffffc1abbfd9
[ 44.823351] RBP: ffffffffb671b613 R08: 0000000000000002 R09: 0000000000000000
[ 44.823359] R10: 0000000000000001 R11: 0000000000000000 R12: ffffd1c3c0733d60
[ 44.823367] R13: ffffffffb575a5fd R14: ffffd1c3c0017be8 R15: ffffd1c3c00179f8
[ 44.823397] ? rv_react+0x9d/0xf0
[ 44.823437] ? mock_printk_react+0x2f/0x50
[ 44.823448] rv_react+0xb4/0xf0
[ 44.823455] ? rv_react+0x9d/0xf0
[ 44.823476] test_printk_react_called+0x83/0xb0
[ 44.823486] ? __pfx_mock_printk_react+0x10/0x10
[ 44.823502] ? __pfx_mock_printk_react+0x10/0x10
[ 44.823513] kunit_try_run_case+0x97/0x190
[ 44.823534] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[ 44.823544] kunit_generic_run_threadfn_adapter+0x21/0x40
[ 44.823551] kthread+0x124/0x160
[ 44.823562] ? __pfx_kthread+0x10/0x10
[ 44.823574] ret_from_fork+0x291/0x3b0
[ 44.823585] ? __pfx_kthread+0x10/0x10
[ 44.823595] ret_from_fork_asm+0x1a/0x30
[ 44.823641] </TASK>
Patch 1 fixes the lockdep bug by correcting rv_react()'s wait_type_inner
from LD_WAIT_CONFIG (which inherits the outer context) to LD_WAIT_SPIN
(the tightest constraint callbacks must satisfy).
Patch 2 adds KUnit tests for reactor_printk. The busy-wait in the mock
callback reproduces the timer interrupt scenario that exposes the bug.
Patch 3 adds KUnit tests for reactor_panic, exercising the panic notifier
chain without halting the system.
Tested with CONFIG_PROVE_LOCKING=y and CONFIG_KUNIT=y.
Wen Yang (3):
rv/reactors: fix lockdep "Invalid wait context" in rv_react()
rv/reactors: add KUnit tests for reactor_printk
rv/reactors: add KUnit tests for reactor_panic
kernel/trace/rv/Kconfig | 20 ++++
kernel/trace/rv/Makefile | 2 +
kernel/trace/rv/reactor_panic_kunit.c | 106 +++++++++++++++++++++
kernel/trace/rv/reactor_printk_kunit.c | 123 +++++++++++++++++++++++++
kernel/trace/rv/rv_reactors.c | 8 +-
5 files changed, 258 insertions(+), 1 deletion(-)
create mode 100644 kernel/trace/rv/reactor_panic_kunit.c
create mode 100644 kernel/trace/rv/reactor_printk_kunit.c
--
2.25.1
^ permalink raw reply
* [PATCH 2/3] rv/reactors: add KUnit tests for reactor_printk
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang
In-Reply-To: <cover.1781541556.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
Add KUnit tests for the printk reactor covering:
- Reactor registration and unregistration lifecycle
- React callback invocation via rv_react()
- Double registration rejection
- Multiple register/unregister cycles
The mock callback calls vprintk_deferred() — the same path as the real
reactor — then busy-waits to simulate I/O back-pressure, exercising the
LD_WAIT_FREE constraint of rv_react() under load.
Signed-off-by: Wen Yang <wen.yang@linux.dev>
---
kernel/trace/rv/Kconfig | 10 ++
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/reactor_printk_kunit.c | 123 +++++++++++++++++++++++++
3 files changed, 134 insertions(+)
create mode 100644 kernel/trace/rv/reactor_printk_kunit.c
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 3884b14df375..ff47895c897f 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -104,6 +104,16 @@ config RV_REACT_PRINTK
Enables the printk reactor. The printk reactor emits a printk()
message if an exception is found.
+config RV_REACT_PRINTK_KUNIT
+ bool "KUnit tests for reactor_printk" if !KUNIT_ALL_TESTS
+ depends on RV_REACT_PRINTK && KUNIT
+ default KUNIT_ALL_TESTS
+ help
+ This builds KUnit tests for the printk reactor. These are only
+ for development and testing, not for regular kernel use cases.
+
+ If unsure, say N.
+
config RV_REACT_PANIC
bool "Panic reactor"
depends on RV_REACTORS
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 94498da35b37..ef0a2dcb927c 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -23,4 +23,5 @@ obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
# Add new monitors here
obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
+obj-$(CONFIG_RV_REACT_PRINTK_KUNIT) += reactor_printk_kunit.o
obj-$(CONFIG_RV_REACT_PANIC) += reactor_panic.o
diff --git a/kernel/trace/rv/reactor_printk_kunit.c b/kernel/trace/rv/reactor_printk_kunit.c
new file mode 100644
index 000000000000..933aa5602226
--- /dev/null
+++ b/kernel/trace/rv/reactor_printk_kunit.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for reactor_printk
+ *
+ */
+
+#include <kunit/test.h>
+#include <linux/rv.h>
+#include <linux/printk.h>
+#include <linux/sched/clock.h>
+#include <linux/processor.h>
+
+/*
+ * Simulated execution time for mock_printk_react (sched_clock units,
+ * nanoseconds). Models the time a real printk reactor callback may consume
+ * under I/O pressure, exercising the LD_WAIT_FREE constraint of rv_react().
+ */
+#define MOCK_REACT_DURATION_NS 5000000ULL
+
+/*
+ * Mock react callback mirroring rv_printk_reaction().
+ *
+ * Calls vprintk_deferred() — the same path as the real reactor — then holds
+ * the CPU for MOCK_REACT_DURATION_NS via a sched_clock() timed busy-loop,
+ * simulating a callback that is slow due to I/O back-pressure.
+ * sched_clock() is notrace and lock-free; no sleep or lock acquisition is
+ * performed, satisfying the LD_WAIT_FREE constraint of rv_react().
+ */
+__printf(1, 0) static void mock_printk_react(const char *msg, va_list args)
+{
+ u64 start = sched_clock();
+
+ vprintk_deferred(msg, args);
+
+ while (sched_clock() - start < MOCK_REACT_DURATION_NS)
+ cpu_relax();
+}
+
+static struct rv_reactor mock_printk_reactor = {
+ .name = "test_printk",
+ .description = "test printk reactor",
+ .react = mock_printk_react,
+};
+
+/* Test 1: register and unregister reactor */
+static void test_printk_register_unregister(struct kunit *test)
+{
+ int ret;
+
+ ret = rv_register_reactor(&mock_printk_reactor);
+ KUNIT_EXPECT_EQ(test, ret, 0);
+ KUNIT_EXPECT_STREQ(test, mock_printk_reactor.name, "test_printk");
+
+ rv_unregister_reactor(&mock_printk_reactor);
+}
+
+/* Test 2: react callback is invoked via rv_react() */
+static void test_printk_react_called(struct kunit *test)
+{
+ struct rv_reactor reactor = {
+ .name = "printk_cb_test",
+ .react = mock_printk_react,
+ };
+ struct rv_monitor monitor = {
+ .name = "test_monitor",
+ .reactor = &reactor,
+ .react = mock_printk_react,
+ };
+
+ rv_react(&monitor, "printk violation message");
+}
+
+/* Test 3: double registration should fail */
+static void test_printk_double_register(struct kunit *test)
+{
+ int ret;
+
+ ret = rv_register_reactor(&mock_printk_reactor);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+
+ ret = rv_register_reactor(&mock_printk_reactor);
+ KUNIT_EXPECT_NE(test, ret, 0);
+
+ rv_unregister_reactor(&mock_printk_reactor);
+}
+
+/* Test 4: register/unregister cycle */
+static void test_printk_register_cycle(struct kunit *test)
+{
+ int ret, i;
+
+ for (i = 0; i < 5; i++) {
+ ret = rv_register_reactor(&mock_printk_reactor);
+ KUNIT_EXPECT_EQ(test, ret, 0);
+
+ rv_unregister_reactor(&mock_printk_reactor);
+ }
+}
+
+/* Test 5: react callback is not NULL (printk reactors must provide react) */
+static void test_printk_react_not_null(struct kunit *test)
+{
+ KUNIT_EXPECT_NOT_NULL(test, mock_printk_reactor.react);
+}
+
+static struct kunit_case reactor_printk_kunit_cases[] = {
+ KUNIT_CASE(test_printk_register_unregister),
+ KUNIT_CASE(test_printk_react_called),
+ KUNIT_CASE(test_printk_double_register),
+ KUNIT_CASE(test_printk_register_cycle),
+ KUNIT_CASE(test_printk_react_not_null),
+ {}
+};
+
+static struct kunit_suite reactor_printk_kunit_suite = {
+ .name = "rv_reactor_printk",
+ .test_cases = reactor_printk_kunit_cases,
+};
+
+kunit_test_suite(reactor_printk_kunit_suite);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("KUnit tests for reactor_printk");
--
2.25.1
^ permalink raw reply related
* [PATCH 1/3] rv/reactors: fix lockdep "Invalid wait context" in rv_react()
From: wen.yang @ 2026-06-15 16:44 UTC (permalink / raw)
To: Gabriele Monaco
Cc: Nam Cao, linux-trace-kernel, linux-kernel, Wen Yang,
Thomas Weißschuh
In-Reply-To: <cover.1781541556.git.wen.yang@linux.dev>
From: Wen Yang <wen.yang@linux.dev>
The DEFINE_WAIT_OVERRIDE_MAP() macro creates a lockdep map with
wait_type_inner = LD_WAIT_CONFIG, which inherits the outer context's
wait type. When rv_react() is called from a LD_WAIT_FREE context
(e.g., a KUnit test with busy-wait), and the reactor callback triggers
a timer interrupt during the busy-loop, the interrupt exit path attempts
to schedule (preempt_schedule_irq -> __schedule -> rq->__lock), which is
LD_WAIT_SPIN. Lockdep then reports:
[ BUG: Invalid wait context ]
context-{5:5}
1 lock held by kunit_try_catch/209:
#0: rv_react_map-wait-type-override at rv_react+0x9d/0xf0
The wait_type_override map allowed the outer LD_WAIT_FREE to propagate
inward, but scheduling from an interrupt is LD_WAIT_SPIN, violating the
constraint.
Fix by explicitly setting wait_type_inner = LD_WAIT_SPIN, which is the
tightest constraint rv_react() callbacks must satisfy: they may not
sleep (LD_WAIT_SLEEP) or use mutexes, but can use spinlocks and be
interrupted. This matches the documented LD_WAIT_FREE constraint.
Fixes: 69d8895cb9a9 ("rv: Add explicit lockdep context for reactors")
Signed-off-by: Wen Yang <wen.yang@linux.dev>
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
kernel/trace/rv/rv_reactors.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/rv/rv_reactors.c b/kernel/trace/rv/rv_reactors.c
index 460af07f7aba..423f843bbd68 100644
--- a/kernel/trace/rv/rv_reactors.c
+++ b/kernel/trace/rv/rv_reactors.c
@@ -465,7 +465,13 @@ int init_rv_reactors(struct dentry *root_dir)
void rv_react(struct rv_monitor *monitor, const char *msg, ...)
{
- static DEFINE_WAIT_OVERRIDE_MAP(rv_react_map, LD_WAIT_FREE);
+#ifdef CONFIG_LOCKDEP
+ static struct lockdep_map rv_react_map = {
+ .name = "rv_react",
+ .wait_type_outer = LD_WAIT_FREE,
+ .wait_type_inner = LD_WAIT_SPIN,
+ };
+#endif
va_list args;
if (!rv_reacting_on() || !monitor->react)
--
2.25.1
^ permalink raw reply related
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-15 15:38 UTC (permalink / raw)
To: Vlastimil Babka (SUSE), Gregory Price
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, Matthew Wilcox
In-Reply-To: <94d6c446-a8a6-485e-bb3c-ee809ebb1d3b@kernel.org>
On 6/15/26 17:27, Vlastimil Babka (SUSE) wrote:
> On 6/15/26 17:18, David Hildenbrand (Arm) wrote:
>> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>>>
>>> I think the memalloc approach is dangerous due to unexpected nesting. There
>>> might be nested page allocations in page allocation itself (due to some
>>> debugging option). But also interrupts do not change what "current" points
>>> to. Suddenly those could start requesting folios and/or private nodes and be
>>> surprised, I'm afraid.
>>
>> Yeah, we'd need some way to distinguish the main allocation from these other
>> (nested) allocations.
>
> That goes against the very principle of scopes. And I don't see how, except
> via a ... flag to the main allocation :D
Unless we teach the handful of debug callpaths to set a custom context. Have
some memalloc_context_save/restore.
I'd assume that the number of such nested allocations we can trigger from the
buddy (through some callbacks like kasan and such) should be rather limited.
I also wonder for a second whether we could use of the _noprof vs. !_noprof
mechanism.
Essentially, calling a !_noprof variant ("external allocation interface") could
mean "save/restore folio allocation". Just a thought.
--
Cheers,
David
^ permalink raw reply
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-15 15:37 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Vlastimil Babka (SUSE), Balbir Singh, lsf-pc, linux-kernel,
linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <fdbdc9f7-d142-4880-b429-065d5056cabb@kernel.org>
On Mon, Jun 15, 2026 at 05:18:55PM +0200, David Hildenbrand (Arm) wrote:
> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
> >
> > I think the memalloc approach is dangerous due to unexpected nesting. There
> > might be nested page allocations in page allocation itself (due to some
> > debugging option). But also interrupts do not change what "current" points
> > to. Suddenly those could start requesting folios and/or private nodes and be
> > surprised, I'm afraid.
>
> Yeah, we'd need some way to distinguish the main allocation from these other
> (nested) allocations.
>
>
> >
> > The memalloc scopes only work well when they restrict the context wrt
> > reclaim, and allocations in IRQ have to be already restricted heavily
> > (atomic) so further memalloc restrictions don't do anything in practice. But
> > to make them change other aspects of the allocations like this won't work.
>
> I was assuming that memalloc_pin_save() would already violate that, but really
> it only restricts where movable allocations land, and that doesn't matter for
> other kernel allocations.
>
> Do you see any other way to make something like an allocation context work, and
> avoid introducing more GFP flags?
>
One thought would be a way to switch what fallback list is used, and
then have specific fallback lists for certain contexts.
Right now there is a single example of this: __GFP_THISNODE
|= __GFP_THISNODE => NOFALLBACK
&= ~__GFP_THISNODE => FALLBACK
We could add an interface with the desired fallback list based as an
argument, and let get_page_from_freelist to prefer that over the default
global lists.
Omit all special nodes from FALLBACK/NOFALLBACK and make the special
contexts provide the fallback-base that should be used.
On my current branch i think that would include modifying, in totality:
alloc_folio_mpol()
alloc_demotion_folio()
alloc_migration_target()
And i'm pretty sure that all just nests nicely.
We might not even need memalloc... hmmm
~Gregory
^ permalink raw reply
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-15 15:27 UTC (permalink / raw)
To: David Hildenbrand (Arm), Gregory Price
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, Matthew Wilcox
In-Reply-To: <fdbdc9f7-d142-4880-b429-065d5056cabb@kernel.org>
On 6/15/26 17:18, David Hildenbrand (Arm) wrote:
> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>> On 6/12/26 17:29, Gregory Price wrote:
>>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>>> ... snip ...
>>>>
>>>> I will still probably send the next RFC version tomorrow or friday,
>>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>>
>>>> Also, I made a new `anondax` driver which enables userland testing
>>>> of this functionality without any specialty hardware.
>>>>
>>>
>>> (apologies for the length of this email: this will all be covered in
>>> the coming cover letter, but I just wanted to share a bit of a preview)
>>>
>>> ===
>>>
>>> Just another small update - I am planning to post the RFC today once i
>>> get some mild cleanup done. It will be based on the dax atomic hotplug
>>>
>>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>>
>>> But a couple specific details regarding the memalloc pieces that i've
>>> learned the past couple of days playing with it.
>>>
>>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>>> on the private node, even if it happens within a memalloc_private
>>> context. Since memalloc_folio may be useful in contexts outside of
>>> private nodes, I kept this as a separate flag.
>>>
>>> If we think there will *never* be additional users of memalloc_folio,
>>> then we could fold _folio into _private to save the flag for now and
>>> add it back when we actually need it.
>>>
>>> 2) memalloc_private is needed to unlock private nodes, but in the
>>> original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>>
>>> This is *highly* restrictive. I found when playing with mbind that
>>> MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>>> implies a bug).
>>>
>>> That leads me to #3
>>
>> I think the memalloc approach is dangerous due to unexpected nesting. There
>> might be nested page allocations in page allocation itself (due to some
>> debugging option). But also interrupts do not change what "current" points
>> to. Suddenly those could start requesting folios and/or private nodes and be
>> surprised, I'm afraid.
>
> Yeah, we'd need some way to distinguish the main allocation from these other
> (nested) allocations.
That goes against the very principle of scopes. And I don't see how, except
via a ... flag to the main allocation :D
>>
>> The memalloc scopes only work well when they restrict the context wrt
>> reclaim, and allocations in IRQ have to be already restricted heavily
>> (atomic) so further memalloc restrictions don't do anything in practice. But
>> to make them change other aspects of the allocations like this won't work.
>
> I was assuming that memalloc_pin_save() would already violate that, but really
> it only restricts where movable allocations land, and that doesn't matter for
> other kernel allocations.
Hm yeah its suboptimal, as it can turn a movable allocation unmovable. But
shouldn't cause outright bugs.
> Do you see any other way to make something like an allocation context work, and
> avoid introducing more GFP flags?
Yeah, the idea of augomenting gfp flags with alloc_flags that are no longer
strictly internal to the page allocator, seems like a way to achieve what we
need.
^ permalink raw reply
* Re: [PATCH v3 6/9] rv/tlob: add tlob hybrid automaton monitor
From: Gabriele Monaco @ 2026-06-15 15:24 UTC (permalink / raw)
To: wen.yang; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <629023dbcc4389fcc6ec46d88c98eb19aa0abc36.1780847473.git.wen.yang@linux.dev>
On Mon, 2026-06-08 at 00:13 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
>
> Add tlob (task latency over budget), a per-task hybrid automaton RV
> monitor that tracks elapsed wall-clock time across a user-delimited
> code section and emits error_env_tlob when the elapsed time exceeds a
> configurable budget.
>
> The monitor uses RV_MON_PER_OBJ with three states (running, waiting,
> sleeping) driven by sched_switch and sched_wakeup tracepoints, and a
> single clock invariant clk_elapsed < budget enforced by an hrtimer
> (HRTIMER_MODE_REL_HARD). On violation, detail_env_tlob provides a
> per-state time breakdown (running_ns, waiting_ns, sleeping_ns).
>
> Per-task state is managed via DA_ALLOC_POOL to avoid allocation on
> the
> scheduler tracepoint path. Uprobe pairs are registered through the
> tracefs monitor file as "p PATH:OFFSET_START OFFSET_STOP
> threshold=NS".
>
> Also adds ha_cancel_timer_sync() to ha_monitor.h, a blocking cancel
> variant needed by tlob's stop_task path to ensure the hrtimer
> callback
> has completed before the per-task monitor state is freed.
>
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com>
I'm not sure this applies to be fair, I'm reviewing the patch and as
part of that I'm suggesting things, but the main idea is yours [1].
Unless you mess up, I'm going to appear as reviewer anyway ;)
[1] - https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
...
> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
> index e2e0033a00b9..ed2de31d0312 100644
> --- a/kernel/trace/rv/Kconfig
> +++ b/kernel/trace/rv/Kconfig
> @@ -85,6 +85,7 @@ source "kernel/trace/rv/monitors/sleep/Kconfig"
> source "kernel/trace/rv/monitors/stall/Kconfig"
> source "kernel/trace/rv/monitors/deadline/Kconfig"
> source "kernel/trace/rv/monitors/nomiss/Kconfig"
> +source "kernel/trace/rv/monitors/tlob/Kconfig"
> # Add new deadline monitors here
Your line should be after the marker for deadline monitors, this is
required for the Kconfig to look good with nested monitors.
Just let rvgen handle the way it looks (V2 was fine, just above the
general monitor marker):
# Add new deadline monitors here
+source "kernel/trace/rv/monitors/tlob/Kconfig"
# Add new monitors here
>
> # Add new monitors here
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000000..d8e0c4794720
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,968 @@
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/tracefs.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +#include <rv/rv_uprobe.h>
You added a few includes I don't believe you need: hrtimer, ktime, sched
(included in events/sched) and tracefs (included in rv).
You can keep the others as they come from the template, all monitors
include them anyway.
> +#include "../../rv.h"
This is <rv.h> the includepath allows that just like rv_trace.
> +
> +#define MODULE_NAME "tlob"
> +
> +#include <trace/events/sched.h>
> +#include <rv_trace.h>
...
> +/* Type for da_monitor_storage.target; must be defined before the
> includes. */
> +typedef struct tlob_task_state *monitor_target;
> +
> +/* Forward-declared so da_monitor_reset_hook works before
> ha_monitor.h. */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon);
> +#define da_monitor_reset_hook tlob_reset_notify
> +
> +/* Override EVENT_NONE_LBL so the timer-fired violation shows
> "budget_exceeded". */
> +#define EVENT_NONE_LBL "budget_exceeded"
> +
> +#include "tlob.h"
> +
> +/*
> + * DA_MON_POOL_SIZE must be defined HERE: after tlob.h (which
> defines
> + * TLOB_MAX_MONITORED) and before #include <rv/ha_monitor.h> (which
> + * transitively includes da_monitor.h and expands
> __da_monitor_init_pool
> + * using this macro). Placing the define before tlob.h or after
> + * ha_monitor.h both cause a build error.
> + */
I don't think you need this comment here. I would add a comment in case
moving it somewhere else would silently change the behaviour, but here
the compiler is going to tell you what isn't defined if you really want
to move it around.
Same for the one line comments above but those are less invasive.
> +#define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
> +
> +/*
> + * Forward-declare tlob_extra_cleanup so the #define below is valid
> when
> + * da_monitor.h (included via ha_monitor.h) expands da_extra_cleanup
> inside
> + * da_monitor_destroy(). The full definition follows after
> ha_monitor.h.
> + */
I'd argue the same here. Though defining da_extra_cleanup() before is
the actual catch (if you do after, the compiler won't tell you it's
using a no-op). If you really want to leave a line for that, feel free
to but I wouldn't be that verbose.
You can combine it to da_monitor_reset_hook() where the same rules
apply.
> +static inline void tlob_extra_cleanup(struct da_monitor *da_mon);
> +#define da_extra_cleanup tlob_extra_cleanup
> +
> +#include <rv/ha_monitor.h>
> +
> +/*
> + * Called from da_monitor_reset() on both normal stop and hrtimer
> expiry.
> + * On violation (stopping==0), emits detail_env_tlob.
> + */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon)
> +{
> + struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
> + struct tlob_task_state *ws;
> +
> + ha_monitor_reset_env(da_mon);
Could have the rest run only if trace_detail_env_tlob_enabled() and
later use trace_call__detail_env_tlob() (the raw tracepoint without
enabled check).
So if the tracepoint is disabled, this code almost doesn't exist.
> +
> + ws = ha_get_target(ha_mon);
> + if (!ws)
> + return;
> +
> + /*
> + * Emit per-state breakdown on budget violation only.
> + * stopping==0: timer callback owns this path (genuine
> overrun).
> + * stopping==1: normal stop claimed ownership first; skip.
> + */
> + if (!atomic_read(&ws->stopping)) {
> + unsigned int curr_state = READ_ONCE(da_mon-
> >curr_state);
> + u64 running_ns, waiting_ns, sleeping_ns, partial_ns;
> + unsigned long flags;
> +
> + /*
> + * Snapshot accumulators; partial_ns covers
> curr_state time
> + * not yet folded in (transition-out pending).
> + */
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + partial_ns = ktime_get_ns() - ktime_to_ns(ws-
> >last_ts);
> + running_ns = ws->running_ns +
> + (curr_state == running_tlob ?
> partial_ns : 0);
> + waiting_ns = ws->waiting_ns +
> + (curr_state == waiting_tlob ?
> partial_ns : 0);
> + sleeping_ns = ws->sleeping_ns +
> + (curr_state == sleeping_tlob ?
> partial_ns : 0);
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +
> + trace_detail_env_tlob(da_get_id(da_mon), ws-
> >threshold_ns,
> + running_ns, waiting_ns,
> sleeping_ns);
> + }
> +}
> +
> +#define BUDGET_NS(ha_mon) (ha_get_target(ha_mon)->threshold_ns)
> +
> +/* HA constraint functions (called by ha_monitor_handle_constraint)
> */
> +
> +static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_tlob env,
> u64 time_ns)
> +{
> + if (env == clk_elapsed_tlob)
> + return ha_get_clk_ns(ha_mon, env, time_ns);
> + return ENV_INVALID_VALUE;
> +}
So here you removed ha_reset_env() because that's done implicitly in the
start. It seems to work, but you're going to need to define it if we
apply an automatic reset at start (what I mentioned in the other review).
It's kinda idiomatic to have resets whenever you have clocks, yours is a
special case because you don't really expect start (the only event with a
reset) to happen again.
It's ok to redefine the functions the way you did, though you risk
leaving them out of sync in case you update the model (rvgen's output is
going to be quite different). Not sure if that's ever going to happen
anyway.
> +
> +/*
> + * ha_verify_invariants - clk_elapsed < BUDGET_NS must hold in all
> states.
> + *
> + * The invariant is uniform across running/waiting/sleeping; check
> it
> + * unconditionally rather than enumerating each state.
> + */
> +static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
> + enum states curr_state, enum
> events event,
> + enum states next_state, u64
> time_ns)
> +{
The compiler is probably going to produce something not that different
if you keep the ifs, but this is smaller and cleaner. Again, you only
need to track it manually but it looks alright now.
> + return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +}
> +
> +/*
> + * Convert invariant (deadline) to guard (reset anchor) on state
> transitions.
> + *
> + * The conversion is identical for every departing state; skip only
> self-loops.
> + */
> +static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
> + enum states curr_state, enum
> events event,
> + enum states next_state, u64
> time_ns)
> +{
> + if (curr_state != next_state)
> + ha_inv_to_guard(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +}
> +
> +/* No per-event guard conditions for tlob; invariants suffice. */
> +static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
> + enum states curr_state, enum
> events event,
> + enum states next_state, u64
> time_ns)
> +{
Yeah you're not quite using a reset and you probably shouldn't so this
is alright too.
> + return true;
> +}
> +
> +/*
> + * Arm or cancel the HA budget timer on state transitions.
> + *
> + * The timer must run in every monitored state
> (running/waiting/sleeping),
> + * so arm it whenever next_state is any of the three. On a self-
> loop caused
> + * by a non-start event the timer is already running; skip the
> redundant
> + * restart. On a true state change the old timer is implicitly
> superseded by
> + * the new ha_start_timer_ns() call.
> + *
> + * Guard on stopping: sched_switch events can arrive after
> ha_cancel_timer_sync,
> + * restarting the timer and triggering an ODEBUG "activate active"
> splat.
> + * The _acquire pairs with the cmpxchg_release in tlob_stop_task.
> + */
> +static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
> + enum states curr_state, enum
> events event,
> + enum states next_state, u64
> time_ns)
> +{
> + if (next_state == curr_state && event != start_tlob)
> + return;
> +
stopping is only ever set in the cleanup phase, right? You don't really
need to cancel timers because you should have done it already, what
about the slightly more readable:
if (atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
return;
if (next_state < state_max_tlob)
ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon), time_ns);
else
ha_cancel_timer(ha_mon);
> + if (next_state < state_max_tlob) {
> + if (!atomic_read_acquire(&ha_get_target(ha_mon)-
> >stopping))
> + ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> + } else {
> + ha_cancel_timer(ha_mon);
> + }
> +}
...
> +/*
> + * da_extra_cleanup - per-task teardown called by
> da_monitor_destroy().
> + *
> + * Claims cleanup ownership via CAS; cancels the budget timer;
> decrements the
> + * monitored-task counter; and schedules the slab free via
> call_rcu().
> + * Must run before da_monitor_reset() (i.e. before hash_del_rcu())
> so that
> + * ha_cancel_timer_sync() can safely access the still-registered
> ha_monitor.
> + */
> +static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
> +{
> + struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
> + struct tlob_task_state *ws = ha_get_target(ha_mon);
> +
> + if (!ws)
> + return;
> +
> + if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
> + return;
> +
> + ha_cancel_timer_sync(ha_mon);
You shouldn't need to do this since that's standard HA functionality and
is stopped already by the reset + sync RCU sequence.
> + put_task_struct(ws->task);
> + call_rcu(&ws->rcu, tlob_free_rcu);
And since you can sleep and already waited for a grace period, you can
probably just free now.
This is the same point I had in da_monitor_destroy_pool() vs
da_monitor_destroy_kmalloc(): let's align what we do with sync/RCU.
> +}
> +
> +/*
> + * __tlob_acc - accumulate elapsed ns into one per-state counter.
> + *
> + * Looks up the task's tlob_task_state under RCU, adds the interval
> + * [ws->last_ts, now] to the field at @offset within the state
> struct,
> + * and updates last_ts. Returns true if the task is monitored.
> + *
> + * entry_lock is a raw spinlock so this is safe from hardirq
> context.
> + */
> +static inline bool __tlob_acc(struct task_struct *task, ktime_t now,
> + size_t offset)
> +{
> + struct tlob_task_state *ws;
> + unsigned long flags;
> +
> + scoped_guard(rcu) {
> + ws = da_get_target_by_id(task->pid);
> + if (!ws)
> + return false;
> + raw_spin_lock_irqsave(&ws->entry_lock, flags);
> + *(u64 *)((char *)ws + offset) +=
> ktime_to_ns(ktime_sub(now, ws->last_ts));
I don't really like this, you have a few options to avoid raw pointer
arithmetics:
1. a macro:
#define __tlob_acc(task, now, field) do { \
...
ws->field += ktime_stuff(now); \
...
}
static inline bool tlob_acc_running(struct task_struct *task, ktime_t now)
{
return __tlob_acc(task, now, running_ns);
}
2. use an array in the struct and pass the index around:
enum tlob_acc {
running,
waiting,
sleeping,
max,
}
struct tlob_task_state {
...
u64 accs_ns[max];
...
};
bool __tlob_acc(struct task_struct *task, ktime_t now, enum tlob_acc acc)
{
...
ws->accs_ns[acc] += ktime_stuff(now);
...
}
static inline bool tlob_acc_running(struct task_struct *task, ktime_t now)
{
return __tlob_acc(task, now, running);
}
Some folks don't like macros, so perhaps options 2 is cleaner.
You may not even need the helper functions since calling the generic
tlob_acc like this is still readable.
If you go down that path, you may also want to simplify the function
using normal (unscoped) guards for both RCU and the lock, since you take
them until the end of the function (won't work with the macro!).
> + ws->last_ts = now;
> + raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> + }
> + return true;
> +}
> +
> +/* Accumulate running_ns for prev; returns true if prev is
> monitored. */
> +static inline bool tlob_acc_running(struct task_struct *task,
> ktime_t now)
> +{
> + return __tlob_acc(task, now, offsetof(struct
> tlob_task_state, running_ns));
> +}
> +
> +/* Accumulate waiting_ns for next; returns true if next is
> monitored. */
> +static inline bool tlob_acc_waiting(struct task_struct *task,
> ktime_t now)
> +{
> + return __tlob_acc(task, now, offsetof(struct
> tlob_task_state, waiting_ns));
> +}
> +
> +/*
> + * handle_sched_switch - advance the DA on every context switch.
> + *
> + * Generates three DA events:
> + * prev, prev_state != 0 -> sleep_tlob (running -> sleeping)
> + * prev, prev_state == 0 -> preempt_tlob (running -> waiting)
> + * next -> switch_in_tlob (waiting -> running)
> + *
> + * A single ktime_get() at handler entry is shared by both acc calls
> so that
> + * prev's running_ns and next's waiting_ns share the same context-
> switch
> + * timestamp; neither absorbs handler overhead into its accumulator.
> + *
> + * No waiting->sleeping edge exists: a task can only block
> voluntarily
> + * (call schedule()) while it is executing on CPU, which corresponds
> to
> + * the running DA state. A task in the waiting state is
> TASK_RUNNING in
> + * kernel terms (on the runqueue) and cannot block itself.
> + *
> + * da_handle_event() is called unconditionally: it skips tasks that
> have no
> + * monitor entry in the hash table.
> + */
> +static void handle_sched_switch(void *data, bool preempt_unused,
> + struct task_struct *prev,
> + struct task_struct *next,
> + unsigned int prev_state)
> +{
> + ktime_t now = ktime_get();
> + bool prev_preempted = (prev_state == 0);
> +
> + /*
> + * No guard on tlob_num_monitored here: da_handle_event()
> internally
> + * calls da_monitor_handling_event() which checks both
> rv_monitoring_on()
> + * and da_monitoring(da_mon). The hash lookup inside
> da_get_monitor()
> + * simply returns NULL for unmonitored tasks, which is
> equally fast as
> + * an atomic_read() guard. By omitting the guard we avoid
> touching the
> + * tlob_num_monitored cacheline on every global context-
> switch.
> + */
> + if (tlob_acc_running(prev, now))
> + da_handle_event(prev->pid, NULL,
> + prev_preempted ? preempt_tlob :
> sleep_tlob);
> + if (tlob_acc_waiting(next, now))
> + da_handle_event(next->pid, NULL, switch_in_tlob);
> +}
> +
> +/* Accumulate sleeping_ns on wakeup; returns true if task is
> monitored. */
> +static inline bool tlob_acc_sleeping(struct task_struct *task,
> ktime_t now)
> +{
> + return __tlob_acc(task, now, offsetof(struct
> tlob_task_state, sleeping_ns));
> +}
> +
> +/*
> + * handle_sched_wakeup - sleeping -> waiting transition.
> + *
> + * try_to_wake_up() skips TASK_RUNNING tasks, so this never fires
> for a
> + * task already in running or waiting state.
> + */
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> + ktime_t now = ktime_get();
> +
> + /* Same reasoning as handle_sched_switch: rely on hash-
> lookup fast path. */
> + if (tlob_acc_sleeping(p, now))
> + da_handle_event(p->pid, NULL, wakeup_tlob);
> +}
> +
> +/*
> + * handle_sched_process_exit - clean up if a task exits without
> TRACE_STOP.
> + *
> + * Called in do_exit() context; the task still has a valid pid here.
> + * tlob_stop_task() returns -ESRCH if the task is not monitored,
> which is fine.
> + */
> +static void handle_sched_process_exit(void *data, struct task_struct
> *p,
> + bool group_dead)
> +{
> + tlob_stop_task(p);
> +}
> +
> +
> +
> +/**
> + * tlob_start_task - begin monitoring @task with budget
> @threshold_ns ns.
> + * @task: Task to monitor; may be current or another task.
> + * @threshold_ns: Latency budget in nanoseconds (wall-clock; running
> + waiting + sleeping).
> + * Must be in [1000, TLOB_MAX_THRESHOLD_NS].
> + *
> + * Returns 0, -ENODEV, -ERANGE, -EALREADY, -ENOMEM, or -ENOSPC.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_ns)
> +{
> + struct tlob_task_state *ws;
> +
> + if (!da_monitor_enabled())
> + return -ENODEV;
> +
> + if (threshold_ns < 1000 || threshold_ns >
Why not some TLOB_MIN_THRESHOLD_NS (that can of course be 1000)?
> TLOB_MAX_THRESHOLD_NS)
> + return -ERANGE;
> +
> + /* Serialise duplicate-check + pool-slot claim for the same
> pid. */
> + guard(mutex)(&tlob_start_mutex);
> +
> + if (da_get_target_by_id(task->pid))
> + return -EALREADY;
> +
> + ws = kmem_cache_zalloc(tlob_state_cache, GFP_KERNEL);
> + if (!ws)
> + return -ENOMEM;
> +
> + ws->task = task;
> + get_task_struct(task);
> + ws->threshold_ns = threshold_ns;
> + ws->last_ts = ktime_get();
> + raw_spin_lock_init(&ws->entry_lock);
> +
> + /*
> + * da_handle_start_run_event() claims a pool slot via
> da_prepare_storage(),
> + * initialises the monitor, and delivers start_tlob in one
> step: the
> + * generated ha_setup_invariants() resets clk_elapsed and
> arms the timer.
> + * Returns 0 if the pool is exhausted (-ENOSPC).
> + */
> + if (!da_handle_start_run_event(task->pid, ws, start_tlob)) {
> + put_task_struct(task);
> + kmem_cache_free(tlob_state_cache, ws);
> + return -ENOSPC;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task.
> + * @task: Task to stop.
> + *
> + * CAS on ws->stopping (0->1) under RCU claims cleanup ownership;
> + * the winner cancels the timer synchronously and frees all
> resources.
> + *
> + * Returns 0, -EOVERFLOW (budget exceeded), -ESRCH (not monitored),
> + * or -EAGAIN (concurrent caller claimed cleanup).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> + struct da_monitor *da_mon;
> + struct ha_monitor *ha_mon;
> + struct tlob_task_state *ws;
> + bool budget_exceeded;
> +
> + scoped_guard(rcu) {
> + ws = da_get_target_by_id(task->pid);
> + if (!ws)
> + return -ESRCH;
> +
> + da_mon = da_get_monitor(task->pid, NULL);
> + if (unlikely(!da_mon)) {
> + /* ws in hash but da_mon gone; internal
> inconsistency. */
> + WARN_ON_ONCE(1);
if (unlikely(WARN_ON_ONCE(!da_mon)))
> + return -ESRCH;
> + }
> +
> + ha_mon = to_ha_monitor(da_mon);
> +
> + /*
> + * CAS (0->1) claims cleanup ownership under RCU (ws
> guaranteed valid).
> + * _release pairs with atomic_read_acquire in
> ha_setup_invariants.
> + */
> + if (atomic_cmpxchg_release(&ws->stopping, 0, 1) !=
> 0)
> + return -EAGAIN;
> + }
> +
> + /* Wait for in-flight timer callback before reading
> da_monitoring. */
> + ha_cancel_timer_sync(ha_mon);
> +
> + /* Timer fired first -> budget exceeded; otherwise reset
> normally. */
> + scoped_guard(rcu) {
> + budget_exceeded = !da_monitoring(da_mon);
> + if (!budget_exceeded)
> + da_monitor_reset(da_mon);
> + }
> + da_destroy_storage(task->pid);
> +
> + put_task_struct(ws->task);
> + call_rcu(&ws->rcu, tlob_free_rcu);
> + return budget_exceeded ? -EOVERFLOW : 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +
> +static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct
> pt_regs *regs,
> + __u64 *data)
> +{
> + struct tlob_uprobe_binding *b = p->priv;
> +
> + tlob_start_task(current, b->threshold_ns);
> + return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct
> pt_regs *regs,
> + __u64 *data)
> +{
> + tlob_stop_task(current);
> + return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_ns, const char *binpath,
> + loff_t offset_start, loff_t offset_stop)
> +{
You can use cleanup helpers here to save a couple of gotos (making the
code less error prone and more readable):
> + struct tlob_uprobe_binding *b, *tmp_b;
struct tlob_uprobe_binding __free(kfree) *b = NULL, *tmp_b;
> + char pathbuf[TLOB_MAX_PATH];
> + struct path path;
struct path path __free(path_put) = {};
> + char *canon;
> + int ret;
> +
> + if (binpath[0] != '/')
> + return -EINVAL;
> +
> + b = kzalloc_obj(*b, GFP_KERNEL);
> + if (!b)
> + return -ENOMEM;
> +
> + b->threshold_ns = threshold_ns;
> + b->offset_start = offset_start;
> + b->offset_stop = offset_stop;
> +
> + ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
> + if (ret)
> + goto err_free;
> +
> + if (!d_is_reg(path.dentry)) {
> + ret = -EINVAL;
> + goto err_path;
> + }
> +
> + /* Reject duplicate start offset for the same binary. */
> + list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> + if (tmp_b->offset_start == offset_start &&
> + tmp_b->start_probe->path.dentry == path.dentry)
> {
> + ret = -EEXIST;
> + goto err_path;
> + }
> + }
> +
> + canon = d_path(&path, pathbuf, sizeof(pathbuf));
> + if (IS_ERR(canon)) {
> + ret = PTR_ERR(canon);
> + goto err_path;
> + }
> + strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> + /* Both probes share b (priv) and path; attach_path refs
> path itself. */
> + b->start_probe = rv_uprobe_attach_path(&path, offset_start,
> +
> tlob_uprobe_entry_handler, NULL, b);
> + if (IS_ERR(b->start_probe)) {
> + ret = PTR_ERR(b->start_probe);
> + b->start_probe = NULL;
> + goto err_path;
> + }
> +
> + b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
> +
> tlob_uprobe_stop_handler, NULL, b);
> + if (IS_ERR(b->stop_probe)) {
> + ret = PTR_ERR(b->stop_probe);
> + b->stop_probe = NULL;
> + goto err_start;
> + }
> +
> + path_put(&path);
> + list_add_tail(&b->list, &tlob_uprobe_list);
And finally tell the compiler you want to keep b in the happy path.
list_add_tail(&no_free_ptr(b)->list, &tlob_uprobe_list);
Not tested, but you got the gist, check include/linux/cleanup.h and grep
around for users if it doesn't work as expected.
> + return 0;
> +
> +err_start:
> + rv_uprobe_detach(b->start_probe);
This will need to stay, unless you want to DEFINE_FREE() for this, may
be fun but not required.
> +err_path:
> + path_put(&path);
> +err_free:
> + kfree(b);
> + return ret;
> +}
...
> +
> +/* Parse "-PATH:OFFSET_START" (ftrace uprobe_events removal
> convention). */
This isn't quite a standard function comment (and you don't really need
to make it so, it wouldn't add much value), but I'd say at least make it
a multi-line comment which is more common in this location:
/*
* Parse ...
*/
...
> +static void __tlob_destroy_monitor(void)
> +{
> + rv_this.enabled = 0;
> + /*
> + * Remove uprobes first; rv_uprobe_sync() inside ensures all
> in-flight
> + * handlers have finished before we proceed.
> + */
> + tlob_remove_all_uprobes();
> +
> + /*
> + * da_monitor_destroy() iterates any remaining entries via
> da_extra_cleanup
> + * (tlob_extra_cleanup), cancels their timers, and frees
> their state.
> + * rcu_barrier() inside drains both da_pool_return_cb and
> tlob_free_rcu
> + * callbacks before the pool arrays are freed.
> + */
I'm not sure we need this many comments, the behaviour may change and we
don't need to describe what the function does before calling it.
If anything say /why/ you call a function before another, but I'd say
it's trivial here so you can just drop both comments.
> + ha_monitor_destroy();
> + kmem_cache_destroy(tlob_state_cache);
> + tlob_state_cache = NULL;
> +}
...
> +static int __init register_tlob(void)
> +{
> + int ret;
> +
> + ret = rv_register_monitor(&rv_this, NULL);
> + if (ret)
> + return ret;
> +
> + if (rv_this.root_d) {
> + if (!tracefs_create_file("monitor", 0644,
We have rv_create_file() , it's exactly the same but let's be consistent
so if we ever decide to overload it, this won't stay behind.
> rv_this.root_d, NULL,
> + &tlob_monitor_fops)) {
> + rv_unregister_monitor(&rv_this);
> + return -ENOMEM;
> + }
> + }
> +
> + return 0;
> +}
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000000..b6724e629c69
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
> @@ -0,0 +1,148 @@
...
> +
> +enum states_tlob {
> + running_tlob,
> + waiting_tlob,
> + sleeping_tlob,
> + state_max_tlob,
> +};
This is personal preference, so feel free to ignore it. The header file
is (mostly) automatically generated and rvgen currently gives you a
different order for enums/arrays. Maybe just reorder all those according
to what rvgen does (just run it without the -a option and diff what
comes out).
The main reason is again to make it easy to update the file in case the
model/framework changes.
Thanks,
Gabriele
^ permalink raw reply
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-15 15:20 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: David Hildenbrand (Arm), Balbir Singh, lsf-pc, linux-kernel,
linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>
On Mon, Jun 15, 2026 at 04:38:43PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
> >
> > 1) memalloc_folio is required to ensure non-folio allocations don't land
> > on the private node, even if it happens within a memalloc_private
> > context. Since memalloc_folio may be useful in contexts outside of
> > private nodes, I kept this as a separate flag.
> >
> > If we think there will *never* be additional users of memalloc_folio,
> > then we could fold _folio into _private to save the flag for now and
> > add it back when we actually need it.
> >
> > 2) memalloc_private is needed to unlock private nodes, but in the
> > original NOFALLBACK-only design, you also needed __GFP_THISNODE.
> >
> > This is *highly* restrictive. I found when playing with mbind that
> > MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
> > implies a bug).
> >
> > That leads me to #3
>
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.
>
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.
>
Reduced to practice I have found success, however what you are
describing could probably be resolved by re-introducing fallback list
isolation. If private nodes are not in fallback lists, and they're not
N_MEMORY, then they're unreachable via nodemask-fallbacks, and a
specific node has to be requested. For everything else memalloc locks
them out regardless.
In v5 I actually stripped this all the way back to just memalloc flags
and implemented a bunch of pressure tests to try to detect leakage - and
I was not able to do so - even with all nodes in each other's fallback
lists.
We can tack on both fallback list isolation and __GFP_THISNODE
requirements on top without ABI implications if we find that is
insufficient.
The only place I think this will matter is in the reclaim / demotion
code, would need to rework the allocation code to handle private nodes
more explicitly. This has no ABI implications AND the entire demotion
logic in vmscan.c is utterly broken anyway and needs a rewrite.
I'm running a mass build test at the moment, and it's looking clean, I'm
expecting to be able to test the new code today or tomorrow.
~Gregory
^ permalink raw reply
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-15 15:18 UTC (permalink / raw)
To: Vlastimil Babka (SUSE), Gregory Price
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman, Matthew Wilcox
In-Reply-To: <9f1815b0-896b-44ab-9e6d-9316d8f11033@kernel.org>
On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
> On 6/12/26 17:29, Gregory Price wrote:
>> On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
>>> ... snip ...
>>>
>>> I will still probably send the next RFC version tomorrow or friday,
>>> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>>>
>>> Also, I made a new `anondax` driver which enables userland testing
>>> of this functionality without any specialty hardware.
>>>
>>
>> (apologies for the length of this email: this will all be covered in
>> the coming cover letter, but I just wanted to share a bit of a preview)
>>
>> ===
>>
>> Just another small update - I am planning to post the RFC today once i
>> get some mild cleanup done. It will be based on the dax atomic hotplug
>>
>> https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/
>>
>> But a couple specific details regarding the memalloc pieces that i've
>> learned the past couple of days playing with it.
>>
>> 1) memalloc_folio is required to ensure non-folio allocations don't land
>> on the private node, even if it happens within a memalloc_private
>> context. Since memalloc_folio may be useful in contexts outside of
>> private nodes, I kept this as a separate flag.
>>
>> If we think there will *never* be additional users of memalloc_folio,
>> then we could fold _folio into _private to save the flag for now and
>> add it back when we actually need it.
>>
>> 2) memalloc_private is needed to unlock private nodes, but in the
>> original NOFALLBACK-only design, you also needed __GFP_THISNODE.
>>
>> This is *highly* restrictive. I found when playing with mbind that
>> MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
>> implies a bug).
>>
>> That leads me to #3
>
> I think the memalloc approach is dangerous due to unexpected nesting. There
> might be nested page allocations in page allocation itself (due to some
> debugging option). But also interrupts do not change what "current" points
> to. Suddenly those could start requesting folios and/or private nodes and be
> surprised, I'm afraid.
Yeah, we'd need some way to distinguish the main allocation from these other
(nested) allocations.
>
> The memalloc scopes only work well when they restrict the context wrt
> reclaim, and allocations in IRQ have to be already restricted heavily
> (atomic) so further memalloc restrictions don't do anything in practice. But
> to make them change other aspects of the allocations like this won't work.
I was assuming that memalloc_pin_save() would already violate that, but really
it only restricts where movable allocations land, and that doesn't matter for
other kernel allocations.
Do you see any other way to make something like an allocation context work, and
avoid introducing more GFP flags?
--
Cheers,
David
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox