Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v2] tracing: fix CFI violation in probestub test
From: Eva Kurchatova @ 2026-06-02 13:54 UTC (permalink / raw)
  To: mhiramat, rostedt
  Cc: linux-trace-kernel, linux-kernel, mathieu.desnoyers, peterz,
	jpoimboe, samitolvanen, eva.kurchatova

When multiple callbacks are registered on the same tracepoint,
callbacks will be indirectly called via traceiter helper.

Pointers to __probestub_* callbacks reside in __tracepoints section,
which is excluded from ENDBR checks in objtool, causing objtool to
assume those functions are never indirectly called.

Registering multiple callbacks using sched_wakeup test will result
in #CP exception due to missing ENDBR in __probestub_sched_wakeup
on a CFI-enabled machine.

Fix this by adding CFI_NOSEAL annotation to probestub declaration.

Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
---
 include/linux/tracepoint.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 763eea4d80d8..38e9f49a71b7 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -20,6 +20,7 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/tracepoint-defs.h>
 #include <linux/static_call.h>
+#include <asm/cfi.h>
 
 struct module;
 struct tracepoint;
@@ -389,6 +390,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	void __probestub_##_name(void *__data, proto)			\
 	{								\
 	}								\
+	/*								\
+	 * Annotate the probestub 'CFI_NOSEAL' to stop objtool from	\
+	 * requesting the kernel remove the ENDBR, because the only	\
+	 * references to the function are in the __tracepoint section,	\
+	 * that objtool doesn't scan.					\
+	 */								\
+	CFI_NOSEAL(__probestub_##_name);				\
 	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);	\
 	DEFINE_RUST_DO_TRACE(_name, TP_PROTO(proto), TP_ARGS(args))
 
-- 
2.54.0


^ permalink raw reply related

* [syzbot] [trace?] KASAN: use-after-free Write in ring_buffer_read_page
From: syzbot @ 2026-06-02 13:45 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, mathieu.desnoyers, mhiramat,
	rostedt, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    e7ae89a0c97c Linux 7.1-rc5
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16f06e2e580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=58acee1ac5406016
dashboard link: https://syzkaller.appspot.com/bug?extid=2dd9d02f60775ce5c1fb
compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/9b0c5b4e3645/disk-e7ae89a0.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/ed163d3ad68b/vmlinux-e7ae89a0.xz
kernel image: https://storage.googleapis.com/syzbot-assets/f2408b333334/bzImage-e7ae89a0.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+2dd9d02f60775ce5c1fb@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in ring_buffer_read_page+0xd51/0x15a0 kernel/trace/ring_buffer.c:7059
Write of size 16308 at addr ffff88805ceb404c by task syz.3.1872/14532

CPU: 0 UID: 0 PID: 14532 Comm: syz.3.1872 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0x13d/0x4b0 mm/kasan/report.c:482
 kasan_report+0xdf/0x1d0 mm/kasan/report.c:595
 check_region_inline mm/kasan/generic.c:186 [inline]
 kasan_check_range+0x10f/0x1e0 mm/kasan/generic.c:200
 __asan_memset+0x23/0x50 mm/kasan/shadow.c:84
 ring_buffer_read_page+0xd51/0x15a0 kernel/trace/ring_buffer.c:7059
 tracing_buffers_read+0x2bf/0xaf0 kernel/trace/trace.c:7129
 vfs_read+0x1e4/0xb30 fs/read_write.c:572
 ksys_read+0x12a/0x250 fs/read_write.c:717
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x10b/0x830 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb1aad9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb1abca5028 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00007fb1ab015fa0 RCX: 00007fb1aad9ce59
RDX: 0000000000001000 RSI: 00002000000002c0 RDI: 0000000000000008
RBP: 00007fb1aae32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb1ab016038 R14: 00007fb1ab015fa0 R15: 00007ffec139a1b8
 </TASK>

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5ceb4
flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x44dc0(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_MAYFAIL|__GFP_COMP), pid 5959, tgid 5958 (syz.1.37), ts 95924498456, free_ts 95918916027
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0xfd/0x120 mm/page_alloc.c:1858
 prep_new_page mm/page_alloc.c:1866 [inline]
 get_page_from_freelist+0x11a6/0x33b0 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x27c/0x2bc0 mm/page_alloc.c:5226
 __alloc_pages_noprof+0xb/0x110 mm/page_alloc.c:5260
 __alloc_pages_node_noprof include/linux/gfp.h:289 [inline]
 alloc_pages_node_noprof include/linux/gfp.h:316 [inline]
 alloc_cpu_data+0x60/0x130 kernel/trace/ring_buffer.c:406
 ring_buffer_alloc_read_page+0x430/0x560 kernel/trace/ring_buffer.c:6801
 tracing_buffers_read+0x603/0xaf0 kernel/trace/trace.c:7110
 vfs_read+0x1e4/0xb30 fs/read_write.c:572
 ksys_read+0x12a/0x250 fs/read_write.c:717
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x10b/0x830 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5946 tgid 5945 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1402 [inline]
 __free_frozen_pages+0x747/0x1040 mm/page_alloc.c:2943
 tlb_batch_list_free mm/mmu_gather.c:161 [inline]
 tlb_finish_mmu+0x27d/0x810 mm/mmu_gather.c:552
 exit_mmap+0x454/0xa10 mm/mmap.c:1313
 __mmput+0x12a/0x410 kernel/fork.c:1178
 mmput+0x67/0x80 kernel/fork.c:1201
 exit_mm kernel/exit.c:582 [inline]
 do_exit+0x8b2/0x2af0 kernel/exit.c:964
 do_group_exit+0xd5/0x2a0 kernel/exit.c:1119
 get_signal+0x20ff/0x2210 kernel/signal.c:3037
 arch_do_signal_or_restart+0x91/0x7a0 arch/x86/kernel/signal.c:337
 __exit_to_user_mode_loop kernel/entry/common.c:64 [inline]
 exit_to_user_mode_loop+0x8b/0x4f0 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:230 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline]
 do_syscall_64+0x6f2/0x830 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Memory state around the buggy address:
 ffff88805ceb4f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88805ceb4f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88805ceb5000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                   ^
 ffff88805ceb5080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88805ceb5100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH] tracing/events: Expand ring buffer for in-kernel event enables
From: Steven Rostedt @ 2026-06-02 13:00 UTC (permalink / raw)
  To: Manjunath Patil
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260601233716.2517987-1-manjunath.b.patil@oracle.com>

On Mon,  1 Jun 2026 16:24:43 -0700
Manjunath Patil <manjunath.b.patil@oracle.com> wrote:

> Ftrace keeps trace arrays at a boot-minimum ring-buffer size until
> tracing is used. Tracefs event-enable paths already call
> tracing_update_buffers() before enabling events, but the exported
> in-kernel helpers trace_set_clr_event() and trace_array_set_clr_event()
> directly enable events through __ftrace_set_clr_event().
> 
> This can leave events enabled by in-kernel users recording into the tiny
> boot-minimum buffer instead of the configured default-sized buffer. Any
> caller that enables events through these exported helpers observes
> different buffer-expansion behavior than a userspace tracefs event enable.
> 
> Expand the relevant trace array before enabling events through the
> exported in-kernel helpers, matching the tracefs event-enable behavior.
> Disabling events remains unchanged.

The above explains everything correctly, but you left out what needs this?

Internal code should not be using the main ring buffer except for
debugging, in which case you can use trace_printk(), which will cause the
tracing buffers to be expanded by default.

Other areas of the kernel should create their own trace array which will be
created expanded by default too.

-- Steve

^ permalink raw reply

* Re: [PATCH] rtla: Fix parsing of multi-character short options
From: Tomas Glozar @ 2026-06-02 12:56 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260602125506.3325345-1-tglozar@redhat.com>

út 2. 6. 2026 v 14:55 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
>
> A bug was reported where the parsing of multi-character short options,
> be it a short option with an argument specified without space (e.g.
> "-p100") or multiple short options in one argument (e.g. -un), ignores
> options specific to individual tools.
>
> Furthermore, if the rest of the option is supposed to be an argument, it
> gets reinterpreted as a string of options. For example, -p100 gets
> interpreted as -100, which is due to hackish implementation read as
> --no-thread --no-irq --no-irq with timerlat hist, causing rtla to error
> out:
>
> $ rtla timerlat hist -p100
> no-irq and no-thread set, there is nothing to do here
>
> This behavior is caused by getopt_long() being called twice on each
> argument, once in common_parse_options(), once in [tool]_parse_args():
>
> - common_parse_options() calls getopt_long() with an array of options
>   common for all rtla tools, while suppressing errors (opterr = 0).
> - If the option fails to parse, common_parse_options() returns 0.
> - If 0 is returned from common_parse_options(), [tool]_parse_args()
>   calls getopt_long() again, with its own set of options.
>
> * [tool] means one of {osnoise,timerlat}_{top,hist}
>
> At least in glibc, getopt_long() increments its internal nextchar
> variable even if the option is not recognized. That means that in the
> case of "-p100", common_parse_options() sets nextchar pointing to '1',
> and timerlat_hist_parse_args() sees '1', not 'p'; the same then repeats
> for the first and second '0'.
>
> As there is no way to restore the correct internal state of
> getopt_long() reliably, fix the issue by merging the common options back
> to the longopt array and option string of the [tool]_parse_args()
> functions using a macro; only the switch part is left in the original
> function, which is renamed to set_common_option().
>
> Fixes: 850cd24cb6d6 ("tools/rtla: Add common_parse_options()")
> Reported-by: John Kacur <jkacur@redhat.com>
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---

Forgot to add note to the original email: This fix is only for 7.1,
7.0 needs tweaking of the commit, 7.2 will remove the command line
parsing logic entirely and replace it with libsubcmd, where this
works.

Tomas


^ permalink raw reply

* [PATCH] rtla: Fix parsing of multi-character short options
From: Tomas Glozar @ 2026-06-02 12:55 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

A bug was reported where the parsing of multi-character short options,
be it a short option with an argument specified without space (e.g.
"-p100") or multiple short options in one argument (e.g. -un), ignores
options specific to individual tools.

Furthermore, if the rest of the option is supposed to be an argument, it
gets reinterpreted as a string of options. For example, -p100 gets
interpreted as -100, which is due to hackish implementation read as
--no-thread --no-irq --no-irq with timerlat hist, causing rtla to error
out:

$ rtla timerlat hist -p100
no-irq and no-thread set, there is nothing to do here

This behavior is caused by getopt_long() being called twice on each
argument, once in common_parse_options(), once in [tool]_parse_args():

- common_parse_options() calls getopt_long() with an array of options
  common for all rtla tools, while suppressing errors (opterr = 0).
- If the option fails to parse, common_parse_options() returns 0.
- If 0 is returned from common_parse_options(), [tool]_parse_args()
  calls getopt_long() again, with its own set of options.

* [tool] means one of {osnoise,timerlat}_{top,hist}

At least in glibc, getopt_long() increments its internal nextchar
variable even if the option is not recognized. That means that in the
case of "-p100", common_parse_options() sets nextchar pointing to '1',
and timerlat_hist_parse_args() sees '1', not 'p'; the same then repeats
for the first and second '0'.

As there is no way to restore the correct internal state of
getopt_long() reliably, fix the issue by merging the common options back
to the longopt array and option string of the [tool]_parse_args()
functions using a macro; only the switch part is left in the original
function, which is renamed to set_common_option().

Fixes: 850cd24cb6d6 ("tools/rtla: Add common_parse_options()")
Reported-by: John Kacur <jkacur@redhat.com>
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/src/common.c        | 28 +++++---------------------
 tools/tracing/rtla/src/common.h        | 12 ++++++++++-
 tools/tracing/rtla/src/osnoise_hist.c  |  7 ++++---
 tools/tracing/rtla/src/osnoise_top.c   |  7 ++++---
 tools/tracing/rtla/src/timerlat_hist.c |  7 ++++---
 tools/tracing/rtla/src/timerlat_top.c  |  7 ++++---
 6 files changed, 32 insertions(+), 36 deletions(-)

diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..bc9d01ddd102 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -84,37 +84,20 @@ int getopt_auto(int argc, char **argv, const struct option *long_opts)
 }
 
 /*
- * common_parse_options - parse common command line options
+ * set_common_option - set common options
  *
+ * @c: option character
  * @argc: argument count
  * @argv: argument vector
  * @common: common parameters structure
  *
  * Parse command line options that are common to all rtla tools.
  *
- * Returns: non zero if a common option was parsed, or 0
- * if the option should be handled by tool-specific parsing.
+ * Returns: 1 if the option was set, 0 otherwise.
  */
-int common_parse_options(int argc, char **argv, struct common_params *common)
+int set_common_option(int c, int argc, char **argv, struct common_params *common)
 {
 	struct trace_events *tevent;
-	int saved_state = optind;
-	int c;
-
-	static struct option long_options[] = {
-		{"cpus",                required_argument,      0, 'c'},
-		{"cgroup",              optional_argument,      0, 'C'},
-		{"debug",               no_argument,            0, 'D'},
-		{"duration",            required_argument,      0, 'd'},
-		{"event",               required_argument,      0, 'e'},
-		{"house-keeping",       required_argument,      0, 'H'},
-		{"priority",            required_argument,      0, 'P'},
-		{0, 0, 0, 0}
-	};
-
-	opterr = 0;
-	c = getopt_auto(argc, argv, long_options);
-	opterr = 1;
 
 	switch (c) {
 	case 'c':
@@ -154,11 +137,10 @@ int common_parse_options(int argc, char **argv, struct common_params *common)
 		common->set_sched = 1;
 		break;
 	default:
-		optind = saved_state;
 		return 0;
 	}
 
-	return c;
+	return 1;
 }
 
 /*
diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 51665db4ffce..8921807bda98 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -178,7 +178,17 @@ int osnoise_set_stop_total_us(struct osnoise_context *context,
 			      long long stop_total_us);
 
 int getopt_auto(int argc, char **argv, const struct option *long_opts);
-int common_parse_options(int argc, char **argv, struct common_params *common);
+
+#define COMMON_OPTIONS \
+	{"cpus",                required_argument,      0, 'c'},\
+	{"cgroup",              optional_argument,      0, 'C'},\
+	{"debug",               no_argument,            0, 'D'},\
+	{"duration",            required_argument,      0, 'd'},\
+	{"event",               required_argument,      0, 'e'},\
+	{"house-keeping",       required_argument,      0, 'H'},\
+	{"priority",            required_argument,      0, 'P'}
+int set_common_option(int c, int argc, char **argv, struct common_params *common);
+
 int common_apply_config(struct osnoise_tool *tool, struct common_params *params);
 int top_main_loop(struct osnoise_tool *tool);
 int hist_main_loop(struct osnoise_tool *tool);
diff --git a/tools/tracing/rtla/src/osnoise_hist.c b/tools/tracing/rtla/src/osnoise_hist.c
index 8ad816b80265..cb4ce58c5987 100644
--- a/tools/tracing/rtla/src/osnoise_hist.c
+++ b/tools/tracing/rtla/src/osnoise_hist.c
@@ -475,6 +475,7 @@ static struct common_params
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"bucket-size",		required_argument,	0, 'b'},
 			{"entries",		required_argument,	0, 'E'},
@@ -498,15 +499,15 @@ static struct common_params
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
 		/* detect the end of the options. */
 		if (c == -1)
 			break;
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		switch (c) {
 		case 'a':
 			/* set sample stop to auto_thresh */
diff --git a/tools/tracing/rtla/src/osnoise_top.c b/tools/tracing/rtla/src/osnoise_top.c
index 244bdce022ad..e65312ec26c4 100644
--- a/tools/tracing/rtla/src/osnoise_top.c
+++ b/tools/tracing/rtla/src/osnoise_top.c
@@ -328,6 +328,7 @@ struct common_params *osnoise_top_parse_args(int argc, char **argv)
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"help",		no_argument,		0, 'h'},
 			{"period",		required_argument,	0, 'p'},
@@ -346,15 +347,15 @@ struct common_params *osnoise_top_parse_args(int argc, char **argv)
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
 		/* Detect the end of the options. */
 		if (c == -1)
 			break;
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		switch (c) {
 		case 'a':
 			/* set sample stop to auto_thresh */
diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 79142af4f566..4b6708e333b8 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -785,6 +785,7 @@ static struct common_params
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"bucket-size",		required_argument,	0, 'b'},
 			{"entries",		required_argument,	0, 'E'},
@@ -819,11 +820,11 @@ static struct common_params
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		/* detect the end of the options. */
 		if (c == -1)
 			break;
diff --git a/tools/tracing/rtla/src/timerlat_top.c b/tools/tracing/rtla/src/timerlat_top.c
index 64cbdcc878b0..91f88bbebad9 100644
--- a/tools/tracing/rtla/src/timerlat_top.c
+++ b/tools/tracing/rtla/src/timerlat_top.c
@@ -549,6 +549,7 @@ static struct common_params
 
 	while (1) {
 		static struct option long_options[] = {
+			COMMON_OPTIONS,
 			{"auto",		required_argument,	0, 'a'},
 			{"help",		no_argument,		0, 'h'},
 			{"irq",			required_argument,	0, 'i'},
@@ -577,11 +578,11 @@ static struct common_params
 			{0, 0, 0, 0}
 		};
 
-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);
 
+		if (set_common_option(c, argc, argv, &params->common))
+			continue;
+
 		/* detect the end of the options. */
 		if (c == -1)
 			break;
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2 4/8] riscv: ftrace: always preserve s0 in dynamic ftrace register frame
From: Shuai Xue @ 2026-06-02 11:37 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-5-wanghan@linux.alibaba.com>

On 5/28/26 4:23 PM, Wang Han wrote:
> The dynamic ftrace entry/exit only saved s0 (the architectural frame
> pointer) when HAVE_FUNCTION_GRAPH_FP_TEST was selected. The upcoming
> reliable frame-pointer unwinder needs s0 to be present in
> ftrace_regs unconditionally so it can use the frame pointer as the
> function-graph return-address cookie regardless of FP_TEST.

Nit: A prefered commit log:

struct __arch_ftrace_regs declares s0 unconditionally, and both
ftrace_regs_get_frame_pointer() and ftrace_partial_regs() read it
unconditionally. But the SAVE_ABI_REGS / RESTORE_ABI_REGS macros in
mcount-dyn.S only stored s0 under HAVE_FUNCTION_GRAPH_FP_TEST
(CONFIG_FUNCTION_GRAPH_TRACER && CONFIG_FRAME_POINTER). With
CONFIG_FRAME_POINTER=n the slot held whatever was on the stack before,
so any callback going through ftrace_partial_regs() saw a garbage
regs->s0. RISC-V kernels default to FRAME_POINTER=y, which is why
this has not bitten in practice.

Save and restore s0 unconditionally in the dynamic ftrace ABI register
frame. This fixes the latent garbage-s0 case, brings the dynamic ftrace
path in line with the static _mcount path (mcount.S SAVE_ABI_STATE
already saves s0 unconditionally), and matches the frame layout already
documented in the comment above SAVE_ABI_REGS. It is also a
prerequisite for the upcoming reliable unwinder, which reads
ftrace_regs_get_frame_pointer(fregs) directly.

Save and restore s0 unconditionally in the dynamic ftrace ABI register
frame. This fixes the latent garbage-s0 case, brings the dynamic ftrace
path in line with the static _mcount path (mcount.S SAVE_ABI_STATE
already saves s0 unconditionally), and matches the frame layout already
documented in the comment above SAVE_ABI_REGS. It is also a
prerequisite for the upcoming reliable unwinder, which reads
ftrace_regs_get_frame_pointer(fregs) directly.

The cost is one extra REG_S/REG_L pair per traced call, negligible
compared to the overall ftrace cost; the existing FREGS_SIZE_ON_STACK
already reserved the slot, so no extra stack space is used.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH v2 3/8] riscv: stacktrace: disable KASAN instrumentation for stacktrace.o
From: Shuai Xue @ 2026-06-02 11:22 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-4-wanghan@linux.alibaba.com>



On 5/28/26 4:23 PM, Wang Han wrote:
> KASAN records stack traces for every alloc/free, which means it walks
> the unwinder very frequently. Instrumenting the stack trace collection
> code itself adds substantial overhead and makes the traces themselves
> noisier.
> 
> Mark stacktrace.o as not KASAN-instrumented, matching the arm, arm64
> and x86 treatment of their stack unwinding code. This is a prerequisite
> preference for the upcoming reliable unwinder, but the change is valid
> on its own.
> 
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
>   arch/riscv/kernel/Makefile | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index cabb99cadfb6..1cb6c9ab2981 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -44,6 +44,11 @@ CFLAGS_REMOVE_return_address.o	= $(CC_FLAGS_FTRACE)
>   CFLAGS_REMOVE_sbi_ecall.o = $(CC_FLAGS_FTRACE)
>   endif
>   
> +# When KASAN is enabled, a stack trace is recorded for every alloc/free, which
> +# can significantly impact performance. Avoid instrumenting the stack trace
> +# collection code to minimize this impact.
> +KASAN_SANITIZE_stacktrace.o := n
> +

I checked the three referenced arches:
   - arm    (arch/arm/kernel/Makefile):    KASAN only
   - arm64  (arch/arm64/kernel/Makefile):  KASAN only
   - x86    (arch/x86/kernel/Makefile):    KASAN *and* KCOV
            (KCOV_INSTRUMENT_stacktrace.o := n, plus dumpstack and
             the unwind_*.o TUs)

So as written, this patch matches arm/arm64 but NOT x86. KCOV
instruments every basic-block edge; the unwinder is a hot path
(doubly so under KASAN, where it runs on every alloc/free), so the
same rationale that justifies disabling KASAN applies to KCOV. I'd
suggest making the claim true by adding:

  KCOV_INSTRUMENT_stacktrace.o := n

(RISC-V keeps its entire unwinder in stacktrace.o, so unlike x86
there's no dumpstack/unwind_*.o to also annotate — the single TU
covers the equivalent scope.)

Alternatively, if you'd rather keep it minimal, just drop "and x86"
from the changelog so the claim matches the code.


Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH 1/2] rtla/timerlat: Fix parsing of short options with attached arguments
From: Tomas Glozar @ 2026-06-02 11:18 UTC (permalink / raw)
  To: John Kacur
  Cc: Steven Rostedt, linux-trace-kernel, Costa Shulyupin,
	Wander Lairson Costa, Crystal Wood, Luis Claudio R . Goncalves,
	linux-kernel
In-Reply-To: <20260601211538.381649-1-jkacur@redhat.com>

po 1. 6. 2026 v 23:15 odesílatel John Kacur <jkacur@redhat.com> napsal:
>
> The timerlat hist command fails to parse short options with attached
> numeric arguments (e.g., -p100) due to conflicts between digit characters
> used as option values and numeric arguments to other options.
>
> This issue was discovered when testing rtla 7.1.0-rc6 with rteval,
> which passes arguments in the compact -p100 format. The rteval tests
> failed with the confusing error "no-irq and no-thread set, there is
> nothing to do here" even though neither option was specified.
>
> The root cause is two-fold:
>
> 1. Digit characters ('0'-'9') were used as short option values for
>    long-only options like --no-irq, --no-thread, etc. This caused
>    getopt_auto() to generate an option string like 'a:b:...:u0123456:7:8:9'
>    which made getopt treat digits as valid option characters.
>
> 2. The two-phase option parsing approach (alternating calls between
>    common_parse_options() and local option parsing) confused getopt's
>    internal state when encountering arguments like -p100.
>

What actually happens is that the call to getopt_long() in
common_parse_options() does not recognize -p, but it still increments
the internal static variable nextchar by 1 before falling back to
timerlat_hist_parse_args()'s getopt_long(). That means -p is ignored
entirely and timerlat_hist_parse_args() only sees -100.

Note that options are not required to trigger the bug:

$ rtla timerlat hist -nu -c 0 -d 1s
# RTLA timerlat histogram
# Time unit is microseconds (us)
# Duration:   0 00:00:02
(rtla 7.0)

vs:

$ rtla timerlat hist -nu -c 0 -d 1s
rtla timerlat hist -nu -c 0 -d 1s
# RTLA timerlat histogram
# Time unit is nanoseconds (ns)
# Duration:   0 00:00:02
(rtla 6.19)

Again, the nanosecond option gets dropped by the
common_parse_options() mechanism pushing nextchar to 1.

> When a user passed -p100, getopt would incorrectly parse it as three
> separate options: -p, -1, -0, and -0, silently setting no_irq and
> no_thread flags instead of recognizing "100" as the period argument.
>
> The two-phase parsing was introduced in commit 850cd24cb6d6 ("tools/rtla:
> Add common_parse_options()") which first appeared in v7.0-rc1. Prior to
> that commit, -p100 worked correctly. The digit characters as option
> values existed since the original timerlat implementation, but only
> became problematic when combined with the two-phase parsing approach.
>

Note that RTLA documentation only ever mentions the syntax "-p 100".
Nevertheless, this is a real regression, and it's not unreasonable for
users to assume the syntax without the space also works, as is common
for most commands on Un*x, for example, gcc's -I/include/path syntax.

> Fix this by:
>
> 1. Eliminating digit characters from the option string by filtering them
>    out in getopt_auto(). This prevents conflicts with numeric arguments.
>
> 2. Refactoring timerlat_hist_parse_args() to use single-pass option
>    parsing. Instead of alternating between common_parse_options() and
>    local parsing, merge all options (common and local) into a single
>    option table and parse them in one pass. This matches the approach
>    used by cyclictest and other tools.

This is a partial revert of the common_parse_options() patchset [1],
it does fix the bug but only for one tool (timerlat hist).
getopt_long()'s design does not allow the user to reset its internal
nextchar variable; it can be reset (by calling it with optind = 0, not
1 as the documentation says) but that would require a lot of work, as
we'd have to calculate and restore the original nextchar. It might be
the easiest to revert the entire consolidation patchset [1], if that's
worth it.

[1] https://lore.kernel.org/linux-trace-kernel/20251209100047.2692515-1-costa.shul@redhat.com/T/#u

<truncated>



Tomas


^ permalink raw reply

* Re: [PATCH v2 2/8] riscv: stacktrace: Add frame record metadata
From: Shuai Xue @ 2026-06-02 11:18 UTC (permalink / raw)
  To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260528082310.1994388-3-wanghan@linux.alibaba.com>



On 5/28/26 4:23 PM, Wang Han wrote:
> Reliable frame-pointer unwinding needs an explicit way to identify
> exception boundaries and the final entry frame. The existing unwinder
> infers those boundaries from return addresses, which is too loose for a
> future reliable unwinder.
> 
> Add a small metadata frame record to pt_regs and initialize it on
> exception entry, kernel thread fork, user fork, and early idle task
> setup. The record uses a zero {fp, ra} sentinel plus a type field so a
> later unwinder can distinguish a final user-to-kernel boundary from a
> nested kernel pt_regs boundary.
> 
> This follows the arm64 metadata frame-record model, adapted to the
> RISC-V {fp, ra} frame record convention.
> 
> The metadata is established at the RISC-V entry boundaries that need an
> explicit unwind marker:
> 
>    * exception entry clears the metadata {fp, ra} pair and uses SPP
>      (or MPP in M-mode) to record whether the pt_regs frame is the final
>      user-to-kernel boundary or a nested kernel boundary;
>    * _start_kernel builds the init task's final metadata record, while
>      the secondary CPU path sets up s0 before smp_callin() so idle-task
>      unwinding does not inherit an undefined caller frame;
>    * copy_thread creates matching final metadata records for new kernel
>      and user tasks, and keeps s0 available for the frame-pointer chain;
>    * call_on_irq_stack still reserves an aligned stack slot, but links the
>      saved {fp, ra} with the raw frame-record size so s0 points at the
>      RISC-V frame record rather than past the alignment padding.
> 
> These changes keep s0 reserved for the frame-pointer chain at task and
> stack-switch boundaries.
> 
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
>   arch/riscv/include/asm/ptrace.h           |  9 ++++
>   arch/riscv/include/asm/stacktrace/frame.h | 53 +++++++++++++++++++++++
>   arch/riscv/kernel/asm-offsets.c           |  4 ++
>   arch/riscv/kernel/entry.S                 | 30 +++++++++++--
>   arch/riscv/kernel/head.S                  | 23 ++++++++++
>   arch/riscv/kernel/process.c               | 31 ++++++++++++-
>   6 files changed, 144 insertions(+), 6 deletions(-)
>   create mode 100644 arch/riscv/include/asm/stacktrace/frame.h
> 
> diff --git a/arch/riscv/include/asm/ptrace.h b/arch/riscv/include/asm/ptrace.h
> index addc8188152f..4b9b0f279214 100644
> --- a/arch/riscv/include/asm/ptrace.h
> +++ b/arch/riscv/include/asm/ptrace.h
> @@ -8,6 +8,7 @@
>   
>   #include <uapi/asm/ptrace.h>
>   #include <asm/csr.h>
> +#include <asm/stacktrace/frame.h>
>   #include <linux/compiler.h>
>   
>   #ifndef __ASSEMBLER__
> @@ -53,6 +54,14 @@ struct pt_regs {
>   	unsigned long cause;
>   	/* a0 value before the syscall */
>   	unsigned long orig_a0;
> +
> +	/*
> +	 * This frame record is entirely zeroed on exception entry, allowing the
> +	 * unwinder to identify exception boundaries. The type field encodes
> +	 * whether the exception was taken from user (FINAL) or kernel (PT_REGS)
> +	 * mode.
> +	 */
> +	struct frame_record_meta stackframe;
>   };
>   
>   #define PTRACE_SYSEMU			0x1f
> diff --git a/arch/riscv/include/asm/stacktrace/frame.h b/arch/riscv/include/asm/stacktrace/frame.h
> new file mode 100644
> index 000000000000..5720a6c65fe8
> --- /dev/null
> +++ b/arch/riscv/include/asm/stacktrace/frame.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __ASM_RISCV_STACKTRACE_FRAME_H
> +#define __ASM_RISCV_STACKTRACE_FRAME_H
> +
> +/*
> + * See: arch/arm64/include/asm/stacktrace/frame.h for the reference
> + * implementation.
> + */
> +
> +/*
> + * - FRAME_META_TYPE_NONE
> + *
> + *   This value is reserved.
> + *
> + * - FRAME_META_TYPE_FINAL
> + *
> + *   The record is the last entry on the stack.
> + *   Unwinding should terminate successfully.
> + *
> + * - FRAME_META_TYPE_PT_REGS
> + *
> + *   The record is embedded within a struct pt_regs, recording the registers at
> + *   an arbitrary point in time.
> + *   Unwinding should consume pt_regs::epc, followed by pt_regs::ra.
> + *
> + * Note: all other values are reserved and should result in unwinding
> + * terminating with an error.
> + */
> +#define FRAME_META_TYPE_NONE		0
> +#define FRAME_META_TYPE_FINAL		1
> +#define FRAME_META_TYPE_PT_REGS		2
> +
> +#ifndef __ASSEMBLER__
> +/*
> + * A standard RISC-V frame record.
> + */
> +struct frame_record {
> +	unsigned long fp;
> +	unsigned long ra;
> +};
> +
> +/*
> + * A metadata frame record indicating a special unwind.
> + * The record::{fp,ra} fields must be zero to indicate the presence of
> + * metadata.
> + */
> +struct frame_record_meta {
> +	struct frame_record record;
> +	unsigned long type;
> +};
> +#endif /* __ASSEMBLER__ */
> +
> +#endif /* __ASM_RISCV_STACKTRACE_FRAME_H */
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index af827448a609..8dfcb5a44bb8 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -131,6 +131,9 @@ void asm_offsets(void)
>   	OFFSET(PT_BADADDR, pt_regs, badaddr);
>   	OFFSET(PT_CAUSE, pt_regs, cause);
>   
> +	DEFINE(S_STACKFRAME,		offsetof(struct pt_regs, stackframe));
> +	DEFINE(S_STACKFRAME_TYPE,	offsetof(struct pt_regs, stackframe.type));
> +
>   	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
>   
>   	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> @@ -501,6 +504,7 @@ void asm_offsets(void)
>   	OFFSET(SBI_HART_BOOT_STACK_PTR_OFFSET, sbi_hart_boot_data, stack_ptr);
>   
>   	DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), STACK_ALIGN));
> +	DEFINE(STACKFRAME_RECORD_SIZE, sizeof(struct stackframe));
>   	OFFSET(STACKFRAME_FP, stackframe, fp);
>   	OFFSET(STACKFRAME_RA, stackframe, ra);
>   #ifdef CONFIG_FUNCTION_TRACER
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index d011fb51c59a..9cae0e1eba1c 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -11,6 +11,7 @@
>   #include <asm/asm.h>
>   #include <asm/csr.h>
>   #include <asm/scs.h>
> +#include <asm/stacktrace/frame.h>
>   #include <asm/unistd.h>
>   #include <asm/page.h>
>   #include <asm/thread_info.h>
> @@ -193,6 +194,27 @@ SYM_CODE_START(handle_exception)
>   	REG_S s4, PT_CAUSE(sp)
>   	REG_S s5, PT_TP(sp)
>   
> +	/*
> +	 * Create a metadata frame record. The unwinder will use this to
> +	 * identify and unwind exception boundaries.
> +	 */
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp) /* stackframe.record.fp = 0 */
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp) /* stackframe.record.ra = 0 */
> +#ifdef CONFIG_RISCV_M_MODE
> +	li t0, SR_MPP
> +	and t0, s1, t0
> +#else
> +	andi t0, s1, SR_SPP
> +#endif
> +	bnez t0, 1f
> +	li t0, FRAME_META_TYPE_FINAL
> +	j 2f
> +1:
> +	li t0, FRAME_META_TYPE_PT_REGS
> +2:
> +	REG_S t0, S_STACKFRAME_TYPE(sp)
> +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
> +

One spot for symmetry (non-blocking, robustness only):

handle_kernel_stack_overflow in entry.S allocates a full
PT_SIZE_ON_STACK frame (including the new stackframe metadata fields)
but, unlike .Lsave_context, never initialises stackframe.{record,type}
nor repoints s0 at the metadata. Those three words are therefore left
as whatever was on the overflow_stack.

In practice this is currently harmless: handle_bad_stack() only calls
__show_regs() (register dump, no unwind) followed by panic(), so the
reliable unwinder never actually consumes that metadata today. So this
is not an active bug — purely a robustness / symmetry gap.

It would still be worth initialising it, because the moment someone
adds a dump_stack() here, or another CPU NMI-backtraces this task, or a
kdump image is walked offline via the frame-record chain, the garbage
type byte would mislead the unwinder. Since the overflow path is by
definition entered from kernel context, FRAME_META_TYPE_PT_REGS is the
right type, and it has the nice property that the unwinder will resume
from frame_pointer(regs)==regs->s0 (the pre-overflow s0 is already
saved into PT_S0 by save_from_x6_to_x31), giving the pre-overflow call
chain instead of a hard stop.


>   	/*
>   	 * Set the scratch register to 0, so that if a recursive exception
>   	 * occurs, the exception vector knows it came from the kernel
> @@ -357,8 +379,8 @@ ASM_NOKPROBE(handle_kernel_stack_overflow)
>   
>   SYM_CODE_START(ret_from_fork_kernel_asm)
>   	call schedule_tail
> -	move a0, s1 /* fn_arg */
> -	move a1, s0 /* fn */
> +	move a0, s3 /* fn_arg */
> +	move a1, s2 /* fn */
>   	move a2, sp /* pt_regs */
>   	call ret_from_fork_kernel
>   	j ret_from_exception
> @@ -383,7 +405,7 @@ SYM_FUNC_START(call_on_irq_stack)
>   	addi	sp, sp, -STACKFRAME_SIZE_ON_STACK
>   	REG_S	ra, STACKFRAME_RA(sp)
>   	REG_S	s0, STACKFRAME_FP(sp)
> -	addi	s0, sp, STACKFRAME_SIZE_ON_STACK
> +	addi	s0, sp, STACKFRAME_RECORD_SIZE
>   
>   	/* Switch to the per-CPU shadow call stack */
>   	scs_save_current
> @@ -399,7 +421,7 @@ SYM_FUNC_START(call_on_irq_stack)
>   	scs_load_current
>   
>   	/* Switch back to the thread stack and restore ra and s0 */
> -	addi	sp, s0, -STACKFRAME_SIZE_ON_STACK
> +	addi	sp, s0, -STACKFRAME_RECORD_SIZE

Worth calling out explicitly that this is more than a cosmetic refactor:
on RV32 the previous code is actually wrong, and this hunk fixes it.

   STACKFRAME_SIZE_ON_STACK = ALIGN(sizeof(struct stackframe), STACK_ALIGN)
   STACKFRAME_RECORD_SIZE   = sizeof(struct stackframe)

   RV64: sizeof(stackframe) == STACK_ALIGN == 16, so the two are equal
         and the old code happened to work.
   RV32: sizeof(stackframe) == 8 but STACK_ALIGN == 16, so the old
         "addi s0, sp, STACKFRAME_SIZE_ON_STACK" left s0 pointing 8 bytes
         past the saved {fp, ra} pair, into the alignment padding. An FP
         unwinder that derives the frame record from s0 (e.g. via
         "(struct stackframe *)s0 - 1" or fixed -8/-16(s0) loads) would
         then read garbage instead of the saved fp/ra at the IRQ-stack

After the change s0 lands exactly at the end of the {fp, ra} record on
both RV32 and RV64, while the aligned slot is still reserved by the
unchanged "addi sp, sp, -STACKFRAME_SIZE_ON_STACK" / matching restore.

Could you mention this in the v3 commit message? It's load-bearing
context for anyone bisecting an RV32 unwind regression later, and it
also justifies why the change is correct to apply ahead of the reliable
unwinder rather than folded into it.

>   	REG_L	ra, STACKFRAME_RA(sp)
>   	REG_L	s0, STACKFRAME_FP(sp)
>   	addi	sp, sp, STACKFRAME_SIZE_ON_STACK
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index f6a8ca49e627..00e16a24f149 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -14,6 +14,7 @@
>   #include <asm/hwcap.h>
>   #include <asm/image.h>
>   #include <asm/scs.h>
> +#include <asm/stacktrace/frame.h>
>   #include <asm/usercfi.h>
>   #include "efi-header.S"
>   
> @@ -177,6 +178,14 @@ secondary_start_sbi:
>   	REG_S a0, (a1)
>   1:
>   #endif
> +
> +	/*
> +	 * Set up the frame pointer for the secondary idle task so reliable
> +	 * stack unwinding terminates at the metadata frame in task_pt_regs().
> +	 * Without this, the first frame records can inherit an undefined caller
> +	 * fp and unwind past smp_callin() into .Lsecondary_park.
> +	 */
> +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
>   	scs_load_current
>   	call smp_callin
>   #endif /* CONFIG_SMP */
> @@ -305,6 +314,20 @@ SYM_CODE_START(_start_kernel)
>   	la tp, init_task
>   	la sp, init_thread_union + THREAD_SIZE
>   	addi sp, sp, -PT_SIZE_ON_STACK
> +
> +	/*
> +	 * Set up a metadata frame record for the init task so that
> +	 * the unwinder can identify the outermost frame by its
> +	 * {fp, ra} = {0, 0} sentinel at the bottom of pt_regs.
> +	 * fp/s0 points above the metadata record (RISC-V
> +	 * convention).
> +	 */
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp)
> +	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp)
> +	li t0, FRAME_META_TYPE_FINAL
> +	REG_S t0, S_STACKFRAME_TYPE(sp)
> +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
> +
>   #if defined(CONFIG_RISCV_SBI) && defined(CONFIG_RISCV_USER_CFI)
>   	li a7, SBI_EXT_FWFT
>   	li a6, SBI_EXT_FWFT_SET
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index b2df7f72241a..5212926b926b 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -258,8 +258,23 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>   		/* Supervisor/Machine, irqs on: */
>   		childregs->status = SR_PP | SR_PIE;
>   
> -		p->thread.s[0] = (unsigned long)args->fn;
> -		p->thread.s[1] = (unsigned long)args->fn_arg;
> +		/*
> +		 * Set up a metadata frame record at the bottom of the
> +		 * stack for the unwinder. Use FRAME_META_TYPE_FINAL
> +		 * since this is the outermost kernel entry for the new
> +		 * task. The frame_record::{fp,ra} are already zero from
> +		 * memset().
> +		 *
> +		 * fp/s0 points above the metadata record (RISC-V
> +		 * convention). fn and fn_arg are passed via s2/s3,
> +		 * keeping s0 available for the frame pointer chain.
> +		 */
> +		childregs->stackframe.type = FRAME_META_TYPE_FINAL;
> +
> +		p->thread.s[0] = (unsigned long)(&childregs->stackframe)
> +				+ sizeof(struct frame_record);
> +		p->thread.s[2] = (unsigned long)args->fn;
> +		p->thread.s[3] = (unsigned long)args->fn_arg;
>   		p->thread.ra = (unsigned long)ret_from_fork_kernel_asm;
>   	} else {
>   		/* allocate new shadow stack if needed. In case of CLONE_VM we have to */
> @@ -278,6 +293,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>   		if (clone_flags & CLONE_SETTLS)
>   			childregs->tp = tls;
>   		childregs->a0 = 0; /* Return value of fork() */
> +
> +		/*
> +		 * Set up the unwind boundary: ensure the metadata
> +		 * frame record has its {fp,ra} sentinel zeroed and
> +		 * point fp/s0 above the metadata record. The type
> +		 * field is inherited from the parent's pt_regs.
> +		 */
> +		childregs->stackframe.record.fp = 0;
> +		childregs->stackframe.record.ra = 0;

This relies on the parent always entering kernel via handle_exception
on a user->kernel boundary (which writes FRAME_META_TYPE_FINAL).
That is true for fork()/clone() today, but:

   - The kernel-thread path right above explicitly assigns type =
     FINAL, so the user-thread path looks asymmetric and like a
     possible omission to anyone reading it cold.
   - A future caller invoking kernel_clone() from a nested-kernel
     context (parent pt_regs.type == PT_REGS) would silently produce
     a broken unwind boundary on the new task.

Recommend explicitly setting it here too:

       childregs->stackframe.type = FRAME_META_TYPE_FINAL;

Even if currently redundant, it is one assignment, costs nothing, is
self-documenting, and fails closed instead of open.


Thanks.
Shuai

^ permalink raw reply

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-02 10:58 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260531071845.10875-1-lance.yang@linux.dev>

On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> [...]
> >@@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >       if (result == SCAN_SUCCEED) {
> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >               mmap_read_unlock(mm);
> >-              result = collapse_huge_page(mm, start_addr, referenced,
> >-                                          unmapped, cc, HPAGE_PMD_ORDER);
> >-              /* collapse_huge_page will return with the mmap_lock released */
> >+              nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >+                                           unmapped, cc, enabled_orders);
> >+              /* mmap_lock was released above, set lock_dropped */
> >               *lock_dropped = true;
> >+              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>
> Hmm ... don't we lose the allocation-failure result here?
>
> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> in khugepaged_do_scan().
>
> Now if allocation fails and nr_collapsed stays 0, we just return
> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?

Ok I did the error propagation! I think I handled both of these cases
you brought up pretty easily.

However I don't know what to do in the following case: We successfully
collapsed some portion of the PMD, but during that process, we also
hit an allocation failure. Is it best to back off entirely? or can we
treat some forward progress as a sign we can continue trying collapses
without sleeping.

Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
successful collapses as the returned value?

This is what I currently have:
done:
    if (collapsed)
        return SCAN_SUCCEED;
    if (alloc_failed)
        return SCAN_ALLOC_HUGE_PAGE_FAIL;

Thanks,
-- Nico

>
> Cheers, Lance
>


^ permalink raw reply

* Re: [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
From: Nico Pache @ 2026-06-02 10:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <ah2Ro54tMDMsPevk@lucifer>

On Mon, Jun 1, 2026 at 8:13 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:00AM -0600, Nico Pache wrote:
> > Currently the collapse_huge_page function requires the mmap_read_lock to
> > enter with it held, and exit with it dropped. This function moves the
> > unlock into its parent caller, and changes this semantic to requiring it
> > to enter/exit with it always unlocked.
> >
> > In future patches, we need this expectation, as for in mTHP collapse, we
> > may have already have dropped the lock, and do not want to conditionally
> > check for this by passing through the lock_dropped variable.
> >
> > No functional change is expected as one of the first things the
> > collapse_huge_page function does is drop this lock before allocating the
> > hugepage.
> >
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> One small nit below, otherwise LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>

Thank you for reviewing!

>
> > ---
> >  mm/khugepaged.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index e98ba5b15163..fab35d318641 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1208,6 +1208,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> >       return SCAN_SUCCEED;
> >  }
> >
> > +/*
> > + * collapse_huge_page expects the mmap_lock to be unlocked before entering and
> > + * will always return with the lock unlocked, to avoid holding the mmap_lock
> > + * while allocating a THP, as that could trigger direct reclaim/compaction.
> > + * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > + */
> >  static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               int referenced, int unmapped, struct collapse_control *cc)
> >  {
> > @@ -1223,14 +1229,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > -     /*
> > -      * Before allocating the hugepage, release the mmap_lock read lock.
> > -      * The allocation can take potentially a long time if it involves
> > -      * sync compaction, and we do not need to hold the mmap_lock during
> > -      * that. We will recheck the vma after taking it again in write mode.
> > -      */
> > -     mmap_read_unlock(mm);
> > -
>
> NIT: Maybe worth an mmap_assert_locked()?

But it will already be unlocked here. The contract is that we enter
unlocked and exit unlocked.

Cheers,
-- Nico

>
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> > @@ -1535,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > +             /* collapse_huge_page expects the lock to be dropped before calling */
> > +             mmap_read_unlock(mm);
> >               result = collapse_huge_page(mm, start_addr, referenced,
> >                                           unmapped, cc);
> >               /* collapse_huge_page will return with the mmap_lock released */
> > --
> > 2.54.0
> >
>
> Cheers, Lorenzo
>


^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-02  9:41 UTC (permalink / raw)
  To: Miaohe Lin, Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <33ef8821-c809-b7d1-ea77-6e8a07a6e784@huawei.com>

On 6/2/26 05:08, Miaohe Lin wrote:
> On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
>> On 6/1/26 14:28, Miaohe Lin wrote:
>>>
>>> Thanks for your patch.
>>>
>>>
>>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
>>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
>>>
>>>
>>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
>>> PageTable and PageLargeKmalloc without extra page refcnt?
>>
>> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
>> PageLargeKmalloc).
> 
> Got it. Thanks.
> 
>> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
>> allow checking it on compound pages.
> 
> It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
> in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).
> 
>>
>> For PageLargeKmalloc, we would want to check the head page, though. The page
>> type is only stored for the head page.
> 
> Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
> set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
> on folio.
> 
>>
>> So maybe we want to lookup the compound head (if any) and perform the type
>> checks against that?
> 
> Maybe we should or we might miss some pages that could have been handled. And
> if compound head is required, should we hold an extra page refcnt to guard against
> possible folio split race?

Races are fine. We might miss some pages, but that can happen on races either way.


I'd just do something like

if (PageReserved(page))
	return true;

head = compound_head(page);
return PageSlab(head) || ...;
	

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4 08/13] rv: Ensure synchronous cleanup for HA monitors
From: Nam Cao @ 2026-06-02  9:17 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260601153840.124372-9-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> HA monitors may start timers, all cleanup functions currently stop the
> timers asynchronously to avoid sleeping in the wrong context.
> Nothing makes sure running callbacks terminate on cleanup.
>
> Run the entire HA timer callback in an RCU read-side critical section,
> this way we can simply synchronize_rcu() with any pending timer and are
> sure any cleanup using kfree_rcu() runs after callbacks terminated.
> Additionally make sure any unlikely callback running late won't run any
> code if the monitor is marked as disabled or if destruction started.
> Use memory barriers to serialise with racing resets.
>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02  9:10 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>



On 02/06/2026 09:55, Suzuki K Poulose wrote:
> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
> 
> nit: s/non-CoCo/CoCo ?
> 
>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn() 
>> on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
> 
> nit: Missing Co-Developed-by: ?
> 
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>   virt/kvm/guest_memfd.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 78e5435967341..adf57a3a1f5dd 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>                int *max_order)
>>   {
>>       pgoff_t index = kvm_gmem_get_index(slot, gfn);
>> +    struct inode *inode;
>>       struct folio *folio;
>>       int r = 0;
>> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>       if (!file)
>>           return -EFAULT;
>> -    filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> +    inode = file_inode(file);
>> +    filemap_invalidate_lock_shared(inode->i_mapping);
>>       folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>>       if (IS_ERR(folio)) {
>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>           folio_mark_uptodate(folio);
>>       }
>> -    r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> +    if (kvm_gmem_is_private_mem(inode, index))
> 
> Don't we need to make sure the entire folio is private ? Not just the 
> page at the index ?
>      if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?

Or rather, we should go through the individual pages and apply the
prepare for ones that are private ?

Suzuki

> 
> Suzuki
> 
>> +        r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>       folio_unlock(folio);
>> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>           folio_put(folio);
>>   out:
>> -    filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> +    filemap_invalidate_unlock_shared(inode->i_mapping);
>>       return r;
>>   }
>>   EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>>
> 


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-02  8:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ah47NNhuiClgGCdn@parvat>

On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> > 
> > I'm debating on whether to include OPS_MEMPOLICY in the initial version
> > if only because it's not intuitive how it interacts with pagecache. That
> > needs more time to bake.
> >
> 
> It makes sense to look at it and then decide if it makes sense.
>

I am thinking i will ship without any OPS flags at all for now and the
have the introduction of ops as a separate series.

> > alloc_pages_node() is the kernel interface
> 
> I was think we wouldn't need explicit flags and that allocations would
> happen from user space using __GFP_THISNODE to the node or via a nodemask
> based on nodes of interest. Is there a reason to add this flag, a system
> might have more than one source of N_MEMORY_PRIVATE?
> 

There's a few things to unpack here.  I discussed this many times on
list and at LSF, but to reiterate.

1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
   not particularly useful.  Additionally, from userland, it's not
   something you can actually set.

   for node in possible_nodes:
       alloc_pages_node(private_node, __GFP_THISNODE)

   In fact it's the opposite semantic of what we want.
   THISNODE says: "Do not fallback back to OTHER nodes".

   The semantic we want is "Do not allow allocations from private
   nodes UNLESS we specifically request" (__GFP_PRIVATE).

   __GFP_THISNODE does not actually buy you anything here, AND it's
   worse, in the scenario where a private node makes its way into the
   preferred slot (via possible_nodes or some other nodemask), the
   allocator cannot fall back to a node it can access.

   __GFP_THISNODE cannot be overloaded to do anything useful here.

2) We're trying not to expose *ANY* userland APIs for this, at all.

   The ultimate goal here should be one of two things:

   1) fd = open(/dev/xxx, ...);
      mem = mmap(fd, ...);
      mem[0] = 0xDEADBEEF; /* Fault device page into page table */

      In this case, the driver is responsible for doing the
      alloc_pages_node() call.

   or

   2) mem = mmap(NULL, ..., ANON);
      mbind(mem, ..., private_node);
      mem[0] = 0xDEADBEEF; /* Fault device page into page table */

      in this case mempolicy.c is responsible for doing the
      alloc_pages_node() call via the _mpol() alloc variants.

Addition OPT flags (reclaim, compaction, whatever), would
(optionally) allow mm/ to operate on the device memory with, for
example, mmu_notifier callbacks to tell the device to invalidate
whatever it's caching about that page.

This would all be relatively transparent the userland, all userland
"knows" is that it's getting memory from a device (/dev/xxx) or a
node it's otherwise aware of hosting device memory somehow.

~Gregory

^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02  8:55 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-7-2f0fae496530@google.com>

On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.

nit: s/non-CoCo/CoCo ?

> 
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
> 
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
> 
> Add a check to make sure that preparation is only performed for private
> folios.
> 
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
> 
> Signed-off-by: Michael Roth <michael.roth@amd.com>

nit: Missing Co-Developed-by: ?

> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>   virt/kvm/guest_memfd.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 78e5435967341..adf57a3a1f5dd 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		     int *max_order)
>   {
>   	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> +	struct inode *inode;
>   	struct folio *folio;
>   	int r = 0;
>   
> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   	if (!file)
>   		return -EFAULT;
>   
> -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +	inode = file_inode(file);
> +	filemap_invalidate_lock_shared(inode->i_mapping);
>   
>   	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>   	if (IS_ERR(folio)) {
> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		folio_mark_uptodate(folio);
>   	}
>   
> -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> +	if (kvm_gmem_is_private_mem(inode, index))

Don't we need to make sure the entire folio is private ? Not just the 
page at the index ?
	if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?

Suzuki

> +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>   
>   	folio_unlock(folio);
>   
> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		folio_put(folio);
>   
>   out:
> -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
>   	return r;
>   }
>   EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> 


^ permalink raw reply

* Re: [PATCH v4 06/13] rv: Do not rely on clean monitor when initialising HA
From: Nam Cao @ 2026-06-02  8:52 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel
  Cc: Wen Yang
In-Reply-To: <20260601153840.124372-7-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Hybrid Automata monitors hook into the DA implementation when doing
> da_monitor_reset(). This function is called both on initialisation and
> teardown, HA monitors try to cancel a timer only when it's initialised
> relying on the da_mon->monitoring flag. This flag could however be
> corrupted during initialisation. This happens for instance on per-task
> monitors that share the same storage with different type of monitors
> like LTL or in case of races during a previous teardown.
>
> Stop relying on the monitoring flag during initialisation, assume that
> can have any value, so use a separate da_reset_state() skiping timer
> cancellation.
> New monitors (e.g. new tasks) are always zero-initialised so it is safe
> to rely on the monitoring flag for those.
>
> Reported-by: Wen Yang <wen.yang@linux.dev>
> Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
> Suggested-by: Nam Cao <namcao@linutronix.de>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Reviewed-by: Wen Yang <wen.yang@linux.dev>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

Reviewed-by: Nam Cao <namcao@linutronix.de>

^ permalink raw reply

* Re: [PATCH v8 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Miaohe Lin @ 2026-06-02  7:05 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260527-ecc_panic-v8-4-9ea0cfa16bb0@debian.org>

On 2026/5/27 22:06, Breno Leitao wrote:
> Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
> default) that triggers a kernel panic when memory_failure()
> encounters pages that cannot be recovered.  This provides a clean
> crash with useful debug information rather than allowing silent
> data corruption or a delayed crash at an unrelated code path.
> 
> Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
> result == MF_IGNORED panics.  After the previous patch, MF_MSG_KERNEL
> covers PG_reserved pages and the kernel-owned pages promoted from
> get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
> large-kmalloc).
> 
> All other action types are excluded:
> 
> - MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
>   transient refcount races with the page allocator (an in-flight buddy
>   allocation has refcount 0 and is no longer on the buddy free list,
>   briefly), and panicking on them would risk killing the box for what
>   is actually a recoverable userspace page.
> 
> - MF_MSG_UNKNOWN means identify_page_state() could not classify the
>   page; that is precisely the wrong basis for a panic decision.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/memory-failure.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 14c0a958638c..dcd53dbc6aec 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
>  
>  static int sysctl_enable_soft_offline __read_mostly = 1;
>  
> +static int sysctl_panic_on_unrecoverable_mf __read_mostly;
> +
>  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
>  static bool hw_memory_failure __read_mostly = false;
> @@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE,
> +	},
> +	{
> +		.procname	= "panic_on_unrecoverable_memory_failure",
> +		.data		= &sysctl_panic_on_unrecoverable_mf,
> +		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
>  	}
>  };
>  
> @@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
>  	++mf_stats->total;
>  }
>  
> +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> +				      enum mf_result result)
> +{
> +	if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> +		return false;
> +
> +	return type == MF_MSG_KERNEL;

Would it be more straightforward to write as something like:

if (!sysctl_panic_on_unrecoverable_mf)
	return false;

return (type == MF_MSG_KERNEL && result == MF_IGNORED);

Thanks.
.

^ permalink raw reply

* Re: [PATCH v8 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Miaohe Lin @ 2026-06-02  3:31 UTC (permalink / raw)
  To: Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260527-ecc_panic-v8-3-9ea0cfa16bb0@debian.org>

On 2026/5/27 22:06, Breno Leitao wrote:
> The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
> for stable unhandlable kernel pages (PG_reserved, slab, page tables,
> large-kmalloc).  memory_failure() still folds every negative return
> into MF_MSG_GET_HWPOISON, so callers that want to react to the
> unrecoverable cases (a panic option, smarter logging) cannot tell
> them apart from transient page-allocator races.
> 
> Turn the post-call branch into a switch over the get_hwpoison_page()
> return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
> negative return to MF_MSG_GET_HWPOISON.  case 0 keeps the existing
> free-buddy / kernel-high-order handling and case 1 falls through to
> the rest of memory_failure() unchanged.
> 
> The MF_MSG_KERNEL label and tracepoint string are kept as
> "reserved kernel page" to avoid breaking userspace tools that match
> on those literals; the enum value still adequately tags the failure
> even though it now also covers slab, page tables and large-kmalloc
> pages.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>

Acked-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks.
.

^ permalink raw reply

* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Miaohe Lin @ 2026-06-02  3:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Breno Leitao
  Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <e3d023f1-ab6e-4424-b304-55f1294480c3@kernel.org>

On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
> On 6/1/26 14:28, Miaohe Lin wrote:
>> On 2026/5/27 22:06, Breno Leitao wrote:
>>> get_any_page() collapses every HWPoisonHandlable() rejection into a
>>> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
>>> -> retry path.  That is correct for the transient case (a userspace
>>> folio briefly off LRU during migration or compaction, which a later
>>> shake can drag back), but wrong for stable kernel-owned pages: slab,
>>> page-table, large-kmalloc and PG_reserved pages will never become
>>> HWPoisonHandlable(), so the retry loop is wasted work and the final
>>> -EIO loses the "this is structurally unrecoverable" information.
>>> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
>>> panic-on-unrecoverable sysctl deliberately does not act on.
>>>
>>> Introduce HWPoisonKernelOwned(), a small predicate that positively
>>> identifies pages the hwpoison handler cannot recover from:
>>>
>>>   HWPoisonKernelOwned(p, flags) :=
>>>       !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
>>>       (PageReserved(p) || PageSlab(p) ||
>>>        PageTable(p)    || PageLargeKmalloc(p))
>>>
>>> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
>>> same exception in HWPoisonHandlable(): soft-offline is allowed to
>>> migrate movable_ops pages even though they are not on the LRU, and
>>> we must not pre-empt that with an unrecoverable verdict.
>>>
>>> The list is intentionally not exhaustive.  vmalloc and kernel-stack
>>> pages, for example, do not carry a page_type bit and would need a
>>> different oracle; they keep going through the existing retry path
>>> unchanged.  This is the smallest set we can identify with certainty
>>> by page type.
>>>
>>> Wire the helper into the top of get_any_page() to short-circuit
>>> those pages before the retry loop runs.  On a hit, drop the caller's
>>> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
>>> straight away.  Pages outside the helper's positive list still take
>>> the existing retry path and return -EIO, leaving operator-visible
>>> behaviour for those cases unchanged.
>>>
>>> Extend the unhandlable-page pr_err() to fire for either errno and
>>> update the get_hwpoison_page() kerneldoc to document the new return.
>>>
>>> memory_failure() still folds every negative return into
>>> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>>> this patch on its own only changes the errno that soft_offline_page()
>>> can propagate to its callers.  A follow-up wires -ENOTRECOVERABLE
>>> through memory_failure() and reports MF_MSG_KERNEL for the
>>> unrecoverable cases, which is what the
>>> panic_on_unrecoverable_memory_failure sysctl observes.
>>
>> Thanks for your patch.
>>
>>>
>>> Suggested-by: David Hildenbrand <david@kernel.org>
>>> Suggested-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>  mm/memory-failure.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>>  1 file changed, 40 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index f4d3e6e20e13..8f63bdfeff8f 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1325,6 +1325,28 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>>>  	return PageLRU(page) || is_free_buddy_page(page);
>>>  }
>>>  
>>> +/*
>>> + * Positive identification of pages the hwpoison handler cannot recover.
>>> + * These page types are owned by kernel internals (no userspace mapping
>>> + * to unmap, no file mapping to invalidate, no migration target), so the
>>> + * shake_page() / retry loop in get_any_page() can never turn them into
>>> + * something HWPoisonHandlable() will accept.  Short-circuit them to
>>> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
>>> + * spinning through retries that exit as a transient-looking -EIO.
>>> + *
>>> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
>>> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
>>> + * pages even though they are not on the LRU.
>>> + */
>>> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
>>> +{
>>> +	if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
>>> +		return false;
>>> +
>>> +	return PageReserved(page) || PageSlab(page) ||
>>
>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
>>
>>> +	       PageTable(page) || PageLargeKmalloc(page);
>>
>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
>> PageTable and PageLargeKmalloc without extra page refcnt?
> 
> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
> PageLargeKmalloc).

Got it. Thanks.

> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
> allow checking it on compound pages.

It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).

> 
> For PageLargeKmalloc, we would want to check the head page, though. The page
> type is only stored for the head page.

Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
on folio.

> 
> So maybe we want to lookup the compound head (if any) and perform the type
> checks against that?

Maybe we should or we might miss some pages that could have been handled. And
if compound head is required, should we hold an extra page refcnt to guard against
possible folio split race?

Thanks.
.



^ permalink raw reply

* Re: [RESEND][PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02  2:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
	Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
	Ian Rogers, Jiri Olsa
In-Reply-To: <20260601202546.564e867b@gandalf.local.home>

On Mon, 1 Jun 2026 20:25:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Add syntax to the parsing of eprobes to be able to typecast a trace event
> field that is a pointer to a structure.
> 
> Currently, a dereference must be a number, where the user has to figure
> out manually the offset of a member of a structure that they want to
> dereference.
> 
> But for event probes that records a field that happens to be a pointer to
> a structure, it cannot dereference these values with BTF naming, but
> must use numerical offsets.
> 
> For example, to find out what device a sk_buff is pointing to in the
> net_dev_xmit trace event, one must first use gdb to find the offsets of the
> members of the structures:
> 
>  (gdb) p &((struct sk_buff *)0)->dev
>  $1 = (struct net_device **) 0x10
>  (gdb) p &((struct net_device *)0)->name
>  $2 = (char (*)[16]) 0x118
> 
> And then use the raw numbers to dereference:
> 
>   # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
> 
> If BTF is in the kernel, then instead, the skbaddr can be typecast to
> sk_buff and use the normal dereference logic.
> 
>   # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
>   # echo 1 > events/eprobes/xmit/enable
>   # cat trace
> [..]
>     sshd-session-1022    [000] b..2.   860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
>     sshd-session-1022    [000] b..2.   860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
> 
> The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
> 
> Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
> to know what they are for.
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> 
> [ Resend with base-id below, maybe Sashiko will apply it to the correct tree! ]

Sashiko still faailed to apply this... Not sure why.

https://sashiko.dev/#/message/20260601202546.564e867b%40gandalf.local.home

Maybe better to configure Sashiko via github or sashiko-ml?
https://github.com/sashiko-dev/sashiko/blob/main/MAINTAINERS_GUIDE.md

Anyway, at least for me, this looks good.

Thanks,

> 
> base-id: 585abc02be3d3ab82fbcc4dbcbbf0ceb61a02129
> 
> Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
> 
> - Add error message in parse_btf_args() for failed parsing of TEVENT.
>   (Sashiko)
> 
> - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
>   The flag was redundant and added unnecessary complexity.
> 
> - Restructure to keep the lifetime of the TYPECAST to the end of
>   traceprobe_parse_probe_arg_body(). This allows the last_type to stay
>   around in case there's not a type parameter and then btf can still be
>   used.
>   (Sashiko and Masami Hiramatsu)
> 
>  Documentation/trace/eprobetrace.rst |   4 +
>  kernel/trace/trace_probe.c          | 173 +++++++++++++++++++++++-----
>  kernel/trace/trace_probe.h          |   5 +-
>  3 files changed, 154 insertions(+), 28 deletions(-)
> 
> diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
> index 89b5157cfab8..fe3602540569 100644
> --- a/Documentation/trace/eprobetrace.rst
> +++ b/Documentation/trace/eprobetrace.rst
> @@ -46,6 +46,10 @@ Synopsis of eprobe_events
>  		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
>                    "string", "ustring", "symbol", "symstr" and "bitfield" are
>                    supported.
> +  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
> +                  a pointer to STRUCT and then derference the pointer defined by
> +                  ->MEMBER. Note that when this is used, the FIELD name does not
> +                  need to be prefixed with a '$'.
>  
>  Types
>  -----
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 695310571b08..fd1caa1f9723 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
>  	return -ENOENT;
>  }
>  
> +static int parse_trace_event(char *arg, struct fetch_insn *code,
> +			     struct traceprobe_parse_context *ctx)
> +{
> +	int ret;
> +
> +	if (code->data)
> +		return -EFAULT;
> +	ret = parse_trace_event_arg(arg, code, ctx);
> +	if (!ret)
> +		return 0;
> +	if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
> +		code->op = FETCH_OP_COMM;
> +		return 0;
> +	}
> +	return -EINVAL;
> +}
> +
>  #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
>  
>  static u32 btf_type_int(const struct btf_type *t)
> @@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
>  		&& BTF_INT_BITS(intdata) == 8;
>  }
>  
> +static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
> +{
> +	return ctx->struct_btf ? : ctx->btf;
> +}
> +
>  static int check_prepare_btf_string_fetch(char *typename,
>  				struct fetch_insn **pcode,
>  				struct traceprobe_parse_context *ctx)
>  {
> -	struct btf *btf = ctx->btf;
> +	struct btf *btf = ctx_btf(ctx);
>  
>  	if (!btf || !ctx->last_type)
>  		return 0;
> @@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
>  	return 0;
>  }
>  
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> +	if (ctx->struct_btf) {
> +		btf_put(ctx->struct_btf);
> +		ctx->struct_btf = NULL;
> +		ctx->last_struct = NULL;
> +	}
> +}
> +
>  static void clear_btf_context(struct traceprobe_parse_context *ctx)
>  {
>  	if (ctx->btf) {
> @@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
>  	struct fetch_insn *code = *pcode;
>  	const struct btf_member *field;
>  	u32 bitoffs, anon_offs;
> +	bool is_struct = ctx->struct_btf != NULL;
> +	struct btf *btf = ctx_btf(ctx);
>  	char *next;
>  	int is_ptr;
>  	s32 tid;
>  
>  	do {
> -		/* Outer loop for solving arrow operator ('->') */
> -		if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
> -			trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
> -			return -EINVAL;
> -		}
> -		/* Convert a struct pointer type to a struct type */
> -		type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
> -		if (!type) {
> -			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> -			return -EINVAL;
> +		if (!is_struct) {
> +			/* Outer loop for solving arrow operator ('->') */
> +			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
> +				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
> +				return -EINVAL;
> +			}
> +
> +			/* Convert a struct pointer type to a struct type */
> +			type = btf_type_skip_modifiers(btf, type->type, &tid);
> +			if (!type) {
> +				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> +				return -EINVAL;
> +			}
>  		}
> +		/* Only the first type can skip being a pointer */
> +		is_struct = false;
>  
>  		bitoffs = 0;
>  		do {
> @@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
>  				return is_ptr;
>  
>  			anon_offs = 0;
> -			field = btf_find_struct_member(ctx->btf, type, fieldname,
> +			field = btf_find_struct_member(btf, type, fieldname,
>  						       &anon_offs);
>  			if (IS_ERR(field)) {
>  				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> @@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
>  				ctx->last_bitsize = 0;
>  			}
>  
> -			type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
> +			type = btf_type_skip_modifiers(btf, field->type, &tid);
>  			if (!type) {
>  				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
>  				return -EINVAL;
> @@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
>  	int i, is_ptr, ret;
>  	u32 tid;
>  
> -	if (WARN_ON_ONCE(!ctx->funcname))
> +	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
>  		return -EINVAL;
>  
>  	is_ptr = split_next_field(varname, &field, ctx);
> @@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
>  		return -EOPNOTSUPP;
>  	}
>  
> +	if (ctx->flags & TPARG_FL_TEVENT) {
> +		ret = parse_trace_event(varname, code, ctx);
> +		if (ret < 0) {
> +			trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
> +			return ret;
> +		}
> +		/* TEVENT is only here via a typecast */
> +		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
> +			return -EINVAL;
> +		type = ctx->last_struct;
> +		goto found_type;
> +	}
> +
>  	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
>  		code->op = FETCH_OP_RETVAL;
>  		/* Check whether the function return type is not void */
> @@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
>  
>  found:
>  	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
> +found_type:
>  	if (!type) {
>  		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
>  		return -EINVAL;
> @@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
>  static const struct fetch_type *find_fetch_type_from_btf_type(
>  					struct traceprobe_parse_context *ctx)
>  {
> -	struct btf *btf = ctx->btf;
> +	struct btf *btf = ctx_btf(ctx);
>  	const char *typestr = NULL;
>  
>  	if (btf && ctx->last_type)
> @@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
>  	return 0;
>  }
>  
> -#else
> +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> +{
> +	struct btf *btf = NULL;
> +	int id;
> +
> +	/* A struct_btf should only be used by a single argument */
> +	if (WARN_ON_ONCE(ctx->struct_btf)) {
> +		btf_put(ctx->struct_btf);
> +		ctx->struct_btf = NULL;
> +	}
> +
> +	id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> +	if (id < 0)
> +		return id;
> +	ctx->struct_btf = btf;
> +	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> +	return 0;
> +}
> +
> +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> +			   struct fetch_insn *end,
> +			   struct traceprobe_parse_context *ctx)
> +{
> +	char *tmp;
> +	int ret;
> +
> +	/* Currently this only works for eprobes */
> +	if (!(ctx->flags & TPARG_FL_TEVENT)) {
> +		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
> +		return -EINVAL;
> +	}
> +
> +	tmp = strchr(arg, ')');
> +	if (!tmp) {
> +		trace_probe_log_err(ctx->offset + strlen(arg),
> +				    DEREF_OPEN_BRACE);
> +		return -EINVAL;
> +	}
> +	*tmp = '\0';
> +	ret = query_btf_struct(arg + 1, ctx);
> +	*tmp = ')';
> +
> +	if (ret < 0) {
> +		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> +		return -EINVAL;
> +	}
> +
> +	tmp++;
> +
> +	ctx->offset += tmp - arg;
> +	ret = parse_btf_arg(tmp, pcode, end, ctx);
> +	return ret;
> +}
> +
> +#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
> +
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> +	ctx->struct_btf = NULL;
> +}
> +
>  static void clear_btf_context(struct traceprobe_parse_context *ctx)
>  {
>  	ctx->btf = NULL;
> @@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
>  	return 0;
>  }
>  
> -#endif
> +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> +			   struct fetch_insn *end,
> +			   struct traceprobe_parse_context *ctx)
> +{
> +	trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
> +	return -EOPNOTSUPP;
> +}
> +
> +#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
>  
>  #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
>  
> @@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
>  	int len;
>  
>  	if (ctx->flags & TPARG_FL_TEVENT) {
> -		if (code->data)
> -			return -EFAULT;
> -		ret = parse_trace_event_arg(arg, code, ctx);
> -		if (!ret)
> -			return 0;
> -		if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
> -			code->op = FETCH_OP_COMM;
> -			return 0;
> -		}
> -		goto inval;
> +		if (parse_trace_event(arg, code, ctx) < 0)
> +			goto inval;
> +		return 0;
>  	}
>  
>  	if (str_has_prefix(arg, "retval")) {
> @@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  				code->op = FETCH_OP_IMM;
>  		}
>  		break;
> +	case '(':
> +		ret = handle_typecast(arg, pcode, end, ctx);
> +		break;
>  	default:
>  		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
>  			if (!tparg_is_function_entry(ctx->flags) &&
> @@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
>  	}
>  	kfree(tmp);
>  
> +	/* struct_btf should not be passed to other arguments */
> +	clear_struct_btf(ctx);
> +
>  	return ret;
>  }
>  
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 1076f1df347b..15758cc11fc6 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -422,7 +422,9 @@ struct traceprobe_parse_context {
>  	const struct btf_param *params;	/* Parameter of the function */
>  	s32 nr_params;			/* The number of the parameters */
>  	struct btf *btf;		/* The BTF to be used */
> +	struct btf *struct_btf;		/* The BTF to be used for structs */
>  	const struct btf_type *last_type;	/* Saved type */
> +	const struct btf_type *last_struct;	/* Saved structure */
>  	u32 last_bitoffs;		/* Saved bitoffs */
>  	u32 last_bitsize;		/* Saved bitsize */
>  	struct trace_probe *tp;
> @@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
>  	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
>  	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
> -	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
> +	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
> +	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"),
>  
>  #undef C
>  #define C(a, b)		TP_ERR_##a
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-02  2:16 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ahOqzpzAua96HVkn@gourry-fedora-PF4VCD3F>

On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> On Thu, May 21, 2026 at 04:23:28PM +1000, Balbir Singh wrote:
> > On Sun, Feb 22, 2026 at 03:48:15AM -0500, Gregory Price wrote:
> > > Topic type: MM
> > > 
> > > Presenter: Gregory Price <gourry@gourry.net>
> > > 
> > > This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> > > managed by the buddy allocator but excluded from normal allocations.
> > > 
> > > I present it with an end-to-end Compressed RAM service (mm/cram.c)
> > > that would otherwise not be possible (or would be considerably more
> > > difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
> > > 
> > 
> > Do we have updates/notes from the meeting?
> > 
> 
> I have been on leave since LSF, but I do have some notes posted:
> 
> https://lore.kernel.org/linux-mm/af9i7dkNvGGxPHzu@gourry-fedora-PF4VCD3F/
> https://lore.kernel.org/linux-mm/agYJcRgOHho8upVv@gourry-fedora-PF4VCD3F/
> 
> I will be trying to post an updated set stripped down without the GFP
> flag as a first pass w/o RFC tags and no UAPI implications so that
> device folks can play with this upstream.
> 
> I'm debating on whether to include OPS_MEMPOLICY in the initial version
> if only because it's not intuitive how it interacts with pagecache. That
> needs more time to bake.
>

It makes sense to look at it and then decide if it makes sense.

> > > 
> > > page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
> > 
> > Do we want to provide kernel level control over allocation of private
> > pages, I assumed that only user space applications? I would assume
> > node affinity would be the way to do so, unless we have multiple
> > 
> 
> alloc_pages_node() is the kernel interface

I was think we wouldn't need explicit flags and that allocations would
happen from user space using __GFP_THISNODE to the node or via a nodemask
based on nodes of interest. Is there a reason to add this flag, a system
might have more than one source of N_MEMORY_PRIVATE?

> 
> > > 
> > > /* Ok but I want to do something useful with it */
> > > static const struct node_private_ops ops = {
> > >         .migrate_to     = my_migrate_to,
> > >         .folio_migrate  = my_folio_migrate,
> > >         .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> > > };
> > > node_private_set_ops(nid, &ops);
> > >
> > 
> > Could you explain this further? Why does OPS_MIGRATION
> > and OPS_MEMPOLICY needs to be set explictly?
> >
> 
> Both of these have been removed from the upcoming version, but in this
> RFC version i was testing OPS_MIGRATION as an explicit flag that meant
> "migrate.c can touch the folios" while OPS_MEMPOLICY meant "mempolicy.c
> can touch the folios".
> 
> As it turns out, OPS_MIGRATION is not a useful filter, as it doesn't
> actually filter anything (anything using OPS_MIGRATION would also need
> its own filter flag, so better to just drop it and do per-server
> opt-ins).
> 

Thanks,
Balbir


^ permalink raw reply

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Lance Yang @ 2026-06-02  1:53 UTC (permalink / raw)
  To: Lorenzo Stoakes, Alexander Gordeev
  Cc: Andrew Morton, Gerald Schaefer, Nico Pache, linux-doc,
	linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	linux-s390, linux-next
In-Reply-To: <ah2z26OzPktchVeT@lucifer>



On 2026/6/2 01:08, Lorenzo Stoakes wrote:
> On Mon, Jun 01, 2026 at 05:58:08PM +0200, Alexander Gordeev wrote:
>> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>>
>> Hi Andrew et al,
>>
>>> On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
>>>
>>>> The following series provides khugepaged with the capability to collapse
>>>> anonymous memory regions to mTHPs.
>>>
>>> Thanks, I've update mm.git's mm-unstable branch to this version.
>>>
>>> It sounds like I might be dropping it soon, haven't started looking at
>>> that yet.  But let's at least eyeball the latest version at this time.
>>>
>>> Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
>>> well, thanks.  The AI checking made a few allegations:
>>
>> This series appears to cause hangs on s390 in linux-next.
>> The issue is not easily reproducible, so it is not yet confirmed.
>> Any ideas for a reliable reproducer that exercises the code path below?
>>
>>      [ 2749.385719] sysrq: Show Blocked State
>>      [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
>>      [ 2749.385735] Call Trace:
>>      [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
>>      [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>>      [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>>      [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>>      [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
>>      [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>>      [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>>      [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>>      [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>>      [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>>      [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
>>      [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
>>      [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>>      [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>>
>> Thanks!
> 
> Hi Alexander,
> 
> Thanks for the report.
> 
> It's a pity it's non-repro, I had Claude have a look at it and it couldn't find
> a definite issue with the code at v18, all the locks seem balanced internally.
> 
> Things it highlighted FWIW:
> 
> - Far more mmap_write_lock()'s being taken - the stack-based approach calls
>    colapse_huge_page() multiple times per-PMD each of which entails an mmap read
>    lock/unlock and mmap write lock.
> 
> - anon_vma write lock held for a much longer period over partial collapse.
> 
> So maybe these are triggering issues rather than being the cause of them per-se?
> 
> If you happen to see it again could you give the output for:
> 
> 'echo t > /proc/sysrq-trigger' so we can track who holds the contended lock and
> get more details on it?
> 
> Also the .config would be useful.
> 
> I'm guessing you've also not enabled mTHP in any way on the system?
> 
> Repro-wise you could also:
> 
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> 
> To get khugepaged going a more aggressively:
> 
> $ for f in /sys/kernel/mm/transparent_hugepage/hugepages-*; do echo always | sudo tee $f/enabled; done
> 
> Then maybe some stress-ng like sudo stress-ng --vm 4 --vm-bytes 2G --vm-method
> all --timeout 5m (or maybe something more refined :)?
> 
> Maybe some of this will help repro more reliably?
> 

Cool!

Maybe also worth trying with CONFIG_DETECT_HUNG_TASK=y and
CONFIG_DETECT_HUNG_TASK_BLOCKER=y.

# detect after 10s in D state instead of default 120s
echo 10 > /proc/sys/kernel/hung_task_timeout_secs

# optional: check more often; 0 means same as timeout
echo 0 > /proc/sys/kernel/hung_task_check_interval_secs

With that enabled, the kernel should hopefully tell us which task likely
owns the rwsem. If it is writer-owned, I would expect that to be fairly
reliable.

Cheers, Lance

^ permalink raw reply

* [RESEND][PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-02  0:25 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.

But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.

For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:

 (gdb) p &((struct sk_buff *)0)->dev
 $1 = (struct net_device **) 0x10
 (gdb) p &((struct net_device *)0)->name
 $2 = (char (*)[16]) 0x118

And then use the raw numbers to dereference:

  # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events

If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.

  # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
  # echo 1 > events/eprobes/xmit/enable
  # cat trace
[..]
    sshd-session-1022    [000] b..2.   860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"

The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]

Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---

[ Resend with base-id below, maybe Sashiko will apply it to the correct tree! ]

base-id: 585abc02be3d3ab82fbcc4dbcbbf0ceb61a02129

Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora

- Add error message in parse_btf_args() for failed parsing of TEVENT.
  (Sashiko)

- Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
  The flag was redundant and added unnecessary complexity.

- Restructure to keep the lifetime of the TYPECAST to the end of
  traceprobe_parse_probe_arg_body(). This allows the last_type to stay
  around in case there's not a type parameter and then btf can still be
  used.
  (Sashiko and Masami Hiramatsu)

 Documentation/trace/eprobetrace.rst |   4 +
 kernel/trace/trace_probe.c          | 173 +++++++++++++++++++++++-----
 kernel/trace/trace_probe.h          |   5 +-
 3 files changed, 154 insertions(+), 28 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and "bitfield" are
                   supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER. Note that when this is used, the FIELD name does not
+                  need to be prefixed with a '$'.
 
 Types
 -----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 695310571b08..fd1caa1f9723 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
 	return -ENOENT;
 }
 
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+			     struct traceprobe_parse_context *ctx)
+{
+	int ret;
+
+	if (code->data)
+		return -EFAULT;
+	ret = parse_trace_event_arg(arg, code, ctx);
+	if (!ret)
+		return 0;
+	if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+		code->op = FETCH_OP_COMM;
+		return 0;
+	}
+	return -EINVAL;
+}
+
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 
 static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
 		&& BTF_INT_BITS(intdata) == 8;
 }
 
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+	return ctx->struct_btf ? : ctx->btf;
+}
+
 static int check_prepare_btf_string_fetch(char *typename,
 				struct fetch_insn **pcode,
 				struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 
 	if (!btf || !ctx->last_type)
 		return 0;
@@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
 	return 0;
 }
 
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+	if (ctx->struct_btf) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
+		ctx->last_struct = NULL;
+	}
+}
+
 static void clear_btf_context(struct traceprobe_parse_context *ctx)
 {
 	if (ctx->btf) {
@@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	struct fetch_insn *code = *pcode;
 	const struct btf_member *field;
 	u32 bitoffs, anon_offs;
+	bool is_struct = ctx->struct_btf != NULL;
+	struct btf *btf = ctx_btf(ctx);
 	char *next;
 	int is_ptr;
 	s32 tid;
 
 	do {
-		/* Outer loop for solving arrow operator ('->') */
-		if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
-			trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
-			return -EINVAL;
-		}
-		/* Convert a struct pointer type to a struct type */
-		type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
-		if (!type) {
-			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-			return -EINVAL;
+		if (!is_struct) {
+			/* Outer loop for solving arrow operator ('->') */
+			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+				return -EINVAL;
+			}
+
+			/* Convert a struct pointer type to a struct type */
+			type = btf_type_skip_modifiers(btf, type->type, &tid);
+			if (!type) {
+				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+				return -EINVAL;
+			}
 		}
+		/* Only the first type can skip being a pointer */
+		is_struct = false;
 
 		bitoffs = 0;
 		do {
@@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				return is_ptr;
 
 			anon_offs = 0;
-			field = btf_find_struct_member(ctx->btf, type, fieldname,
+			field = btf_find_struct_member(btf, type, fieldname,
 						       &anon_offs);
 			if (IS_ERR(field)) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				ctx->last_bitsize = 0;
 			}
 
-			type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+			type = btf_type_skip_modifiers(btf, field->type, &tid);
 			if (!type) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 				return -EINVAL;
@@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
 	int i, is_ptr, ret;
 	u32 tid;
 
-	if (WARN_ON_ONCE(!ctx->funcname))
+	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
 		return -EINVAL;
 
 	is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
 		return -EOPNOTSUPP;
 	}
 
+	if (ctx->flags & TPARG_FL_TEVENT) {
+		ret = parse_trace_event(varname, code, ctx);
+		if (ret < 0) {
+			trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
+			return ret;
+		}
+		/* TEVENT is only here via a typecast */
+		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
+			return -EINVAL;
+		type = ctx->last_struct;
+		goto found_type;
+	}
+
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
 		code->op = FETCH_OP_RETVAL;
 		/* Check whether the function return type is not void */
@@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
 
 found:
 	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
@@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
 static const struct fetch_type *find_fetch_type_from_btf_type(
 					struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 	const char *typestr = NULL;
 
 	if (btf && ctx->last_type)
@@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
 	return 0;
 }
 
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+	struct btf *btf = NULL;
+	int id;
+
+	/* A struct_btf should only be used by a single argument */
+	if (WARN_ON_ONCE(ctx->struct_btf)) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
+	}
+
+	id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+	if (id < 0)
+		return id;
+	ctx->struct_btf = btf;
+	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+	return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+			   struct fetch_insn *end,
+			   struct traceprobe_parse_context *ctx)
+{
+	char *tmp;
+	int ret;
+
+	/* Currently this only works for eprobes */
+	if (!(ctx->flags & TPARG_FL_TEVENT)) {
+		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+		return -EINVAL;
+	}
+
+	tmp = strchr(arg, ')');
+	if (!tmp) {
+		trace_probe_log_err(ctx->offset + strlen(arg),
+				    DEREF_OPEN_BRACE);
+		return -EINVAL;
+	}
+	*tmp = '\0';
+	ret = query_btf_struct(arg + 1, ctx);
+	*tmp = ')';
+
+	if (ret < 0) {
+		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+		return -EINVAL;
+	}
+
+	tmp++;
+
+	ctx->offset += tmp - arg;
+	ret = parse_btf_arg(tmp, pcode, end, ctx);
+	return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+	ctx->struct_btf = NULL;
+}
+
 static void clear_btf_context(struct traceprobe_parse_context *ctx)
 {
 	ctx->btf = NULL;
@@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
 	return 0;
 }
 
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+			   struct fetch_insn *end,
+			   struct traceprobe_parse_context *ctx)
+{
+	trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
 
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 
@@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 	int len;
 
 	if (ctx->flags & TPARG_FL_TEVENT) {
-		if (code->data)
-			return -EFAULT;
-		ret = parse_trace_event_arg(arg, code, ctx);
-		if (!ret)
-			return 0;
-		if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
-			code->op = FETCH_OP_COMM;
-			return 0;
-		}
-		goto inval;
+		if (parse_trace_event(arg, code, ctx) < 0)
+			goto inval;
+		return 0;
 	}
 
 	if (str_has_prefix(arg, "retval")) {
@@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 				code->op = FETCH_OP_IMM;
 		}
 		break;
+	case '(':
+		ret = handle_typecast(arg, pcode, end, ctx);
+		break;
 	default:
 		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
 			if (!tparg_is_function_entry(ctx->flags) &&
@@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
 	}
 	kfree(tmp);
 
+	/* struct_btf should not be passed to other arguments */
+	clear_struct_btf(ctx);
+
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 1076f1df347b..15758cc11fc6 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -422,7 +422,9 @@ struct traceprobe_parse_context {
 	const struct btf_param *params;	/* Parameter of the function */
 	s32 nr_params;			/* The number of the parameters */
 	struct btf *btf;		/* The BTF to be used */
+	struct btf *struct_btf;		/* The BTF to be used for structs */
 	const struct btf_type *last_type;	/* Saved type */
+	const struct btf_type *last_struct;	/* Saved structure */
 	u32 last_bitoffs;		/* Saved bitoffs */
 	u32 last_bitsize;		/* Saved bitsize */
 	struct trace_probe *tp;
@@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
-	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
+	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
+	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02  0:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
	Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
	Ian Rogers, Jiri Olsa
In-Reply-To: <20260601133129.4a1e9dec@gandalf.local.home>

On Mon, 1 Jun 2026 13:31:29 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Jun 2026 13:07:46 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
> > 
> > - Add error message in parse_btf_args() for failed parsing of TEVENT.
> >   (Sashiko)
> > 
> > - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> >   The flag was redundant and added unnecessary complexity.
> > 
> > - Restructure to keep the lifetime of the TYPECAST to the end of
> >   traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> >   around in case there's not a type parameter and then btf can still be
> >   used.
> >   (Sashiko and Masami Hiramatsu)
> 
> And I rebased onto probes/for-next
> 

Thanks, but it seems Sashiko failed to apply (because it is using
linux-trace/HEAD branch?) Hmm, we may always need "base-id" tag.

Thanks,

> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox