* Re: [PATCH v2] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Tianchen Ding @ 2026-05-29 2:59 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-kselftest
In-Reply-To: <20260528091348.71ae3aa3@fedora>
On 5/28/26 9:13 PM, Steven Rostedt wrote:
> On Thu, 28 May 2026 10:24:17 +0800
> Tianchen Ding <dtcccc@linux.alibaba.com> wrote:
>
>>
>> diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
>> index 8e905d4fe6dd..f68f1901f65f 100644
>> --- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
>> +++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
>> @@ -43,8 +43,11 @@ write_buffer() {
>> id=$1
>> size=$2
>>
>> - # write the string into the raw marker
>> - make_str $id $size > trace_marker_raw
>> + # Pipe through dd to ensure a single atomic write() syscall
>> + # on architectures with 64K pages, where shell's printf builtin
>> + # uses stdio buffering which may split the output into multiple
>> + # writes.
>> + make_str $id $size | dd of=trace_marker_raw bs=`expr $size + 4` iflag=fullblock
>
> I was looking at this more, and I'm not comfortable with the hard coded
> 4 above. I rather use the length of the string. Something like:
>
> str=`make_str $id $size`
> len=${#str}
> echo "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
>
> -- Steve
>
Capturing make_str output into a shell variable doesn't work because make_str
outputs raw binary that may contain NUL bytes, and shell command substitution
silently strips them.
However, the val variable inside make_str doesn't hold actual NUL bytes — it
holds the text of escape sequences (e.g., the literal characters
\003\000\000\000). The binary conversion only happens at the final printf
"${val}${data}".
We can take advantage of this by having make_str return the escape-sequence text
instead of binary, and letting write_buffer handle the conversion:
make_str() {
...
printf '%s' "${val}${data}"
}
write_buffer() {
id=$1
size=$2
str=`make_str $id $size`
len=$(printf "$str" | wc -c)
printf "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
}
This way str holds only printable escape-sequence text (no NUL), printf "$str"
converts it to real binary through the pipe, and wc -c measures the true binary
length.
>> }
>>
>>
^ permalink raw reply
* [PATCH] ring-buffer: Better comment the use of RB_MISSED_EVENTS
From: Steven Rostedt @ 2026-05-29 2:37 UTC (permalink / raw)
To: LKML, Linux trace kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers
From: Steven Rostedt <rostedt@goodmis.org>
If the persistent ring buffer is detected on boot up to have a corrupted
sub-buffer, that sub-buffer is cleared to zero and its commit value has
the RB_MISSED_EVENTS bit set. That bit is to allow the "trace",
"trace_pipe" and "trace_pipe_raw" files know that events were dropped by
outputting "[LOST EVENTS]".
Only in this case does that bit get set in the writeable portion of the
ring buffer. When events are dropped in the normal ring buffer, that
information is stored in the cpu_buffer descriptor and the
RB_MISSED_EVENTS is set in the buffer page at the time the page is
consumed. It is never set in the writeable portion of the buffer.
Add comments to describe this better as it can be confusing to know when
the RB_MISSED_EVENTS are set in the commit portion of the buffer page.
Link: https://lore.kernel.org/all/20260529001500.14178455a046a5cbc6180861@kernel.org/
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 910f6b3adf74..06fb365bb86e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1929,6 +1929,12 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
*/
if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
local_set(&bpage->entries, 0);
+ /*
+ * Note, the RB_MISSED_EVENTS is only set inside the main write
+ * buffer by this verification logic. The normal ring buffer
+ * has this bit set when the page is read and passed to the
+ * consumers.
+ */
local_set(&dpage->commit, RB_MISSED_EVENTS);
dpage->time_stamp = prev_ts ? prev_ts : next_ts;
ret = -1;
@@ -7232,6 +7238,14 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
local_add(RB_MISSED_STORED, &dpage->commit);
size += sizeof(missed_events);
}
+ /*
+ * Note, for the persistent ring buffer, the RB_MISSED_EVENTS
+ * may have been set in the main buffer via the verification code.
+ * But here, dpage is a copy of that page and has not yet had
+ * the RB_MISSED_EVENTS set. As for the normal buffers,
+ * the main write buffer does not set these bits and it needs
+ * to be set here.
+ */
local_add(RB_MISSED_EVENTS, &dpage->commit);
}
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] tracing/probes: Point the error offset correctly for eprobe argument error
From: Steven Rostedt @ 2026-05-29 2:29 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Shuah Khan, Mathieu Desnoyers, linux-kernel, linux-trace-kernel,
linux-kselftest
In-Reply-To: <177967567399.209006.1451571244515632097.stgit@devnote2>
On Mon, 25 May 2026 11:21:14 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> Fix to point the error offset correctly for eprobe argument error.
> In the cleanup commit 1b8b0cd754cd ("tracing/probes: Move event parameter
> fetching code to common parser"), due to incorrect backward compatibility
> aimed at conforming to the test specifications, the error location was set
> to 0 when a non-existent formal parameter was specified for Eprobe.
> However, this should be corrected in both the test and the implementation
> to point correct error position.
>
> Fixes: 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser")
> Cc: stable@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
-- Steve
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-29 2:20 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <CAEf4BzZh4qyPiMbpZPeVGx+HFNjBjAHTsNOx5wE7RWidM-iphA@mail.gmail.com>
On Thu, 28 May 2026 16:01:06 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> [...]
>
> > * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index a627acc8fb5f..17042d7e5e87 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> > #define __NR_rseq_slice_yield 471
> > __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
> >
> > +#define __NR_sframe_register 472
> > +__SYSCALL(__NR_sframe_register, sys_sframe_register)
> > +#define __NR_sframe_unregister 473
> > +__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
> > +
> > #undef __NR_syscalls
> > -#define __NR_syscalls 472
> > +#define __NR_syscalls 474
> >
> > /*
> > * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> > new file mode 100644
> > index 000000000000..d3c9f88b024b
> > --- /dev/null
> > +++ b/include/uapi/linux/sframe.h
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_SFRAME_H
> > +#define _UAPI_LINUX_SFRAME_H
> > +
> > +struct sframe_setup {
>
> I'd add `u64 flags;` field for easier and nicer extensibility. Check
> in the kernel that it is set to zero, future kernels will allow some
> of the bits to be set.
That sounds reasonable.
>
> And I still think that prctl() instead of a separate sframe-specific
> syscall is the way to go. I see no reason for sframe-specific set of
> syscalls just to set a bit of extra metadata for the entire process.
> That seems to be the job of prctl().
I personally do not have a preference. I've just heard a lot from
others where they want to avoid extending an ioctl() like system call
or even create a new multiplexer syscall.
If we can get a consensus of using prctl() or adding a separate system
call, I'll go with whatever that is.
>
> > + __u64 sframe_start;
> > + __u64 sframe_size;
> > + __u64 text_start;
> > + __u64 text_size;
> > +};
> > +
>
> [...]
>
> > +
> > +/**
> > + * sys_sframe_register - register an address for user space stacktrace walking.
> > + * @data: Structure of sframe data used to register the sframe section
> > + * @size: The size of the given structure.
> > + *
> > + * This system call is used by dynamic library utilities to inform the kernel
> > + * of meta data that it loaded that can be used by the kernel to know how
> > + * to stack walk the given text locations.
> > + *
> > + * Return: 0 if successful, otherwise a negative error.
> > + */
> > +SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
> > +{
> > + struct sframe_setup sframe;
> > +
> > + if (sizeof(sframe) != size)
> > + return -EINVAL;
>
> This seems overly aggressive. It seems like the pattern is to allow
> sizes both smaller and bigger:
> - if user-provided size is smaller than what kernel knows about,
> treat missing fields as zeroes
Well, that could work with unregister, but for register that isn't
quite useful, as all fields should be filled (well, if we add flags,
that may not be 100% true).
> - if user-provided size is bigger, then check that space after
> fields that kernel recognizes are all zeroes.
That is dangerous. A zero with greater size could mean something. If
the size is greater than expected it should simply fail and let user
space call it again with the older version.
>
> This allows extensibility without having to change user space code all
> the time. Old code will provide smaller struct without new (presumably
> optional) fields, while newer code can use newer and larger struct
> size, but as long as it clears extra fields old kernel will be fine
> with that.
The old size will always work, thus old code will always continue to
work. If we extend the system call, then it must handle both the older
size as well as the newer size. User space would not need to change. It
would only change if it wanted to use a new feature, and if it wants to
work with older kernels it would need to try the bigger size first and
if that fails, it knows the kernel doesn't support that new feature and
then user space can figure out what to do. Either use the old system
call or abort.
-- Steve
>
> > +
> > + if (copy_from_user(&sframe, data, size))
> > + return -EFAULT;
> > +
> > + return sframe_add_section(sframe.sframe_start,
> > + sframe.sframe_start + sframe.sframe_size,
> > + sframe.text_start,
> > + sframe.text_start + sframe.text_size);
> > +}
> > +
>
> [...]
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Andrii Nakryiko @ 2026-05-28 23:01 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <20260528151023.00f5ec4e@gandalf.local.home>
On Thu, May 28, 2026 at 12:09 PM Steven Rostedt <rostedt@kernel.org> wrote:
>
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add system calls to register and unregister sframes that can be used by
> dynamic linkers to tell the kernel where the sframe section is in memory
> for libraries it loads.
>
> Both system calls take a pointer to a new structure:
>
> struct sframe_setup {
> __u64 sframe_start;
> __u64 sframe_size;
> __u64 text_start;
> __u64 text_size;
> };
>
> and a size of the passed in structure. If the system call needs to be
> extended, then the structure could be changed and the size of that
> structure will tell the kernel that it is the new version. If the kernel
> does not recognize the structure size, it will return -EINVAL.
>
> sframe_start - The virtual address of the sframe section
> sframe_size - The length of the sframe section
> text_start - the text section the sframe represents
> test_size - the length of the section
>
> If other stack tracing functionality is added, it will require a new
> system call.
>
> The unregister only needs the sframe_start and requires all the rest of
> the fields to be 0. In the future, if more can be done, then user space
> can update the other values and check the return code to see if the kernel
> supports it.
>
> Also added a DEFINE_GUARD() for mmap_write_lock. There was one for
> mmap_read_lock but not for mmap_write_lock.
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>
> Changes since v1: https://patch.msgid.link/20260521183532.7a145c8a@gandalf.local.home
>
> - Use mmap_write_lock() instead of mmap_read_lock() for mutual
> exclusiveness. (Jens Remus)
>
> - Guard mtree_insert_range() with mmap_write_lock. (Jens Remus)
>
> - Added a guard for mmap_write_lock() similar to the one for mmap_read_lock.
>
> - Have syscall prototype use structure pointer instead of void (Thomas Weißschuh)
>
> - Use __u64 instead of unsigned long for struct members (Thomas Weißschuh)
>
> - Use size_t instead of int for structure size in syscall argument.
> (Thomas Weißschuh)
>
> arch/alpha/kernel/syscalls/syscall.tbl | 2 +
> arch/arm/tools/syscall.tbl | 2 +
> arch/arm64/tools/syscall_32.tbl | 2 +
> arch/m68k/kernel/syscalls/syscall.tbl | 2 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
> arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
> arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
> arch/parisc/kernel/syscalls/syscall.tbl | 2 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
> arch/s390/kernel/syscalls/syscall.tbl | 3 +
> arch/sh/kernel/syscalls/syscall.tbl | 2 +
> arch/sparc/kernel/syscalls/syscall.tbl | 2 +
> arch/x86/entry/syscalls/syscall_32.tbl | 2 +
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
> include/linux/mmap_lock.h | 3 +
> include/linux/syscalls.h | 3 +
> include/uapi/asm-generic/unistd.h | 7 ++-
> include/uapi/linux/sframe.h | 12 ++++
> kernel/sys_ni.c | 3 +
> kernel/unwind/sframe.c | 69 +++++++++++++++++++--
> scripts/syscall.tbl | 2 +
> 23 files changed, 126 insertions(+), 6 deletions(-)
> create mode 100644 include/uapi/linux/sframe.h
>
[...]
> * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index a627acc8fb5f..17042d7e5e87 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> #define __NR_rseq_slice_yield 471
> __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
>
> +#define __NR_sframe_register 472
> +__SYSCALL(__NR_sframe_register, sys_sframe_register)
> +#define __NR_sframe_unregister 473
> +__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
> +
> #undef __NR_syscalls
> -#define __NR_syscalls 472
> +#define __NR_syscalls 474
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> new file mode 100644
> index 000000000000..d3c9f88b024b
> --- /dev/null
> +++ b/include/uapi/linux/sframe.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_SFRAME_H
> +#define _UAPI_LINUX_SFRAME_H
> +
> +struct sframe_setup {
I'd add `u64 flags;` field for easier and nicer extensibility. Check
in the kernel that it is set to zero, future kernels will allow some
of the bits to be set.
And I still think that prctl() instead of a separate sframe-specific
syscall is the way to go. I see no reason for sframe-specific set of
syscalls just to set a bit of extra metadata for the entire process.
That seems to be the job of prctl().
> + __u64 sframe_start;
> + __u64 sframe_size;
> + __u64 text_start;
> + __u64 text_size;
> +};
> +
[...]
> +
> +/**
> + * sys_sframe_register - register an address for user space stacktrace walking.
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * This system call is used by dynamic library utilities to inform the kernel
> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
> +{
> + struct sframe_setup sframe;
> +
> + if (sizeof(sframe) != size)
> + return -EINVAL;
This seems overly aggressive. It seems like the pattern is to allow
sizes both smaller and bigger:
- if user-provided size is smaller than what kernel knows about,
treat missing fields as zeroes
- if user-provided size is bigger, then check that space after
fields that kernel recognizes are all zeroes.
This allows extensibility without having to change user space code all
the time. Old code will provide smaller struct without new (presumably
optional) fields, while newer code can use newer and larger struct
size, but as long as it clears extra fields old kernel will be fine
with that.
> +
> + if (copy_from_user(&sframe, data, size))
> + return -EFAULT;
> +
> + return sframe_add_section(sframe.sframe_start,
> + sframe.sframe_start + sframe.sframe_size,
> + sframe.text_start,
> + sframe.text_start + sframe.text_size);
> +}
> +
[...]
^ permalink raw reply
* Re: [PATCH] tracing: fix CFI violation in probestub helper
From: Steven Rostedt @ 2026-05-28 20:49 UTC (permalink / raw)
To: Eva Kurchatova
Cc: mhiramat, linux-trace-kernel, linux-kernel, mathieu.desnoyers,
peterz, jpoimboe, samitolvanen
In-Reply-To: <20260524154301.21119-1-eva.kurchatova@virtuozzo.com>
On Sun, 24 May 2026 18:43:01 +0300
Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:
> When multiple callbacks are registered on the same tracepoint, probestub
> will be indirectly called via traceiter helper.
>
> Pointer to probestub callback resides in __tracepoints section, which is
> excluded from ENDBR checks in objtool. Pointers to regfunc/unregfunc
> callbacks reside in extended structure however, which is not affected.
>
> Registering multiple callbacks will result in a #CP exception due to
> missed ENDBR in __probestub helper on a CFI-enabled machine.
>
> Fix this by adding CFI_NOSEAL annotation to probestub declaration.
>
> Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
> Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
Wait! The probestub is not in the __tracepoints section. At least it
shouldn't be. Are you sure there's not another issue here?
#define __DEFINE_TRACE_EXT(_name, _ext, proto, args) \
static const char __tpstrtab_##_name[] \
__section("__tracepoints_strings") = #_name; \
extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
int __traceiter_##_name(void *__data, proto); \
void __probestub_##_name(void *__data, proto); \
struct tracepoint __tracepoint_##_name __used \
__section("__tracepoints") = { \
Here the structure __tracepoint_##name is in the __tracepoints section.
.name = __tpstrtab_##_name, \
.key = STATIC_KEY_FALSE_INIT, \
.static_call_key = &STATIC_CALL_KEY(tp_func_##_name), \
.static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
.iterator = &__traceiter_##_name, \
.probestub = &__probestub_##_name, \
.funcs = NULL, \
.ext = _ext, \
}; \
__TRACEPOINT_ENTRY(_name); \
int __traceiter_##_name(void *__data, proto) \
{ \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
\
it_func_ptr = \
rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
if (it_func_ptr) { \
do { \
it_func = READ_ONCE((it_func_ptr)->func); \
__data = (it_func_ptr)->data; \
((void(*)(void *, proto))(it_func))(__data, args); \
} while ((++it_func_ptr)->func); \
} \
return 0; \
} \
void __probestub_##_name(void *__data, proto) \
{ \
}
But above, probestub is just a function defined wherever the tracepoint is
created.
In fact, it's just there for fprobes to work. It doesn't get called if you
add more than one callback to the tracepoint. So your explanation is totally
bogus.
Do you actually see a crash? Or is this just some AI slop that told you
this is a bug?
-- Steve
> ---
> include/linux/tracepoint.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 583d962abcc3..5a32a709759c 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -19,6 +19,7 @@
> #include <linux/rcupdate.h>
> #include <linux/tracepoint-defs.h>
> #include <linux/static_call.h>
> +#include <asm/cfi.h>
>
> struct module;
> struct tracepoint;
> @@ -356,6 +357,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> void __probestub_##_name(void *__data, proto) \
> { \
> } \
> + CFI_NOSEAL(__probestub_##_name); \
> DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
>
> #define DEFINE_TRACE_FN(_name, _reg, _unreg, _proto, _args) \
^ permalink raw reply
* Re: [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Steven Rostedt @ 2026-05-28 20:36 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Theodore Ts'o, Jason A . Donenfeld, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel
In-Reply-To: <20260524104439.ec01284998cae6d4a5053e61@kernel.org>
On Sun, 24 May 2026 10:44:39 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > If the get_boot_id() is accepted by the random folks, then I'm fine with
> > this change.
>
> Yeah, BTW, Sashiko found this can be initialized before we get enough
> entropy for random seed. Maybe we need one more delay.
Well, maybe for adding the boot_id later, but the code that initializes the
buffers needs to stay early. With the backup instance, the persistent ring
buffer can restart tracing immediately.
-- Steve
^ permalink raw reply
* Re: [PATCHv2] trace: allocate fields with elt struct
From: Steven Rostedt @ 2026-05-28 20:22 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Rosen Penev, linux-trace-kernel, Mathieu Desnoyers,
open list:TRACING
In-Reply-To: <20260526134317.394c1cf3060e89df36662ecc@kernel.org>
On Tue, 26 May 2026 13:43:17 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > #define DEFINE_TRACING_MAP_CMP_FN(type) \
> > > -static int tracing_map_cmp_##type(void *val_a, void *val_b) \
> > > +static int tracing_map_cmp_##type(const void *val_a, const void *val_b) \
> > > { \
> > > - type a = (type)(*(u64 *)val_a); \
> > > - type b = (type)(*(u64 *)val_b); \
> > > + type a = (type)(*(const u64 *)val_a); \
> > > + type b = (type)(*(const u64 *)val_b); \
> > > \
> > > return (a > b) ? 1 : ((a < b) ? -1 : 0); \
> > > }
> > This is a pre-existing issue, but does unconditionally reading 8 bytes
> > via the u64 cast cause unaligned access exceptions on architectures that
> > do not support them?
> > Additionally, for fields near the end of the dynamically allocated elt->key
> > buffer, can this trigger KASAN slab-out-of-bounds reads?
> > Also, on big-endian architectures, reading a smaller integer as a 64-bit
> > value and casting it down extracts the least-significant bytes rather than
> > the correct field value. Could this result in completely incorrect sorting
> > for small types?
>
> Steve, it seems this comes from your commit 106f41f5a302 ("tracing: Have
> the histogram compare functions convert to u64 first").
>
> I think neither of them is a problem, but could you check it?
This should not be a problem because the pointer being passed in was a
number to begin with. In fact that commit you shared was to fix this
compare on big endian machines. The typecast was specifically made to allow
big endian to work here.
The value is already in a 8 byte (64bit) memory location. It is copied into
it as a 64 bit number. Hence it has to be read as a 64 bit number for the
conversions.
A short would be copied into the location via:
u64 location;
location = short_word;
On big endian, for a short word of 0xabcd, it would be in the memory as:
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xab 0xcd
on little endian, it would be:
0xcd 0xab 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
For big endian to work, it would need to read that first into a 64 bit word
and then convert it back to short.
Thus, Sashiko doesn't know enough here to comment.
-- Steve
^ permalink raw reply
* [RESEND][PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-28 19:16 UTC (permalink / raw)
To: LKML, Linux Trace Kernel, bpf
Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik, Thomas Weißschuh
From: Steven Rostedt <rostedt@goodmis.org>
Add system calls to register and unregister sframes that can be used by
dynamic linkers to tell the kernel where the sframe section is in memory
for libraries it loads.
Both system calls take a pointer to a new structure:
struct sframe_setup {
__u64 sframe_start;
__u64 sframe_size;
__u64 text_start;
__u64 text_size;
};
and a size of the passed in structure. If the system call needs to be
extended, then the structure could be changed and the size of that
structure will tell the kernel that it is the new version. If the kernel
does not recognize the structure size, it will return -EINVAL.
sframe_start - The virtual address of the sframe section
sframe_size - The length of the sframe section
text_start - the text section the sframe represents
test_size - the length of the section
If other stack tracing functionality is added, it will require a new
system call.
The unregister only needs the sframe_start and requires all the rest of
the fields to be 0. In the future, if more can be done, then user space
can update the other values and check the return code to see if the kernel
supports it.
Also added a DEFINE_GUARD() for mmap_write_lock. There was one for
mmap_read_lock but not for mmap_write_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
[ Resend with Indu's current email address. ]
Changes since v1: https://patch.msgid.link/20260521183532.7a145c8a@gandalf.local.home
- Use mmap_write_lock() instead of mmap_read_lock() for mutual
exclusiveness. (Jens Remus)
- Guard mtree_insert_range() with mmap_write_lock. (Jens Remus)
- Added a guard for mmap_write_lock() similar to the one for mmap_read_lock.
- Have syscall prototype use structure pointer instead of void (Thomas Weißschuh)
- Use __u64 instead of unsigned long for struct members (Thomas Weißschuh)
- Use size_t instead of int for structure size in syscall argument.
(Thomas Weißschuh)
arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 2 +
arch/arm64/tools/syscall_32.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 3 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
include/linux/mmap_lock.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 7 ++-
include/uapi/linux/sframe.h | 12 ++++
kernel/sys_ni.c | 3 +
kernel/unwind/sframe.c | 69 +++++++++++++++++++--
scripts/syscall.tbl | 2 +
23 files changed, 126 insertions(+), 6 deletions(-)
create mode 100644 include/uapi/linux/sframe.h
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..f0639b831f2a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,5 @@
579 common file_setattr sys_file_setattr
580 common listns sys_listns
581 common rseq_slice_yield sys_rseq_slice_yield
+582 common sframe_register sys_sframe_register
+583 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..887b242ffb25 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..c820f1ff718c 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..4c7f17f0364b 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..e8dc2cc149f4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..d0bae05d16af 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,5 @@
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
471 n32 rseq_slice_yield sys_rseq_slice_yield
+472 n32 sframe_register sys_sframe_register
+473 n32 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2e200de6a58c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,5 @@
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
471 n64 rseq_slice_yield sys_rseq_slice_yield
+472 n64 sframe_register sys_sframe_register
+473 n64 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..0e3b82011ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,5 @@
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
471 o32 rseq_slice_yield sys_rseq_slice_yield
+472 o32 sframe_register sys_sframe_register
+473 o32 sframe_unregister sys_sframe_unregister
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..e0758ef8667d 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..eda40c4f4f2f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 nospu rseq_slice_yield sys_rseq_slice_yield
+472 nospu sframe_register sys_sframe_register
+473 nospu sframe_unregister sys_sframe_unregister
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..52519e2acdc8 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,6 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common stacktrace_setup sys_stacktrace_setup
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..62ac7b1b4dd4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..f92273ae608a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..409a50df3b21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
471 i386 rseq_slice_yield sys_rseq_slice_yield
+472 i386 sframe_register sys_sframe_register
+473 i386 sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..9b7c5a449751 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..037b8040f69d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register' sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 04b8f61ece5d..6650c89a13ab 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -579,6 +579,9 @@ static inline void mmap_write_unlock(struct mm_struct *mm)
up_write(&mm->mmap_lock);
}
+DEFINE_GUARD(mmap_write_lock, struct mm_struct *,
+ mmap_write_lock(_T), mmap_write_unlock(_T))
+
static inline void mmap_write_downgrade(struct mm_struct *mm)
{
__mmap_lock_trace_acquire_returned(mm, false, true);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..ad3c8d6b6471 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -79,6 +79,7 @@ struct mnt_id_req;
struct ns_id_req;
struct xattr_args;
struct file_attr;
+struct sframe_setup;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -999,6 +1000,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
u32 size, u32 flags);
asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_sframe_register(struct sframe_setup *data, size_t size);
+asmlinkage long sys_sframe_unregister(struct sframe_setup *data, size_t size);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..17042d7e5e87 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_sframe_register 472
+__SYSCALL(__NR_sframe_register, sys_sframe_register)
+#define __NR_sframe_unregister 473
+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
new file mode 100644
index 000000000000..d3c9f88b024b
--- /dev/null
+++ b/include/uapi/linux/sframe.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SFRAME_H
+#define _UAPI_LINUX_SFRAME_H
+
+struct sframe_setup {
+ __u64 sframe_start;
+ __u64 sframe_size;
+ __u64 text_start;
+ __u64 text_size;
+};
+
+#endif /* _UAPI_LINUX_SFRAME_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..eca5293f5d40 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
+
+COND_SYSCALL(sframe_register);
+COND_SYSCALL(sframe_unregister);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index db88d993dff1..84bd762a1080 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
#include <linux/mm.h>
#include <linux/string_helpers.h>
#include <linux/sframe.h>
+#include <linux/syscalls.h>
#include <asm/unwind_user_sframe.h>
#include <linux/unwind_user_types.h>
+#include <uapi/linux/sframe.h>
#include "sframe.h"
#include "sframe_debug.h"
@@ -817,8 +819,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
if (ret)
goto err_free;
- ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
- sec, GFP_KERNEL_ACCOUNT);
+ scoped_guard(mmap_write_lock, mm) {
+ ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
+ sec, GFP_KERNEL_ACCOUNT);
+ }
if (ret) {
dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
sec->text_start, sec->text_end);
@@ -842,9 +846,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
- if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
- return -EINVAL;
+ scoped_guard(mmap_write_lock, mm) {
+ if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
+ return -EINVAL;
+ }
}
call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
@@ -936,3 +942,56 @@ void sframe_free_mm(struct mm_struct *mm)
mtree_destroy(&mm->sframe_mt);
}
+
+/**
+ * sys_sframe_register - register an address for user space stacktrace walking.
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ return sframe_add_section(sframe.sframe_start,
+ sframe.sframe_start + sframe.sframe_size,
+ sframe.text_start,
+ sframe.text_start + sframe.text_size);
+}
+
+/**
+ * sys_sframe_unregister - unregister an sframe address
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * The data->sframe_start is the only value that is used. The rest must
+ * be zero.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_unregister, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ if (sframe.sframe_size || sframe.text_start || sframe.text_size)
+ return -EINVAL;
+
+ return sframe_remove_section(sframe.sframe_start);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..46ec22b50042 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
--
2.53.0
^ permalink raw reply related
* [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-28 19:10 UTC (permalink / raw)
To: LKML, Linux Trace Kernel, bpf
Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik, Thomas Weißschuh
From: Steven Rostedt <rostedt@goodmis.org>
Add system calls to register and unregister sframes that can be used by
dynamic linkers to tell the kernel where the sframe section is in memory
for libraries it loads.
Both system calls take a pointer to a new structure:
struct sframe_setup {
__u64 sframe_start;
__u64 sframe_size;
__u64 text_start;
__u64 text_size;
};
and a size of the passed in structure. If the system call needs to be
extended, then the structure could be changed and the size of that
structure will tell the kernel that it is the new version. If the kernel
does not recognize the structure size, it will return -EINVAL.
sframe_start - The virtual address of the sframe section
sframe_size - The length of the sframe section
text_start - the text section the sframe represents
test_size - the length of the section
If other stack tracing functionality is added, it will require a new
system call.
The unregister only needs the sframe_start and requires all the rest of
the fields to be 0. In the future, if more can be done, then user space
can update the other values and check the return code to see if the kernel
supports it.
Also added a DEFINE_GUARD() for mmap_write_lock. There was one for
mmap_read_lock but not for mmap_write_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20260521183532.7a145c8a@gandalf.local.home
- Use mmap_write_lock() instead of mmap_read_lock() for mutual
exclusiveness. (Jens Remus)
- Guard mtree_insert_range() with mmap_write_lock. (Jens Remus)
- Added a guard for mmap_write_lock() similar to the one for mmap_read_lock.
- Have syscall prototype use structure pointer instead of void (Thomas Weißschuh)
- Use __u64 instead of unsigned long for struct members (Thomas Weißschuh)
- Use size_t instead of int for structure size in syscall argument.
(Thomas Weißschuh)
arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 2 +
arch/arm64/tools/syscall_32.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 3 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
include/linux/mmap_lock.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 7 ++-
include/uapi/linux/sframe.h | 12 ++++
kernel/sys_ni.c | 3 +
kernel/unwind/sframe.c | 69 +++++++++++++++++++--
scripts/syscall.tbl | 2 +
23 files changed, 126 insertions(+), 6 deletions(-)
create mode 100644 include/uapi/linux/sframe.h
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..f0639b831f2a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,5 @@
579 common file_setattr sys_file_setattr
580 common listns sys_listns
581 common rseq_slice_yield sys_rseq_slice_yield
+582 common sframe_register sys_sframe_register
+583 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..887b242ffb25 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..c820f1ff718c 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..4c7f17f0364b 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..e8dc2cc149f4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..d0bae05d16af 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,5 @@
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
471 n32 rseq_slice_yield sys_rseq_slice_yield
+472 n32 sframe_register sys_sframe_register
+473 n32 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2e200de6a58c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,5 @@
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
471 n64 rseq_slice_yield sys_rseq_slice_yield
+472 n64 sframe_register sys_sframe_register
+473 n64 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..0e3b82011ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,5 @@
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
471 o32 rseq_slice_yield sys_rseq_slice_yield
+472 o32 sframe_register sys_sframe_register
+473 o32 sframe_unregister sys_sframe_unregister
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..e0758ef8667d 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..eda40c4f4f2f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 nospu rseq_slice_yield sys_rseq_slice_yield
+472 nospu sframe_register sys_sframe_register
+473 nospu sframe_unregister sys_sframe_unregister
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..52519e2acdc8 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,6 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common stacktrace_setup sys_stacktrace_setup
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..62ac7b1b4dd4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..f92273ae608a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..409a50df3b21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
471 i386 rseq_slice_yield sys_rseq_slice_yield
+472 i386 sframe_register sys_sframe_register
+473 i386 sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..9b7c5a449751 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..037b8040f69d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register' sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 04b8f61ece5d..6650c89a13ab 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -579,6 +579,9 @@ static inline void mmap_write_unlock(struct mm_struct *mm)
up_write(&mm->mmap_lock);
}
+DEFINE_GUARD(mmap_write_lock, struct mm_struct *,
+ mmap_write_lock(_T), mmap_write_unlock(_T))
+
static inline void mmap_write_downgrade(struct mm_struct *mm)
{
__mmap_lock_trace_acquire_returned(mm, false, true);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..ad3c8d6b6471 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -79,6 +79,7 @@ struct mnt_id_req;
struct ns_id_req;
struct xattr_args;
struct file_attr;
+struct sframe_setup;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -999,6 +1000,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
u32 size, u32 flags);
asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_sframe_register(struct sframe_setup *data, size_t size);
+asmlinkage long sys_sframe_unregister(struct sframe_setup *data, size_t size);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..17042d7e5e87 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_sframe_register 472
+__SYSCALL(__NR_sframe_register, sys_sframe_register)
+#define __NR_sframe_unregister 473
+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
new file mode 100644
index 000000000000..d3c9f88b024b
--- /dev/null
+++ b/include/uapi/linux/sframe.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SFRAME_H
+#define _UAPI_LINUX_SFRAME_H
+
+struct sframe_setup {
+ __u64 sframe_start;
+ __u64 sframe_size;
+ __u64 text_start;
+ __u64 text_size;
+};
+
+#endif /* _UAPI_LINUX_SFRAME_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..eca5293f5d40 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
+
+COND_SYSCALL(sframe_register);
+COND_SYSCALL(sframe_unregister);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index db88d993dff1..84bd762a1080 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
#include <linux/mm.h>
#include <linux/string_helpers.h>
#include <linux/sframe.h>
+#include <linux/syscalls.h>
#include <asm/unwind_user_sframe.h>
#include <linux/unwind_user_types.h>
+#include <uapi/linux/sframe.h>
#include "sframe.h"
#include "sframe_debug.h"
@@ -817,8 +819,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
if (ret)
goto err_free;
- ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
- sec, GFP_KERNEL_ACCOUNT);
+ scoped_guard(mmap_write_lock, mm) {
+ ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
+ sec, GFP_KERNEL_ACCOUNT);
+ }
if (ret) {
dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
sec->text_start, sec->text_end);
@@ -842,9 +846,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
- if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
- return -EINVAL;
+ scoped_guard(mmap_write_lock, mm) {
+ if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
+ return -EINVAL;
+ }
}
call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
@@ -936,3 +942,56 @@ void sframe_free_mm(struct mm_struct *mm)
mtree_destroy(&mm->sframe_mt);
}
+
+/**
+ * sys_sframe_register - register an address for user space stacktrace walking.
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ return sframe_add_section(sframe.sframe_start,
+ sframe.sframe_start + sframe.sframe_size,
+ sframe.text_start,
+ sframe.text_start + sframe.text_size);
+}
+
+/**
+ * sys_sframe_unregister - unregister an sframe address
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * The data->sframe_start is the only value that is used. The rest must
+ * be zero.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_unregister, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ if (sframe.sframe_size || sframe.text_start || sframe.text_size)
+ return -EINVAL;
+
+ return sframe_remove_section(sframe.sframe_start);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..46ec22b50042 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
--
2.53.0
^ permalink raw reply related
* [BUG] tracing/uprobe: GPF in path_put() via __free() on alloc failure
From: Farhad Alemi @ 2026-05-28 17:22 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 4332 bytes --]
Hello Masami and the linux-trace-kernel team,
I am reporting a tracing/uprobe error-path crash found by syzkaller
with fault injection.
Summary:
A write(2) to /sys/kernel/tracing/uprobe_events that fails inside
alloc_trace_uprobe() leaves the caller's `tu` variable holding
ERR_PTR(-ENOMEM). The cleanup macro at kernel/trace/trace_uprobe.c:536:
DEFINE_FREE(free_trace_uprobe, struct trace_uprobe *,
if (_T) free_trace_uprobe(_T))
guards only against NULL, not IS_ERR. free_trace_uprobe() (invoked by
the __free() helper at __trace_uprobe_create's return) has the same
guard shape -- `if (!tu) return;` -- and then calls path_put(&tu->path)
on the ERR_PTR-valued tu. KASAN catches the resulting dereference as a
null-ptr-deref-in-range at path_put+0x29.
Observed on:
- Linux v7.1-rc3-200-g70eda68668d1-dirty (where the bug was originally
found), x86_64, QEMU Q35
- KASAN enabled; panic_on_warn set; CONFIG_FAULT_INJECTION enabled
- The only local dirty file in my tree is drivers/tty/serial/serial_core.c,
containing a local ttyS0 console guard for the fuzzing harness. It is
unrelated to kernel/trace/.
- Trigger requires CAP_SYS_ADMIN to open /sys/kernel/tracing/uprobe_events
for write (mode 0640, TRACE_MODE_WRITE) plus CONFIG_FAULT_INJECTION
with a forced kmalloc failure inside alloc_trace_uprobe().
- Source inspection of linus/master at commit e8c2f9fdadee
(v7.1-rc4-754-ge8c2f9fdadee) shows the buggy structure is unchanged:
DEFINE_FREE(free_trace_uprobe, ..., if (_T) free_trace_uprobe(_T)) at
trace_uprobe.c:536 has only a NULL guard, free_trace_uprobe() at
trace_uprobe.c:369 still has only `if (!tu) return;`, and
__trace_uprobe_create() declares `struct trace_uprobe *tu
__free(free_trace_uprobe) = NULL` and assigns the alloc_trace_uprobe()
return value into tu before any IS_ERR check.
Impact:
With CONFIG_FAULT_INJECTION enabled, a fail_nth-injected allocation
failure inside alloc_trace_uprobe() (kzalloc_flex at line 341 returns
NULL, the function returns ERR_PTR(-ENOMEM)) causes the
__free(free_trace_uprobe) cleanup in __trace_uprobe_create() to
dereference the ERR_PTR via path_put():
Oops: general protection fault, probably for non-canonical address
0xdffffc0000000008: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000040-0x0000000000000047]
RIP: 0010:path_put+0x29/0x60 fs/namei.c:717
The R12 register in the crash dump shows the smoking gun:
R12 = 0xfffffffffffffff4 = -ENOMEM = ERR_PTR(-12) -- i.e. the `tu`
pointer being freed is an ERR_PTR, not a real object.
Relevant stack:
path_put+0x29/0x60 fs/namei.c:717
free_trace_uprobe kernel/trace/trace_uprobe.c:374 [inline]
__free_free_trace_uprobe kernel/trace/trace_uprobe.c:536 [inline]
__trace_uprobe_create+0x53c/0xe40 kernel/trace/trace_uprobe.c:725
trace_probe_create+0xce/0x130 kernel/trace/trace_probe.c:2252
dyn_event_create+0x4f/0x70 kernel/trace/trace_dynevent.c:128
create_or_delete_trace_uprobe+0x65/0xa0 kernel/trace/trace_uprobe.c:739
trace_parse_run_command+0x1f3/0x380 kernel/trace/trace.c:9565
vfs_write+0x29f/0xb90 fs/read_write.c:686
ksys_write+0x155/0x270 fs/read_write.c:740
Expected behavior:
Any one of these closes the hole:
1. Tighten free_trace_uprobe()'s entry guard:
if (IS_ERR_OR_NULL(tu))
return;
2. Or change the DEFINE_FREE macro to skip ERR_PTR values:
DEFINE_FREE(free_trace_uprobe, struct trace_uprobe *,
if (!IS_ERR_OR_NULL(_T)) free_trace_uprobe(_T))
3. Or, after the IS_ERR(tu) check in __trace_uprobe_create(), assign
`tu = NULL;` before returning so the __free helper sees NULL and
skips the path_put.
Reproducer:
I attached the generated C reproducer as reproducer.c. I also attached the
syzkaller program as reproducer.syz and the console
report as crash-report.txt.
Novelty check:
I searched syzbot dashboard data across upstream, fixed, invalid, stable,
and Android namespaces, and searched lore.kernel.org for "path_put" +
"trace_uprobe", "free_trace_uprobe", and "__trace_uprobe_create" + "GPF" /
"KASAN". I did not find an exact match. Adjacent uprobe_unregister /
bpf_uprobe_multi_link UAF reports have different free paths.
I appreciate your time and consideration, and I'm grateful for your
work on this subsystem.
Regards,
Farhad
[-- Attachment #2: crash-report.txt --]
[-- Type: text/plain, Size: 4494 bytes --]
RBP: 00007ffccc56d910 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f528c605fa0 R14: 00007f528c605fa0 R15: 0000000000001e91
</TASK>
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000008: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000040-0x0000000000000047]
CPU: 0 UID: 0 PID: 3563 Comm: syz.2.17 Not tainted 7.1.0-rc3-00200-g70eda68668d1-dirty #1 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:path_put+0x29/0x60 fs/namei.c:717
Code: 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 53 48 89 fb 49 be 00 00 00 00 00 fc ff df e8 22 91 8a ff 48 8d 7b 08 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 05 e8 db 50 f4 ff 48 8b 7b 08 e8 c2 89 03 00 48
RSP: 0018:ffffc900035bf9e8 EFLAGS: 00010203
RAX: 0000000000000008 RBX: 000000000000003c RCX: ffff88810c36a500
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000044
RBP: ffffc900035bfb70 R08: ffffffff9165b777 R09: 0000000000000000
R10: ffffffff9165b760 R11: fffffbfff22cb6ef R12: fffffffffffffff4
R13: dffffc0000000000 R14: dffffc0000000000 R15: 0000000000000000
FS: 0000555560a86500(0000) GS:ffff8882ab6b6000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f528c240760 CR3: 00000001214cb000 CR4: 0000000000750ef0
PKRU: 00000000
Call Trace:
<TASK>
free_trace_uprobe kernel/trace/trace_uprobe.c:374 [inline]
__free_free_trace_uprobe kernel/trace/trace_uprobe.c:536 [inline]
__trace_uprobe_create+0x53c/0xe40 kernel/trace/trace_uprobe.c:725
trace_probe_create+0xce/0x130 kernel/trace/trace_probe.c:2252
dyn_event_create+0x4f/0x70 kernel/trace/trace_dynevent.c:128
create_or_delete_trace_uprobe+0x65/0xa0 kernel/trace/trace_uprobe.c:739
trace_parse_run_command+0x1f3/0x380 kernel/trace/trace.c:9565
vfs_write+0x29f/0xb90 fs/read_write.c:686
ksys_write+0x155/0x270 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f528c37778d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffccc56d8a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f528c605fa0 RCX: 00007f528c37778d
RDX: 0000000000000022 RSI: 0000200000001100 RDI: 0000000000000003
RBP: 00007ffccc56d910 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f528c605fa0 R14: 00007f528c605fa0 R15: 0000000000001e91
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:path_put+0x29/0x60 fs/namei.c:717
Code: 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 53 48 89 fb 49 be 00 00 00 00 00 fc ff df e8 22 91 8a ff 48 8d 7b 08 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 05 e8 db 50 f4 ff 48 8b 7b 08 e8 c2 89 03 00 48
RSP: 0018:ffffc900035bf9e8 EFLAGS: 00010203
RAX: 0000000000000008 RBX: 000000000000003c RCX: ffff88810c36a500
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000044
RBP: ffffc900035bfb70 R08: ffffffff9165b777 R09: 0000000000000000
R10: ffffffff9165b760 R11: fffffbfff22cb6ef R12: fffffffffffffff4
R13: dffffc0000000000 R14: dffffc0000000000 R15: 0000000000000000
FS: 0000555560a86500(0000) GS:ffff8882ab6b6000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f528c240760 CR3: 00000001214cb000 CR4: 0000000000750ef0
PKRU: 00000000
----------------
Code disassembly (best guess):
0: 90 nop
1: f3 0f 1e fa endbr64
5: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
a: 41 56 push %r14
c: 53 push %rbx
d: 48 89 fb mov %rdi,%rbx
10: 49 be 00 00 00 00 00 movabs $0xdffffc0000000000,%r14
17: fc ff df
1a: e8 22 91 8a ff call 0xff8a9141
1f: 48 8d 7b 08 lea 0x8(%rbx),%rdi
23: 48 89 f8 mov %rdi,%rax
26: 48 c1 e8 03 shr $0x3,%rax
* 2a: 42 80 3c 30 00 cmpb $0x0,(%rax,%r14,1) <-- trapping instruction
2f: 74 05 je 0x36
31: e8 db 50 f4 ff call 0xfff45111
36: 48 8b 7b 08 mov 0x8(%rbx),%rdi
3a: e8 c2 89 03 00 call 0x38a01
3f: 48 rex.W
[-- Attachment #3: reproducer.c --]
[-- Type: application/octet-stream, Size: 6291 bytes --]
// autogenerated by syzkaller (https://github.com/google/syzkaller)
#define _GNU_SOURCE
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
static bool write_file(const char* file, const char* what, ...)
{
char buf[1024];
va_list args;
va_start(args, what);
vsnprintf(buf, sizeof(buf), what, args);
va_end(args);
buf[sizeof(buf) - 1] = 0;
int len = strlen(buf);
int fd = open(file, O_WRONLY | O_CLOEXEC);
if (fd == -1)
return false;
if (write(fd, buf, len) != len) {
int err = errno;
close(fd);
errno = err;
return false;
}
close(fd);
return true;
}
static int inject_fault(int nth)
{
int fd;
fd = open("/proc/thread-self/fail-nth", O_RDWR);
if (fd == -1)
exit(1);
char buf[16];
sprintf(buf, "%d", nth);
if (write(fd, buf, strlen(buf)) != (ssize_t)strlen(buf))
exit(1);
return fd;
}
static const char* setup_fault()
{
int fd = open("/proc/self/make-it-fail", O_WRONLY);
if (fd == -1)
return "CONFIG_FAULT_INJECTION is not enabled";
close(fd);
fd = open("/proc/thread-self/fail-nth", O_WRONLY);
if (fd == -1)
return "kernel does not have systematic fault injection support";
close(fd);
static struct {
const char* file;
const char* val;
bool fatal;
} files[] = {
{"/sys/kernel/debug/failslab/ignore-gfp-wait", "N", true},
{"/sys/kernel/debug/fail_futex/ignore-private", "N", false},
{"/sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem", "N", false},
{"/sys/kernel/debug/fail_page_alloc/ignore-gfp-wait", "N", false},
{"/sys/kernel/debug/fail_page_alloc/min-order", "0", false},
};
unsigned i;
for (i = 0; i < sizeof(files) / sizeof(files[0]); i++) {
if (!write_file(files[i].file, files[i].val)) {
if (files[i].fatal)
return "failed to write fault injection file";
}
}
return NULL;
}
static void setup_sysctl()
{
int cad_pid = fork();
if (cad_pid < 0)
exit(1);
if (cad_pid == 0) {
for (;;)
sleep(100);
}
char tmppid[32];
snprintf(tmppid, sizeof(tmppid), "%d", cad_pid);
struct {
const char* name;
const char* data;
} files[] = {
{"/sys/kernel/debug/x86/nmi_longest_ns", "10000000000"},
{"/proc/sys/kernel/hung_task_check_interval_secs", "20"},
{"/proc/sys/net/core/bpf_jit_kallsyms", "1"},
{"/proc/sys/net/core/bpf_jit_harden", "0"},
{"/proc/sys/kernel/kptr_restrict", "0"},
{"/proc/sys/kernel/softlockup_all_cpu_backtrace", "1"},
{"/proc/sys/fs/mount-max", "100"},
{"/proc/sys/vm/oom_dump_tasks", "0"},
{"/proc/sys/debug/exception-trace", "0"},
{"/proc/sys/kernel/printk", "7 4 1 3"},
{"/proc/sys/kernel/keys/gc_delay", "1"},
{"/proc/sys/vm/oom_kill_allocating_task", "1"},
{"/proc/sys/kernel/ctrl-alt-del", "0"},
{"/proc/sys/kernel/cad_pid", tmppid},
};
for (size_t i = 0; i < sizeof(files) / sizeof(files[0]); i++) {
if (!write_file(files[i].name, files[i].data)) {
}
}
kill(cad_pid, SIGKILL);
while (waitpid(cad_pid, NULL, 0) != cad_pid)
;
}
uint64_t r[1] = {0xffffffffffffffff};
int main(void)
{
syscall(__NR_mmap, /*addr=*/0x1ffffffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
/*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul,
/*fd=*/(intptr_t)-1, /*offset=*/0ul);
syscall(__NR_mmap, /*addr=*/0x200000000000ul, /*len=*/0x1000000ul,
/*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
/*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul,
/*fd=*/(intptr_t)-1, /*offset=*/0ul);
syscall(__NR_mmap, /*addr=*/0x200001000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
/*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul,
/*fd=*/(intptr_t)-1, /*offset=*/0ul);
setup_sysctl();
const char* reason;
(void)reason;
if ((reason = setup_fault()))
printf("the reproducer may not work as expected: fault injection setup "
"failed: %s\n",
reason);
intptr_t res = 0;
if (write(1, "executing program\n", sizeof("executing program\n") - 1)) {
}
// mkdir arguments: [
// path: ptr[in, buffer] {
// buffer: {2e 2f 70 6d 32 39 30 38 65 00} (length 0xa)
// }
// mode: open_mode = 0x1ff (8 bytes)
// ]
memcpy((void*)0x200000001000, "./pm2908e\000", 10);
syscall(
__NR_mkdir, /*path=*/0x200000001000ul,
/*mode=S_IXOTH|S_IWOTH|S_IROTH|S_IXGRP|S_IWGRP|S_IRGRP|S_IXUSR|S_IWUSR|0x100*/
0x1fful);
// mount arguments: [
// src: nil
// dst: ptr[in, buffer] {
// buffer: {2e 2f 70 6d 32 39 30 38 65 00} (length 0xa)
// }
// type: ptr[in, buffer] {
// buffer: {74 72 61 63 65 66 73 00} (length 0x8)
// }
// flags: mount_flags = 0x0 (8 bytes)
// data: nil
// ]
memcpy((void*)0x200000001040, "./pm2908e\000", 10);
memcpy((void*)0x200000001080, "tracefs\000", 8);
syscall(__NR_mount, /*src=*/0ul, /*dst=*/0x200000001040ul,
/*type=*/0x200000001080ul, /*flags=*/0ul, /*data=*/0ul);
// openat arguments: [
// fd: fd_dir (resource)
// file: ptr[in, buffer] {
// buffer: {2e 2f 70 6d 32 39 30 38 65 2f 75 70 72 6f 62 65 5f 65 76 65
// 6e 74 73 00} (length 0x18)
// }
// flags: open_flags = 0x1 (4 bytes)
// mode: open_mode = 0x0 (2 bytes)
// ]
// returns fd
memcpy((void*)0x2000000010c0, "./pm2908e/uprobe_events\000", 24);
res = syscall(__NR_openat, /*fd=*/0xffffff9c, /*file=*/0x2000000010c0ul,
/*flags=O_WRONLY*/ 1, /*mode=*/0);
if (res != -1)
r[0] = res;
// write arguments: [
// fd: fd (resource)
// buf: ptr[in, buffer] {
// buffer: {70 3a 75 70 72 6f 62 65 73 2f 70 6d 32 39 30 38 65 20 2f 70
// 72 6f 63 2f 73 65 6c 66 2f 65 78 65 3a 30} (length 0x22)
// }
// count: len = 0x22 (8 bytes)
// ]
memcpy((void*)0x200000001100, "p:uprobes/pm2908e /proc/self/exe:0", 34);
inject_fault(12);
syscall(__NR_write, /*fd=*/r[0], /*buf=*/0x200000001100ul, /*count=*/0x22ul);
return 0;
}
[-- Attachment #4: reproducer.syz --]
[-- Type: application/octet-stream, Size: 760 bytes --]
# {Threaded:false Repeat:false RepeatTimes:0 Procs:1 Slowdown:1 Sandbox: SandboxArg:0 Leak:false NetInjection:false NetDevices:false NetReset:false Cgroups:false BinfmtMisc:false CloseFDs:false KCSAN:false DevlinkPCI:false NicVF:false USB:false VhciInjection:false Wifi:false IEEE802154:false Sysctl:true Swap:false UseTmpDir:false HandleSegv:false Trace:false CallComments:true LegacyOptions:{Collide:false Fault:false FaultCall:0 FaultNth:0}}
mkdir(&(0x7f0000001000)='./pm2908e\x00', 0x1ff)
mount(0x0, &(0x7f0000001040)='./pm2908e\x00', &(0x7f0000001080)='tracefs\x00', 0x0, 0x0)
r0 = openat(0xffffffffffffff9c, &(0x7f00000010c0)='./pm2908e/uprobe_events\x00', 0x1, 0x0)
write(r0, &(0x7f0000001100)='p:uprobes/pm2908e /proc/self/exe:0', 0x22) (fail_nth: 12)
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-28 17:11 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260528084211.wsdrvbvxvkddokb5@master>
On Thu, May 28, 2026 at 2:42 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, May 26, 2026 at 06:07:38AM -0600, Nico Pache wrote:
> >On Tue, May 26, 2026 at 12:57 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
> >> >On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
> >> >
> >> >> Can you please append the following fixup that reverts one of the
> >> >> changes requested in V17. The issue with the change is described
> >> >> below.
> >> >
> >> >OK. fyi, what I received was badly mangled: wordwrapping, tabs messed
> >> >up, etc.
> >> >
> >> >Here's my reconstruction:
> >> >
> >>
> >> Hi, Nico
> >>
> >> I tried to reply your mail, but found it has some encoding problem, so reply
> >> here.
> >
> >Yeah sorry I didnt properly configure my email client after getting a
> >new laptop.
> >
> >>
> >> >
> >> >Author: Nico Pache <npache@redhat.com>
> >> >Subject: fix potential use-after-free of vma in mthp_collapse()
> >> >Date: Mon May 25 07:38:59 2026 -0600
> >> >
> >> >Between V17 and v18, one reviewer (Wei) brought up that we are not doing
> >> >the uffd-armed check until deep in the collapse operation. While not
> >> >functionally incorrect, it can lead to unnecessary work.
> >>
> >> So we decide to tolerate the behavioral change?
> >
> >Yes, I believe it is ok for now. Either way we needed to remove the
> >potential UAF. It only affects the behavior if mTHP is enabled, so the
> >legacy behavior is kept. And the uffd case is limited.
> >
> >My future work involves further optimizing and cleaning up khugepaged.
> >I'll make this part of the goal too. My first thought is to do the
> >revalidation at every order (between the locks dropping); but that
> >essentially pays the same penalty... I can't think of a clean solution
> >at the moment.
>
> One way come into my mind is add a @was_uffd_armed field in collapse_control
> and updates it in hugepage_vma_revalidate() when latest vma is retrieved.
>
> Still not elegant enough.
So our issue is that userfaultfd_armed is at the VMA granularity.
Ideally we want PMD/PTE granularity, but we only have that for wp. I'm
just still investigating all the nuances of uffd and its interactions
with khugepaged (something I've been meaning to understand more of
anyway). But from what i understand so far we actually can use the
bitmap and the was_uffd_armed to optimize this further. It solves the
issue and has a rather small race window, which can just be handled by
the revalidation later on, probably eliminating most of the potential
cases.
IIUC, filling a region with previously empty/zero pages is only an
issue for MODE_MISSING and MODE_WP with WP_UNPOPULATED set as well. I
have a work in progress commit to improve all this uffd handling.
I think what i have is a good middle ground. It improves the current
functionality and closes this gap we have with the new mthp_collapse--
best of both worlds. If the race window is hit, we will pay the
penalty, but that should be greatly reduced. I will send out an RFC
for this targeting mm-new once I have everything verified and cleaned
up :)
Cheers,
-- Nico
>
> >
> >Does that sound ok?
> >
>
> Not sure. I can't imagine the impact it would have.
>
> >Cheers,
> >-- Nico
>
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply
* Re: [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-05-28 16:14 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260529001519.14ca9dbe92fb2622249137c6@kernel.org>
On Fri, May 29, 2026 at 12:15:19AM +0000, Masami Hiramatsu wrote:
> On Wed, 27 May 2026 09:41:33 -0700
> Breno Leitao <leitao@debian.org> wrote:
>
> > The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> > already landed; this series wires the rendered cmdline into the kernel.
> >
> > Motivation: today the embedded bootconfig is parsed at runtime, after
> > parse_early_param() has already run, so early_param() handlers can't
> > see embedded values. Folding the kernel.* subtree into the cmdline at
> > build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> > users without forcing them to maintain two cmdline sources.
> >
> > Behaviorally, the "kernel" subtree is rendered to a flat string at
> > build time and stashed in .init.rodata. setup_arch() prepends it to
> > boot_command_line before parse_early_param() runs. Overflow is a soft
> > error: the helper logs and leaves boot_command_line untouched rather
> > than panicking, so an oversized embedded bconf cannot brick a boot.
> >
>
> Thanks Breno, yes, that is what I think about.
> Let me check it. And could you also check Sashiko's comments?
>
> https://sashiko.dev/#/patchset/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5%40debian.org
Ack, I will have a look at them, thanks for confirming the direction is
correct.
--breno
^ permalink raw reply
* Re: [PATCH v21 8/9] ring-buffer: Show persistent buffer dropped events in trace file
From: Steven Rostedt @ 2026-05-28 15:26 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Ian Rogers
In-Reply-To: <20260529001500.14178455a046a5cbc6180861@kernel.org>
On Fri, 29 May 2026 00:15:00 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> Yeah, the relationship between the persistent ring buffer (which is mapped)
> and the RB_MISSED_EVENTS is a bit unclear.
I'm going to pull these patches into the ring-buffer for-next branch and
start testing it. If they pass, I'll push them up to the linux-trace repo.
I'll send out this patch on top of it.
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 910f6b3adf74..5de6f352249a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1929,6 +1929,12 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
*/
if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
local_set(&bpage->entries, 0);
+ /*
+ * Note, the RB_MISSED_EVENTS is only set inside the main write
+ * buffer by this verification logic. The normal ring buffer
+ * has this bit set when the page is read and passed to the
+ * consumers.
+ */
local_set(&dpage->commit, RB_MISSED_EVENTS);
dpage->time_stamp = prev_ts ? prev_ts : next_ts;
ret = -1;
@@ -7232,6 +7238,14 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
local_add(RB_MISSED_STORED, &dpage->commit);
size += sizeof(missed_events);
}
+ /*
+ * Note, for the persistent ring buffer, the RB_MISSED_EVENTS
+ * may have been set in the main buffer via the verification code.
+ * But here, dpage is a copy of that page and has not yet had
+ * the RB_MISSED_EVENTS set. As for the normal buffers,
+ * the main write buffer does not set these bits and it needs
+ * to be set here.
+ */
local_add(RB_MISSED_EVENTS, &dpage->commit);
}
-- Steve
^ permalink raw reply related
* Re: [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Masami Hiramatsu @ 2026-05-28 15:15 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org>
On Wed, 27 May 2026 09:41:33 -0700
Breno Leitao <leitao@debian.org> wrote:
> The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> already landed; this series wires the rendered cmdline into the kernel.
>
> Motivation: today the embedded bootconfig is parsed at runtime, after
> parse_early_param() has already run, so early_param() handlers can't
> see embedded values. Folding the kernel.* subtree into the cmdline at
> build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> users without forcing them to maintain two cmdline sources.
>
> Behaviorally, the "kernel" subtree is rendered to a flat string at
> build time and stashed in .init.rodata. setup_arch() prepends it to
> boot_command_line before parse_early_param() runs. Overflow is a soft
> error: the helper logs and leaves boot_command_line untouched rather
> than panicking, so an oversized embedded bconf cannot brick a boot.
>
Thanks Breno, yes, that is what I think about.
Let me check it. And could you also check Sashiko's comments?
https://sashiko.dev/#/patchset/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5%40debian.org
Thanks,
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Breno Leitao (4):
> bootconfig: return 0 from xbc_snprint_cmdline() for a leaf root
> bootconfig: render embedded bootconfig as a kernel cmdline at build time
> bootconfig: add xbc_prepend_embedded_cmdline() helper
> x86/setup: prepend embedded bootconfig cmdline before parse_early_param
>
> Makefile | 5 ++++
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/setup.c | 3 +++
> include/linux/bootconfig.h | 7 ++++++
> init/Kconfig | 33 ++++++++++++++++++++++++++
> init/main.c | 19 ++++++++++++---
> lib/Makefile | 16 +++++++++++++
> lib/bootconfig.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
> lib/embedded-cmdline.S | 16 +++++++++++++
> tools/bootconfig/Makefile | 2 +-
> 10 files changed, 156 insertions(+), 4 deletions(-)
> ---
> base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
> change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v21 8/9] ring-buffer: Show persistent buffer dropped events in trace file
From: Masami Hiramatsu @ 2026-05-28 15:15 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Ian Rogers
In-Reply-To: <20260527093507.00ac35d8@gandalf.local.home>
On Wed, 27 May 2026 09:35:07 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> On Wed, 27 May 2026 12:47:21 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > Yeah, for the persistent ring buffer, it does not happen.
> > But there seems RB_MISSED_EVENTS bit can be cleared in
> > "else" path (after applying 1-8 patches)?
>
> Note, *only* the persistent ring buffer adds RB_MISSED_EVENTS to the pages
> in the write buffer. In the normal buffer, these bits are only set by this
> function. That is, they would not be set from the swap of pages.
>
Ah, OK. For normal ring buffers, it is only set by reader.
> >
> > ----------
> > if (read || (len < (commit - read)) ||
> > cpu_buffer->reader_page == cpu_buffer->commit_page ||
> > force_memcpy) { // <-- persistent ring buffer sets force_memcpy = true.
> > [...]
> > } else {
> > /* update the entry counter */
> > [...]
> > if (!missed_events && rb_data_page_commit(dpage) & RB_MISSED_EVENTS)
> > missed_events = -1;
> > //^-- we check RB_MISSED_EVENTS bit on @dpage->commit and set missed_events = -1.
> >
> > /*
> > * Use the real_end for the data size,
> > * This gives us a chance to store the lost events
> > * on the page.
> > */
> > if (reader->real_end)
> > local_set(&dpage->commit, reader->real_end);
> > // ^- only if @reader->real_end, RB_MISSED_EVENTS bit is dropped.
>
> Because this isn't a persistent ring buffer (if it was, as you noted,
> force_memcpy would be true and we wouldn't enter the else path), the
> RB_MISSED_EVENTS bit in the commit would never be set here. It is *only* set
> by the verifier of the persistent ring buffer logic.
OK, I got it.
Thanks for confirmation!
>
> > }
> >
> > cpu_buffer->lost_events = 0;
> >
> > commit = rb_data_page_commit(dpage);
> > /*
> > * Set a flag in the commit field if we lost events
> > */
> > if (missed_events) {
> > /*
> > * If there is room at the end of the page to save the
> > * missed events, then record it there.
> > */
> > if (missed_events > 0 &&
> > buffer->subbuf_size - commit >= sizeof(missed_events)) {
> > memcpy(&dpage->data[commit], &missed_events,
> > sizeof(missed_events));
> > local_add(RB_MISSED_STORED, &dpage->commit);
> > commit += sizeof(missed_events);
> > }
> > local_add(RB_MISSED_EVENTS, &dpage->commit); // <-- @dpage->commit is updated.
> > }
>
> And this is the first place it would get set.
>
> But yeah, it is very confusing and needs better comments.
Yeah, the relationship between the persistent ring buffer (which is mapped)
and the RB_MISSED_EVENTS is a bit unclear.
Thanks,
>
> Thanks,
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v2 8/8] selftests/livepatch: Add RISC-V syscall wrapper prefix
From: Marcos Paulo de Souza @ 2026-05-28 13:33 UTC (permalink / raw)
To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, oliver.yang, xueshuai, zhuo.song, jkchen,
linux-riscv, linux-kernel, linux-trace-kernel, live-patching,
linux-kselftest, linux-perf-users
In-Reply-To: <20260528082310.1994388-9-wanghan@linux.alibaba.com>
On Thu, 2026-05-28 at 16:23 +0800, Wang Han wrote:
> The syscall livepatch selftest resolves and patches a syscall wrapper
> symbol. To use that test for RISC-V livepatch validation, add the
> RISC-V FN_PREFIX definition for ARCH_HAS_SYSCALL_WRAPPER.
>
> Without this macro, the syscall livepatch selftest cannot resolve the
> RISC-V target symbol, and the syscall-related livepatch test fails on
> RISC-V.
>
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
> ---
> .../testing/selftests/livepatch/test_modules/test_klp_syscall.c | 2
> ++
> 1 file changed, 2 insertions(+)
>
> diff --git
> a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
> b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
> index dd802783ea84..275e4b10cf59 100644
> ---
> a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
> +++
> b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
> @@ -18,6 +18,8 @@
> #define FN_PREFIX __s390x_
> #elif defined(__aarch64__)
> #define FN_PREFIX __arm64_
> +#elif defined(__riscv)
> +#define FN_PREFIX __riscv_
> #else
> /* powerpc does not select ARCH_HAS_SYSCALL_WRAPPER */
> #define FN_PREFIX
^ permalink raw reply
* Re: [PATCH v2 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Steven Rostedt @ 2026-05-28 13:21 UTC (permalink / raw)
To: Wang Han
Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
Masami Hiramatsu, Mark Rutland, Catalin Marinas, Chen Pei,
Andy Chiu, Björn Töpel, Deepak Gupta, Puranjay Mohan,
Conor Dooley, Josh Poimboeuf, Jiri Kosina, Miroslav Benes,
Petr Mladek, Joe Lawrence, Shuah Khan, Peter Zijlstra,
Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, oliver.yang,
xueshuai, zhuo.song, jkchen, linux-riscv, linux-kernel,
linux-trace-kernel, live-patching, linux-kselftest,
linux-perf-users
In-Reply-To: <20260528082310.1994388-2-wanghan@linux.alibaba.com>
On Thu, 28 May 2026 16:23:03 +0800
Wang Han <wanghan@linux.alibaba.com> wrote:
> RISC-V uses -fpatchable-function-entry=8,4 when the compressed ISA is
> enabled and -fpatchable-function-entry=4,2 otherwise. In both cases, the
> patchable NOP area starts 8 bytes before the function symbol address.
> The __mcount_loc entries therefore point at the patchable NOP area
> associated with a function, while nm reports the function symbol at the
> entry address used for the function range check.
>
> After RISC-V selected HAVE_BUILDTIME_MCOUNT_SORT, sorttable started
> applying that range check at build time. Without allowing entries just
> before the reported function address, the mcount sorter treats valid
> RISC-V ftrace callsites as invalid weak-function entries and writes
> them back as zero. The resulting kernel boots with no ftrace entries,
> breaking dynamic ftrace and users such as livepatch.
>
> The failure is silent during the final link because zeroing weak-function
> entries is an expected sorttable operation. At boot, those zero entries
> are skipped by ftrace_process_locs(), so the only obvious symptom is that
> the vmlinux ftrace table has lost valid callsites and ftrace users cannot
> attach to them.
>
> CONFIG_FTRACE_SORT_STARTUP_TEST also reports the table as sorted in this
> state: it only checks that the __mcount_loc entries are in ascending
> order, which a fully zeroed table trivially satisfies. The original
> commit relied on this check and did not see the regression.
>
> On an affected RISC-V QEMU boot with both CONFIG_FTRACE_SORT_STARTUP_TEST
> and CONFIG_FTRACE_STARTUP_TEST enabled, the sort check still passes
> while ftrace reports zero usable entries and the early selftests fail:
>
> [ 0.000000] ftrace section at ffffffff8101da98 sorted properly
> [ 0.000000] ftrace: allocating 0 entries in 128 pages
> [ 0.054999] Testing tracer function: .. no entries found ..FAILED!
> [ 0.172407] tracer: function failed selftest, disabling
> [ 0.178186] Failed to init function_graph tracer, init returned -19
>
> Handle RISC-V like arm64 for the function-range check and allow
> patchable entries up to 8 bytes before the function address.
>
> With this fix, a RISC-V QEMU smoke boot with ftrace startup tests shows
> the vmlinux ftrace table is populated and dynamic ftrace still works:
>
> [ 0.000000] ftrace: allocating 46749 entries in 184 pages
> [ 0.051115] Testing tracer function: PASSED
> [ 1.283782] Testing dynamic ftrace: PASSED
> [ 6.275456] Testing tracer function_graph: PASSED
>
> Fixes: 0ca1724b56af ("riscv: ftrace: select HAVE_BUILDTIME_MCOUNT_SORT")
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Link: https://lore.kernel.org/all/20260527113028.4b21a5de@fedora/
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
-- Steve
^ permalink raw reply
* Re: [RFC PATCH 03/10] rcu/segcblist: Change gp_seq to struct rcu_gp_oldstate gp_seq_full
From: Frederic Weisbecker @ 2026-05-28 13:15 UTC (permalink / raw)
To: Puranjay Mohan
Cc: rcu, linux-kernel, linux-trace-kernel, Paul E. McKenney,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Masami Hiramatsu, Davidlohr Bueso
In-Reply-To: <20260417231203.785172-4-puranjay@kernel.org>
Le Fri, Apr 17, 2026 at 04:11:51PM -0700, Puranjay Mohan a écrit :
> This commit renames the ->gp_seq[] field in struct rcu_segcblist to
> ->gp_seq_full[] and changes its type from unsigned long to struct
> rcu_gp_oldstate. This prepares the callback tracking infrastructure to
> support both normal and expedited grace periods.
>
> All function signatures are updated to pass struct rcu_gp_oldstate
> pointers: rcu_segcblist_nextgp(), rcu_segcblist_advance(), and
> rcu_segcblist_accelerate() now take struct rcu_gp_oldstate * instead of
> unsigned long. All callers are updated to use the .rgos_norm field for
> comparisons and assignments.
>
> The SRCU and Tasks RCU wrappers now construct an rcu_gp_oldstate with
> just .rgos_norm set and forward to the core functions.
>
> No functional change: only the .rgos_norm field is used in place of
> gp_seq.
>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> include/linux/rcu_segcblist.h | 2 +-
> include/trace/events/rcu.h | 5 +++--
> kernel/rcu/rcu_segcblist.c | 30 +++++++++++++++++-------------
> kernel/rcu/rcu_segcblist.h | 6 +++---
> kernel/rcu/tree.c | 25 ++++++++++++++-----------
> kernel/rcu/tree_nocb.h | 29 +++++++++++++++--------------
> 6 files changed, 53 insertions(+), 44 deletions(-)
>
> diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
> index 2fdc2208f1ca..59c68f2ba113 100644
> --- a/include/linux/rcu_segcblist.h
> +++ b/include/linux/rcu_segcblist.h
> @@ -190,7 +190,7 @@ struct rcu_cblist {
> struct rcu_segcblist {
> struct rcu_head *head;
> struct rcu_head **tails[RCU_CBLIST_NSEGS];
> - unsigned long gp_seq[RCU_CBLIST_NSEGS];
> + struct rcu_gp_oldstate gp_seq_full[RCU_CBLIST_NSEGS];
That could stay as gp_seq.
> #ifdef CONFIG_RCU_NOCB_CPU
> atomic_long_t len;
> #else
> diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
> index 421f1dadb5e5..00e164db8b74 100644
> --- a/kernel/rcu/rcu_segcblist.c
> +++ b/kernel/rcu/rcu_segcblist.c
> @@ -496,7 +496,7 @@ static void rcu_segcblist_advance_compact(struct rcu_segcblist *rsclp, int i)
> * Advance the callbacks in the specified rcu_segcblist structure based
> * on the current value passed in for the grace-period counter.
> */
> -void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq)
> +void rcu_segcblist_advance(struct rcu_segcblist *rsclp, struct
> rcu_gp_oldstate *rgosp)
I don't think we need to rename everything to rgos*, especially as it's not as
intuitive as gp_seq.
> {
> int i;
>
> @@ -1229,7 +1231,8 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp)
> * Find all callbacks whose ->gp_seq numbers indicate that they
> * are ready to invoke, and put them into the RCU_DONE_TAIL sublist.
> */
> - rcu_segcblist_advance(&rdp->cblist, rnp->gp_seq);
> + rgos.rgos_norm = rnp->gp_seq;
Can we shorten that rgos_norm field to just norm?
So the above would parse better as:
gp_seq->norm = rnp->gp_seq
> + rcu_segcblist_advance(&rdp->cblist, &rgos);
>
> /* Classify any remaining callbacks. */
> return rcu_accelerate_cbs(rnp, rdp);
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 1047b30cd46b..1837eedfb8c2 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -433,7 +433,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> bool lazy)
> {
> unsigned long c;
> - unsigned long cur_gp_seq;
> + struct rcu_gp_oldstate cur_gp_seq_full;
This could stay as cur_gp_seq.
> unsigned long j = jiffies;
> long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> long lazy_len = READ_ONCE(rdp->lazy_len);
> @@ -501,8 +501,8 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> return false; // Caller must enqueue the callback.
> }
> if (j != rdp->nocb_gp_adv_time &&
> - rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
> - rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
> + rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
> + rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq_full.rgos_norm)) {
Because cur_gp_seq.norm would parse much better.
> rcu_advance_cbs_nowake(rdp->mynode, rdp);
> rdp->nocb_gp_adv_time = j;
> }
> @@ -659,7 +659,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> {
> bool bypass = false;
> int __maybe_unused cpu = my_rdp->cpu;
> - unsigned long cur_gp_seq;
> + struct rcu_gp_oldstate cur_gp_seq_full;
Ditto
> unsigned long flags;
> bool gotcbs = false;
> unsigned long j = jiffies;
> @@ -730,8 +730,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> needwake_gp = false;
> if (!rcu_segcblist_restempty(&rdp->cblist,
> RCU_NEXT_READY_TAIL) ||
> - (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
> - rcu_seq_done(&rnp->gp_seq, cur_gp_seq))) {
> + (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq_full) &&
> + rcu_seq_done(&rnp->gp_seq, cur_gp_seq_full.rgos_norm))) {
> raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
> needwake_gp = rcu_advance_cbs(rnp, rdp);
> wasempty = rcu_segcblist_restempty(&rdp->cblist,
> @@ -877,7 +877,7 @@ static inline bool nocb_cb_wait_cond(struct rcu_data *rdp)
> static void nocb_cb_wait(struct rcu_data *rdp)
> {
> struct rcu_segcblist *cblist = &rdp->cblist;
> - unsigned long cur_gp_seq;
> + struct rcu_gp_oldstate cur_gp_seq_full;
Ditto.
> unsigned long flags;
> bool needwake_gp = false;
> struct rcu_node *rnp = rdp->mynode;
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply
* Re: [PATCH v2] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Steven Rostedt @ 2026-05-28 13:13 UTC (permalink / raw)
To: Tianchen Ding
Cc: Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-kselftest
In-Reply-To: <20260528022417.1813745-1-dtcccc@linux.alibaba.com>
On Thu, 28 May 2026 10:24:17 +0800
Tianchen Ding <dtcccc@linux.alibaba.com> wrote:
>
> diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> index 8e905d4fe6dd..f68f1901f65f 100644
> --- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> +++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> @@ -43,8 +43,11 @@ write_buffer() {
> id=$1
> size=$2
>
> - # write the string into the raw marker
> - make_str $id $size > trace_marker_raw
> + # Pipe through dd to ensure a single atomic write() syscall
> + # on architectures with 64K pages, where shell's printf builtin
> + # uses stdio buffering which may split the output into multiple
> + # writes.
> + make_str $id $size | dd of=trace_marker_raw bs=`expr $size + 4` iflag=fullblock
I was looking at this more, and I'm not comfortable with the hard coded
4 above. I rather use the length of the string. Something like:
str=`make_str $id $size`
len=${#str}
echo "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
-- Steve
> }
>
>
^ permalink raw reply
* Re: [PATCHv6 bpf-next 10/29] bpf: Add multi tracing attach types
From: Jiri Olsa @ 2026-05-28 13:13 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <23ab5817f619274a252b35119e760817c383e3c20f9ce9facccbf4b4fec9dd58@mail.kernel.org>
On Wed, May 27, 2026 at 12:59:43PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 435ae6efe961..27ad0c265c95 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -41,6 +41,7 @@
> > #include <linux/overflow.h>
> > #include <linux/cookie.h>
> > #include <linux/verification.h>
> > +#include <linux/btf_ids.h>
> >
> > #include <net/netfilter/nf_bpf_link.h>
> > #include <net/netkit.h>
>
> [ ... ]
>
> > @@ -2904,6 +2914,11 @@ static int bpf_prog_mark_insn_arrays_ready(struct bpf_prog *prog)
> > return 0;
> > }
> >
> > +extern int bpf_multi_func(void);
> > +int __init __used bpf_multi_func(void) { return 0; }
>
> The signature change from v5 to v6 creates a potential issue. In v5,
> bpf_multi_func had 6 arguments (matching the initial verifier setup):
>
> int __init __used bpf_multi_func(u64 r1, u64 r2, u64 r3,
> u64 r4, u64 r5, u64 r6)
>
> Now it has no arguments. The Sashiko AI review in v5 identified that
> context array bounds checking could fail when bpf_multi_func's argument
> count doesn't match the attached function's argument count, and the
> author acknowledged the need to disable direct context access.
>
> Since the initial verification happens against bpf_multi_func (which now
> has 0 arguments), but the verifier previously allowed direct ctx[0-5]
> accesses based on the 6-argument version, is there still a path where
> direct context accesses could be allowed? If so, would accessing ctx[0]
> on a function with 0 arguments lead to out-of-bounds kernel stack reads?
>
> The earlier review suggested either:
> (a) explicitly disabling direct memory accesses to the ctx array for
> FENTRY_MULTI and FEXIT_MULTI, or
> (b) enforcing the use of bounds-checked helpers like bpf_get_func_arg()
>
> Was this implemented somewhere in the patch series?
it's implemented through bpf_multi_func without any arguments,
then verifier won't allow any access to any argument,
including ctx[0], which is 1st argument
jirka
>
> > +
> > +BTF_ID_LIST_GLOBAL_SINGLE(bpf_multi_func_btf_id, func, bpf_multi_func)
>
> [ ... ]
>
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index c8d980fdd709..58450cb88beb 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
>
> [ ... ]
>
> > @@ -18837,6 +18839,11 @@ static int check_attach_modify_return(unsigned long addr, const char *func_name)
> >
> > #endif /* CONFIG_FUNCTION_ERROR_INJECTION */
> >
> > +static bool is_tracing_multi_id(const struct bpf_prog *prog, u32 btf_id)
> > +{
> > + return is_tracing_multi(prog->expected_attach_type) && bpf_multi_func_btf_id[0] == btf_id;
> > +}
> > +
> > int bpf_check_attach_target(struct bpf_verifier_log *log,
> > const struct bpf_prog *prog,
> > const struct bpf_prog *tgt_prog,
>
> [ ... ]
>
> > @@ -19094,7 +19105,18 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
> > if (ret < 0)
> > return ret;
> >
> > - if (tgt_prog) {
> > + /*
> > + * *.multi programs don't need an address during program
> > + * verification, we just take the module ref if needed.
> > + */
> > + if (is_tracing_multi_id(prog, btf_id)) {
> > + if (btf_is_module(btf)) {
> > + mod = btf_try_get_module(btf);
> > + if (!mod)
> > + return -ENOENT;
> > + }
> > + addr = 0;
> > + } else if (tgt_prog) {
> > if (subprog == 0)
> > addr = (long) tgt_prog->bpf_func;
> > else
>
> [ ... ]
>
> > @@ -19122,6 +19144,12 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
> > ret = -EINVAL;
> > switch (prog->type) {
> > case BPF_PROG_TYPE_TRACING:
> > + /* *.multi sleepable programs will pass initial sleepable check,
> > + * the actual attached btf ids are checked later during the link
> > + * attachment.
> > + */
> > + if (is_tracing_multi_id(prog, btf_id))
> > + ret = 0;
> > if (!check_attach_sleepable(btf_id, addr, tname))
> > ret = 0;
>
> The multi types are not included in btf_validate_prog_ctx_type() in
> kernel/bpf/btf.c. That function validates which attach types allow u64*
> as their ctx parameter:
>
> kernel/bpf/btf.c:btf_validate_prog_ctx_type() {
> case BPF_TRACE_FENTRY:
> case BPF_TRACE_FEXIT:
> case BPF_MODIFY_RETURN:
> ...
> }
>
> Since BPF_TRACE_FENTRY_MULTI and BPF_TRACE_FEXIT_MULTI also use u64*
> context, do they need to be added to that validation list? Or if direct
> context access should be disabled for multi types (per the earlier
> review concern), should they be handled with special rejection logic?
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26509800686
^ permalink raw reply
* Re: [PATCHv6 bpf-next 22/29] selftests/bpf: Add tracing multi skel/pattern/ids module attach tests
From: Jiri Olsa @ 2026-05-28 13:10 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <20af6dea23834f831d06cf601a91bf8848d78091b46a913ed1b24302576983e1@mail.kernel.org>
On Wed, May 27, 2026 at 12:59:39PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> > index e290513358d9..746a8398211a 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
>
> [ ... ]
>
> > @@ -241,6 +250,96 @@ static void test_link_api_ids(void)
>
> [ ... ]
>
> > +static void test_module_link_api_ids(void)
> > +{
> > + size_t cnt = ARRAY_SIZE(bpf_testmod_fentry_test);
> > + LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
> > + struct tracing_multi_module *skel = NULL;
> > + __u32 *ids;
> > +
> > + skel = tracing_multi_module__open_and_load();
> > + if (!ASSERT_OK_PTR(skel, "tracing_multi_module__open_and_load"))
> > + return;
> > +
> > + skel->bss->pid = getpid();
> > +
> > + ids = get_ids(bpf_testmod_fentry_test, cnt, "bpf_testmod");
> > + if (!ASSERT_OK_PTR(ids, "get_ids"))
> > + goto cleanup;
>
> This new test calls get_ids() but there's an unaddressed review comment
> from sashiko-bot about a memory leak in that function. Looking at get_ids()
> in the same file:
>
> for (i = 0; i < funcs_cnt; i++) {
> if (!tsearch(&funcs[i], &root, compare)) {
> ASSERT_FAIL("tsearch failed");
> err = -1;
> goto out;
> }
> }
>
> ...
>
> if (err) {
> free(ids);
> ids = NULL;
> }
>
> out:
> tdestroy(root, tdestroy_free_nop);
> btf__free(vmlinux_btf);
> btf__free(btf);
> return ids;
>
> If tsearch() fails, the code sets err = -1 and jumps to the out label.
> However, the out label is positioned after the cleanup block that frees ids
> (the "if (err)" block), not before it. This means the cleanup code is
> bypassed when we goto out, causing two issues:
>
> 1. The ids array is leaked
>
> 2. The function returns a non-NULL pointer to allocated but potentially
> invalid memory, which will incorrectly pass the ASSERT_OK_PTR check above
>
> Should the out label be moved before the cleanup block so that the error
> path properly frees the ids array and sets it to NULL?
yes, will fix
jirka
^ permalink raw reply
* Re: [PATCH bpf-next] x86/ftrace: relocate %rip-relative percpu refs in dynamic trampolines
From: Steven Rostedt @ 2026-05-28 13:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Alexis Lothoré (eBPF Foundation), Masami Hiramatsu,
Mark Rutland, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Uros Bizjak, Thomas Petazzoni,
Ingo Molnar, linux-kernel, linux-trace-kernel, bpf, ebpf,
Bastien Curutchet
In-Reply-To: <20260528090231.45d9b28f@fedora>
On Thu, 28 May 2026 09:02:31 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 27 May 2026 23:11:35 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > I went and had a quick grep through the tree to see if there are more
> > sites that were missed in the conversion (commit 17bce3b2ae2d), but I
> > couldn't find another one.
> >
> > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> Peter, do you want me to take this through my tree, or do you prefer it
> goes through the x86 tree?
>
Never mind. I now see it is in tip.
-- Steve
^ permalink raw reply
* Re: [PATCH bpf-next] x86/ftrace: relocate %rip-relative percpu refs in dynamic trampolines
From: Steven Rostedt @ 2026-05-28 13:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Alexis Lothoré (eBPF Foundation), Masami Hiramatsu,
Mark Rutland, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Uros Bizjak, Thomas Petazzoni,
Ingo Molnar, linux-kernel, linux-trace-kernel, bpf, ebpf,
Bastien Curutchet
In-Reply-To: <20260527211135.GA343181@noisy.programming.kicks-ass.net>
On Wed, 27 May 2026 23:11:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> I went and had a quick grep through the tree to see if there are more
> sites that were missed in the conversion (commit 17bce3b2ae2d), but I
> couldn't find another one.
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Peter, do you want me to take this through my tree, or do you prefer it
goes through the x86 tree?
-- Steve
^ permalink raw reply
* Re: [PATCHv4 13/13] selftests/bpf: Add tests for forked/cloned optimized uprobes
From: Jakub Sitnicki @ 2026-05-28 13:00 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260526205840.173790-14-jolsa@kernel.org>
On Tue, May 26, 2026 at 10:58 PM +02, Jiri Olsa wrote:
> Adding tests for forked/cloned optimized uprobes and make
> sure the child can properly execute optimized probe for
> both fork (dups mm) and clone with CLONE_VM.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
> .../selftests/bpf/prog_tests/uprobe_syscall.c | 88 +++++++++++++++++++
> 1 file changed, 88 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index efff0c515184..033d32b4cc27 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -4,6 +4,8 @@
>
> #ifdef __x86_64__
>
> +#define _GNU_SOURCE
> +#include <sched.h>
> #include <unistd.h>
> #include <asm/ptrace.h>
> #include <linux/compiler.h>
> @@ -936,6 +938,88 @@ static void test_uprobe_error(void)
> ASSERT_EQ(errno, EPROTO, "errno");
> }
>
> +__attribute__((aligned(16)))
> +__nocf_check __weak __naked void uprobe_fork_test(void)
> +{
> + asm volatile (
> + ".byte 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00\n" /* nop10 */
> + "ret\n"
> + );
> +}
> +
> +static int child_func(void *arg)
Nit: Could annotate with noreturn:
#include <stdnoreturn.h>
/* ... */
static noreturn int child_func(void *arg)
> +{
> + struct uprobe_syscall_executed *skel = arg;
> +
> + /* Make sure the child's probe is still there and optimized.. */
> + if (memcmp(uprobe_fork_test, lea_rsp, sizeof(lea_rsp)))
> + _exit(1);
> +
> + skel->bss->pid = getpid();
> +
> + /* .. and it executes properly. */
> + uprobe_fork_test();
> +
> + if (skel->bss->executed != 3)
> + _exit(2);
> +
> + _exit(0);
> +}
[...]
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox