* Re: [GIT PULL] RTLA changes for 7.2
From: Steven Rostedt @ 2026-05-29 13:56 UTC (permalink / raw)
To: Tomas Glozar; +Cc: Costa Shulyupin, Crystal Wood, LKML, linux-trace-kernel
In-Reply-To: <20260529130643.3080315-1-tglozar@redhat.com>
On Fri, 29 May 2026 15:06:43 +0200
Tomas Glozar <tglozar@redhat.com> wrote:
> - Fix discrepancy in --dump-tasks option
>
> Due to a mistake, rtla-timerlat-hist used the CLI syntax "--dump-task"
> instead of the documented "--dump-tasks". Change the option to match
> both documentation and the other timerlat tool, rtla-timerlat-top.
Is there any concern that scripts might be using the old option?
I wonder if you should keep the old option for backward compatibility,
but do not document that it exists.
-- Steve
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-29 13:31 UTC (permalink / raw)
To: Heiko Carstens
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Vasily Gorbik,
Thomas Weißschuh
In-Reply-To: <20260529082840.26496A2c-hca@linux.ibm.com>
On Fri, 29 May 2026 10:28:40 +0200
Heiko Carstens <hca@linux.ibm.com> wrote:
> > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> > index 09a7ef04d979..52519e2acdc8 100644
> > --- a/arch/s390/kernel/syscalls/syscall.tbl
> > +++ b/arch/s390/kernel/syscalls/syscall.tbl
> > @@ -398,3 +398,6 @@
> > 469 common file_setattr sys_file_setattr
> > 470 common listns sys_listns
> > 471 common rseq_slice_yield sys_rseq_slice_yield
> > +472 common stacktrace_setup sys_stacktrace_setup
> > +472 common sframe_register sys_sframe_register
> > +473 common sframe_unregister sys_sframe_unregister
>
> What is stacktrace_setup? And why only for s390? Looks like a leftover.
Oops! Yes it's a left over. Thanks for noticing.
-- Steve
^ permalink raw reply
* Re: [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-29 13:19 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
Jiri Olsa
In-Reply-To: <20260529132508.2e98cab925fdb1fa7be21a9b@kernel.org>
On Fri, 29 May 2026 13:25:08 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> Thus, we should report a kernel bug if !ctx->flags & TPARG_FL_TYPECAST
> here. Something like this:
>
> if (WARN_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
> return -EINVAL;
> type = ctx->last_struct;
> goto found_type;
OK, will update it in v7.
Thanks,
-- Steve
^ permalink raw reply
* [GIT PULL] RTLA changes for 7.2
From: Tomas Glozar @ 2026-05-29 13:06 UTC (permalink / raw)
To: Steven Rostedt
Cc: Costa Shulyupin, Crystal Wood, LKML, linux-trace-kernel,
Tomas Glozar
Steven,
The following changes since commit 5200f5f493f79f14bbdc349e402a40dfb32f23c8:
Linux 7.1-rc4 (2026-05-17 13:59:58 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/tglozar/linux.git tags/rtla-v7.2
for you to fetch changes up to db956bcf8d681b5a01ebe04c79f6a7b29b9934f9:
rtla: Document tests in README (2026-05-29 09:40:54 +0200)
----------------------------------------------------------------
RTLA patches for v7.2
- Fix discrepancy in --dump-tasks option
Due to a mistake, rtla-timerlat-hist used the CLI syntax "--dump-task"
instead of the documented "--dump-tasks". Change the option to match
both documentation and the other timerlat tool, rtla-timerlat-top.
- Extend coverage of runtime tests
Cover both top and hist tools in all applicable test cases, add tests
for a few uncovered options, and extend checks for some existing tests.
- Add unit tests for actions
rtla's actions feature is implemented in its source file and contains
non-trivial parsing logic. Cover it with unit tests.
- Stop record trace on interrupt
Fix a bug where an interval exists after receiving a signal in which
the main instance is stopped but the record instance is not, leading to
discrepancies in reported results and sometimes rtla hanging.
- Restore continue flag in actions_perform()
Fix a bug where rtla always continues tracing after hitting a threshold
even if the continue action was triggered just once, and add tests
verifying that the flag is reset properly.
- Migrate command line interface to libsubcmd
Replace rtla's argument parsing using getopt_long() with libsubcmd, used
by perf and objtool, to reuse existing code and auto-generate better
help messages. Extensive unit tests are included to detect regressions.
- Add -A/--aligned option to timerlat tools
Add an option to align timerlat threads, based on the recently
introduced TIMERLAT_ALIGN option of the timerlat tracer, together with
unit tests and documentation.
- Document tests in README
Document how to run unit and runtime tests in rtla's README.txt,
including the dependencies needed to run them.
---
Two of the commits:
- 534d9a93dbff2 tools subcmd: support optarg as separate argument
- da62fc3458462 tools subcmd: allow parsing distinct --opt and --no-opt
do minor changes to libsubcmd code needed for rtla. libsubcmd does not
have a MAINTAINERS entry, but generally follows perf commit message
style, so I also followed that instead of the tracing subsystem style.
The tag was built and tested (make && make unit-tests && sudo make
check) on 7.1-rc5 kernel, the same was also done after test-merge into
next-20260528. No new issues were found.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
----------------------------------------------------------------
Costa Shulyupin (1):
tools/rtla: Fix --dump-tasks usage in timerlat
Crystal Wood (1):
rtla: Stop the record trace on interrupt
Tomas Glozar (24):
rtla/tests: Cover both top and hist tools where possible
rtla/tests: Add get_workload_pids() helper
rtla/tests: Check -c/--cpus thread affinity
rtla/tests: Use negative match when testing --aa-only
rtla/tests: Extend timerlat top --aa-only coverage
rtla/tests: Cover all hist options in runtime tests
rtla/tests: Add runtime test for -H/--house-keeping
rtla/tests: Add runtime test for -k and -u options
rtla/tests: Add runtime tests for -C/--cgroup
rtla/tests: Add unit tests for actions module
rtla/actions: Restore continue flag in actions_perform()
rtla/tests: Add unit test for restoring continue flag
rtla/tests: Run runtime tests in temporary directory
rtla/tests: Add runtime tests for restoring continue flag
rtla: Add libsubcmd dependency
tools subcmd: support optarg as separate argument
tools subcmd: allow parsing distinct --opt and --no-opt
rtla: Parse cmdline using libsubcmd
rtla/tests: Add unit tests for _parse_args() functions
rtla/tests: Add unit tests for CLI option callbacks
rtla/timerlat: Add -A/--aligned CLI option
rtla/tests: Add unit tests for -A/--aligned option
Documentation/rtla: Add -A/--aligned option
rtla: Document tests in README
Documentation/tools/rtla/common_appendix.txt | 7 +-
.../tools/rtla/common_timerlat_options.txt | 11 +
tools/lib/subcmd/parse-options.c | 63 +-
tools/lib/subcmd/parse-options.h | 4 +
tools/tracing/rtla/.gitignore | 3 +
tools/tracing/rtla/Makefile | 66 +-
tools/tracing/rtla/README.txt | 30 +
tools/tracing/rtla/src/Build | 2 +-
tools/tracing/rtla/src/actions.c | 2 +
tools/tracing/rtla/src/cli.c | 539 +++++++++++++++
tools/tracing/rtla/src/cli.h | 9 +
tools/tracing/rtla/src/cli_p.h | 687 ++++++++++++++++++++
tools/tracing/rtla/src/common.c | 128 +---
tools/tracing/rtla/src/common.h | 36 +-
tools/tracing/rtla/src/osnoise.c | 158 ++++-
tools/tracing/rtla/src/osnoise.h | 6 +
tools/tracing/rtla/src/osnoise_hist.c | 221 +------
tools/tracing/rtla/src/osnoise_top.c | 200 +-----
tools/tracing/rtla/src/rtla.c | 89 ---
tools/tracing/rtla/src/timerlat.c | 29 +-
tools/tracing/rtla/src/timerlat.h | 8 +-
tools/tracing/rtla/src/timerlat_hist.c | 317 +--------
tools/tracing/rtla/src/timerlat_top.c | 285 +-------
tools/tracing/rtla/src/utils.c | 28 +-
tools/tracing/rtla/src/utils.h | 9 +-
tools/tracing/rtla/tests/engine.sh | 27 +
tools/tracing/rtla/tests/hwnoise.t | 2 +-
tools/tracing/rtla/tests/osnoise.t | 77 ++-
.../rtla/tests/scripts/check-cgroup-match.sh | 17 +
tools/tracing/rtla/tests/scripts/check-cpus.sh | 9 +
.../rtla/tests/scripts/check-housekeeping-cpus.sh | 4 +
tools/tracing/rtla/tests/scripts/check-priority.sh | 8 +-
.../tests/scripts/check-user-kernel-threads.sh | 16 +
.../rtla/tests/scripts/lib/get_workload_pids.sh | 11 +
tools/tracing/rtla/tests/timerlat.t | 117 ++--
tools/tracing/rtla/tests/unit/Build | 8 +-
tools/tracing/rtla/tests/unit/Makefile.unit | 6 +-
tools/tracing/rtla/tests/unit/actions.c | 393 +++++++++++
tools/tracing/rtla/tests/unit/cli_opt_callback.c | 716 ++++++++++++++++++++
tools/tracing/rtla/tests/unit/cli_params_assert.h | 68 ++
tools/tracing/rtla/tests/unit/osnoise_hist_cli.c | 557 ++++++++++++++++
tools/tracing/rtla/tests/unit/osnoise_top_cli.c | 503 ++++++++++++++
tools/tracing/rtla/tests/unit/timerlat_hist_cli.c | 722 +++++++++++++++++++++
tools/tracing/rtla/tests/unit/timerlat_top_cli.c | 654 +++++++++++++++++++
tools/tracing/rtla/tests/unit/unit_tests.c | 120 +---
tools/tracing/rtla/tests/unit/utils.c | 106 +++
46 files changed, 5591 insertions(+), 1487 deletions(-)
create mode 100644 tools/tracing/rtla/src/cli.c
create mode 100644 tools/tracing/rtla/src/cli.h
create mode 100644 tools/tracing/rtla/src/cli_p.h
delete mode 100644 tools/tracing/rtla/src/rtla.c
create mode 100755 tools/tracing/rtla/tests/scripts/check-cgroup-match.sh
create mode 100755 tools/tracing/rtla/tests/scripts/check-cpus.sh
create mode 100755 tools/tracing/rtla/tests/scripts/check-housekeeping-cpus.sh
create mode 100755 tools/tracing/rtla/tests/scripts/check-user-kernel-threads.sh
create mode 100644 tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
create mode 100644 tools/tracing/rtla/tests/unit/actions.c
create mode 100644 tools/tracing/rtla/tests/unit/cli_opt_callback.c
create mode 100644 tools/tracing/rtla/tests/unit/cli_params_assert.h
create mode 100644 tools/tracing/rtla/tests/unit/osnoise_hist_cli.c
create mode 100644 tools/tracing/rtla/tests/unit/osnoise_top_cli.c
create mode 100644 tools/tracing/rtla/tests/unit/timerlat_hist_cli.c
create mode 100644 tools/tracing/rtla/tests/unit/timerlat_top_cli.c
create mode 100644 tools/tracing/rtla/tests/unit/utils.c
^ permalink raw reply
* Re: [PATCH v2] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Steven Rostedt @ 2026-05-29 13:15 UTC (permalink / raw)
To: Tianchen Ding
Cc: Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-kselftest
In-Reply-To: <bcb52d91-0440-4e73-86af-997e8b723711@linux.alibaba.com>
On Fri, 29 May 2026 10:59:34 +0800
Tianchen Ding <dtcccc@linux.alibaba.com> wrote:
> We can take advantage of this by having make_str return the escape-sequence text
> instead of binary, and letting write_buffer handle the conversion:
>
> make_str() {
> ...
> printf '%s' "${val}${data}"
> }
>
> write_buffer() {
> id=$1
> size=$2
>
> str=`make_str $id $size`
> len=$(printf "$str" | wc -c)
> printf "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
> }
>
> This way str holds only printable escape-sequence text (no NUL), printf "$str"
> converts it to real binary through the pipe, and wc -c measures the true binary
> length.
This is quite hacky, but at least it removes the hardcoded assumptions.
OK, you can send a v3 that does that.
Thanks,
-- Steve
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Heiko Carstens @ 2026-05-29 8:28 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Vasily Gorbik,
Thomas Weißschuh
In-Reply-To: <20260528151023.00f5ec4e@gandalf.local.home>
On Thu, May 28, 2026 at 03:10:23PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add system calls to register and unregister sframes that can be used by
> dynamic linkers to tell the kernel where the sframe section is in memory
> for libraries it loads.
...
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 09a7ef04d979..52519e2acdc8 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -398,3 +398,6 @@
> 469 common file_setattr sys_file_setattr
> 470 common listns sys_listns
> 471 common rseq_slice_yield sys_rseq_slice_yield
> +472 common stacktrace_setup sys_stacktrace_setup
> +472 common sframe_register sys_sframe_register
> +473 common sframe_unregister sys_sframe_unregister
What is stacktrace_setup? And why only for s390? Looks like a leftover.
^ permalink raw reply
* Re: [PATCH v2 02/12] rv: Fix read_lock scope in per-task DA cleanup
From: Gabriele Monaco @ 2026-05-29 6:08 UTC (permalink / raw)
To: Nam Cao; +Cc: Wen Yang, linux-kernel, Steven Rostedt, linux-trace-kernel
In-Reply-To: <87ldd3orft.fsf@yellow.woof>
On Thu, 2026-05-28 at 10:43 +0200, Nam Cao wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> > The da_monitor_reset_all() function for per-task monitors takes
> > tasklist_lock while iterating over tasks, then keeps it also while
> > iterating over idle tasks (one per CPU). The latter is not
> > necessary
> > since the lock needs to guard only for_each_process_thread().
> >
> > Use a scoped_guard for more compact syntax and adjust the scope
> > only
> > where the lock is necessary.
> >
> > Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
> > Fixes: 8259cb14a7068 ("rv: Reset per-task monitors also for idle
> > tasks")
>
> Fixes: tag "indicates that the patch fixes a bug in a previous
> commit". There is no bug here, so I don't think Fixes tags are
> applicable.
Yeah good point, that isn't a real bug.. We're just holding a lock for
a bit too long but there's no harm in that. Will remove the tags.
Thanks,
Gabriele
>
> > Reviewed-by: Wen Yang <wen.yang@linux.dev>
> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
>
> Reviewed-by: Nam Cao <namcao@linutronix.de>
^ permalink raw reply
* Re: [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-29 4:25 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers,
Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
Ian Rogers, Jiri Olsa
In-Reply-To: <20260521225033.56458336@fedora>
On Thu, 21 May 2026 22:50:33 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> + if (ctx->flags & TPARG_FL_TEVENT) {
> + int ret;
> +
> + ret = parse_trace_event(varname, code, ctx);
> + if (ret < 0)
> + return ret;
> +
> + if (ctx->flags & TPARG_FL_TYPECAST) {
> + type = ctx->last_struct;
> + goto found_type;
> + }
> + return 0;
Here is a bit complicated but a buggy case.
parse_btf_arg() is not used for eprobe arguments because those
requires '$' prefix. However, only if it has a typecast, this
parse_btf_arg() is called for eprobes.
Thus, we should report a kernel bug if !ctx->flags & TPARG_FL_TYPECAST
here. Something like this:
if (WARN_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
return -EINVAL;
type = ctx->last_struct;
goto found_type;
Thanks,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Tengda Wu @ 2026-05-29 3:39 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
linux-trace-kernel, linux-kernel
In-Reply-To: <20260526123719.482f07a3843e207e22d95378@kernel.org>
Hi Masami,
thanks for the review and feedback.
On 2026/5/26 11:37, Masami Hiramatsu wrote:
> On Mon, 25 May 2026 21:22:53 +0800
> Tengda Wu <wutengda@huaweicloud.com> wrote:
>
>> When a task calls schedule() to yield the CPU, its state remains
>> TASK_RUNNING, but its stack is frozen and safe to walk.
>>
>> Replace task_is_running(tsk) with tsk->on_cpu to avoid overly
>> conservative rejections.
>
> Please see the Sashiko's comment.
>
> https://sashiko.dev/#/patchset/20260525132253.1889726-1-wutengda%40huaweicloud.com
>
> When calling Unwind on a task other than the current, IMHO, it is
> the responsibility of the caller of this function to ensure that the
> stack trace of that task is safe.
Agree.
> We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).
>
> BTW, should task_on_cpu() use READ_ONCE() etc?
> wait_task_inactive() seems a bit fragile.
>
> Thanks,
>
It seems that using task_on_cpu() is not necessary here because:
1. It requires an additional 'rq' parameter not available in the rethook context.
2. It just returns p->on_cpu, which is identical to our current use of tsk->on_cpu.
/* file: kernel/sched/sched.h */
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
return p->on_cpu;
}
Given these constraints, staying with tsk->on_cpu seems more straightforward
for the rethook context.
Thanks,
Tengda
>>
>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
>> ---
>> kernel/trace/rethook.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>> index 5a8bdf88999a..bd5e5f455e85 100644
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>> if (WARN_ON_ONCE(!cur))
>> return 0;
>>
>> - if (tsk != current && task_is_running(tsk))
>> + if (tsk != current && tsk->on_cpu)
>> return 0;
>>
>> do {
>> --
>> 2.34.1
>>
>>
>
>
^ permalink raw reply
* Re: [PATCH v2] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Tianchen Ding @ 2026-05-29 2:59 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-kselftest
In-Reply-To: <20260528091348.71ae3aa3@fedora>
On 5/28/26 9:13 PM, Steven Rostedt wrote:
> On Thu, 28 May 2026 10:24:17 +0800
> Tianchen Ding <dtcccc@linux.alibaba.com> wrote:
>
>>
>> diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
>> index 8e905d4fe6dd..f68f1901f65f 100644
>> --- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
>> +++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
>> @@ -43,8 +43,11 @@ write_buffer() {
>> id=$1
>> size=$2
>>
>> - # write the string into the raw marker
>> - make_str $id $size > trace_marker_raw
>> + # Pipe through dd to ensure a single atomic write() syscall
>> + # on architectures with 64K pages, where shell's printf builtin
>> + # uses stdio buffering which may split the output into multiple
>> + # writes.
>> + make_str $id $size | dd of=trace_marker_raw bs=`expr $size + 4` iflag=fullblock
>
> I was looking at this more, and I'm not comfortable with the hard coded
> 4 above. I rather use the length of the string. Something like:
>
> str=`make_str $id $size`
> len=${#str}
> echo "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
>
> -- Steve
>
Capturing make_str output into a shell variable doesn't work because make_str
outputs raw binary that may contain NUL bytes, and shell command substitution
silently strips them.
However, the val variable inside make_str doesn't hold actual NUL bytes — it
holds the text of escape sequences (e.g., the literal characters
\003\000\000\000). The binary conversion only happens at the final printf
"${val}${data}".
We can take advantage of this by having make_str return the escape-sequence text
instead of binary, and letting write_buffer handle the conversion:
make_str() {
...
printf '%s' "${val}${data}"
}
write_buffer() {
id=$1
size=$2
str=`make_str $id $size`
len=$(printf "$str" | wc -c)
printf "$str" | dd of=trace_marker_raw bs=$len iflag=fullblock
}
This way str holds only printable escape-sequence text (no NUL), printf "$str"
converts it to real binary through the pipe, and wc -c measures the true binary
length.
>> }
>>
>>
^ permalink raw reply
* [PATCH] ring-buffer: Better comment the use of RB_MISSED_EVENTS
From: Steven Rostedt @ 2026-05-29 2:37 UTC (permalink / raw)
To: LKML, Linux trace kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers
From: Steven Rostedt <rostedt@goodmis.org>
If the persistent ring buffer is detected on boot up to have a corrupted
sub-buffer, that sub-buffer is cleared to zero and its commit value has
the RB_MISSED_EVENTS bit set. That bit is to allow the "trace",
"trace_pipe" and "trace_pipe_raw" files know that events were dropped by
outputting "[LOST EVENTS]".
Only in this case does that bit get set in the writeable portion of the
ring buffer. When events are dropped in the normal ring buffer, that
information is stored in the cpu_buffer descriptor and the
RB_MISSED_EVENTS is set in the buffer page at the time the page is
consumed. It is never set in the writeable portion of the buffer.
Add comments to describe this better as it can be confusing to know when
the RB_MISSED_EVENTS are set in the commit portion of the buffer page.
Link: https://lore.kernel.org/all/20260529001500.14178455a046a5cbc6180861@kernel.org/
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 910f6b3adf74..06fb365bb86e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1929,6 +1929,12 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
*/
if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
local_set(&bpage->entries, 0);
+ /*
+ * Note, the RB_MISSED_EVENTS is only set inside the main write
+ * buffer by this verification logic. The normal ring buffer
+ * has this bit set when the page is read and passed to the
+ * consumers.
+ */
local_set(&dpage->commit, RB_MISSED_EVENTS);
dpage->time_stamp = prev_ts ? prev_ts : next_ts;
ret = -1;
@@ -7232,6 +7238,14 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
local_add(RB_MISSED_STORED, &dpage->commit);
size += sizeof(missed_events);
}
+ /*
+ * Note, for the persistent ring buffer, the RB_MISSED_EVENTS
+ * may have been set in the main buffer via the verification code.
+ * But here, dpage is a copy of that page and has not yet had
+ * the RB_MISSED_EVENTS set. As for the normal buffers,
+ * the main write buffer does not set these bits and it needs
+ * to be set here.
+ */
local_add(RB_MISSED_EVENTS, &dpage->commit);
}
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] tracing/probes: Point the error offset correctly for eprobe argument error
From: Steven Rostedt @ 2026-05-29 2:29 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Shuah Khan, Mathieu Desnoyers, linux-kernel, linux-trace-kernel,
linux-kselftest
In-Reply-To: <177967567399.209006.1451571244515632097.stgit@devnote2>
On Mon, 25 May 2026 11:21:14 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> Fix to point the error offset correctly for eprobe argument error.
> In the cleanup commit 1b8b0cd754cd ("tracing/probes: Move event parameter
> fetching code to common parser"), due to incorrect backward compatibility
> aimed at conforming to the test specifications, the error location was set
> to 0 when a non-existent formal parameter was specified for Eprobe.
> However, this should be corrected in both the test and the implementation
> to point correct error position.
>
> Fixes: 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser")
> Cc: stable@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
-- Steve
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-29 2:20 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <CAEf4BzZh4qyPiMbpZPeVGx+HFNjBjAHTsNOx5wE7RWidM-iphA@mail.gmail.com>
On Thu, 28 May 2026 16:01:06 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> [...]
>
> > * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index a627acc8fb5f..17042d7e5e87 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> > #define __NR_rseq_slice_yield 471
> > __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
> >
> > +#define __NR_sframe_register 472
> > +__SYSCALL(__NR_sframe_register, sys_sframe_register)
> > +#define __NR_sframe_unregister 473
> > +__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
> > +
> > #undef __NR_syscalls
> > -#define __NR_syscalls 472
> > +#define __NR_syscalls 474
> >
> > /*
> > * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> > new file mode 100644
> > index 000000000000..d3c9f88b024b
> > --- /dev/null
> > +++ b/include/uapi/linux/sframe.h
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_SFRAME_H
> > +#define _UAPI_LINUX_SFRAME_H
> > +
> > +struct sframe_setup {
>
> I'd add `u64 flags;` field for easier and nicer extensibility. Check
> in the kernel that it is set to zero, future kernels will allow some
> of the bits to be set.
That sounds reasonable.
>
> And I still think that prctl() instead of a separate sframe-specific
> syscall is the way to go. I see no reason for sframe-specific set of
> syscalls just to set a bit of extra metadata for the entire process.
> That seems to be the job of prctl().
I personally do not have a preference. I've just heard a lot from
others where they want to avoid extending an ioctl() like system call
or even create a new multiplexer syscall.
If we can get a consensus of using prctl() or adding a separate system
call, I'll go with whatever that is.
>
> > + __u64 sframe_start;
> > + __u64 sframe_size;
> > + __u64 text_start;
> > + __u64 text_size;
> > +};
> > +
>
> [...]
>
> > +
> > +/**
> > + * sys_sframe_register - register an address for user space stacktrace walking.
> > + * @data: Structure of sframe data used to register the sframe section
> > + * @size: The size of the given structure.
> > + *
> > + * This system call is used by dynamic library utilities to inform the kernel
> > + * of meta data that it loaded that can be used by the kernel to know how
> > + * to stack walk the given text locations.
> > + *
> > + * Return: 0 if successful, otherwise a negative error.
> > + */
> > +SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
> > +{
> > + struct sframe_setup sframe;
> > +
> > + if (sizeof(sframe) != size)
> > + return -EINVAL;
>
> This seems overly aggressive. It seems like the pattern is to allow
> sizes both smaller and bigger:
> - if user-provided size is smaller than what kernel knows about,
> treat missing fields as zeroes
Well, that could work with unregister, but for register that isn't
quite useful, as all fields should be filled (well, if we add flags,
that may not be 100% true).
> - if user-provided size is bigger, then check that space after
> fields that kernel recognizes are all zeroes.
That is dangerous. A zero with greater size could mean something. If
the size is greater than expected it should simply fail and let user
space call it again with the older version.
>
> This allows extensibility without having to change user space code all
> the time. Old code will provide smaller struct without new (presumably
> optional) fields, while newer code can use newer and larger struct
> size, but as long as it clears extra fields old kernel will be fine
> with that.
The old size will always work, thus old code will always continue to
work. If we extend the system call, then it must handle both the older
size as well as the newer size. User space would not need to change. It
would only change if it wanted to use a new feature, and if it wants to
work with older kernels it would need to try the bigger size first and
if that fails, it knows the kernel doesn't support that new feature and
then user space can figure out what to do. Either use the old system
call or abort.
-- Steve
>
> > +
> > + if (copy_from_user(&sframe, data, size))
> > + return -EFAULT;
> > +
> > + return sframe_add_section(sframe.sframe_start,
> > + sframe.sframe_start + sframe.sframe_size,
> > + sframe.text_start,
> > + sframe.text_start + sframe.text_size);
> > +}
> > +
>
> [...]
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Andrii Nakryiko @ 2026-05-28 23:01 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <20260528151023.00f5ec4e@gandalf.local.home>
On Thu, May 28, 2026 at 12:09 PM Steven Rostedt <rostedt@kernel.org> wrote:
>
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add system calls to register and unregister sframes that can be used by
> dynamic linkers to tell the kernel where the sframe section is in memory
> for libraries it loads.
>
> Both system calls take a pointer to a new structure:
>
> struct sframe_setup {
> __u64 sframe_start;
> __u64 sframe_size;
> __u64 text_start;
> __u64 text_size;
> };
>
> and a size of the passed in structure. If the system call needs to be
> extended, then the structure could be changed and the size of that
> structure will tell the kernel that it is the new version. If the kernel
> does not recognize the structure size, it will return -EINVAL.
>
> sframe_start - The virtual address of the sframe section
> sframe_size - The length of the sframe section
> text_start - the text section the sframe represents
> test_size - the length of the section
>
> If other stack tracing functionality is added, it will require a new
> system call.
>
> The unregister only needs the sframe_start and requires all the rest of
> the fields to be 0. In the future, if more can be done, then user space
> can update the other values and check the return code to see if the kernel
> supports it.
>
> Also added a DEFINE_GUARD() for mmap_write_lock. There was one for
> mmap_read_lock but not for mmap_write_lock.
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>
> Changes since v1: https://patch.msgid.link/20260521183532.7a145c8a@gandalf.local.home
>
> - Use mmap_write_lock() instead of mmap_read_lock() for mutual
> exclusiveness. (Jens Remus)
>
> - Guard mtree_insert_range() with mmap_write_lock. (Jens Remus)
>
> - Added a guard for mmap_write_lock() similar to the one for mmap_read_lock.
>
> - Have syscall prototype use structure pointer instead of void (Thomas Weißschuh)
>
> - Use __u64 instead of unsigned long for struct members (Thomas Weißschuh)
>
> - Use size_t instead of int for structure size in syscall argument.
> (Thomas Weißschuh)
>
> arch/alpha/kernel/syscalls/syscall.tbl | 2 +
> arch/arm/tools/syscall.tbl | 2 +
> arch/arm64/tools/syscall_32.tbl | 2 +
> arch/m68k/kernel/syscalls/syscall.tbl | 2 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
> arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
> arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
> arch/parisc/kernel/syscalls/syscall.tbl | 2 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
> arch/s390/kernel/syscalls/syscall.tbl | 3 +
> arch/sh/kernel/syscalls/syscall.tbl | 2 +
> arch/sparc/kernel/syscalls/syscall.tbl | 2 +
> arch/x86/entry/syscalls/syscall_32.tbl | 2 +
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
> include/linux/mmap_lock.h | 3 +
> include/linux/syscalls.h | 3 +
> include/uapi/asm-generic/unistd.h | 7 ++-
> include/uapi/linux/sframe.h | 12 ++++
> kernel/sys_ni.c | 3 +
> kernel/unwind/sframe.c | 69 +++++++++++++++++++--
> scripts/syscall.tbl | 2 +
> 23 files changed, 126 insertions(+), 6 deletions(-)
> create mode 100644 include/uapi/linux/sframe.h
>
[...]
> * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index a627acc8fb5f..17042d7e5e87 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> #define __NR_rseq_slice_yield 471
> __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
>
> +#define __NR_sframe_register 472
> +__SYSCALL(__NR_sframe_register, sys_sframe_register)
> +#define __NR_sframe_unregister 473
> +__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
> +
> #undef __NR_syscalls
> -#define __NR_syscalls 472
> +#define __NR_syscalls 474
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> new file mode 100644
> index 000000000000..d3c9f88b024b
> --- /dev/null
> +++ b/include/uapi/linux/sframe.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_SFRAME_H
> +#define _UAPI_LINUX_SFRAME_H
> +
> +struct sframe_setup {
I'd add `u64 flags;` field for easier and nicer extensibility. Check
in the kernel that it is set to zero, future kernels will allow some
of the bits to be set.
And I still think that prctl() instead of a separate sframe-specific
syscall is the way to go. I see no reason for sframe-specific set of
syscalls just to set a bit of extra metadata for the entire process.
That seems to be the job of prctl().
> + __u64 sframe_start;
> + __u64 sframe_size;
> + __u64 text_start;
> + __u64 text_size;
> +};
> +
[...]
> +
> +/**
> + * sys_sframe_register - register an address for user space stacktrace walking.
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * This system call is used by dynamic library utilities to inform the kernel
> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
> +{
> + struct sframe_setup sframe;
> +
> + if (sizeof(sframe) != size)
> + return -EINVAL;
This seems overly aggressive. It seems like the pattern is to allow
sizes both smaller and bigger:
- if user-provided size is smaller than what kernel knows about,
treat missing fields as zeroes
- if user-provided size is bigger, then check that space after
fields that kernel recognizes are all zeroes.
This allows extensibility without having to change user space code all
the time. Old code will provide smaller struct without new (presumably
optional) fields, while newer code can use newer and larger struct
size, but as long as it clears extra fields old kernel will be fine
with that.
> +
> + if (copy_from_user(&sframe, data, size))
> + return -EFAULT;
> +
> + return sframe_add_section(sframe.sframe_start,
> + sframe.sframe_start + sframe.sframe_size,
> + sframe.text_start,
> + sframe.text_start + sframe.text_size);
> +}
> +
[...]
^ permalink raw reply
* Re: [PATCH] tracing: fix CFI violation in probestub helper
From: Steven Rostedt @ 2026-05-28 20:49 UTC (permalink / raw)
To: Eva Kurchatova
Cc: mhiramat, linux-trace-kernel, linux-kernel, mathieu.desnoyers,
peterz, jpoimboe, samitolvanen
In-Reply-To: <20260524154301.21119-1-eva.kurchatova@virtuozzo.com>
On Sun, 24 May 2026 18:43:01 +0300
Eva Kurchatova <eva.kurchatova@virtuozzo.com> wrote:
> When multiple callbacks are registered on the same tracepoint, probestub
> will be indirectly called via traceiter helper.
>
> Pointer to probestub callback resides in __tracepoints section, which is
> excluded from ENDBR checks in objtool. Pointers to regfunc/unregfunc
> callbacks reside in extended structure however, which is not affected.
>
> Registering multiple callbacks will result in a #CP exception due to
> missed ENDBR in __probestub helper on a CFI-enabled machine.
>
> Fix this by adding CFI_NOSEAL annotation to probestub declaration.
>
> Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
> Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
Wait! The probestub is not in the __tracepoints section. At least it
shouldn't be. Are you sure there's not another issue here?
#define __DEFINE_TRACE_EXT(_name, _ext, proto, args) \
static const char __tpstrtab_##_name[] \
__section("__tracepoints_strings") = #_name; \
extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
int __traceiter_##_name(void *__data, proto); \
void __probestub_##_name(void *__data, proto); \
struct tracepoint __tracepoint_##_name __used \
__section("__tracepoints") = { \
Here the structure __tracepoint_##name is in the __tracepoints section.
.name = __tpstrtab_##_name, \
.key = STATIC_KEY_FALSE_INIT, \
.static_call_key = &STATIC_CALL_KEY(tp_func_##_name), \
.static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
.iterator = &__traceiter_##_name, \
.probestub = &__probestub_##_name, \
.funcs = NULL, \
.ext = _ext, \
}; \
__TRACEPOINT_ENTRY(_name); \
int __traceiter_##_name(void *__data, proto) \
{ \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
\
it_func_ptr = \
rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
if (it_func_ptr) { \
do { \
it_func = READ_ONCE((it_func_ptr)->func); \
__data = (it_func_ptr)->data; \
((void(*)(void *, proto))(it_func))(__data, args); \
} while ((++it_func_ptr)->func); \
} \
return 0; \
} \
void __probestub_##_name(void *__data, proto) \
{ \
}
But above, probestub is just a function defined wherever the tracepoint is
created.
In fact, it's just there for fprobes to work. It doesn't get called if you
add more than one callback to the tracepoint. So your explanation is totally
bogus.
Do you actually see a crash? Or is this just some AI slop that told you
this is a bug?
-- Steve
> ---
> include/linux/tracepoint.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 583d962abcc3..5a32a709759c 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -19,6 +19,7 @@
> #include <linux/rcupdate.h>
> #include <linux/tracepoint-defs.h>
> #include <linux/static_call.h>
> +#include <asm/cfi.h>
>
> struct module;
> struct tracepoint;
> @@ -356,6 +357,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> void __probestub_##_name(void *__data, proto) \
> { \
> } \
> + CFI_NOSEAL(__probestub_##_name); \
> DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
>
> #define DEFINE_TRACE_FN(_name, _reg, _unreg, _proto, _args) \
^ permalink raw reply
* Re: [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Steven Rostedt @ 2026-05-28 20:36 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Theodore Ts'o, Jason A . Donenfeld, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel
In-Reply-To: <20260524104439.ec01284998cae6d4a5053e61@kernel.org>
On Sun, 24 May 2026 10:44:39 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > If the get_boot_id() is accepted by the random folks, then I'm fine with
> > this change.
>
> Yeah, BTW, Sashiko found this can be initialized before we get enough
> entropy for random seed. Maybe we need one more delay.
Well, maybe for adding the boot_id later, but the code that initializes the
buffers needs to stay early. With the backup instance, the persistent ring
buffer can restart tracing immediately.
-- Steve
^ permalink raw reply
* Re: [PATCHv2] trace: allocate fields with elt struct
From: Steven Rostedt @ 2026-05-28 20:22 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Rosen Penev, linux-trace-kernel, Mathieu Desnoyers,
open list:TRACING
In-Reply-To: <20260526134317.394c1cf3060e89df36662ecc@kernel.org>
On Tue, 26 May 2026 13:43:17 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > #define DEFINE_TRACING_MAP_CMP_FN(type) \
> > > -static int tracing_map_cmp_##type(void *val_a, void *val_b) \
> > > +static int tracing_map_cmp_##type(const void *val_a, const void *val_b) \
> > > { \
> > > - type a = (type)(*(u64 *)val_a); \
> > > - type b = (type)(*(u64 *)val_b); \
> > > + type a = (type)(*(const u64 *)val_a); \
> > > + type b = (type)(*(const u64 *)val_b); \
> > > \
> > > return (a > b) ? 1 : ((a < b) ? -1 : 0); \
> > > }
> > This is a pre-existing issue, but does unconditionally reading 8 bytes
> > via the u64 cast cause unaligned access exceptions on architectures that
> > do not support them?
> > Additionally, for fields near the end of the dynamically allocated elt->key
> > buffer, can this trigger KASAN slab-out-of-bounds reads?
> > Also, on big-endian architectures, reading a smaller integer as a 64-bit
> > value and casting it down extracts the least-significant bytes rather than
> > the correct field value. Could this result in completely incorrect sorting
> > for small types?
>
> Steve, it seems this comes from your commit 106f41f5a302 ("tracing: Have
> the histogram compare functions convert to u64 first").
>
> I think neither of them is a problem, but could you check it?
This should not be a problem because the pointer being passed in was a
number to begin with. In fact that commit you shared was to fix this
compare on big endian machines. The typecast was specifically made to allow
big endian to work here.
The value is already in a 8 byte (64bit) memory location. It is copied into
it as a 64 bit number. Hence it has to be read as a 64 bit number for the
conversions.
A short would be copied into the location via:
u64 location;
location = short_word;
On big endian, for a short word of 0xabcd, it would be in the memory as:
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xab 0xcd
on little endian, it would be:
0xcd 0xab 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
For big endian to work, it would need to read that first into a 64 bit word
and then convert it back to short.
Thus, Sashiko doesn't know enough here to comment.
-- Steve
^ permalink raw reply
* [RESEND][PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-28 19:16 UTC (permalink / raw)
To: LKML, Linux Trace Kernel, bpf
Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik, Thomas Weißschuh
From: Steven Rostedt <rostedt@goodmis.org>
Add system calls to register and unregister sframes that can be used by
dynamic linkers to tell the kernel where the sframe section is in memory
for libraries it loads.
Both system calls take a pointer to a new structure:
struct sframe_setup {
__u64 sframe_start;
__u64 sframe_size;
__u64 text_start;
__u64 text_size;
};
and a size of the passed in structure. If the system call needs to be
extended, then the structure could be changed and the size of that
structure will tell the kernel that it is the new version. If the kernel
does not recognize the structure size, it will return -EINVAL.
sframe_start - The virtual address of the sframe section
sframe_size - The length of the sframe section
text_start - the text section the sframe represents
test_size - the length of the section
If other stack tracing functionality is added, it will require a new
system call.
The unregister only needs the sframe_start and requires all the rest of
the fields to be 0. In the future, if more can be done, then user space
can update the other values and check the return code to see if the kernel
supports it.
Also added a DEFINE_GUARD() for mmap_write_lock. There was one for
mmap_read_lock but not for mmap_write_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
[ Resend with Indu's current email address. ]
Changes since v1: https://patch.msgid.link/20260521183532.7a145c8a@gandalf.local.home
- Use mmap_write_lock() instead of mmap_read_lock() for mutual
exclusiveness. (Jens Remus)
- Guard mtree_insert_range() with mmap_write_lock. (Jens Remus)
- Added a guard for mmap_write_lock() similar to the one for mmap_read_lock.
- Have syscall prototype use structure pointer instead of void (Thomas Weißschuh)
- Use __u64 instead of unsigned long for struct members (Thomas Weißschuh)
- Use size_t instead of int for structure size in syscall argument.
(Thomas Weißschuh)
arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 2 +
arch/arm64/tools/syscall_32.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 3 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
include/linux/mmap_lock.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 7 ++-
include/uapi/linux/sframe.h | 12 ++++
kernel/sys_ni.c | 3 +
kernel/unwind/sframe.c | 69 +++++++++++++++++++--
scripts/syscall.tbl | 2 +
23 files changed, 126 insertions(+), 6 deletions(-)
create mode 100644 include/uapi/linux/sframe.h
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..f0639b831f2a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,5 @@
579 common file_setattr sys_file_setattr
580 common listns sys_listns
581 common rseq_slice_yield sys_rseq_slice_yield
+582 common sframe_register sys_sframe_register
+583 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..887b242ffb25 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..c820f1ff718c 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..4c7f17f0364b 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..e8dc2cc149f4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..d0bae05d16af 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,5 @@
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
471 n32 rseq_slice_yield sys_rseq_slice_yield
+472 n32 sframe_register sys_sframe_register
+473 n32 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2e200de6a58c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,5 @@
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
471 n64 rseq_slice_yield sys_rseq_slice_yield
+472 n64 sframe_register sys_sframe_register
+473 n64 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..0e3b82011ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,5 @@
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
471 o32 rseq_slice_yield sys_rseq_slice_yield
+472 o32 sframe_register sys_sframe_register
+473 o32 sframe_unregister sys_sframe_unregister
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..e0758ef8667d 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..eda40c4f4f2f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 nospu rseq_slice_yield sys_rseq_slice_yield
+472 nospu sframe_register sys_sframe_register
+473 nospu sframe_unregister sys_sframe_unregister
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..52519e2acdc8 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,6 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common stacktrace_setup sys_stacktrace_setup
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..62ac7b1b4dd4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..f92273ae608a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..409a50df3b21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
471 i386 rseq_slice_yield sys_rseq_slice_yield
+472 i386 sframe_register sys_sframe_register
+473 i386 sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..9b7c5a449751 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..037b8040f69d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register' sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 04b8f61ece5d..6650c89a13ab 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -579,6 +579,9 @@ static inline void mmap_write_unlock(struct mm_struct *mm)
up_write(&mm->mmap_lock);
}
+DEFINE_GUARD(mmap_write_lock, struct mm_struct *,
+ mmap_write_lock(_T), mmap_write_unlock(_T))
+
static inline void mmap_write_downgrade(struct mm_struct *mm)
{
__mmap_lock_trace_acquire_returned(mm, false, true);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..ad3c8d6b6471 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -79,6 +79,7 @@ struct mnt_id_req;
struct ns_id_req;
struct xattr_args;
struct file_attr;
+struct sframe_setup;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -999,6 +1000,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
u32 size, u32 flags);
asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_sframe_register(struct sframe_setup *data, size_t size);
+asmlinkage long sys_sframe_unregister(struct sframe_setup *data, size_t size);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..17042d7e5e87 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_sframe_register 472
+__SYSCALL(__NR_sframe_register, sys_sframe_register)
+#define __NR_sframe_unregister 473
+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
new file mode 100644
index 000000000000..d3c9f88b024b
--- /dev/null
+++ b/include/uapi/linux/sframe.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SFRAME_H
+#define _UAPI_LINUX_SFRAME_H
+
+struct sframe_setup {
+ __u64 sframe_start;
+ __u64 sframe_size;
+ __u64 text_start;
+ __u64 text_size;
+};
+
+#endif /* _UAPI_LINUX_SFRAME_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..eca5293f5d40 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
+
+COND_SYSCALL(sframe_register);
+COND_SYSCALL(sframe_unregister);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index db88d993dff1..84bd762a1080 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
#include <linux/mm.h>
#include <linux/string_helpers.h>
#include <linux/sframe.h>
+#include <linux/syscalls.h>
#include <asm/unwind_user_sframe.h>
#include <linux/unwind_user_types.h>
+#include <uapi/linux/sframe.h>
#include "sframe.h"
#include "sframe_debug.h"
@@ -817,8 +819,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
if (ret)
goto err_free;
- ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
- sec, GFP_KERNEL_ACCOUNT);
+ scoped_guard(mmap_write_lock, mm) {
+ ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
+ sec, GFP_KERNEL_ACCOUNT);
+ }
if (ret) {
dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
sec->text_start, sec->text_end);
@@ -842,9 +846,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
- if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
- return -EINVAL;
+ scoped_guard(mmap_write_lock, mm) {
+ if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
+ return -EINVAL;
+ }
}
call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
@@ -936,3 +942,56 @@ void sframe_free_mm(struct mm_struct *mm)
mtree_destroy(&mm->sframe_mt);
}
+
+/**
+ * sys_sframe_register - register an address for user space stacktrace walking.
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ return sframe_add_section(sframe.sframe_start,
+ sframe.sframe_start + sframe.sframe_size,
+ sframe.text_start,
+ sframe.text_start + sframe.text_size);
+}
+
+/**
+ * sys_sframe_unregister - unregister an sframe address
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * The data->sframe_start is the only value that is used. The rest must
+ * be zero.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_unregister, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ if (sframe.sframe_size || sframe.text_start || sframe.text_size)
+ return -EINVAL;
+
+ return sframe_remove_section(sframe.sframe_start);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..46ec22b50042 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
--
2.53.0
^ permalink raw reply related
* [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-28 19:10 UTC (permalink / raw)
To: LKML, Linux Trace Kernel, bpf
Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik, Thomas Weißschuh
From: Steven Rostedt <rostedt@goodmis.org>
Add system calls to register and unregister sframes that can be used by
dynamic linkers to tell the kernel where the sframe section is in memory
for libraries it loads.
Both system calls take a pointer to a new structure:
struct sframe_setup {
__u64 sframe_start;
__u64 sframe_size;
__u64 text_start;
__u64 text_size;
};
and a size of the passed in structure. If the system call needs to be
extended, then the structure could be changed and the size of that
structure will tell the kernel that it is the new version. If the kernel
does not recognize the structure size, it will return -EINVAL.
sframe_start - The virtual address of the sframe section
sframe_size - The length of the sframe section
text_start - the text section the sframe represents
test_size - the length of the section
If other stack tracing functionality is added, it will require a new
system call.
The unregister only needs the sframe_start and requires all the rest of
the fields to be 0. In the future, if more can be done, then user space
can update the other values and check the return code to see if the kernel
supports it.
Also added a DEFINE_GUARD() for mmap_write_lock. There was one for
mmap_read_lock but not for mmap_write_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20260521183532.7a145c8a@gandalf.local.home
- Use mmap_write_lock() instead of mmap_read_lock() for mutual
exclusiveness. (Jens Remus)
- Guard mtree_insert_range() with mmap_write_lock. (Jens Remus)
- Added a guard for mmap_write_lock() similar to the one for mmap_read_lock.
- Have syscall prototype use structure pointer instead of void (Thomas Weißschuh)
- Use __u64 instead of unsigned long for struct members (Thomas Weißschuh)
- Use size_t instead of int for structure size in syscall argument.
(Thomas Weißschuh)
arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 2 +
arch/arm64/tools/syscall_32.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 3 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
include/linux/mmap_lock.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 7 ++-
include/uapi/linux/sframe.h | 12 ++++
kernel/sys_ni.c | 3 +
kernel/unwind/sframe.c | 69 +++++++++++++++++++--
scripts/syscall.tbl | 2 +
23 files changed, 126 insertions(+), 6 deletions(-)
create mode 100644 include/uapi/linux/sframe.h
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..f0639b831f2a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,5 @@
579 common file_setattr sys_file_setattr
580 common listns sys_listns
581 common rseq_slice_yield sys_rseq_slice_yield
+582 common sframe_register sys_sframe_register
+583 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..887b242ffb25 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..c820f1ff718c 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..4c7f17f0364b 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..e8dc2cc149f4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..d0bae05d16af 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,5 @@
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
471 n32 rseq_slice_yield sys_rseq_slice_yield
+472 n32 sframe_register sys_sframe_register
+473 n32 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2e200de6a58c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,5 @@
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
471 n64 rseq_slice_yield sys_rseq_slice_yield
+472 n64 sframe_register sys_sframe_register
+473 n64 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..0e3b82011ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,5 @@
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
471 o32 rseq_slice_yield sys_rseq_slice_yield
+472 o32 sframe_register sys_sframe_register
+473 o32 sframe_unregister sys_sframe_unregister
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..e0758ef8667d 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..eda40c4f4f2f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 nospu rseq_slice_yield sys_rseq_slice_yield
+472 nospu sframe_register sys_sframe_register
+473 nospu sframe_unregister sys_sframe_unregister
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..52519e2acdc8 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,6 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common stacktrace_setup sys_stacktrace_setup
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..62ac7b1b4dd4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..f92273ae608a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..409a50df3b21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
471 i386 rseq_slice_yield sys_rseq_slice_yield
+472 i386 sframe_register sys_sframe_register
+473 i386 sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..9b7c5a449751 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..037b8040f69d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register' sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 04b8f61ece5d..6650c89a13ab 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -579,6 +579,9 @@ static inline void mmap_write_unlock(struct mm_struct *mm)
up_write(&mm->mmap_lock);
}
+DEFINE_GUARD(mmap_write_lock, struct mm_struct *,
+ mmap_write_lock(_T), mmap_write_unlock(_T))
+
static inline void mmap_write_downgrade(struct mm_struct *mm)
{
__mmap_lock_trace_acquire_returned(mm, false, true);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..ad3c8d6b6471 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -79,6 +79,7 @@ struct mnt_id_req;
struct ns_id_req;
struct xattr_args;
struct file_attr;
+struct sframe_setup;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -999,6 +1000,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
u32 size, u32 flags);
asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_sframe_register(struct sframe_setup *data, size_t size);
+asmlinkage long sys_sframe_unregister(struct sframe_setup *data, size_t size);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..17042d7e5e87 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_sframe_register 472
+__SYSCALL(__NR_sframe_register, sys_sframe_register)
+#define __NR_sframe_unregister 473
+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
new file mode 100644
index 000000000000..d3c9f88b024b
--- /dev/null
+++ b/include/uapi/linux/sframe.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SFRAME_H
+#define _UAPI_LINUX_SFRAME_H
+
+struct sframe_setup {
+ __u64 sframe_start;
+ __u64 sframe_size;
+ __u64 text_start;
+ __u64 text_size;
+};
+
+#endif /* _UAPI_LINUX_SFRAME_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..eca5293f5d40 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
+
+COND_SYSCALL(sframe_register);
+COND_SYSCALL(sframe_unregister);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index db88d993dff1..84bd762a1080 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
#include <linux/mm.h>
#include <linux/string_helpers.h>
#include <linux/sframe.h>
+#include <linux/syscalls.h>
#include <asm/unwind_user_sframe.h>
#include <linux/unwind_user_types.h>
+#include <uapi/linux/sframe.h>
#include "sframe.h"
#include "sframe_debug.h"
@@ -817,8 +819,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
if (ret)
goto err_free;
- ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
- sec, GFP_KERNEL_ACCOUNT);
+ scoped_guard(mmap_write_lock, mm) {
+ ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
+ sec, GFP_KERNEL_ACCOUNT);
+ }
if (ret) {
dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
sec->text_start, sec->text_end);
@@ -842,9 +846,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
- if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
- return -EINVAL;
+ scoped_guard(mmap_write_lock, mm) {
+ if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
+ return -EINVAL;
+ }
}
call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
@@ -936,3 +942,56 @@ void sframe_free_mm(struct mm_struct *mm)
mtree_destroy(&mm->sframe_mt);
}
+
+/**
+ * sys_sframe_register - register an address for user space stacktrace walking.
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ return sframe_add_section(sframe.sframe_start,
+ sframe.sframe_start + sframe.sframe_size,
+ sframe.text_start,
+ sframe.text_start + sframe.text_size);
+}
+
+/**
+ * sys_sframe_unregister - unregister an sframe address
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * The data->sframe_start is the only value that is used. The rest must
+ * be zero.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_unregister, struct sframe_setup __user *, data, size_t, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ if (sframe.sframe_size || sframe.text_start || sframe.text_size)
+ return -EINVAL;
+
+ return sframe_remove_section(sframe.sframe_start);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..46ec22b50042 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
--
2.53.0
^ permalink raw reply related
* [BUG] tracing/uprobe: GPF in path_put() via __free() on alloc failure
From: Farhad Alemi @ 2026-05-28 17:22 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Steven Rostedt, Oleg Nesterov, Peter Zijlstra, Mathieu Desnoyers,
linux-trace-kernel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 4332 bytes --]
Hello Masami and the linux-trace-kernel team,
I am reporting a tracing/uprobe error-path crash found by syzkaller
with fault injection.
Summary:
A write(2) to /sys/kernel/tracing/uprobe_events that fails inside
alloc_trace_uprobe() leaves the caller's `tu` variable holding
ERR_PTR(-ENOMEM). The cleanup macro at kernel/trace/trace_uprobe.c:536:
DEFINE_FREE(free_trace_uprobe, struct trace_uprobe *,
if (_T) free_trace_uprobe(_T))
guards only against NULL, not IS_ERR. free_trace_uprobe() (invoked by
the __free() helper at __trace_uprobe_create's return) has the same
guard shape -- `if (!tu) return;` -- and then calls path_put(&tu->path)
on the ERR_PTR-valued tu. KASAN catches the resulting dereference as a
null-ptr-deref-in-range at path_put+0x29.
Observed on:
- Linux v7.1-rc3-200-g70eda68668d1-dirty (where the bug was originally
found), x86_64, QEMU Q35
- KASAN enabled; panic_on_warn set; CONFIG_FAULT_INJECTION enabled
- The only local dirty file in my tree is drivers/tty/serial/serial_core.c,
containing a local ttyS0 console guard for the fuzzing harness. It is
unrelated to kernel/trace/.
- Trigger requires CAP_SYS_ADMIN to open /sys/kernel/tracing/uprobe_events
for write (mode 0640, TRACE_MODE_WRITE) plus CONFIG_FAULT_INJECTION
with a forced kmalloc failure inside alloc_trace_uprobe().
- Source inspection of linus/master at commit e8c2f9fdadee
(v7.1-rc4-754-ge8c2f9fdadee) shows the buggy structure is unchanged:
DEFINE_FREE(free_trace_uprobe, ..., if (_T) free_trace_uprobe(_T)) at
trace_uprobe.c:536 has only a NULL guard, free_trace_uprobe() at
trace_uprobe.c:369 still has only `if (!tu) return;`, and
__trace_uprobe_create() declares `struct trace_uprobe *tu
__free(free_trace_uprobe) = NULL` and assigns the alloc_trace_uprobe()
return value into tu before any IS_ERR check.
Impact:
With CONFIG_FAULT_INJECTION enabled, a fail_nth-injected allocation
failure inside alloc_trace_uprobe() (kzalloc_flex at line 341 returns
NULL, the function returns ERR_PTR(-ENOMEM)) causes the
__free(free_trace_uprobe) cleanup in __trace_uprobe_create() to
dereference the ERR_PTR via path_put():
Oops: general protection fault, probably for non-canonical address
0xdffffc0000000008: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000040-0x0000000000000047]
RIP: 0010:path_put+0x29/0x60 fs/namei.c:717
The R12 register in the crash dump shows the smoking gun:
R12 = 0xfffffffffffffff4 = -ENOMEM = ERR_PTR(-12) -- i.e. the `tu`
pointer being freed is an ERR_PTR, not a real object.
Relevant stack:
path_put+0x29/0x60 fs/namei.c:717
free_trace_uprobe kernel/trace/trace_uprobe.c:374 [inline]
__free_free_trace_uprobe kernel/trace/trace_uprobe.c:536 [inline]
__trace_uprobe_create+0x53c/0xe40 kernel/trace/trace_uprobe.c:725
trace_probe_create+0xce/0x130 kernel/trace/trace_probe.c:2252
dyn_event_create+0x4f/0x70 kernel/trace/trace_dynevent.c:128
create_or_delete_trace_uprobe+0x65/0xa0 kernel/trace/trace_uprobe.c:739
trace_parse_run_command+0x1f3/0x380 kernel/trace/trace.c:9565
vfs_write+0x29f/0xb90 fs/read_write.c:686
ksys_write+0x155/0x270 fs/read_write.c:740
Expected behavior:
Any one of these closes the hole:
1. Tighten free_trace_uprobe()'s entry guard:
if (IS_ERR_OR_NULL(tu))
return;
2. Or change the DEFINE_FREE macro to skip ERR_PTR values:
DEFINE_FREE(free_trace_uprobe, struct trace_uprobe *,
if (!IS_ERR_OR_NULL(_T)) free_trace_uprobe(_T))
3. Or, after the IS_ERR(tu) check in __trace_uprobe_create(), assign
`tu = NULL;` before returning so the __free helper sees NULL and
skips the path_put.
Reproducer:
I attached the generated C reproducer as reproducer.c. I also attached the
syzkaller program as reproducer.syz and the console
report as crash-report.txt.
Novelty check:
I searched syzbot dashboard data across upstream, fixed, invalid, stable,
and Android namespaces, and searched lore.kernel.org for "path_put" +
"trace_uprobe", "free_trace_uprobe", and "__trace_uprobe_create" + "GPF" /
"KASAN". I did not find an exact match. Adjacent uprobe_unregister /
bpf_uprobe_multi_link UAF reports have different free paths.
I appreciate your time and consideration, and I'm grateful for your
work on this subsystem.
Regards,
Farhad
[-- Attachment #2: crash-report.txt --]
[-- Type: text/plain, Size: 4494 bytes --]
RBP: 00007ffccc56d910 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f528c605fa0 R14: 00007f528c605fa0 R15: 0000000000001e91
</TASK>
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000008: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000040-0x0000000000000047]
CPU: 0 UID: 0 PID: 3563 Comm: syz.2.17 Not tainted 7.1.0-rc3-00200-g70eda68668d1-dirty #1 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:path_put+0x29/0x60 fs/namei.c:717
Code: 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 53 48 89 fb 49 be 00 00 00 00 00 fc ff df e8 22 91 8a ff 48 8d 7b 08 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 05 e8 db 50 f4 ff 48 8b 7b 08 e8 c2 89 03 00 48
RSP: 0018:ffffc900035bf9e8 EFLAGS: 00010203
RAX: 0000000000000008 RBX: 000000000000003c RCX: ffff88810c36a500
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000044
RBP: ffffc900035bfb70 R08: ffffffff9165b777 R09: 0000000000000000
R10: ffffffff9165b760 R11: fffffbfff22cb6ef R12: fffffffffffffff4
R13: dffffc0000000000 R14: dffffc0000000000 R15: 0000000000000000
FS: 0000555560a86500(0000) GS:ffff8882ab6b6000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f528c240760 CR3: 00000001214cb000 CR4: 0000000000750ef0
PKRU: 00000000
Call Trace:
<TASK>
free_trace_uprobe kernel/trace/trace_uprobe.c:374 [inline]
__free_free_trace_uprobe kernel/trace/trace_uprobe.c:536 [inline]
__trace_uprobe_create+0x53c/0xe40 kernel/trace/trace_uprobe.c:725
trace_probe_create+0xce/0x130 kernel/trace/trace_probe.c:2252
dyn_event_create+0x4f/0x70 kernel/trace/trace_dynevent.c:128
create_or_delete_trace_uprobe+0x65/0xa0 kernel/trace/trace_uprobe.c:739
trace_parse_run_command+0x1f3/0x380 kernel/trace/trace.c:9565
vfs_write+0x29f/0xb90 fs/read_write.c:686
ksys_write+0x155/0x270 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f528c37778d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffccc56d8a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f528c605fa0 RCX: 00007f528c37778d
RDX: 0000000000000022 RSI: 0000200000001100 RDI: 0000000000000003
RBP: 00007ffccc56d910 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f528c605fa0 R14: 00007f528c605fa0 R15: 0000000000001e91
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:path_put+0x29/0x60 fs/namei.c:717
Code: 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 53 48 89 fb 49 be 00 00 00 00 00 fc ff df e8 22 91 8a ff 48 8d 7b 08 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 05 e8 db 50 f4 ff 48 8b 7b 08 e8 c2 89 03 00 48
RSP: 0018:ffffc900035bf9e8 EFLAGS: 00010203
RAX: 0000000000000008 RBX: 000000000000003c RCX: ffff88810c36a500
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000044
RBP: ffffc900035bfb70 R08: ffffffff9165b777 R09: 0000000000000000
R10: ffffffff9165b760 R11: fffffbfff22cb6ef R12: fffffffffffffff4
R13: dffffc0000000000 R14: dffffc0000000000 R15: 0000000000000000
FS: 0000555560a86500(0000) GS:ffff8882ab6b6000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f528c240760 CR3: 00000001214cb000 CR4: 0000000000750ef0
PKRU: 00000000
----------------
Code disassembly (best guess):
0: 90 nop
1: f3 0f 1e fa endbr64
5: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
a: 41 56 push %r14
c: 53 push %rbx
d: 48 89 fb mov %rdi,%rbx
10: 49 be 00 00 00 00 00 movabs $0xdffffc0000000000,%r14
17: fc ff df
1a: e8 22 91 8a ff call 0xff8a9141
1f: 48 8d 7b 08 lea 0x8(%rbx),%rdi
23: 48 89 f8 mov %rdi,%rax
26: 48 c1 e8 03 shr $0x3,%rax
* 2a: 42 80 3c 30 00 cmpb $0x0,(%rax,%r14,1) <-- trapping instruction
2f: 74 05 je 0x36
31: e8 db 50 f4 ff call 0xfff45111
36: 48 8b 7b 08 mov 0x8(%rbx),%rdi
3a: e8 c2 89 03 00 call 0x38a01
3f: 48 rex.W
[-- Attachment #3: reproducer.c --]
[-- Type: application/octet-stream, Size: 6291 bytes --]
// autogenerated by syzkaller (https://github.com/google/syzkaller)
#define _GNU_SOURCE
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
static bool write_file(const char* file, const char* what, ...)
{
char buf[1024];
va_list args;
va_start(args, what);
vsnprintf(buf, sizeof(buf), what, args);
va_end(args);
buf[sizeof(buf) - 1] = 0;
int len = strlen(buf);
int fd = open(file, O_WRONLY | O_CLOEXEC);
if (fd == -1)
return false;
if (write(fd, buf, len) != len) {
int err = errno;
close(fd);
errno = err;
return false;
}
close(fd);
return true;
}
static int inject_fault(int nth)
{
int fd;
fd = open("/proc/thread-self/fail-nth", O_RDWR);
if (fd == -1)
exit(1);
char buf[16];
sprintf(buf, "%d", nth);
if (write(fd, buf, strlen(buf)) != (ssize_t)strlen(buf))
exit(1);
return fd;
}
static const char* setup_fault()
{
int fd = open("/proc/self/make-it-fail", O_WRONLY);
if (fd == -1)
return "CONFIG_FAULT_INJECTION is not enabled";
close(fd);
fd = open("/proc/thread-self/fail-nth", O_WRONLY);
if (fd == -1)
return "kernel does not have systematic fault injection support";
close(fd);
static struct {
const char* file;
const char* val;
bool fatal;
} files[] = {
{"/sys/kernel/debug/failslab/ignore-gfp-wait", "N", true},
{"/sys/kernel/debug/fail_futex/ignore-private", "N", false},
{"/sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem", "N", false},
{"/sys/kernel/debug/fail_page_alloc/ignore-gfp-wait", "N", false},
{"/sys/kernel/debug/fail_page_alloc/min-order", "0", false},
};
unsigned i;
for (i = 0; i < sizeof(files) / sizeof(files[0]); i++) {
if (!write_file(files[i].file, files[i].val)) {
if (files[i].fatal)
return "failed to write fault injection file";
}
}
return NULL;
}
static void setup_sysctl()
{
int cad_pid = fork();
if (cad_pid < 0)
exit(1);
if (cad_pid == 0) {
for (;;)
sleep(100);
}
char tmppid[32];
snprintf(tmppid, sizeof(tmppid), "%d", cad_pid);
struct {
const char* name;
const char* data;
} files[] = {
{"/sys/kernel/debug/x86/nmi_longest_ns", "10000000000"},
{"/proc/sys/kernel/hung_task_check_interval_secs", "20"},
{"/proc/sys/net/core/bpf_jit_kallsyms", "1"},
{"/proc/sys/net/core/bpf_jit_harden", "0"},
{"/proc/sys/kernel/kptr_restrict", "0"},
{"/proc/sys/kernel/softlockup_all_cpu_backtrace", "1"},
{"/proc/sys/fs/mount-max", "100"},
{"/proc/sys/vm/oom_dump_tasks", "0"},
{"/proc/sys/debug/exception-trace", "0"},
{"/proc/sys/kernel/printk", "7 4 1 3"},
{"/proc/sys/kernel/keys/gc_delay", "1"},
{"/proc/sys/vm/oom_kill_allocating_task", "1"},
{"/proc/sys/kernel/ctrl-alt-del", "0"},
{"/proc/sys/kernel/cad_pid", tmppid},
};
for (size_t i = 0; i < sizeof(files) / sizeof(files[0]); i++) {
if (!write_file(files[i].name, files[i].data)) {
}
}
kill(cad_pid, SIGKILL);
while (waitpid(cad_pid, NULL, 0) != cad_pid)
;
}
uint64_t r[1] = {0xffffffffffffffff};
int main(void)
{
syscall(__NR_mmap, /*addr=*/0x1ffffffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
/*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul,
/*fd=*/(intptr_t)-1, /*offset=*/0ul);
syscall(__NR_mmap, /*addr=*/0x200000000000ul, /*len=*/0x1000000ul,
/*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
/*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul,
/*fd=*/(intptr_t)-1, /*offset=*/0ul);
syscall(__NR_mmap, /*addr=*/0x200001000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
/*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul,
/*fd=*/(intptr_t)-1, /*offset=*/0ul);
setup_sysctl();
const char* reason;
(void)reason;
if ((reason = setup_fault()))
printf("the reproducer may not work as expected: fault injection setup "
"failed: %s\n",
reason);
intptr_t res = 0;
if (write(1, "executing program\n", sizeof("executing program\n") - 1)) {
}
// mkdir arguments: [
// path: ptr[in, buffer] {
// buffer: {2e 2f 70 6d 32 39 30 38 65 00} (length 0xa)
// }
// mode: open_mode = 0x1ff (8 bytes)
// ]
memcpy((void*)0x200000001000, "./pm2908e\000", 10);
syscall(
__NR_mkdir, /*path=*/0x200000001000ul,
/*mode=S_IXOTH|S_IWOTH|S_IROTH|S_IXGRP|S_IWGRP|S_IRGRP|S_IXUSR|S_IWUSR|0x100*/
0x1fful);
// mount arguments: [
// src: nil
// dst: ptr[in, buffer] {
// buffer: {2e 2f 70 6d 32 39 30 38 65 00} (length 0xa)
// }
// type: ptr[in, buffer] {
// buffer: {74 72 61 63 65 66 73 00} (length 0x8)
// }
// flags: mount_flags = 0x0 (8 bytes)
// data: nil
// ]
memcpy((void*)0x200000001040, "./pm2908e\000", 10);
memcpy((void*)0x200000001080, "tracefs\000", 8);
syscall(__NR_mount, /*src=*/0ul, /*dst=*/0x200000001040ul,
/*type=*/0x200000001080ul, /*flags=*/0ul, /*data=*/0ul);
// openat arguments: [
// fd: fd_dir (resource)
// file: ptr[in, buffer] {
// buffer: {2e 2f 70 6d 32 39 30 38 65 2f 75 70 72 6f 62 65 5f 65 76 65
// 6e 74 73 00} (length 0x18)
// }
// flags: open_flags = 0x1 (4 bytes)
// mode: open_mode = 0x0 (2 bytes)
// ]
// returns fd
memcpy((void*)0x2000000010c0, "./pm2908e/uprobe_events\000", 24);
res = syscall(__NR_openat, /*fd=*/0xffffff9c, /*file=*/0x2000000010c0ul,
/*flags=O_WRONLY*/ 1, /*mode=*/0);
if (res != -1)
r[0] = res;
// write arguments: [
// fd: fd (resource)
// buf: ptr[in, buffer] {
// buffer: {70 3a 75 70 72 6f 62 65 73 2f 70 6d 32 39 30 38 65 20 2f 70
// 72 6f 63 2f 73 65 6c 66 2f 65 78 65 3a 30} (length 0x22)
// }
// count: len = 0x22 (8 bytes)
// ]
memcpy((void*)0x200000001100, "p:uprobes/pm2908e /proc/self/exe:0", 34);
inject_fault(12);
syscall(__NR_write, /*fd=*/r[0], /*buf=*/0x200000001100ul, /*count=*/0x22ul);
return 0;
}
[-- Attachment #4: reproducer.syz --]
[-- Type: application/octet-stream, Size: 760 bytes --]
# {Threaded:false Repeat:false RepeatTimes:0 Procs:1 Slowdown:1 Sandbox: SandboxArg:0 Leak:false NetInjection:false NetDevices:false NetReset:false Cgroups:false BinfmtMisc:false CloseFDs:false KCSAN:false DevlinkPCI:false NicVF:false USB:false VhciInjection:false Wifi:false IEEE802154:false Sysctl:true Swap:false UseTmpDir:false HandleSegv:false Trace:false CallComments:true LegacyOptions:{Collide:false Fault:false FaultCall:0 FaultNth:0}}
mkdir(&(0x7f0000001000)='./pm2908e\x00', 0x1ff)
mount(0x0, &(0x7f0000001040)='./pm2908e\x00', &(0x7f0000001080)='tracefs\x00', 0x0, 0x0)
r0 = openat(0xffffffffffffff9c, &(0x7f00000010c0)='./pm2908e/uprobe_events\x00', 0x1, 0x0)
write(r0, &(0x7f0000001100)='p:uprobes/pm2908e /proc/self/exe:0', 0x22) (fail_nth: 12)
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-28 17:11 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260528084211.wsdrvbvxvkddokb5@master>
On Thu, May 28, 2026 at 2:42 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, May 26, 2026 at 06:07:38AM -0600, Nico Pache wrote:
> >On Tue, May 26, 2026 at 12:57 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
> >> >On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
> >> >
> >> >> Can you please append the following fixup that reverts one of the
> >> >> changes requested in V17. The issue with the change is described
> >> >> below.
> >> >
> >> >OK. fyi, what I received was badly mangled: wordwrapping, tabs messed
> >> >up, etc.
> >> >
> >> >Here's my reconstruction:
> >> >
> >>
> >> Hi, Nico
> >>
> >> I tried to reply your mail, but found it has some encoding problem, so reply
> >> here.
> >
> >Yeah sorry I didnt properly configure my email client after getting a
> >new laptop.
> >
> >>
> >> >
> >> >Author: Nico Pache <npache@redhat.com>
> >> >Subject: fix potential use-after-free of vma in mthp_collapse()
> >> >Date: Mon May 25 07:38:59 2026 -0600
> >> >
> >> >Between V17 and v18, one reviewer (Wei) brought up that we are not doing
> >> >the uffd-armed check until deep in the collapse operation. While not
> >> >functionally incorrect, it can lead to unnecessary work.
> >>
> >> So we decide to tolerate the behavioral change?
> >
> >Yes, I believe it is ok for now. Either way we needed to remove the
> >potential UAF. It only affects the behavior if mTHP is enabled, so the
> >legacy behavior is kept. And the uffd case is limited.
> >
> >My future work involves further optimizing and cleaning up khugepaged.
> >I'll make this part of the goal too. My first thought is to do the
> >revalidation at every order (between the locks dropping); but that
> >essentially pays the same penalty... I can't think of a clean solution
> >at the moment.
>
> One way come into my mind is add a @was_uffd_armed field in collapse_control
> and updates it in hugepage_vma_revalidate() when latest vma is retrieved.
>
> Still not elegant enough.
So our issue is that userfaultfd_armed is at the VMA granularity.
Ideally we want PMD/PTE granularity, but we only have that for wp. I'm
just still investigating all the nuances of uffd and its interactions
with khugepaged (something I've been meaning to understand more of
anyway). But from what i understand so far we actually can use the
bitmap and the was_uffd_armed to optimize this further. It solves the
issue and has a rather small race window, which can just be handled by
the revalidation later on, probably eliminating most of the potential
cases.
IIUC, filling a region with previously empty/zero pages is only an
issue for MODE_MISSING and MODE_WP with WP_UNPOPULATED set as well. I
have a work in progress commit to improve all this uffd handling.
I think what i have is a good middle ground. It improves the current
functionality and closes this gap we have with the new mthp_collapse--
best of both worlds. If the race window is hit, we will pay the
penalty, but that should be greatly reduced. I will send out an RFC
for this targeting mm-new once I have everything verified and cleaned
up :)
Cheers,
-- Nico
>
> >
> >Does that sound ok?
> >
>
> Not sure. I can't imagine the impact it would have.
>
> >Cheers,
> >-- Nico
>
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply
* Re: [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-05-28 16:14 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260529001519.14ca9dbe92fb2622249137c6@kernel.org>
On Fri, May 29, 2026 at 12:15:19AM +0000, Masami Hiramatsu wrote:
> On Wed, 27 May 2026 09:41:33 -0700
> Breno Leitao <leitao@debian.org> wrote:
>
> > The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> > already landed; this series wires the rendered cmdline into the kernel.
> >
> > Motivation: today the embedded bootconfig is parsed at runtime, after
> > parse_early_param() has already run, so early_param() handlers can't
> > see embedded values. Folding the kernel.* subtree into the cmdline at
> > build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> > users without forcing them to maintain two cmdline sources.
> >
> > Behaviorally, the "kernel" subtree is rendered to a flat string at
> > build time and stashed in .init.rodata. setup_arch() prepends it to
> > boot_command_line before parse_early_param() runs. Overflow is a soft
> > error: the helper logs and leaves boot_command_line untouched rather
> > than panicking, so an oversized embedded bconf cannot brick a boot.
> >
>
> Thanks Breno, yes, that is what I think about.
> Let me check it. And could you also check Sashiko's comments?
>
> https://sashiko.dev/#/patchset/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5%40debian.org
Ack, I will have a look at them, thanks for confirming the direction is
correct.
--breno
^ permalink raw reply
* Re: [PATCH v21 8/9] ring-buffer: Show persistent buffer dropped events in trace file
From: Steven Rostedt @ 2026-05-28 15:26 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Ian Rogers
In-Reply-To: <20260529001500.14178455a046a5cbc6180861@kernel.org>
On Fri, 29 May 2026 00:15:00 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> Yeah, the relationship between the persistent ring buffer (which is mapped)
> and the RB_MISSED_EVENTS is a bit unclear.
I'm going to pull these patches into the ring-buffer for-next branch and
start testing it. If they pass, I'll push them up to the linux-trace repo.
I'll send out this patch on top of it.
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 910f6b3adf74..5de6f352249a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1929,6 +1929,12 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
*/
if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
local_set(&bpage->entries, 0);
+ /*
+ * Note, the RB_MISSED_EVENTS is only set inside the main write
+ * buffer by this verification logic. The normal ring buffer
+ * has this bit set when the page is read and passed to the
+ * consumers.
+ */
local_set(&dpage->commit, RB_MISSED_EVENTS);
dpage->time_stamp = prev_ts ? prev_ts : next_ts;
ret = -1;
@@ -7232,6 +7238,14 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
local_add(RB_MISSED_STORED, &dpage->commit);
size += sizeof(missed_events);
}
+ /*
+ * Note, for the persistent ring buffer, the RB_MISSED_EVENTS
+ * may have been set in the main buffer via the verification code.
+ * But here, dpage is a copy of that page and has not yet had
+ * the RB_MISSED_EVENTS set. As for the normal buffers,
+ * the main write buffer does not set these bits and it needs
+ * to be set here.
+ */
local_add(RB_MISSED_EVENTS, &dpage->commit);
}
-- Steve
^ permalink raw reply related
* Re: [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Masami Hiramatsu @ 2026-05-28 15:15 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org>
On Wed, 27 May 2026 09:41:33 -0700
Breno Leitao <leitao@debian.org> wrote:
> The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> already landed; this series wires the rendered cmdline into the kernel.
>
> Motivation: today the embedded bootconfig is parsed at runtime, after
> parse_early_param() has already run, so early_param() handlers can't
> see embedded values. Folding the kernel.* subtree into the cmdline at
> build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> users without forcing them to maintain two cmdline sources.
>
> Behaviorally, the "kernel" subtree is rendered to a flat string at
> build time and stashed in .init.rodata. setup_arch() prepends it to
> boot_command_line before parse_early_param() runs. Overflow is a soft
> error: the helper logs and leaves boot_command_line untouched rather
> than panicking, so an oversized embedded bconf cannot brick a boot.
>
Thanks Breno, yes, that is what I think about.
Let me check it. And could you also check Sashiko's comments?
https://sashiko.dev/#/patchset/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5%40debian.org
Thanks,
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Breno Leitao (4):
> bootconfig: return 0 from xbc_snprint_cmdline() for a leaf root
> bootconfig: render embedded bootconfig as a kernel cmdline at build time
> bootconfig: add xbc_prepend_embedded_cmdline() helper
> x86/setup: prepend embedded bootconfig cmdline before parse_early_param
>
> Makefile | 5 ++++
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/setup.c | 3 +++
> include/linux/bootconfig.h | 7 ++++++
> init/Kconfig | 33 ++++++++++++++++++++++++++
> init/main.c | 19 ++++++++++++---
> lib/Makefile | 16 +++++++++++++
> lib/bootconfig.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
> lib/embedded-cmdline.S | 16 +++++++++++++
> tools/bootconfig/Makefile | 2 +-
> 10 files changed, 156 insertions(+), 4 deletions(-)
> ---
> base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
> change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v21 8/9] ring-buffer: Show persistent buffer dropped events in trace file
From: Masami Hiramatsu @ 2026-05-28 15:15 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Ian Rogers
In-Reply-To: <20260527093507.00ac35d8@gandalf.local.home>
On Wed, 27 May 2026 09:35:07 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> On Wed, 27 May 2026 12:47:21 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > Yeah, for the persistent ring buffer, it does not happen.
> > But there seems RB_MISSED_EVENTS bit can be cleared in
> > "else" path (after applying 1-8 patches)?
>
> Note, *only* the persistent ring buffer adds RB_MISSED_EVENTS to the pages
> in the write buffer. In the normal buffer, these bits are only set by this
> function. That is, they would not be set from the swap of pages.
>
Ah, OK. For normal ring buffers, it is only set by reader.
> >
> > ----------
> > if (read || (len < (commit - read)) ||
> > cpu_buffer->reader_page == cpu_buffer->commit_page ||
> > force_memcpy) { // <-- persistent ring buffer sets force_memcpy = true.
> > [...]
> > } else {
> > /* update the entry counter */
> > [...]
> > if (!missed_events && rb_data_page_commit(dpage) & RB_MISSED_EVENTS)
> > missed_events = -1;
> > //^-- we check RB_MISSED_EVENTS bit on @dpage->commit and set missed_events = -1.
> >
> > /*
> > * Use the real_end for the data size,
> > * This gives us a chance to store the lost events
> > * on the page.
> > */
> > if (reader->real_end)
> > local_set(&dpage->commit, reader->real_end);
> > // ^- only if @reader->real_end, RB_MISSED_EVENTS bit is dropped.
>
> Because this isn't a persistent ring buffer (if it was, as you noted,
> force_memcpy would be true and we wouldn't enter the else path), the
> RB_MISSED_EVENTS bit in the commit would never be set here. It is *only* set
> by the verifier of the persistent ring buffer logic.
OK, I got it.
Thanks for confirmation!
>
> > }
> >
> > cpu_buffer->lost_events = 0;
> >
> > commit = rb_data_page_commit(dpage);
> > /*
> > * Set a flag in the commit field if we lost events
> > */
> > if (missed_events) {
> > /*
> > * If there is room at the end of the page to save the
> > * missed events, then record it there.
> > */
> > if (missed_events > 0 &&
> > buffer->subbuf_size - commit >= sizeof(missed_events)) {
> > memcpy(&dpage->data[commit], &missed_events,
> > sizeof(missed_events));
> > local_add(RB_MISSED_STORED, &dpage->commit);
> > commit += sizeof(missed_events);
> > }
> > local_add(RB_MISSED_EVENTS, &dpage->commit); // <-- @dpage->commit is updated.
> > }
>
> And this is the first place it would get set.
>
> But yeah, it is very confusing and needs better comments.
Yeah, the relationship between the persistent ring buffer (which is mapped)
and the RB_MISSED_EVENTS is a bit unclear.
Thanks,
>
> Thanks,
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox