* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Jann Horn @ 2026-04-30 15:22 UTC (permalink / raw)
To: Daniel Walker (danielwa)
Cc: Oleg Nesterov, David Hildenbrand (Arm),
Darko Tominac -X (dtominac - GLOBALLOGIC INC at Cisco),
Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
Vlastimil Babka, xe-linux-external(mailer list),
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
linux-perf-users@vger.kernel.org, linux-mm@kvack.org
In-Reply-To: <afJz3l6435DHLorn@goliath>
On Wed, Apr 29, 2026 at 11:11 PM Daniel Walker (danielwa)
<danielwa@cisco.com> wrote:
> On Wed, Apr 29, 2026 at 05:24:25PM +0200, Oleg Nesterov wrote:
> > On 04/29, David Hildenbrand (Arm) wrote:
> > >
> > > On 4/29/26 15:15, Darko Tominac wrote:
> > >
> > > > uprobe infrastructure does not re-instrument pages on individual page
> > > > faults (uprobe_mmap() is only called during VMA creation, not on
> > > > page-in), the breakpoints are silently lost once the discarded pages are
> > > > re-read from the backing file. The probes stop firing with no error
> > > > indication, and the only recovery is to unregister and re-register the
> > > > affected uprobes.
> > >
> > > Right. Don't MADV_DONTNEED uprobes, just like you are not supposed to
> > > MADV_DONTNEED debugger breakpoints/set data etc. :)
> >
> > Agreed, thanks.
>
>
> Shouldn't there be some sort of compensation or notification for this, or is each person that
> hits this suppose to just scratch their head and send a patch that's rejected?
I guess we could add a pr_warn_once() that warns when
madvise(MADV_DONTNEED) is called on a read+execute file mapping,
and/or (as David said) add an explicit note in the madvise() manpage
about how that can interfere with software breakpoints and uprobes?
^ permalink raw reply
* [PATCH v2] tracepoint: add lockdep rcu_is_watching() check to trace_##name##_enabled()
From: David Carlier @ 2026-04-30 14:41 UTC (permalink / raw)
To: linux-trace-kernel
Cc: linux-kernel, rostedt, mhiramat, mathieu.desnoyers, peterz,
vineeth, David Carlier
In-Reply-To: <20260430053511.8767-1-devnexen@gmail.com>
trace_call__##name() was added by commit 677a3d82b640 ("tracepoint: Add
trace_call__##name() API") for callers that already gate on
trace_##name##_enabled(), letting them skip the redundant
static_branch_unlikely() re-evaluation. The LOCKDEP rcu_is_watching()
WARN_ONCE() that trace_##name() carries on the disabled path was not
mirrored, so the cold-path coverage added by commit c2679254b9c9
("tracing: Make tracepoint lockdep check actually test something") is
lost for trace_call__##name() callers when the tracepoint is disabled.
When the tracepoint is enabled, the rcu_dereference inside
__DO_TRACE_CALL() already trips under PROVE_RCU, so the warning is
only needed on the !enabled path. The natural place is the gate that
trace_call__##name() callers consult: trace_##name##_enabled().
Fixes: 677a3d82b640 ("tracepoint: Add trace_call__##name() API")
Cc: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
include/linux/tracepoint.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 578e520b6ee6..3b3ab19c2627 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -293,6 +293,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
static inline bool \
trace_##name##_enabled(void) \
{ \
+ if (IS_ENABLED(CONFIG_LOCKDEP)) { \
+ WARN_ONCE(!rcu_is_watching(), \
+ "RCU not watching for tracepoint"); \
+ } \
return static_branch_unlikely(&__tracepoint_##name.key);\
}
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] tracepoint: add lockdep rcu_is_watching() check to trace_call__##name()
From: Steven Rostedt @ 2026-04-30 13:35 UTC (permalink / raw)
To: David Carlier
Cc: linux-trace-kernel, linux-kernel, Vineeth Pillai (Google),
Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra
In-Reply-To: <20260430053511.8767-1-devnexen@gmail.com>
On Thu, 30 Apr 2026 06:35:10 +0100
David Carlier <devnexen@gmail.com> wrote:
> Without it, sites converted from trace_foo(args) to trace_call__foo(args)
> lose the WARN_ONCE(!rcu_is_watching()) coverage under CONFIG_LOCKDEP when
> the tracepoint is enabled - the case that commit c2679254b9c9 ("tracing:
> Make tracepoint lockdep check actually test something") added the warning
> for.
>
> Mirror the same block in both trace_call__##name() bodies, gated by
> (cond) in __DECLARE_TRACE to match its trace_##name() and ungated in
> __DECLARE_TRACE_SYSCALL.
If it gets called without rcu watching, the rcu dereference in
__DO_TRACE_CALL() will trigger.
No need to add the warning here. The reason the trace_foo() had it, was to
trigger when the tracepoint WASN'T enabled.
Now, perhaps where the warning should go, is in the trace_foo_enabled()
code.
-- Steve
^ permalink raw reply
* Re: [PATCH v6] tracing: Bound synthetic-field strings with seq_buf
From: Steven Rostedt @ 2026-04-30 13:32 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Pengpeng Hou, Mathieu Desnoyers, Tom Zanussi, linux-trace-kernel,
linux-kernel
In-Reply-To: <20260430141414.b6fb7a49f4eae3cac847fb54@kernel.org>
On Thu, 30 Apr 2026 14:14:14 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> (BTW, we need to use __free(kfree) in this file too.)
I saw that too, but that can be a separate patch.
-- Steve
^ permalink raw reply
* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: David Hildenbrand (Arm) @ 2026-04-30 9:54 UTC (permalink / raw)
To: Oleg Nesterov, Daniel Walker (danielwa)
Cc: Darko Tominac -X (dtominac - GLOBALLOGIC INC at Cisco),
Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, xe-linux-external(mailer list),
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
linux-perf-users@vger.kernel.org, linux-mm@kvack.org
In-Reply-To: <afMeAs6upm8kLCYb@redhat.com>
On 4/30/26 11:16, Oleg Nesterov wrote:
> On 04/29, Daniel Walker (danielwa) wrote:
>>
>> Shouldn't there be some sort of compensation or notification for this, or is each person that
>> hits this suppose to just scratch their head and send a patch that's rejected?
>
> Not sure... uprobes assumes that it owns the modified page and nobody
> can modify/unmap it, including the probed application or debugger.
>
> But may be MADV_DONTNEED should not set zap_details.even_cows == true
> by default ?
It should. If you do mmap(MAP_PRIVATE, fd), and modify some pages, MADV_DONTNEED
must zap these pages.
There is a recurring confusion about MADV_DONTNEED.
The man page is relatively clear about the semantics ("subsequent accesses of
pages in the range will succeed, but will result in either repopulating the
memory ... or zero-fill-on-demand pages for anonymous private mappings").
But the initial sentence "Do not expect access in the near future. (For the
time being, the application is finished with the given range, so the kernel can
free resources associated with it.)" mostly only makes sense for shared mappings.
Maybe we could emphasize that MADV_DONTNEED is not a hint but something
destructive in private mappings.
And that MADV_DONTNEED is not for memory reclaim, just like munmap() is not for
memory reclaim.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Oleg Nesterov @ 2026-04-30 9:16 UTC (permalink / raw)
To: Daniel Walker (danielwa), David Hildenbrand (Arm)
Cc: Darko Tominac -X (dtominac - GLOBALLOGIC INC at Cisco),
Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, xe-linux-external(mailer list),
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
linux-perf-users@vger.kernel.org, linux-mm@kvack.org
In-Reply-To: <afJz3l6435DHLorn@goliath>
On 04/29, Daniel Walker (danielwa) wrote:
>
> Shouldn't there be some sort of compensation or notification for this, or is each person that
> hits this suppose to just scratch their head and send a patch that's rejected?
Not sure... uprobes assumes that it owns the modified page and nobody
can modify/unmap it, including the probed application or debugger.
But may be MADV_DONTNEED should not set zap_details.even_cows == true
by default ?
Oleg.
^ permalink raw reply
* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Geert Uytterhoeven @ 2026-04-30 7:26 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Madhavan Srinivasan, Michael Ellerman, LKML,
Linux Trace Kernel, Masami Hiramatsu, Jens Remus, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik
In-Reply-To: <b743d575-0842-4718-8dc9-23c3c97b4a92@efficios.com>
On Wed, 29 Apr 2026 at 20:35, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> On 2026-04-29 11:47, Steven Rostedt wrote:
> > On Wed, 29 Apr 2026 11:43:55 -0400
> > Steven Rostedt <rostedt@kernel.org> wrote:
> >
> >> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> >> index 4fcc7c58a105..005441233932 100644
> >> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> >> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> >> @@ -562,3 +562,4 @@
> >> 469 common file_setattr sys_file_setattr
> >> 470 common listns sys_listns
> >> 471 nospu rseq_slice_yield sys_rseq_slice_yield
> >> +472 nospu stacktrace_setup sys_stacktrace_setup
> >
> > BTW, I have no idea what "nospu" is. I did some searches for PowerPC and
> > spu vs nospu and the only thing I found was a stackoverflow asking about
> > it, and the response was "don't worry about it".
> >
> > I only picked nospu because I noticed that rseq_slice_yield used it, but
> > that doesn't give me a good feeling that it's correct.
>
> AFAIR PowerPC SPUs are the Cell "Synergistic Processing Units", e.g.
> for Playstation 3.
Exactly. System call numbers are the same for normal (PPU = PowerPC)
CPUs and SPUs, but SPUs don't need all syscalls. Hence the option to
exclude a syscall for the SPUs by marking it "nospu".
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* [PATCH] tracepoint: add lockdep rcu_is_watching() check to trace_call__##name()
From: David Carlier @ 2026-04-30 5:35 UTC (permalink / raw)
To: linux-trace-kernel
Cc: linux-kernel, David Carlier, Vineeth Pillai (Google),
Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
Steven Rostedt
trace_call__##name() was added by commit 677a3d82b640 ("tracepoint: Add
trace_call__##name() API") to let callers that already gate on
trace_##name##_enabled() skip the redundant static_branch_unlikely()
re-evaluation. The changelog states that the new helper retains the
LOCKDEP RCU-watching assertion carried by trace_##name(), but the
WARN_ONCE() was omitted from both definitions.
Without it, sites converted from trace_foo(args) to trace_call__foo(args)
lose the WARN_ONCE(!rcu_is_watching()) coverage under CONFIG_LOCKDEP when
the tracepoint is enabled - the case that commit c2679254b9c9 ("tracing:
Make tracepoint lockdep check actually test something") added the warning
for.
Mirror the same block in both trace_call__##name() bodies, gated by
(cond) in __DECLARE_TRACE to match its trace_##name() and ungated in
__DECLARE_TRACE_SYSCALL.
Fixes: 677a3d82b640 ("tracepoint: Add trace_call__##name() API")
Cc: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
include/linux/tracepoint.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 578e520b6ee6..048e0035d4fa 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -318,6 +318,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
static inline void trace_call__##name(proto) \
{ \
__do_trace_##name(args); \
+ if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
+ WARN_ONCE(!rcu_is_watching(), \
+ "RCU not watching for tracepoint"); \
+ } \
}
#define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) \
@@ -342,6 +346,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
{ \
might_fault(); \
__do_trace_##name(args); \
+ if (IS_ENABLED(CONFIG_LOCKDEP)) { \
+ WARN_ONCE(!rcu_is_watching(), \
+ "RCU not watching for tracepoint"); \
+ } \
}
/*
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v6] tracing: Bound synthetic-field strings with seq_buf
From: Masami Hiramatsu @ 2026-04-30 5:14 UTC (permalink / raw)
To: Pengpeng Hou
Cc: Steven Rostedt, Mathieu Desnoyers, Tom Zanussi,
linux-trace-kernel, linux-kernel
In-Reply-To: <20260430043350.57928-1-pengpeng@iscas.ac.cn>
On Thu, 30 Apr 2026 12:33:50 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
> The synthetic field helpers build a prefixed synthetic variable name and
> a generated hist command in fixed MAX_FILTER_STR_VAL buffers. The
> current code appends those strings with raw strcat(), so long key lists,
> field names, or saved filters can run past the end of the staging
> buffers.
>
> Build both strings with seq_buf and propagate -E2BIG if either the
> synthetic variable name or the generated command exceeds
> MAX_FILTER_STR_VAL. This keeps the existing tracing-side limit while
> using the helper intended for bounded command construction.
>
Good catch! This looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thank you,
(BTW, we need to use __free(kfree) in this file too.)
> Fixes: 02205a6752f2 ("tracing: Add support for 'field variables'")
> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> ---
> Changes since v5: https://lore.kernel.org/all/20260424070104.1-tracing-synth-v5-pengpeng@iscas.ac.cn/
> - start a new thread for the new patch revision
> - use a lore link for the previous version in the changelog
> - simplify the synthetic-name construction with seq_buf_printf()
> - keep saved_filter as a normal local variable and avoid an anonymous block
>
> kernel/trace/trace_events_hist.c | 41 ++++++++++++++++++++++----------
> 1 file changed, 29 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
> index 0dbbf6cca9bc..aa8e7f043ac0 100644
> --- a/kernel/trace/trace_events_hist.c
> +++ b/kernel/trace/trace_events_hist.c
> @@ -8,6 +8,7 @@
> #include <linux/module.h>
> #include <linux/kallsyms.h>
> #include <linux/security.h>
> +#include <linux/seq_buf.h>
> #include <linux/mutex.h>
> #include <linux/slab.h>
> #include <linux/stacktrace.h>
> @@ -2968,14 +2969,23 @@ find_synthetic_field_var(struct hist_trigger_data *target_hist_data,
> char *system, char *event_name, char *field_name)
> {
> struct hist_field *event_var;
> + struct seq_buf s;
> char *synthetic_name;
>
> synthetic_name = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
> if (!synthetic_name)
> return ERR_PTR(-ENOMEM);
>
> - strcpy(synthetic_name, "synthetic_");
> - strcat(synthetic_name, field_name);
> + seq_buf_init(&s, synthetic_name, MAX_FILTER_STR_VAL);
> + seq_buf_printf(&s, "synthetic_%s", field_name);
> +
> + /* Terminate synthetic_name with a NUL. */
> + seq_buf_str(&s);
> +
> + if (seq_buf_has_overflowed(&s)) {
> + kfree(synthetic_name);
> + return ERR_PTR(-E2BIG);
> + }
>
> event_var = find_event_var(target_hist_data, system, event_name, synthetic_name);
>
> @@ -3020,6 +3030,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
> struct trace_event_file *file;
> struct hist_field *key_field;
> struct hist_field *event_var;
> + struct seq_buf s;
> char *saved_filter;
> char *cmd;
> int ret;
> @@ -3065,28 +3076,34 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
> return ERR_PTR(-ENOMEM);
> }
>
> + seq_buf_init(&s, cmd, MAX_FILTER_STR_VAL);
> +
> /* Use the same keys as the compatible histogram */
> - strcat(cmd, "keys=");
> + seq_buf_puts(&s, "keys=");
>
> for_each_hist_key_field(i, hist_data) {
> key_field = hist_data->fields[i];
> if (!first)
> - strcat(cmd, ",");
> - strcat(cmd, key_field->field->name);
> + seq_buf_putc(&s, ',');
> + seq_buf_puts(&s, key_field->field->name);
> first = false;
> }
>
> /* Create the synthetic field variable specification */
> - strcat(cmd, ":synthetic_");
> - strcat(cmd, field_name);
> - strcat(cmd, "=");
> - strcat(cmd, field_name);
> + seq_buf_printf(&s, ":synthetic_%s=%s", field_name, field_name);
>
> /* Use the same filter as the compatible histogram */
> saved_filter = find_trigger_filter(hist_data, file);
> - if (saved_filter) {
> - strcat(cmd, " if ");
> - strcat(cmd, saved_filter);
> + if (saved_filter)
> + seq_buf_printf(&s, " if %s", saved_filter);
> +
> + /* Terminate cmd with a NUL. */
> + seq_buf_str(&s);
> +
> + if (seq_buf_has_overflowed(&s)) {
> + kfree(cmd);
> + kfree(var_hist);
> + return ERR_PTR(-E2BIG);
> }
>
> var_hist->cmd = kstrdup(cmd, GFP_KERNEL);
> --
> 2.50.1 (Apple Git-155)
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v6] tracing: Bound synthetic-field strings with seq_buf
From: Pengpeng Hou @ 2026-04-30 4:33 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Tom Zanussi, linux-trace-kernel, linux-kernel,
Pengpeng Hou
The synthetic field helpers build a prefixed synthetic variable name and
a generated hist command in fixed MAX_FILTER_STR_VAL buffers. The
current code appends those strings with raw strcat(), so long key lists,
field names, or saved filters can run past the end of the staging
buffers.
Build both strings with seq_buf and propagate -E2BIG if either the
synthetic variable name or the generated command exceeds
MAX_FILTER_STR_VAL. This keeps the existing tracing-side limit while
using the helper intended for bounded command construction.
Fixes: 02205a6752f2 ("tracing: Add support for 'field variables'")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v5: https://lore.kernel.org/all/20260424070104.1-tracing-synth-v5-pengpeng@iscas.ac.cn/
- start a new thread for the new patch revision
- use a lore link for the previous version in the changelog
- simplify the synthetic-name construction with seq_buf_printf()
- keep saved_filter as a normal local variable and avoid an anonymous block
kernel/trace/trace_events_hist.c | 41 ++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 12 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0dbbf6cca9bc..aa8e7f043ac0 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/kallsyms.h>
#include <linux/security.h>
+#include <linux/seq_buf.h>
#include <linux/mutex.h>
#include <linux/slab.h>
#include <linux/stacktrace.h>
@@ -2968,14 +2969,23 @@ find_synthetic_field_var(struct hist_trigger_data *target_hist_data,
char *system, char *event_name, char *field_name)
{
struct hist_field *event_var;
+ struct seq_buf s;
char *synthetic_name;
synthetic_name = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
if (!synthetic_name)
return ERR_PTR(-ENOMEM);
- strcpy(synthetic_name, "synthetic_");
- strcat(synthetic_name, field_name);
+ seq_buf_init(&s, synthetic_name, MAX_FILTER_STR_VAL);
+ seq_buf_printf(&s, "synthetic_%s", field_name);
+
+ /* Terminate synthetic_name with a NUL. */
+ seq_buf_str(&s);
+
+ if (seq_buf_has_overflowed(&s)) {
+ kfree(synthetic_name);
+ return ERR_PTR(-E2BIG);
+ }
event_var = find_event_var(target_hist_data, system, event_name, synthetic_name);
@@ -3020,6 +3030,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
struct trace_event_file *file;
struct hist_field *key_field;
struct hist_field *event_var;
+ struct seq_buf s;
char *saved_filter;
char *cmd;
int ret;
@@ -3065,28 +3076,34 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
return ERR_PTR(-ENOMEM);
}
+ seq_buf_init(&s, cmd, MAX_FILTER_STR_VAL);
+
/* Use the same keys as the compatible histogram */
- strcat(cmd, "keys=");
+ seq_buf_puts(&s, "keys=");
for_each_hist_key_field(i, hist_data) {
key_field = hist_data->fields[i];
if (!first)
- strcat(cmd, ",");
- strcat(cmd, key_field->field->name);
+ seq_buf_putc(&s, ',');
+ seq_buf_puts(&s, key_field->field->name);
first = false;
}
/* Create the synthetic field variable specification */
- strcat(cmd, ":synthetic_");
- strcat(cmd, field_name);
- strcat(cmd, "=");
- strcat(cmd, field_name);
+ seq_buf_printf(&s, ":synthetic_%s=%s", field_name, field_name);
/* Use the same filter as the compatible histogram */
saved_filter = find_trigger_filter(hist_data, file);
- if (saved_filter) {
- strcat(cmd, " if ");
- strcat(cmd, saved_filter);
+ if (saved_filter)
+ seq_buf_printf(&s, " if %s", saved_filter);
+
+ /* Terminate cmd with a NUL. */
+ seq_buf_str(&s);
+
+ if (seq_buf_has_overflowed(&s)) {
+ kfree(cmd);
+ kfree(var_hist);
+ return ERR_PTR(-E2BIG);
}
var_hist->cmd = kstrdup(cmd, GFP_KERNEL);
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v19 7/7] ring-buffer: Cleanup buffer_data_page related code
From: Masami Hiramatsu (Google) @ 2026-04-30 3:29 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Code cleanup related to buffer_data_page for readability,
which includes:
- Introduce rb_data_page_commit() and rb_data_page_size()
- Use 'dpage' for buffer_data_page, instead of 'bpage' because
'bpage' is used for buffer_page.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/ring_buffer.c | 112 ++++++++++++++++++++++++--------------------
1 file changed, 60 insertions(+), 52 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 236fee6bc8f6..afbea7d918e4 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -364,21 +364,30 @@ struct buffer_page {
#define RB_WRITE_MASK 0xfffff
#define RB_WRITE_INTCNT (1 << 20)
-static void rb_init_page(struct buffer_data_page *bpage)
+static void rb_init_data_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
bpage->time_stamp = 0;
}
+static __always_inline long rb_data_page_commit(struct buffer_data_page *dpage)
+{
+ return local_read(&dpage->commit);
+}
+
+static __always_inline long rb_data_page_size(struct buffer_data_page *dpage)
+{
+ return rb_data_page_commit(dpage) & ~RB_MISSED_MASK;
+}
+
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
{
- return local_read(&bpage->page->commit);
+ return rb_data_page_commit(bpage->page);
}
-/* Size is determined by what has been committed */
static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+ return rb_data_page_size(bpage->page);
}
static void free_buffer_page(struct buffer_page *bpage)
@@ -419,7 +428,7 @@ static struct buffer_data_page *alloc_cpu_data(int cpu, int order)
return NULL;
dpage = page_address(page);
- rb_init_page(dpage);
+ rb_init_data_page(dpage);
return dpage;
}
@@ -659,7 +668,7 @@ static void verify_event(struct ring_buffer_per_cpu *cpu_buffer,
do {
if (page == tail_page || WARN_ON_ONCE(stop++ > 100))
done = true;
- commit = local_read(&page->page->commit);
+ commit = rb_page_commit(page);
write = local_read(&page->write);
if (addr >= (unsigned long)&page->page->data[commit] &&
addr < (unsigned long)&page->page->data[write])
@@ -1906,7 +1915,7 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
* Even after clearing these bits, a commit value greater than the
* subbuf_size is considered invalid.
*/
- tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ tail = rb_data_page_size(dpage);
if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
else
@@ -2138,12 +2147,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset the reader page */
local_set(&cpu_buffer->reader_page->entries, 0);
- rb_init_page(cpu_buffer->reader_page->page);
+ rb_init_data_page(cpu_buffer->reader_page->page);
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- rb_init_page(head_page->page);
+ rb_init_data_page(head_page->page);
}
}
@@ -2203,7 +2212,7 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages, int sc
*/
for (i = 0; i < meta->nr_subbufs; i++) {
meta->buffers[i] = i;
- rb_init_page(subbuf);
+ rb_init_data_page(subbuf);
subbuf += meta->subbuf_size;
}
}
@@ -2255,7 +2264,7 @@ static int rbm_show(struct seq_file *m, void *v)
val -= 2;
dpage = rb_range_buffer(cpu_buffer, val);
seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
- val, meta->buffers[val], dpage ? local_read(&dpage->commit) : -1);
+ val, meta->buffers[val], dpage ? rb_data_page_commit(dpage) : -1);
return 0;
}
@@ -2646,7 +2655,7 @@ static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
dpage = (void *)(ptr + idx * subbuf_size);
/* Skip unused pages */
- if (!local_read(&dpage->commit))
+ if (!rb_data_page_commit(dpage))
continue;
/*
@@ -2658,7 +2667,7 @@ static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
invalid++;
} else {
/* Count total commit bytes. */
- entry_bytes += local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ entry_bytes += rb_data_page_size(dpage);
}
}
@@ -4187,8 +4196,7 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->commit_page->page->commit,
rb_page_write(cpu_buffer->commit_page));
RB_WARN_ON(cpu_buffer,
- local_read(&cpu_buffer->commit_page->page->commit) &
- ~RB_WRITE_MASK);
+ rb_page_commit(cpu_buffer->commit_page) & ~RB_WRITE_MASK);
barrier();
}
@@ -4560,7 +4568,7 @@ static const char *show_interrupt_level(void)
return show_irq_str(level);
}
-static void dump_buffer_page(struct buffer_data_page *bpage,
+static void dump_buffer_page(struct buffer_data_page *dpage,
struct rb_event_info *info,
unsigned long tail)
{
@@ -4568,12 +4576,12 @@ static void dump_buffer_page(struct buffer_data_page *bpage,
u64 ts, delta;
int e;
- ts = bpage->time_stamp;
+ ts = dpage->time_stamp;
pr_warn(" [%lld] PAGE TIME STAMP\n", ts);
for (e = 0; e < tail; e += rb_event_length(event)) {
- event = (struct ring_buffer_event *)(bpage->data + e);
+ event = (struct ring_buffer_event *)(dpage->data + e);
switch (event->type_len) {
@@ -4623,7 +4631,7 @@ static atomic_t ts_dump;
} \
atomic_inc(&cpu_buffer->record_disabled); \
pr_warn(fmt, ##__VA_ARGS__); \
- dump_buffer_page(bpage, info, tail); \
+ dump_buffer_page(dpage, info, tail); \
atomic_dec(&ts_dump); \
/* There's some cases in boot up that this can happen */ \
if (WARN_ON_ONCE(system_state != SYSTEM_BOOTING)) \
@@ -4639,16 +4647,16 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
struct rb_event_info *info,
unsigned long tail)
{
- struct buffer_data_page *bpage;
+ struct buffer_data_page *dpage;
u64 ts, delta;
bool full = false;
int ret;
- bpage = info->tail_page->page;
+ dpage = info->tail_page->page;
if (tail == CHECK_FULL_PAGE) {
full = true;
- tail = local_read(&bpage->commit);
+ tail = rb_data_page_commit(dpage);
} else if (info->add_timestamp &
(RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)) {
/* Ignore events with absolute time stamps */
@@ -4659,7 +4667,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
* Do not check the first event (skip possible extends too).
* Also do not check if previous events have not been committed.
*/
- if (tail <= 8 || tail > local_read(&bpage->commit))
+ if (tail <= 8 || tail > rb_data_page_commit(dpage))
return;
/*
@@ -4668,7 +4676,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
if (atomic_inc_return(this_cpu_ptr(&checking)) != 1)
goto out;
- ret = rb_read_data_buffer(bpage, tail, cpu_buffer->cpu, &ts, &delta);
+ ret = rb_read_data_buffer(dpage, tail, cpu_buffer->cpu, &ts, &delta);
if (ret < 0) {
if (delta < ts) {
buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT BACKWARDS: last ts: %lld absolute ts: %lld clock:%pS\n",
@@ -6459,7 +6467,7 @@ static void rb_clear_buffer_page(struct buffer_page *page)
{
local_set(&page->write, 0);
local_set(&page->entries, 0);
- rb_init_page(page->page);
+ rb_init_data_page(page->page);
page->read = 0;
}
@@ -6944,7 +6952,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
local_irq_restore(flags);
if (bpage->data) {
- rb_init_page(bpage->data);
+ rb_init_data_page(bpage->data);
} else {
bpage->data = alloc_cpu_data(cpu, cpu_buffer->buffer->subbuf_order);
if (!bpage->data) {
@@ -6969,8 +6977,8 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
struct buffer_data_read_page *data_page)
{
struct ring_buffer_per_cpu *cpu_buffer;
- struct buffer_data_page *bpage = data_page->data;
- struct page *page = virt_to_page(bpage);
+ struct buffer_data_page *dpage = data_page->data;
+ struct page *page = virt_to_page(dpage);
unsigned long flags;
if (!buffer || !buffer->buffers || !buffer->buffers[cpu])
@@ -6990,15 +6998,15 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
arch_spin_lock(&cpu_buffer->lock);
if (!cpu_buffer->free_page) {
- cpu_buffer->free_page = bpage;
- bpage = NULL;
+ cpu_buffer->free_page = dpage;
+ dpage = NULL;
}
arch_spin_unlock(&cpu_buffer->lock);
local_irq_restore(flags);
out:
- free_pages((unsigned long)bpage, data_page->order);
+ free_pages((unsigned long)dpage, data_page->order);
kfree(data_page);
}
EXPORT_SYMBOL_GPL(ring_buffer_free_read_page);
@@ -7043,7 +7051,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
{
struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
struct ring_buffer_event *event;
- struct buffer_data_page *bpage;
+ struct buffer_data_page *dpage;
struct buffer_page *reader;
unsigned long missed_events;
unsigned int commit;
@@ -7069,8 +7077,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
if (data_page->order != buffer->subbuf_order)
return -1;
- bpage = data_page->data;
- if (!bpage)
+ dpage = data_page->data;
+ if (!dpage)
return -1;
guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock);
@@ -7136,7 +7144,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
* We have already ensured there's enough space if this
* is a time extend. */
size = rb_event_length(event);
- memcpy(bpage->data + pos, rpage->data + rpos, size);
+ memcpy(dpage->data + pos, rpage->data + rpos, size);
len -= size;
@@ -7152,9 +7160,9 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
size = rb_event_ts_length(event);
} while (len >= size);
- /* update bpage */
- local_set(&bpage->commit, pos);
- bpage->time_stamp = save_timestamp;
+ /* update dpage */
+ local_set(&dpage->commit, pos);
+ dpage->time_stamp = save_timestamp;
/* we copied everything to the beginning */
read = 0;
@@ -7164,13 +7172,13 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
cpu_buffer->read_bytes += rb_page_size(reader);
/* swap the pages */
- rb_init_page(bpage);
- bpage = reader->page;
+ rb_init_data_page(dpage);
+ dpage = reader->page;
reader->page = data_page->data;
local_set(&reader->write, 0);
local_set(&reader->entries, 0);
reader->read = 0;
- data_page->data = bpage;
+ data_page->data = dpage;
/*
* Use the real_end for the data size,
@@ -7178,12 +7186,12 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
* on the page.
*/
if (reader->real_end)
- local_set(&bpage->commit, reader->real_end);
+ local_set(&dpage->commit, reader->real_end);
}
cpu_buffer->lost_events = 0;
- commit = local_read(&bpage->commit);
+ commit = rb_data_page_commit(dpage);
/*
* Set a flag in the commit field if we lost events
*/
@@ -7192,19 +7200,19 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
* missed events, then record it there.
*/
if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
- memcpy(&bpage->data[commit], &missed_events,
+ memcpy(&dpage->data[commit], &missed_events,
sizeof(missed_events));
- local_add(RB_MISSED_STORED, &bpage->commit);
+ local_add(RB_MISSED_STORED, &dpage->commit);
commit += sizeof(missed_events);
}
- local_add(RB_MISSED_EVENTS, &bpage->commit);
+ local_add(RB_MISSED_EVENTS, &dpage->commit);
}
/*
* This page may be off to user land. Zero it out here.
*/
if (commit < buffer->subbuf_size)
- memset(&bpage->data[commit], 0, buffer->subbuf_size - commit);
+ memset(&dpage->data[commit], 0, buffer->subbuf_size - commit);
return read;
}
@@ -7835,7 +7843,7 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
if (missed_events) {
if (cpu_buffer->reader_page != cpu_buffer->commit_page) {
- struct buffer_data_page *bpage = reader->page;
+ struct buffer_data_page *dpage = reader->page;
unsigned int commit;
/*
* Use the real_end for the data size,
@@ -7843,18 +7851,18 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
* on the page.
*/
if (reader->real_end)
- local_set(&bpage->commit, reader->real_end);
+ local_set(&dpage->commit, reader->real_end);
/*
* If there is room at the end of the page to save the
* missed events, then record it there.
*/
commit = rb_page_size(reader);
if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
- memcpy(&bpage->data[commit], &missed_events,
+ memcpy(&dpage->data[commit], &missed_events,
sizeof(missed_events));
- local_add(RB_MISSED_STORED, &bpage->commit);
+ local_add(RB_MISSED_STORED, &dpage->commit);
}
- local_add(RB_MISSED_EVENTS, &bpage->commit);
+ local_add(RB_MISSED_EVENTS, &dpage->commit);
} else if (!WARN_ONCE(cpu_buffer->reader_page == cpu_buffer->tail_page,
"Reader on commit with %ld missed events",
missed_events)) {
^ permalink raw reply related
* [PATCH v19 6/7] ring-buffer: Cleanup persistent ring buffer validation
From: Masami Hiramatsu (Google) @ 2026-04-30 3:28 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cleanup rb_meta_validate_events() function to make it easier to read.
This includes the following cleanups:
- Introduce rb_validatation_state to hold working variables in
validation.
- Move repleated validation state updates into rb_validate_buffer().
- Move reader_page injection code outside of rb_meta_validate_events().
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v19:
- Add a kerneldoc about rb_validate_buffer().
---
kernel/trace/ring_buffer.c | 198 ++++++++++++++++++++++++--------------------
1 file changed, 107 insertions(+), 91 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2654315c1b7c..236fee6bc8f6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1883,8 +1883,16 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
- struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
+struct rb_validation_state {
+ unsigned long entries;
+ unsigned long entry_bytes;
+ int discarded;
+ u64 ts;
+};
+
+static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
+ struct ring_buffer_cpu_meta *meta,
+ u64 prev_ts, u64 next_ts)
{
struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
@@ -1922,16 +1930,94 @@ static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
return ret;
}
+/**
+ * rb_validate_buffer - validates a single buffer page and updates the state.
+ * @bpage: buffer page to validate
+ * @cpu_buffer: cpu_buffer this page belongs to
+ * @meta: meta of the cpu_buffer
+ * @state: validation state
+ * @prev_ts: previous buffer's timestamp (optional)
+ * @next_ts: next buffer's timestamp (optional)
+ *
+ * If the page is invalid (wrong event length or timestamp), it increments the
+ * discarded counter and warns it. Otherwise, it updates the validation state.
+ */
+static void rb_validate_buffer(struct buffer_page *bpage,
+ struct ring_buffer_per_cpu *cpu_buffer,
+ struct ring_buffer_cpu_meta *meta,
+ struct rb_validation_state *state,
+ u64 prev_ts, u64 next_ts)
+{
+ int ret;
+
+ ret = __rb_validate_buffer(bpage, cpu_buffer->cpu, meta, prev_ts, next_ts);
+ if (ret < 0) {
+ if (!state->discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ state->discarded++;
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+
+ state->entries += ret;
+ state->entry_bytes += rb_page_size(bpage);
+ state->ts = bpage->page->time_stamp;
+ }
+}
+
+static void rb_meta_inject_reader_page(struct ring_buffer_per_cpu *cpu_buffer,
+ struct ring_buffer_cpu_meta *meta,
+ struct buffer_page *orig_head,
+ struct buffer_page *head_page)
+{
+ struct buffer_page *bpage = orig_head;
+ int i;
+
+ rb_dec_page(&bpage);
+ /*
+ * Insert the reader_page before the original head page.
+ * Since the list encode RB_PAGE flags, general list
+ * operations should be avoided.
+ */
+ cpu_buffer->reader_page->list.next = &orig_head->list;
+ cpu_buffer->reader_page->list.prev = orig_head->list.prev;
+ orig_head->list.prev = &cpu_buffer->reader_page->list;
+ bpage->list.next = &cpu_buffer->reader_page->list;
+
+ /* Make the head_page the reader page */
+ cpu_buffer->reader_page = head_page;
+ bpage = head_page;
+ rb_inc_page(&head_page);
+ head_page->list.prev = bpage->list.prev;
+ rb_dec_page(&bpage);
+ bpage->list.next = &head_page->list;
+ rb_set_list_to_head(&bpage->list);
+ cpu_buffer->pages = &head_page->list;
+
+ cpu_buffer->head_page = head_page;
+ meta->head_buffer = (unsigned long)head_page->page;
+
+ /* Reset all the indexes */
+ bpage = cpu_buffer->reader_page;
+ meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
+ bpage->id = 0;
+
+ for (i = 1, bpage = head_page; i < meta->nr_subbufs;
+ i++, rb_inc_page(&bpage)) {
+ meta->buffers[i] = rb_meta_subbuf_idx(meta, bpage->page);
+ bpage->id = i;
+ }
+}
+
/* If the meta data has been validated, now validate the events */
static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
{
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
struct buffer_page *head_page, *orig_head, *orig_reader;
- unsigned long entry_bytes = 0;
- unsigned long entries = 0;
- int discarded = 0;
+ struct rb_validation_state state = { 0 };
int ret;
- u64 ts;
int i;
if (!meta || !meta->head_buffer)
@@ -1941,25 +2027,16 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_reader = cpu_buffer->reader_page;
/* Do the head page first */
- ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
+ ret = __rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid head page detected\n",
cpu_buffer->cpu);
goto skip_rewind;
}
- ts = head_page->page->time_stamp;
+ state.ts = head_page->page->time_stamp;
/* Do the reader page - reader must be previous to head. */
- ret = rb_validate_buffer(orig_reader, cpu_buffer->cpu, meta, 0, ts);
- if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid reader page detected\n",
- cpu_buffer->cpu);
- discarded++;
- } else {
- entries += ret;
- entry_bytes += rb_page_size(orig_reader);
- ts = orig_reader->page->time_stamp;
- }
+ rb_validate_buffer(orig_reader, cpu_buffer, meta, &state, 0, state.ts);
/*
* Try to rewind the head so that we can read the pages which are already
@@ -1983,19 +2060,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
* Skip if the page is invalid, or its timestamp is newer than the
* previous valid page.
*/
- ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, ts);
- if (ret < 0) {
- if (!discarded)
- pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
- cpu_buffer->cpu);
- discarded++;
- } else {
- entries += ret;
- entry_bytes += rb_page_size(head_page);
- if (ret > 0)
- local_inc(&cpu_buffer->pages_touched);
- ts = head_page->page->time_stamp;
- }
+ rb_validate_buffer(head_page, cpu_buffer, meta, &state, 0, state.ts);
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2009,43 +2074,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
* into the location just before the original head page.
*/
if (head_page != orig_head) {
- struct buffer_page *bpage = orig_head;
-
- rb_dec_page(&bpage);
- /*
- * Insert the reader_page before the original head page.
- * Since the list encode RB_PAGE flags, general list
- * operations should be avoided.
- */
- cpu_buffer->reader_page->list.next = &orig_head->list;
- cpu_buffer->reader_page->list.prev = orig_head->list.prev;
- orig_head->list.prev = &cpu_buffer->reader_page->list;
- bpage->list.next = &cpu_buffer->reader_page->list;
-
- /* Make the head_page the reader page */
- cpu_buffer->reader_page = head_page;
- bpage = head_page;
- rb_inc_page(&head_page);
- head_page->list.prev = bpage->list.prev;
- rb_dec_page(&bpage);
- bpage->list.next = &head_page->list;
- rb_set_list_to_head(&bpage->list);
- cpu_buffer->pages = &head_page->list;
-
- cpu_buffer->head_page = head_page;
- meta->head_buffer = (unsigned long)head_page->page;
-
- /* Reset all the indexes */
- bpage = cpu_buffer->reader_page;
- meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
- bpage->id = 0;
-
- for (i = 1, bpage = head_page; i < meta->nr_subbufs;
- i++, rb_inc_page(&bpage)) {
- meta->buffers[i] = rb_meta_subbuf_idx(meta, bpage->page);
- bpage->id = i;
- }
-
+ rb_meta_inject_reader_page(cpu_buffer, meta, orig_head, head_page);
/* We'll restart verifying from orig_head */
head_page = orig_head;
}
@@ -2057,7 +2086,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Nothing more to do, the only page is the reader page */
goto done;
}
- ts = head_page->page->time_stamp;
+ state.ts = head_page->page->time_stamp;
/* Iterate until finding the commit page */
for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
@@ -2066,21 +2095,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == orig_reader)
continue;
- ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, ts, 0);
- if (ret < 0) {
- if (!discarded)
- pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
- cpu_buffer->cpu);
- discarded++;
- } else {
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
+ rb_validate_buffer(head_page, cpu_buffer, meta, &state, state.ts, 0);
- entries += ret;
- entry_bytes += rb_page_size(head_page);
- ts = head_page->page->time_stamp;
- }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2091,25 +2107,25 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
goto invalid;
}
done:
- local_set(&cpu_buffer->entries, entries);
- local_set(&cpu_buffer->entries_bytes, entry_bytes);
+ local_set(&cpu_buffer->entries, state.entries);
+ local_set(&cpu_buffer->entries_bytes, state.entry_bytes);
pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
- if (discarded)
- pr_cont(" (%d pages discarded)", discarded);
+ if (state.discarded)
+ pr_cont(" (%d pages discarded)", state.discarded);
pr_cont("\n");
#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
if (meta->nr_invalid)
pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
cpu_buffer->cpu,
- (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
- discarded, meta->nr_invalid);
+ (state.discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ state.discarded, meta->nr_invalid);
if (meta->entry_bytes)
pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
cpu_buffer->cpu,
- (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
- (long)entry_bytes, (long)meta->entry_bytes);
+ (state.entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)state.entry_bytes, (long)meta->entry_bytes);
meta->nr_invalid = 0;
meta->entry_bytes = 0;
#endif
^ permalink raw reply related
* [PATCH v19 5/7] ring-buffer: Show commit numbers in buffer_meta file
From: Masami Hiramatsu (Google) @ 2026-04-30 3:28 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v17:
- Added NULL check for dpage in rbm_show in ring_buffer.c.
Changes in v16:
- update description.
---
kernel/trace/ring_buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c7cd93b66e0a..2654315c1b7c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2224,6 +2224,7 @@ static int rbm_show(struct seq_file *m, void *v)
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
unsigned long val = (unsigned long)v;
+ struct buffer_data_page *dpage;
if (val == 1) {
seq_printf(m, "head_buffer: %d\n",
@@ -2236,7 +2237,9 @@ static int rbm_show(struct seq_file *m, void *v)
}
val -= 2;
- seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]);
+ dpage = rb_range_buffer(cpu_buffer, val);
+ seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
+ val, meta->buffers[val], dpage ? local_read(&dpage->commit) : -1);
return 0;
}
^ permalink raw reply related
* [PATCH v19 4/7] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Masami Hiramatsu (Google) @ 2026-04-30 3:28 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a self-corrupting test for the persistent ring buffer.
This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.
This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:
reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1
Run the following commands after the 1st boot:
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v18:
- Fix to mask RB_MISSED_* flags when counting entry_bytes.
Changes in v17:
- In rb_test_inject_invalid_pages(), changed entry_bytes and
idx to unsigned long
- Added NULL checks for cpu_buffer and meta.
- In allocate_trace_buffer(), added a NULL check for tr->name
before comparing it with strcmp.
Changes in v16:
- Update description and comments according to review comments.
Changes in v15:
- Use pr_warn() for test result.
- Inject errors on the page index is multiples of 5 so that
this can reproduce contiguous empty pages.
Changes in v14:
- Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
- Clear meta->nr_invalid/entry_bytes after testing.
- Add test commands in config comment.
Changes in v10:
- Add entry_bytes test.
- Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
Changes in v9:
- Test also reader pages.
---
include/linux/ring_buffer.h | 1 +
kernel/trace/Kconfig | 34 +++++++++++++++++++
kernel/trace/ring_buffer.c | 79 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 4 ++
4 files changed, 118 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_TESTING = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..084f34dc6c9f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,40 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
Only say Y if you understand what this does, and you
still want it enabled. Otherwise say N
+config RING_BUFFER_PERSISTENT_INJECT
+ bool "Enable persistent ring buffer error injection test"
+ depends on RING_BUFFER
+ help
+ This option will have the kernel check if the persistent ring
+ buffer is named "ptracingtest". and if so, it will corrupt some
+ of its pages on a kernel panic. This is used to test if the
+ persistent ring buffer can recover from some of its sub-buffers
+ being corrupted.
+ To use this, boot a kernel with a "ptracingtest" persistent
+ ring buffer, e.g.
+
+ reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+ And after the 1st boot, run the following commands:
+
+ cd /sys/kernel/tracing/instances/ptracingtest
+ echo 1 > events/enable
+ echo 1 > tracing_on
+ sleep 3
+ echo c > /proc/sysrq-trigger
+
+ After the panic message, the kernel will reboot and will show
+ the test results in the console output.
+
+ Note that events for the test ring buffer needs to be enabled
+ prior to crashing the kernel so that the ring buffer has content
+ that the test will corrupt.
+ As the test will corrupt events in the "ptracingtest" persistent
+ ring buffer, it should not be used for any other purpose other
+ than this test.
+
+ If unsure, say N
+
config MMIOTRACE_TEST
tristate "Test module for mmiotrace"
depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9a82eaa684c1..c7cd93b66e0a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ __u32 nr_invalid;
+ __u32 entry_bytes;
+#endif
int buffers[];
};
@@ -2094,6 +2098,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (discarded)
pr_cont(" (%d pages discarded)", discarded);
pr_cont("\n");
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ if (meta->nr_invalid)
+ pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+ cpu_buffer->cpu,
+ (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ discarded, meta->nr_invalid);
+ if (meta->entry_bytes)
+ pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+ cpu_buffer->cpu,
+ (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)entry_bytes, (long)meta->entry_bytes);
+ meta->nr_invalid = 0;
+ meta->entry_bytes = 0;
+#endif
return;
invalid:
@@ -2574,12 +2593,72 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct ring_buffer_cpu_meta *meta;
+ struct buffer_data_page *dpage;
+ unsigned long entry_bytes = 0;
+ unsigned long ptr;
+ int subbuf_size;
+ int invalid = 0;
+ int cpu;
+ int i;
+
+ if (!(buffer->flags & RB_FL_TESTING))
+ return;
+
+ guard(preempt)();
+ cpu = smp_processor_id();
+
+ cpu_buffer = buffer->buffers[cpu];
+ if (!cpu_buffer)
+ return;
+ meta = cpu_buffer->ring_meta;
+ if (!meta)
+ return;
+
+ ptr = (unsigned long)rb_subbufs_from_meta(meta);
+ subbuf_size = meta->subbuf_size;
+
+ for (i = 0; i < meta->nr_subbufs; i++) {
+ unsigned long idx = meta->buffers[i];
+
+ dpage = (void *)(ptr + idx * subbuf_size);
+ /* Skip unused pages */
+ if (!local_read(&dpage->commit))
+ continue;
+
+ /*
+ * Invalidate even pages or multiples of 5. This will cause 3
+ * contiguous invalidated(empty) pages.
+ */
+ if (!(i & 0x1) || !(i % 5)) {
+ local_add(subbuf_size + 1, &dpage->commit);
+ invalid++;
+ } else {
+ /* Count total commit bytes. */
+ entry_bytes += local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ }
+ }
+
+ pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+ invalid, cpu, (long)entry_bytes);
+ meta->nr_invalid = invalid;
+ meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer) do { } while (0)
+#endif
+
/* Stop recording on a persistent buffer and flush cache if needed. */
static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
{
struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
ring_buffer_record_off(buffer);
+ rb_test_inject_invalid_pages(buffer);
arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
return NOTIFY_DONE;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..4573f65d68ce 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8383,6 +8383,8 @@ static void setup_trace_scratch(struct trace_array *tr,
memset(tscratch, 0, size);
}
+#define TRACE_TEST_PTRACING_NAME "ptracingtest"
+
int allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size)
{
enum ring_buffer_flags rb_flags;
@@ -8394,6 +8396,8 @@ int allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
+ if (tr->name && !strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+ rb_flags |= RB_FL_TESTING;
/* Add scratch buffer to handle 128 modules */
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
^ permalink raw reply related
* [PATCH v19 3/7] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-30 3:28 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v19:
- Cleanup rb_validate_buffer() so that it is more readable,
and add a comment about the timestamp validation.
Changes in v18:
- Reset timestamp of reader_page when the entire cpu_buffer is
invalid.
- Minor update by new fix.
Changes in v17:
- Fix to verify head_page at first before using its timestamp.
- Reset timestamp if the page is invalid.
Changes in v12:
- Fix build error.
Changes in v11:
- Reset timestamp when the buffer is invalid.
- When rewinding, skip subbuf page if timestamp is wrong and
check timestamp after validating buffer data page.
Changes in v10:
- Newly added.
---
kernel/trace/ring_buffer.c | 102 +++++++++++++++++++++++++++-----------------
1 file changed, 63 insertions(+), 39 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2ac09b1abe9f..9a82eaa684c1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
static void rb_init_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
+ bpage->time_stamp = 0;
}
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
- struct ring_buffer_cpu_meta *meta)
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
+ struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
{
+ struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
unsigned long tail;
u64 delta;
+ int ret;
/*
* When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,27 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
* subbuf_size is considered invalid.
*/
tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
- if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
- return -1;
- return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
+ ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ else
+ ret = -1;
+
+ /*
+ * The timestamp must be greater than @prev_ts and smaller than @next_ts.
+ * Since this function works in both forward (verify) and reverse (unwind)
+ * loop, we don't know both @prev_ts and @next_ts at the same time.
+ * So use the known boundary as the boundary.
+ */
+ if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
+ local_set(&bpage->entries, 0);
+ local_set(&dpage->commit, 0);
+ dpage->time_stamp = prev_ts ? prev_ts : next_ts;
+ ret = -1;
+ } else {
+ local_set(&bpage->entries, ret);
+ }
+
+ return ret;
}
/* If the meta data has been validated, now validate the events */
@@ -1915,25 +1936,29 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
orig_reader = cpu_buffer->reader_page;
- /* Do the reader page first */
- ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu, meta);
+ /* Do the head page first */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
+ if (ret < 0) {
+ pr_info("Ring buffer meta [%d] invalid head page detected\n",
+ cpu_buffer->cpu);
+ goto skip_rewind;
+ }
+ ts = head_page->page->time_stamp;
+
+ /* Do the reader page - reader must be previous to head. */
+ ret = rb_validate_buffer(orig_reader, cpu_buffer->cpu, meta, 0, ts);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid reader page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&orig_reader->entries, 0);
- local_set(&orig_reader->page->commit, 0);
} else {
entries += ret;
entry_bytes += rb_page_size(orig_reader);
- local_set(&orig_reader->entries, ret);
+ ts = orig_reader->page->time_stamp;
}
- ts = head_page->page->time_stamp;
-
/*
- * Try to rewind the head so that we can read the pages which already
+ * Try to rewind the head so that we can read the pages which are already
* read in the previous boot.
*/
if (head_page == cpu_buffer->tail_page)
@@ -1946,26 +1971,27 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->tail_page)
break;
- /* Ensure the page has older data than head. */
- if (ts < head_page->page->time_stamp)
+ /* Rewind until unused page (no timestamp, no commit). */
+ if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
break;
- ts = head_page->page->time_stamp;
- /* Ensure the page has correct timestamp and some data. */
- if (!ts || rb_page_commit(head_page) == 0)
- break;
-
- /* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
- if (ret < 0)
- break;
-
- /* Recover the number of entries and update stats. */
- local_set(&head_page->entries, ret);
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
- entries += ret;
- entry_bytes += rb_page_size(head_page);
+ /*
+ * Skip if the page is invalid, or its timestamp is newer than the
+ * previous valid page.
+ */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, ts);
+ if (ret < 0) {
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ if (ret > 0)
+ local_inc(&cpu_buffer->pages_touched);
+ ts = head_page->page->time_stamp;
+ }
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2027,6 +2053,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Nothing more to do, the only page is the reader page */
goto done;
}
+ ts = head_page->page->time_stamp;
/* Iterate until finding the commit page */
for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
@@ -2035,15 +2062,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == orig_reader)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, ts, 0);
if (ret < 0) {
if (!discarded)
pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
} else {
/* If the buffer has content, update pages_touched */
if (ret)
@@ -2051,7 +2075,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += rb_page_size(head_page);
- local_set(&head_page->entries, ret);
+ ts = head_page->page->time_stamp;
}
if (head_page == cpu_buffer->commit_page)
break;
@@ -2079,12 +2103,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset the reader page */
local_set(&cpu_buffer->reader_page->entries, 0);
- local_set(&cpu_buffer->reader_page->page->commit, 0);
+ rb_init_page(cpu_buffer->reader_page->page);
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
+ rb_init_page(head_page->page);
}
}
^ permalink raw reply related
* [PATCH v19 2/7] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-30 3:28 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v19:
- Limit max number of loops to nr_pages in __rb_get_reader_page()
and update comment.
Changes in v18:
- Minor update by the new fix.
- Fix to substract BUF_PAGE_HDR_SIZE from meta->subbuf_size
to make the limit of commit size.
Changes in v17:
- Fix to use rb_page_size() of rewound pages for entry_bytes.
Changes in v15:
- Skip reader_page loop check on persistent ring buffer because
there can be contiguous empty(invalidated) pages.
- Do not show discarded page number information if it is 0.
Changes in v11:
- Fix a typo.
Changes in v9:
- Add meta->subbuf_size check.
- Fix a typo.
- Handle invalid reader_page case.
Changes in v8:
- Add comment in rb_valudate_buffer()
- Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
skipping subbuf.
- Remove unused subbuf local variable from rb_cpu_meta_valid().
Changes in v7:
- Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
- Remove checking subbuffer data when validating metadata, because it should be done
later.
- Do not mark the discarded sub buffer page but just reset it.
Changes in v6:
- Show invalid page detection message once per CPU.
Changes in v5:
- Instead of showing errors for each page, just show the number
of discarded pages at last.
Changes in v3:
- Record missed data event on commit.
---
kernel/trace/ring_buffer.c | 116 +++++++++++++++++++++++++++-----------------
1 file changed, 70 insertions(+), 46 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7288383b1f27..2ac09b1abe9f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
return local_read(&bpage->page->commit);
}
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+ return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
unsigned long *subbuf_mask)
{
int subbuf_size = PAGE_SIZE;
- struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
if (!subbuf_mask)
return false;
+ if (meta->subbuf_size != PAGE_SIZE) {
+ pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+ return false;
+ }
+
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- subbuf = rb_subbufs_from_meta(meta);
-
bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
- /* Is the meta buffers and the subbufs themselves have correct data? */
+ /*
+ * Ensure the meta::buffers array has correct data. The data in each subbufs
+ * are checked later in rb_meta_validate_events().
+ */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
- pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
- return false;
- }
-
if (test_bit(meta->buffers[i], subbuf_mask)) {
pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
return false;
}
set_bit(meta->buffers[i], subbuf_mask);
- subbuf = (void *)subbuf + subbuf_size;
}
return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+ struct ring_buffer_cpu_meta *meta)
{
unsigned long long ts;
+ unsigned long tail;
u64 delta;
- int tail;
- tail = local_read(&dpage->commit);
+ /*
+ * When a sub-buffer is recovered from a read, the commit value may
+ * have RB_MISSED_* bits set, as these bits are reset on reuse.
+ * Even after clearing these bits, a commit value greater than the
+ * subbuf_size is considered invalid.
+ */
+ tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
+ return -1;
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
struct buffer_page *head_page, *orig_head, *orig_reader;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
+ int discarded = 0;
int ret;
u64 ts;
int i;
@@ -1901,14 +1916,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_reader = cpu_buffer->reader_page;
/* Do the reader page first */
- ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer reader page is invalid\n");
- goto invalid;
+ pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&orig_reader->entries, 0);
+ local_set(&orig_reader->page->commit, 0);
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(orig_reader);
+ local_set(&orig_reader->entries, ret);
}
- entries += ret;
- entry_bytes += local_read(&orig_reader->page->commit);
- local_set(&orig_reader->entries, ret);
ts = head_page->page->time_stamp;
@@ -1936,7 +1956,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
break;
/* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0)
break;
@@ -1945,7 +1965,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (ret)
local_inc(&cpu_buffer->pages_touched);
entries += ret;
- entry_bytes += rb_page_commit(head_page);
+ entry_bytes += rb_page_size(head_page);
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2015,21 +2035,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == orig_reader)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid buffer page\n",
- cpu_buffer->cpu);
- goto invalid;
- }
-
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
-
- entries += ret;
- entry_bytes += local_read(&head_page->page->commit);
- local_set(&head_page->entries, ret);
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ local_set(&head_page->entries, ret);
+ }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2043,7 +2066,10 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
- pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+ pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
+ if (discarded)
+ pr_cont(" (%d pages discarded)", discarded);
+ pr_cont("\n");
return;
invalid:
@@ -3330,12 +3356,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
return NULL;
}
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
static __always_inline unsigned
rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -5635,8 +5655,9 @@ __rb_get_reader_page_from_remote(struct ring_buffer_per_cpu *cpu_buffer)
static struct buffer_page *
__rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
{
- struct buffer_page *reader = NULL;
+ int max_loops = cpu_buffer->ring_meta ? cpu_buffer->nr_pages : 3;
unsigned long bsize = READ_ONCE(cpu_buffer->buffer->subbuf_size);
+ struct buffer_page *reader = NULL;
unsigned long overwrite;
unsigned long flags;
int nr_loops = 0;
@@ -5648,11 +5669,14 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
again:
/*
* This should normally only loop twice. But because the
- * start of the reader inserts an empty page, it causes
- * a case where we will loop three times. There should be no
- * reason to loop four times (that I know of).
+ * start of the reader inserts an empty page, it causes a
+ * case where we will loop three times. There should be no
+ * reason to loop four times unless the ring buffer is a
+ * recovered persistent ring buffer. For persistent ring buffers,
+ * invalid pages are reset during recovery, so there may be more
+ * than 3 contiguous pages can be empty, but less than nr_pages.
*/
- if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+ if (RB_WARN_ON(cpu_buffer, ++nr_loops > max_loops)) {
reader = NULL;
goto out;
}
^ permalink raw reply related
* [PATCH v19 1/7] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-04-30 3:28 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
Changes in v13:
- Fix a rebase conflict.
Changes in v11:
- Do nothing by default since flush_cache_vmap() does nothing on x86
but it can cause deadlock on some architectures via on_each_cpu()
because other CPUs will be stoppped when panic notifier is called.
Changes in v9:
- Fix typo of & to &&.
- Fix typo of "Generic"
Changes in v6:
- Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
- Use flush_cache_vmap() instead of flush_cache_all().
Changes in v5:
- Use ring_buffer_record_off() instead of ring_buffer_record_disable().
- Use flush_cache_all() to ensure flush all cache.
Changes in v3:
- update patch description.
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/ring_buffer.h | 10 ++++++++++
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/ring_buffer.h | 13 +++++++++++++
kernel/trace/ring_buffer.c | 22 ++++++++++++++++++++++
23 files changed, 65 insertions(+)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
generic-y += asm-offsets.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
generic-y += extable.h
generic-y += flat.h
generic-y += parport.h
+generic-y += ring_buffer.h
generated-y += mach-types.h
generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
generic-y += user.h
generic-y += ioctl.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
generic-y += statfs.h
generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
generic-y += spinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
generic-y += agp.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
generic-y += asm-offsets.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += ring_buffer.h
generic-y += runtime-const.h
generic-y += softirq_stack.h
generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
generic-y += fprobe.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end) do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5326924615a4..7288383b1f27 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
#include <linux/ring_buffer_types.h>
#include <linux/sched/isolation.h>
#include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
#include <linux/trace_events.h>
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <asm/ring_buffer.h>
#include <asm/local64.h>
#include <asm/local.h>
#include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
unsigned long range_addr_start;
unsigned long range_addr_end;
+ struct notifier_block flush_nb;
struct ring_buffer_meta *meta;
@@ -2521,6 +2524,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+ ring_buffer_record_off(buffer);
+ arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+ return NOTIFY_DONE;
+}
+
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
@@ -2651,6 +2664,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
mutex_init(&buffer->mutex);
+ /* Persistent ring buffer needs to flush cache before reboot. */
+ if (start && end) {
+ buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+ atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+ }
+
return_ptr(buffer);
fail_free_buffers:
@@ -2749,6 +2768,9 @@ ring_buffer_free(struct trace_buffer *buffer)
{
int cpu;
+ if (buffer->range_addr_start && buffer->range_addr_end)
+ atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
irq_work_sync(&buffer->irq_work.work);
^ permalink raw reply related
* [PATCH v19 0/7] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-04-30 3:28 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
Hi,
Here is the 19th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:
https://lore.kernel.org/all/177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com/
This version rebased on v7.1-rc1 and cleanup code.
[2/7] Limit max number of loops to nr_pages in __rb_get_reader_page()
and update comment.
[3/7] Cleanup rb_validate_buffer() so that it is more readable,
and add a comment about the timestamp validation.
[6/7] Add a kerneldoc about rb_validate_buffer().
Thank you,
Masami Hiramatsu (Google) (7):
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
ring-buffer: Add persistent ring buffer invalid-page inject test
ring-buffer: Show commit numbers in buffer_meta file
ring-buffer: Cleanup persistent ring buffer validation
ring-buffer: Cleanup buffer_data_page related code
arch/alpha/include/asm/Kbuild | 1
arch/arc/include/asm/Kbuild | 1
arch/arm/include/asm/Kbuild | 1
arch/arm64/include/asm/ring_buffer.h | 10 +
arch/csky/include/asm/Kbuild | 1
arch/hexagon/include/asm/Kbuild | 1
arch/loongarch/include/asm/Kbuild | 1
arch/m68k/include/asm/Kbuild | 1
arch/microblaze/include/asm/Kbuild | 1
arch/mips/include/asm/Kbuild | 1
arch/nios2/include/asm/Kbuild | 1
arch/openrisc/include/asm/Kbuild | 1
arch/parisc/include/asm/Kbuild | 1
arch/powerpc/include/asm/Kbuild | 1
arch/riscv/include/asm/Kbuild | 1
arch/s390/include/asm/Kbuild | 1
arch/sh/include/asm/Kbuild | 1
arch/sparc/include/asm/Kbuild | 1
arch/um/include/asm/Kbuild | 1
arch/x86/include/asm/Kbuild | 1
arch/xtensa/include/asm/Kbuild | 1
include/asm-generic/ring_buffer.h | 13 +
include/linux/ring_buffer.h | 1
kernel/trace/Kconfig | 34 ++
kernel/trace/ring_buffer.c | 490 +++++++++++++++++++++++-----------
kernel/trace/trace.c | 4
26 files changed, 415 insertions(+), 157 deletions(-)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] kprobes: Remove dead child probes from aggrprobe list on module unload
From: Shijia Hu @ 2026-04-30 1:39 UTC (permalink / raw)
To: mhiramat; +Cc: rostedt, naveen, davem, akpm, linux-kernel, linux-trace-kernel
In-Reply-To: <20260429115645.793f15ea@gandalf.local.home>
On Wed, 29 Apr 2026 17:40:53 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> Yes, this cause UAF because that module has a bug. Please call
> unregister_kprobe().
On Wed, 29 Apr 2026 11:56:45 -0400
Steve Rostedt <rostedt@goodmis.org> wrote:
> Yes, this is one of those...
>
> Patient: Doctor it hurts me when I do this
> Doctor: Then don't do that
>
> ... reports.
>
> No, the kernel isn't responsible for fixing buggy modules.
Agreed. I withdraw this patch.
Thank you both for the review.
Shijia
^ permalink raw reply
* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Michael Roth @ 2026-04-29 23:51 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260428-gmem-inplace-conversion-v5-0-d8608ccfca22@google.com>
On Tue, Apr 28, 2026 at 04:24:55PM -0700, Ackerley Tng via B4 Relay wrote:
> [Some people who received this message don't often get email from devnull+ackerleytng.google.com@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> This is RFC v5 of guest_memfd in-place conversion support.
>
> Up till now, guest_memfd supports the entire inode worth of memory being
> used as all-shared, or all-private. CoCo VMs may request guest memory to be
> converted between private and shared states, and the only way to support
> that currently would be to have the userspace VMM provide two sources of
> backing memory from completely different areas of physical memory.
>
> pKVM has a use case for in-place sharing: the guest and host may be
> cooperating on given data, and pKVM doesn't protect data through
> encryption, so copying that given data between different areas of physical
> memory as part of conversions would be unnecessary work.
>
> This series also serves as a foundation for guest_memfd huge page
> support. Now, guest_memfd only supports PAGE_SIZE pages, so if two sources
> of backing memory are used, the userspace VMM could maintain a steady total
> memory utilized by punching out the pages that are not used. When huge
> pages are available in guest_memfd, even if the backing memory source
> supports hole punching within a huge page, punching out pages to maintain
> the total memory utilized by a VM would be introducing lots of
> fragmentation.
>
> In-place conversion avoids fragmentation by allowing the same physical
> memory to be used for both shared and private memory, with guest_memfd
> tracks the shared/private status of all the pages at a per-page
> granularity.
>
> The central principle, which guest_memfd continues to uphold, is that any
> guest-private page will not be mappable to host userspace. All pages will
> be mmap()-able in host userspace, but accesses to guest-private pages (as
> tracked by guest_memfd) will result in a SIGBUS.
>
> This series introduces a guest_memfd ioctl (not kvm, vm or vcpu, but
> guest_memfd ioctl) that allows userspace to set memory
> attributes (shared/private) directly through the guest_memfd. This is the
> appropriate interface because shared/private-ness is a property of memory
> and hence the request should be sent directly to the memory provider -
> guest_memfd.
>
> Tested with both CONFIG_KVM_VM_MEMORY_ATTRIBUTES enabled and disabled:
>
> + tools/testing/selftests/kvm/guest_memfd_test.c
> + tools/testing/selftests/kvm/pre_fault_memory_test.c
> + tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> + tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> + tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
> + tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
>
> Updates for this revision:
>
> + For TDX and SNP, PRESERVE supported only before VM is finalized only for
> to_private conversions.
> + This allows PRESERVE to be used as part of the VM memory
> loading/encryption flow
> + Only support PRESERVE for to_private conversions (to_shared on
> populated memory on TDX would cause zeroing)
> + Relaxed constraints for SNP and TDX to allow NULL to be passed as
> source address.
> + Dropped KVM_CAP_MEMORY_ATTRIBUTES2. KVM_CAP_MEMORY_ATTRIBUTES reports
> attributes supported by the KVM_SET_MEMORY_ATTRIBUTES VM ioctl, and
> KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reports attributes supported bt the
> KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl.
> + KVM_SET_MEMORY_ATTRIBUTES2 is not supported by the VM ioctl
> + Resolve locking issue when kvm_gmem_get_attribute() is called from
> kvm_mmu_zap_collapsible_spte() by bugging the VM. guest_memfd memslots
> don't support dirty tracking, so the locking issue is not on an
> accessible code path.
> + Moved guest_memfd_conversions_test.c to only be compiled and tested for
> x86, since it depends so heavily on KVM_X86_SW_PROTECTED_VM's as a
> testing vehicle
>
> TODOs
>
> + Perhaps further clarify PRESERVE flag: [8]
I made a super-long-winded reply to that thread, but to summarize:
PRESERVE flag has different enumeration/behavior/enforcement for pre-launch
vs. post-launch, and similar considerations might come into play for
other flags, so to make it easier to enumerate what flags are available
for pre-launch/post-launch, maybe we could have 2 capabilities instead
of 1:
KVM_CAP_MEMORY_ATTRIBUTES2_PRE_LAUNCH_FLAGS
KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS
where SNP/TDX would only advertise PRESERVE for PRE_LAUNCH, and pKVM I
guess would enumerate it for both (or maybe just POST_LAUNCH?)
That lets us keep the flags definitions more straightforward but still
allows userspace to easily enumerate what exactly should be available at
pre vs. post launch time, and give us some flexibility to detail
variations in behavior between the 2 phases without documenting
edge-cases in terms of VM types.
> + Resolve issue where guest_memfd_conversions_test, which uses the
> kselftest framework, doesn't perform teardown on assertion
> failure. Please see proposal at [9]
> + Test with TDX selftests. We're in the process of rebasing TDX selftests
> on this series and will post updates when that's tested.
>
> I would like feedback on:
>
> + Content modes: 0 (MODE_UNSPECIFIED), ZERO, and PRESERVE. Is that all
> good, or does anyone think there is a use case for something else?
> + Should the content modes apply even if no attribute changes are required?
> + See notes added in "KVM: guest_memfd: Apply content modes while
> setting memory attributes"
Looking at the example you have there:
+ Note: These content modes apply to the entire requested range, not
+ just the parts of the range that underwent conversion. For example, if
+ this was the initial state:
+
+ * [0x0000, 0x1000): shared
+ * [0x1000, 0x2000): private
+ * [0x2000, 0x3000): shared
+ and range [0x0000, 0x3000) was set to shared, the content mode would
+ apply to all memory in [0x0000, 0x3000), not just the range that
+ underwent conversion [0x1000, 0x2000).
Userspace would be aware of whether the range contains pages that were
already set to private, so if it really wants to set the just the
[0x1000, 0x2000) range to shared with appropriate content mode, it is
fully able to do so by just issuing the ioctl for that specific range.
If it attempts to issue it for the entire range, it only seems like it
would defy normal expectations and cause confusion to skip ranges, and
I'm not sure it gains us anything useful in exchange for that potential
confusion.
> + Possibly related: should setting attributes be allowed if some
> sub-range requested already has the requested attribute?
As it is now, userspace has that capability (to use finer-grained ranges
if it doesn't want to re-issue unecessary/unwanted conversions), similar
to above. And KVM internally will just issue kvm_arch_gmem_prepare()
calls so that architecture-specific handling can deal with this case
(e.g. SNP's sev_gmem_prepare() already checks if the corresponding
attribute is set in the RMP table and just skips it otherwise). So I
don't think we really gain anything but added complexity if we try to
make gmem more selective about it.
-Mike
> + Structure of how various content modes are checked for support or
> applied? I used overridable weak functions for architectures that haven't
> defined support, and defined overrides for x86 to show how I think it would
> work. For CoCo platforms, I only implemented TDX for illustration purposes
> and might need help with the other platforms. Should I have used
> kvm_x86_ops? I tried and found myself defining lots of boilerplate.
> + The use of private_mem_conversions_test.sh to run different options in
> private_mem_conversions_test. If this makes sense, I'll adjust the
> Makefile to have private_mem_conversions_test tested only via the script.
>
> This series is based on kvm/next, and here's the tree for your convenience:
>
> https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-conversion-v5
>
> Older series:
>
> + RFCv4 is at [7]
> + RFCv3 is at [6]
> + RFCv2 is at [5]
> + RFCv1 is at [4]
> + Previous versions of this feature, part of other series, are available at
> [1][2][3].
>
> [1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com/
> [2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
> [3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com/
> [4] https://lore.kernel.org/all/cover.1760731772.git.ackerleytng@google.com/T/
> [5] https://lore.kernel.org/all/cover.1770071243.git.ackerleytng@google.com/T/
> [6] https://lore.kernel.org/r/20260313-gmem-inplace-conversion-v3-0-5fc12a70ec89@google.com/T/
> [7] https://lore.kernel.org/all/20260326-gmem-inplace-conversion-v4-0-e202fe950ffd@google.com/T/
> [8] https://lore.kernel.org/all/CAEvNRgGbMhkX310CkFY_M5x-zod=BDTiuznrZ0XvFPUK7weL1A@mail.gmail.com/
> [9] https://lore.kernel.org/all/20260414-selftest-global-metadata-v1-0-fd223922bc57@google.com/T/
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> Ackerley Tng (34):
> KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
> KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
> KVM: guest_memfd: Only prepare folios for private pages
> KVM: Move kvm_supported_mem_attributes() to kvm_host.h
> KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
> KVM: guest_memfd: Ensure pages are not in use before conversion
> KVM: guest_memfd: Call arch invalidate hooks on conversion
> KVM: guest_memfd: Return early if range already has requested attributes
> KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
> KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
> KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
> KVM: guest_memfd: Determine invalidation filter from memory attributes
> KVM: guest_memfd: Introduce default handlers for content modes
> KVM: guest_memfd: Apply content modes while setting memory attributes
> KVM: x86: Support SW_PROTECTED_VM in applying content modes
> KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
> KVM: x86: Support SNP and TDX applying content modes
> KVM: x86: Bug CoCo VM on page fault before finalizing
> KVM: Add CAP to enumerate supported SET_MEMORY_ATTRIBUTES2 flags
> KVM: selftests: Test basic single-page conversion flow
> KVM: selftests: Test conversion flow when INIT_SHARED
> KVM: selftests: Test conversion precision in guest_memfd
> KVM: selftests: Test conversion before allocation
> KVM: selftests: Convert with allocated folios in different layouts
> KVM: selftests: Test that truncation does not change shared/private status
> KVM: selftests: Test conversion with elevated page refcount
> KVM: selftests: Test that conversion to private does not support ZERO
> KVM: selftests: Support checking that data not equal expected
> KVM: selftests: Test that not specifying a conversion flag scrambles memory contents
> KVM: selftests: Reset shared memory after hole-punching
> KVM: selftests: Provide function to look up guest_memfd details from gpa
> KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
> KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
> KVM: selftests: Add script to exercise private_mem_conversions_test
>
> Michael Roth (1):
> KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
>
> Sean Christopherson (18):
> KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
> KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
> KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
> KVM: Stub in ability to disable per-VM memory attribute tracking
> KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
> KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
> KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes
> KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
> KVM: selftests: Create gmem fd before "regular" fd when adding memslot
> KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
> KVM: selftests: Add support for mmap() on guest_memfd in core library
> KVM: selftests: Add selftests global for guest memory attributes capability
> KVM: selftests: Add helpers for calling ioctls on guest_memfd
> KVM: selftests: Test that shared/private status is consistent across processes
> KVM: selftests: Provide common function to set memory attributes
> KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
> KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes
> KVM: selftests: Update private memory exits test to work with per-gmem attributes
>
> Documentation/virt/kvm/api.rst | 139 ++++-
> .../virt/kvm/x86/amd-memory-encryption.rst | 19 +-
> Documentation/virt/kvm/x86/intel-tdx.rst | 4 +
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/Kconfig | 15 +-
> arch/x86/kvm/mmu/mmu.c | 20 +-
> arch/x86/kvm/svm/sev.c | 18 +-
> arch/x86/kvm/vmx/tdx.c | 8 +-
> arch/x86/kvm/x86.c | 145 ++++-
> include/linux/kvm_host.h | 74 ++-
> include/trace/events/kvm.h | 4 +-
> include/uapi/linux/kvm.h | 21 +
> mm/swap.c | 2 +
> tools/testing/selftests/kvm/Makefile.kvm | 5 +
> tools/testing/selftests/kvm/include/kvm_util.h | 141 ++++-
> tools/testing/selftests/kvm/include/test_util.h | 34 +-
> .../selftests/kvm/kvm_has_gmem_attributes.c | 17 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 130 +++--
> tools/testing/selftests/kvm/lib/test_util.c | 7 -
> tools/testing/selftests/kvm/lib/x86/sev.c | 2 +-
> .../testing/selftests/kvm/pre_fault_memory_test.c | 4 +-
> .../kvm/x86/guest_memfd_conversions_test.c | 552 +++++++++++++++++++
> .../kvm/x86/private_mem_conversions_test.c | 55 +-
> .../kvm/x86/private_mem_conversions_test.sh | 128 +++++
> .../selftests/kvm/x86/private_mem_kvm_exits_test.c | 38 +-
> virt/kvm/Kconfig | 3 +-
> virt/kvm/guest_memfd.c | 591 ++++++++++++++++++++-
> virt/kvm/kvm_main.c | 87 ++-
> 28 files changed, 2075 insertions(+), 190 deletions(-)
> ---
> base-commit: 39f1c201b93f4ff71631bac72cff6eb155f976a4
> change-id: 20260225-gmem-inplace-conversion-bd0dbd39753a
>
> Best regards,
> --
> Ackerley Tng <ackerleytng@google.com>
>
>
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-29 23:21 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgGbMhkX310CkFY_M5x-zod=BDTiuznrZ0XvFPUK7weL1A@mail.gmail.com>
On Fri, Apr 24, 2026 at 12:08:45PM -0700, Ackerley Tng wrote:
> Michael Roth <michael.roth@amd.com> writes:
>
> Thank you for your patches!
>
> >
> > [...snip...]
> >
> >>
> >> I also did some minor updates (prefixed with a "[squash]" tag) to advertise
> >> the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVED flag so it can be used by
> >
> > Though I'm not sure how we deal with it if SNP/TDX at some point become
> > capable of using the PRESERVED flag *after* populate... but maybe that's
> > too unlikely to worry about? If we wanted to address it though, we could
> > have both PRESERVED and PRESERVED_BEFORE_LAUNCH so they can be
> > enumerated separately from the start.
> >
>
> Not sure how likely it is, but if SNP and TDX can honor PRESERVE
> semantics after populate, I think we could implement support under a new
> flag like CIPHER.
That works, but it still makes things *slightly* awkward due to special-casing
the PRESERVE semantics for 1 guest type vs. another.
For instance, for the QEMU patches what I'd like to do is be able to
issue conversions with the PRESERVE flag set, and it makes sense that
when the function that handles that gets called (in this case,
guest_memfd_set_memory_attributes_fd()), it asserts that QEMU
checked for the corresponding capability/flag (via
KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS) earlier on during QEMU init time
and stored that into kvm_supported_memory_attributes2_flag to check
against later in cases where we try to use PRESERVE:
https://github.com/AMDESE/qemu/blob/1308d4bcd8dd8acd151243e96ff0be4098389b93/accel/kvm/kvm-all.c#L1655
but that check is sort of incomplete: for use-cases like pKVM it makes
sense to allow conversion+PRESERVE at all times, but for SNP/TDX we'd
ideally have an additional check like:
static int guest_memfd_set_memory_attributes_fd(..., attrs, flags)
{
struct kvm_memory_attributes2 attrs;
...
assert(kvm_supported_memory_attributes & attrs);
assert(kvm_supported_memory_attributes2_flags & flags);
/* SNP and TDX only support PRESERVE prior to launch) */
if (vm_type_is_snp_or_tdx() && (flags & PRESERVE))
assert(vcpus_not_yet_started_and_LAUNCH_FINISH_etc_not_yet_called);
...
}
maybe that's fine, but there's just some asymmetry in the fact that
the capabilities checks are essentially self-documenting/self-enforcing
in this regard, except for PRESERVE+SNP/TDX where userspace needs to
sprinkle some additional vm_type-specific logic.
So that's why I was sort of thinking PRESERVE might deserve a
PRESERVE_BEFORE_LAUNCH or something so the kernel can enumerate the
support rather than userspace needing to read into the documentation /
implementation to implement finer-grained checks/etc.
>
> CIPHER can then be used to mean "do the encryption or decryption", and
> for platforms not supporting encryption, they'd stick with PRESERVE?
It's all theoretical, but I'd imagine that *if* SNP/TDX ever added
support for allowing userspace to write encrypted memory into a guest
range, it would be something like:
- write plain-text data to corresponding GPA
- issue shared-to-private conversion with CIPHER (or PRESERVE) bit
set, which will then be augmented to make some sort of firmware
call to encrypt the memory with guest key and make it visible to
guest
So, at a high-level, if host writes 0xbeef to shared memory, then upon
completion of the converison ioctl, 0xbeef then be visible to the guest
via encrypted/private memory, which seems to match the semantics of
PRESERVE perfectly:
+``KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE``
+
+ On conversion, KVM guarantees memory contents will be preserved with
+ respect to the last written unencrypted value. As a concrete
+ example, if the host writes ``0xbeef`` to shared memory and converts
+ the memory to private, the guest will also read ``0xbeef``, even if
+ the in-memory data is encrypted as part of the conversion. And vice
+ versa, if the guest writes ``0xbeef`` to private memory and then
+ converts the memory to shared, the host (and guest) will read
+ ``0xbeef`` (if the memory is accessible).
So it seems like a waste to have to add a CIPHER that would essentially
have the same semantics (even though the architectural implementation
details are different), and then complicate the documentation of
PRESERVE to have vm-specific behaviors that userspace needs to account
for, rather than leaving PRESERVE as-is and adding a
PRESERVE_BEFORE_LAUNCH that neatly encapsulates the specialness of the
SNP/TDX flow without complicating the more generally-defined flags like
PRESERVE that could in theory become usable by all variants in the future.
But then that sort of has me wondering if PRESERVE should be
PRESERVE_AFTER_LAUNCH (to cover cases where maybe SNP/TDX gain this
ability in the future, but can only use it post-launch, where other
architectures might allow it both before/after.)
...but that seems to be getting into a more general issue of how we
determine what is available before-launch/after-launch, so maybe instead
of introducing new flags, we just have 2 capabilities that returns what
flags are available at each particular phase, e.g.
KVM_CAP_MEMORY_ATTRIBUTES2_PRE_LAUNCH_FLAGS
KVM_CAP_MEMORY_ATTRIBUTES2_POST_LAUNCH_FLAGS
And then userspace can easily probe/lock down what it expects to be
available prior to launching the guest vs after without any vm-specific
logic, and we can continue to just stick with PRESERVE/ZERO. (and if
we do end up need to further distinguish pre vs. post-launch behavior,
this gives us some flexibility to handle that as well).
>
> Should we redefine the semantics of PRESERVE to be "ensure that memory
> contents don't change while guest_memfd tracking is being updated" and
> avoid making a commitment on how the guest should read the memory?
>
> The above update would be aligned with ZERO not being allowed for
> conversions to private (because KVM/guest_memfd does not make guarantees
> about the contract between the host and guest.
It's a little awkward here too, because PRESERVE is sort of making a
contract between host/guest: we are providing trusted data to the guest.
However, there's an attestation proces that allows the guest to enforce
that for pre-launch and ensure we are following through on that
contract. PRESERVE for post-launch...I'm not sure how that could be
enforced, but I think this is further reason to allow pre vs. post
launch flags to be enumerated separately.
Thanks,
Mike
>
> This way, all of those (ZERO, PRESERVE) will focus on KVM's interface
> with the host.
>
> This lines up for SW_PROTECTED_VMs too, since reading memory that didn't
> change in the guest is the contract between SW_PROTECTED_VMs and the
> host.
>
> >> userspace for SNP/TDX in the kvm_gmem_populate() path as agreed upon
> >> during PUCK.
> >>
> >>
> >> [...snip...]
> >>
^ permalink raw reply
* Re: [PATCH] trace_printk: replace _______STR with __UNIQUE_ID(STR)
From: David Laight @ 2026-04-29 21:47 UTC (permalink / raw)
To: Steven Rostedt
Cc: Qian-Yu Lin, mhiramat, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260429134226.49e95e9d@robin>
On Wed, 29 Apr 2026 13:42:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 30 Apr 2026 00:57:07 +0800
> Qian-Yu Lin <tiffany019230@gmail.com> wrote:
>
> > The macro trace_printk() uses a hardcoded identifier _______STR
> > within a statement expression, which can lead to variable name
> > shadowing if a caller happens to use the same name in its scope.
>
> Has this ever been a problem?
>
> >
> > Following the pattern in commit 24ba53017e18 ("rcu: Replace ________p1
> > and _________p1 with __UNIQUE_ID(rcu)") and commit 589a9785ee3a
> > ("min/max: remove sparse warnings when they're nested"), replace the
> > hardcoded identifier with __UNIQUE_ID(STR).
min and max do get nested - so it is important that the 'local'
variables have different names - otherwise you can get invalid expansions.
No one is going to have: trace_printk(fmt1, trace_printk(ftm2, ...), ...)
it just doesn't make sense.
There is a slight problem the ____________________________STR might
be used by an entirely different #define.
Just using _trace_printk_str[] would make this pretty unlikely.
It would also have the advantage of making the .i file a bit easier
to read, all the UNIQUE names in min/max output make it really hard
to see what the output actually means.
> >
> > Since __UNIQUE_ID() must be expanded once to remain consistent across
> > declaration and sizeof() within the statement expression, introduce a
> > nested helper macro ___trace_printk.
>
> Hmm, so we are replacing one name with underscores with another name
> with underscores?
>
> >
> > Signed-off-by: Qian-Yu Lin <tiffany019230@gmail.com>
> > ---
> > include/linux/trace_printk.h | 10 +++++++---
> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
> > index 2670ec7f4262..060eccb40838 100644
> > --- a/include/linux/trace_printk.h
> > +++ b/include/linux/trace_printk.h
> > @@ -2,6 +2,7 @@
> > #ifndef _LINUX_TRACE_PRINTK_H
> > #define _LINUX_TRACE_PRINTK_H
> >
> > +#include <linux/compiler.h>
>
> People are already saying that trace_printk.h slows down the compile.
> Does this add any overhead to the compile?
A little - nothing is free.
David
>
> -- Steve
>
>
> > #include <linux/compiler_attributes.h>
> > #include <linux/instruction_pointer.h>
> > #include <linux/stddef.h>
> > @@ -84,15 +85,18 @@ do { \
> > * let gcc optimize the rest.
> > */
> >
> > -#define trace_printk(fmt, ...) \
> > +#define ___trace_printk(fmt, str, ...) \
> > do { \
> > - char _______STR[] = __stringify((__VA_ARGS__)); \
> > - if (sizeof(_______STR) > 3) \
> > + char str[] = __stringify((__VA_ARGS__)); \
> > + if (sizeof(str) > 3) \
> > do_trace_printk(fmt, ##__VA_ARGS__); \
> > else \
> > trace_puts(fmt); \
> > } while (0)
> >
> > +#define trace_printk(fmt, ...) \
> > + ___trace_printk(fmt, __UNIQUE_ID(str), ##__VA_ARGS__)
> > +
> > #define do_trace_printk(fmt, args...) \
> > do { \
> > static const char *trace_printk_fmt __used \
>
>
^ permalink raw reply
* Re: [PATCH] mm/madvise: preserve uprobe breakpoints across MADV_DONTNEED
From: Daniel Walker (danielwa) @ 2026-04-29 21:11 UTC (permalink / raw)
To: Oleg Nesterov
Cc: David Hildenbrand (Arm),
Darko Tominac -X (dtominac - GLOBALLOGIC INC at Cisco),
Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, xe-linux-external(mailer list),
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
linux-perf-users@vger.kernel.org, linux-mm@kvack.org
In-Reply-To: <afIiqR5CqYRId0N0@redhat.com>
On Wed, Apr 29, 2026 at 05:24:25PM +0200, Oleg Nesterov wrote:
> On 04/29, David Hildenbrand (Arm) wrote:
> >
> > On 4/29/26 15:15, Darko Tominac wrote:
> >
> > > uprobe infrastructure does not re-instrument pages on individual page
> > > faults (uprobe_mmap() is only called during VMA creation, not on
> > > page-in), the breakpoints are silently lost once the discarded pages are
> > > re-read from the backing file. The probes stop firing with no error
> > > indication, and the only recovery is to unregister and re-register the
> > > affected uprobes.
> >
> > Right. Don't MADV_DONTNEED uprobes, just like you are not supposed to
> > MADV_DONTNEED debugger breakpoints/set data etc. :)
>
> Agreed, thanks.
Shouldn't there be some sort of compensation or notification for this, or is each person that
hits this suppose to just scratch their head and send a patch that's rejected?
Daniel
^ permalink raw reply
* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 20:03 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <20260429145815.69dcb1c0@gandalf.local.home>
On Wed, 29 Apr 2026 14:58:15 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> > If the main executable doesn't have a sframe section, I don't see why this
> > shouldn't be allowed to add one.
>
> The address is added via mtree_insert_range() which states it will return
> -EEXISTS if the range is occupied.
I confirmed this. I changed the test program to load its own sframe section
again, and it errors out with -EEXISTS.
-- Steve
^ permalink raw reply
* Re: [RFC][PATCH] unwind: Add stacktrace_setup system call
From: Steven Rostedt @ 2026-04-29 18:58 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Jens Remus,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Linus Torvalds, Andrew Morton, Florian Weimer, Kees Cook,
Carlos O'Donell, Sam James, Dylan Hatch, Borislav Petkov,
Dave Hansen, David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
Vlastimil Babka, Heiko Carstens, Vasily Gorbik
In-Reply-To: <20260429145512.461e09fe@gandalf.local.home>
On Wed, 29 Apr 2026 14:55:12 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> > > + *
> > > + * This system call is used by dynamic library utilities to inform the kernel
> >
> > Out of curiosity: what would happen if the application chooses to invoke
> > this system call to register sframe for the main executable ? Should it be
> > allowed ?
>
> I haven't tried, and should look at the code. Ideally, no two sframes
> sections should cover the same code. I think it should error out if it does.
>
> If the main executable doesn't have a sframe section, I don't see why this
> shouldn't be allowed to add one.
The address is added via mtree_insert_range() which states it will return
-EEXISTS if the range is occupied.
-- Steve
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox