* Re: [PATCH v14 4/5] ring-buffer: Reset RB_MISSED_* flags on persistent ring buffer
From: Masami Hiramatsu @ 2026-03-31 1:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Ian Rogers
In-Reply-To: <20260330143613.42fe5640@gandalf.local.home>
On Mon, 30 Mar 2026 14:36:13 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 30 Mar 2026 21:50:20 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
>
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >
> > Reset RB_MISSED_* flags when the persistent ring buffer is
> > validated at boot. Since these flags are used only in reading
> > process, such process should be stopped when reboot and never
> > be restarted. Thus, these flags are meaningless in the next
> > boot. Moreover, it can confuse the read process after reboot.
>
> Is it meaningless on a second boot?
>
> Let's say you have a crash, and there's an invalid buffer. On the next boot
> it is flagged as invalid with the RB_MISSED flag. But then you reboot again
> before looking at the buffer. The next boot will clear this flag. Now
> looking at the persistent ring buffer will not show any missed events.
>
> Ideally, it shouldn't matter how many reboots are made. If the persistent
> ring buffer hasn't started again, it should always show the same output.
Hmm, OK. I'll drop this.
Thanks!
>
> -- Steve
>
>
> >
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> > Changes in v14:
> > - Newly added.
> > ---
> > kernel/trace/ring_buffer.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index e5178239f2f9..5049cf13021e 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -1903,6 +1903,7 @@ static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
> > local_set(&bpage->page->commit, 0);
> > } else {
> > local_set(&bpage->entries, ret);
> > + local_set(&bpage->page->commit, tail);
> > }
> >
> > return ret;
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v13 4/4] ring-buffer: Add persistent ring buffer selftest
From: Masami Hiramatsu @ 2026-03-31 1:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Ian Rogers
In-Reply-To: <20260327164748.67b6453d@gandalf.local.home>
On Fri, 27 Mar 2026 16:47:48 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 27 Mar 2026 16:25:08 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Also, I noticed that there's nothing that reads the RB_MISSING as I thought
> > it might. I'll have to look into how to pass that info to the trace output.
>
> And when I cat /sys/kernel/tracing/instances/ptracingtest/per_cpu/cpuX/trace_pipe
>
> (where X is the failed buffer)
>
> It triggered an infinite loop of:
>
> [ 206.549217] ------------[ cut here ]------------
> [ 206.550907] WARNING: kernel/trace/ring_buffer.c:5751 at __rb_get_reader_page+0xa6b/0x1040, CPU#2: cat/1197
> [ 206.554111] Modules linked in:
> [ 206.555331] CPU: 2 UID: 0 PID: 1197 Comm: cat Tainted: G W 7.0.0-rc4-test-00028-g7b37f48b2c57-dirty #276 PREEMPT(full)
> [ 206.559048] Tainted: [W]=WARN
> [ 206.560244] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> [ 206.563212] RIP: 0010:__rb_get_reader_page+0xa6b/0x1040
> [ 206.564964] Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 4a 05 00 00 48 8b 43 10 be 04 00 00 00 4c 8d 60 08 4c 89 e7 e8 9a 2d 63 00 f0 41 ff 04 24 <0f> 0b e9 36 fb ff ff e8 29 39 05 00 fb 0f 1f 44 00 00 4d 85 f6 0f
> [ 206.572295] RSP: 0018:ffff888112a77938 EFLAGS: 00010006
> [ 206.574095] RAX: 0000000000000001 RBX: ffff888100d6e000 RCX: 0000000000000001
> [ 206.576458] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88810027b808
> [ 206.578749] RBP: 1ffff1102254ef34 R08: ffffffff909a1556 R09: ffffed102004f701
> [ 206.581020] R10: ffffed102004f702 R11: ffff88823443a000 R12: ffff88810027b808
> [ 206.583312] R13: ffff888100f65f00 R14: ffff888100f65f00 R15: dffffc0000000000
> [ 206.585647] FS: 00007f98e4d80780(0000) GS:ffff88829e3c2000(0000) knlGS:0000000000000000
> [ 206.588246] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 206.590179] CR2: 00007f98e4d3e000 CR3: 000000012272e006 CR4: 0000000000172ef0
> [ 206.592444] Call Trace:
> [ 206.593518] <TASK>
> [ 206.594436] ? __pfx___rb_get_reader_page+0x10/0x10
> [ 206.596148] ? lock_acquire+0x1b2/0x340
> [ 206.597599] rb_buffer_peek+0x37e/0x520
> [ 206.598954] ring_buffer_peek+0xe9/0x310
> [ 206.601956] peek_next_entry+0x15a/0x280
> [ 206.603420] __find_next_entry+0x39f/0x530
> [ 206.604918] ? __pfx___mutex_lock+0x10/0x10
> [ 206.606474] ? rcu_is_watching+0x15/0xb0
> [ 206.616049] ? __pfx___find_next_entry+0x10/0x10
> [ 206.617741] ? preempt_count_sub+0x10c/0x1c0
> [ 206.619242] ? __pfx_down_read+0x10/0x10
> [ 206.620687] trace_find_next_entry_inc+0x2f/0x240
> [ 206.622351] tracing_read_pipe+0x4e7/0xc60
> [ 206.623852] ? rw_verify_area+0x353/0x5f0
> [ 206.625325] vfs_read+0x171/0xb20
> [ 206.626592] ? __lock_acquire+0x487/0x2220
> [ 206.628135] ? __pfx___handle_mm_fault+0x10/0x10
> [ 206.629784] ? __pfx_vfs_read+0x10/0x10
> [ 206.632696] ? __pfx_css_rstat_updated+0x10/0x10
> [ 206.634351] ? rcu_is_watching+0x15/0xb0
> [ 206.635835] ? trace_preempt_on+0x126/0x160
> [ 206.637362] ? preempt_count_sub+0x10c/0x1c0
> [ 206.638880] ? count_memcg_events+0x10a/0x4b0
> [ 206.640455] ? find_held_lock+0x2b/0x80
> [ 206.641908] ? rcu_read_unlock+0x17/0x60
> [ 206.643340] ? lock_release+0x1ab/0x320
> [ 206.644812] ksys_read+0xff/0x200
> [ 206.646127] ? __pfx_ksys_read+0x10/0x10
> [ 206.647651] do_syscall_64+0x117/0x16c0
> [ 206.649035] ? irqentry_exit+0xd9/0x690
> [ 206.650548] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 206.652331] RIP: 0033:0x7f98e4e14eb2
> [ 206.653743] Code: 18 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 1a 83 e2 39 83 fa 08 75 12 e8 2b ff ff ff 0f 1f 00 49 89 ca 48 8b 44 24 20 0f 05 <48> 83 c4 18 c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 10 ff 74 24 18
> [ 206.659364] RSP: 002b:00007ffdc0a8d930 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
> [ 206.663251] RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f98e4e14eb2
> [ 206.665614] RDX: 0000000000040000 RSI: 00007f98e4d3f000 RDI: 0000000000000003
> [ 206.668022] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
> [ 206.670306] R10: 0000000000000000 R11: 0000000000000202 R12: 00007f98e4d3f000
> [ 206.672624] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000040000
> [ 206.674941] </TASK>
> [ 206.675927] irq event stamp: 7898
> [ 206.677154] hardirqs last enabled at (7897): [<ffffffff90991f6f>] ring_buffer_empty_cpu+0x19f/0x2f0
> [ 206.680088] hardirqs last disabled at (7898): [<ffffffff909a277d>] ring_buffer_peek+0x17d/0x310
> [ 206.682881] softirqs last enabled at (7888): [<ffffffff9056cffc>] handle_softirqs+0x5bc/0x7c0
> [ 206.685710] softirqs last disabled at (7879): [<ffffffff9056d322>] __irq_exit_rcu+0x112/0x230
> [ 206.688483] ---[ end trace 0000000000000000 ]---
>
> OK, that RB_MISSED_EVENTS is causing an issue. Something else we need to
> look into. The warning is that __rb_get_reader_page() is trying more than 3
> times. Thus I think it's constantly swapping the head page and the reader
> page. Something to investigate.
I think this happens because invalidated buffer is empty. After recovering
persistent ring buffer, there can be contiguous empty buffers on the
ring buffer. Thus the reader needs to find next non-empty buffer
on the list.
Thank you,
>
> So, I'm holding off pulling in these patches. I may take the first one
> though.
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Masami Hiramatsu @ 2026-03-31 3:58 UTC (permalink / raw)
To: Breno Leitao
Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acqJk-zbyjIiy6hJ@gmail.com>
On Mon, 30 Mar 2026 08:04:23 -0700
Breno Leitao <leitao@debian.org> wrote:
> On Mon, Mar 30, 2026 at 06:15:17AM -0700, Breno Leitao wrote:
> > On Fri, Mar 27, 2026 at 10:37:44PM +0900, Masami Hiramatsu wrote:
> > > On Fri, 27 Mar 2026 03:06:41 -0700
> > > Breno Leitao <leitao@debian.org> wrote:
> >
> > > > > To fix this, we need to change setup_arch() for each architecture so
> > > > > that it calls this bootconfig_apply_early_params().
> > > >
> > > > Could we instead integrate this into parse_early_param() itself? That
> > > > approach would avoid the need to modify each architecture individually.
> > >
> > > Ah, indeed.
> >
> > I investigated integrating bootconfig into parse_early_param() and hit a
> > blocker: xbc_init() and xbc_make_cmdline() depend on memblock_alloc(), but on
> > most architectures (x86, arm64, arm, s390, riscv) parse_early_param() is called
> > from setup_arch() _before_ memblock is initialized.
>
> That said, I'd like to propose a simpler approach as a first step:
>
> 1) Keep calling bootconfig_apply_early_params() from setup_boot_config().
> This is the least intrusive approach and expands bootconfig support to
> additional early boot parameters.
>
> 2) Document that architecture-specific early parameters might be ignored.
> If a parameter is consumed early enough (during setup_arch()), it will
> not see the bootconfig value.
We have to carefully do this because what parameter is arch specific or not
depends on architecture and undocumented.
How about introducing a new Kconfig, which supports early params by
embedded bootconfig?
>
> 3) Ensure that early bootconfig parameters don't overwrite the boot command
> line. For example, if the boot command line has foo=bar and bootconfig
> later has foo=baz, the command line value should take precedence.
> This prevents early boot code (in setup_arch()) from seeing a parameter
> value that will be changed later.
OK, this also needs to be considered. Currently we just pass the bootconfig
parameters right before bootloader given parameters as "extra_command_line"
if "bootconfig" in cmdline or CONFIG_BOOT_CONFIG_FORCE=y.
[boot_config(.kernel)]<command_line>[ -- [boot_config(.init)][init_command_line]]
This is currently expected behavior. The bootconfig parameters are
expected to be overridden by command_line or command_line are appended.
If we change this for early params, we also should change the expected
output of /proc/cmdline too. I think we have 2 options;
- As before, we expect the parameters provided by the boot configuration
to be processed first and then overridden later by the command line.
Or,
- ignore all parameters which is given from the command line, this also
updates existing setup_boot_config() (means xbc_snprint_cmdline() ).
Anyway, this behavior change will also be a bit critical... We have
to announce it.
>
>
> If that is OK, that is what I have right now:
Let me comment on it.
> diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
> index f712758472d5c..6ed852a0c66d8 100644
> --- a/Documentation/admin-guide/bootconfig.rst
> +++ b/Documentation/admin-guide/bootconfig.rst
> @@ -169,6 +169,15 @@ Boot Kernel With a Boot Config
> There are two options to boot the kernel with bootconfig: attaching the
> bootconfig to the initrd image or embedding it in the kernel itself.
>
> +Early options (those registered with ``early_param()``) may only be
> +specified in the embedded bootconfig, because the initrd is not yet
> +available when early parameters are processed.
> +
> +Note that embedded bootconfig is parsed after ``setup_arch()``, so
> +early options that are consumed during architecture initialization
> +(e.g., ``mem=``, ``memmap=``, ``earlycon``, ``noapic``, ``nolapic``,
> +``acpi=``, ``numa=``, ``iommu=``) may not take effect from bootconfig.
> +
This is easy to explain, but it's quite troublesome for users to
determine which parameters are unavailable. Currently we can identify
it by `git grep early_param -- arch/${ARCH}`. But it is setup in
setup_arch() we need to track the source code. (Or ask AI :))
> Attaching a Boot Config to Initrd
> ---------------------------------
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 7484cd703bc1a..34adcc1feb9b6 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1525,6 +1525,16 @@ config BOOT_CONFIG_EMBED
> image. But if the system doesn't support initrd, this option will
> help you by embedding a bootconfig file while building the kernel.
>
> + Unlike bootconfig attached to initrd, the embedded bootconfig also
> + supports early options (those registered with early_param()). Any
> + kernel.* key in the embedded bootconfig is applied before
> + parse_early_param() runs. Early options in initrd bootconfig will
> + not be applied. Early options consumed during setup_arch() (e.g.
> + mem=, memmap=, earlycon, noapic, acpi=, numa=, iommu=) may not
> + take effect. If the same early option
> + appears in both bootconfig and the kernel command line, the
> + command line value takes precedence.
> +
As a compromise, how about making this a separate Kconfig?
config BOOT_CONFIG_EMBED_EARLY_PARAM
bool "Support early_params in embedded bootconfig"
and afterwards, we can introduce
depends on ARCH_BOOT_CONFIG_EMBED_EARLY_PARAM
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v14 5/5] ring-buffer: Add persistent ring buffer selftest
From: Masami Hiramatsu @ 2026-03-31 4:12 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, Ian Rogers
In-Reply-To: <20260330162419.5e4708fb@gandalf.local.home>
On Mon, 30 Mar 2026 16:24:19 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 30 Mar 2026 21:50:27 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
>
> > @@ -2558,12 +2577,64 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
> > kfree(cpu_buffer);
> > }
> >
> > +#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
> > +static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
> > +{
> > + struct ring_buffer_per_cpu *cpu_buffer;
> > + struct ring_buffer_cpu_meta *meta;
> > + struct buffer_data_page *dpage;
> > + u32 entry_bytes = 0;
> > + unsigned long ptr;
> > + int subbuf_size;
> > + int invalid = 0;
> > + int cpu;
> > + int i;
> > +
> > + if (!(buffer->flags & RB_FL_TESTING))
> > + return;
> > +
> > + guard(preempt)();
> > + cpu = smp_processor_id();
> > +
> > + cpu_buffer = buffer->buffers[cpu];
> > + meta = cpu_buffer->ring_meta;
> > + ptr = (unsigned long)rb_subbufs_from_meta(meta);
> > + subbuf_size = meta->subbuf_size;
> > +
> > + for (i = 0; i < meta->nr_subbufs; i++) {
> > + int idx = meta->buffers[i];
> > +
> > + dpage = (void *)(ptr + idx * subbuf_size);
> > + /* Skip unused pages */
> > + if (!local_read(&dpage->commit))
> > + continue;
> > +
> > + /* Invalidate even pages. */
> > + if (!(i & 0x1)) {
> > + local_add(subbuf_size + 1, &dpage->commit);
> > + invalid++;
> > + } else {
> > + /* Count total commit bytes. */
> > + entry_bytes += local_read(&dpage->commit);
> > + }
> > + }
> > +
> > + pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
> > + invalid, cpu, (long)entry_bytes);
>
> This is only enabled when testing. Let's make that a pr_warn() as we really
> do want to be able to see it. And it should warn that it is invalidating pages!
> (warn as in pr_warn, it doesn't need a warn as in WARN()).
OK. Let me update it.
Thanks!
>
> -- Steve
>
>
> > + meta->nr_invalid = invalid;
> > + meta->entry_bytes = entry_bytes;
> > +}
> > +#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
> > +#define rb_test_inject_invalid_pages(buffer) do { } while (0)
> > +#endif
> > +
> > /* Stop recording on a persistent buffer and flush cache if needed. */
> > static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
> > {
> > struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
> >
> > ring_buffer_record_off(buffer);
> > + rb_test_inject_invalid_pages(buffer);
> > arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
> > return NOTIFY_DONE;
> > }
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH] tracing: always provide a prototype for tracing_alloc_snapshot()
From: Bartosz Golaszewski @ 2026-03-31 8:20 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, Bartosz Golaszewski
The tracing_alloc_snapshot() symbol is always exported even with
!CONFIG_TRACER_SNAPSHOT so the prototype too must be always visible or
we'll see the following warning:
kernel/trace/trace.c:820:5: warning: no previous prototype for ‘tracing_alloc_snapshot’ [-Wmissing-prototypes]
820 | int tracing_alloc_snapshot(void)
| ^~~~~~~~~~~~~~~~~~~~~~
Fixes: bade44fe5462 ("tracing: Move snapshot code out of trace.c and into trace_snapshot.c")
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
---
kernel/trace/trace.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6abd9e16ef21..e8612b8b0a34 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -2275,6 +2275,8 @@ static inline void __init trace_event_init(void) { }
static inline void trace_event_update_all(struct trace_eval_map **map, int len) { }
#endif
+int tracing_alloc_snapshot(void);
+
#ifdef CONFIG_TRACER_SNAPSHOT
extern const struct file_operations snapshot_fops;
extern const struct file_operations snapshot_raw_fops;
@@ -2282,7 +2284,6 @@ extern const struct file_operations snapshot_raw_fops;
/* Used when creating instances */
int trace_allocate_snapshot(struct trace_array *tr, int size);
-int tracing_alloc_snapshot(void);
void tracing_snapshot_cond(struct trace_array *tr, void *cond_data);
int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update);
int tracing_snapshot_cond_disable(struct trace_array *tr);
--
2.47.3
^ permalink raw reply related
* [PATCH v15 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-03-31 8:35 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
Hi,
Here is the 15th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:
https://lore.kernel.org/all/177487498530.3463592.12715592581212799257.stgit@mhiramat.tok.corp.google.com/
This version fixes warning if reader see contiguous multiple
empty (invalidated) pages on the recovered persistent ring buffer
[2/5], do not show discarded pages if it is 0 [2/5], use pr_warn()
for showing test result [4/5], and inject errors to the pages which
is multiples of 5 too[4/5].
This also drops "Reset RB_MISSED_* flags" patch, and add a patch
to show commit number on each data page [5/5] for debugging.
In this version, I added arm64 maitainers to request their
review for [1/5], which introduces asm/ring_buffer.h for flushing
the cache before reboot.
Thank you,
---
Masami Hiramatsu (Google) (5):
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
ring-buffer: Add persistent ring buffer invalid-page inject test
ring-buffer: Show commit numbers in buffer_meta file
arch/alpha/include/asm/Kbuild | 1
arch/arc/include/asm/Kbuild | 1
arch/arm/include/asm/Kbuild | 1
arch/arm64/include/asm/ring_buffer.h | 10 +
arch/csky/include/asm/Kbuild | 1
arch/hexagon/include/asm/Kbuild | 1
arch/loongarch/include/asm/Kbuild | 1
arch/m68k/include/asm/Kbuild | 1
arch/microblaze/include/asm/Kbuild | 1
arch/mips/include/asm/Kbuild | 1
arch/nios2/include/asm/Kbuild | 1
arch/openrisc/include/asm/Kbuild | 1
arch/parisc/include/asm/Kbuild | 1
arch/powerpc/include/asm/Kbuild | 1
arch/riscv/include/asm/Kbuild | 1
arch/s390/include/asm/Kbuild | 1
arch/sh/include/asm/Kbuild | 1
arch/sparc/include/asm/Kbuild | 1
arch/um/include/asm/Kbuild | 1
arch/x86/include/asm/Kbuild | 1
arch/xtensa/include/asm/Kbuild | 1
include/asm-generic/ring_buffer.h | 13 ++
include/linux/ring_buffer.h | 1
kernel/trace/Kconfig | 31 ++++
kernel/trace/ring_buffer.c | 258 ++++++++++++++++++++++++++--------
kernel/trace/trace.c | 4 +
26 files changed, 273 insertions(+), 64 deletions(-)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v15 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-03-31 8:36 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177494615421.71933.3679132057004156013.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v13:
- Fix a rebase conflict.
Changes in v11:
- Do nothing by default since flush_cache_vmap() does nothing on x86
but it can cause deadlock on some architectures via on_each_cpu()
because other CPUs will be stoppped when panic notifier is called.
Changes in v9:
- Fix typo of & to &&.
- Fix typo of "Generic"
Changes in v6:
- Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
- Use flush_cache_vmap() instead of flush_cache_all().
Changes in v5:
- Use ring_buffer_record_off() instead of ring_buffer_record_disable().
- Use flush_cache_all() to ensure flush all cache.
Changes in v3:
- update patch description.
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/ring_buffer.h | 10 ++++++++++
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/ring_buffer.h | 13 +++++++++++++
kernel/trace/ring_buffer.c | 22 ++++++++++++++++++++++
23 files changed, 65 insertions(+)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
generic-y += asm-offsets.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
generic-y += extable.h
generic-y += flat.h
generic-y += parport.h
+generic-y += ring_buffer.h
generated-y += mach-types.h
generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
generic-y += user.h
generic-y += ioctl.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
generic-y += statfs.h
generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
generic-y += spinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
generic-y += agp.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
generic-y += asm-offsets.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += ring_buffer.h
generic-y += runtime-const.h
generic-y += softirq_stack.h
generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
generic-y += fprobe.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end) do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8b6c39bba56d..3e793bd1c134 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
#include <linux/ring_buffer_types.h>
#include <linux/sched/isolation.h>
#include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
#include <linux/trace_events.h>
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <asm/ring_buffer.h>
#include <asm/local64.h>
#include <asm/local.h>
#include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
unsigned long range_addr_start;
unsigned long range_addr_end;
+ struct notifier_block flush_nb;
struct ring_buffer_meta *meta;
@@ -2520,6 +2523,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+ ring_buffer_record_off(buffer);
+ arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+ return NOTIFY_DONE;
+}
+
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
@@ -2650,6 +2663,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
mutex_init(&buffer->mutex);
+ /* Persistent ring buffer needs to flush cache before reboot. */
+ if (start && end) {
+ buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+ atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+ }
+
return_ptr(buffer);
fail_free_buffers:
@@ -2748,6 +2767,9 @@ ring_buffer_free(struct trace_buffer *buffer)
{
int cpu;
+ if (buffer->range_addr_start && buffer->range_addr_end)
+ atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
irq_work_sync(&buffer->irq_work.work);
^ permalink raw reply related
* [PATCH v15 2/5] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-31 8:36 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177494615421.71933.3679132057004156013.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v15:
- Skip reader_page loop check on persistent ring buffer because
there can be contiguous empty(invalidated) pages.
- Do not show discarded page number information if it is 0.
Changes in v11:
- Fix a typo.
Changes in v9:
- Add meta->subbuf_size check.
- Fix a typo.
- Handle invalid reader_page case.
Changes in v8:
- Add comment in rb_valudate_buffer()
- Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
skipping subbuf.
- Remove unused subbuf local variable from rb_cpu_meta_valid().
Changes in v7:
- Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
- Remove checking subbuffer data when validating metadata, because it should be done
later.
- Do not mark the discarded sub buffer page but just reset it.
Changes in v6:
- Show invalid page detection message once per CPU.
Changes in v5:
- Instead of showing errors for each page, just show the number
of discarded pages at last.
Changes in v3:
- Record missed data event on commit.
---
kernel/trace/ring_buffer.c | 109 ++++++++++++++++++++++++++------------------
1 file changed, 65 insertions(+), 44 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3e793bd1c134..2a6254edae5f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
return local_read(&bpage->page->commit);
}
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+ return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
unsigned long *subbuf_mask)
{
int subbuf_size = PAGE_SIZE;
- struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
if (!subbuf_mask)
return false;
+ if (meta->subbuf_size != PAGE_SIZE) {
+ pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+ return false;
+ }
+
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- subbuf = rb_subbufs_from_meta(meta);
-
bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
- /* Is the meta buffers and the subbufs themselves have correct data? */
+ /*
+ * Ensure the meta::buffers array has correct data. The data in each subbufs
+ * are checked later in rb_meta_validate_events().
+ */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
- pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
- return false;
- }
-
if (test_bit(meta->buffers[i], subbuf_mask)) {
pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
return false;
}
set_bit(meta->buffers[i], subbuf_mask);
- subbuf = (void *)subbuf + subbuf_size;
}
return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+ struct ring_buffer_cpu_meta *meta)
{
unsigned long long ts;
+ unsigned long tail;
u64 delta;
- int tail;
- tail = local_read(&dpage->commit);
+ /*
+ * When a sub-buffer is recovered from a read, the commit value may
+ * have RB_MISSED_* bits set, as these bits are reset on reuse.
+ * Even after clearing these bits, a commit value greater than the
+ * subbuf_size is considered invalid.
+ */
+ tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ if (tail > meta->subbuf_size)
+ return -1;
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
struct buffer_page *head_page, *orig_head;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
+ int discarded = 0;
int ret;
u64 ts;
int i;
@@ -1900,14 +1915,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer reader page is invalid\n");
- goto invalid;
+ pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&cpu_buffer->reader_page->entries, 0);
+ local_set(&cpu_buffer->reader_page->page->commit, 0);
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(cpu_buffer->reader_page);
+ local_set(&cpu_buffer->reader_page->entries, ret);
}
- entries += ret;
- entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
- local_set(&cpu_buffer->reader_page->entries, ret);
ts = head_page->page->time_stamp;
@@ -1935,7 +1955,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
break;
/* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0)
break;
@@ -2014,21 +2034,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid buffer page\n",
- cpu_buffer->cpu);
- goto invalid;
- }
-
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
-
- entries += ret;
- entry_bytes += local_read(&head_page->page->commit);
- local_set(&head_page->entries, ret);
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ local_set(&head_page->entries, ret);
+ }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2042,7 +2065,10 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
- pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+ pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
+ if (discarded)
+ pr_cont(" (%d pages discarded)", discarded);
+ pr_cont("\n");
return;
invalid:
@@ -3329,12 +3355,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
return NULL;
}
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
static __always_inline unsigned
rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -5647,11 +5667,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
again:
/*
* This should normally only loop twice. But because the
- * start of the reader inserts an empty page, it causes
- * a case where we will loop three times. There should be no
- * reason to loop four times (that I know of).
+ * start of the reader inserts an empty page, it causes a
+ * case where we will loop three times. There should be no
+ * reason to loop four times unless the ring buffer is a
+ * recovered persistent ring buffer.
*/
- if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+ if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
reader = NULL;
goto out;
}
^ permalink raw reply related
* [PATCH v15 3/5] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-03-31 8:36 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177494615421.71933.3679132057004156013.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v12:
- Fix build error.
Changes in v11:
- Reset timestamp when the buffer is invalid.
- When rewinding, skip subbuf page if timestamp is wrong and
check timestamp after validating buffer data page.
Changes in v10:
- Newly added.
---
kernel/trace/ring_buffer.c | 76 +++++++++++++++++++++++++-------------------
1 file changed, 43 insertions(+), 33 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2a6254edae5f..5ff632ca3858 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
static void rb_init_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
+ bpage->time_stamp = 0;
}
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
struct ring_buffer_cpu_meta *meta)
{
+ struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
unsigned long tail;
u64 delta;
+ int ret = -1;
/*
* When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
* subbuf_size is considered invalid.
*/
tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
- if (tail > meta->subbuf_size)
- return -1;
- return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ if (tail <= meta->subbuf_size)
+ ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+ if (ret < 0) {
+ local_set(&bpage->entries, 0);
+ local_set(&bpage->page->commit, 0);
+ } else {
+ local_set(&bpage->entries, ret);
+ }
+
+ return ret;
}
/* If the meta data has been validated, now validate the events */
@@ -1915,18 +1926,14 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(cpu_buffer->reader_page, cpu_buffer->cpu, meta);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid reader page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&cpu_buffer->reader_page->entries, 0);
- local_set(&cpu_buffer->reader_page->page->commit, 0);
} else {
entries += ret;
entry_bytes += rb_page_size(cpu_buffer->reader_page);
- local_set(&cpu_buffer->reader_page->entries, ret);
}
ts = head_page->page->time_stamp;
@@ -1945,26 +1952,33 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->tail_page)
break;
- /* Ensure the page has older data than head. */
- if (ts < head_page->page->time_stamp)
- break;
-
- ts = head_page->page->time_stamp;
- /* Ensure the page has correct timestamp and some data. */
- if (!ts || rb_page_commit(head_page) == 0)
- break;
-
- /* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
- if (ret < 0)
+ /* Rewind until unused page (no timestamp, no commit). */
+ if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
break;
- /* Recover the number of entries and update stats. */
- local_set(&head_page->entries, ret);
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
- entries += ret;
- entry_bytes += rb_page_commit(head_page);
+ /*
+ * Skip if the page is invalid, or its timestamp is newer than the
+ * previous valid page.
+ */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
+ if (ret >= 0 && ts < head_page->page->time_stamp) {
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ head_page->page->time_stamp = ts;
+ ret = -1;
+ }
+ if (ret < 0) {
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ if (ret > 0)
+ local_inc(&cpu_buffer->pages_touched);
+ ts = head_page->page->time_stamp;
+ }
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2034,15 +2048,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
if (ret < 0) {
if (!discarded)
pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
} else {
/* If the buffer has content, update pages_touched */
if (ret)
@@ -2050,7 +2061,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += rb_page_size(head_page);
- local_set(&head_page->entries, ret);
}
if (head_page == cpu_buffer->commit_page)
break;
@@ -2083,7 +2093,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
+ rb_init_page(head_page->page);
}
}
^ permalink raw reply related
* [PATCH v15 4/5] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Masami Hiramatsu (Google) @ 2026-03-31 8:36 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177494615421.71933.3679132057004156013.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a self-destractive test for the persistent ring buffer.
This will inject erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when kernel gets panic, and check whether the number of detected
invalid pages and the total entry_bytes are the same as recorded
values after reboot.
This can ensure the kernel correctly recover partially corrupted
persistent ring buffer when boot.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". And user has to fill it up with events before
kernel panics.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and you have to setup the kernel cmdline;
reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1
And run following commands after the 1st boot;
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v15:
- Use pr_warn() for test result.
- Inject errors on the page index is multiples of 5 so that
this can reproduce contiguous empty pages.
Changes in v14:
- Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
- Clear meta->nr_invalid/entry_bytes after testing.
- Add test commands in config comment.
Changes in v10:
- Add entry_bytes test.
- Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
Changes in v9:
- Test also reader pages.
---
include/linux/ring_buffer.h | 1 +
kernel/trace/Kconfig | 31 ++++++++++++++++++
kernel/trace/ring_buffer.c | 74 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 4 ++
4 files changed, 110 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_TESTING = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..07305ed6d745 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,37 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
Only say Y if you understand what this does, and you
still want it enabled. Otherwise say N
+config RING_BUFFER_PERSISTENT_INJECT
+ bool "Enable persistent ring buffer error injection test"
+ depends on RING_BUFFER
+ help
+ Run a selftest on the persistent ring buffer which names
+ "ptracingtest" (and its backup) when panic_on_reboot by
+ invalidating ring buffer pages.
+ To use this, boot kernel with "ptracingtest" persistent
+ ring buffer, e.g.
+
+ reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+ And after the 1st boot, run test command, like;
+
+ cd /sys/kernel/tracing/instances/ptracingtest
+ echo 1 > events/enable
+ echo 1 > tracing_on
+ sleep 3
+ echo c > /proc/sysrq-trigger
+
+ After panic message, the kernel reboots and show test results
+ on the boot log.
+
+ Note that user has to enable events on the persistent ring
+ buffer manually to fill up ring buffers before rebooting.
+ Since this invalidates the data on test target ring buffer,
+ "ptracingtest" persistent ring buffer must not be used for
+ actual tracing, but only for testing.
+
+ If unsure, say N
+
config MMIOTRACE_TEST
tristate "Test module for mmiotrace"
depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5ff632ca3858..fb098b0b4505 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ __u32 nr_invalid;
+ __u32 entry_bytes;
+#endif
int buffers[];
};
@@ -2079,6 +2083,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (discarded)
pr_cont(" (%d pages discarded)", discarded);
pr_cont("\n");
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ if (meta->nr_invalid)
+ pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+ cpu_buffer->cpu,
+ (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ discarded, meta->nr_invalid);
+ if (meta->entry_bytes)
+ pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+ cpu_buffer->cpu,
+ (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)entry_bytes, (long)meta->entry_bytes);
+ meta->nr_invalid = 0;
+ meta->entry_bytes = 0;
+#endif
return;
invalid:
@@ -2559,12 +2578,67 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct ring_buffer_cpu_meta *meta;
+ struct buffer_data_page *dpage;
+ u32 entry_bytes = 0;
+ unsigned long ptr;
+ int subbuf_size;
+ int invalid = 0;
+ int cpu;
+ int i;
+
+ if (!(buffer->flags & RB_FL_TESTING))
+ return;
+
+ guard(preempt)();
+ cpu = smp_processor_id();
+
+ cpu_buffer = buffer->buffers[cpu];
+ meta = cpu_buffer->ring_meta;
+ ptr = (unsigned long)rb_subbufs_from_meta(meta);
+ subbuf_size = meta->subbuf_size;
+
+ for (i = 0; i < meta->nr_subbufs; i++) {
+ int idx = meta->buffers[i];
+
+ dpage = (void *)(ptr + idx * subbuf_size);
+ /* Skip unused pages */
+ if (!local_read(&dpage->commit))
+ continue;
+
+ /*
+ * Invalidate even pages or multiples of 5. This will lead 3
+ * contiguous invalidated(empty) pages.
+ */
+ if (!(i & 0x1) || !(i % 5)) {
+ local_add(subbuf_size + 1, &dpage->commit);
+ invalid++;
+ } else {
+ /* Count total commit bytes. */
+ entry_bytes += local_read(&dpage->commit);
+ }
+ }
+
+ pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+ invalid, cpu, (long)entry_bytes);
+ meta->nr_invalid = invalid;
+ meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer) do { } while (0)
+#endif
+
/* Stop recording on a persistent buffer and flush cache if needed. */
static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
{
struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
ring_buffer_record_off(buffer);
+ rb_test_inject_invalid_pages(buffer);
arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
return NOTIFY_DONE;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4189ec9df6a5..108b0d16badf 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9366,6 +9366,8 @@ static void setup_trace_scratch(struct trace_array *tr,
memset(tscratch, 0, size);
}
+#define TRACE_TEST_PTRACING_NAME "ptracingtest"
+
static int
allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
{
@@ -9378,6 +9380,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
+ if (!strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+ rb_flags |= RB_FL_TESTING;
/* Add scratch buffer to handle 128 modules */
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
^ permalink raw reply related
* [PATCH v15 5/5] ring-buffer: Show commit numbers in buffer_meta file
From: Masami Hiramatsu (Google) @ 2026-03-31 8:36 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177494615421.71933.3679132057004156013.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In addition to the index number, show the commit numbers of
each data page in per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/ring_buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fb098b0b4505..5b40fea6b15d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2209,6 +2209,7 @@ static int rbm_show(struct seq_file *m, void *v)
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
unsigned long val = (unsigned long)v;
+ struct buffer_data_page *dpage;
if (val == 1) {
seq_printf(m, "head_buffer: %d\n",
@@ -2221,7 +2222,9 @@ static int rbm_show(struct seq_file *m, void *v)
}
val -= 2;
- seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]);
+ dpage = rb_range_buffer(cpu_buffer, val);
+ seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
+ val, meta->buffers[val], local_read(&dpage->commit));
return 0;
}
^ permalink raw reply related
* Re: [PATCH] blktrace: reject buf_size smaller than blk_io_trace
From: Deepanshu Kartikey @ 2026-03-31 8:47 UTC (permalink / raw)
To: axboe, rostedt, mhiramat, mathieu.desnoyers
Cc: linux-block, linux-kernel, linux-trace-kernel,
syzbot+ed8bc247f231c1a48e21
In-Reply-To: <20260322051838.1137822-1-kartikey406@gmail.com>
On Sun, Mar 22, 2026 at 10:48 AM Deepanshu Kartikey
<kartikey406@gmail.com> wrote:
>
> blk_trace_setup() accepts any non-zero buf_size.
> If buf_size < sizeof(struct blk_io_trace), relay_reserve()
> always returns NULL and all trace events are silently dropped.
>
> Reject such values early with -EINVAL.
>
> Reported-by: syzbot+ed8bc247f231c1a48e21@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=ed8bc247f231c1a48e21
> Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
> ---
> kernel/trace/blktrace.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 8cd2520b4c99..6cc7d83ed1c2 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -773,7 +773,7 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> if (ret)
> return -EFAULT;
>
> - if (!buts.buf_size || !buts.buf_nr)
> + if (buts.buf_size < sizeof(struct blk_io_trace) || !buts.buf_nr)
> return -EINVAL;
>
> buts2 = (struct blk_user_trace_setup2) {
> --
> 2.43.0
>
Gentle ping on this patch . Let me know if anything else required
Thanks
^ permalink raw reply
* Re: [PATCH] tracing: always provide a prototype for tracing_alloc_snapshot()
From: Masami Hiramatsu @ 2026-03-31 9:20 UTC (permalink / raw)
To: Bartosz Golaszewski
Cc: Steven Rostedt, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260331082001.31345-1-bartosz.golaszewski@oss.qualcomm.com>
On Tue, 31 Mar 2026 10:20:01 +0200
Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> wrote:
> The tracing_alloc_snapshot() symbol is always exported even with
> !CONFIG_TRACER_SNAPSHOT so the prototype too must be always visible or
> we'll see the following warning:
>
> kernel/trace/trace.c:820:5: warning: no previous prototype for ‘tracing_alloc_snapshot’ [-Wmissing-prototypes]
> 820 | int tracing_alloc_snapshot(void)
> | ^~~~~~~~~~~~~~~~~~~~~~
>
> Fixes: bade44fe5462 ("tracing: Move snapshot code out of trace.c and into trace_snapshot.c")
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Good catch!
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks!
> ---
> kernel/trace/trace.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 6abd9e16ef21..e8612b8b0a34 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -2275,6 +2275,8 @@ static inline void __init trace_event_init(void) { }
> static inline void trace_event_update_all(struct trace_eval_map **map, int len) { }
> #endif
>
> +int tracing_alloc_snapshot(void);
> +
> #ifdef CONFIG_TRACER_SNAPSHOT
> extern const struct file_operations snapshot_fops;
> extern const struct file_operations snapshot_raw_fops;
> @@ -2282,7 +2284,6 @@ extern const struct file_operations snapshot_raw_fops;
> /* Used when creating instances */
> int trace_allocate_snapshot(struct trace_array *tr, int size);
>
> -int tracing_alloc_snapshot(void);
> void tracing_snapshot_cond(struct trace_array *tr, void *cond_data);
> int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update);
> int tracing_snapshot_cond_disable(struct trace_array *tr);
> --
> 2.47.3
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v8 3/6] tracefs: Check file permission even if user has CAP_DAC_OVERRIDE
From: Masami Hiramatsu @ 2026-03-31 9:24 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260305123314.41ab284d@gandalf.local.home>
On Thu, 5 Mar 2026 12:33:14 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 12 Feb 2026 15:15:15 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > With this still not working this late in the game, it will need to wait
> > > until the next merge window. I'll take the first two patches of this
> > > series now though.
> >
> > OK. I will send the next version without the first 2 patches.
>
> Hi Masami,
>
> Did you send a new version of this yet?
Oops, I missed this reply. Let me update it.
Thanks,
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v4 1/5] tracing/lock: Remove unnecessary linux/sched.h include
From: Usama Arif @ 2026-03-31 10:11 UTC (permalink / raw)
To: Dmitry Ilvokhin
Cc: Usama Arif, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
Waiman Long, Thomas Bogendoerfer, Juergen Gross, Ajay Kaher,
Alexey Makhalov, Broadcom internal kernel review list,
Thomas Gleixner, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Arnd Bergmann, Dennis Zhou, Tejun Heo,
Christoph Lameter, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
linux-arch, linux-mm, linux-trace-kernel, kernel-team
In-Reply-To: <5593ed9718b1a6e4ec51d99772c485734029d4d4.1774536681.git.d@ilvokhin.com>
On Thu, 26 Mar 2026 15:10:00 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> None of the trace events in lock.h reference anything from
> linux/sched.h. Remove the unnecessary include.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> ---
> include/trace/events/lock.h | 1 -
> 1 file changed, 1 deletion(-)
>
Acked-by: Usama Arif <usama.arif@linux.dev>
^ permalink raw reply
* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Kiryl Shutsemau @ 2026-03-31 10:18 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Breno Leitao, Jonathan Corbet, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <20260325232204.05edbb21c7602b6408ca007b@kernel.org>
On Wed, Mar 25, 2026 at 11:22:04PM +0900, Masami Hiramatsu wrote:
> > diff --git a/init/main.c b/init/main.c
> > index 453ac9dff2da0..14a04c283fa48 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -416,9 +416,64 @@ static int __init warn_bootconfig(char *str)
> > return 0;
> > }
> >
> > +/*
> > + * do_early_param() is defined later in this file but called from
> > + * bootconfig_apply_early_params() below, so we need a forward declaration.
> > + */
> > +static int __init do_early_param(char *param, char *val,
> > + const char *unused, void *arg);
> > +
> > +/*
> > + * bootconfig_apply_early_params - dispatch kernel.* keys from the embedded
> > + * bootconfig as early_param() calls.
> > + *
> > + * early_param() handlers must run before most of the kernel initialises
> > + * (e.g. before the GIC driver reads irqchip.gicv3_pseudo_nmi). A bootconfig
> > + * attached to the initrd arrives too late for this because the initrd is not
> > + * mapped yet when early params are processed. The embedded bootconfig lives
> > + * in the kernel image itself (.init.data), so it is always reachable.
> > + *
> > + * This function is called from setup_boot_config() which runs in
> > + * start_kernel() before parse_early_param(), making the timing correct.
> > + */
> > +static void __init bootconfig_apply_early_params(void)
>
> [sashiko comment]
> | Does this run early enough for architectural parameters?
> | While setup_boot_config() runs before parse_early_param() in start_kernel(),
> | it runs after setup_arch(). setup_boot_config() relies on xbc_init() which
> | uses the memblock allocator, requiring setup_arch() to have already
> | initialized it.
> | However, the kernel expects many early parameters (like mem=, earlycon,
> | noapic, and iommu) to be parsed during setup_arch() via the architecture's
> | call to parse_early_param(). Since setup_arch() completes before
> | setup_boot_config() runs, will these architectural early parameters be
> | silently ignored because the decisions they influence were already
> | finalized?
>
> This is the major reason that I did not support early parameter
> in bootconfig. Some archs initialize kernel_cmdline in setup_arch()
> and setup early parameters in it.
> To fix this, we need to change setup_arch() for each architecture so
> that it calls this bootconfig_apply_early_params().
Hi Masami,
I have a question about bootconfig design. Is it necessary to parse the
bootconfig at boot time?
We could avoid a lot of complexity if we flattened the bootconfig into a
simple key-value list before embedding it in the kernel image or
attaching it to the initrd. That would eliminate the need for memory
allocations and allow the config to be used earlier during boot.
What am I missing?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v4 0/5] locking: contended_release tracepoint instrumentation
From: Usama Arif @ 2026-03-31 10:27 UTC (permalink / raw)
To: Dmitry Ilvokhin
Cc: Usama Arif, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
Waiman Long, Thomas Bogendoerfer, Juergen Gross, Ajay Kaher,
Alexey Makhalov, Broadcom internal kernel review list,
Thomas Gleixner, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Arnd Bergmann, Dennis Zhou, Tejun Heo,
Christoph Lameter, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
linux-arch, linux-mm, linux-trace-kernel, kernel-team
In-Reply-To: <cover.1774536681.git.d@ilvokhin.com>
On Thu, 26 Mar 2026 15:09:59 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> The existing contention_begin/contention_end tracepoints fire on the
> waiter side. The lock holder's identity and stack can be captured at
> contention_begin time (e.g. perf lock contention --lock-owner), but
> this reflects the holder's state when a waiter arrives, not when the
> lock is actually released.
>
> This series adds a contended_release tracepoint that fires on the
> holder side when a lock with waiters is released. This provides:
>
> - Hold time estimation: when the holder's own acquisition was
> contended, its contention_end (acquisition) and contended_release
> can be correlated to measure how long the lock was held under
> contention.
>
> - The holder's stack at release time, which may differ from what perf lock
> contention --lock-owner captures if the holder does significant work between
> the waiter's arrival and the unlock.
>
> Note: for reader/writer locks, the tracepoint fires for every reader
> releasing while a writer is waiting, not only for the last reader.
>
Would it be better to reorder the patches? It would help with git
bisectability as well. Move the refractoring work in patch 4 and
5 (excluding adding the tracepoints ofcourse) earlier, and then add
all the tracepoints in the same commit at the end? It would help
in the future with git blame to see where all the tracepoints
were added as well.
^ permalink raw reply
* [PATCH] tracing: remove duplicate latency_fsnotify() stub
From: Arnd Bergmann @ 2026-03-31 10:30 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Arnd Bergmann, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
From: Arnd Bergmann <arnd@arndb.de>
During the move, an extra copy of latency_fsnotify() crept in:
kernel/trace/trace_snapshot.c:395:20: error: redefinition of 'latency_fsnotify'
395 | static inline void latency_fsnotify(struct trace_array *tr) { }
| ^~~~~~~~~~~~~~~~
In file included from kernel/trace/trace_snapshot.c:6:
kernel/trace/trace.h:858:20: note: previous definition of 'latency_fsnotify' with type 'void(struct trace_array *)'
858 | static inline void latency_fsnotify(struct trace_array *tr) { }
| ^~~~~~~~~~~~~~~~
The function is still called from the hwlat and osnoise tracers, so the
copy in the header file is the one that has to stay.
Remove the extra one from trace_snapshot.c
Fixes: bade44fe5462 ("tracing: Move snapshot code out of trace.c and into trace_snapshot.c")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Please fold into the commit that moves the code, if possible
---
kernel/trace/trace_snapshot.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/kernel/trace/trace_snapshot.c b/kernel/trace/trace_snapshot.c
index 8865b2ef2264..a54e9533e79d 100644
--- a/kernel/trace/trace_snapshot.c
+++ b/kernel/trace/trace_snapshot.c
@@ -391,8 +391,6 @@ void latency_fsnotify(struct trace_array *tr)
*/
irq_work_queue(&tr->fsnotify_irqwork);
}
-#else
-static inline void latency_fsnotify(struct trace_array *tr) { }
#endif /* LATENCY_FS_NOTIFY */
static const struct file_operations tracing_max_lat_fops;
--
2.39.5
^ permalink raw reply related
* Re: [PATCH v4 3/5] locking: Add contended_release tracepoint to sleepable locks
From: Usama Arif @ 2026-03-31 10:34 UTC (permalink / raw)
To: Dmitry Ilvokhin
Cc: Usama Arif, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng,
Waiman Long, Thomas Bogendoerfer, Juergen Gross, Ajay Kaher,
Alexey Makhalov, Broadcom internal kernel review list,
Thomas Gleixner, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Arnd Bergmann, Dennis Zhou, Tejun Heo,
Christoph Lameter, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-kernel, linux-mips, virtualization,
linux-arch, linux-mm, linux-trace-kernel, kernel-team
In-Reply-To: <d2e5763278812499335b22a013aafb4979e3324b.1774536681.git.d@ilvokhin.com>
On Thu, 26 Mar 2026 15:10:02 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> Add the contended_release trace event. This tracepoint fires on the
> holder side when a contended lock is released, complementing the
> existing contention_begin/contention_end tracepoints which fire on the
> waiter side.
>
> This enables correlating lock hold time under contention with waiter
> events by lock address.
>
> Add trace_contended_release() calls to the slowpath unlock paths of
> sleepable locks: mutex, rtmutex, semaphore, rwsem, percpu-rwsem, and
> RT-specific rwbase locks.
>
> Where possible, trace_contended_release() fires before the lock is
> released and before the waiter is woken. For some lock types, the
> tracepoint fires after the release but before the wake. Making the
> placement consistent across all lock types is not worth the added
> complexity.
>
> For reader/writer locks, the tracepoint fires for every reader releasing
> while a writer is waiting, not only for the last reader.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> ---
> include/trace/events/lock.h | 17 +++++++++++++++++
> kernel/locking/mutex.c | 4 ++++
> kernel/locking/percpu-rwsem.c | 11 +++++++++++
> kernel/locking/rtmutex.c | 1 +
> kernel/locking/rwbase_rt.c | 6 ++++++
> kernel/locking/rwsem.c | 10 ++++++++--
> kernel/locking/semaphore.c | 4 ++++
> 7 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
> index da978f2afb45..1ded869cd619 100644
> --- a/include/trace/events/lock.h
> +++ b/include/trace/events/lock.h
> @@ -137,6 +137,23 @@ TRACE_EVENT(contention_end,
> TP_printk("%p (ret=%d)", __entry->lock_addr, __entry->ret)
> );
>
> +TRACE_EVENT(contended_release,
> +
> + TP_PROTO(void *lock),
> +
> + TP_ARGS(lock),
> +
> + TP_STRUCT__entry(
> + __field(void *, lock_addr)
> + ),
> +
> + TP_fast_assign(
> + __entry->lock_addr = lock;
> + ),
> +
> + TP_printk("%p", __entry->lock_addr)
> +);
> +
> #endif /* _TRACE_LOCK_H */
>
> /* This part must be outside protection */
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 427187ff02db..6c2c9312eb8f 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -997,6 +997,9 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> wake_q_add(&wake_q, next);
> }
>
> + if (trace_contended_release_enabled() && waiter)
> + trace_contended_release(lock);
> +
This won't compile? waiter is declared in the if block, so you are using
it outside scope here.
^ permalink raw reply
* [PATCH] rv: Allow epoll in rtapp-sleep monitor
From: Nam Cao @ 2026-03-31 10:49 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel, Nam Cao
Since commit 0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock"),
epoll_wait is real-time-safe syscall for sleeping.
Add epoll_wait to the list of rt-safe sleeping APIs.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
kernel/trace/rv/monitors/sleep/sleep.c | 8 ++
kernel/trace/rv/monitors/sleep/sleep.h | 98 ++++++++++++-----------
tools/verification/models/rtapp/sleep.ltl | 1 +
3 files changed, 61 insertions(+), 46 deletions(-)
diff --git a/kernel/trace/rv/monitors/sleep/sleep.c b/kernel/trace/rv/monitors/sleep/sleep.c
index c1347da69e9d..59091863c17c 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.c
+++ b/kernel/trace/rv/monitors/sleep/sleep.c
@@ -49,6 +49,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
ltl_atom_set(mon, LTL_FUTEX_WAIT, false);
+ ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
ltl_atom_set(mon, LTL_FUTEX_LOCK_PI, false);
ltl_atom_set(mon, LTL_BLOCK_ON_RT_MUTEX, false);
}
@@ -63,6 +64,7 @@ static void ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bo
ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
+ ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
if (strstarts(task->comm, "migration/"))
ltl_atom_set(mon, LTL_TASK_IS_MIGRATION, true);
@@ -162,6 +164,11 @@ static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
break;
}
break;
+#ifdef __NR_epoll_wait
+ case __NR_epoll_wait:
+ ltl_atom_set(mon, LTL_EPOLL_WAIT, true);
+ break;
+#endif
}
}
@@ -174,6 +181,7 @@ static void handle_sys_exit(void *data, struct pt_regs *regs, long ret)
ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
+ ltl_atom_set(mon, LTL_EPOLL_WAIT, false);
ltl_atom_update(current, LTL_CLOCK_NANOSLEEP, false);
}
diff --git a/kernel/trace/rv/monitors/sleep/sleep.h b/kernel/trace/rv/monitors/sleep/sleep.h
index 2ab46fd218d2..95dc2727c059 100644
--- a/kernel/trace/rv/monitors/sleep/sleep.h
+++ b/kernel/trace/rv/monitors/sleep/sleep.h
@@ -15,6 +15,7 @@ enum ltl_atom {
LTL_ABORT_SLEEP,
LTL_BLOCK_ON_RT_MUTEX,
LTL_CLOCK_NANOSLEEP,
+ LTL_EPOLL_WAIT,
LTL_FUTEX_LOCK_PI,
LTL_FUTEX_WAIT,
LTL_KERNEL_THREAD,
@@ -40,6 +41,7 @@ static const char *ltl_atom_str(enum ltl_atom atom)
"ab_sl",
"bl_on_rt_mu",
"cl_na",
+ "ep_wa",
"fu_lo_pi",
"fu_wa",
"ker_th",
@@ -75,39 +77,41 @@ static_assert(RV_NUM_BA_STATES <= RV_MAX_BA_STATES);
static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
{
- bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
- bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
- bool val40 = task_is_rcu || task_is_migration;
- bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
- bool val41 = futex_lock_pi || val40;
- bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
- bool val5 = block_on_rt_mutex || val41;
- bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
- bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val32 = abort_sleep || kthread_should_stop;
bool woken_by_nmi = test_bit(LTL_WOKEN_BY_NMI, mon->atoms);
- bool val33 = woken_by_nmi || val32;
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
- bool val34 = woken_by_hardirq || val33;
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool val14 = woken_by_equal_or_higher_prio || val34;
bool wake = test_bit(LTL_WAKE, mon->atoms);
- bool val13 = !wake;
- bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
+ bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
+ bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+ bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+ bool rt = test_bit(LTL_RT, mon->atoms);
+ bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
- bool val24 = nanosleep_clock_monotonic || nanosleep_clock_tai;
- bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
- bool val25 = nanosleep_timer_abstime && val24;
- bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
- bool val18 = clock_nanosleep && val25;
+ bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
+ bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
- bool val9 = futex_wait || val18;
+ bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+ bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
+ bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
+ bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+ bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
+ bool val42 = task_is_rcu || task_is_migration;
+ bool val43 = futex_lock_pi || val42;
+ bool val5 = block_on_rt_mutex || val43;
+ bool val34 = abort_sleep || kthread_should_stop;
+ bool val35 = woken_by_nmi || val34;
+ bool val36 = woken_by_hardirq || val35;
+ bool val14 = woken_by_equal_or_higher_prio || val36;
+ bool val13 = !wake;
+ bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
+ bool val27 = nanosleep_timer_abstime && val26;
+ bool val18 = clock_nanosleep && val27;
+ bool val20 = val18 || epoll_wait;
+ bool val9 = futex_wait || val20;
bool val11 = val9 || kernel_thread;
- bool sleep = test_bit(LTL_SLEEP, mon->atoms);
bool val2 = !sleep;
- bool rt = test_bit(LTL_RT, mon->atoms);
bool val1 = !rt;
bool val3 = val1 || val2;
@@ -124,39 +128,41 @@ static void ltl_start(struct task_struct *task, struct ltl_monitor *mon)
static void
ltl_possible_next_states(struct ltl_monitor *mon, unsigned int state, unsigned long *next)
{
- bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
- bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
- bool val40 = task_is_rcu || task_is_migration;
- bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
- bool val41 = futex_lock_pi || val40;
- bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
- bool val5 = block_on_rt_mutex || val41;
- bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
- bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
- bool val32 = abort_sleep || kthread_should_stop;
bool woken_by_nmi = test_bit(LTL_WOKEN_BY_NMI, mon->atoms);
- bool val33 = woken_by_nmi || val32;
bool woken_by_hardirq = test_bit(LTL_WOKEN_BY_HARDIRQ, mon->atoms);
- bool val34 = woken_by_hardirq || val33;
bool woken_by_equal_or_higher_prio = test_bit(LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO,
mon->atoms);
- bool val14 = woken_by_equal_or_higher_prio || val34;
bool wake = test_bit(LTL_WAKE, mon->atoms);
- bool val13 = !wake;
- bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
+ bool task_is_rcu = test_bit(LTL_TASK_IS_RCU, mon->atoms);
+ bool task_is_migration = test_bit(LTL_TASK_IS_MIGRATION, mon->atoms);
+ bool sleep = test_bit(LTL_SLEEP, mon->atoms);
+ bool rt = test_bit(LTL_RT, mon->atoms);
+ bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
bool nanosleep_clock_tai = test_bit(LTL_NANOSLEEP_CLOCK_TAI, mon->atoms);
bool nanosleep_clock_monotonic = test_bit(LTL_NANOSLEEP_CLOCK_MONOTONIC, mon->atoms);
- bool val24 = nanosleep_clock_monotonic || nanosleep_clock_tai;
- bool nanosleep_timer_abstime = test_bit(LTL_NANOSLEEP_TIMER_ABSTIME, mon->atoms);
- bool val25 = nanosleep_timer_abstime && val24;
- bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
- bool val18 = clock_nanosleep && val25;
+ bool kthread_should_stop = test_bit(LTL_KTHREAD_SHOULD_STOP, mon->atoms);
+ bool kernel_thread = test_bit(LTL_KERNEL_THREAD, mon->atoms);
bool futex_wait = test_bit(LTL_FUTEX_WAIT, mon->atoms);
- bool val9 = futex_wait || val18;
+ bool futex_lock_pi = test_bit(LTL_FUTEX_LOCK_PI, mon->atoms);
+ bool epoll_wait = test_bit(LTL_EPOLL_WAIT, mon->atoms);
+ bool clock_nanosleep = test_bit(LTL_CLOCK_NANOSLEEP, mon->atoms);
+ bool block_on_rt_mutex = test_bit(LTL_BLOCK_ON_RT_MUTEX, mon->atoms);
+ bool abort_sleep = test_bit(LTL_ABORT_SLEEP, mon->atoms);
+ bool val42 = task_is_rcu || task_is_migration;
+ bool val43 = futex_lock_pi || val42;
+ bool val5 = block_on_rt_mutex || val43;
+ bool val34 = abort_sleep || kthread_should_stop;
+ bool val35 = woken_by_nmi || val34;
+ bool val36 = woken_by_hardirq || val35;
+ bool val14 = woken_by_equal_or_higher_prio || val36;
+ bool val13 = !wake;
+ bool val26 = nanosleep_clock_monotonic || nanosleep_clock_tai;
+ bool val27 = nanosleep_timer_abstime && val26;
+ bool val18 = clock_nanosleep && val27;
+ bool val20 = val18 || epoll_wait;
+ bool val9 = futex_wait || val20;
bool val11 = val9 || kernel_thread;
- bool sleep = test_bit(LTL_SLEEP, mon->atoms);
bool val2 = !sleep;
- bool rt = test_bit(LTL_RT, mon->atoms);
bool val1 = !rt;
bool val3 = val1 || val2;
diff --git a/tools/verification/models/rtapp/sleep.ltl b/tools/verification/models/rtapp/sleep.ltl
index 6379bbeb6212..6f26c4810f78 100644
--- a/tools/verification/models/rtapp/sleep.ltl
+++ b/tools/verification/models/rtapp/sleep.ltl
@@ -5,6 +5,7 @@ RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
RT_VALID_SLEEP_REASON = FUTEX_WAIT
or RT_FRIENDLY_NANOSLEEP
+ or EPOLL_WAIT
RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
and NANOSLEEP_TIMER_ABSTIME
--
2.47.3
^ permalink raw reply related
* Re: [PATCH] tracing: Move snapshot code out of trace.c and into trace_snapshot.c
From: Arnd Bergmann @ 2026-03-31 10:53 UTC (permalink / raw)
To: Steven Rostedt
Cc: kernel test robot, LKML, Linux trace kernel, llvm, oe-kbuild-all,
Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260330120516.73aede9b@gandalf.local.home>
On Mon, Mar 30, 2026, at 18:05, Steven Rostedt wrote:
> On Mon, 30 Mar 2026 16:06:44 +0200 "Arnd Bergmann" <arnd@arndb.de> wrote:
>
>> I saw the same thing and worked around it by removing the function.
>> I then noticed that a bunch of code surrounding it is also unused
>> and I removed that as well (see below). This version passes
>> my randconfig build tests, but I suspect it is still wrong,
>> since the code never had any callers and I don't understand
>> why.
>
> Note, this code is in include/linux/tracing_printk.h, and is for debugging
> purposes (just like trace_printk() is). Hence, it shouldn't be removed.
>
> The purpose is to call tracing_snapshot() when your code detects something
> isn't right (but it doesn't crash), and this will take a snapshot of the
> current trace that lead up to the anomaly.
Right, I assumed it had to be something like that, just didn't immediately
see what. I've sent a patch to just remove the duplicate inline
function now.
> If anything, I should add more to Documentation/trace/debugging.rst about it.
Or maybe a samples module that serves to show how the interface
gets used?
Arnd
^ permalink raw reply
* Re: [PATCH v4 3/5] locking: Add contended_release tracepoint to sleepable locks
From: Dmitry Ilvokhin @ 2026-03-31 12:16 UTC (permalink / raw)
To: Usama Arif
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
virtualization, linux-arch, linux-mm, linux-trace-kernel,
kernel-team
In-Reply-To: <20260331103451.1070175-1-usama.arif@linux.dev>
On Tue, Mar 31, 2026 at 03:34:50AM -0700, Usama Arif wrote:
> On Thu, 26 Mar 2026 15:10:02 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> > Add the contended_release trace event. This tracepoint fires on the
> > holder side when a contended lock is released, complementing the
> > existing contention_begin/contention_end tracepoints which fire on the
> > waiter side.
> >
> > This enables correlating lock hold time under contention with waiter
> > events by lock address.
> >
> > Add trace_contended_release() calls to the slowpath unlock paths of
> > sleepable locks: mutex, rtmutex, semaphore, rwsem, percpu-rwsem, and
> > RT-specific rwbase locks.
> >
> > Where possible, trace_contended_release() fires before the lock is
> > released and before the waiter is woken. For some lock types, the
> > tracepoint fires after the release but before the wake. Making the
> > placement consistent across all lock types is not worth the added
> > complexity.
> >
> > For reader/writer locks, the tracepoint fires for every reader releasing
> > while a writer is waiting, not only for the last reader.
> >
> > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > ---
> > include/trace/events/lock.h | 17 +++++++++++++++++
> > kernel/locking/mutex.c | 4 ++++
> > kernel/locking/percpu-rwsem.c | 11 +++++++++++
> > kernel/locking/rtmutex.c | 1 +
> > kernel/locking/rwbase_rt.c | 6 ++++++
> > kernel/locking/rwsem.c | 10 ++++++++--
> > kernel/locking/semaphore.c | 4 ++++
> > 7 files changed, 51 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
> > index da978f2afb45..1ded869cd619 100644
> > --- a/include/trace/events/lock.h
> > +++ b/include/trace/events/lock.h
> > @@ -137,6 +137,23 @@ TRACE_EVENT(contention_end,
> > TP_printk("%p (ret=%d)", __entry->lock_addr, __entry->ret)
> > );
> >
> > +TRACE_EVENT(contended_release,
> > +
> > + TP_PROTO(void *lock),
> > +
> > + TP_ARGS(lock),
> > +
> > + TP_STRUCT__entry(
> > + __field(void *, lock_addr)
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->lock_addr = lock;
> > + ),
> > +
> > + TP_printk("%p", __entry->lock_addr)
> > +);
> > +
> > #endif /* _TRACE_LOCK_H */
> >
> > /* This part must be outside protection */
> > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> > index 427187ff02db..6c2c9312eb8f 100644
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -997,6 +997,9 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> > wake_q_add(&wake_q, next);
> > }
> >
> > + if (trace_contended_release_enabled() && waiter)
> > + trace_contended_release(lock);
> > +
>
> This won't compile? waiter is declared in the if block, so you are using
> it outside scope here.
>
Thanks for the feedback, Usama.
waiter is declared at function scope, right on top. It's also assigned
before the if block, so it's still in scope at the tracepoint.
^ permalink raw reply
* Re: [PATCH v4 0/5] locking: contended_release tracepoint instrumentation
From: Dmitry Ilvokhin @ 2026-03-31 12:32 UTC (permalink / raw)
To: Usama Arif
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
Broadcom internal kernel review list, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
virtualization, linux-arch, linux-mm, linux-trace-kernel,
kernel-team
In-Reply-To: <20260331102704.921355-1-usama.arif@linux.dev>
On Tue, Mar 31, 2026 at 03:27:03AM -0700, Usama Arif wrote:
> On Thu, 26 Mar 2026 15:09:59 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> > The existing contention_begin/contention_end tracepoints fire on the
> > waiter side. The lock holder's identity and stack can be captured at
> > contention_begin time (e.g. perf lock contention --lock-owner), but
> > this reflects the holder's state when a waiter arrives, not when the
> > lock is actually released.
> >
> > This series adds a contended_release tracepoint that fires on the
> > holder side when a lock with waiters is released. This provides:
> >
> > - Hold time estimation: when the holder's own acquisition was
> > contended, its contention_end (acquisition) and contended_release
> > can be correlated to measure how long the lock was held under
> > contention.
> >
> > - The holder's stack at release time, which may differ from what perf lock
> > contention --lock-owner captures if the holder does significant work between
> > the waiter's arrival and the unlock.
> >
> > Note: for reader/writer locks, the tracepoint fires for every reader
> > releasing while a writer is waiting, not only for the last reader.
> >
>
> Would it be better to reorder the patches? It would help with git
> bisectability as well. Move the refractoring work in patch 4 and
> 5 (excluding adding the tracepoints ofcourse) earlier, and then add
> all the tracepoints in the same commit at the end? It would help
> in the future with git blame to see where all the tracepoints
> were added as well.
Thanks for the suggestion, Usama.
I'd prefer to keep the current ordering: each refactoring commit is
immediately followed by the commit that uses it. For example,
queued_spin_release() is factored out right before the commit that adds
the tracepoint to spinning locks. This makes the motivation for the
refactoring clear and should also ease the review since the context is
still fresh.
Bisectability should be fine as-is, each commit compiles and works
independently, since the refactoring patches do not introduce behavioral
changes on their own.
^ permalink raw reply
* Re: [PATCH v8 12/12] rv: Add nomiss deadline monitor
From: Juri Lelli @ 2026-03-31 12:32 UTC (permalink / raw)
To: Gabriele Monaco
Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Jonathan Corbet, Masami Hiramatsu, linux-trace-kernel, linux-doc,
Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-13-gmonaco@redhat.com>
Hello,
On 30/03/26 13:10, Gabriele Monaco wrote:
> Add the deadline monitors collection to validate the deadline scheduler,
> both for deadline tasks and servers.
>
> The currently implemented monitors are:
> * nomiss:
> validate dl entities run to completion before their deadiline
>
> Reviewed-by: Nam Cao <namcao@linutronix.de>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Looks good to me.
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Best,
Juri
^ permalink raw reply
* Re: [PATCH] tracing: Move snapshot code out of trace.c and into trace_snapshot.c
From: Steven Rostedt @ 2026-03-31 12:34 UTC (permalink / raw)
To: Arnd Bergmann
Cc: kernel test robot, LKML, Linux trace kernel, llvm, oe-kbuild-all,
Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <6efebcb6-194e-4c21-8808-fdf09160eac0@app.fastmail.com>
On Tue, 31 Mar 2026 12:53:38 +0200
"Arnd Bergmann" <arnd@arndb.de> wrote:
> Right, I assumed it had to be something like that, just didn't immediately
> see what. I've sent a patch to just remove the duplicate inline
> function now.
Ah, I had already sent a patch to fix the duplicate because I saw the
kernel test robot's report.
https://lore.kernel.org/all/20260330205859.24c0aae3@gandalf.local.home/
-- Steve
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox