* [RESEND PATCH v16 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
[Resend this series with base-commit tag so that bot can apply this correctly]
Hi,
Here is the 16th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:
https://lore.kernel.org/all/177494615421.71933.3679132057004156013.stgit@mhiramat.tok.corp.google.com/
This version adds Catalin's Ack [1/5] and update description and
document[4/5][5/5]. Also, rebased on ring-buffer/for-next.
Thank you,
Masami Hiramatsu (Google) (5):
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
ring-buffer: Add persistent ring buffer invalid-page inject test
ring-buffer: Show commit numbers in buffer_meta file
arch/alpha/include/asm/Kbuild | 1
arch/arc/include/asm/Kbuild | 1
arch/arm/include/asm/Kbuild | 1
arch/arm64/include/asm/ring_buffer.h | 10 +
arch/csky/include/asm/Kbuild | 1
arch/hexagon/include/asm/Kbuild | 1
arch/loongarch/include/asm/Kbuild | 1
arch/m68k/include/asm/Kbuild | 1
arch/microblaze/include/asm/Kbuild | 1
arch/mips/include/asm/Kbuild | 1
arch/nios2/include/asm/Kbuild | 1
arch/openrisc/include/asm/Kbuild | 1
arch/parisc/include/asm/Kbuild | 1
arch/powerpc/include/asm/Kbuild | 1
arch/riscv/include/asm/Kbuild | 1
arch/s390/include/asm/Kbuild | 1
arch/sh/include/asm/Kbuild | 1
arch/sparc/include/asm/Kbuild | 1
arch/um/include/asm/Kbuild | 1
arch/x86/include/asm/Kbuild | 1
arch/xtensa/include/asm/Kbuild | 1
include/asm-generic/ring_buffer.h | 13 ++
include/linux/ring_buffer.h | 1
kernel/trace/Kconfig | 34 ++++
kernel/trace/ring_buffer.c | 258 ++++++++++++++++++++++++++--------
kernel/trace/trace.c | 4 +
26 files changed, 276 insertions(+), 64 deletions(-)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
base-commit: 3515572dd068895ffd241b8a69399a0ebfac7593
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [RESEND PATCH v16 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes in v13:
- Fix a rebase conflict.
Changes in v11:
- Do nothing by default since flush_cache_vmap() does nothing on x86
but it can cause deadlock on some architectures via on_each_cpu()
because other CPUs will be stoppped when panic notifier is called.
Changes in v9:
- Fix typo of & to &&.
- Fix typo of "Generic"
Changes in v6:
- Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
- Use flush_cache_vmap() instead of flush_cache_all().
Changes in v5:
- Use ring_buffer_record_off() instead of ring_buffer_record_disable().
- Use flush_cache_all() to ensure flush all cache.
Changes in v3:
- update patch description.
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/ring_buffer.h | 10 ++++++++++
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/ring_buffer.h | 13 +++++++++++++
kernel/trace/ring_buffer.c | 22 ++++++++++++++++++++++
23 files changed, 65 insertions(+)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
generic-y += asm-offsets.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
generic-y += extable.h
generic-y += flat.h
generic-y += parport.h
+generic-y += ring_buffer.h
generated-y += mach-types.h
generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
generic-y += user.h
generic-y += ioctl.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
generic-y += statfs.h
generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
generic-y += spinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
generic-y += agp.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
generic-y += asm-offsets.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += ring_buffer.h
generic-y += runtime-const.h
generic-y += softirq_stack.h
generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
generic-y += fprobe.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end) do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2caa5d3d0ae9..4d5817286791 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
#include <linux/ring_buffer_types.h>
#include <linux/sched/isolation.h>
#include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
#include <linux/trace_events.h>
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <asm/ring_buffer.h>
#include <asm/local64.h>
#include <asm/local.h>
#include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
unsigned long range_addr_start;
unsigned long range_addr_end;
+ struct notifier_block flush_nb;
struct ring_buffer_meta *meta;
@@ -2520,6 +2523,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+ ring_buffer_record_off(buffer);
+ arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+ return NOTIFY_DONE;
+}
+
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
@@ -2650,6 +2663,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
mutex_init(&buffer->mutex);
+ /* Persistent ring buffer needs to flush cache before reboot. */
+ if (start && end) {
+ buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+ atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+ }
+
return_ptr(buffer);
fail_free_buffers:
@@ -2748,6 +2767,9 @@ ring_buffer_free(struct trace_buffer *buffer)
{
int cpu;
+ if (buffer->range_addr_start && buffer->range_addr_end)
+ atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
irq_work_sync(&buffer->irq_work.work);
^ permalink raw reply related
* [RESEND PATCH v16 2/5] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v15:
- Skip reader_page loop check on persistent ring buffer because
there can be contiguous empty(invalidated) pages.
- Do not show discarded page number information if it is 0.
Changes in v11:
- Fix a typo.
Changes in v9:
- Add meta->subbuf_size check.
- Fix a typo.
- Handle invalid reader_page case.
Changes in v8:
- Add comment in rb_valudate_buffer()
- Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
skipping subbuf.
- Remove unused subbuf local variable from rb_cpu_meta_valid().
Changes in v7:
- Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
- Remove checking subbuffer data when validating metadata, because it should be done
later.
- Do not mark the discarded sub buffer page but just reset it.
Changes in v6:
- Show invalid page detection message once per CPU.
Changes in v5:
- Instead of showing errors for each page, just show the number
of discarded pages at last.
Changes in v3:
- Record missed data event on commit.
---
kernel/trace/ring_buffer.c | 109 ++++++++++++++++++++++++++------------------
1 file changed, 65 insertions(+), 44 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 4d5817286791..0c284094f7d0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
return local_read(&bpage->page->commit);
}
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+ return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
unsigned long *subbuf_mask)
{
int subbuf_size = PAGE_SIZE;
- struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
if (!subbuf_mask)
return false;
+ if (meta->subbuf_size != PAGE_SIZE) {
+ pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+ return false;
+ }
+
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- subbuf = rb_subbufs_from_meta(meta);
-
bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
- /* Is the meta buffers and the subbufs themselves have correct data? */
+ /*
+ * Ensure the meta::buffers array has correct data. The data in each subbufs
+ * are checked later in rb_meta_validate_events().
+ */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
- pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
- return false;
- }
-
if (test_bit(meta->buffers[i], subbuf_mask)) {
pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
return false;
}
set_bit(meta->buffers[i], subbuf_mask);
- subbuf = (void *)subbuf + subbuf_size;
}
return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+ struct ring_buffer_cpu_meta *meta)
{
unsigned long long ts;
+ unsigned long tail;
u64 delta;
- int tail;
- tail = local_read(&dpage->commit);
+ /*
+ * When a sub-buffer is recovered from a read, the commit value may
+ * have RB_MISSED_* bits set, as these bits are reset on reuse.
+ * Even after clearing these bits, a commit value greater than the
+ * subbuf_size is considered invalid.
+ */
+ tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ if (tail > meta->subbuf_size)
+ return -1;
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
struct buffer_page *head_page, *orig_head;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
+ int discarded = 0;
int ret;
u64 ts;
int i;
@@ -1900,14 +1915,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer reader page is invalid\n");
- goto invalid;
+ pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&cpu_buffer->reader_page->entries, 0);
+ local_set(&cpu_buffer->reader_page->page->commit, 0);
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(cpu_buffer->reader_page);
+ local_set(&cpu_buffer->reader_page->entries, ret);
}
- entries += ret;
- entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
- local_set(&cpu_buffer->reader_page->entries, ret);
ts = head_page->page->time_stamp;
@@ -1935,7 +1955,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
break;
/* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0)
break;
@@ -2014,21 +2034,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid buffer page\n",
- cpu_buffer->cpu);
- goto invalid;
- }
-
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
-
- entries += ret;
- entry_bytes += local_read(&head_page->page->commit);
- local_set(&head_page->entries, ret);
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ local_set(&head_page->entries, ret);
+ }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2042,7 +2065,10 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
- pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+ pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
+ if (discarded)
+ pr_cont(" (%d pages discarded)", discarded);
+ pr_cont("\n");
return;
invalid:
@@ -3329,12 +3355,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
return NULL;
}
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
static __always_inline unsigned
rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -5647,11 +5667,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
again:
/*
* This should normally only loop twice. But because the
- * start of the reader inserts an empty page, it causes
- * a case where we will loop three times. There should be no
- * reason to loop four times (that I know of).
+ * start of the reader inserts an empty page, it causes a
+ * case where we will loop three times. There should be no
+ * reason to loop four times unless the ring buffer is a
+ * recovered persistent ring buffer.
*/
- if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+ if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
reader = NULL;
goto out;
}
^ permalink raw reply related
* [RESEND PATCH v16 3/5] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v12:
- Fix build error.
Changes in v11:
- Reset timestamp when the buffer is invalid.
- When rewinding, skip subbuf page if timestamp is wrong and
check timestamp after validating buffer data page.
Changes in v10:
- Newly added.
---
kernel/trace/ring_buffer.c | 76 +++++++++++++++++++++++++-------------------
1 file changed, 43 insertions(+), 33 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 0c284094f7d0..518a05df6ef7 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
static void rb_init_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
+ bpage->time_stamp = 0;
}
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
struct ring_buffer_cpu_meta *meta)
{
+ struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
unsigned long tail;
u64 delta;
+ int ret = -1;
/*
* When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
* subbuf_size is considered invalid.
*/
tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
- if (tail > meta->subbuf_size)
- return -1;
- return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ if (tail <= meta->subbuf_size)
+ ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+ if (ret < 0) {
+ local_set(&bpage->entries, 0);
+ local_set(&bpage->page->commit, 0);
+ } else {
+ local_set(&bpage->entries, ret);
+ }
+
+ return ret;
}
/* If the meta data has been validated, now validate the events */
@@ -1915,18 +1926,14 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(cpu_buffer->reader_page, cpu_buffer->cpu, meta);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid reader page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&cpu_buffer->reader_page->entries, 0);
- local_set(&cpu_buffer->reader_page->page->commit, 0);
} else {
entries += ret;
entry_bytes += rb_page_size(cpu_buffer->reader_page);
- local_set(&cpu_buffer->reader_page->entries, ret);
}
ts = head_page->page->time_stamp;
@@ -1945,26 +1952,33 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->tail_page)
break;
- /* Ensure the page has older data than head. */
- if (ts < head_page->page->time_stamp)
- break;
-
- ts = head_page->page->time_stamp;
- /* Ensure the page has correct timestamp and some data. */
- if (!ts || rb_page_commit(head_page) == 0)
- break;
-
- /* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
- if (ret < 0)
+ /* Rewind until unused page (no timestamp, no commit). */
+ if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
break;
- /* Recover the number of entries and update stats. */
- local_set(&head_page->entries, ret);
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
- entries += ret;
- entry_bytes += rb_page_commit(head_page);
+ /*
+ * Skip if the page is invalid, or its timestamp is newer than the
+ * previous valid page.
+ */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
+ if (ret >= 0 && ts < head_page->page->time_stamp) {
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ head_page->page->time_stamp = ts;
+ ret = -1;
+ }
+ if (ret < 0) {
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ if (ret > 0)
+ local_inc(&cpu_buffer->pages_touched);
+ ts = head_page->page->time_stamp;
+ }
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2034,15 +2048,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
if (ret < 0) {
if (!discarded)
pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
} else {
/* If the buffer has content, update pages_touched */
if (ret)
@@ -2050,7 +2061,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += rb_page_size(head_page);
- local_set(&head_page->entries, ret);
}
if (head_page == cpu_buffer->commit_page)
break;
@@ -2083,7 +2093,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
+ rb_init_page(head_page->page);
}
}
^ permalink raw reply related
* [RESEND PATCH v16 4/5] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a self-corrupting test for the persistent ring buffer.
This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.
This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:
reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1
Run the following commands after the 1st boot:
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v16:
- Update description and comments according to review comments.
Changes in v15:
- Use pr_warn() for test result.
- Inject errors on the page index is multiples of 5 so that
this can reproduce contiguous empty pages.
Changes in v14:
- Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
- Clear meta->nr_invalid/entry_bytes after testing.
- Add test commands in config comment.
Changes in v10:
- Add entry_bytes test.
- Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
Changes in v9:
- Test also reader pages.
---
include/linux/ring_buffer.h | 1 +
kernel/trace/Kconfig | 34 ++++++++++++++++++++
kernel/trace/ring_buffer.c | 74 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 4 ++
4 files changed, 113 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_TESTING = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..084f34dc6c9f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,40 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
Only say Y if you understand what this does, and you
still want it enabled. Otherwise say N
+config RING_BUFFER_PERSISTENT_INJECT
+ bool "Enable persistent ring buffer error injection test"
+ depends on RING_BUFFER
+ help
+ This option will have the kernel check if the persistent ring
+ buffer is named "ptracingtest". and if so, it will corrupt some
+ of its pages on a kernel panic. This is used to test if the
+ persistent ring buffer can recover from some of its sub-buffers
+ being corrupted.
+ To use this, boot a kernel with a "ptracingtest" persistent
+ ring buffer, e.g.
+
+ reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+ And after the 1st boot, run the following commands:
+
+ cd /sys/kernel/tracing/instances/ptracingtest
+ echo 1 > events/enable
+ echo 1 > tracing_on
+ sleep 3
+ echo c > /proc/sysrq-trigger
+
+ After the panic message, the kernel will reboot and will show
+ the test results in the console output.
+
+ Note that events for the test ring buffer needs to be enabled
+ prior to crashing the kernel so that the ring buffer has content
+ that the test will corrupt.
+ As the test will corrupt events in the "ptracingtest" persistent
+ ring buffer, it should not be used for any other purpose other
+ than this test.
+
+ If unsure, say N
+
config MMIOTRACE_TEST
tristate "Test module for mmiotrace"
depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 518a05df6ef7..e56fe9dcc7d7 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ __u32 nr_invalid;
+ __u32 entry_bytes;
+#endif
int buffers[];
};
@@ -2079,6 +2083,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (discarded)
pr_cont(" (%d pages discarded)", discarded);
pr_cont("\n");
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ if (meta->nr_invalid)
+ pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+ cpu_buffer->cpu,
+ (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ discarded, meta->nr_invalid);
+ if (meta->entry_bytes)
+ pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+ cpu_buffer->cpu,
+ (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)entry_bytes, (long)meta->entry_bytes);
+ meta->nr_invalid = 0;
+ meta->entry_bytes = 0;
+#endif
return;
invalid:
@@ -2559,12 +2578,67 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct ring_buffer_cpu_meta *meta;
+ struct buffer_data_page *dpage;
+ u32 entry_bytes = 0;
+ unsigned long ptr;
+ int subbuf_size;
+ int invalid = 0;
+ int cpu;
+ int i;
+
+ if (!(buffer->flags & RB_FL_TESTING))
+ return;
+
+ guard(preempt)();
+ cpu = smp_processor_id();
+
+ cpu_buffer = buffer->buffers[cpu];
+ meta = cpu_buffer->ring_meta;
+ ptr = (unsigned long)rb_subbufs_from_meta(meta);
+ subbuf_size = meta->subbuf_size;
+
+ for (i = 0; i < meta->nr_subbufs; i++) {
+ int idx = meta->buffers[i];
+
+ dpage = (void *)(ptr + idx * subbuf_size);
+ /* Skip unused pages */
+ if (!local_read(&dpage->commit))
+ continue;
+
+ /*
+ * Invalidate even pages or multiples of 5. This will cause 3
+ * contiguous invalidated(empty) pages.
+ */
+ if (!(i & 0x1) || !(i % 5)) {
+ local_add(subbuf_size + 1, &dpage->commit);
+ invalid++;
+ } else {
+ /* Count total commit bytes. */
+ entry_bytes += local_read(&dpage->commit);
+ }
+ }
+
+ pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+ invalid, cpu, (long)entry_bytes);
+ meta->nr_invalid = invalid;
+ meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer) do { } while (0)
+#endif
+
/* Stop recording on a persistent buffer and flush cache if needed. */
static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
{
struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
ring_buffer_record_off(buffer);
+ rb_test_inject_invalid_pages(buffer);
arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
return NOTIFY_DONE;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e9455d46ec16..96101d276d13 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9436,6 +9436,8 @@ static void setup_trace_scratch(struct trace_array *tr,
memset(tscratch, 0, size);
}
+#define TRACE_TEST_PTRACING_NAME "ptracingtest"
+
static int
allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
{
@@ -9448,6 +9450,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
+ if (!strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+ rb_flags |= RB_FL_TESTING;
/* Add scratch buffer to handle 128 modules */
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
^ permalink raw reply related
* [RESEND PATCH v16 5/5] ring-buffer: Show commit numbers in buffer_meta file
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v16:
- update description.
---
kernel/trace/ring_buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e56fe9dcc7d7..4bf83b7805da 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2209,6 +2209,7 @@ static int rbm_show(struct seq_file *m, void *v)
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
unsigned long val = (unsigned long)v;
+ struct buffer_data_page *dpage;
if (val == 1) {
seq_printf(m, "head_buffer: %d\n",
@@ -2221,7 +2222,9 @@ static int rbm_show(struct seq_file *m, void *v)
}
val -= 2;
- seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]);
+ dpage = rb_range_buffer(cpu_buffer, val);
+ seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
+ val, meta->buffers[val], local_read(&dpage->commit));
return 0;
}
^ permalink raw reply related
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Yafang Shao @ 2026-04-07 2:21 UTC (permalink / raw)
To: Song Liu
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CAPhsuW73qFybHgOnZ=oFC1PvdWkYWDk7gsAoiBXe4xWYagPrmA@mail.gmail.com>
On Tue, Apr 7, 2026 at 2:26 AM Song Liu <song@kernel.org> wrote:
>
> On Mon, Apr 6, 2026 at 3:55 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Apr 4, 2026 at 12:07 AM Song Liu <song@kernel.org> wrote:
> > >
> > > Hi Yafang,
> > >
> > > On Thu, Apr 2, 2026 at 2:26 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Livepatching allows for rapid experimentation with new kernel features
> > > > without interrupting production workloads. However, static livepatches lack
> > > > the flexibility required to tune features based on task-specific attributes,
> > > > such as cgroup membership, which is critical in multi-tenant k8s
> > > > environments. Furthermore, hardcoding logic into a livepatch prevents
> > > > dynamic adjustments based on the runtime environment.
> > > >
> > > > To address this, we propose a hybrid approach using BPF. Our production use
> > > > case involves:
> > > >
> > > > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> > > >
> > > > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > > > of that hook based on the current task's context.
> > >
> > > Could you please provide a specific use case that can benefit from this?
> > > AFAICT, livepatch is more flexible but risky (may cause crash); while
> > > BPF is safe, but less flexible. The combination you are proposing seems
> > > to get the worse of the two sides. Maybe it can indeed get the benefit of
> > > both sides in some cases, but I cannot think of such examples.
> > >
> >
> > Here is an example we recently deployed on our production servers:
> >
> > https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
> >
> > In one of our specific clusters, we needed to send BGP traffic out
> > through specific NICs based on the destination IP. To achieve this
> > without interrupting service, we live-patched
> > bond_xmit_3ad_xor_slave_get(), added a new hook called
> > bond_get_slave_hook(), and then ran a BPF program attached to that
> > hook to select the outgoing NIC from the SKB. This allowed us to
> > rapidly deploy the feature with zero downtime.
>
> I guess the idea here is: keep the risk part simple, and implement
> it in module/livepatch, then use BPF for the flexible and programmable
> part safe.
Right
>
> Can we use struct_ops instead of bpf_override_return for this case?
> This should make the solution more flexible.
Upstreaming struct_ops based BPF hooks is a challenging process, as
seen in these examples:
https://lwn.net/Articles/1054030/
https://lwn.net/Articles/1043548/
Even when successful, upstreaming can take a significant amount of
time—often longer than our production requirements allow. To bridge
this gap, we developed this livepatch+BPF solution. This allows us to
rapidly deploy new features without maintaining custom hooks in our
local kernel. Because these livepatch-based hooks are lightweight,
they minimize maintenance overhead and simplify kernel upgrades (e.g.,
from 6.1 to 6.18).
That said, we would still prefer to have our hooks accepted upstream
to eliminate the need for self-maintenance entirely.
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Song Liu @ 2026-04-07 2:46 UTC (permalink / raw)
To: Yafang Shao
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CALOAHbC0hqk+yrUZay01EBRNOHgyj1MAavzNK-06XJKK9ARMqQ@mail.gmail.com>
On Mon, Apr 6, 2026 at 7:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Apr 7, 2026 at 2:26 AM Song Liu <song@kernel.org> wrote:
> >
> > On Mon, Apr 6, 2026 at 3:55 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sat, Apr 4, 2026 at 12:07 AM Song Liu <song@kernel.org> wrote:
> > > >
> > > > Hi Yafang,
> > > >
> > > > On Thu, Apr 2, 2026 at 2:26 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Livepatching allows for rapid experimentation with new kernel features
> > > > > without interrupting production workloads. However, static livepatches lack
> > > > > the flexibility required to tune features based on task-specific attributes,
> > > > > such as cgroup membership, which is critical in multi-tenant k8s
> > > > > environments. Furthermore, hardcoding logic into a livepatch prevents
> > > > > dynamic adjustments based on the runtime environment.
> > > > >
> > > > > To address this, we propose a hybrid approach using BPF. Our production use
> > > > > case involves:
> > > > >
> > > > > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> > > > >
> > > > > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > > > > of that hook based on the current task's context.
> > > >
> > > > Could you please provide a specific use case that can benefit from this?
> > > > AFAICT, livepatch is more flexible but risky (may cause crash); while
> > > > BPF is safe, but less flexible. The combination you are proposing seems
> > > > to get the worse of the two sides. Maybe it can indeed get the benefit of
> > > > both sides in some cases, but I cannot think of such examples.
> > > >
> > >
> > > Here is an example we recently deployed on our production servers:
> > >
> > > https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
> > >
> > > In one of our specific clusters, we needed to send BGP traffic out
> > > through specific NICs based on the destination IP. To achieve this
> > > without interrupting service, we live-patched
> > > bond_xmit_3ad_xor_slave_get(), added a new hook called
> > > bond_get_slave_hook(), and then ran a BPF program attached to that
> > > hook to select the outgoing NIC from the SKB. This allowed us to
> > > rapidly deploy the feature with zero downtime.
> >
> > I guess the idea here is: keep the risk part simple, and implement
> > it in module/livepatch, then use BPF for the flexible and programmable
> > part safe.
>
> Right
>
> >
> > Can we use struct_ops instead of bpf_override_return for this case?
> > This should make the solution more flexible.
>
> Upstreaming struct_ops based BPF hooks is a challenging process, as
> seen in these examples:
>
> https://lwn.net/Articles/1054030/
> https://lwn.net/Articles/1043548/
>
> Even when successful, upstreaming can take a significant amount of
> time—often longer than our production requirements allow. To bridge
> this gap, we developed this livepatch+BPF solution. This allows us to
> rapidly deploy new features without maintaining custom hooks in our
> local kernel. Because these livepatch-based hooks are lightweight,
> they minimize maintenance overhead and simplify kernel upgrades (e.g.,
> from 6.1 to 6.18).
I didn't mean upstream struct_ops.
We can define the struct_ops in an OOT kernel module. Then we
can attach BPF programs to the struct_ops. We may need
livepatch to connect the new struct_ops to original kernel logic.
I think kernel side of this solution is mostly available, but we may
need some work on the toolchain side.
Does this make sense?
Thanks,
Song
> That said, we would still prefer to have our hooks accepted upstream
> to eliminate the need for self-maintenance entirely.
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Song Liu @ 2026-04-07 2:54 UTC (permalink / raw)
To: Joe Lawrence
Cc: Yafang Shao, Dylan Hatch, jpoimboe, jikos, mbenes, pmladek,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <adQhpBC2W9I6QW-g@redhat.com>
On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
[...]
> > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > are replaceable.
> > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > and are not replaceable.
> > > >
> > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > still release a cumulative livepatch to replace the previous cumulative
> > > > livepatch. Is this the expected use case?
> > >
> > > That matches our expected use case.
> >
> > If we really want to serve use cases like this, I think we can introduce
> > some replace tag concept: Each livepatch will have a tag, u32 number.
> > Newly loaded livepatch will only replace existing livepatch with the
> > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > and make it u32: replace=0 means no replace; replace > 0 are the
> > replace tag.
> >
> > For current users of cumulative patches, all the livepatch will have the
> > same tag, say 1. For your use case, you can assign each user a
> > unique tag. Then all these users can do atomic upgrades of their
> > own livepatches.
> >
> > We may also need to check whether two livepatches of different tags
> > touch the same kernel function. When that happens, the later
> > livepatch should fail to load.
> >
> > Does this make sense?
> >
>
> I haven't been following the thread carefully, but could the Livepatch
> system state API (see Documentation/livepatch/system-state.rst) be
> leveraged somehow instead of adding further replace semantics?
AFAICT, system state will not help Yafang's use case.
Thanks,
Song
^ permalink raw reply
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Yafang Shao @ 2026-04-07 3:13 UTC (permalink / raw)
To: Song Liu
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CAPhsuW5MN6ikKmxgqby5RJ3_gvjJ4B77X74OvfbTQoFO8iUgzA@mail.gmail.com>
On Tue, Apr 7, 2026 at 10:47 AM Song Liu <song@kernel.org> wrote:
>
> On Mon, Apr 6, 2026 at 7:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Apr 7, 2026 at 2:26 AM Song Liu <song@kernel.org> wrote:
> > >
> > > On Mon, Apr 6, 2026 at 3:55 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sat, Apr 4, 2026 at 12:07 AM Song Liu <song@kernel.org> wrote:
> > > > >
> > > > > Hi Yafang,
> > > > >
> > > > > On Thu, Apr 2, 2026 at 2:26 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > Livepatching allows for rapid experimentation with new kernel features
> > > > > > without interrupting production workloads. However, static livepatches lack
> > > > > > the flexibility required to tune features based on task-specific attributes,
> > > > > > such as cgroup membership, which is critical in multi-tenant k8s
> > > > > > environments. Furthermore, hardcoding logic into a livepatch prevents
> > > > > > dynamic adjustments based on the runtime environment.
> > > > > >
> > > > > > To address this, we propose a hybrid approach using BPF. Our production use
> > > > > > case involves:
> > > > > >
> > > > > > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> > > > > >
> > > > > > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > > > > > of that hook based on the current task's context.
> > > > >
> > > > > Could you please provide a specific use case that can benefit from this?
> > > > > AFAICT, livepatch is more flexible but risky (may cause crash); while
> > > > > BPF is safe, but less flexible. The combination you are proposing seems
> > > > > to get the worse of the two sides. Maybe it can indeed get the benefit of
> > > > > both sides in some cases, but I cannot think of such examples.
> > > > >
> > > >
> > > > Here is an example we recently deployed on our production servers:
> > > >
> > > > https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
> > > >
> > > > In one of our specific clusters, we needed to send BGP traffic out
> > > > through specific NICs based on the destination IP. To achieve this
> > > > without interrupting service, we live-patched
> > > > bond_xmit_3ad_xor_slave_get(), added a new hook called
> > > > bond_get_slave_hook(), and then ran a BPF program attached to that
> > > > hook to select the outgoing NIC from the SKB. This allowed us to
> > > > rapidly deploy the feature with zero downtime.
> > >
> > > I guess the idea here is: keep the risk part simple, and implement
> > > it in module/livepatch, then use BPF for the flexible and programmable
> > > part safe.
> >
> > Right
> >
> > >
> > > Can we use struct_ops instead of bpf_override_return for this case?
> > > This should make the solution more flexible.
> >
> > Upstreaming struct_ops based BPF hooks is a challenging process, as
> > seen in these examples:
> >
> > https://lwn.net/Articles/1054030/
> > https://lwn.net/Articles/1043548/
> >
> > Even when successful, upstreaming can take a significant amount of
> > time—often longer than our production requirements allow. To bridge
> > this gap, we developed this livepatch+BPF solution. This allows us to
> > rapidly deploy new features without maintaining custom hooks in our
> > local kernel. Because these livepatch-based hooks are lightweight,
> > they minimize maintenance overhead and simplify kernel upgrades (e.g.,
> > from 6.1 to 6.18).
>
> I didn't mean upstream struct_ops.
>
> We can define the struct_ops in an OOT kernel module. Then we
> can attach BPF programs to the struct_ops. We may need
> livepatch to connect the new struct_ops to original kernel logic.
>
> I think kernel side of this solution is mostly available, but we may
> need some work on the toolchain side.
>
> Does this make sense?
Are there actual benefits to using struct_ops instead of
bpf_override_return? So far, I’ve only found it adds complexity
without much gain.
Can we add something like ALLOW_LIVEPATCH_ERROR_INJECTION() to allow
error injection on functions defined inside a livepatch?
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Yafang Shao @ 2026-04-07 3:16 UTC (permalink / raw)
To: Song Liu
Cc: Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes, pmladek,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <CAPhsuW66tuF+QZ0pVheWb5sC4NQ-9CXikq=zMrPBXTHcsVPjdg@mail.gmail.com>
On Tue, Apr 7, 2026 at 10:54 AM Song Liu <song@kernel.org> wrote:
>
> On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
> [...]
> > > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > > are replaceable.
> > > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > > and are not replaceable.
> > > > >
> > > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > > still release a cumulative livepatch to replace the previous cumulative
> > > > > livepatch. Is this the expected use case?
> > > >
> > > > That matches our expected use case.
> > >
> > > If we really want to serve use cases like this, I think we can introduce
> > > some replace tag concept: Each livepatch will have a tag, u32 number.
> > > Newly loaded livepatch will only replace existing livepatch with the
> > > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > > and make it u32: replace=0 means no replace; replace > 0 are the
> > > replace tag.
> > >
> > > For current users of cumulative patches, all the livepatch will have the
> > > same tag, say 1. For your use case, you can assign each user a
> > > unique tag. Then all these users can do atomic upgrades of their
> > > own livepatches.
> > >
> > > We may also need to check whether two livepatches of different tags
> > > touch the same kernel function. When that happens, the later
> > > livepatch should fail to load.
That sounds like a viable solution. I'll look into it and see how we
can implement it.
> > >
> > > Does this make sense?
> > >
> >
> > I haven't been following the thread carefully, but could the Livepatch
> > system state API (see Documentation/livepatch/system-state.rst) be
> > leveraged somehow instead of adding further replace semantics?
>
> AFAICT, system state will not help Yafang's use case.
Right.
--
Regards
Yafang
^ permalink raw reply
* [PATCH] tracing/hist: bound expression string construction
From: Pengpeng Hou @ 2026-04-07 6:09 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, linux-trace-kernel, linux-kernel, pengpeng
expr_str() allocates a fixed MAX_FILTER_STR_VAL buffer and then builds
expression names with a series of raw strcat() appends. Nested operands,
constants and field flags can push the rendered string past that fixed
limit before the name is attached to the hist field.
Convert the construction helpers to explicit bounded appends and
propagate failures back to the expression parser when the rendered name
would exceed MAX_FILTER_STR_VAL.
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
kernel/trace/trace_events_hist.c | 101 +++++++++++++++++++++++--------
1 file changed, 76 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 73ea180cad55..caaa262360d2 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1738,85 +1738,121 @@ static const char *get_hist_field_flags(struct hist_field *hist_field)
return flags_str;
}
-static void expr_field_str(struct hist_field *field, char *expr)
+static bool expr_append(char *expr, size_t *len, const char *str)
{
- if (field->flags & HIST_FIELD_FL_VAR_REF)
- strcat(expr, "$");
- else if (field->flags & HIST_FIELD_FL_CONST) {
+ size_t str_len = strlen(str);
+
+ if (*len + str_len >= MAX_FILTER_STR_VAL)
+ return false;
+
+ memcpy(expr + *len, str, str_len + 1);
+ *len += str_len;
+ return true;
+}
+
+static bool expr_field_str(struct hist_field *field, char *expr, size_t *len)
+{
+ if (field->flags & HIST_FIELD_FL_VAR_REF) {
+ if (!expr_append(expr, len, "$"))
+ return false;
+ } else if (field->flags & HIST_FIELD_FL_CONST) {
char str[HIST_CONST_DIGITS_MAX];
+ int ret;
+
+ ret = snprintf(str, sizeof(str), "%llu", field->constant);
+ if (ret >= sizeof(str))
+ return false;
- snprintf(str, HIST_CONST_DIGITS_MAX, "%llu", field->constant);
- strcat(expr, str);
+ if (!expr_append(expr, len, str))
+ return false;
}
- strcat(expr, hist_field_name(field, 0));
+ if (!expr_append(expr, len, hist_field_name(field, 0)))
+ return false;
if (field->flags && !(field->flags & HIST_FIELD_FL_VAR_REF)) {
const char *flags_str = get_hist_field_flags(field);
if (flags_str) {
- strcat(expr, ".");
- strcat(expr, flags_str);
+ if (!expr_append(expr, len, ".") ||
+ !expr_append(expr, len, flags_str))
+ return false;
}
}
+
+ return true;
}
static char *expr_str(struct hist_field *field, unsigned int level)
{
char *expr;
+ size_t len = 0;
if (level > 1)
- return NULL;
+ return ERR_PTR(-EINVAL);
expr = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
if (!expr)
- return NULL;
+ return ERR_PTR(-ENOMEM);
if (!field->operands[0]) {
- expr_field_str(field, expr);
+ if (!expr_field_str(field, expr, &len))
+ goto free;
return expr;
}
if (field->operator == FIELD_OP_UNARY_MINUS) {
char *subexpr;
- strcat(expr, "-(");
+ if (!expr_append(expr, &len, "-("))
+ goto free;
subexpr = expr_str(field->operands[0], ++level);
if (!subexpr) {
- kfree(expr);
- return NULL;
+ goto free;
+ }
+ if (!expr_append(expr, &len, subexpr) ||
+ !expr_append(expr, &len, ")")) {
+ kfree(subexpr);
+ goto free;
}
- strcat(expr, subexpr);
- strcat(expr, ")");
kfree(subexpr);
return expr;
}
- expr_field_str(field->operands[0], expr);
+ if (!expr_field_str(field->operands[0], expr, &len))
+ goto free;
switch (field->operator) {
case FIELD_OP_MINUS:
- strcat(expr, "-");
+ if (!expr_append(expr, &len, "-"))
+ goto free;
break;
case FIELD_OP_PLUS:
- strcat(expr, "+");
+ if (!expr_append(expr, &len, "+"))
+ goto free;
break;
case FIELD_OP_DIV:
- strcat(expr, "/");
+ if (!expr_append(expr, &len, "/"))
+ goto free;
break;
case FIELD_OP_MULT:
- strcat(expr, "*");
+ if (!expr_append(expr, &len, "*"))
+ goto free;
break;
default:
- kfree(expr);
- return NULL;
+ goto free;
}
- expr_field_str(field->operands[1], expr);
+ if (!expr_field_str(field->operands[1], expr, &len))
+ goto free;
return expr;
+
+free:
+ kfree(expr);
+ return ERR_PTR(-E2BIG);
}
/*
@@ -2630,6 +2666,11 @@ static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
expr->is_signed = operand1->is_signed;
expr->operator = FIELD_OP_UNARY_MINUS;
expr->name = expr_str(expr, 0);
+ if (IS_ERR(expr->name)) {
+ ret = PTR_ERR(expr->name);
+ expr->name = NULL;
+ goto free;
+ }
expr->type = kstrdup_const(operand1->type, GFP_KERNEL);
if (!expr->type) {
ret = -ENOMEM;
@@ -2842,6 +2883,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
destroy_hist_field(operand1, 0);
expr->name = expr_str(expr, 0);
+ if (IS_ERR(expr->name)) {
+ ret = PTR_ERR(expr->name);
+ expr->name = NULL;
+ goto free_expr;
+ }
} else {
/* The operand sizes should be the same, so just pick one */
expr->size = operand1->size;
@@ -2855,6 +2901,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
}
expr->name = expr_str(expr, 0);
+ if (IS_ERR(expr->name)) {
+ ret = PTR_ERR(expr->name);
+ expr->name = NULL;
+ goto free_expr;
+ }
}
return expr;
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH net-next v2 0/2] mptcp: autotune related improvement
From: Matthieu Baerts (NGI0) @ 2026-04-07 8:45 UTC (permalink / raw)
To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Florian Westphal
Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0),
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel
Here are two patches from Paolo that have been crafted a couple of
months ago, but needed more validation because they were indirectly
causing instabilities in the sefltests. The root cause has been fixed in
'net' recently in commit 8c09412e584d ("selftests: mptcp: more stable
simult_flows tests").
These patches refactor the receive space and RTT estimator, overall
making DRS more correct while avoiding receive buffer drifting to
tcp_rmem[2], which in turn makes the throughput more stable and less
bursty, especially with high bandwidth and low delay environments.
Note that the first patch addresses a very old issue. 'net-next' is
targeted because the change is quite invasive and based on a recent
backlog refactor. The 'Fixes' tag is then there more as a FYI, because
backporting this patch will quickly be blocked due to large conflicts.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Changes in v2:
- Patch 1: add missing READ_ONCE() and remove unused entry. (AI)
- Link to v1: https://patch.msgid.link/20260309-net-next-mptcp-reduce-rbuf-v1-0-8f471206f9c5@kernel.org
---
Paolo Abeni (2):
mptcp: better mptcp-level RTT estimator
mptcp: add receive queue awareness in tcp_rcv_space_adjust()
include/trace/events/mptcp.h | 2 +-
net/mptcp/protocol.c | 71 +++++++++++++++++++++++++-------------------
net/mptcp/protocol.h | 37 ++++++++++++++++++++++-
3 files changed, 77 insertions(+), 33 deletions(-)
---
base-commit: c149d90e260ca1b6b9175468955a15c4d95a9f3b
change-id: 20260306-net-next-mptcp-reduce-rbuf-4166ba6fb763
Best regards,
--
Matthieu Baerts (NGI0) <matttbe@kernel.org>
^ permalink raw reply
* [PATCH net-next v2 1/2] mptcp: better mptcp-level RTT estimator
From: Matthieu Baerts (NGI0) @ 2026-04-07 8:45 UTC (permalink / raw)
To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Florian Westphal
Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0),
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-trace-kernel
In-Reply-To: <20260407-net-next-mptcp-reduce-rbuf-v2-0-0d1d135bf6f6@kernel.org>
From: Paolo Abeni <pabeni@redhat.com>
The current MPTCP-level RTT estimator has several issues. On high speed
links, the MPTCP-level receive buffer auto-tuning happens with a
frequency well above the TCP-level's one. That in turn can cause
excessive/unneeded receive buffer increase.
On such links, the initial rtt_us value is considerably higher than the
actual delay, and the current mptcp_rcv_space_adjust() updates
msk->rcvq_space.rtt_us with a period equal to the such field previous
value. If the initial rtt_us is 40ms, its first update will happen after
40ms, even if the subflows see actual RTT orders of magnitude lower.
Additionally:
- setting the msk RTT to the maximum among all the subflows RTTs makes
DRS constantly overshooting the rcvbuf size when a subflow has
considerable higher latency than the other(s).
- during unidirectional bulk transfers with multiple active subflows,
the TCP-level RTT estimator occasionally sees considerably higher
value than the real link delay, i.e. when the packet scheduler reacts
to an incoming ACK on given subflow pushing data on a different
subflow.
- currently inactive but still open subflows (i.e. switched to backup
mode) are always considered when computing the msk-level RTT.
Address the all the issues above with a more accurate RTT estimation
strategy: the MPTCP-level RTT is set to the minimum of all the subflows
actually feeding data into the MPTCP receive buffer, using a small
sliding window.
While at it, also use EWMA to compute the msk-level scaling_ratio, to
that MPTCP can avoid traversing the subflow list is
mptcp_rcv_space_adjust().
Use some care to avoid updating msk and ssk level fields too often.
Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: linux-trace-kernel@vger.kernel.org
---
v2:
- samples[0] was read without READ_ONCE, prev_rtt_us was not used (AI).
---
include/trace/events/mptcp.h | 2 +-
net/mptcp/protocol.c | 63 ++++++++++++++++++++++++--------------------
net/mptcp/protocol.h | 37 +++++++++++++++++++++++++-
3 files changed, 72 insertions(+), 30 deletions(-)
diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
index 269d949b2025..04521acba483 100644
--- a/include/trace/events/mptcp.h
+++ b/include/trace/events/mptcp.h
@@ -219,7 +219,7 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
__be32 *p32;
__entry->time = time;
- __entry->rtt_us = msk->rcvq_space.rtt_us >> 3;
+ __entry->rtt_us = mptcp_rtt_us_est(msk) >> 3;
__entry->copied = msk->rcvq_space.copied;
__entry->inq = mptcp_inq_hint(sk);
__entry->space = msk->rcvq_space.space;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2f4776a4f06a..70a090a95299 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -879,6 +879,32 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
return moved;
}
+static void mptcp_rcv_rtt_update(struct mptcp_sock *msk,
+ struct mptcp_subflow_context *subflow)
+{
+ const struct tcp_sock *tp = tcp_sk(subflow->tcp_sock);
+ u32 rtt_us = tp->rcv_rtt_est.rtt_us;
+ int id;
+
+ /* Update once per subflow per rcvwnd to avoid touching the msk
+ * too often.
+ */
+ if (!rtt_us || tp->rcv_rtt_est.seq == subflow->prev_rtt_seq)
+ return;
+
+ subflow->prev_rtt_seq = tp->rcv_rtt_est.seq;
+
+ /* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */
+ id = msk->rcv_rtt_est.next_sample;
+ WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us);
+ if (++msk->rcv_rtt_est.next_sample == MPTCP_RTT_SAMPLES)
+ msk->rcv_rtt_est.next_sample = 0;
+
+ /* EWMA among the incoming subflows */
+ msk->scaling_ratio = ((msk->scaling_ratio << 3) - msk->scaling_ratio +
+ tp->scaling_ratio) >> 3;
+}
+
void mptcp_data_ready(struct sock *sk, struct sock *ssk)
{
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
@@ -892,6 +918,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
return;
mptcp_data_lock(sk);
+ mptcp_rcv_rtt_update(msk, subflow);
if (!sock_owned_by_user(sk)) {
/* Wake-up the reader only for in-sequence data */
if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
@@ -2095,7 +2122,6 @@ static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
msk->rcvspace_init = 1;
msk->rcvq_space.copied = 0;
- msk->rcvq_space.rtt_us = 0;
/* initial rcv_space offering made to peer */
msk->rcvq_space.space = min_t(u32, tp->rcv_wnd,
@@ -2106,15 +2132,15 @@ static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
/* receive buffer autotuning. See tcp_rcv_space_adjust for more information.
*
- * Only difference: Use highest rtt estimate of the subflows in use.
+ * Only difference: Use lowest rtt estimate of the subflows in use, see
+ * mptcp_rcv_rtt_update() and mptcp_rtt_us_est().
*/
static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
{
struct mptcp_subflow_context *subflow;
struct sock *sk = (struct sock *)msk;
- u8 scaling_ratio = U8_MAX;
- u32 time, advmss = 1;
- u64 rtt_us, mstamp;
+ u32 time, rtt_us;
+ u64 mstamp;
msk_owned_by_me(msk);
@@ -2129,29 +2155,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
mstamp = mptcp_stamp();
time = tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time));
- rtt_us = msk->rcvq_space.rtt_us;
- if (rtt_us && time < (rtt_us >> 3))
- return;
-
- rtt_us = 0;
- mptcp_for_each_subflow(msk, subflow) {
- const struct tcp_sock *tp;
- u64 sf_rtt_us;
- u32 sf_advmss;
-
- tp = tcp_sk(mptcp_subflow_tcp_sock(subflow));
-
- sf_rtt_us = READ_ONCE(tp->rcv_rtt_est.rtt_us);
- sf_advmss = READ_ONCE(tp->advmss);
-
- rtt_us = max(sf_rtt_us, rtt_us);
- advmss = max(sf_advmss, advmss);
- scaling_ratio = min(tp->scaling_ratio, scaling_ratio);
- }
-
- msk->rcvq_space.rtt_us = rtt_us;
- msk->scaling_ratio = scaling_ratio;
- if (time < (rtt_us >> 3) || rtt_us == 0)
+ rtt_us = mptcp_rtt_us_est(msk);
+ if (rtt_us == U32_MAX || time < (rtt_us >> 3))
return;
if (msk->rcvq_space.copied <= msk->rcvq_space.space)
@@ -3015,6 +3020,7 @@ static void __mptcp_init_sock(struct sock *sk)
msk->timer_ival = TCP_RTO_MIN;
msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
msk->backlog_len = 0;
+ mptcp_init_rtt_est(msk);
WRITE_ONCE(msk->first, NULL);
inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
@@ -3460,6 +3466,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
msk->bytes_retrans = 0;
msk->rcvspace_init = 0;
msk->fastclosing = 0;
+ mptcp_init_rtt_est(msk);
/* for fallback's sake */
WRITE_ONCE(msk->ack_seq, 0);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 1208f317ac33..e1d4783db02f 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -269,6 +269,13 @@ struct mptcp_data_frag {
struct page *page;
};
+/* Arbitrary compromise between as low as possible to react timely to subflow
+ * close event and as big as possible to avoid being fouled by biased large
+ * samples due to peer sending data on a different subflow WRT to the incoming
+ * ack.
+ */
+#define MPTCP_RTT_SAMPLES 5
+
/* MPTCP connection sock */
struct mptcp_sock {
/* inet_connection_sock must be the first member */
@@ -341,11 +348,17 @@ struct mptcp_sock {
*/
struct mptcp_pm_data pm;
struct mptcp_sched_ops *sched;
+
+ /* Most recent rtt_us observed by in use incoming subflows. */
+ struct {
+ u32 samples[MPTCP_RTT_SAMPLES];
+ u32 next_sample;
+ } rcv_rtt_est;
+
struct {
int space; /* bytes copied in last measurement window */
int copied; /* bytes copied in this measurement window */
u64 time; /* start time of measurement window */
- u64 rtt_us; /* last maximum rtt of subflows */
} rcvq_space;
u8 scaling_ratio;
bool allow_subflows;
@@ -423,6 +436,27 @@ static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
return msk->first_pending;
}
+static inline void mptcp_init_rtt_est(struct mptcp_sock *msk)
+{
+ int i;
+
+ for (i = 0; i < MPTCP_RTT_SAMPLES; ++i)
+ msk->rcv_rtt_est.samples[i] = U32_MAX;
+ msk->rcv_rtt_est.next_sample = 0;
+ msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+}
+
+static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk)
+{
+ u32 rtt_us = READ_ONCE(msk->rcv_rtt_est.samples[0]);
+ int i;
+
+ /* Lockless access of collected samples. */
+ for (i = 1; i < MPTCP_RTT_SAMPLES; ++i)
+ rtt_us = min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i]));
+ return rtt_us;
+}
+
static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk)
{
struct mptcp_sock *msk = mptcp_sk(sk);
@@ -524,6 +558,7 @@ struct mptcp_subflow_context {
u32 map_data_len;
__wsum map_data_csum;
u32 map_csum_len;
+ u32 prev_rtt_seq;
u32 request_mptcp : 1, /* send MP_CAPABLE */
request_join : 1, /* send MP_JOIN */
request_bkup : 1,
--
2.53.0
^ permalink raw reply related
* [PATCH] ring-buffer: Preserve true payload lengths in long data events
From: Cao Ruichuang @ 2026-04-07 9:15 UTC (permalink / raw)
To: rostedt, mhiramat; +Cc: mathieu.desnoyers, linux-kernel, linux-trace-kernel
Long ring buffer data records currently store the aligned in-buffer size in
their length field. That makes ring_buffer_event_length() report padded
sizes, and small TRACE_PRINT / TRACE_RAW_DATA records lose their true
payload length entirely when they use the short type_len encoding.
Teach long data events to keep the true payload size in array[0], and let
the ring buffer derive the aligned in-buffer size separately when it needs
to walk or discard records. Then add a long-reserve helper and use it for
TRACE_PRINT and TRACE_RAW_DATA so their zero-length-array tails always
preserve the real payload size.
The temporary filtered-event buffer keeps the same long-record payload
length semantics, and a QEMU runtime reproducer for trace_marker_raw now
reports the expected byte counts again.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=210173
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
include/linux/ring_buffer.h | 2 ++
kernel/trace/ring_buffer.c | 56 ++++++++++++++++++++++++++-----------
kernel/trace/trace.c | 8 +++---
kernel/trace/trace.h | 15 ++++++++++
kernel/trace/trace_printk.c | 8 +++---
5 files changed, 65 insertions(+), 24 deletions(-)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index d862fa610..a4e46cb53 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -137,6 +137,8 @@ void ring_buffer_change_overwrite(struct trace_buffer *buffer, int val);
struct ring_buffer_event *ring_buffer_lock_reserve(struct trace_buffer *buffer,
unsigned long length);
+struct ring_buffer_event *ring_buffer_lock_reserve_long(struct trace_buffer *buffer,
+ unsigned long length);
int ring_buffer_unlock_commit(struct trace_buffer *buffer);
int ring_buffer_write(struct trace_buffer *buffer,
unsigned long length, void *data);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 170170bd8..c9ade62df 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -206,10 +206,14 @@ rb_event_data_length(struct ring_buffer_event *event)
unsigned length;
if (event->type_len)
- length = event->type_len * RB_ALIGNMENT;
- else
- length = event->array[0];
- return length + RB_EVNT_HDR_SIZE;
+ return event->type_len * RB_ALIGNMENT + RB_EVNT_HDR_SIZE;
+
+ /*
+ * Long records store the true payload size in array[0], but still
+ * consume an aligned amount of space in the buffer.
+ */
+ length = event->array[0] + RB_EVNT_HDR_SIZE + sizeof(event->array[0]);
+ return ALIGN(length, RB_ARCH_ALIGNMENT);
}
/*
@@ -276,12 +280,13 @@ unsigned ring_buffer_event_length(struct ring_buffer_event *event)
if (extended_time(event))
event = skip_time_extend(event);
+ if (!event->type_len)
+ return event->array[0];
+
length = rb_event_length(event);
if (event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
return length;
length -= RB_EVNT_HDR_SIZE;
- if (length > RB_MAX_SMALL_DATA + sizeof(event->array[0]))
- length -= sizeof(event->array[0]);
return length;
}
EXPORT_SYMBOL_GPL(ring_buffer_event_length);
@@ -463,9 +468,11 @@ struct rb_event_info {
u64 delta;
u64 before;
u64 after;
+ unsigned long data_length;
unsigned long length;
struct buffer_page *tail_page;
int add_timestamp;
+ bool force_long;
};
/*
@@ -3796,14 +3803,15 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
event->time_delta = delta;
length -= RB_EVNT_HDR_SIZE;
- if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT) {
+ if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT ||
+ info->force_long) {
event->type_len = 0;
- event->array[0] = length;
+ event->array[0] = info->data_length;
} else
event->type_len = DIV_ROUND_UP(length, RB_ALIGNMENT);
}
-static unsigned rb_calculate_event_length(unsigned length)
+static unsigned int rb_calculate_event_length(unsigned int length, bool force_long)
{
struct ring_buffer_event event; /* Used only for sizeof array */
@@ -3811,7 +3819,7 @@ static unsigned rb_calculate_event_length(unsigned length)
if (!length)
length++;
- if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT)
+ if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT || force_long)
length += sizeof(event.array[0]);
length += RB_EVNT_HDR_SIZE;
@@ -4605,7 +4613,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
static __always_inline struct ring_buffer_event *
rb_reserve_next_event(struct trace_buffer *buffer,
struct ring_buffer_per_cpu *cpu_buffer,
- unsigned long length)
+ unsigned long length, bool force_long)
{
struct ring_buffer_event *event;
struct rb_event_info info;
@@ -4641,7 +4649,9 @@ rb_reserve_next_event(struct trace_buffer *buffer,
}
#endif
- info.length = rb_calculate_event_length(length);
+ info.length = rb_calculate_event_length(length, force_long);
+ info.data_length = length ? : 1;
+ info.force_long = force_long;
if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) {
add_ts_default = RB_ADD_STAMP_ABSOLUTE;
@@ -4698,8 +4708,9 @@ rb_reserve_next_event(struct trace_buffer *buffer,
* Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
* If NULL is returned, then nothing has been allocated or locked.
*/
-struct ring_buffer_event *
-ring_buffer_lock_reserve(struct trace_buffer *buffer, unsigned long length)
+static struct ring_buffer_event *
+__ring_buffer_lock_reserve(struct trace_buffer *buffer, unsigned long length,
+ bool force_long)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_event *event;
@@ -4727,7 +4738,7 @@ ring_buffer_lock_reserve(struct trace_buffer *buffer, unsigned long length)
if (unlikely(trace_recursive_lock(cpu_buffer)))
goto out;
- event = rb_reserve_next_event(buffer, cpu_buffer, length);
+ event = rb_reserve_next_event(buffer, cpu_buffer, length, force_long);
if (!event)
goto out_unlock;
@@ -4739,8 +4750,21 @@ ring_buffer_lock_reserve(struct trace_buffer *buffer, unsigned long length)
preempt_enable_notrace();
return NULL;
}
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct trace_buffer *buffer, unsigned long length)
+{
+ return __ring_buffer_lock_reserve(buffer, length, false);
+}
EXPORT_SYMBOL_GPL(ring_buffer_lock_reserve);
+struct ring_buffer_event *
+ring_buffer_lock_reserve_long(struct trace_buffer *buffer, unsigned long length)
+{
+ return __ring_buffer_lock_reserve(buffer, length, true);
+}
+EXPORT_SYMBOL_GPL(ring_buffer_lock_reserve_long);
+
/*
* Decrement the entries to the page that an event is on.
* The event does not even need to exist, only the pointer
@@ -4874,7 +4898,7 @@ int ring_buffer_write(struct trace_buffer *buffer,
if (unlikely(trace_recursive_lock(cpu_buffer)))
return -EBUSY;
- event = rb_reserve_next_event(buffer, cpu_buffer, length);
+ event = rb_reserve_next_event(buffer, cpu_buffer, length, false);
if (!event)
goto out_unlock;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a626211ce..ffc1b1e9c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6503,8 +6503,8 @@ static ssize_t write_marker_to_buffer(struct trace_array *tr, const char *buf,
size = cnt + meta_size;
buffer = tr->array_buffer.buffer;
- event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
- tracing_gen_ctx());
+ event = __trace_buffer_lock_reserve_long(buffer, TRACE_PRINT, size,
+ tracing_gen_ctx());
if (unlikely(!event)) {
/*
* If the size was greater than what was allowed, then
@@ -6917,8 +6917,8 @@ static ssize_t write_raw_marker_to_buffer(struct trace_array *tr,
if (size > ring_buffer_max_event_size(buffer))
return -EINVAL;
- event = __trace_buffer_lock_reserve(buffer, TRACE_RAW_DATA, size,
- tracing_gen_ctx());
+ event = __trace_buffer_lock_reserve_long(buffer, TRACE_RAW_DATA, size,
+ tracing_gen_ctx());
if (!event)
/* Ring buffer disabled, return as if not open for write */
return -EBADF;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index b8f380458..da55717c9 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1613,6 +1613,21 @@ __trace_buffer_lock_reserve(struct trace_buffer *buffer,
return event;
}
+static __always_inline struct ring_buffer_event *
+__trace_buffer_lock_reserve_long(struct trace_buffer *buffer,
+ int type,
+ unsigned long len,
+ unsigned int trace_ctx)
+{
+ struct ring_buffer_event *event;
+
+ event = ring_buffer_lock_reserve_long(buffer, len);
+ if (event != NULL)
+ trace_event_setup(event, type, trace_ctx);
+
+ return event;
+}
+
static __always_inline void
__buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_event *event)
{
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index 9f67ce42e..1441b2bd4 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -444,8 +444,8 @@ int __trace_array_puts(struct trace_array *tr, unsigned long ip,
trace_ctx = tracing_gen_ctx();
buffer = tr->array_buffer.buffer;
guard(ring_buffer_nest)(buffer);
- event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, alloc,
- trace_ctx);
+ event = __trace_buffer_lock_reserve_long(buffer, TRACE_PRINT, alloc,
+ trace_ctx);
if (!event)
return 0;
@@ -725,8 +725,8 @@ int __trace_array_vprintk(struct trace_buffer *buffer,
size = sizeof(*entry) + len + 1;
scoped_guard(ring_buffer_nest, buffer) {
- event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
- trace_ctx);
+ event = __trace_buffer_lock_reserve_long(buffer, TRACE_PRINT, size,
+ trace_ctx);
if (!event)
goto out;
entry = ring_buffer_event_data(event);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Yafang Shao @ 2026-04-07 9:45 UTC (permalink / raw)
To: Song Liu
Cc: Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes, pmladek,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <CALOAHbDN_t-ZRO0g9_sQFCv0J6SPDFfwJCcwSzd4ww5XRkU0QA@mail.gmail.com>
On Tue, Apr 7, 2026 at 11:16 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Apr 7, 2026 at 10:54 AM Song Liu <song@kernel.org> wrote:
> >
> > On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
> > [...]
> > > > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > > > are replaceable.
> > > > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > > > and are not replaceable.
> > > > > >
> > > > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > > > still release a cumulative livepatch to replace the previous cumulative
> > > > > > livepatch. Is this the expected use case?
> > > > >
> > > > > That matches our expected use case.
> > > >
> > > > If we really want to serve use cases like this, I think we can introduce
> > > > some replace tag concept: Each livepatch will have a tag, u32 number.
> > > > Newly loaded livepatch will only replace existing livepatch with the
> > > > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > > > and make it u32: replace=0 means no replace; replace > 0 are the
> > > > replace tag.
> > > >
> > > > For current users of cumulative patches, all the livepatch will have the
> > > > same tag, say 1. For your use case, you can assign each user a
> > > > unique tag. Then all these users can do atomic upgrades of their
> > > > own livepatches.
> > > >
> > > > We may also need to check whether two livepatches of different tags
> > > > touch the same kernel function. When that happens, the later
> > > > livepatch should fail to load.
>
> That sounds like a viable solution. I'll look into it and see how we
> can implement it.
Does the following change look good to you ?
Subject: [PATCH] livepatch: Support scoped atomic replace using replace tags
Extend the replace attribute from a boolean to a u32 to act as a replace
tag. This introduces the following semantics:
replace = 0: Atomic replace is disabled. However, this patch remains
eligible to be superseded by others.
replace > 0: Enables tagged replace (default is 1). A newly loaded
livepatch will only replace existing patches that share the
same tag.
To maintain backward compatibility, a patch with replace == 0 does not
trigger an outgoing atomic replace, but remains eligible to be superseded
by any incoming patch with a valid replace tag.
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
.../livepatch/cumulative-patches.rst | 20 +++++++-----
Documentation/livepatch/livepatch.rst | 31 +++++++++++++------
include/linux/livepatch.h | 8 +++--
kernel/livepatch/core.c | 4 +++
scripts/livepatch/init.c | 6 +---
scripts/livepatch/klp-build | 11 +++++--
6 files changed, 53 insertions(+), 27 deletions(-)
diff --git a/Documentation/livepatch/cumulative-patches.rst
b/Documentation/livepatch/cumulative-patches.rst
index 1931f318976a..06e90dc5967c 100644
--- a/Documentation/livepatch/cumulative-patches.rst
+++ b/Documentation/livepatch/cumulative-patches.rst
@@ -12,23 +12,26 @@ modified the same function in different ways.
An elegant solution comes with the feature called "Atomic Replace". It allows
creation of so called "Cumulative Patches". They include all wanted changes
-from all older livepatches and completely replace them in one transition.
+from older livepatches with a matching tag and replace them in one transition.
Usage
-----
-The atomic replace can be enabled by setting "replace" flag in struct
klp_patch,
-for example::
+he atomic replace can be enabled by setting a non-zero value to the "replace"
+attribute in ``struct klp_patch``. This value acts as a **replace tag**,
+defining the scope of the replacement.
+
+For example::
static struct klp_patch patch = {
.mod = THIS_MODULE,
.objs = objs,
- .replace = true,
+ .replace = 1,
};
All processes are then migrated to use the code only from the new patch.
-Once the transition is finished, all older patches are automatically
-disabled.
+Once the transition is finished, all older patches with the same replace tag
+are automatically disabled. Patches with different tags remain active.
Ftrace handlers are transparently removed from functions that are no
longer modified by the new cumulative patch.
@@ -62,9 +65,10 @@ Limitations:
------------
- Once the operation finishes, there is no straightforward way
- to reverse it and restore the replaced patches atomically.
+ to reverse it and restore the replaced patches (with the same tag)
+ atomically.
- A good practice is to set .replace flag in any released livepatch.
+ A good practice is to set a consistent .replace tag in related livepatches.
Then re-adding an older livepatch is equivalent to downgrading
to that patch. This is safe as long as the livepatches do _not_ do
extra modifications in (un)patching callbacks or in the module_init()
diff --git a/Documentation/livepatch/livepatch.rst
b/Documentation/livepatch/livepatch.rst
index acb90164929e..1fc1543a22c3 100644
--- a/Documentation/livepatch/livepatch.rst
+++ b/Documentation/livepatch/livepatch.rst
@@ -347,15 +347,28 @@ to '0'.
5.3. Replacing
--------------
-All enabled patches might get replaced by a cumulative patch that
-has the .replace flag set.
-
-Once the new patch is enabled and the 'transition' finishes then
-all the functions (struct klp_func) associated with the replaced
-patches are removed from the corresponding struct klp_ops. Also
-the ftrace handler is unregistered and the struct klp_ops is
-freed when the related function is not modified by the new patch
-and func_stack list becomes empty.
+All currently enabled patches may be superseded by a cumulative patch that
+has the ``.replace`` attribute enabled. The behavior of the replacement
+depends on the value assigned to the replace tag:
+
+replace = 0
+ Atomic replace is disabled. However, this patch remains eligible to be
+ superseded by others.
+
+replace > 0
+ Enables tagged atomic replace. Once the new patch is enabled and the
+ transition finishes, the livepatching core identifies all existing
+ patches that share the same replace tag.
+
+Once the transition is complete, all functions (``struct klp_func``)
+associated with the matching replaced patches are removed from the
+corresponding ``struct klp_ops``. If a function is no longer modified by
+the new patch and its ``func_stack`` list becomes empty, the ftrace
+handler is unregistered and the ``struct klp_ops`` is freed.
+
+Patches with a different replace tag are not affected by this process
+and remain active. This allows for the independent management and
+stacking of multiple, non-conflicting livepatch sets.
See Documentation/livepatch/cumulative-patches.rst for more details.
diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index ba9e3988c07c..417c67a17b99 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -123,7 +123,11 @@ struct klp_state {
* @mod: reference to the live patch module
* @objs: object entries for kernel objects to be patched
* @states: system states that can get modified
- * @replace: replace all actively used patches
+ * @replace: replace tag:
+ * = 0: Atomic replace is disabled; however, this patch remains
+ * eligible to be superseded by others.
+ * > 0: Atomic replace is enabled. Only existing patches with a
+ * matching replace tag will be superseded.
* @list: list node for global list of actively used patches
* @kobj: kobject for sysfs resources
* @obj_list: dynamic list of the object entries
@@ -137,7 +141,7 @@ struct klp_patch {
struct module *mod;
struct klp_object *objs;
struct klp_state *states;
- bool replace;
+ unsigned int replace;
/* internal */
struct list_head list;
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 28d15ba58a26..e4e5c03b0724 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -793,6 +793,8 @@ void klp_free_replaced_patches_async(struct
klp_patch *new_patch)
klp_for_each_patch_safe(old_patch, tmp_patch) {
if (old_patch == new_patch)
return;
+ if (old_patch->replace && old_patch->replace !=
new_patch->replace)
+ continue;
klp_free_patch_async(old_patch);
}
}
@@ -1194,6 +1196,8 @@ void klp_unpatch_replaced_patches(struct
klp_patch *new_patch)
klp_for_each_patch(old_patch) {
if (old_patch == new_patch)
return;
+ if (old_patch->replace && old_patch->replace !=
new_patch->replace)
+ continue;
old_patch->enabled = false;
klp_unpatch_objects(old_patch);
diff --git a/scripts/livepatch/init.c b/scripts/livepatch/init.c
index f14d8c8fb35f..cd00e278a1d2 100644
--- a/scripts/livepatch/init.c
+++ b/scripts/livepatch/init.c
@@ -72,11 +72,7 @@ static int __init livepatch_mod_init(void)
/* TODO patch->states */
-#ifdef KLP_NO_REPLACE
- patch->replace = false;
-#else
- patch->replace = true;
-#endif
+ patch->replace = KLP_REPLACE_TAG;
return klp_enable_patch(patch);
diff --git a/scripts/livepatch/klp-build b/scripts/livepatch/klp-build
index 7b82c7503c2b..9f6a7673304f 100755
--- a/scripts/livepatch/klp-build
+++ b/scripts/livepatch/klp-build
@@ -118,6 +118,7 @@ Options:
-j, --jobs=<jobs> Build jobs to run simultaneously
[default: $JOBS]
-o, --output=<file.ko> Output file [default: livepatch-<patch-name>.ko]
--no-replace Disable livepatch atomic replace
+ -t, --replace-tag=<tag> Set the atomic replace tag for this livepatch
-v, --verbose Pass V=1 to kernel/module builds
Advanced Options:
@@ -142,8 +143,8 @@ process_args() {
local long
local args
- short="hfj:o:vdS:T"
- long="help,show-first-changed,jobs:,output:,no-replace,verbose,debug,short-circuit:,keep-tmp"
+ short="hfj:o:t:vdS:T"
+ long="help,show-first-changed,jobs:,output:,no-replace,replace-tag:,verbose,debug,short-circuit:,keep-tmp"
args=$(getopt --options "$short" --longoptions "$long" -- "$@") || {
echo; usage; exit
@@ -176,6 +177,10 @@ process_args() {
REPLACE=0
shift
;;
+ -t | --replace-tag)
+ REPLACE="$2"
+ shift 2
+ ;;
-v | --verbose)
VERBOSE="V=1"
shift
@@ -759,7 +764,7 @@ build_patch_module() {
cflags=("-ffunction-sections")
cflags+=("-fdata-sections")
- [[ $REPLACE -eq 0 ]] && cflags+=("-DKLP_NO_REPLACE")
+ cflags+=("-DKLP_REPLACE_TAG=$REPLACE")
cmd=("make")
cmd+=("$VERBOSE")
--
Regards
Yafang
^ permalink raw reply related
* [PATCH] selftests/ftrace: Check exact trace_marker_raw payload lengths
From: CaoRuichuang @ 2026-04-07 10:12 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, shuah
Cc: linux-kernel, linux-trace-kernel, linux-kselftest, Cao Ruichuang
From: Cao Ruichuang <create0818@163.com>
trace_marker_raw.tc currently depends on awk strtonum() and
assumes that the printed raw-data byte count is rounded up to four
bytes.
That makes the test fail on systems that use mawk, and it no longer
matches the raw_data trace output we want to validate after preserving
true payload lengths for long records.
Rewrite the test to capture a small sequence of raw marker writes and
check the exact number of printed payload bytes in order. While doing
that, use od for the endian probe, switch to a fixed raw marker id so
the test only varies payload length, and disable pause-on-trace while
streaming trace_pipe.
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
.../ftrace/test.d/00basic/trace_marker_raw.tc | 93 ++++++++++++-------
1 file changed, 59 insertions(+), 34 deletions(-)
diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
index a2c42e13f..3b37890f8 100644
--- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
+++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
@@ -1,11 +1,11 @@
#!/bin/sh
# SPDX-License-Identifier: GPL-2.0
# description: Basic tests on writing to trace_marker_raw
-# requires: trace_marker_raw
+# requires: trace_marker_raw od:program
# flags: instance
is_little_endian() {
- if lscpu | grep -q 'Little Endian'; then
+ if [ "$(printf '\001\000\000\000' | od -An -tu4 | tr -d '[:space:]')" = "1" ]; then
echo 1;
else
echo 0;
@@ -34,7 +34,7 @@ make_str() {
data=`printf -- 'X%.0s' $(seq $cnt)`
- printf "${val}${data}"
+ printf "%b%s" "${val}" "${data}"
}
write_buffer() {
@@ -47,36 +47,68 @@ write_buffer() {
test_multiple_writes() {
+ out_file=$TMPDIR/trace_marker_raw.out
+ match_file=$TMPDIR/trace_marker_raw.lines
+ wait_iter=0
+ pause_on_trace=
+
+ if [ -f options/pause-on-trace ]; then
+ pause_on_trace=`cat options/pause-on-trace`
+ echo 0 > options/pause-on-trace
+ fi
+
+ : > trace
+ cat trace_pipe > $out_file &
+ reader_pid=$!
+ sleep 1
+
+ # Write sizes that cover both the short and long raw-data encodings
+ # without overflowing the trace buffer before we can verify them.
+ for i in `seq 1 12`; do
+ write_buffer 0x12345678 $i
+ done
- # Write a bunch of data where the id is the count of
- # data to write
- for i in `seq 1 10` `seq 101 110` `seq 1001 1010`; do
- write_buffer $i $i
+ while [ "`grep -c ' buf:' $out_file 2> /dev/null || true`" -lt 12 ]; do
+ wait_iter=$((wait_iter + 1))
+ if [ $wait_iter -ge 10 ]; then
+ kill $reader_pid 2> /dev/null || true
+ wait $reader_pid 2> /dev/null || true
+ if [ -n "$pause_on_trace" ]; then
+ echo $pause_on_trace > options/pause-on-trace
+ fi
+ return 1
+ fi
+ sleep 1
done
# add a little buffer
echo stop > trace_marker
+ sleep 1
+ kill $reader_pid 2> /dev/null || true
+ wait $reader_pid 2> /dev/null || true
+ if [ -n "$pause_on_trace" ]; then
+ echo $pause_on_trace > options/pause-on-trace
+ fi
- # Check to make sure the number of entries is the id (rounded up by 4)
- awk '/.*: # [0-9a-f]* / {
- print;
- cnt = -1;
- for (i = 0; i < NF; i++) {
- # The counter is after the "#" marker
- if ( $i == "#" ) {
- i++;
- cnt = strtonum("0x" $i);
- num = NF - (i + 1);
- # The number of items is always rounded up by 4
- cnt2 = int((cnt + 3) / 4) * 4;
- if (cnt2 != num) {
- exit 1;
- }
- break;
- }
- }
- }
- // { if (NR > 30) { exit 0; } } ' trace_pipe;
+ grep ' buf:' $out_file > $match_file || return 1
+ if [ "`wc -l < $match_file`" -ne 12 ]; then
+ cat $match_file
+ return 1
+ fi
+
+ # Check to make sure the number of byte values matches the id exactly.
+ for expected in `seq 1 12`; do
+ line=`sed -n "${expected}p" $match_file`
+ if [ -z "$line" ]; then
+ return 1
+ fi
+ rest=${line#* buf: }
+ set -- $rest
+ if [ "$#" -ne "$expected" ]; then
+ echo "$line"
+ return 1
+ fi
+ done
}
@@ -107,13 +139,6 @@ test_buffer() {
ORIG=`cat buffer_size_kb`
-# test_multiple_writes test needs at least 12KB buffer
-NEW_SIZE=12
-
-if [ ${ORIG} -lt ${NEW_SIZE} ]; then
- echo ${NEW_SIZE} > buffer_size_kb
-fi
-
test_buffer
if ! test_multiple_writes; then
echo ${ORIG} > buffer_size_kb
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Breno Leitao @ 2026-04-07 10:19 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Jonathan Corbet, Shuah Khan, linux-kernel, linux-trace-kernel,
linux-doc, oss, paulmck, rostedt, kernel-team, Kiryl Shutsemau
In-Reply-To: <20260403114519.14e326a4bba019373bf3ff09@kernel.org>
On Fri, Apr 03, 2026 at 11:45:19AM +0900, Masami Hiramatsu wrote:
> > I'm still uncertain about this approach. The goal is to identify and
> > categorize the early parameters that are parsed prior to bootconfig
> > initialization.
>
> Yes, if we support early parameters in bootconfig, we need to clarify
> which parameters are inherently unsupportable, and document it.
> Currently it is easy to say that it does not support the parameter
> defined with "early_param()". Similary, maybe we should introduce
> "arch_param()" or something like it (or support all of them).
>
> >
> > Moreover, this work could become obsolete if bootconfig's initialization
> > point shifts earlier or later in the boot sequence, necessitating
> > another comprehensive analysis.
>
> If we can init it before calling setup_arch(), yes, we don't need to
> check it. So that is another option. Do you think it is feasible to
> support all of them? (Of course, theologically we can do, but the
> question is the use case and requirements.)
I don't believe all early parameters can be supported by bootconfig.
Some are inherently incompatible as far as I understand, while others
depend on bootconfig's initialization point in the boot sequence.
Regarding documenting arch_param(), I need clarification: should we
document parameters that are currently called before bootconfig (as of
today), or those that fundamentally cannot be called by bootconfig
regardless of its location?
> > Do you have any additional recommendations if I proceed with this
> > approach?
>
> OK,
>
> First of all, even if we enable early parameter support in bootconfig,
> this is only possible if bootconfig is embedded. In that case, we can
> pass memory that has been pre-allocated at compile time to bootconfig
> as a working area. However, this will consume a lot of memory, so it
> needs to be selectable in Kconfig.
>
> If you're going to embed this, as Kiryl pointed out[1], it might be better
> to pass pre-normalized (or compiled) data and avoid using a parser.
> Compilation itself is relatively easy if you utilize the tools/bootconfig.
> (However, in this case, there doesn't seem to be much point in using
> bootconfig in the first place because we also can use embed kernel
> cmdline.)
>
> [1] https://lore.kernel.org/all/acueCFv4neO7zQGI@thinkstation/
>
> Can you clarify the main reason of requesting this feature and
> examples?
My primary objective is to enable early configuration of
`irqchip.gicv3_pseudo_nmi`, allowing the kernel to support pseudo NMI
on arm64 by default.
^ permalink raw reply
* [PATCH] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-07 10:24 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.
However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).
This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
=======
cd /sys/kernel/tracing
# 'Add entry and exit events on the same place'
echo 'f:event1 vfs_read' >> dynamic_events
echo 'f:event2 vfs_read%return' >> dynamic_events
# 'Enable both of them'
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210
# 'Disable and remove exit event'
echo 0 > events/fprobes/event2/enable
echo -:event2 >> dynamic_events
# 'Disable and remove all events'
echo 0 > events/fprobes/enable
echo > dynamic_events
# 'Add another event'
echo 'f:event3 vfs_open%return' > dynamic_events
cat dynamic_events
f:fprobes/event3 vfs_open%return
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
=======
As you can see, an entry for the vfs_read remains.
To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.
Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/fprobe.c | 77 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 62 insertions(+), 15 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index dcadf1d23b8a..7f75e6e4462c 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -85,11 +85,9 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node)
return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
}
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
{
lockdep_assert_held(&fprobe_mutex);
- bool ret;
/* Avoid double deleting */
if (READ_ONCE(node->fp) != NULL) {
@@ -97,13 +95,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
rhltable_remove(&fprobe_ip_table, &node->hlist,
fprobe_rht_params);
}
-
- rcu_read_lock();
- ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
- fprobe_rht_params);
- rcu_read_unlock();
-
- return ret;
}
/* Check existence of the fprobe */
@@ -337,6 +328,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return !fp->exit_handler;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We have to check the same type on the list. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp)) {
+ if ((!ftrace && fp->exit_handler) ||
+ (ftrace && !fp->exit_handler))
+ return true;
+ }
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
int reset)
@@ -360,6 +377,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return false;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We only need to check fp is there. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp))
+ return true;
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
int reset)
@@ -574,15 +614,20 @@ static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long ad
static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
struct fprobe_addr_list *alist)
{
+ lockdep_assert_in_rcu_read_lock();
+
if (!within_module(node->addr, mod))
return;
- if (delete_fprobe_node(node))
- return;
+
+ delete_fprobe_node(node);
/*
- * If failed to update alist, just continue to update hlist.
+ * Ignore failure of updating alist, but continue to update hlist.
* Therefore, at list user handler will not hit anymore.
+ * And don't care the type here, because all fprobes on the same
+ * address must be removed eventually.
*/
- fprobe_addr_list_add(alist, node->addr);
+ if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params))
+ fprobe_addr_list_add(alist, node->addr);
}
/* Handle module unloading to manage fprobe_ip_table. */
@@ -943,7 +988,9 @@ int unregister_fprobe(struct fprobe *fp)
/* Remove non-synonim ips from table and hash */
count = 0;
for (i = 0; i < hlist_array->size; i++) {
- if (!delete_fprobe_node(&hlist_array->array[i]))
+ delete_fprobe_node(&hlist_array->array[i]);
+ if (!fprobe_exists_on_hash(hlist_array->array[i].addr,
+ fprobe_is_ftrace(fp)))
addrs[count++] = hlist_array->array[i].addr;
}
del_fprobe_hash(fp);
^ permalink raw reply related
* [PATCH] selftests/ftrace: Drop invalid top-level local in test_ownership
From: CaoRuichuang @ 2026-04-07 10:26 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, shuah
Cc: linux-kernel, linux-trace-kernel, linux-kselftest, Cao Ruichuang
From: Cao Ruichuang <create0818@163.com>
test_ownership.tc is sourced by ftracetest under /bin/sh.
The script currently declares mount_point with local at file scope,
which makes /bin/sh abort with "local: not in a function" before the
test can reach the eventfs ownership checks.
Replace the top-level local declaration with a normal shell variable so
kernels that support the gid= tracefs mount option can run the test at
all.
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
index e71cc3ad0..6d00d3c0f 100644
--- a/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
+++ b/tools/testing/selftests/ftrace/test.d/00basic/test_ownership.tc
@@ -6,7 +6,7 @@
original_group=`stat -c "%g" .`
original_owner=`stat -c "%u" .`
-local mount_point=$(get_mount_point)
+mount_point=$(get_mount_point)
mount_options=$(get_mnt_options "$mount_point")
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* [PATCH] selftests/ftrace: Account for fprobe attachment at creation
From: Cao Ruichuang @ 2026-04-07 11:57 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, shuah
Cc: linux-kernel, linux-trace-kernel, linux-kselftest
add_remove_fprobe.tc assumes that enabling an fprobe event is what adds
its target function to enabled_functions.
On the current kernel, the fprobe target already appears in
enabled_functions as soon as the event is created, and enabling the
event does not change that count again. That makes the test fail even
though the event lifecycle itself works.
Record the attachment baseline after creating the probe events and only
check that enabling them keeps the expected functions attached. The
cleanup checks still verify that removing the events returns
enabled_functions to its original state.
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
.../test.d/dynevent/add_remove_fprobe.tc | 28 +++++++++++++------
1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
index 47067a5e3..ff08bd1ac 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
@@ -26,23 +26,29 @@ grep -q myevent2 dynamic_events
grep -q myevent3 dynamic_events
test -d events/fprobes/myevent1
test -d events/fprobes/myevent2
-
-echo 1 > events/fprobes/myevent1/enable
-# Make sure the event is attached.
grep -q $PLACE enabled_functions
+grep -q $PLACE2 enabled_functions
cnt=`cat enabled_functions | wc -l`
-if [ $cnt -eq $ocnt ]; then
+if [ $cnt -le $ocnt ]; then
+ exit_fail
+fi
+
+echo 1 > events/fprobes/myevent1/enable
+cnt1=`cat enabled_functions | wc -l`
+if [ $cnt1 -ne $cnt ]; then
exit_fail
fi
echo 1 > events/fprobes/myevent2/enable
cnt2=`cat enabled_functions | wc -l`
+if [ $cnt2 -ne $cnt1 ]; then
+ exit_fail
+fi
echo 1 > events/fprobes/myevent3/enable
-# If the function is different, the attached function should be increased
grep -q $PLACE2 enabled_functions
cnt=`cat enabled_functions | wc -l`
-if [ $cnt -eq $cnt2 ]; then
+if [ $cnt -ne $cnt2 ]; then
exit_fail
fi
@@ -62,11 +68,15 @@ if [ $cnt -ne $ocnt ]; then
fi
echo "f:myevent4 $PLACE" >> dynamic_events
+grep -q $PLACE enabled_functions
+cnt=`cat enabled_functions | wc -l`
+if [ $cnt -le $ocnt ]; then
+ exit_fail
+fi
echo 1 > events/fprobes/myevent4/enable
-# Should only have one enabled
-cnt=`cat enabled_functions | wc -l`
-if [ $cnt -ne $((ocnt + 1)) ]; then
+cnt2=`cat enabled_functions | wc -l`
+if [ $cnt2 -ne $cnt ]; then
exit_fail
fi
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* [PATCH] selftests/ftrace: Fix BAD_TP_NAME marker in fprobe_syntax_errors
From: Cao Ruichuang @ 2026-04-07 12:11 UTC (permalink / raw)
To: rostedt, mhiramat
Cc: mathieu.desnoyers, shuah, linux-kernel, linux-trace-kernel,
linux-kselftest
The BAD_TP_NAME check currently expects the error marker for
`t kmem/kfree` to point at the slash. Current kernels report
this parse error at the start of the invalid tracepoint name,
so the selftest fails even though the input is still rejected.
Move the caret to match the reported error position.
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
.../selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
index fee479295..e51f642c3 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
@@ -27,7 +27,7 @@ check_error 'f:^foo.1/bar vfs_read' # BAD_GROUP_NAME
check_error 'f:^ vfs_read' # NO_EVENT_NAME
check_error 'f:foo/^12345678901234567890123456789012345678901234567890123456789012345 vfs_read' # EVENT_TOO_LONG
check_error 'f:foo/^bar.1 vfs_read' # BAD_EVENT_NAME
-check_error 't kmem^/kfree' # BAD_TP_NAME
+check_error 't ^kmem/kfree' # BAD_TP_NAME
check_error 'f vfs_read ^$stack10000' # BAD_STACK_NUM
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* Re: [RFC PATCH bpf-next v5 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: Jiri Olsa @ 2026-04-07 12:59 UTC (permalink / raw)
To: bot+bpf-ci
Cc: andrey.grodzovsky, bpf, linux-trace-kernel, ast, daniel, andrii,
rostedt, mhiramat, ihor.solodrai, emil, linux-open-source,
martin.lau, eddyz87, yonghong.song, clm
In-Reply-To: <669779524c357c9790196e8e9ce8da8b21a71756c96e828252fead311fdc44c3@mail.kernel.org>
On Mon, Apr 06, 2026 at 08:15:10PM +0000, bot+bpf-ci@kernel.org wrote:
> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
> > Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
>
> This commit fixes a bug introduced when module symbol counting was
> added to number_of_same_symbols(). Would it be worth adding a
> Fixes: tag?
>
> Suggested:
> Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
there's also:
9d8616034f16 tracing/kprobes: Add symbol counting check when module loads
jirka
>
> That commit added module_kallsyms_on_each_symbol() to
> number_of_same_symbols(), which caused unqualified kprobe targets
> matching both vmlinux and a module to return count > 1 and fail
> with -EADDRNOTAVAIL.
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24047967861
^ permalink raw reply
* Re: [RFC PATCH bpf-next v5 2/2] selftests/bpf: Add tests for duplicate kprobe symbol handling
From: Jiri Olsa @ 2026-04-07 12:59 UTC (permalink / raw)
To: Andrey Grodzovsky
Cc: bpf, linux-trace-kernel, ast, daniel, andrii, rostedt, mhiramat,
ihor.solodrai, emil, linux-open-source
In-Reply-To: <20260406193158.754498-3-andrey.grodzovsky@crowdstrike.com>
On Mon, Apr 06, 2026 at 03:31:58PM -0400, Andrey Grodzovsky wrote:
SNIP
> +static void test_attach_probe_dup_sym(enum probe_attach_mode attach_mode)
> +{
> + DECLARE_LIBBPF_OPTS(bpf_kprobe_opts, kprobe_opts);
> + struct bpf_link *kprobe_link, *kretprobe_link;
> + struct test_attach_probe_manual *skel;
> + int err;
> +
> + /* Load module with duplicate symbol */
> + err = load_module("bpf_testmod_dup_sym.ko", false);
> + if (!ASSERT_OK(err, "load_bpf_testmod_dup_sym")) {
> + test__skip();
> + return;
> + }
> +
> + skel = test_attach_probe_manual__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "skel_dup_sym_open_and_load"))
> + goto unload_module;
> +
> + /* manual-attach kprobe/kretprobe with duplicate symbol present */
> + kprobe_opts.attach_mode = attach_mode;
> + kprobe_opts.retprobe = false;
> + kprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kprobe,
> + SYS_NANOSLEEP_KPROBE_NAME,
> + &kprobe_opts);
> + if (!ASSERT_OK_PTR(kprobe_link, "attach_kprobe_dup_sym"))
> + goto cleanup;
> + skel->links.handle_kprobe = kprobe_link;
> +
> + kprobe_opts.retprobe = true;
> + kretprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kretprobe,
> + SYS_NANOSLEEP_KPROBE_NAME,
> + &kprobe_opts);
maybe add tests for attaching the shadow module function as well?
> diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c
> new file mode 100644
> index 000000000000..0e12f68afe3a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c
> @@ -0,0 +1,48 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 CrowdStrike */
> +/* Test module for duplicate kprobe symbol handling */
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +
> +/* Duplicate symbol to test kprobe attachment with duplicate symbols.
> + * This creates a duplicate of the syscall wrapper used in attach_probe tests.
> + * The libbpf fix should handle this by preferring the vmlinux symbol.
> + * This function should NEVER be called - kprobes should attach to vmlinux version.
> + */
> +#ifdef __x86_64__
> +int __x64_sys_nanosleep(void);
> +noinline int __x64_sys_nanosleep(void)
> +#elif defined(__s390x__)
> +int __s390x_sys_nanosleep(void);
> +noinline int __s390x_sys_nanosleep(void)
> +#elif defined(__aarch64__)
> +int __arm64_sys_nanosleep(void);
> +noinline int __arm64_sys_nanosleep(void)
> +#elif defined(__riscv)
> +int __riscv_sys_nanosleep(void);
> +noinline int __riscv_sys_nanosleep(void)
> +#else
> +int sys_nanosleep(void);
> +noinline int sys_nanosleep(void)
> +#endif
could we use module_fentry_shadow instead? it's in kernel and in bpf_testmod
for fentry shadowing test.. it's not executed via test_run but it could be
added or we just don't run it
jirka
> +{
> + WARN_ONCE(1, "bpf_testmod_dup_sym: dummy nanosleep symbol called - this should never execute!\n");
> + return -EINVAL;
> +}
> +
> +static int __init bpf_testmod_dup_sym_init(void)
> +{
> + return 0;
> +}
> +
> +static void __exit bpf_testmod_dup_sym_exit(void)
> +{
> +}
> +
> +module_init(bpf_testmod_dup_sym_init);
> +module_exit(bpf_testmod_dup_sym_exit);
> +
> +MODULE_AUTHOR("Andrey Grodzovsky");
> +MODULE_DESCRIPTION("BPF selftest duplicate symbol module");
> +MODULE_LICENSE("GPL");
> --
> 2.34.1
>
^ permalink raw reply
* Re: [RFC PATCH bpf-next v5 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: Jiri Olsa @ 2026-04-07 12:59 UTC (permalink / raw)
To: Andrey Grodzovsky
Cc: bpf, linux-trace-kernel, ast, daniel, andrii, rostedt, mhiramat,
ihor.solodrai, emil, linux-open-source
In-Reply-To: <20260406193158.754498-2-andrey.grodzovsky@crowdstrike.com>
On Mon, Apr 06, 2026 at 03:31:57PM -0400, Andrey Grodzovsky wrote:
> When an unqualified kprobe target exists in both vmlinux and a loaded
> module, number_of_same_symbols() returns a count greater than 1,
> causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
> vmlinux symbol is unambiguous.
>
> When no module qualifier is given and the symbol is found in vmlinux,
> return the vmlinux-only count without scanning loaded modules. This
> preserves the existing behavior for all other cases:
> - Symbol only in a module: vmlinux count is 0, falls through to module
> scan as before.
> - Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
> - Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
> Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
lgtm, kprobe_multi seems to behave like that already, maybe you could add test for that as well
Acked-by: Jiri Olsa <jolsa@kernel.org>
jirka
> ---
> kernel/trace/trace_kprobe.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index a5dbb72528e0..99c41ea8b6d7 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -765,6 +765,13 @@ static unsigned int number_of_same_symbols(const char *mod, const char *func_nam
> if (!mod)
> kallsyms_on_each_match_symbol(count_symbols, func_name, &ctx.count);
>
> + /* If the symbol is found in vmlinux, use vmlinux resolution only.
> + * This prevents module symbols from shadowing vmlinux symbols
> + * and causing -EADDRNOTAVAIL for unqualified kprobe targets.
> + */
> + if (!mod && ctx.count > 0)
> + return ctx.count;
> +
> module_kallsyms_on_each_symbol(mod, count_mod_symbols, &ctx);
>
> return ctx.count;
> --
> 2.34.1
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox