* Re: [PATCH v17 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Geert Uytterhoeven @ 2026-04-23 7:28 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Catalin Marinas, Will Deacon, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177687459412.932171.8121855108122534476.stgit@mhiramat.tok.corp.google.com>
On Wed, 22 Apr 2026 at 18:26, Masami Hiramatsu (Google)
<mhiramat@kernel.org> wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> On real hardware, panic and machine reboot may not flush hardware cache
> to memory. This means the persistent ring buffer, which relies on a
> coherent state of memory, may not have its events written to the buffer
> and they may be lost. Moreover, there may be inconsistency with the
> counters which are used for validation of the integrity of the
> persistent ring buffer which may cause all data to be discarded.
>
> To avoid this issue, stop recording of the ring buffer on panic and
> flush the cache of the ring buffer's memory.
>
> Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
> Cc: stable@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> arch/m68k/include/asm/Kbuild | 1 +
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [PATCH v13 00/18] unwind_deferred: Implement sframe handling
From: Indu Bhagat @ 2026-04-23 7:00 UTC (permalink / raw)
To: Jens Remus
Cc: linux-kernel, linux-trace-kernel, bpf, x86, linux-mm,
Steven Rostedt, Jens Remus, Josh Poimboeuf, Masami Hiramatsu,
Mathieu Desnoyers, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Jose E. Marchesi, Beau Belgrave, Linus Torvalds,
Andrew Morton, Florian Weimer, Kees Cook, Carlos O'Donell,
Sam James, Dylan Hatch, Borislav Petkov, Dave Hansen,
David Hildenbrand, H. Peter Anvin, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
Vlastimil Babka, Heiko Carstens, Vasily Gorbik, ibhagatgnu
In-Reply-To: <20260127150554.2760964-1-jremus@linux.ibm.com>
On Tue, Jan 27, 2026 at 7:32 AM Jens Remus <jremus@linux.ibm.com> wrote:
>
> This is the implementation of parsing the SFrame V3 stack trace information
> from an .sframe section in an ELF file. It's a continuation of Josh's and
> Steve's work that can be found here:
>
> https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
> https://lore.kernel.org/all/20250827201548.448472904@kernel.org/
>
> Currently the only way to get a user space stack trace from a stack
> walk (and not just copying large amount of user stack into the kernel
> ring buffer) is to use frame pointers. This has a few issues. The biggest
> one is that compiling frame pointers into every application and library
> has been shown to cause performance overhead.
>
> Another issue is that the format of the frames may not always be consistent
> between different compilers and some architectures (s390) has no defined
> format to do a reliable stack walk. The only way to perform user space
> profiling on these architectures is to copy the user stack into the kernel
> buffer.
>
> SFrame [1] is now supported in binutils (x86-64, ARM64, and s390). There is
> discussions going on about supporting SFrame in LLVM. SFrame acts more like
> ORC, and lives in the ELF executable file as its own section. Like ORC it
> has two tables where the first table is sorted by instruction pointers (IP)
> and using the current IP and finding it's entry in the first table, it will
> take you to the second table which will tell you where the return address
> of the current function is located and then you can use that address to
> look it up in the first table to find the return address of that function,
> and so on. This performs a user space stack walk.
>
> Now because the .sframe section lives in the ELF file it needs to be faulted
> into memory when it is used. This means that walking the user space stack
> requires being in a faultable context. As profilers like perf request a stack
> trace in interrupt or NMI context, it cannot do the walking when it is
> requested. Instead it must be deferred until it is safe to fault in user
> space. One place this is known to be safe is when the task is about to return
> back to user space.
>
> This series makes the deferred unwind user code implement SFrame format V3
> and enables it on x86-64.
>
> [1]: https://sourceware.org/binutils/wiki/sframe
>
>
> This series applies on top of the tip perf/core branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
>
> The to be stack-traced user space programs (and libraries) need to be
> built with the recent SFrame stack trace information format V3, as
> generated by the upcoming binutils 2.46 with assembler option --gsframe.
> It can be built from source from the binutils-2_46-branch branch:
>
> git://sourceware.org/git/binutils-gdb.git binutils-2_46-branch
>
> Namhyung Kim's related perf tools deferred callchain support can be used
> for testing ("perf record --call-graph fp,defer" and "perf report/script").
>
>
> Changes since v12 (see patch notes for details):
> - Rebase on tip perf/core branch (d55c571e4333).
> - Add support for SFrame V3, including its new flexible FDEs. SFrame V2
> is not supported.
>
> Changes since v11 (see patch notes for details):
> - Rebase on tip master branch (f8fdee44bf2f) with Namhyung Kim's
> perf/defer-callchain-v4 branch merged on top.
> - Adjust to Peter's latest undwind user enhancements.
> - Simplify logic by using an internal SFrame FDE representation, whose
> FDE function start address field is an address instead of a PC-relative
> offset (from FDE).
> - Rename struct sframe_fre to sframe_fre_internal to align with
> struct sframe_fde_internal.
> - Remove unused pt_regs from unwind_user_next_common() and its
> callers. (Peter)
> - Simplify unwind_user_next_sframe(). (Peter)
> - Fix a few checkpatch errors and warnings.
> - Minor cleanups (e.g. move includes, fix indentation).
>
> Changes since v10:
> - Support for SFrame V2 PC-relative FDE function start address.
> - Support for SFrame V2 representing RA undefined as indication for
> outermost frames.
>
>
> Patches 1, 4, 11, and 17 have been updated to exclusively support the
> latest SFrame V3 stack trace information format, that is generated by
> the upcoming binutils 2.46 release. Old SFrame V2 sections get rejected
> with dynamic debug message "bad/unsupported sframe header".
>
> Patches 7 and 8 add support to unwind user (sframe) for outermost frames.
>
> Patches 12-15 add support to unwind user (sframe) for the new SFrame V3
> flexible FDEs.
>
> Patch 16 improves the performance of searching the SFrame FRE for an IP.
>
Thanks Jens for your work on this. Apart from some of those minor
renames you are already planning on doing (as you mentioned in the
meeting today), the SFrame related bits look OK to me.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
> Regards,
> Jens
>
>
> Jens Remus (7):
> unwind_user: Stop when reaching an outermost frame
> unwind_user/sframe: Add support for outermost frame indication
> unwind_user: Enable archs that pass RA in a register
> unwind_user: Flexible FP/RA recovery rules
> unwind_user: Flexible CFA recovery rules
> unwind_user/sframe: Add support for SFrame V3 flexible FDEs
> unwind_user/sframe: Separate reading of FRE from reading of FRE data
> words
>
> Josh Poimboeuf (11):
> unwind_user/sframe: Add support for reading .sframe headers
> unwind_user/sframe: Store .sframe section data in per-mm maple tree
> x86/uaccess: Add unsafe_copy_from_user() implementation
> unwind_user/sframe: Add support for reading .sframe contents
> unwind_user/sframe: Detect .sframe sections in executables
> unwind_user/sframe: Wire up unwind_user to sframe
> unwind_user/sframe: Remove .sframe section on detected corruption
> unwind_user/sframe: Show file name in debug output
> unwind_user/sframe: Add .sframe validation option
> unwind_user/sframe/x86: Enable sframe unwinding on x86
> unwind_user/sframe: Add prctl() interface for registering .sframe
> sections
>
> MAINTAINERS | 1 +
> arch/Kconfig | 23 +
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/mmu.h | 2 +-
> arch/x86/include/asm/uaccess.h | 39 +-
> arch/x86/include/asm/unwind_user.h | 69 +-
> arch/x86/include/asm/unwind_user_sframe.h | 12 +
> fs/binfmt_elf.c | 48 +-
> include/linux/mm_types.h | 3 +
> include/linux/sframe.h | 60 ++
> include/linux/unwind_user.h | 18 +
> include/linux/unwind_user_types.h | 46 +-
> include/uapi/linux/elf.h | 1 +
> include/uapi/linux/prctl.h | 6 +-
> kernel/fork.c | 10 +
> kernel/sys.c | 9 +
> kernel/unwind/Makefile | 3 +-
> kernel/unwind/sframe.c | 840 ++++++++++++++++++++++
> kernel/unwind/sframe.h | 87 +++
> kernel/unwind/sframe_debug.h | 68 ++
> kernel/unwind/user.c | 105 ++-
> mm/init-mm.c | 2 +
> 22 files changed, 1414 insertions(+), 39 deletions(-)
> create mode 100644 arch/x86/include/asm/unwind_user_sframe.h
> create mode 100644 include/linux/sframe.h
> create mode 100644 kernel/unwind/sframe.c
> create mode 100644 kernel/unwind/sframe.h
> create mode 100644 kernel/unwind/sframe_debug.h
>
> --
> 2.51.0
>
>
^ permalink raw reply
* Re: [moderation] KCSAN: data-race in filemap_read / filemap_splice_read (3)
From: syzbot @ 2026-04-23 5:05 UTC (permalink / raw)
To: adilger.kernel, akpm, almaz.alexandrovich, baolin.wang, hughd,
jack, jiayuan.chen, jiayuan.chen, linux-ext4, linux-fsdevel,
linux-kernel, linux-mm, linux-trace-kernel, mathieu.desnoyers,
mhiramat, ntfs3, rostedt, syzkaller-upstream-moderation, tytso,
willy
In-Reply-To: <699fd494.050a0220.2fcbed.0000.GAE@google.com>
Auto-closing this bug as obsolete.
Crashes did not happen for a while, no reproducer and no activity.
^ permalink raw reply
* Re: [PATCH net v1] net: validate skb->napi_id in RX tracepoints
From: patchwork-bot+netdevbpf @ 2026-04-23 3:40 UTC (permalink / raw)
To: Kohei Enju
Cc: netdev, linux-trace-kernel, davem, edumazet, kuba, pabeni, horms,
rostedt, mhiramat, mathieu.desnoyers
In-Reply-To: <20260420105427.162816-1-kohei@enjuk.jp>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 20 Apr 2026 10:54:23 +0000 you wrote:
> Since commit 2bd82484bb4c ("xps: fix xps for stacked devices"),
> skb->napi_id shares storage with sender_cpu. RX tracepoints using
> net_dev_rx_verbose_template read skb->napi_id directly and can therefore
> report sender_cpu values as if they were NAPI IDs.
>
> For example, on the loopback path this can report 1 as napi_id, where 1
> comes from raw_smp_processor_id() + 1 in the XPS path:
>
> [...]
Here is the summary with links:
- [net,v1] net: validate skb->napi_id in RX tracepoints
https://git.kernel.org/netdev/net/c/3bfcf396081a
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH v17 5/5] ring-buffer: Show commit numbers in buffer_meta file
From: Masami Hiramatsu (Google) @ 2026-04-22 16:17 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v17:
- Added NULL check for dpage in rbm_show in ring_buffer.c.
Changes in v16:
- update description.
---
kernel/trace/ring_buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 31448c5ea791..15dcbf554d49 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2215,6 +2215,7 @@ static int rbm_show(struct seq_file *m, void *v)
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
unsigned long val = (unsigned long)v;
+ struct buffer_data_page *dpage;
if (val == 1) {
seq_printf(m, "head_buffer: %d\n",
@@ -2227,7 +2228,9 @@ static int rbm_show(struct seq_file *m, void *v)
}
val -= 2;
- seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]);
+ dpage = rb_range_buffer(cpu_buffer, val);
+ seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
+ val, meta->buffers[val], dpage ? local_read(&dpage->commit) : -1);
return 0;
}
^ permalink raw reply related
* [PATCH v17 4/5] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Masami Hiramatsu (Google) @ 2026-04-22 16:16 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a self-corrupting test for the persistent ring buffer.
This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.
This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:
reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1
Run the following commands after the 1st boot:
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v17:
- In rb_test_inject_invalid_pages(), changed entry_bytes and
idx to unsigned long
- Added NULL checks for cpu_buffer and meta.
- In allocate_trace_buffer(), added a NULL check for tr->name
before comparing it with strcmp.
Changes in v16:
- Update description and comments according to review comments.
Changes in v15:
- Use pr_warn() for test result.
- Inject errors on the page index is multiples of 5 so that
this can reproduce contiguous empty pages.
Changes in v14:
- Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
- Clear meta->nr_invalid/entry_bytes after testing.
- Add test commands in config comment.
Changes in v10:
- Add entry_bytes test.
- Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
Changes in v9:
- Test also reader pages.
---
include/linux/ring_buffer.h | 1 +
kernel/trace/Kconfig | 34 +++++++++++++++++++
kernel/trace/ring_buffer.c | 79 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 4 ++
4 files changed, 118 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_TESTING = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..084f34dc6c9f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,40 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
Only say Y if you understand what this does, and you
still want it enabled. Otherwise say N
+config RING_BUFFER_PERSISTENT_INJECT
+ bool "Enable persistent ring buffer error injection test"
+ depends on RING_BUFFER
+ help
+ This option will have the kernel check if the persistent ring
+ buffer is named "ptracingtest". and if so, it will corrupt some
+ of its pages on a kernel panic. This is used to test if the
+ persistent ring buffer can recover from some of its sub-buffers
+ being corrupted.
+ To use this, boot a kernel with a "ptracingtest" persistent
+ ring buffer, e.g.
+
+ reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+ And after the 1st boot, run the following commands:
+
+ cd /sys/kernel/tracing/instances/ptracingtest
+ echo 1 > events/enable
+ echo 1 > tracing_on
+ sleep 3
+ echo c > /proc/sysrq-trigger
+
+ After the panic message, the kernel will reboot and will show
+ the test results in the console output.
+
+ Note that events for the test ring buffer needs to be enabled
+ prior to crashing the kernel so that the ring buffer has content
+ that the test will corrupt.
+ As the test will corrupt events in the "ptracingtest" persistent
+ ring buffer, it should not be used for any other purpose other
+ than this test.
+
+ If unsure, say N
+
config MMIOTRACE_TEST
tristate "Test module for mmiotrace"
depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 0e3d2d037d4d..31448c5ea791 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ __u32 nr_invalid;
+ __u32 entry_bytes;
+#endif
int buffers[];
};
@@ -2085,6 +2089,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (discarded)
pr_cont(" (%d pages discarded)", discarded);
pr_cont("\n");
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ if (meta->nr_invalid)
+ pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+ cpu_buffer->cpu,
+ (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ discarded, meta->nr_invalid);
+ if (meta->entry_bytes)
+ pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+ cpu_buffer->cpu,
+ (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)entry_bytes, (long)meta->entry_bytes);
+ meta->nr_invalid = 0;
+ meta->entry_bytes = 0;
+#endif
return;
invalid:
@@ -2565,12 +2584,72 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct ring_buffer_cpu_meta *meta;
+ struct buffer_data_page *dpage;
+ unsigned long entry_bytes = 0;
+ unsigned long ptr;
+ int subbuf_size;
+ int invalid = 0;
+ int cpu;
+ int i;
+
+ if (!(buffer->flags & RB_FL_TESTING))
+ return;
+
+ guard(preempt)();
+ cpu = smp_processor_id();
+
+ cpu_buffer = buffer->buffers[cpu];
+ if (!cpu_buffer)
+ return;
+ meta = cpu_buffer->ring_meta;
+ if (!meta)
+ return;
+
+ ptr = (unsigned long)rb_subbufs_from_meta(meta);
+ subbuf_size = meta->subbuf_size;
+
+ for (i = 0; i < meta->nr_subbufs; i++) {
+ unsigned long idx = meta->buffers[i];
+
+ dpage = (void *)(ptr + idx * subbuf_size);
+ /* Skip unused pages */
+ if (!local_read(&dpage->commit))
+ continue;
+
+ /*
+ * Invalidate even pages or multiples of 5. This will cause 3
+ * contiguous invalidated(empty) pages.
+ */
+ if (!(i & 0x1) || !(i % 5)) {
+ local_add(subbuf_size + 1, &dpage->commit);
+ invalid++;
+ } else {
+ /* Count total commit bytes. */
+ entry_bytes += local_read(&dpage->commit);
+ }
+ }
+
+ pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+ invalid, cpu, (long)entry_bytes);
+ meta->nr_invalid = invalid;
+ meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer) do { } while (0)
+#endif
+
/* Stop recording on a persistent buffer and flush cache if needed. */
static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
{
struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
ring_buffer_record_off(buffer);
+ rb_test_inject_invalid_pages(buffer);
arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
return NOTIFY_DONE;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e9455d46ec16..d972b24cd73b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9436,6 +9436,8 @@ static void setup_trace_scratch(struct trace_array *tr,
memset(tscratch, 0, size);
}
+#define TRACE_TEST_PTRACING_NAME "ptracingtest"
+
static int
allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
{
@@ -9448,6 +9450,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
+ if (tr->name && !strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+ rb_flags |= RB_FL_TESTING;
/* Add scratch buffer to handle 128 modules */
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
^ permalink raw reply related
* [PATCH v17 3/5] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-22 16:16 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v17:
- Fix to verify head_page at first before using its timestamp.
- Reset timestamp if the page is invalid.
Changes in v12:
- Fix build error.
Changes in v11:
- Reset timestamp when the buffer is invalid.
- When rewinding, skip subbuf page if timestamp is wrong and
check timestamp after validating buffer data page.
Changes in v10:
- Newly added.
---
kernel/trace/ring_buffer.c | 92 ++++++++++++++++++++++++++------------------
1 file changed, 54 insertions(+), 38 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 26507b93cf40..0e3d2d037d4d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
static void rb_init_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
+ bpage->time_stamp = 0;
}
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
- struct ring_buffer_cpu_meta *meta)
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
+ struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
{
+ struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
unsigned long tail;
u64 delta;
+ int ret = -1;
/*
* When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,19 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
* subbuf_size is considered invalid.
*/
tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
- if (tail > meta->subbuf_size)
- return -1;
- return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ if (tail <= meta->subbuf_size)
+ ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+ if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
+ local_set(&bpage->entries, 0);
+ local_set(&bpage->page->commit, 0);
+ bpage->page->time_stamp = prev_ts ? prev_ts : next_ts;
+ ret = -1;
+ } else {
+ local_set(&bpage->entries, ret);
+ }
+
+ return ret;
}
/* If the meta data has been validated, now validate the events */
@@ -1914,25 +1927,29 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
- /* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
+ /* Do the head page first */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
+ if (ret < 0) {
+ pr_info("Ring buffer meta [%d] invalid head page detected\n",
+ cpu_buffer->cpu);
+ goto skip_rewind;
+ }
+ ts = head_page->page->time_stamp;
+
+ /* Do the reader page - reader must be previous to head. */
+ ret = rb_validate_buffer(cpu_buffer->reader_page, cpu_buffer->cpu, meta, 0, ts);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid reader page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&cpu_buffer->reader_page->entries, 0);
- local_set(&cpu_buffer->reader_page->page->commit, 0);
} else {
entries += ret;
entry_bytes += rb_page_size(cpu_buffer->reader_page);
- local_set(&cpu_buffer->reader_page->entries, ret);
+ ts = cpu_buffer->reader_page->page->time_stamp;
}
- ts = head_page->page->time_stamp;
-
/*
- * Try to rewind the head so that we can read the pages which already
+ * Try to rewind the head so that we can read the pages which are already
* read in the previous boot.
*/
if (head_page == cpu_buffer->tail_page)
@@ -1945,26 +1962,27 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->tail_page)
break;
- /* Ensure the page has older data than head. */
- if (ts < head_page->page->time_stamp)
- break;
-
- ts = head_page->page->time_stamp;
- /* Ensure the page has correct timestamp and some data. */
- if (!ts || rb_page_commit(head_page) == 0)
+ /* Rewind until unused page (no timestamp, no commit). */
+ if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
break;
- /* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
- if (ret < 0)
- break;
-
- /* Recover the number of entries and update stats. */
- local_set(&head_page->entries, ret);
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
- entries += ret;
- entry_bytes += rb_page_size(head_page);
+ /*
+ * Skip if the page is invalid, or its timestamp is newer than the
+ * previous valid page.
+ */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, ts);
+ if (ret < 0) {
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ if (ret > 0)
+ local_inc(&cpu_buffer->pages_touched);
+ ts = head_page->page->time_stamp;
+ }
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2026,6 +2044,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Nothing more to do, the only page is the reader page */
goto done;
}
+ ts = head_page->page->time_stamp;
/* Iterate until finding the commit page */
for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
@@ -2034,15 +2053,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, ts, 0);
if (ret < 0) {
if (!discarded)
pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
} else {
/* If the buffer has content, update pages_touched */
if (ret)
@@ -2050,7 +2066,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += rb_page_size(head_page);
- local_set(&head_page->entries, ret);
+ ts = head_page->page->time_stamp;
}
if (head_page == cpu_buffer->commit_page)
break;
@@ -2083,7 +2099,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
+ rb_init_page(head_page->page);
}
}
^ permalink raw reply related
* [PATCH v17 2/5] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-22 16:16 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v17:
- Fix to use rb_page_size() of rewound pages for entry_bytes.
Changes in v15:
- Skip reader_page loop check on persistent ring buffer because
there can be contiguous empty(invalidated) pages.
- Do not show discarded page number information if it is 0.
Changes in v11:
- Fix a typo.
Changes in v9:
- Add meta->subbuf_size check.
- Fix a typo.
- Handle invalid reader_page case.
Changes in v8:
- Add comment in rb_valudate_buffer()
- Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
skipping subbuf.
- Remove unused subbuf local variable from rb_cpu_meta_valid().
Changes in v7:
- Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
- Remove checking subbuffer data when validating metadata, because it should be done
later.
- Do not mark the discarded sub buffer page but just reset it.
Changes in v6:
- Show invalid page detection message once per CPU.
Changes in v5:
- Instead of showing errors for each page, just show the number
of discarded pages at last.
Changes in v3:
- Record missed data event on commit.
---
kernel/trace/ring_buffer.c | 111 ++++++++++++++++++++++++++------------------
1 file changed, 66 insertions(+), 45 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b5ed4c72643e..26507b93cf40 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
return local_read(&bpage->page->commit);
}
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+ return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
unsigned long *subbuf_mask)
{
int subbuf_size = PAGE_SIZE;
- struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
if (!subbuf_mask)
return false;
+ if (meta->subbuf_size != PAGE_SIZE) {
+ pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+ return false;
+ }
+
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- subbuf = rb_subbufs_from_meta(meta);
-
bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
- /* Is the meta buffers and the subbufs themselves have correct data? */
+ /*
+ * Ensure the meta::buffers array has correct data. The data in each subbufs
+ * are checked later in rb_meta_validate_events().
+ */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
- pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
- return false;
- }
-
if (test_bit(meta->buffers[i], subbuf_mask)) {
pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
return false;
}
set_bit(meta->buffers[i], subbuf_mask);
- subbuf = (void *)subbuf + subbuf_size;
}
return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+ struct ring_buffer_cpu_meta *meta)
{
unsigned long long ts;
+ unsigned long tail;
u64 delta;
- int tail;
- tail = local_read(&dpage->commit);
+ /*
+ * When a sub-buffer is recovered from a read, the commit value may
+ * have RB_MISSED_* bits set, as these bits are reset on reuse.
+ * Even after clearing these bits, a commit value greater than the
+ * subbuf_size is considered invalid.
+ */
+ tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ if (tail > meta->subbuf_size)
+ return -1;
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
struct buffer_page *head_page, *orig_head;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
+ int discarded = 0;
int ret;
u64 ts;
int i;
@@ -1900,14 +1915,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer reader page is invalid\n");
- goto invalid;
+ pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&cpu_buffer->reader_page->entries, 0);
+ local_set(&cpu_buffer->reader_page->page->commit, 0);
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(cpu_buffer->reader_page);
+ local_set(&cpu_buffer->reader_page->entries, ret);
}
- entries += ret;
- entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
- local_set(&cpu_buffer->reader_page->entries, ret);
ts = head_page->page->time_stamp;
@@ -1935,7 +1955,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
break;
/* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0)
break;
@@ -1944,7 +1964,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (ret)
local_inc(&cpu_buffer->pages_touched);
entries += ret;
- entry_bytes += rb_page_commit(head_page);
+ entry_bytes += rb_page_size(head_page);
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2014,21 +2034,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid buffer page\n",
- cpu_buffer->cpu);
- goto invalid;
- }
-
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
-
- entries += ret;
- entry_bytes += local_read(&head_page->page->commit);
- local_set(&head_page->entries, ret);
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ local_set(&head_page->entries, ret);
+ }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2042,7 +2065,10 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
- pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+ pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
+ if (discarded)
+ pr_cont(" (%d pages discarded)", discarded);
+ pr_cont("\n");
return;
invalid:
@@ -3329,12 +3355,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
return NULL;
}
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
static __always_inline unsigned
rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -5647,11 +5667,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
again:
/*
* This should normally only loop twice. But because the
- * start of the reader inserts an empty page, it causes
- * a case where we will loop three times. There should be no
- * reason to loop four times (that I know of).
+ * start of the reader inserts an empty page, it causes a
+ * case where we will loop three times. There should be no
+ * reason to loop four times unless the ring buffer is a
+ * recovered persistent ring buffer.
*/
- if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+ if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
reader = NULL;
goto out;
}
^ permalink raw reply related
* [PATCH v17 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-04-22 16:16 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes in v13:
- Fix a rebase conflict.
Changes in v11:
- Do nothing by default since flush_cache_vmap() does nothing on x86
but it can cause deadlock on some architectures via on_each_cpu()
because other CPUs will be stoppped when panic notifier is called.
Changes in v9:
- Fix typo of & to &&.
- Fix typo of "Generic"
Changes in v6:
- Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
- Use flush_cache_vmap() instead of flush_cache_all().
Changes in v5:
- Use ring_buffer_record_off() instead of ring_buffer_record_disable().
- Use flush_cache_all() to ensure flush all cache.
Changes in v3:
- update patch description.
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/ring_buffer.h | 10 ++++++++++
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/ring_buffer.h | 13 +++++++++++++
kernel/trace/ring_buffer.c | 22 ++++++++++++++++++++++
23 files changed, 65 insertions(+)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
generic-y += asm-offsets.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
generic-y += extable.h
generic-y += flat.h
generic-y += parport.h
+generic-y += ring_buffer.h
generated-y += mach-types.h
generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
generic-y += user.h
generic-y += ioctl.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
generic-y += statfs.h
generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
generic-y += spinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
generic-y += agp.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
generic-y += asm-offsets.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += ring_buffer.h
generic-y += runtime-const.h
generic-y += softirq_stack.h
generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
generic-y += fprobe.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end) do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cef49f8871d2..b5ed4c72643e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
#include <linux/ring_buffer_types.h>
#include <linux/sched/isolation.h>
#include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
#include <linux/trace_events.h>
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <asm/ring_buffer.h>
#include <asm/local64.h>
#include <asm/local.h>
#include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
unsigned long range_addr_start;
unsigned long range_addr_end;
+ struct notifier_block flush_nb;
struct ring_buffer_meta *meta;
@@ -2520,6 +2523,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+ ring_buffer_record_off(buffer);
+ arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+ return NOTIFY_DONE;
+}
+
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
@@ -2650,6 +2663,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
mutex_init(&buffer->mutex);
+ /* Persistent ring buffer needs to flush cache before reboot. */
+ if (start && end) {
+ buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+ atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+ }
+
return_ptr(buffer);
fail_free_buffers:
@@ -2748,6 +2767,9 @@ ring_buffer_free(struct trace_buffer *buffer)
{
int cpu;
+ if (buffer->range_addr_start && buffer->range_addr_end)
+ atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
irq_work_sync(&buffer->irq_work.work);
^ permalink raw reply related
* [PATCH v17 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-04-22 16:16 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
Hi,
Here is the 17th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:
https://lore.kernel.org/all/177547105523.259641.14385891517704197263.stgit@mhiramat.tok.corp.google.com/
This version fixes some review comments from Sashiko[1], which
includes:
[2/5] Fix to use rb_page_size() of rewound pages for entry_bytes.
[3/5] - Fix to verify head_page at first before using its timestamp.
- Reset timestamp if the page is invalid.
[4/5] - In rb_test_inject_invalid_pages(), changed entry_bytes and
idx to unsigned long
- Added NULL checks for cpu_buffer and meta.
- In allocate_trace_buffer(), added a NULL check for tr->name
before comparing it with strcmp.
[5/5] Added NULL check for dpage in rbm_show in ring_buffer.c.
[1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com
Thank you,
Masami Hiramatsu (Google) (5):
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
ring-buffer: Add persistent ring buffer invalid-page inject test
ring-buffer: Show commit numbers in buffer_meta file
arch/alpha/include/asm/Kbuild | 1
arch/arc/include/asm/Kbuild | 1
arch/arm/include/asm/Kbuild | 1
arch/arm64/include/asm/ring_buffer.h | 10 +
arch/csky/include/asm/Kbuild | 1
arch/hexagon/include/asm/Kbuild | 1
arch/loongarch/include/asm/Kbuild | 1
arch/m68k/include/asm/Kbuild | 1
arch/microblaze/include/asm/Kbuild | 1
arch/mips/include/asm/Kbuild | 1
arch/nios2/include/asm/Kbuild | 1
arch/openrisc/include/asm/Kbuild | 1
arch/parisc/include/asm/Kbuild | 1
arch/powerpc/include/asm/Kbuild | 1
arch/riscv/include/asm/Kbuild | 1
arch/s390/include/asm/Kbuild | 1
arch/sh/include/asm/Kbuild | 1
arch/sparc/include/asm/Kbuild | 1
arch/um/include/asm/Kbuild | 1
arch/x86/include/asm/Kbuild | 1
arch/xtensa/include/asm/Kbuild | 1
include/asm-generic/ring_buffer.h | 13 ++
include/linux/ring_buffer.h | 1
kernel/trace/Kconfig | 34 ++++
kernel/trace/ring_buffer.c | 275 ++++++++++++++++++++++++++--------
kernel/trace/trace.c | 4
26 files changed, 290 insertions(+), 67 deletions(-)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
base-commit: 6170922f137231b98fc568571befef63e1edff3f
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] ftrace: fix use-after-free of mod->name in function_stat_show()
From: Xiang Gao @ 2026-04-22 9:35 UTC (permalink / raw)
To: rostedt
Cc: mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
linux-trace-kernel, gaoxiang17, Xiang Gao
In-Reply-To: <20260417101814.22d5c21b@fedora>
On Fri, 17 Apr 2026 10:18:14 -0400, Steven Rostedt wrote:
> Was AI used for any part of this patch? Including finding the bug? If
> so, it must be disclosed.
Yes, AI was used. Claude (claude-opus-4-7) assisted in both finding
the bug and drafting the fix. I reviewed the analysis and took
responsibility for the submission, but I should have disclosed this
up front per Documentation/process/coding-assistants.rst. I
apologize for the oversight, and I will add an
Assisted-by: Claude:claude-opus-4-7 tag in the follow-up.
> Just move guard(rcu) out of this if statement to include the below
> reference. No need to make the code worse. This really looks like
> AI slop :-(
You are right. Hoisting guard(rcu)() to the top of the
if (tr->trace_flags & TRACE_ITER(PROF_TEXT_OFFSET)) {
block so its scope covers the single snprintf() after the if/else is
the correct fix -- +1/-1, net zero, instead of duplicating snprintf()
into both branches as I did. I should have recognized this instead of
submitting the first plausible-looking approach.
I will send a follow-up patch that restores the single snprintf()
after the if/else and hoists guard(rcu)() to cover it, with the
Subject capitalized ("ftrace: Fix ...") and
Assisted-by: Claude:claude-opus-4-7 added.
Thanks for the review and for pushing back on the approach.
Xiang
^ permalink raw reply
* [PATCH v2] tracepoint: Fix typo in tracepoint.h comment
From: Sheng Che Peng @ 2026-04-22 2:18 UTC (permalink / raw)
To: rostedt, mathieu.desnoyers
Cc: linux-trace-kernel, linux-kernel, Sheng Che Peng
Change "my" to "may" in the description of subsystem configurations.
Signed-off-by: Sheng Che Peng <synte4028@gmail.com>
---
include/linux/tracepoint.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 578e520b6ee6c..763eea4d80d87 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -202,7 +202,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
#define TP_CONDITION(args...) args
/*
- * Individual subsystem my have a separate configuration to
+ * Individual subsystem may have a separate configuration to
* enable their tracepoints. By default, this file will create
* the tracepoints if CONFIG_TRACEPOINTS is defined. If a subsystem
* wants to be able to disable its tracepoints from being created
--
2.34.1
^ permalink raw reply related
* Re: [PATCH net v1] net: validate skb->napi_id in RX tracepoints
From: Jiayuan Chen @ 2026-04-22 1:55 UTC (permalink / raw)
To: Kohei Enju
Cc: netdev, linux-trace-kernel, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <aeYTImGIRuOf72mi@x1>
On 4/20/26 7:54 PM, Kohei Enju wrote:
> On 04/20 19:27, Jiayuan Chen wrote:
>> On 4/20/26 6:54 PM, Kohei Enju wrote:
>>> Since commit 2bd82484bb4c ("xps: fix xps for stacked devices"),
>>> skb->napi_id shares storage with sender_cpu. RX tracepoints using
>>> net_dev_rx_verbose_template read skb->napi_id directly and can therefore
>>> report sender_cpu values as if they were NAPI IDs.
>>>
>>> For example, on the loopback path this can report 1 as napi_id, where 1
>> So I think veth_forward_skb->__netif_rx could be affected as well?
> Yes. Just in case, I've confirmed the same behavior in the veth path.
> The mentioned loopback path is just a single example of possibly
> affected paths.
>
> Thanks,
> Kohei
>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
^ permalink raw reply
* Re: [PATCH] kernel/trace/ftrace: introduce ftrace module notifier
From: Song Liu @ 2026-04-21 22:40 UTC (permalink / raw)
To: Song Chen
Cc: Petr Mladek, Steven Rostedt, Miroslav Benes, mcgrof, petr.pavlu,
da.gomez, samitolvanen, atomlin, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, live-patching
In-Reply-To: <4037aa19-1b01-4076-b823-5cc0e43becac@189.cn>
Hi,
I am replying partially to make sure folks know there are two
persons with the same first name.
On Sun, Apr 12, 2026 at 7:11 AM Song Chen <chensong_2000@189.cn> wrote:
[...]
> >
> > + We would need to make sure that it does not break some
> > existing "hidden" dependencies.
> >
> Thanks so much, this is the solution i'm working on. I replaced next
> with a list_head in notifier_block and implemented
> anotifier_call_chain_reverse to address the order issues, like your
> suggestion. And a new robust revision for rolling back.
I personally don't think there is strong enough motivation to make
changes like the following. If there is indeed strong motivations,
please make it clear in the next revision.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH] kernel: trace: do not generate undefsyms_base.c
From: Nathan Chancellor @ 2026-04-21 21:51 UTC (permalink / raw)
To: Paolo Bonzini
Cc: linux-kernel, kvm, torvalds, rostedt, Marc Zyngier, Arnd Bergmann,
linux-trace-kernel
In-Reply-To: <20260421100455.324333-1-pbonzini@redhat.com>
On Tue, Apr 21, 2026 at 11:04:55AM +0100, Paolo Bonzini wrote:
> The code to autogenerate undefsyms_base.c in the Makefile is larger
> than the file itself.
>
> Remove the "echo" indirection that creates the file, which keeps
> the build system sane and makes it much easier to edit it if/when
> new situations arrive.
>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yeah, I don't really know how I did not see this originally :/ tunnel
vision is real I suppose.
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
> ---
> kernel/trace/.gitignore | 1 -
> kernel/trace/Makefile | 35 ++++-------------------------------
> kernel/trace/undefsyms_base.c | 28 ++++++++++++++++++++++++++++
> 3 files changed, 32 insertions(+), 32 deletions(-)
> delete mode 100644 kernel/trace/.gitignore
> create mode 100644 kernel/trace/undefsyms_base.c
>
> diff --git a/kernel/trace/.gitignore b/kernel/trace/.gitignore
> deleted file mode 100644
> index 6adbb09d6deb..000000000000
> --- a/kernel/trace/.gitignore
> +++ /dev/null
> @@ -1 +0,0 @@
> -/undefsyms_base.c
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 4d4229e5eec4..1decdce8cbef 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -133,41 +133,14 @@ obj-$(CONFIG_TRACE_REMOTE) += trace_remote.o
> obj-$(CONFIG_SIMPLE_RING_BUFFER) += simple_ring_buffer.o
> obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
>
> -#
> # simple_ring_buffer is used by the pKVM hypervisor which does not have access
> # to all kernel symbols. Fail the build if forbidden symbols are found.
> -#
> -# undefsyms_base generates a set of compiler and tooling-generated symbols that can
> -# safely be ignored for simple_ring_buffer.
> -#
> -filechk_undefsyms_base = \
> - echo '$(pound)include <linux/atomic.h>'; \
> - echo '$(pound)include <linux/string.h>'; \
> - echo '$(pound)include <asm/page.h>'; \
> - echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);'; \
> - echo 'void undefsyms_base(void *p, int n);'; \
> - echo 'void undefsyms_base(void *p, int n) {'; \
> - echo ' char buffer[256] = { 0 };'; \
> - echo ' u32 u = 0;'; \
> - echo ' memset((char * volatile)page, 8, PAGE_SIZE);'; \
> - echo ' memset((char * volatile)buffer, 8, sizeof(buffer));'; \
> - echo ' memcpy((void * volatile)p, buffer, sizeof(buffer));'; \
> - echo ' cmpxchg((u32 * volatile)&u, 0, 8);'; \
> - echo ' WARN_ON(n == 0xdeadbeef);'; \
> - echo '}'
> -
> -$(obj)/undefsyms_base.c: FORCE
> - $(call filechk,undefsyms_base)
> -
> -clean-files += undefsyms_base.c
> -
> -$(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
>
> +# Basic compiler and tooling-generated symbols that can safely be left
> +# undefined. Ensure KASAN is enabled to avoid logic that may disable
> +# FORTIFY_SOURCE when KASAN is not enabled. undefsyms_base.o does not
> +# automatically get KASAN flags because it is not linked into vmlinux.
> targets += undefsyms_base.o
> -
> -# Ensure KASAN is enabled to avoid logic that may disable FORTIFY_SOURCE when
> -# KASAN is not enabled. undefsyms_base.o does not automatically get KASAN flags
> -# because it is not linked into vmlinux.
> KASAN_SANITIZE_undefsyms_base.o := y
>
> UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
> diff --git a/kernel/trace/undefsyms_base.c b/kernel/trace/undefsyms_base.c
> new file mode 100644
> index 000000000000..e65baf58e6ff
> --- /dev/null
> +++ b/kernel/trace/undefsyms_base.c
> @@ -0,0 +1,28 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * simple_ring_buffer is used by the pKVM hypervisor which does not have access
> + * to all kernel symbols. Whatever is undefined when compiling this file is
> + * compiler and tooling-generated symbols that can safely be ignored for
> + * simple_ring_buffer.
> + */
> +
> +#include <linux/atomic.h>
> +#include <linux/string.h>
> +#include <asm/page.h>
> +
> +void undefsyms_base(void *p, int n);
> +
> +static char page[PAGE_SIZE] __aligned(PAGE_SIZE);
> +
> +void undefsyms_base(void *p, int n)
> +{
> + char buffer[256] = { 0 };
> +
> + u32 u = 0;
> + memset((char * volatile)page, 8, PAGE_SIZE);
> + memset((char * volatile)buffer, 8, sizeof(buffer));
> + memcpy((void * volatile)p, buffer, sizeof(buffer));
> + cmpxchg((u32 * volatile)&u, 0, 8);
> + WARN_ON(n == 0xdeadbeef);
> +}
> --
> 2.53.0
>
^ permalink raw reply
* Re: [PATCH] kernel: trace: do not generate undefsyms_base.c
From: Steven Rostedt @ 2026-04-21 19:56 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Marc Zyngier, linux-kernel, kvm, torvalds, Arnd Bergmann,
Nathan Chancellor, linux-trace-kernel
In-Reply-To: <CABgObfam470W_Cq-jFHXc7H=xkr4CRMOGrbuLd-HtkepvwA3AQ@mail.gmail.com>
On Tue, 21 Apr 2026 15:21:19 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:
> On Tue, Apr 21, 2026 at 3:18 PM Marc Zyngier <maz@kernel.org> wrote:
> > On Tue, 21 Apr 2026 15:02:09 +0100, Steven Rostedt <rostedt@goodmis.org> wrote:
> > > Marc beat you to it by a few minutes ;-)
> > >
> > > https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
> > >
> > > Although he didn't Cc linux-trace-kernel@vger.kernel.org :-p
> > >
> > > But anyway, it ended up in my internal Patchwork. I'll take his as they are
> > > both pretty much the same patch.
> >
> > On the other hand, Paolo's patch has the SPDX tag on the new file,
> > which I forgot to add. Your call, anyway.
Ah, OK. I'll take Paolo's.
>
> Yeah if you want to get the best of both worlds, take the commit
> message from Marc's and the code from mine; he has more complete
> trailers whereas I have slightly cleaner formatting and more comments.
> But otherwise it's not a big deal.
I'll update the subject and change log slightly, but otherwise, it's your
patch.
-- Steve
^ permalink raw reply
* Re: [PATCH net v1] net: validate skb->napi_id in RX tracepoints
From: Simon Horman @ 2026-04-21 16:33 UTC (permalink / raw)
To: Kohei Enju
Cc: netdev, linux-trace-kernel, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers
In-Reply-To: <20260420105427.162816-1-kohei@enjuk.jp>
On Mon, Apr 20, 2026 at 10:54:23AM +0000, Kohei Enju wrote:
> Since commit 2bd82484bb4c ("xps: fix xps for stacked devices"),
> skb->napi_id shares storage with sender_cpu. RX tracepoints using
> net_dev_rx_verbose_template read skb->napi_id directly and can therefore
> report sender_cpu values as if they were NAPI IDs.
>
> For example, on the loopback path this can report 1 as napi_id, where 1
> comes from raw_smp_processor_id() + 1 in the XPS path:
>
> # bpftrace -e 'tracepoint:net:netif_rx_entry{ print(args->napi_id); }'
> # taskset -c 0 ping -c 1 ::1
>
> Report only valid NAPI IDs in these tracepoints and use 0 otherwise.
>
> Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices")
> Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Simon Horman <horms@kernel.org>
> ---
> include/trace/events/net.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/net.h b/include/trace/events/net.h
> index fdd9ad474ce3..dbc2c5598e35 100644
> --- a/include/trace/events/net.h
> +++ b/include/trace/events/net.h
> @@ -10,6 +10,7 @@
> #include <linux/if_vlan.h>
> #include <linux/ip.h>
> #include <linux/tracepoint.h>
> +#include <net/busy_poll.h>
>
> TRACE_EVENT(net_dev_start_xmit,
>
> @@ -208,7 +209,8 @@ DECLARE_EVENT_CLASS(net_dev_rx_verbose_template,
> TP_fast_assign(
> __assign_str(name);
> #ifdef CONFIG_NET_RX_BUSY_POLL
> - __entry->napi_id = skb->napi_id;
> + __entry->napi_id = napi_id_valid(skb->napi_id) ?
> + skb->napi_id : 0;
Note to self: they key is that if the storage at napi_id is
being used as a sender_cpu then napi_id_valid because
the valid values for a sender_cpu are disjoint from those
of a valid napi_id. This can be seen clearly in the
implementation of napi_id_valid() and the comment above it.
> #else
> __entry->napi_id = 0;
> #endif
> --
> 2.51.0
>
^ permalink raw reply
* Re: [PATCH] kernel: trace: do not generate undefsyms_base.c
From: Paolo Bonzini @ 2026-04-21 14:21 UTC (permalink / raw)
To: Marc Zyngier
Cc: Steven Rostedt, linux-kernel, kvm, torvalds, Arnd Bergmann,
Nathan Chancellor, linux-trace-kernel
In-Reply-To: <86ik9k1jqs.wl-maz@kernel.org>
On Tue, Apr 21, 2026 at 3:18 PM Marc Zyngier <maz@kernel.org> wrote:
> On Tue, 21 Apr 2026 15:02:09 +0100, Steven Rostedt <rostedt@goodmis.org> wrote:
> > Marc beat you to it by a few minutes ;-)
> >
> > https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
> >
> > Although he didn't Cc linux-trace-kernel@vger.kernel.org :-p
> >
> > But anyway, it ended up in my internal Patchwork. I'll take his as they are
> > both pretty much the same patch.
>
> On the other hand, Paolo's patch has the SPDX tag on the new file,
> which I forgot to add. Your call, anyway.
Yeah if you want to get the best of both worlds, take the commit
message from Marc's and the code from mine; he has more complete
trailers whereas I have slightly cleaner formatting and more comments.
But otherwise it's not a big deal.
Paolo
^ permalink raw reply
* Re: [PATCH] kernel: trace: do not generate undefsyms_base.c
From: Marc Zyngier @ 2026-04-21 14:16 UTC (permalink / raw)
To: Steven Rostedt
Cc: Paolo Bonzini, linux-kernel, kvm, torvalds, Arnd Bergmann,
Nathan Chancellor, linux-trace-kernel
In-Reply-To: <20260421100209.7535e26f@gandalf.local.home>
On Tue, 21 Apr 2026 15:02:09 +0100,
Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 21 Apr 2026 11:04:55 +0100
> Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> > The code to autogenerate undefsyms_base.c in the Makefile is larger
> > than the file itself.
> >
> > Remove the "echo" indirection that creates the file, which keeps
> > the build system sane and makes it much easier to edit it if/when
> > new situations arrive.
> >
> > Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>
> Marc beat you to it by a few minutes ;-)
>
> https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
>
> Although he didn't Cc linux-trace-kernel@vger.kernel.org :-p
>
> But anyway, it ended up in my internal Patchwork. I'll take his as they are
> both pretty much the same patch.
On the other hand, Paolo's patch has the SPDX tag on the new file,
which I forgot to add. Your call, anyway.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply
* Re: [PATCH] kernel: trace: do not generate undefsyms_base.c
From: Steven Rostedt @ 2026-04-21 14:02 UTC (permalink / raw)
To: Paolo Bonzini
Cc: linux-kernel, kvm, torvalds, Marc Zyngier, Arnd Bergmann,
Nathan Chancellor, linux-trace-kernel
In-Reply-To: <20260421100455.324333-1-pbonzini@redhat.com>
On Tue, 21 Apr 2026 11:04:55 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:
> The code to autogenerate undefsyms_base.c in the Makefile is larger
> than the file itself.
>
> Remove the "echo" indirection that creates the file, which keeps
> the build system sane and makes it much easier to edit it if/when
> new situations arrive.
>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc beat you to it by a few minutes ;-)
https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
Although he didn't Cc linux-trace-kernel@vger.kernel.org :-p
But anyway, it ended up in my internal Patchwork. I'll take his as they are
both pretty much the same patch.
Thanks,
-- Steve
^ permalink raw reply
* [PATCH] kernel: trace: do not generate undefsyms_base.c
From: Paolo Bonzini @ 2026-04-21 10:04 UTC (permalink / raw)
To: linux-kernel, kvm
Cc: torvalds, rostedt, Marc Zyngier, Arnd Bergmann, Nathan Chancellor,
linux-trace-kernel
The code to autogenerate undefsyms_base.c in the Makefile is larger
than the file itself.
Remove the "echo" indirection that creates the file, which keeps
the build system sane and makes it much easier to edit it if/when
new situations arrive.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
kernel/trace/.gitignore | 1 -
kernel/trace/Makefile | 35 ++++-------------------------------
kernel/trace/undefsyms_base.c | 28 ++++++++++++++++++++++++++++
3 files changed, 32 insertions(+), 32 deletions(-)
delete mode 100644 kernel/trace/.gitignore
create mode 100644 kernel/trace/undefsyms_base.c
diff --git a/kernel/trace/.gitignore b/kernel/trace/.gitignore
deleted file mode 100644
index 6adbb09d6deb..000000000000
--- a/kernel/trace/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-/undefsyms_base.c
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 4d4229e5eec4..1decdce8cbef 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -133,41 +133,14 @@ obj-$(CONFIG_TRACE_REMOTE) += trace_remote.o
obj-$(CONFIG_SIMPLE_RING_BUFFER) += simple_ring_buffer.o
obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
-#
# simple_ring_buffer is used by the pKVM hypervisor which does not have access
# to all kernel symbols. Fail the build if forbidden symbols are found.
-#
-# undefsyms_base generates a set of compiler and tooling-generated symbols that can
-# safely be ignored for simple_ring_buffer.
-#
-filechk_undefsyms_base = \
- echo '$(pound)include <linux/atomic.h>'; \
- echo '$(pound)include <linux/string.h>'; \
- echo '$(pound)include <asm/page.h>'; \
- echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);'; \
- echo 'void undefsyms_base(void *p, int n);'; \
- echo 'void undefsyms_base(void *p, int n) {'; \
- echo ' char buffer[256] = { 0 };'; \
- echo ' u32 u = 0;'; \
- echo ' memset((char * volatile)page, 8, PAGE_SIZE);'; \
- echo ' memset((char * volatile)buffer, 8, sizeof(buffer));'; \
- echo ' memcpy((void * volatile)p, buffer, sizeof(buffer));'; \
- echo ' cmpxchg((u32 * volatile)&u, 0, 8);'; \
- echo ' WARN_ON(n == 0xdeadbeef);'; \
- echo '}'
-
-$(obj)/undefsyms_base.c: FORCE
- $(call filechk,undefsyms_base)
-
-clean-files += undefsyms_base.c
-
-$(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
+# Basic compiler and tooling-generated symbols that can safely be left
+# undefined. Ensure KASAN is enabled to avoid logic that may disable
+# FORTIFY_SOURCE when KASAN is not enabled. undefsyms_base.o does not
+# automatically get KASAN flags because it is not linked into vmlinux.
targets += undefsyms_base.o
-
-# Ensure KASAN is enabled to avoid logic that may disable FORTIFY_SOURCE when
-# KASAN is not enabled. undefsyms_base.o does not automatically get KASAN flags
-# because it is not linked into vmlinux.
KASAN_SANITIZE_undefsyms_base.o := y
UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
diff --git a/kernel/trace/undefsyms_base.c b/kernel/trace/undefsyms_base.c
new file mode 100644
index 000000000000..e65baf58e6ff
--- /dev/null
+++ b/kernel/trace/undefsyms_base.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * simple_ring_buffer is used by the pKVM hypervisor which does not have access
+ * to all kernel symbols. Whatever is undefined when compiling this file is
+ * compiler and tooling-generated symbols that can safely be ignored for
+ * simple_ring_buffer.
+ */
+
+#include <linux/atomic.h>
+#include <linux/string.h>
+#include <asm/page.h>
+
+void undefsyms_base(void *p, int n);
+
+static char page[PAGE_SIZE] __aligned(PAGE_SIZE);
+
+void undefsyms_base(void *p, int n)
+{
+ char buffer[256] = { 0 };
+
+ u32 u = 0;
+ memset((char * volatile)page, 8, PAGE_SIZE);
+ memset((char * volatile)buffer, 8, sizeof(buffer));
+ memcpy((void * volatile)p, buffer, sizeof(buffer));
+ cmpxchg((u32 * volatile)&u, 0, 8);
+ WARN_ON(n == 0xdeadbeef);
+}
--
2.53.0
^ permalink raw reply related
* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Petr Mladek @ 2026-04-21 9:05 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: chensong_2000, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
frederic, mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin,
jpoimboe, jikos, mbenes, joe.lawrence, rostedt, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260420144429.57b45f2beece690bceea96ec@kernel.org>
On Mon 2026-04-20 14:44:29, Masami Hiramatsu wrote:
> Hi Song,
>
> On Wed, 15 Apr 2026 15:01:37 +0800
> chensong_2000@189.cn wrote:
>
> > From: Song Chen <chensong_2000@189.cn>
> >
> > The current notifier chain implementation uses a single-linked list
> > (struct notifier_block *next), which only supports forward traversal
> > in priority order. This makes it difficult to handle cleanup/teardown
> > scenarios that require notifiers to be called in reverse priority order.
>
> What about introducing a new notification callback API that allows you
> to describe dependencies between callback functions?
>
> For example, when registering a callback, you could register a string
> as an ID and specify whether to call it before or after that ID,
> or you could register a comparison function that is called when adding
> to a list. (I prefer @name and @depends fields so that it can be easily
> maintained.)
This looks too complex. It would make sense only
when this API has more users.
Also this won't be enough for the ftrace/livepatch callbacks.
They need to be ordered against against each other. But they
also need to be called before/after all other callbacks.
For example, when the module is loaded:
+ 1st frace
+ 2nd livepatch
+ then other notifiers
See the commit c1bf08ac26e92122 ("ftrace: Be first to run code
modification on modules").
> This would allow for better dependency building when adding to the list.
> >
> > A concrete example is the ordering dependency between ftrace and
> > livepatch during module load/unload. see the detail here [1].
>
> If this only concerns notification callback issues with the ftrace
> and livepatch modules, it's far more robust to simply call the
> necessary processing directly when the modules load and unload,
> rather than registering notification callbacks externally.
>
> There are fprobe, kprobe and its trace-events, all of them are using
> ftrace as its fundation layer. In this case, I always needs to
> consider callback order when a module is unloaded.
>
> If ftrace is working as a part of module callbacks, it will conflict
> with fprobe/kprobe module callback. Of course we can reorder it with
> modifying its priority. But this is ugly, because when we introduce
> a new other feature which depends on another layer, we need to
> reorder the callback's priority number on the list.
>
> Based on the above, I don't think this can be resolved simply by
> changing the list of notification callbacks to a bidirectional list.
I agree. I would keep it as is (hardcoded).
Best Regards,
Petr
^ permalink raw reply
* Re: [PATCHv5 bpf-next 20/28] libbpf: Add support to create tracing multi link
From: Jiri Olsa @ 2026-04-21 8:57 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <658609fdf8d0b9ee59a5a516fd7dd2460ef7cb52b9fca779094dfb761e43aa82@mail.kernel.org>
On Sat, Apr 18, 2026 at 06:10:58AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 49368ee161b1..9bf7e8754c7b 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
>
> [ ... ]
>
> > @@ -12510,6 +12592,197 @@ bool btf_type_is_traceable_func(const struct btf *btf, const struct btf_type *t
> > return true;
> > }
> >
> > +static int collect_func_ids_by_glob(const struct bpf_program *prog, const char *pattern, __u32 **ids)
> > +{
> > + struct bpf_object *obj = prog->obj;
> > + const struct module_btf *mod;
> > + struct btf *btf = NULL;
> > + const char *sep;
> > + int err;
> > +
> > + err = bpf_object__load_vmlinux_btf(obj, true);
> > + if (err)
> > + return err;
> > +
> > + /* In case we have module specified, we will find its btf and use that. */
> > + sep = strchr(pattern, ':');
> > + if (sep) {
> > + mod = find_attach_module(obj, pattern);
> > + if (!mod) {
> > + err = -EINVAL;
> > + goto cleanup;
> > + }
> > + btf = mod->btf;
> > + pattern = sep + 1;
> > + } else {
> > + /* Program is loaded for kernel module. */
> > + if (prog->attach_btf_obj_fd)
> > + return -EINVAL;
> > + btf = obj->btf_vmlinux;
> > + }
> > +
> > + err = collect_btf_func_ids_by_glob(btf, pattern, ids);
> > +
> > +cleanup:
> > + bpf_object_cleanup_btf(obj);
> > + return err;
> > +}
>
> Does this function leak the vmlinux BTF when prog->attach_btf_obj_fd is
> set? The early return bypasses the cleanup label that calls
> bpf_object_cleanup_btf(obj). Every other error path goes through
> cleanup, which suggests the BTF loaded by bpf_object__load_vmlinux_btf()
> needs to be cleaned up before returning.
>
> The design pattern here is that bpf_object__load_vmlinux_btf(obj, true)
> is called specifically at attach-time (after the object has already been
> post-load-cleaned), so obj->btf_vmlinux is freshly loaded. Skipping the
> cleanup leaves that freshly-loaded btf_vmlinux attached to the bpf_object
> until bpf_object__close().
>
> Should this be 'err = -EINVAL; goto cleanup;' to match the other error
> paths?
yea, should be 'goto cleanup', will fix
jirka
^ permalink raw reply
* Re: [PATCHv5 bpf-next 12/28] bpf: Add bpf_trampoline_multi_attach/detach functions
From: Jiri Olsa @ 2026-04-21 8:56 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <0fefc9ca058787dddef9c125163a0182c6ddd1f47d2754f2ddbb5a82bfc9126a@mail.kernel.org>
On Sat, Apr 18, 2026 at 06:10:57AM +0000, bot+bpf-ci@kernel.org wrote:
SNIP
> > + err = __bpf_trampoline_link_prog(&mnode->node, mnode->trampoline, NULL,
> > + &trampoline_multi_ops, data);
> > + if (err) {
> > + rollback_cnt = i;
> > + goto rollback_unlink;
> > + }
> > + }
>
> When user-provided ids[] contains duplicate BTF IDs (or distinct IDs
> that resolve to the same trampoline key), multiple nodes point to the
> same struct bpf_trampoline. The link loop above then calls
> bpf_trampoline_multi_attach_init() more than once on that trampoline,
> overwriting the saved old_image with the newly assigned cur_image from
> the previous iteration.
>
> Scenario with ids[0] == ids[1] and trampoline X starting with OLD_X:
>
> i=0: attach_init(X) saves old_image=OLD_X
> __bpf_trampoline_link_prog() -> modify_fentry_multi() sets
> X->cur_image=NEW_X (OLD_X refcount not dropped, intent is for
> multi_attach_free() to release it later).
>
> i=1: attach_init(X) re-runs on the same trampoline and overwrites
> old_image=NEW_X (the only saved reference to OLD_X is lost).
> __bpf_trampoline_link_prog() returns -EBUSY (duplicate prog).
> rollback_cnt=i=1; goto rollback_unlink.
>
> rollback_unlink calls bpf_trampoline_multi_attach_rollback(X) once:
>
> static void bpf_trampoline_multi_attach_rollback(struct bpf_trampoline *tr)
> {
> if (tr->cur_image)
> bpf_tramp_image_put(tr->cur_image); /* puts NEW_X */
> tr->cur_image = tr->multi_attach.old_image; /* = NEW_X (stale!) */
> ...
> }
>
> Result: OLD_X leaks (its refcount is never decremented). cur_image
> points at NEW_X which was just released via bpf_tramp_image_put().
>
> The rollback_put loop calls bpf_trampoline_put() which frees the
> trampoline (prog was removed, progs_hlist is empty), so the dangling
> NEW_X pointer is cleaned up. However OLD_X remains allocated via
> module_alloc forever.
>
> Reachability: commit dbf2afe2f603 ("bpf: Add support for tracing multi
> link") copies ids[] from userspace without deduplication. The series'
> selftests (commit 411fb40d4b2a "selftests/bpf: Add tracing multi attach
> fails test", 'fail#7 (kernel) attach with duplicate id') exercise this
> path explicitly, expecting -EBUSY.
>
> The trigger requires the shared trampoline to already have cur_image !=
> NULL when the first iteration runs (i.e., another program attached to
> the same function), so modify_fentry_multi() is used and there is a
> real OLD_X to leak. The first-time-attach case (register_fentry_multi,
> OLD_X=NULL) only leaves cur_image dangling briefly until the trampoline
> is freed.
>
> Can the code save old_image/old_flags once per trampoline instead of
> once per node, or store them per-mnode to avoid overwriting the saved
> state?
I think it's better to ensure the btf ids array does not contain
duplicate entries.. wil check
>
> > +
> > + rollback_cnt = link->nodes_cnt;
> > + if (ftrace_hash_count(data->reg)) {
> > + err = update_ftrace_direct_add(&direct_ops, data->reg);
> > + if (err)
> > + goto rollback_unlink;
> > + }
> > +
> > + if (ftrace_hash_count(data->modify)) {
> > + err = update_ftrace_direct_mod(&direct_ops, data->modify, true);
> > + if (err) {
> > + WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> > + ^^^^^^^^^^
> > + goto rollback_unlink;
> > + }
> > + }
>
> The recovery path unconditionally calls update_ftrace_direct_del with
> data->reg, but update_ftrace_direct_add is only invoked when data->reg
> is non-empty:
>
> if (ftrace_hash_count(data->reg)) {
> err = update_ftrace_direct_add(&direct_ops, data->reg);
> ...
> }
>
> When update_ftrace_direct_mod fails and all trampolines already had
> cur_image != NULL (so nothing was added to data->reg),
> update_ftrace_direct_del is called with an empty hash.
> update_ftrace_direct_del at kernel/trace/ftrace.c:6511 checks:
>
> if (!hash_count(hash))
> return -EINVAL;
>
> so it returns -EINVAL and the WARN_ON_ONCE fires spuriously. Should
> this guard match the add path?
>
> if (ftrace_hash_count(data->reg))
> WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
yes, replied in the other email
SNIP
> > +int bpf_check_attach_btf_id_multi(struct btf *btf, struct bpf_prog *prog, u32 btf_id,
> > + struct bpf_attach_target_info *tgt_info)
> > +{
> > + const struct btf_type *t;
> > + unsigned long addr;
> > + const char *tname;
> > + int err;
> > +
> > + if (!btf_id || !btf)
> > + return -EINVAL;
> > +
> > + /* Check noreturn attachment. */
> > + if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
> > + btf_id_set_contains(&noreturn_deny, btf_id))
> > + return -EINVAL;
>
> Operator precedence issue in the noreturn check. The two subexpressions
> are joined by ||, so the function returns -EINVAL whenever EITHER:
>
> (a) the program's expected_attach_type is BPF_TRACE_FEXIT_MULTI
> (regardless of whether btf_id targets a noreturn function), OR
>
> (b) the btf_id is in the noreturn_deny set (regardless of program
> attach type, rejecting FENTRY_MULTI attachments to noreturn
> functions as well).
>
> The intent is to reject only FEXIT-like multi attachments to noreturn
> functions. The non-multi path in check_attach_btf_id() correctly uses
> &&:
>
> } else if ((prog->expected_attach_type == BPF_TRACE_FEXIT ||
> prog->expected_attach_type == BPF_TRACE_FSESSION ||
> prog->expected_attach_type == BPF_MODIFY_RETURN) &&
> btf_id_set_contains(&noreturn_deny, btf_id)) {
> ...
> return -EINVAL;
> }
>
> Consequences once bpf_trampoline_multi_attach() is wired up from
> user-space (commit dbf2afe2f603 "bpf: Add support for tracing multi
> link"):
>
> 1. Every BPF_TRACE_FEXIT_MULTI program attach request is rejected
> with -EINVAL, even when the target function returns normally. This
> disables BPF_TRACE_FEXIT_MULTI entirely.
>
> 2. Any tracing_multi program (including BPF_TRACE_FENTRY_MULTI) is
> rejected when its btf_id is in noreturn_deny (do_exit,
> __x64_sys_exit, make_task_dead, etc.). Rejecting FENTRY_MULTI on
> noreturn functions is an unintended behavioral restriction.
>
> Forward search in the git range shows commit bfcad202ce99 ("bpf: Add
> support for tracing_multi link session") rewrites this hunk to:
>
> if ((prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
> prog->expected_attach_type == BPF_TRACE_FSESSION_MULTI) &&
> btf_id_set_contains(&noreturn_deny, btf_id))
> return -EINVAL;
>
> The parenthesization + && in the later commit confirms this is a
> precedence bug. Should this use && instead?
yes, replied in the other email
jirka
^ permalink raw reply
* Re: [PATCHv5 bpf-next 12/28] bpf: Add bpf_trampoline_multi_attach/detach functions
From: Jiri Olsa @ 2026-04-21 8:56 UTC (permalink / raw)
To: bot+bpf-ci
Cc: ast, daniel, andrii, bpf, linux-trace-kernel, kafai, eddyz87,
songliubraving, yhs, menglong8.dong, rostedt, martin.lau,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <11459a06ee7d070bdbdc3699d59adcdca12d5178eaa6ebd15286e0f51cc27141@mail.kernel.org>
On Fri, Apr 17, 2026 at 08:22:36PM +0000, bot+bpf-ci@kernel.org wrote:
SNIP
> > + if (ftrace_hash_count(data->modify)) {
> > + err = update_ftrace_direct_mod(&direct_ops, data->modify, true);
> > + if (err) {
> > + WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> > + goto rollback_unlink;
> > + }
> > + }
>
> Does update_ftrace_direct_del need a guard here?
>
> When update_ftrace_direct_mod fails, the recovery code unconditionally
> calls update_ftrace_direct_del(&direct_ops, data->reg). However,
> update_ftrace_direct_add is itself only invoked when data->reg is
> non-empty:
>
> if (ftrace_hash_count(data->reg)) {
> err = update_ftrace_direct_add(&direct_ops, data->reg);
> ...
> }
>
> If all trampolines being multi-attached already had cur_image != NULL,
> nothing is ever added to data->reg, and the add call is skipped. Yet
> the recovery path still calls update_ftrace_direct_del with the empty
> data->reg.
>
> update_ftrace_direct_del() at kernel/trace/ftrace.c:6511 does:
>
> if (!hash_count(hash))
> return -EINVAL;
>
> so it returns -EINVAL, and WARN_ON_ONCE fires. No further damage is
> done since the rollback_unlink path still unwinds correctly, but this
> is a spurious warning that becomes visible whenever
> update_ftrace_direct_mod fails and data->reg happens to be empty.
>
> The undo should match the guard used for the add:
>
> if (err) {
> if (ftrace_hash_count(data->reg))
> WARN_ON_ONCE(update_ftrace_direct_del(&direct_ops, data->reg));
> goto rollback_unlink;
> }
right, that's the fix
>
> > +
> > + for_each_mnode(mnode, link)
> > + bpf_trampoline_multi_attach_free(mnode->trampoline);
> > +
> > + trampoline_unlock_all();
> > +
> > + remove_tracing_multi_data(data);
> > + return 0;
> > +
> > +rollback_unlink:
> > + for_each_mnode_cnt(mnode, link, rollback_cnt) {
> > + bpf_trampoline_remove_prog(mnode->trampoline, &mnode->node);
> > + bpf_trampoline_multi_attach_rollback(mnode->trampoline);
> > + }
> > +
> > + trampoline_unlock_all();
> > +
> > + clear_tracing_multi_data(data);
> > + rollback_cnt = link->nodes_cnt;
> > +
> > +rollback_put:
> > + for_each_mnode_cnt(mnode, link, rollback_cnt)
> > + bpf_trampoline_put(mnode->trampoline);
> > +
> > + return err;
> > +}
SNIP
> > + /* Check noreturn attachment. */
> > + if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI ||
> > + btf_id_set_contains(&noreturn_deny, btf_id))
> > + return -EINVAL;
>
> Should this be && instead of ||?
>
> This condition rejects all BPF_TRACE_FEXIT_MULTI programs regardless
> of whether the target is on the noreturn_deny list. The C operator
> precedence makes this `A || B`, so if the program's expected_attach_type
> is BPF_TRACE_FEXIT_MULTI, the entire condition is TRUE and the
> function returns -EINVAL.
>
> This breaks FEXIT_MULTI attachment support. The caller
> bpf_trampoline_multi_attach is the multi-attach path that supports both
> BPF_TRACE_FENTRY_MULTI and BPF_TRACE_FEXIT_MULTI (see is_tracing_multi()
> in include/linux/bpf.h).
>
> The equivalent check in the single-target path, check_attach_btf_id,
> correctly uses &&:
>
> } else if ((prog->expected_attach_type == BPF_TRACE_FEXIT ||
> prog->expected_attach_type == BPF_TRACE_FSESSION ||
> prog->expected_attach_type == BPF_MODIFY_RETURN) &&
> btf_id_set_contains(&noreturn_deny, btf_id)) {
> ...
> }
>
> The intended semantics are to reject FEXIT_MULTI attachments only when
> the target is on noreturn_deny. The correct operator is &&:
>
> if (prog->expected_attach_type == BPF_TRACE_FEXIT_MULTI &&
> btf_id_set_contains(&noreturn_deny, btf_id))
> return -EINVAL;
>
> Additional evidence:
> 1. The comment says "Check noreturn attachment" but the current code
> rejects all FEXIT_MULTI regardless of noreturn status.
> 2. The indentation of the second line matches the alignment pattern
> used when lines are joined by && rather than ||.
> 3. bpf_check_attach_target handles case BPF_TRACE_FEXIT_MULTI as a
> valid attach type in its switch.
>
> Note: In this commit, bpf_trampoline_multi_attach has no callers yet;
> it is infrastructure wired up by a later commit. Once the link creation
> path invokes bpf_trampoline_multi_attach for a FEXIT_MULTI program,
> every such attach will fail with -EINVAL.
guilty as charged, should be &&, will fix
jirka
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox