* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Yafang Shao @ 2026-04-07 3:16 UTC (permalink / raw)
To: Song Liu
Cc: Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes, pmladek,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <CAPhsuW66tuF+QZ0pVheWb5sC4NQ-9CXikq=zMrPBXTHcsVPjdg@mail.gmail.com>
On Tue, Apr 7, 2026 at 10:54 AM Song Liu <song@kernel.org> wrote:
>
> On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
> [...]
> > > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > > are replaceable.
> > > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > > and are not replaceable.
> > > > >
> > > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > > still release a cumulative livepatch to replace the previous cumulative
> > > > > livepatch. Is this the expected use case?
> > > >
> > > > That matches our expected use case.
> > >
> > > If we really want to serve use cases like this, I think we can introduce
> > > some replace tag concept: Each livepatch will have a tag, u32 number.
> > > Newly loaded livepatch will only replace existing livepatch with the
> > > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > > and make it u32: replace=0 means no replace; replace > 0 are the
> > > replace tag.
> > >
> > > For current users of cumulative patches, all the livepatch will have the
> > > same tag, say 1. For your use case, you can assign each user a
> > > unique tag. Then all these users can do atomic upgrades of their
> > > own livepatches.
> > >
> > > We may also need to check whether two livepatches of different tags
> > > touch the same kernel function. When that happens, the later
> > > livepatch should fail to load.
That sounds like a viable solution. I'll look into it and see how we
can implement it.
> > >
> > > Does this make sense?
> > >
> >
> > I haven't been following the thread carefully, but could the Livepatch
> > system state API (see Documentation/livepatch/system-state.rst) be
> > leveraged somehow instead of adding further replace semantics?
>
> AFAICT, system state will not help Yafang's use case.
Right.
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Yafang Shao @ 2026-04-07 3:13 UTC (permalink / raw)
To: Song Liu
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CAPhsuW5MN6ikKmxgqby5RJ3_gvjJ4B77X74OvfbTQoFO8iUgzA@mail.gmail.com>
On Tue, Apr 7, 2026 at 10:47 AM Song Liu <song@kernel.org> wrote:
>
> On Mon, Apr 6, 2026 at 7:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Apr 7, 2026 at 2:26 AM Song Liu <song@kernel.org> wrote:
> > >
> > > On Mon, Apr 6, 2026 at 3:55 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sat, Apr 4, 2026 at 12:07 AM Song Liu <song@kernel.org> wrote:
> > > > >
> > > > > Hi Yafang,
> > > > >
> > > > > On Thu, Apr 2, 2026 at 2:26 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > Livepatching allows for rapid experimentation with new kernel features
> > > > > > without interrupting production workloads. However, static livepatches lack
> > > > > > the flexibility required to tune features based on task-specific attributes,
> > > > > > such as cgroup membership, which is critical in multi-tenant k8s
> > > > > > environments. Furthermore, hardcoding logic into a livepatch prevents
> > > > > > dynamic adjustments based on the runtime environment.
> > > > > >
> > > > > > To address this, we propose a hybrid approach using BPF. Our production use
> > > > > > case involves:
> > > > > >
> > > > > > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> > > > > >
> > > > > > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > > > > > of that hook based on the current task's context.
> > > > >
> > > > > Could you please provide a specific use case that can benefit from this?
> > > > > AFAICT, livepatch is more flexible but risky (may cause crash); while
> > > > > BPF is safe, but less flexible. The combination you are proposing seems
> > > > > to get the worse of the two sides. Maybe it can indeed get the benefit of
> > > > > both sides in some cases, but I cannot think of such examples.
> > > > >
> > > >
> > > > Here is an example we recently deployed on our production servers:
> > > >
> > > > https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
> > > >
> > > > In one of our specific clusters, we needed to send BGP traffic out
> > > > through specific NICs based on the destination IP. To achieve this
> > > > without interrupting service, we live-patched
> > > > bond_xmit_3ad_xor_slave_get(), added a new hook called
> > > > bond_get_slave_hook(), and then ran a BPF program attached to that
> > > > hook to select the outgoing NIC from the SKB. This allowed us to
> > > > rapidly deploy the feature with zero downtime.
> > >
> > > I guess the idea here is: keep the risk part simple, and implement
> > > it in module/livepatch, then use BPF for the flexible and programmable
> > > part safe.
> >
> > Right
> >
> > >
> > > Can we use struct_ops instead of bpf_override_return for this case?
> > > This should make the solution more flexible.
> >
> > Upstreaming struct_ops based BPF hooks is a challenging process, as
> > seen in these examples:
> >
> > https://lwn.net/Articles/1054030/
> > https://lwn.net/Articles/1043548/
> >
> > Even when successful, upstreaming can take a significant amount of
> > time—often longer than our production requirements allow. To bridge
> > this gap, we developed this livepatch+BPF solution. This allows us to
> > rapidly deploy new features without maintaining custom hooks in our
> > local kernel. Because these livepatch-based hooks are lightweight,
> > they minimize maintenance overhead and simplify kernel upgrades (e.g.,
> > from 6.1 to 6.18).
>
> I didn't mean upstream struct_ops.
>
> We can define the struct_ops in an OOT kernel module. Then we
> can attach BPF programs to the struct_ops. We may need
> livepatch to connect the new struct_ops to original kernel logic.
>
> I think kernel side of this solution is mostly available, but we may
> need some work on the toolchain side.
>
> Does this make sense?
Are there actual benefits to using struct_ops instead of
bpf_override_return? So far, I’ve only found it adds complexity
without much gain.
Can we add something like ALLOW_LIVEPATCH_ERROR_INJECTION() to allow
error injection on functions defined inside a livepatch?
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Song Liu @ 2026-04-07 2:54 UTC (permalink / raw)
To: Joe Lawrence
Cc: Yafang Shao, Dylan Hatch, jpoimboe, jikos, mbenes, pmladek,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <adQhpBC2W9I6QW-g@redhat.com>
On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
[...]
> > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > are replaceable.
> > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > and are not replaceable.
> > > >
> > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > still release a cumulative livepatch to replace the previous cumulative
> > > > livepatch. Is this the expected use case?
> > >
> > > That matches our expected use case.
> >
> > If we really want to serve use cases like this, I think we can introduce
> > some replace tag concept: Each livepatch will have a tag, u32 number.
> > Newly loaded livepatch will only replace existing livepatch with the
> > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > and make it u32: replace=0 means no replace; replace > 0 are the
> > replace tag.
> >
> > For current users of cumulative patches, all the livepatch will have the
> > same tag, say 1. For your use case, you can assign each user a
> > unique tag. Then all these users can do atomic upgrades of their
> > own livepatches.
> >
> > We may also need to check whether two livepatches of different tags
> > touch the same kernel function. When that happens, the later
> > livepatch should fail to load.
> >
> > Does this make sense?
> >
>
> I haven't been following the thread carefully, but could the Livepatch
> system state API (see Documentation/livepatch/system-state.rst) be
> leveraged somehow instead of adding further replace semantics?
AFAICT, system state will not help Yafang's use case.
Thanks,
Song
^ permalink raw reply
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Song Liu @ 2026-04-07 2:46 UTC (permalink / raw)
To: Yafang Shao
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CALOAHbC0hqk+yrUZay01EBRNOHgyj1MAavzNK-06XJKK9ARMqQ@mail.gmail.com>
On Mon, Apr 6, 2026 at 7:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Apr 7, 2026 at 2:26 AM Song Liu <song@kernel.org> wrote:
> >
> > On Mon, Apr 6, 2026 at 3:55 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sat, Apr 4, 2026 at 12:07 AM Song Liu <song@kernel.org> wrote:
> > > >
> > > > Hi Yafang,
> > > >
> > > > On Thu, Apr 2, 2026 at 2:26 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Livepatching allows for rapid experimentation with new kernel features
> > > > > without interrupting production workloads. However, static livepatches lack
> > > > > the flexibility required to tune features based on task-specific attributes,
> > > > > such as cgroup membership, which is critical in multi-tenant k8s
> > > > > environments. Furthermore, hardcoding logic into a livepatch prevents
> > > > > dynamic adjustments based on the runtime environment.
> > > > >
> > > > > To address this, we propose a hybrid approach using BPF. Our production use
> > > > > case involves:
> > > > >
> > > > > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> > > > >
> > > > > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > > > > of that hook based on the current task's context.
> > > >
> > > > Could you please provide a specific use case that can benefit from this?
> > > > AFAICT, livepatch is more flexible but risky (may cause crash); while
> > > > BPF is safe, but less flexible. The combination you are proposing seems
> > > > to get the worse of the two sides. Maybe it can indeed get the benefit of
> > > > both sides in some cases, but I cannot think of such examples.
> > > >
> > >
> > > Here is an example we recently deployed on our production servers:
> > >
> > > https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
> > >
> > > In one of our specific clusters, we needed to send BGP traffic out
> > > through specific NICs based on the destination IP. To achieve this
> > > without interrupting service, we live-patched
> > > bond_xmit_3ad_xor_slave_get(), added a new hook called
> > > bond_get_slave_hook(), and then ran a BPF program attached to that
> > > hook to select the outgoing NIC from the SKB. This allowed us to
> > > rapidly deploy the feature with zero downtime.
> >
> > I guess the idea here is: keep the risk part simple, and implement
> > it in module/livepatch, then use BPF for the flexible and programmable
> > part safe.
>
> Right
>
> >
> > Can we use struct_ops instead of bpf_override_return for this case?
> > This should make the solution more flexible.
>
> Upstreaming struct_ops based BPF hooks is a challenging process, as
> seen in these examples:
>
> https://lwn.net/Articles/1054030/
> https://lwn.net/Articles/1043548/
>
> Even when successful, upstreaming can take a significant amount of
> time—often longer than our production requirements allow. To bridge
> this gap, we developed this livepatch+BPF solution. This allows us to
> rapidly deploy new features without maintaining custom hooks in our
> local kernel. Because these livepatch-based hooks are lightweight,
> they minimize maintenance overhead and simplify kernel upgrades (e.g.,
> from 6.1 to 6.18).
I didn't mean upstream struct_ops.
We can define the struct_ops in an OOT kernel module. Then we
can attach BPF programs to the struct_ops. We may need
livepatch to connect the new struct_ops to original kernel logic.
I think kernel side of this solution is mostly available, but we may
need some work on the toolchain side.
Does this make sense?
Thanks,
Song
> That said, we would still prefer to have our hooks accepted upstream
> to eliminate the need for self-maintenance entirely.
^ permalink raw reply
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Yafang Shao @ 2026-04-07 2:21 UTC (permalink / raw)
To: Song Liu
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CAPhsuW73qFybHgOnZ=oFC1PvdWkYWDk7gsAoiBXe4xWYagPrmA@mail.gmail.com>
On Tue, Apr 7, 2026 at 2:26 AM Song Liu <song@kernel.org> wrote:
>
> On Mon, Apr 6, 2026 at 3:55 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Apr 4, 2026 at 12:07 AM Song Liu <song@kernel.org> wrote:
> > >
> > > Hi Yafang,
> > >
> > > On Thu, Apr 2, 2026 at 2:26 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Livepatching allows for rapid experimentation with new kernel features
> > > > without interrupting production workloads. However, static livepatches lack
> > > > the flexibility required to tune features based on task-specific attributes,
> > > > such as cgroup membership, which is critical in multi-tenant k8s
> > > > environments. Furthermore, hardcoding logic into a livepatch prevents
> > > > dynamic adjustments based on the runtime environment.
> > > >
> > > > To address this, we propose a hybrid approach using BPF. Our production use
> > > > case involves:
> > > >
> > > > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> > > >
> > > > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > > > of that hook based on the current task's context.
> > >
> > > Could you please provide a specific use case that can benefit from this?
> > > AFAICT, livepatch is more flexible but risky (may cause crash); while
> > > BPF is safe, but less flexible. The combination you are proposing seems
> > > to get the worse of the two sides. Maybe it can indeed get the benefit of
> > > both sides in some cases, but I cannot think of such examples.
> > >
> >
> > Here is an example we recently deployed on our production servers:
> >
> > https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
> >
> > In one of our specific clusters, we needed to send BGP traffic out
> > through specific NICs based on the destination IP. To achieve this
> > without interrupting service, we live-patched
> > bond_xmit_3ad_xor_slave_get(), added a new hook called
> > bond_get_slave_hook(), and then ran a BPF program attached to that
> > hook to select the outgoing NIC from the SKB. This allowed us to
> > rapidly deploy the feature with zero downtime.
>
> I guess the idea here is: keep the risk part simple, and implement
> it in module/livepatch, then use BPF for the flexible and programmable
> part safe.
Right
>
> Can we use struct_ops instead of bpf_override_return for this case?
> This should make the solution more flexible.
Upstreaming struct_ops based BPF hooks is a challenging process, as
seen in these examples:
https://lwn.net/Articles/1054030/
https://lwn.net/Articles/1043548/
Even when successful, upstreaming can take a significant amount of
time—often longer than our production requirements allow. To bridge
this gap, we developed this livepatch+BPF solution. This allows us to
rapidly deploy new features without maintaining custom hooks in our
local kernel. Because these livepatch-based hooks are lightweight,
they minimize maintenance overhead and simplify kernel upgrades (e.g.,
from 6.1 to 6.18).
That said, we would still prefer to have our hooks accepted upstream
to eliminate the need for self-maintenance entirely.
--
Regards
Yafang
^ permalink raw reply
* [RESEND PATCH v16 5/5] ring-buffer: Show commit numbers in buffer_meta file
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v16:
- update description.
---
kernel/trace/ring_buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e56fe9dcc7d7..4bf83b7805da 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2209,6 +2209,7 @@ static int rbm_show(struct seq_file *m, void *v)
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
unsigned long val = (unsigned long)v;
+ struct buffer_data_page *dpage;
if (val == 1) {
seq_printf(m, "head_buffer: %d\n",
@@ -2221,7 +2222,9 @@ static int rbm_show(struct seq_file *m, void *v)
}
val -= 2;
- seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]);
+ dpage = rb_range_buffer(cpu_buffer, val);
+ seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
+ val, meta->buffers[val], local_read(&dpage->commit));
return 0;
}
^ permalink raw reply related
* [RESEND PATCH v16 4/5] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a self-corrupting test for the persistent ring buffer.
This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.
This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:
reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1
Run the following commands after the 1st boot:
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v16:
- Update description and comments according to review comments.
Changes in v15:
- Use pr_warn() for test result.
- Inject errors on the page index is multiples of 5 so that
this can reproduce contiguous empty pages.
Changes in v14:
- Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
- Clear meta->nr_invalid/entry_bytes after testing.
- Add test commands in config comment.
Changes in v10:
- Add entry_bytes test.
- Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
Changes in v9:
- Test also reader pages.
---
include/linux/ring_buffer.h | 1 +
kernel/trace/Kconfig | 34 ++++++++++++++++++++
kernel/trace/ring_buffer.c | 74 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 4 ++
4 files changed, 113 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_TESTING = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..084f34dc6c9f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,40 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
Only say Y if you understand what this does, and you
still want it enabled. Otherwise say N
+config RING_BUFFER_PERSISTENT_INJECT
+ bool "Enable persistent ring buffer error injection test"
+ depends on RING_BUFFER
+ help
+ This option will have the kernel check if the persistent ring
+ buffer is named "ptracingtest". and if so, it will corrupt some
+ of its pages on a kernel panic. This is used to test if the
+ persistent ring buffer can recover from some of its sub-buffers
+ being corrupted.
+ To use this, boot a kernel with a "ptracingtest" persistent
+ ring buffer, e.g.
+
+ reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+ And after the 1st boot, run the following commands:
+
+ cd /sys/kernel/tracing/instances/ptracingtest
+ echo 1 > events/enable
+ echo 1 > tracing_on
+ sleep 3
+ echo c > /proc/sysrq-trigger
+
+ After the panic message, the kernel will reboot and will show
+ the test results in the console output.
+
+ Note that events for the test ring buffer needs to be enabled
+ prior to crashing the kernel so that the ring buffer has content
+ that the test will corrupt.
+ As the test will corrupt events in the "ptracingtest" persistent
+ ring buffer, it should not be used for any other purpose other
+ than this test.
+
+ If unsure, say N
+
config MMIOTRACE_TEST
tristate "Test module for mmiotrace"
depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 518a05df6ef7..e56fe9dcc7d7 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ __u32 nr_invalid;
+ __u32 entry_bytes;
+#endif
int buffers[];
};
@@ -2079,6 +2083,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (discarded)
pr_cont(" (%d pages discarded)", discarded);
pr_cont("\n");
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ if (meta->nr_invalid)
+ pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+ cpu_buffer->cpu,
+ (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ discarded, meta->nr_invalid);
+ if (meta->entry_bytes)
+ pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+ cpu_buffer->cpu,
+ (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)entry_bytes, (long)meta->entry_bytes);
+ meta->nr_invalid = 0;
+ meta->entry_bytes = 0;
+#endif
return;
invalid:
@@ -2559,12 +2578,67 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct ring_buffer_cpu_meta *meta;
+ struct buffer_data_page *dpage;
+ u32 entry_bytes = 0;
+ unsigned long ptr;
+ int subbuf_size;
+ int invalid = 0;
+ int cpu;
+ int i;
+
+ if (!(buffer->flags & RB_FL_TESTING))
+ return;
+
+ guard(preempt)();
+ cpu = smp_processor_id();
+
+ cpu_buffer = buffer->buffers[cpu];
+ meta = cpu_buffer->ring_meta;
+ ptr = (unsigned long)rb_subbufs_from_meta(meta);
+ subbuf_size = meta->subbuf_size;
+
+ for (i = 0; i < meta->nr_subbufs; i++) {
+ int idx = meta->buffers[i];
+
+ dpage = (void *)(ptr + idx * subbuf_size);
+ /* Skip unused pages */
+ if (!local_read(&dpage->commit))
+ continue;
+
+ /*
+ * Invalidate even pages or multiples of 5. This will cause 3
+ * contiguous invalidated(empty) pages.
+ */
+ if (!(i & 0x1) || !(i % 5)) {
+ local_add(subbuf_size + 1, &dpage->commit);
+ invalid++;
+ } else {
+ /* Count total commit bytes. */
+ entry_bytes += local_read(&dpage->commit);
+ }
+ }
+
+ pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+ invalid, cpu, (long)entry_bytes);
+ meta->nr_invalid = invalid;
+ meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer) do { } while (0)
+#endif
+
/* Stop recording on a persistent buffer and flush cache if needed. */
static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
{
struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
ring_buffer_record_off(buffer);
+ rb_test_inject_invalid_pages(buffer);
arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
return NOTIFY_DONE;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e9455d46ec16..96101d276d13 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9436,6 +9436,8 @@ static void setup_trace_scratch(struct trace_array *tr,
memset(tscratch, 0, size);
}
+#define TRACE_TEST_PTRACING_NAME "ptracingtest"
+
static int
allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
{
@@ -9448,6 +9450,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
+ if (!strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+ rb_flags |= RB_FL_TESTING;
/* Add scratch buffer to handle 128 modules */
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
^ permalink raw reply related
* [RESEND PATCH v16 3/5] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v12:
- Fix build error.
Changes in v11:
- Reset timestamp when the buffer is invalid.
- When rewinding, skip subbuf page if timestamp is wrong and
check timestamp after validating buffer data page.
Changes in v10:
- Newly added.
---
kernel/trace/ring_buffer.c | 76 +++++++++++++++++++++++++-------------------
1 file changed, 43 insertions(+), 33 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 0c284094f7d0..518a05df6ef7 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
static void rb_init_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
+ bpage->time_stamp = 0;
}
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
struct ring_buffer_cpu_meta *meta)
{
+ struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
unsigned long tail;
u64 delta;
+ int ret = -1;
/*
* When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
* subbuf_size is considered invalid.
*/
tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
- if (tail > meta->subbuf_size)
- return -1;
- return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ if (tail <= meta->subbuf_size)
+ ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+ if (ret < 0) {
+ local_set(&bpage->entries, 0);
+ local_set(&bpage->page->commit, 0);
+ } else {
+ local_set(&bpage->entries, ret);
+ }
+
+ return ret;
}
/* If the meta data has been validated, now validate the events */
@@ -1915,18 +1926,14 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(cpu_buffer->reader_page, cpu_buffer->cpu, meta);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid reader page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&cpu_buffer->reader_page->entries, 0);
- local_set(&cpu_buffer->reader_page->page->commit, 0);
} else {
entries += ret;
entry_bytes += rb_page_size(cpu_buffer->reader_page);
- local_set(&cpu_buffer->reader_page->entries, ret);
}
ts = head_page->page->time_stamp;
@@ -1945,26 +1952,33 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->tail_page)
break;
- /* Ensure the page has older data than head. */
- if (ts < head_page->page->time_stamp)
- break;
-
- ts = head_page->page->time_stamp;
- /* Ensure the page has correct timestamp and some data. */
- if (!ts || rb_page_commit(head_page) == 0)
- break;
-
- /* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
- if (ret < 0)
+ /* Rewind until unused page (no timestamp, no commit). */
+ if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
break;
- /* Recover the number of entries and update stats. */
- local_set(&head_page->entries, ret);
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
- entries += ret;
- entry_bytes += rb_page_commit(head_page);
+ /*
+ * Skip if the page is invalid, or its timestamp is newer than the
+ * previous valid page.
+ */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
+ if (ret >= 0 && ts < head_page->page->time_stamp) {
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ head_page->page->time_stamp = ts;
+ ret = -1;
+ }
+ if (ret < 0) {
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ if (ret > 0)
+ local_inc(&cpu_buffer->pages_touched);
+ ts = head_page->page->time_stamp;
+ }
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2034,15 +2048,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta);
if (ret < 0) {
if (!discarded)
pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
} else {
/* If the buffer has content, update pages_touched */
if (ret)
@@ -2050,7 +2061,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += rb_page_size(head_page);
- local_set(&head_page->entries, ret);
}
if (head_page == cpu_buffer->commit_page)
break;
@@ -2083,7 +2093,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
+ rb_init_page(head_page->page);
}
}
^ permalink raw reply related
* [RESEND PATCH v16 2/5] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v15:
- Skip reader_page loop check on persistent ring buffer because
there can be contiguous empty(invalidated) pages.
- Do not show discarded page number information if it is 0.
Changes in v11:
- Fix a typo.
Changes in v9:
- Add meta->subbuf_size check.
- Fix a typo.
- Handle invalid reader_page case.
Changes in v8:
- Add comment in rb_valudate_buffer()
- Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
skipping subbuf.
- Remove unused subbuf local variable from rb_cpu_meta_valid().
Changes in v7:
- Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
- Remove checking subbuffer data when validating metadata, because it should be done
later.
- Do not mark the discarded sub buffer page but just reset it.
Changes in v6:
- Show invalid page detection message once per CPU.
Changes in v5:
- Instead of showing errors for each page, just show the number
of discarded pages at last.
Changes in v3:
- Record missed data event on commit.
---
kernel/trace/ring_buffer.c | 109 ++++++++++++++++++++++++++------------------
1 file changed, 65 insertions(+), 44 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 4d5817286791..0c284094f7d0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
return local_read(&bpage->page->commit);
}
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+ return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
unsigned long *subbuf_mask)
{
int subbuf_size = PAGE_SIZE;
- struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
if (!subbuf_mask)
return false;
+ if (meta->subbuf_size != PAGE_SIZE) {
+ pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+ return false;
+ }
+
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- subbuf = rb_subbufs_from_meta(meta);
-
bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
- /* Is the meta buffers and the subbufs themselves have correct data? */
+ /*
+ * Ensure the meta::buffers array has correct data. The data in each subbufs
+ * are checked later in rb_meta_validate_events().
+ */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
- pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
- return false;
- }
-
if (test_bit(meta->buffers[i], subbuf_mask)) {
pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
return false;
}
set_bit(meta->buffers[i], subbuf_mask);
- subbuf = (void *)subbuf + subbuf_size;
}
return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+ struct ring_buffer_cpu_meta *meta)
{
unsigned long long ts;
+ unsigned long tail;
u64 delta;
- int tail;
- tail = local_read(&dpage->commit);
+ /*
+ * When a sub-buffer is recovered from a read, the commit value may
+ * have RB_MISSED_* bits set, as these bits are reset on reuse.
+ * Even after clearing these bits, a commit value greater than the
+ * subbuf_size is considered invalid.
+ */
+ tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ if (tail > meta->subbuf_size)
+ return -1;
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
struct buffer_page *head_page, *orig_head;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
+ int discarded = 0;
int ret;
u64 ts;
int i;
@@ -1900,14 +1915,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer reader page is invalid\n");
- goto invalid;
+ pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&cpu_buffer->reader_page->entries, 0);
+ local_set(&cpu_buffer->reader_page->page->commit, 0);
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(cpu_buffer->reader_page);
+ local_set(&cpu_buffer->reader_page->entries, ret);
}
- entries += ret;
- entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
- local_set(&cpu_buffer->reader_page->entries, ret);
ts = head_page->page->time_stamp;
@@ -1935,7 +1955,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
break;
/* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0)
break;
@@ -2014,21 +2034,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->reader_page)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid buffer page\n",
- cpu_buffer->cpu);
- goto invalid;
- }
-
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
-
- entries += ret;
- entry_bytes += local_read(&head_page->page->commit);
- local_set(&head_page->entries, ret);
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ local_set(&head_page->entries, ret);
+ }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2042,7 +2065,10 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
- pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+ pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
+ if (discarded)
+ pr_cont(" (%d pages discarded)", discarded);
+ pr_cont("\n");
return;
invalid:
@@ -3329,12 +3355,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
return NULL;
}
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
static __always_inline unsigned
rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -5647,11 +5667,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
again:
/*
* This should normally only loop twice. But because the
- * start of the reader inserts an empty page, it causes
- * a case where we will loop three times. There should be no
- * reason to loop four times (that I know of).
+ * start of the reader inserts an empty page, it causes a
+ * case where we will loop three times. There should be no
+ * reason to loop four times unless the ring buffer is a
+ * recovered persistent ring buffer.
*/
- if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+ if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
reader = NULL;
goto out;
}
^ permalink raw reply related
* [RESEND PATCH v16 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177552432201.853249.5125045538812833325.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes in v13:
- Fix a rebase conflict.
Changes in v11:
- Do nothing by default since flush_cache_vmap() does nothing on x86
but it can cause deadlock on some architectures via on_each_cpu()
because other CPUs will be stoppped when panic notifier is called.
Changes in v9:
- Fix typo of & to &&.
- Fix typo of "Generic"
Changes in v6:
- Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
- Use flush_cache_vmap() instead of flush_cache_all().
Changes in v5:
- Use ring_buffer_record_off() instead of ring_buffer_record_disable().
- Use flush_cache_all() to ensure flush all cache.
Changes in v3:
- update patch description.
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/ring_buffer.h | 10 ++++++++++
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/ring_buffer.h | 13 +++++++++++++
kernel/trace/ring_buffer.c | 22 ++++++++++++++++++++++
23 files changed, 65 insertions(+)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
generic-y += asm-offsets.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
generic-y += extable.h
generic-y += flat.h
generic-y += parport.h
+generic-y += ring_buffer.h
generated-y += mach-types.h
generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
generic-y += user.h
generic-y += ioctl.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
generic-y += statfs.h
generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
generic-y += spinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
generic-y += agp.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
generic-y += asm-offsets.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += ring_buffer.h
generic-y += runtime-const.h
generic-y += softirq_stack.h
generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
generic-y += fprobe.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end) do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2caa5d3d0ae9..4d5817286791 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
#include <linux/ring_buffer_types.h>
#include <linux/sched/isolation.h>
#include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
#include <linux/trace_events.h>
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <asm/ring_buffer.h>
#include <asm/local64.h>
#include <asm/local.h>
#include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
unsigned long range_addr_start;
unsigned long range_addr_end;
+ struct notifier_block flush_nb;
struct ring_buffer_meta *meta;
@@ -2520,6 +2523,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+ ring_buffer_record_off(buffer);
+ arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+ return NOTIFY_DONE;
+}
+
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
@@ -2650,6 +2663,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
mutex_init(&buffer->mutex);
+ /* Persistent ring buffer needs to flush cache before reboot. */
+ if (start && end) {
+ buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+ atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+ }
+
return_ptr(buffer);
fail_free_buffers:
@@ -2748,6 +2767,9 @@ ring_buffer_free(struct trace_buffer *buffer)
{
int cpu;
+ if (buffer->range_addr_start && buffer->range_addr_end)
+ atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
irq_work_sync(&buffer->irq_work.work);
^ permalink raw reply related
* [RESEND PATCH v16 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-04-07 1:12 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
[Resend this series with base-commit tag so that bot can apply this correctly]
Hi,
Here is the 16th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:
https://lore.kernel.org/all/177494615421.71933.3679132057004156013.stgit@mhiramat.tok.corp.google.com/
This version adds Catalin's Ack [1/5] and update description and
document[4/5][5/5]. Also, rebased on ring-buffer/for-next.
Thank you,
Masami Hiramatsu (Google) (5):
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
ring-buffer: Add persistent ring buffer invalid-page inject test
ring-buffer: Show commit numbers in buffer_meta file
arch/alpha/include/asm/Kbuild | 1
arch/arc/include/asm/Kbuild | 1
arch/arm/include/asm/Kbuild | 1
arch/arm64/include/asm/ring_buffer.h | 10 +
arch/csky/include/asm/Kbuild | 1
arch/hexagon/include/asm/Kbuild | 1
arch/loongarch/include/asm/Kbuild | 1
arch/m68k/include/asm/Kbuild | 1
arch/microblaze/include/asm/Kbuild | 1
arch/mips/include/asm/Kbuild | 1
arch/nios2/include/asm/Kbuild | 1
arch/openrisc/include/asm/Kbuild | 1
arch/parisc/include/asm/Kbuild | 1
arch/powerpc/include/asm/Kbuild | 1
arch/riscv/include/asm/Kbuild | 1
arch/s390/include/asm/Kbuild | 1
arch/sh/include/asm/Kbuild | 1
arch/sparc/include/asm/Kbuild | 1
arch/um/include/asm/Kbuild | 1
arch/x86/include/asm/Kbuild | 1
arch/xtensa/include/asm/Kbuild | 1
include/asm-generic/ring_buffer.h | 13 ++
include/linux/ring_buffer.h | 1
kernel/trace/Kconfig | 34 ++++
kernel/trace/ring_buffer.c | 258 ++++++++++++++++++++++++++--------
kernel/trace/trace.c | 4 +
26 files changed, 276 insertions(+), 64 deletions(-)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
base-commit: 3515572dd068895ffd241b8a69399a0ebfac7593
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Joe Lawrence @ 2026-04-06 21:12 UTC (permalink / raw)
To: Song Liu
Cc: Yafang Shao, Dylan Hatch, jpoimboe, jikos, mbenes, pmladek,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <CAPhsuW4B00-grg9XJa+AO3xgGwM_u8FC+GH3JrkYZOJx4PuV8Q@mail.gmail.com>
On Mon, Apr 06, 2026 at 11:11:27AM -0700, Song Liu wrote:
> On Mon, Apr 6, 2026 at 4:08 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Apr 4, 2026 at 5:36 AM Song Liu <song@kernel.org> wrote:
> > >
> > > On Fri, Apr 3, 2026 at 1:55 PM Dylan Hatch <dylanbhatch@google.com> wrote:
> > > [...]
> > > > > IIRC, the use case for this change is when multiple users load various
> > > > > livepatch modules on the same system. I still don't believe this is the
> > > > > right way to manage livepatches. That said, I won't really NACK this
> > > > > if other folks think this is a useful option.
> > > >
> > > > In our production fleet, we apply exactly one cumulative livepatch
> > > > module, and we use per-kernel build "livepatch release" branches to
> > > > track the contents of these cumulative livepatches. This model has
> > > > worked relatively well for us, but there are some painpoints.
> > > >
> > > > We are often under pressure to selectively deploy a livepatch fix to
> > > > certain subpopulations of production. If the subpopulation is running
> > > > the same build of everything else, this would require us to introduce
> > > > another branching factor to the "livepatch release" branches --
> > > > something we do not support due to the added toil and complexity.
> > > >
> > > > However, if we had the ability to build "off-band" livepatch modules
> > > > that were marked as non-replaceable, we could support these selective
> > > > patches without the additional branching factor. I will have to
> > > > circulate the idea internally, but to me this seems like a very useful
> > > > option to have in certain cases.
> > >
> > > IIUC, the plan is:
> > >
> > > - The regular livepatches are cumulative, have the replace flag; and
> > > are replaceable.
> > > - The occasional "off-band" livepatches do not have the replace flag,
> > > and are not replaceable.
> > >
> > > With this setup, for systems with off-band livepatches loaded, we can
> > > still release a cumulative livepatch to replace the previous cumulative
> > > livepatch. Is this the expected use case?
> >
> > That matches our expected use case.
>
> If we really want to serve use cases like this, I think we can introduce
> some replace tag concept: Each livepatch will have a tag, u32 number.
> Newly loaded livepatch will only replace existing livepatch with the
> same tag. We can even reuse the existing "bool replace" in klp_patch,
> and make it u32: replace=0 means no replace; replace > 0 are the
> replace tag.
>
> For current users of cumulative patches, all the livepatch will have the
> same tag, say 1. For your use case, you can assign each user a
> unique tag. Then all these users can do atomic upgrades of their
> own livepatches.
>
> We may also need to check whether two livepatches of different tags
> touch the same kernel function. When that happens, the later
> livepatch should fail to load.
>
> Does this make sense?
>
I haven't been following the thread carefully, but could the Livepatch
system state API (see Documentation/livepatch/system-state.rst) be
leveraged somehow instead of adding further replace semantics?
--
Joe
^ permalink raw reply
* Re: [RFC PATCH bpf-next v5 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: bot+bpf-ci @ 2026-04-06 20:15 UTC (permalink / raw)
To: andrey.grodzovsky, bpf, linux-trace-kernel
Cc: ast, daniel, andrii, jolsa, rostedt, mhiramat, ihor.solodrai,
emil, linux-open-source, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260406193158.754498-2-andrey.grodzovsky@crowdstrike.com>
[-- Attachment #1: Type: text/plain, Size: 825 bytes --]
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
> Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
This commit fixes a bug introduced when module symbol counting was
added to number_of_same_symbols(). Would it be worth adding a
Fixes: tag?
Suggested:
Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
That commit added module_kallsyms_on_each_symbol() to
number_of_same_symbols(), which caused unqualified kprobe targets
matching both vmlinux and a module to return count > 1 and fail
with -EADDRNOTAVAIL.
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24047967861
^ permalink raw reply
* Re: [PATCH bpf v3 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Kumar Kartikeya Dwivedi @ 2026-04-06 20:11 UTC (permalink / raw)
To: Jiri Olsa
Cc: Varun R Mallya, bpf, ast, daniel, yonghong.song, rostedt,
mhiramat, linux-kernel, linux-trace-kernel
In-Reply-To: <ac47BIEUBBkTch31@krava>
On Thu, 2 Apr 2026 at 11:46, Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Apr 02, 2026 at 12:50:10AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Wed, 1 Apr 2026 at 21:11, Varun R Mallya <varunrmallya@gmail.com> wrote:
> > >
> > > Add a selftest to ensure that kprobe_multi programs cannot be attached
> > > using the BPF_F_SLEEPABLE flag. This test succeeds when the kernel
> > > rejects attachment of kprobe_multi when the BPF_F_SLEEPABLE flag is set.
> > >
> > > Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
> > > ---
> >
> > Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> >
> > > .../bpf/prog_tests/kprobe_multi_test.c | 41 +++++++++++++++++++
> > > .../bpf/progs/kprobe_multi_sleepable.c | 13 ++++++
> > > 2 files changed, 54 insertions(+)
> > > create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> > > index 78c974d4ea33..f02fec2b6fda 100644
> > > --- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> > > +++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> > > @@ -10,6 +10,7 @@
> > > #include "kprobe_multi_session_cookie.skel.h"
> > > #include "kprobe_multi_verifier.skel.h"
> > > #include "kprobe_write_ctx.skel.h"
> > > +#include "kprobe_multi_sleepable.skel.h"
> > > #include "bpf/libbpf_internal.h"
> > > #include "bpf/hashmap.h"
> > >
> > > @@ -633,6 +634,44 @@ static void test_attach_write_ctx(void)
> > > }
> > > #endif
> > >
> > > +static void test_attach_multi_sleepable(void)
> > > +{
> > > + struct kprobe_multi_sleepable *skel;
> > > + int err;
> > > +
> > > + skel = kprobe_multi_sleepable__open();
> > > + if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
> > > + return;
> > > +
> > > + err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
> > > + BPF_F_SLEEPABLE);
> > > + if (!ASSERT_OK(err, "bpf_program__set_flags"))
> > > + goto cleanup;
> > > +
> > > + /* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
> > > + err = kprobe_multi_sleepable__load(skel);
> > > + if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
> > > + goto cleanup;
> > > +
> > > + /* Attachment must fail for kprobe.multi + BPF_F_SLEEPABLE.
> > > + * Also chosen a stable symbol to send into opts
> > > + */
> > > + LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
> > > + const char *sym = "vfs_read";
> > > +
> > > + opts.syms = &sym;
> > > + opts.cnt = 1;
> > > +
> > > + skel->links.handle_kprobe_multi_sleepable =
> > > + bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
> > > + NULL, &opts);
> > > + ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
> > > + "bpf_program__attach_kprobe_multi_opts");
> >
> > Nit: While vfs_read will likely remain stable, the check could
> > probably be stronger to distinguish an attach error from -EINVAL?
> > I added a typo to vfs_read and it still passed, because it failed to
> > attach instead of getting rejected on unfixed kernel.
> > May not be a big deal since vfs_read is unlikely to break.
> > I verified it works by adding bpf_copy_from_user to the program and
> > attaching to SYS_PREFIX sys_getpid and invoking the splat though, so
> > LGTM otherwise.
>
> why not use bpf_fentry_test2 ? you could also put it in pattern argument
> and bypass opts completely (up to you)
>
> also there's test_attach_api_fails test, please move it over there
>
Varun, the selftest is still not applied, only the fix. Please follow
up and target bpf-next tree this time.
Thanks.
^ permalink raw reply
* [RFC PATCH bpf-next v5 0/2] tracing: Fix kprobe attachment when module shadows vmlinux symbol
From: Andrey Grodzovsky @ 2026-04-06 19:31 UTC (permalink / raw)
To: bpf, linux-trace-kernel
Cc: ast, daniel, andrii, jolsa, rostedt, mhiramat, ihor.solodrai,
emil, linux-open-source
When a kernel module exports a symbol with the same name as an existing
vmlinux symbol, kprobe attachment fails with -EADDRNOTAVAIL because
number_of_same_symbols() counts matches across both vmlinux and all
loaded modules, returning a count greater than 1.
This series takes a different approach from v1-v4, which implemented a
libbpf-side fallback parsing /proc/kallsyms and retrying with the
absolute address. That approach was rejected (Andrii Nakryiko, Ihor
Solodrai) because ambiguous symbol resolution does not belong in libbpf,
and because it did not cover the kprobe_multi path.
Following Ihor's suggestion, this series fixes the root cause in the
kernel: when an unqualified symbol name is given and the symbol is found
in vmlinux, prefer the vmlinux symbol and do not scan loaded modules.
This makes the skeleton auto-attach path work transparently with no
libbpf changes needed.
Patch 1: Kernel fix - return vmlinux-only count from
number_of_same_symbols() when the symbol is found in vmlinux,
preventing module shadows from causing -EADDRNOTAVAIL.
Patch 2: Selftests with bpf_testmod_dup_sym.ko test module validating
kprobe attachment across all four attach modes with a duplicate
symbol present. Unchaged from V4.
Changes since v4 [1]:
- Completely rework the approach: move fix from libbpf to the kernel
(number_of_same_symbols() in trace_kprobe.c) as suggested by Ihor
Solodrai. No libbpf changes needed.
- When mod==NULL and vmlinux contains the symbol (count > 0), return
the vmlinux-only count immediately, skipping module scan entirely.
- Preserves all existing semantics: MOD:SYM qualification unchanged,
module-only symbols unchanged, vmlinux-ambiguous symbols unchanged.
[1] https://lore.kernel.org/bpf/20260302210532.381083-1-andrey.grodzovsky@crowdstrike.com/
Andrey Grodzovsky (2):
tracing: Prefer vmlinux symbols over module symbols for unqualified
kprobes
selftests/bpf: Add tests for duplicate kprobe symbol handling
kernel/trace/trace_kprobe.c | 7 +++
tools/testing/selftests/bpf/Makefile | 2 +-
.../selftests/bpf/prog_tests/attach_probe.c | 63 +++++++++++++++++++
.../testing/selftests/bpf/test_kmods/Makefile | 2 +-
.../bpf/test_kmods/bpf_testmod_dup_sym.c | 48 ++++++++++++++
5 files changed, 120 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c
--
2.34.1
^ permalink raw reply
* Re: [syzbot] [block?] [trace?] INFO: task hung in blk_trace_startstop
From: syzbot @ 2026-04-06 19:55 UTC (permalink / raw)
To: axboe, linux-block, linux-kernel, linux-trace-kernel,
mathieu.desnoyers, mhiramat, rostedt, syzkaller-bugs
In-Reply-To: <691367ae.a70a0220.22f260.0141.GAE@google.com>
syzbot has found a reproducer for the following issue on:
HEAD commit: 591cd656a1bf Linux 7.0-rc7
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=129136ba580000
kernel config: https://syzkaller.appspot.com/x/.config?x=6754c86e8d9e4c91
dashboard link: https://syzkaller.appspot.com/bug?extid=774863666ef5b025c9d0
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1268ad4e580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1580b3da580000
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/6382829d7cc5/disk-591cd656.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/17a325d524d5/vmlinux-591cd656.xz
kernel image: https://storage.googleapis.com/syzbot-assets/0a06ea295210/bzImage-591cd656.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+774863666ef5b025c9d0@syzkaller.appspotmail.com
INFO: task syz.2.19:6128 blocked for more than 143 seconds.
Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.2.19 state:D stack:27960 pid:6128 tgid:6125 ppid:5955 task_flags:0x400040 flags:0x00080002
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5298 [inline]
__schedule+0x15dd/0x52d0 kernel/sched/core.c:6911
__schedule_loop kernel/sched/core.c:6993 [inline]
schedule+0x164/0x360 kernel/sched/core.c:7008
schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7065
__mutex_lock_common kernel/locking/mutex.c:692 [inline]
__mutex_lock+0x7fe/0x1300 kernel/locking/mutex.c:776
blk_debugfs_lock_nomemsave block/blk.h:740 [inline]
blk_trace_startstop+0x8f/0x610 kernel/trace/blktrace.c:903
blk_trace_ioctl+0x314/0x920 kernel/trace/blktrace.c:949
blkdev_common_ioctl+0x13a7/0x3250 block/ioctl.c:724
blkdev_ioctl+0x528/0x740 block/ioctl.c:798
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f7f6ab9c819
RSP: 002b:00007f7f6baad028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f7f6ae16090 RCX: 00007f7f6ab9c819
RDX: 0000000000000000 RSI: 0000000000001274 RDI: 0000000000000003
RBP: 00007f7f6ac32c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
---
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
^ permalink raw reply
* [RFC PATCH bpf-next v5 2/2] selftests/bpf: Add tests for duplicate kprobe symbol handling
From: Andrey Grodzovsky @ 2026-04-06 19:31 UTC (permalink / raw)
To: bpf, linux-trace-kernel
Cc: ast, daniel, andrii, jolsa, rostedt, mhiramat, ihor.solodrai,
emil, linux-open-source
In-Reply-To: <20260406193158.754498-1-andrey.grodzovsky@crowdstrike.com>
Add bpf_testmod_dup_sym.ko test module that creates a duplicate
nanosleep symbol to test kprobe attachment when a module exports
a symbol with the same name as a vmlinux symbol.
Add test_attach_probe_dup_sym() to attach_probe tests that loads
the duplicate symbol module and validates kprobe attachment succeeds
across all four attach modes: default, legacy, perf_event_open, and
link — relying on the kernel fix to vmlinux-prefer unqualified symbol
resolution.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
---
tools/testing/selftests/bpf/Makefile | 2 +-
.../selftests/bpf/prog_tests/attach_probe.c | 63 +++++++++++++++++++
.../testing/selftests/bpf/test_kmods/Makefile | 2 +-
.../bpf/test_kmods/bpf_testmod_dup_sym.c | 48 ++++++++++++++
4 files changed, 113 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index f75c4f52c028..cceb3fcc97a2 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -121,7 +121,7 @@ TEST_PROGS_EXTENDED := \
ima_setup.sh verify_sig_setup.sh
TEST_KMODS := bpf_testmod.ko bpf_test_no_cfi.ko bpf_test_modorder_x.ko \
- bpf_test_modorder_y.ko bpf_test_rqspinlock.ko
+ bpf_test_modorder_y.ko bpf_test_rqspinlock.ko bpf_testmod_dup_sym.ko
TEST_KMOD_TARGETS = $(addprefix $(OUTPUT)/,$(TEST_KMODS))
# Compile but not part of 'make run_tests'
diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index 12a841afda68..04b177ee3adf 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -4,6 +4,7 @@
#include "test_attach_probe_manual.skel.h"
#include "test_attach_probe.skel.h"
#include "kprobe_write_ctx.skel.h"
+#include "testing_helpers.h"
/* this is how USDT semaphore is actually defined, except volatile modifier */
volatile unsigned short uprobe_ref_ctr __attribute__((unused)) __attribute((section(".probes")));
@@ -197,6 +198,59 @@ static void test_attach_kprobe_legacy_by_addr_reject(void)
test_attach_probe_manual__destroy(skel);
}
+/* Test kprobe attachment with duplicate symbols.
+ * This test loads bpf_testmod_dup_sym.ko which creates a duplicate
+ * __x64_sys_nanosleep symbol. The kernel fix should prefer the vmlinux
+ * symbol over the module symbol when attaching kprobes.
+ */
+static void test_attach_probe_dup_sym(enum probe_attach_mode attach_mode)
+{
+ DECLARE_LIBBPF_OPTS(bpf_kprobe_opts, kprobe_opts);
+ struct bpf_link *kprobe_link, *kretprobe_link;
+ struct test_attach_probe_manual *skel;
+ int err;
+
+ /* Load module with duplicate symbol */
+ err = load_module("bpf_testmod_dup_sym.ko", false);
+ if (!ASSERT_OK(err, "load_bpf_testmod_dup_sym")) {
+ test__skip();
+ return;
+ }
+
+ skel = test_attach_probe_manual__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_dup_sym_open_and_load"))
+ goto unload_module;
+
+ /* manual-attach kprobe/kretprobe with duplicate symbol present */
+ kprobe_opts.attach_mode = attach_mode;
+ kprobe_opts.retprobe = false;
+ kprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kprobe,
+ SYS_NANOSLEEP_KPROBE_NAME,
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kprobe_link, "attach_kprobe_dup_sym"))
+ goto cleanup;
+ skel->links.handle_kprobe = kprobe_link;
+
+ kprobe_opts.retprobe = true;
+ kretprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kretprobe,
+ SYS_NANOSLEEP_KPROBE_NAME,
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kretprobe_link, "attach_kretprobe_dup_sym"))
+ goto cleanup;
+ skel->links.handle_kretprobe = kretprobe_link;
+
+ /* trigger & validate kprobe && kretprobe */
+ usleep(1);
+
+ ASSERT_EQ(skel->bss->kprobe_res, 1, "check_kprobe_dup_sym_res");
+ ASSERT_EQ(skel->bss->kretprobe_res, 2, "check_kretprobe_dup_sym_res");
+
+cleanup:
+ test_attach_probe_manual__destroy(skel);
+unload_module:
+ unload_module("bpf_testmod_dup_sym", false);
+}
+
/* attach uprobe/uretprobe long event name testings */
static void test_attach_uprobe_long_event_name(void)
{
@@ -559,6 +613,15 @@ void test_attach_probe(void)
if (test__start_subtest("kprobe-legacy-by-addr-reject"))
test_attach_kprobe_legacy_by_addr_reject();
+ if (test__start_subtest("dup-sym-default"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_DEFAULT);
+ if (test__start_subtest("dup-sym-legacy"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_LEGACY);
+ if (test__start_subtest("dup-sym-perf"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_PERF);
+ if (test__start_subtest("dup-sym-link"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_LINK);
+
if (test__start_subtest("auto"))
test_attach_probe_auto(skel);
if (test__start_subtest("kprobe-sleepable"))
diff --git a/tools/testing/selftests/bpf/test_kmods/Makefile b/tools/testing/selftests/bpf/test_kmods/Makefile
index 63c4d3f6a12f..938c462a103b 100644
--- a/tools/testing/selftests/bpf/test_kmods/Makefile
+++ b/tools/testing/selftests/bpf/test_kmods/Makefile
@@ -8,7 +8,7 @@ Q = @
endif
MODULES = bpf_testmod.ko bpf_test_no_cfi.ko bpf_test_modorder_x.ko \
- bpf_test_modorder_y.ko bpf_test_rqspinlock.ko
+ bpf_test_modorder_y.ko bpf_test_rqspinlock.ko bpf_testmod_dup_sym.ko
$(foreach m,$(MODULES),$(eval obj-m += $(m:.ko=.o)))
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c
new file mode 100644
index 000000000000..0e12f68afe3a
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod_dup_sym.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 CrowdStrike */
+/* Test module for duplicate kprobe symbol handling */
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+/* Duplicate symbol to test kprobe attachment with duplicate symbols.
+ * This creates a duplicate of the syscall wrapper used in attach_probe tests.
+ * The libbpf fix should handle this by preferring the vmlinux symbol.
+ * This function should NEVER be called - kprobes should attach to vmlinux version.
+ */
+#ifdef __x86_64__
+int __x64_sys_nanosleep(void);
+noinline int __x64_sys_nanosleep(void)
+#elif defined(__s390x__)
+int __s390x_sys_nanosleep(void);
+noinline int __s390x_sys_nanosleep(void)
+#elif defined(__aarch64__)
+int __arm64_sys_nanosleep(void);
+noinline int __arm64_sys_nanosleep(void)
+#elif defined(__riscv)
+int __riscv_sys_nanosleep(void);
+noinline int __riscv_sys_nanosleep(void)
+#else
+int sys_nanosleep(void);
+noinline int sys_nanosleep(void)
+#endif
+{
+ WARN_ONCE(1, "bpf_testmod_dup_sym: dummy nanosleep symbol called - this should never execute!\n");
+ return -EINVAL;
+}
+
+static int __init bpf_testmod_dup_sym_init(void)
+{
+ return 0;
+}
+
+static void __exit bpf_testmod_dup_sym_exit(void)
+{
+}
+
+module_init(bpf_testmod_dup_sym_init);
+module_exit(bpf_testmod_dup_sym_exit);
+
+MODULE_AUTHOR("Andrey Grodzovsky");
+MODULE_DESCRIPTION("BPF selftest duplicate symbol module");
+MODULE_LICENSE("GPL");
--
2.34.1
^ permalink raw reply related
* [RFC PATCH bpf-next v5 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: Andrey Grodzovsky @ 2026-04-06 19:31 UTC (permalink / raw)
To: bpf, linux-trace-kernel
Cc: ast, daniel, andrii, jolsa, rostedt, mhiramat, ihor.solodrai,
emil, linux-open-source
In-Reply-To: <20260406193158.754498-1-andrey.grodzovsky@crowdstrike.com>
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.
When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
---
kernel/trace/trace_kprobe.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index a5dbb72528e0..99c41ea8b6d7 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -765,6 +765,13 @@ static unsigned int number_of_same_symbols(const char *mod, const char *func_nam
if (!mod)
kallsyms_on_each_match_symbol(count_symbols, func_name, &ctx.count);
+ /* If the symbol is found in vmlinux, use vmlinux resolution only.
+ * This prevents module symbols from shadowing vmlinux symbols
+ * and causing -EADDRNOTAVAIL for unqualified kprobe targets.
+ */
+ if (!mod && ctx.count > 0)
+ return ctx.count;
+
module_kallsyms_on_each_symbol(mod, count_mod_symbols, &ctx);
return ctx.count;
--
2.34.1
^ permalink raw reply related
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Song Liu @ 2026-04-06 18:26 UTC (permalink / raw)
To: Yafang Shao
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast, daniel,
andrii, martin.lau, eddyz87, memxor, yonghong.song, live-patching,
linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <CALOAHbDG8=eUV53kF+xn=izs2rpydCk=a9RznU-EEOzmkB8mQg@mail.gmail.com>
On Mon, Apr 6, 2026 at 3:55 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Apr 4, 2026 at 12:07 AM Song Liu <song@kernel.org> wrote:
> >
> > Hi Yafang,
> >
> > On Thu, Apr 2, 2026 at 2:26 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Livepatching allows for rapid experimentation with new kernel features
> > > without interrupting production workloads. However, static livepatches lack
> > > the flexibility required to tune features based on task-specific attributes,
> > > such as cgroup membership, which is critical in multi-tenant k8s
> > > environments. Furthermore, hardcoding logic into a livepatch prevents
> > > dynamic adjustments based on the runtime environment.
> > >
> > > To address this, we propose a hybrid approach using BPF. Our production use
> > > case involves:
> > >
> > > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> > >
> > > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > > of that hook based on the current task's context.
> >
> > Could you please provide a specific use case that can benefit from this?
> > AFAICT, livepatch is more flexible but risky (may cause crash); while
> > BPF is safe, but less flexible. The combination you are proposing seems
> > to get the worse of the two sides. Maybe it can indeed get the benefit of
> > both sides in some cases, but I cannot think of such examples.
> >
>
> Here is an example we recently deployed on our production servers:
>
> https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
>
> In one of our specific clusters, we needed to send BGP traffic out
> through specific NICs based on the destination IP. To achieve this
> without interrupting service, we live-patched
> bond_xmit_3ad_xor_slave_get(), added a new hook called
> bond_get_slave_hook(), and then ran a BPF program attached to that
> hook to select the outgoing NIC from the SKB. This allowed us to
> rapidly deploy the feature with zero downtime.
I guess the idea here is: keep the risk part simple, and implement
it in module/livepatch, then use BPF for the flexible and programmable
part safe.
Can we use struct_ops instead of bpf_override_return for this case?
This should make the solution more flexible.
Thanks,
Song
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Song Liu @ 2026-04-06 18:11 UTC (permalink / raw)
To: Yafang Shao
Cc: Dylan Hatch, jpoimboe, jikos, mbenes, pmladek, joe.lawrence,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <CALOAHbCbcw2jpjk9JD9yyf+SMpQ-s9FAonSaz7Gs4XUeP+w+2g@mail.gmail.com>
On Mon, Apr 6, 2026 at 4:08 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Apr 4, 2026 at 5:36 AM Song Liu <song@kernel.org> wrote:
> >
> > On Fri, Apr 3, 2026 at 1:55 PM Dylan Hatch <dylanbhatch@google.com> wrote:
> > [...]
> > > > IIRC, the use case for this change is when multiple users load various
> > > > livepatch modules on the same system. I still don't believe this is the
> > > > right way to manage livepatches. That said, I won't really NACK this
> > > > if other folks think this is a useful option.
> > >
> > > In our production fleet, we apply exactly one cumulative livepatch
> > > module, and we use per-kernel build "livepatch release" branches to
> > > track the contents of these cumulative livepatches. This model has
> > > worked relatively well for us, but there are some painpoints.
> > >
> > > We are often under pressure to selectively deploy a livepatch fix to
> > > certain subpopulations of production. If the subpopulation is running
> > > the same build of everything else, this would require us to introduce
> > > another branching factor to the "livepatch release" branches --
> > > something we do not support due to the added toil and complexity.
> > >
> > > However, if we had the ability to build "off-band" livepatch modules
> > > that were marked as non-replaceable, we could support these selective
> > > patches without the additional branching factor. I will have to
> > > circulate the idea internally, but to me this seems like a very useful
> > > option to have in certain cases.
> >
> > IIUC, the plan is:
> >
> > - The regular livepatches are cumulative, have the replace flag; and
> > are replaceable.
> > - The occasional "off-band" livepatches do not have the replace flag,
> > and are not replaceable.
> >
> > With this setup, for systems with off-band livepatches loaded, we can
> > still release a cumulative livepatch to replace the previous cumulative
> > livepatch. Is this the expected use case?
>
> That matches our expected use case.
If we really want to serve use cases like this, I think we can introduce
some replace tag concept: Each livepatch will have a tag, u32 number.
Newly loaded livepatch will only replace existing livepatch with the
same tag. We can even reuse the existing "bool replace" in klp_patch,
and make it u32: replace=0 means no replace; replace > 0 are the
replace tag.
For current users of cumulative patches, all the livepatch will have the
same tag, say 1. For your use case, you can assign each user a
unique tag. Then all these users can do atomic upgrades of their
own livepatches.
We may also need to check whether two livepatches of different tags
touch the same kernel function. When that happens, the later
livepatch should fail to load.
Does this make sense?
Thanks,
Song
^ permalink raw reply
* [PATCH] tracing: preserve module tracepoint strings
From: Cao Ruichuang @ 2026-04-06 17:09 UTC (permalink / raw)
To: rostedt
Cc: mhiramat, mathieu.desnoyers, mcgrof, petr.pavlu, da.gomez,
samitolvanen, atomlin, linux-kernel, linux-trace-kernel,
linux-modules
tracepoint_string() is documented as exporting constant strings
through printk_formats, including when it is used from modules.
That currently does not work.
A small test module that calls
tracepoint_string("tracepoint_string_test_module_string") loads
successfully and gets a pointer back, but the string never appears
in /sys/kernel/tracing/printk_formats. The loader only collects
__trace_printk_fmt from modules and ignores __tracepoint_str.
Collect module __tracepoint_str entries too, copy them to stable
tracing-managed storage like module trace_printk formats, and let
trace_is_tracepoint_string() recognize those copied strings. This
makes module tracepoint strings visible through printk_formats and
keeps them accepted by the trace string safety checks.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217196
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
include/linux/module.h | 2 ++
kernel/module/main.c | 4 +++
kernel/trace/trace_printk.c | 63 ++++++++++++++++++++++++++++---------
3 files changed, 54 insertions(+), 15 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h
index 14f391b18..e475466a7 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -515,6 +515,8 @@ struct module {
#ifdef CONFIG_TRACING
unsigned int num_trace_bprintk_fmt;
const char **trace_bprintk_fmt_start;
+ unsigned int num_tracepoint_strings;
+ const char **tracepoint_strings_start;
#endif
#ifdef CONFIG_EVENT_TRACING
struct trace_event_call **trace_events;
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c3ce106c7..d7d890138 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2672,6 +2672,10 @@ static int find_module_sections(struct module *mod, struct load_info *info)
mod->trace_bprintk_fmt_start = section_objs(info, "__trace_printk_fmt",
sizeof(*mod->trace_bprintk_fmt_start),
&mod->num_trace_bprintk_fmt);
+ mod->tracepoint_strings_start =
+ section_objs(info, "__tracepoint_str",
+ sizeof(*mod->tracepoint_strings_start),
+ &mod->num_tracepoint_strings);
#endif
#ifdef CONFIG_DYNAMIC_FTRACE
/* sechdrs[0].sh_size is always zero */
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index 5ea5e0d76..9f67ce42e 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -22,8 +22,9 @@
#ifdef CONFIG_MODULES
/*
- * modules trace_printk()'s formats are autosaved in struct trace_bprintk_fmt
- * which are queued on trace_bprintk_fmt_list.
+ * modules trace_printk() formats and tracepoint_string() strings are
+ * autosaved in struct trace_bprintk_fmt, which are queued on
+ * trace_bprintk_fmt_list.
*/
static LIST_HEAD(trace_bprintk_fmt_list);
@@ -33,8 +34,12 @@ static DEFINE_MUTEX(btrace_mutex);
struct trace_bprintk_fmt {
struct list_head list;
const char *fmt;
+ unsigned int type;
};
+#define TRACE_BPRINTK_TYPE BIT(0)
+#define TRACE_TRACEPOINT_TYPE BIT(1)
+
static inline struct trace_bprintk_fmt *lookup_format(const char *fmt)
{
struct trace_bprintk_fmt *pos;
@@ -49,22 +54,24 @@ static inline struct trace_bprintk_fmt *lookup_format(const char *fmt)
return NULL;
}
-static
-void hold_module_trace_bprintk_format(const char **start, const char **end)
+static void hold_module_trace_format(const char **start, const char **end,
+ unsigned int type)
{
const char **iter;
char *fmt;
/* allocate the trace_printk per cpu buffers */
- if (start != end)
+ if ((type & TRACE_BPRINTK_TYPE) && start != end)
trace_printk_init_buffers();
mutex_lock(&btrace_mutex);
for (iter = start; iter < end; iter++) {
struct trace_bprintk_fmt *tb_fmt = lookup_format(*iter);
if (tb_fmt) {
- if (!IS_ERR(tb_fmt))
+ if (!IS_ERR(tb_fmt)) {
+ tb_fmt->type |= type;
*iter = tb_fmt->fmt;
+ }
continue;
}
@@ -76,6 +83,7 @@ void hold_module_trace_bprintk_format(const char **start, const char **end)
list_add_tail(&tb_fmt->list, &trace_bprintk_fmt_list);
strcpy(fmt, *iter);
tb_fmt->fmt = fmt;
+ tb_fmt->type = type;
} else
kfree(tb_fmt);
}
@@ -85,17 +93,28 @@ void hold_module_trace_bprintk_format(const char **start, const char **end)
mutex_unlock(&btrace_mutex);
}
-static int module_trace_bprintk_format_notify(struct notifier_block *self,
- unsigned long val, void *data)
+static int module_trace_format_notify(struct notifier_block *self,
+ unsigned long val, void *data)
{
struct module *mod = data;
+
+ if (val != MODULE_STATE_COMING)
+ return NOTIFY_OK;
+
if (mod->num_trace_bprintk_fmt) {
const char **start = mod->trace_bprintk_fmt_start;
const char **end = start + mod->num_trace_bprintk_fmt;
- if (val == MODULE_STATE_COMING)
- hold_module_trace_bprintk_format(start, end);
+ hold_module_trace_format(start, end, TRACE_BPRINTK_TYPE);
+ }
+
+ if (mod->num_tracepoint_strings) {
+ const char **start = mod->tracepoint_strings_start;
+ const char **end = start + mod->num_tracepoint_strings;
+
+ hold_module_trace_format(start, end, TRACE_TRACEPOINT_TYPE);
}
+
return NOTIFY_OK;
}
@@ -171,8 +190,8 @@ static void format_mod_stop(void)
#else /* !CONFIG_MODULES */
__init static int
-module_trace_bprintk_format_notify(struct notifier_block *self,
- unsigned long val, void *data)
+module_trace_format_notify(struct notifier_block *self,
+ unsigned long val, void *data)
{
return NOTIFY_OK;
}
@@ -193,8 +212,8 @@ void trace_printk_control(bool enabled)
}
__initdata_or_module static
-struct notifier_block module_trace_bprintk_format_nb = {
- .notifier_call = module_trace_bprintk_format_notify,
+struct notifier_block module_trace_format_nb = {
+ .notifier_call = module_trace_format_notify,
};
int __trace_bprintk(unsigned long ip, const char *fmt, ...)
@@ -254,11 +273,25 @@ EXPORT_SYMBOL_GPL(__ftrace_vprintk);
bool trace_is_tracepoint_string(const char *str)
{
const char **ptr = __start___tracepoint_str;
+#ifdef CONFIG_MODULES
+ struct trace_bprintk_fmt *tb_fmt;
+#endif
for (ptr = __start___tracepoint_str; ptr < __stop___tracepoint_str; ptr++) {
if (str == *ptr)
return true;
}
+
+#ifdef CONFIG_MODULES
+ mutex_lock(&btrace_mutex);
+ list_for_each_entry(tb_fmt, &trace_bprintk_fmt_list, list) {
+ if ((tb_fmt->type & TRACE_TRACEPOINT_TYPE) && str == tb_fmt->fmt) {
+ mutex_unlock(&btrace_mutex);
+ return true;
+ }
+ }
+ mutex_unlock(&btrace_mutex);
+#endif
return false;
}
@@ -824,7 +857,7 @@ fs_initcall(init_trace_printk_function_export);
static __init int init_trace_printk(void)
{
- return register_module_notifier(&module_trace_bprintk_format_nb);
+ return register_module_notifier(&module_trace_format_nb);
}
early_initcall(init_trace_printk);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* [PATCH v2] ring-buffer: report header_page overwrite as char
From: Cao Ruichuang @ 2026-04-06 16:53 UTC (permalink / raw)
To: rostedt; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260406162843.41592-1-create0818@163.com>
The header_page tracefs metadata currently reports overwrite as an
int field with size 1. That makes parsers warn about a type and
size mismatch even though the field is only used as a one-byte flag
within commit.
Keep the shared offset with commit as-is, but report overwrite as
char so the declared type matches the hardcoded size. The signedness
is already carried separately by the emitted signed field.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216999
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
kernel/trace/ring_buffer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 170170bd8..6811dfffa 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -627,11 +627,11 @@ int ring_buffer_print_page_header(struct trace_buffer *buffer, struct trace_seq
(unsigned int)sizeof(field.commit),
(unsigned int)is_signed_type(long));
- trace_seq_printf(s, "\tfield: int overwrite;\t"
+ trace_seq_printf(s, "\tfield: char overwrite;\t"
"offset:%u;\tsize:%u;\tsigned:%u;\n",
(unsigned int)offsetof(typeof(field), commit),
1,
- (unsigned int)is_signed_type(long));
+ (unsigned int)is_signed_type(char));
trace_seq_printf(s, "\tfield: char data;\t"
"offset:%u;\tsize:%u;\tsigned:%u;\n",
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* Re: [PATCH] ring-buffer: report header_page overwrite as signed char
From: Steven Rostedt @ 2026-04-06 16:45 UTC (permalink / raw)
To: CaoRuichuang
Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260406162843.41592-1-create0818@163.com>
On Tue, 7 Apr 2026 00:28:43 +0800
CaoRuichuang <create0818@163.com> wrote:
> The header_page tracefs metadata currently reports overwrite as an
> int field with size 1. That makes parsers warn about a type and
> size mismatch even though the field is only used as a one-byte flag
> within commit.
>
> Keep the shared offset with commit as-is, but report overwrite as
> signed char so the declared type matches the hardcoded size and the
> emitted signedness.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216999
> Signed-off-by: CaoRuichuang <create0818@163.com>
> ---
> kernel/trace/ring_buffer.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 170170bd8..c4c2361b0 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -627,11 +627,11 @@ int ring_buffer_print_page_header(struct trace_buffer *buffer, struct trace_seq
> (unsigned int)sizeof(field.commit),
> (unsigned int)is_signed_type(long));
>
> - trace_seq_printf(s, "\tfield: int overwrite;\t"
> + trace_seq_printf(s, "\tfield: signed char overwrite;\t"
From the Bugzilla, the issue was with the rust parser. Would this still not
cause a warning if the "int" was switched to "char" and not "signed char".
The signed is redundant as it is already specified in the fields.
-- Steve
> "offset:%u;\tsize:%u;\tsigned:%u;\n",
> (unsigned int)offsetof(typeof(field), commit),
> 1,
> - (unsigned int)is_signed_type(long));
> + (unsigned int)is_signed_type(signed char));
>
> trace_seq_printf(s, "\tfield: char data;\t"
> "offset:%u;\tsize:%u;\tsigned:%u;\n",
^ permalink raw reply
* [PATCH] ring-buffer: report header_page overwrite as signed char
From: CaoRuichuang @ 2026-04-06 16:28 UTC (permalink / raw)
To: rostedt; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
The header_page tracefs metadata currently reports overwrite as an
int field with size 1. That makes parsers warn about a type and
size mismatch even though the field is only used as a one-byte flag
within commit.
Keep the shared offset with commit as-is, but report overwrite as
signed char so the declared type matches the hardcoded size and the
emitted signedness.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216999
Signed-off-by: CaoRuichuang <create0818@163.com>
---
kernel/trace/ring_buffer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 170170bd8..c4c2361b0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -627,11 +627,11 @@ int ring_buffer_print_page_header(struct trace_buffer *buffer, struct trace_seq
(unsigned int)sizeof(field.commit),
(unsigned int)is_signed_type(long));
- trace_seq_printf(s, "\tfield: int overwrite;\t"
+ trace_seq_printf(s, "\tfield: signed char overwrite;\t"
"offset:%u;\tsize:%u;\tsigned:%u;\n",
(unsigned int)offsetof(typeof(field), commit),
1,
- (unsigned int)is_signed_type(long));
+ (unsigned int)is_signed_type(signed char));
trace_seq_printf(s, "\tfield: char data;\t"
"offset:%u;\tsize:%u;\tsigned:%u;\n",
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* [PATCH] tracing/ipi: report ipi_raise target CPUs as cpumask
From: CaoRuichuang @ 2026-04-06 16:24 UTC (permalink / raw)
To: rostedt; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
Bugzilla 217447 points out that ftrace bitmask fields still use the
legacy dynamic-array format, which makes trace consumers treat them
as unsigned long arrays instead of bitmaps.
This is visible in the ipi events today: ipi_send_cpumask already
reports its CPU mask as '__data_loc cpumask_t', but ipi_raise still
exposes target_cpus as '__data_loc unsigned long[]'.
Switch ipi_raise to __cpumask() and the matching helpers so its
tracefs format matches the existing cpumask representation used by
the other ipi event. The underlying storage size stays the same, but
trace data consumers can now recognize the field as a cpumask
directly.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217447
Signed-off-by: CaoRuichuang <create0818@163.com>
---
include/trace/events/ipi.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/trace/events/ipi.h b/include/trace/events/ipi.h
index 9912f0ded..fae4f8eac 100644
--- a/include/trace/events/ipi.h
+++ b/include/trace/events/ipi.h
@@ -68,16 +68,16 @@ TRACE_EVENT(ipi_raise,
TP_ARGS(mask, reason),
TP_STRUCT__entry(
- __bitmask(target_cpus, nr_cpumask_bits)
+ __cpumask(target_cpus)
__field(const char *, reason)
),
TP_fast_assign(
- __assign_bitmask(target_cpus, cpumask_bits(mask), nr_cpumask_bits);
+ __assign_cpumask(target_cpus, cpumask_bits(mask));
__entry->reason = reason;
),
- TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), __entry->reason)
+ TP_printk("target_mask=%s (%s)", __get_cpumask(target_cpus), __entry->reason)
);
DECLARE_EVENT_CLASS(ipi_handler,
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox