* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Andrew Morton @ 2026-04-23 20:30 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260419185750.260784-1-npache@redhat.com>
On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
Thanks, I added this to mm.git's mm-new branch for testing while review
is being completed. I added notes regarding Usana's comments, so they
don't get lost.
^ permalink raw reply
* Re: [PATCH] rtla: Document tests in README
From: Crystal Wood @ 2026-04-23 22:44 UTC (permalink / raw)
To: Tomas Glozar, Steven Rostedt
Cc: John Kacur, Luis Goncalves, Costa Shulyupin, Wander Lairson Costa,
LKML, linux-trace-kernel
In-Reply-To: <20260423130759.882247-1-tglozar@redhat.com>
On Thu, 2026-04-23 at 15:07 +0200, Tomas Glozar wrote:
> RTLA tests are not documented anywhere. Mention both runtime and unit
> tests in the README, with instructions on how to run them and a list of
> dependencies and required system configuration.
>
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---
> tools/tracing/rtla/README.txt | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/tools/tracing/rtla/README.txt b/tools/tracing/rtla/README.txt
> index a9faee4dbb3a..8a782cd2c171 100644
> --- a/tools/tracing/rtla/README.txt
> +++ b/tools/tracing/rtla/README.txt
> @@ -42,4 +42,34 @@ For development, we suggest the following steps for compiling rtla:
> $ make
> $ sudo make install
>
> +Running tests
> +
> +RTLA has two test suites: a runtime test suite and a unit test suite.
> +
> +The runtime test suite is available as "make check" (root required) and has
> +the following dependencies, in addition to RTLA build dependencies:
> +
> +- Perl
> +- Test::Harness / TAP::Harness
> +- bash
> +- coreutils
> +- ldd
> +- util-linux
> +- procps(-ng)
> +- bpftool (if rtla is built against libbpf)
> +
> +as well as the following required system configuration:
> +
> +- CONFIG_OSNOISE_TRACER=y
> +- CONFIG_TIMERLAT_TRACER=y
> +- tracefs mounted and readable at /sys/kernel/tracing
> +
> +The unit test suite is available as "make unit-tests" and has the following
> +dependencies:
> +
> +- libcheck
> +
> +Unlike the runtime test suite, root is not required to run unit tests, nor is
> +a tracefs/osnoise/timerlat-capable kernel required.
> +
Should add something explaining how to install "Test::Harness /
TAP::Harness" for those who aren't familiar with the Perl ecosystem.
-Crystal
^ permalink raw reply
* [PATCH v5] tracing: Bound synthetic-field strings with seq_buf
From: Pengpeng Hou @ 2026-04-23 15:33 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, Tom Zanussi, linux-trace-kernel, linux-kernel,
pengpeng
In-Reply-To: <20260417223001.1-tracing-synth-v4-pengpeng@iscas.ac.cn>
The synthetic field helpers build a prefixed synthetic variable name and
a generated hist command in fixed MAX_FILTER_STR_VAL buffers. The
current code appends those strings with raw strcat(), so long key lists,
field names, or saved filters can run past the end of the staging
buffers.
Build both strings with seq_buf and propagate -E2BIG if either the
synthetic variable name or the generated command exceeds
MAX_FILTER_STR_VAL. This keeps the existing tracing-side limit while
using the helper intended for bounded command construction.
Fixes: 02205a6752f2 ("tracing: Add support for 'field variables'")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v4:
- add the requested blank lines around seq_buf_str() comments
- add the seq_buf_str() comment for the generated command buffer too
- keep saved_filter scoped next to its point of use
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0dbbf6cca9bc..87429567417f 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/kallsyms.h>
#include <linux/security.h>
+#include <linux/seq_buf.h>
#include <linux/mutex.h>
#include <linux/slab.h>
#include <linux/stacktrace.h>
@@ -2968,14 +2969,24 @@ find_synthetic_field_var(struct hist_trigger_data *target_hist_data,
char *system, char *event_name, char *field_name)
{
struct hist_field *event_var;
+ struct seq_buf s;
char *synthetic_name;
synthetic_name = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
if (!synthetic_name)
return ERR_PTR(-ENOMEM);
- strcpy(synthetic_name, "synthetic_");
- strcat(synthetic_name, field_name);
+ seq_buf_init(&s, synthetic_name, MAX_FILTER_STR_VAL);
+ seq_buf_puts(&s, "synthetic_");
+ seq_buf_puts(&s, field_name);
+
+ /* Terminate synthetic_name with a NUL. */
+ seq_buf_str(&s);
+
+ if (seq_buf_has_overflowed(&s)) {
+ kfree(synthetic_name);
+ return ERR_PTR(-E2BIG);
+ }
event_var = find_event_var(target_hist_data, system, event_name, synthetic_name);
@@ -3020,7 +3031,7 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
struct trace_event_file *file;
struct hist_field *key_field;
struct hist_field *event_var;
- char *saved_filter;
+ struct seq_buf s;
char *cmd;
int ret;
@@ -3065,28 +3076,37 @@ create_field_var_hist(struct hist_trigger_data *target_hist_data,
return ERR_PTR(-ENOMEM);
}
+ seq_buf_init(&s, cmd, MAX_FILTER_STR_VAL);
+
/* Use the same keys as the compatible histogram */
- strcat(cmd, "keys=");
+ seq_buf_puts(&s, "keys=");
for_each_hist_key_field(i, hist_data) {
key_field = hist_data->fields[i];
if (!first)
- strcat(cmd, ",");
- strcat(cmd, key_field->field->name);
+ seq_buf_putc(&s, ',');
+ seq_buf_puts(&s, key_field->field->name);
first = false;
}
/* Create the synthetic field variable specification */
- strcat(cmd, ":synthetic_");
- strcat(cmd, field_name);
- strcat(cmd, "=");
- strcat(cmd, field_name);
+ seq_buf_printf(&s, ":synthetic_%s=%s", field_name, field_name);
/* Use the same filter as the compatible histogram */
- saved_filter = find_trigger_filter(hist_data, file);
- if (saved_filter) {
- strcat(cmd, " if ");
- strcat(cmd, saved_filter);
+ {
+ char *saved_filter = find_trigger_filter(hist_data, file);
+
+ if (saved_filter)
+ seq_buf_printf(&s, " if %s", saved_filter);
+ }
+
+ /* Terminate cmd with a NUL. */
+ seq_buf_str(&s);
+
+ if (seq_buf_has_overflowed(&s)) {
+ kfree(cmd);
+ kfree(var_hist);
+ return ERR_PTR(-E2BIG);
}
var_hist->cmd = kstrdup(cmd, GFP_KERNEL);
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* Re: [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints
From: SUVONOV BUNYOD @ 2026-04-24 0:46 UTC (permalink / raw)
To: Shakeel Butt
Cc: akpm, hannes, rostedt, mhiramat, david, mhocko, zhengqi arch, ljs,
mathieu desnoyers, linux-mm, linux-trace-kernel, linux-kernel
In-Reply-To: <aepavbLy7H3odp6p@linux.dev>
Thank you for reviewing Shakeel,
> Do we need to trace highest_zoneidx at the end? Can it change within
> balance_pgdat()?
highest_zoneidx does not change within a balance_pgdat() invocation. It
is passed in as an argument and remains the classzone bound used for the
balancing checks throughout the function.
I kept highest_zoneidx in the end tracepoint to make the outcome event
self-contained. In principle, begin/end correlation is possible, but
under sustained memory pressure kswapd reclaim can be frequent enough
that consumers may prefer to analyze end events directly, and any
dependence on matching begin/end becomes less convenient and less robust
in the presence of filtering or dropped trace records.
Since nr_reclaimed and the final order are only known at the end, having
highest_zoneidx there allows end-only analysis without correlating with
the begin event.
For example, it lets users answer questions like:
- this pass reclaimed too much or too little memory; what highest_zoneidx
did that result correspond to?
- how much reclaim was done when balancing up to ZONE_NORMAL vs other
classzone bounds?
- when highest_zoneidx == ZONE_NORMAL, how often did reclaim finish at
order=0?
So it is there because it provides context for the end-of-reclaim result.
Do you think this is sufficient justification? If not, then I can drop it
from the end tracepoint in v2.
----- Original Message -----
From: "Shakeel Butt" <shakeel.butt@linux.dev>
To: "Bunyod Suvonov" <b.suvonov@sjtu.edu.cn>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, rostedt@goodmis.org, mhiramat@kernel.org, david@kernel.org, mhocko@kernel.org, "zhengqi arch" <zhengqi.arch@bytedance.com>, ljs@kernel.org, "mathieu desnoyers" <mathieu.desnoyers@efficios.com>, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org
Sent: Friday, April 24, 2026 1:46:55 AM
Subject: Re: [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints
On Thu, Apr 23, 2026 at 06:37:53PM +0800, Bunyod Suvonov wrote:
> Vmscan has six main reclaim entry points: try_to_free_pages() for
> direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
> mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
> for node reclaim, shrink_all_memory() for hibernation reclaim, and
> balance_pgdat() for kswapd reclaim.
>
> All of them, except for shrink_all_memory() and balance_pgdat(), already
> have begin/end tracepoints. This makes it harder to trace which reclaim
> path is responsible for memory reclaim activity, because kswapd reclaim
> cannot be identified as cleanly as other reclaim entry points, even
> though it is the main background reclaim path under memory pressure.
> There may be no need to trace shrink_all_memory() as it is primarily
> used during hibernation. So this patch adds the missing tracepoint pair
> for balance_pgdat().
>
> The begin tracepoint records the node id, requested reclaim order, and
> highest_zoneidx. The end tracepoint records the node id, reclaim order
> that balance_pgdat() finished with, highest_zoneidx, and nr_reclaimed.
Do we need to trace highest_zoneidx at the end? Can it change within
balance_pgdat()?
> Together, they show the requested reclaim order and zone bound, whether
> reclaim fell back to a lower order, and how much reclaim work was done.
>
> Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
Overall looks good.
^ permalink raw reply
* Re: [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints
From: Shakeel Butt @ 2026-04-24 2:15 UTC (permalink / raw)
To: SUVONOV BUNYOD
Cc: akpm, hannes, rostedt, mhiramat, david, mhocko, zhengqi arch, ljs,
mathieu desnoyers, linux-mm, linux-trace-kernel, linux-kernel
In-Reply-To: <1868971658.1970916.1776991584726.JavaMail.zimbra@sjtu.edu.cn>
On Fri, Apr 24, 2026 at 08:46:24AM +0800, SUVONOV BUNYOD wrote:
> Thank you for reviewing Shakeel,
>
> > Do we need to trace highest_zoneidx at the end? Can it change within
> > balance_pgdat()?
>
> highest_zoneidx does not change within a balance_pgdat() invocation. It
> is passed in as an argument and remains the classzone bound used for the
> balancing checks throughout the function.
>
> I kept highest_zoneidx in the end tracepoint to make the outcome event
> self-contained. In principle, begin/end correlation is possible, but
> under sustained memory pressure kswapd reclaim can be frequent enough
> that consumers may prefer to analyze end events directly, and any
> dependence on matching begin/end becomes less convenient and less robust
> in the presence of filtering or dropped trace records.
>
> Since nr_reclaimed and the final order are only known at the end, having
> highest_zoneidx there allows end-only analysis without correlating with
> the begin event.
>
> For example, it lets users answer questions like:
> - this pass reclaimed too much or too little memory; what highest_zoneidx
> did that result correspond to?
> - how much reclaim was done when balancing up to ZONE_NORMAL vs other
> classzone bounds?
> - when highest_zoneidx == ZONE_NORMAL, how often did reclaim finish at
> order=0?
>
> So it is there because it provides context for the end-of-reclaim result.
> Do you think this is sufficient justification? If not, then I can drop it
> from the end tracepoint in v2.
I think it is ok but let's add this reasoning in the commit message.
^ permalink raw reply
* [PATCH v2] mm/vmscan: add balance_pgdat begin/end tracepoints
From: Bunyod Suvonov @ 2026-04-24 3:14 UTC (permalink / raw)
To: akpm, hannes, rostedt, mhiramat
Cc: david, mhocko, zhengqi.arch, shakeel.butt, ljs, mathieu.desnoyers,
linux-mm, linux-trace-kernel, linux-kernel, Bunyod Suvonov
In-Reply-To: <20260423103753.546582-1-b.suvonov@sjtu.edu.cn>
Vmscan has six main reclaim entry points: try_to_free_pages() for
direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
for node reclaim, shrink_all_memory() for hibernation reclaim, and
balance_pgdat() for kswapd reclaim.
All of them, except for shrink_all_memory() and balance_pgdat(), already
have begin/end tracepoints. This makes it harder to trace which reclaim
path is responsible for memory reclaim activity, because kswapd reclaim
cannot be identified as cleanly as other reclaim entry points, even
though it is the main background reclaim path under memory pressure.
There may be no need to trace shrink_all_memory() as it is primarily
used during hibernation. So this patch adds the missing tracepoint pair
for balance_pgdat().
The begin tracepoint records the node id, requested reclaim order, and
the requested classzone bound (highest_zoneidx). The end tracepoint
records the node id, the reclaim order that balance_pgdat() finished
with, the requested classzone bound, and nr_reclaimed. Together, they
show the requested reclaim order and classzone bound, whether reclaim
fell back to a lower order, and how much reclaim work was done.
The end tracepoint also records highest_zoneidx even though it does not
change within a balance_pgdat() invocation. This keeps the end event
self-contained, so users can analyze reclaim results directly from end
events without depending on begin/end correlation, which is less
convenient when tracing is filtered or records are dropped. It also
makes it straightforward to relate nr_reclaimed and the final reclaim
order to the requested classzone bound.
Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
---
v2:
- explain why highest_zoneidx is kept in the end tracepoint
include/trace/events/vmscan.h | 52 +++++++++++++++++++++++++++++++++++
mm/vmscan.c | 5 ++++
2 files changed, 57 insertions(+)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 4445a8d9218d..b4bf7b8def1f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -96,6 +96,58 @@ TRACE_EVENT(mm_vmscan_kswapd_wake,
__entry->order)
);
+TRACE_EVENT(mm_vmscan_balance_pgdat_begin,
+
+ TP_PROTO(int nid, int order, int highest_zoneidx),
+
+ TP_ARGS(nid, order, highest_zoneidx),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(int, order)
+ __field(int, highest_zoneidx)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->order = order;
+ __entry->highest_zoneidx = highest_zoneidx;
+ ),
+
+ TP_printk("nid=%d order=%d highest_zoneidx=%-8s",
+ __entry->nid,
+ __entry->order,
+ __print_symbolic(__entry->highest_zoneidx, ZONE_TYPE))
+);
+
+TRACE_EVENT(mm_vmscan_balance_pgdat_end,
+
+ TP_PROTO(int nid, int order, int highest_zoneidx,
+ unsigned long nr_reclaimed),
+
+ TP_ARGS(nid, order, highest_zoneidx, nr_reclaimed),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(int, order)
+ __field(int, highest_zoneidx)
+ __field(unsigned long, nr_reclaimed)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->order = order;
+ __entry->highest_zoneidx = highest_zoneidx;
+ __entry->nr_reclaimed = nr_reclaimed;
+ ),
+
+ TP_printk("nid=%d order=%d highest_zoneidx=%-8s nr_reclaimed=%lu",
+ __entry->nid,
+ __entry->order,
+ __print_symbolic(__entry->highest_zoneidx, ZONE_TYPE),
+ __entry->nr_reclaimed)
+);
+
TRACE_EVENT(mm_vmscan_wakeup_kswapd,
TP_PROTO(int nid, int zid, int order, gfp_t gfp_flags),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..b2d89ed69d22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7121,6 +7121,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
.may_unmap = 1,
};
+ trace_mm_vmscan_balance_pgdat_begin(pgdat->node_id, order,
+ highest_zoneidx);
set_task_reclaim_state(current, &sc.reclaim_state);
psi_memstall_enter(&pflags);
__fs_reclaim_acquire(_THIS_IP_);
@@ -7314,6 +7316,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
psi_memstall_leave(&pflags);
set_task_reclaim_state(current, NULL);
+ trace_mm_vmscan_balance_pgdat_end(pgdat->node_id, sc.order,
+ highest_zoneidx, sc.nr_reclaimed);
+
/*
* Return the order kswapd stopped reclaiming at as
* prepare_kswapd_sleep() takes it into account. If another caller
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v2] mm/vmscan: add balance_pgdat begin/end tracepoints
From: Shakeel Butt @ 2026-04-24 3:16 UTC (permalink / raw)
To: Bunyod Suvonov, akpm, hannes, rostedt, mhiramat
Cc: david, mhocko, zhengqi.arch, ljs, mathieu.desnoyers, linux-mm,
linux-trace-kernel, linux-kernel, Bunyod Suvonov
In-Reply-To: <20260424031418.174597-1-b.suvonov@sjtu.edu.cn>
April 23, 2026 at 8:14 PM, "Bunyod Suvonov" <b.suvonov@sjtu.edu.cn mailto:b.suvonov@sjtu.edu.cn?to=%22Bunyod%20Suvonov%22%20%3Cb.suvonov%40sjtu.edu.cn%3E > wrote:
>
> Vmscan has six main reclaim entry points: try_to_free_pages() for
> direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
> mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
> for node reclaim, shrink_all_memory() for hibernation reclaim, and
> balance_pgdat() for kswapd reclaim.
>
> All of them, except for shrink_all_memory() and balance_pgdat(), already
> have begin/end tracepoints. This makes it harder to trace which reclaim
> path is responsible for memory reclaim activity, because kswapd reclaim
> cannot be identified as cleanly as other reclaim entry points, even
> though it is the main background reclaim path under memory pressure.
> There may be no need to trace shrink_all_memory() as it is primarily
> used during hibernation. So this patch adds the missing tracepoint pair
> for balance_pgdat().
>
> The begin tracepoint records the node id, requested reclaim order, and
> the requested classzone bound (highest_zoneidx). The end tracepoint
> records the node id, the reclaim order that balance_pgdat() finished
> with, the requested classzone bound, and nr_reclaimed. Together, they
> show the requested reclaim order and classzone bound, whether reclaim
> fell back to a lower order, and how much reclaim work was done.
>
> The end tracepoint also records highest_zoneidx even though it does not
> change within a balance_pgdat() invocation. This keeps the end event
> self-contained, so users can analyze reclaim results directly from end
> events without depending on begin/end correlation, which is less
> convenient when tracing is filtered or records are dropped. It also
> makes it straightforward to relate nr_reclaimed and the final reclaim
> order to the requested classzone bound.
>
> Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply
* [PATCH v18 0/8] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-04-24 6:51 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
Hi,
Here is the 18th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:
https://lore.kernel.org/all/177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com/
This version fixes a newly found bug and some review comments from
Sashiko[1], also, add 2 cleanups, which includes:
[1/8] Do not double count the reader_page when verifying persistent
ring buffer.
[2/8] Add Geert's Ack (Thanks!)
[3/8] Fix to substract BUF_PAGE_HDR_SIZE from meta->subbuf_size
to make the limit of commit size.
[4/8] Reset timestamp of reader_page when the entire cpu_buffer is
invalid.
[5/8] In rb_test_inject_invalid_pages(), changed entry_bytes and
idx to unsigned long.
[7/8] Cleanup persistent ring buffer validation code.
[8/8] Cleanup buffer_data_page related code.
[1] https://sashiko.dev/#/patchset/177687458572.932171.10907864814735342737.stgit%40mhiramat.tok.corp.google.com
Thank you,
Masami Hiramatsu (Google) (8):
ring-buffer: Do not double count the reader_page
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
ring-buffer: Add persistent ring buffer invalid-page inject test
ring-buffer: Show commit numbers in buffer_meta file
ring-buffer: Cleanup persistent ring buffer validation
ring-buffer: Cleanup buffer_data_page related code
arch/alpha/include/asm/Kbuild | 1
arch/arc/include/asm/Kbuild | 1
arch/arm/include/asm/Kbuild | 1
arch/arm64/include/asm/ring_buffer.h | 10 +
arch/csky/include/asm/Kbuild | 1
arch/hexagon/include/asm/Kbuild | 1
arch/loongarch/include/asm/Kbuild | 1
arch/m68k/include/asm/Kbuild | 1
arch/microblaze/include/asm/Kbuild | 1
arch/mips/include/asm/Kbuild | 1
arch/nios2/include/asm/Kbuild | 1
arch/openrisc/include/asm/Kbuild | 1
arch/parisc/include/asm/Kbuild | 1
arch/powerpc/include/asm/Kbuild | 1
arch/riscv/include/asm/Kbuild | 1
arch/s390/include/asm/Kbuild | 1
arch/sh/include/asm/Kbuild | 1
arch/sparc/include/asm/Kbuild | 1
arch/um/include/asm/Kbuild | 1
arch/x86/include/asm/Kbuild | 1
arch/xtensa/include/asm/Kbuild | 1
include/asm-generic/ring_buffer.h | 13 +
include/linux/ring_buffer.h | 1
kernel/trace/Kconfig | 34 ++
kernel/trace/ring_buffer.c | 472 +++++++++++++++++++++++-----------
kernel/trace/trace.c | 4
26 files changed, 395 insertions(+), 159 deletions(-)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
base-commit: 6170922f137231b98fc568571befef63e1edff3f
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v18 1/8] ring-buffer: Do not double count the reader_page
From: Masami Hiramatsu (Google) @ 2026-04-24 6:52 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since the cpu_buffer->reader_page is updated if there are unwound
pages. After that update, we should skip the page if it is the
original reader_page, because the original reader_page is already
checked.
Fixes: ca296d32ece3 ("tracing: ring_buffer: Rewind persistent ring buffer on reboot")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v18:
- Newly added.
---
kernel/trace/ring_buffer.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cef49f8871d2..5326924615a4 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1884,7 +1884,7 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
{
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
- struct buffer_page *head_page, *orig_head;
+ struct buffer_page *head_page, *orig_head, *orig_reader;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
int ret;
@@ -1895,16 +1895,17 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
return;
orig_head = head_page = cpu_buffer->head_page;
+ orig_reader = cpu_buffer->reader_page;
/* Do the reader page first */
- ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu);
if (ret < 0) {
pr_info("Ring buffer reader page is invalid\n");
goto invalid;
}
entries += ret;
- entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
- local_set(&cpu_buffer->reader_page->entries, ret);
+ entry_bytes += local_read(&orig_reader->page->commit);
+ local_set(&orig_reader->entries, ret);
ts = head_page->page->time_stamp;
@@ -2007,8 +2008,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Iterate until finding the commit page */
for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
- /* Reader page has already been done */
- if (head_page == cpu_buffer->reader_page)
+ /* The original reader page has already been checked/counted. */
+ if (head_page == orig_reader)
continue;
ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
^ permalink raw reply related
* [PATCH v18 2/8] ring-buffer: Flush and stop persistent ring buffer on panic
From: Masami Hiramatsu (Google) @ 2026-04-24 6:52 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
Changes in v13:
- Fix a rebase conflict.
Changes in v11:
- Do nothing by default since flush_cache_vmap() does nothing on x86
but it can cause deadlock on some architectures via on_each_cpu()
because other CPUs will be stoppped when panic notifier is called.
Changes in v9:
- Fix typo of & to &&.
- Fix typo of "Generic"
Changes in v6:
- Introduce asm/ring_buffer.h for arch_ring_buffer_flush_range().
- Use flush_cache_vmap() instead of flush_cache_all().
Changes in v5:
- Use ring_buffer_record_off() instead of ring_buffer_record_disable().
- Use flush_cache_all() to ensure flush all cache.
Changes in v3:
- update patch description.
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/ring_buffer.h | 10 ++++++++++
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/ring_buffer.h | 13 +++++++++++++
kernel/trace/ring_buffer.c | 22 ++++++++++++++++++++++
23 files changed, 65 insertions(+)
create mode 100644 arch/arm64/include/asm/ring_buffer.h
create mode 100644 include/asm-generic/ring_buffer.h
diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 483965c5a4de..b154b4e3dfa8 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += agp.h
generic-y += asm-offsets.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 4c69522e0328..483caacc6988 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -5,5 +5,6 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 03657ff8fbe3..decad5f2c826 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -3,6 +3,7 @@ generic-y += early_ioremap.h
generic-y += extable.h
generic-y += flat.h
generic-y += parport.h
+generic-y += ring_buffer.h
generated-y += mach-types.h
generated-y += unistd-nr.h
diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
new file mode 100644
index 000000000000..62316c406888
--- /dev/null
+++ b/arch/arm64/include/asm/ring_buffer.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_ARM64_RING_BUFFER_H
+#define _ASM_ARM64_RING_BUFFER_H
+
+#include <asm/cacheflush.h>
+
+/* Flush D-cache on persistent ring buffer */
+#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
+
+#endif /* _ASM_ARM64_RING_BUFFER_H */
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 3a5c7f6e5aac..7dca0c6cdc84 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -9,6 +9,7 @@ generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
generic-y += text-patching.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 1efa1e993d4b..0f887d4238ed 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,4 +5,5 @@ generic-y += extable.h
generic-y += iomap.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild
index 9034b583a88a..7e92957baf6a 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -10,5 +10,6 @@ generic-y += qrwlock.h
generic-y += user.h
generic-y += ioctl.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
generic-y += statfs.h
generic-y += text-patching.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index b282e0dd8dc1..62543bf305ff 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,6 @@ generated-y += syscall_table.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += text-patching.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 7178f990e8b3..0030309b47ad 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += syscalls.h
generic-y += tlb.h
generic-y += user.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 684569b2ecd6..9771c3d85074 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -12,5 +12,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 28004301c236..0a2530964413 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -5,6 +5,7 @@ generic-y += cmpxchg.h
generic-y += extable.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += spinlock.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index cef49d60d74c..8aa34621702d 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -8,4 +8,5 @@ generic-y += spinlock_types.h
generic-y += spinlock.h
generic-y += qrwlock_types.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index 4fb596d94c89..d48d158f7241 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 2e23533b67e3..805b5aeebb6f 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -5,4 +5,5 @@ generated-y += syscall_table_spu.h
generic-y += agp.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
+generic-y += ring_buffer.h
generic-y += early_ioremap.h
diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index bd5fc9403295..7721b63642f4 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -14,5 +14,6 @@ generic-y += ticket_spinlock.h
generic-y += qrwlock.h
generic-y += qrwlock_types.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += vmlinux.lds.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 80bad7de7a04..0c1fc47c3ba0 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,3 +7,4 @@ generated-y += unistd_nr.h
generic-y += asm-offsets.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 4d3f10ed8275..f0403d3ee8ab 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -3,4 +3,5 @@ generated-y += syscall_table.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += parport.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 17ee8a273aa6..49c6bb326b75 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -4,4 +4,5 @@ generated-y += syscall_table_64.h
generic-y += agp.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
+generic-y += ring_buffer.h
generic-y += text-patching.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 1b9b82bbe322..2a1629ba8140 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -17,6 +17,7 @@ generic-y += module.lds.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
+generic-y += ring_buffer.h
generic-y += runtime-const.h
generic-y += softirq_stack.h
generic-y += switch_to.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4566000e15c4..078fd2c0d69d 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -14,3 +14,4 @@ generic-y += early_ioremap.h
generic-y += fprobe.h
generic-y += mcs_spinlock.h
generic-y += mmzone.h
+generic-y += ring_buffer.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 13fe45dea296..e57af619263a 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -6,5 +6,6 @@ generic-y += mcs_spinlock.h
generic-y += parport.h
generic-y += qrwlock.h
generic-y += qspinlock.h
+generic-y += ring_buffer.h
generic-y += user.h
generic-y += text-patching.h
diff --git a/include/asm-generic/ring_buffer.h b/include/asm-generic/ring_buffer.h
new file mode 100644
index 000000000000..201d2aee1005
--- /dev/null
+++ b/include/asm-generic/ring_buffer.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Generic arch dependent ring_buffer macros.
+ */
+#ifndef __ASM_GENERIC_RING_BUFFER_H__
+#define __ASM_GENERIC_RING_BUFFER_H__
+
+#include <linux/cacheflush.h>
+
+/* Flush cache on ring buffer range if needed. Do nothing by default. */
+#define arch_ring_buffer_flush_range(start, end) do { } while (0)
+
+#endif /* __ASM_GENERIC_RING_BUFFER_H__ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5326924615a4..7288383b1f27 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7,6 +7,7 @@
#include <linux/ring_buffer_types.h>
#include <linux/sched/isolation.h>
#include <linux/trace_recursion.h>
+#include <linux/panic_notifier.h>
#include <linux/trace_events.h>
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
@@ -31,6 +32,7 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <asm/ring_buffer.h>
#include <asm/local64.h>
#include <asm/local.h>
#include <asm/setup.h>
@@ -559,6 +561,7 @@ struct trace_buffer {
unsigned long range_addr_start;
unsigned long range_addr_end;
+ struct notifier_block flush_nb;
struct ring_buffer_meta *meta;
@@ -2521,6 +2524,16 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+/* Stop recording on a persistent buffer and flush cache if needed. */
+static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
+
+ ring_buffer_record_off(buffer);
+ arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
+ return NOTIFY_DONE;
+}
+
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
@@ -2651,6 +2664,12 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
mutex_init(&buffer->mutex);
+ /* Persistent ring buffer needs to flush cache before reboot. */
+ if (start && end) {
+ buffer->flush_nb.notifier_call = rb_flush_buffer_cb;
+ atomic_notifier_chain_register(&panic_notifier_list, &buffer->flush_nb);
+ }
+
return_ptr(buffer);
fail_free_buffers:
@@ -2749,6 +2768,9 @@ ring_buffer_free(struct trace_buffer *buffer)
{
int cpu;
+ if (buffer->range_addr_start && buffer->range_addr_end)
+ atomic_notifier_chain_unregister(&panic_notifier_list, &buffer->flush_nb);
+
cpuhp_state_remove_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
irq_work_sync(&buffer->irq_work.work);
^ permalink raw reply related
* [PATCH v18 3/8] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-24 6:52 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v18:
- Minor update by the new fix.
- Fix to substract BUF_PAGE_HDR_SIZE from meta->subbuf_size
to make the limit of commit size.
Changes in v17:
- Fix to use rb_page_size() of rewound pages for entry_bytes.
Changes in v15:
- Skip reader_page loop check on persistent ring buffer because
there can be contiguous empty(invalidated) pages.
- Do not show discarded page number information if it is 0.
Changes in v11:
- Fix a typo.
Changes in v9:
- Add meta->subbuf_size check.
- Fix a typo.
- Handle invalid reader_page case.
Changes in v8:
- Add comment in rb_valudate_buffer()
- Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
skipping subbuf.
- Remove unused subbuf local variable from rb_cpu_meta_valid().
Changes in v7:
- Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
- Remove checking subbuffer data when validating metadata, because it should be done
later.
- Do not mark the discarded sub buffer page but just reset it.
Changes in v6:
- Show invalid page detection message once per CPU.
Changes in v5:
- Instead of showing errors for each page, just show the number
of discarded pages at last.
Changes in v3:
- Record missed data event on commit.
---
kernel/trace/ring_buffer.c | 111 ++++++++++++++++++++++++++------------------
1 file changed, 66 insertions(+), 45 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7288383b1f27..404c1fcac0ae 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
return local_read(&bpage->page->commit);
}
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+ return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
unsigned long *subbuf_mask)
{
int subbuf_size = PAGE_SIZE;
- struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
if (!subbuf_mask)
return false;
+ if (meta->subbuf_size != PAGE_SIZE) {
+ pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+ return false;
+ }
+
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- subbuf = rb_subbufs_from_meta(meta);
-
bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
- /* Is the meta buffers and the subbufs themselves have correct data? */
+ /*
+ * Ensure the meta::buffers array has correct data. The data in each subbufs
+ * are checked later in rb_meta_validate_events().
+ */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
return false;
}
- if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
- pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
- return false;
- }
-
if (test_bit(meta->buffers[i], subbuf_mask)) {
pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
return false;
}
set_bit(meta->buffers[i], subbuf_mask);
- subbuf = (void *)subbuf + subbuf_size;
}
return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+ struct ring_buffer_cpu_meta *meta)
{
unsigned long long ts;
+ unsigned long tail;
u64 delta;
- int tail;
- tail = local_read(&dpage->commit);
+ /*
+ * When a sub-buffer is recovered from a read, the commit value may
+ * have RB_MISSED_* bits set, as these bits are reset on reuse.
+ * Even after clearing these bits, a commit value greater than the
+ * subbuf_size is considered invalid.
+ */
+ tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
+ return -1;
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
struct buffer_page *head_page, *orig_head, *orig_reader;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
+ int discarded = 0;
int ret;
u64 ts;
int i;
@@ -1901,14 +1916,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_reader = cpu_buffer->reader_page;
/* Do the reader page first */
- ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer reader page is invalid\n");
- goto invalid;
+ pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&orig_reader->entries, 0);
+ local_set(&orig_reader->page->commit, 0);
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(orig_reader);
+ local_set(&orig_reader->entries, ret);
}
- entries += ret;
- entry_bytes += local_read(&orig_reader->page->commit);
- local_set(&orig_reader->entries, ret);
ts = head_page->page->time_stamp;
@@ -1936,7 +1956,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
break;
/* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0)
break;
@@ -1945,7 +1965,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (ret)
local_inc(&cpu_buffer->pages_touched);
entries += ret;
- entry_bytes += rb_page_commit(head_page);
+ entry_bytes += rb_page_size(head_page);
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2015,21 +2035,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == orig_reader)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+ ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid buffer page\n",
- cpu_buffer->cpu);
- goto invalid;
- }
-
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
-
- entries += ret;
- entry_bytes += local_read(&head_page->page->commit);
- local_set(&head_page->entries, ret);
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ /* Instead of discard whole ring buffer, discard only this sub-buffer. */
+ local_set(&head_page->entries, 0);
+ local_set(&head_page->page->commit, 0);
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ local_set(&head_page->entries, ret);
+ }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2043,7 +2066,10 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
- pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+ pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
+ if (discarded)
+ pr_cont(" (%d pages discarded)", discarded);
+ pr_cont("\n");
return;
invalid:
@@ -3330,12 +3356,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
return NULL;
}
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
static __always_inline unsigned
rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -5648,11 +5668,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
again:
/*
* This should normally only loop twice. But because the
- * start of the reader inserts an empty page, it causes
- * a case where we will loop three times. There should be no
- * reason to loop four times (that I know of).
+ * start of the reader inserts an empty page, it causes a
+ * case where we will loop three times. There should be no
+ * reason to loop four times unless the ring buffer is a
+ * recovered persistent ring buffer.
*/
- if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+ if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
reader = NULL;
goto out;
}
^ permalink raw reply related
* [PATCH v18 4/8] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-24 6:52 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v18:
- Reset timestamp of reader_page when the entire cpu_buffer is
invalid.
- Minor update by new fix.
Changes in v17:
- Fix to verify head_page at first before using its timestamp.
- Reset timestamp if the page is invalid.
Changes in v12:
- Fix build error.
Changes in v11:
- Reset timestamp when the buffer is invalid.
- When rewinding, skip subbuf page if timestamp is wrong and
check timestamp after validating buffer data page.
Changes in v10:
- Newly added.
---
kernel/trace/ring_buffer.c | 94 ++++++++++++++++++++++++++------------------
1 file changed, 55 insertions(+), 39 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 404c1fcac0ae..899a78b307e6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
static void rb_init_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
+ bpage->time_stamp = 0;
}
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
- struct ring_buffer_cpu_meta *meta)
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
+ struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
{
+ struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
unsigned long tail;
u64 delta;
+ int ret = -1;
/*
* When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,19 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
* subbuf_size is considered invalid.
*/
tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
- if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
- return -1;
- return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+ if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
+ ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+ if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
+ local_set(&bpage->entries, 0);
+ local_set(&bpage->page->commit, 0);
+ bpage->page->time_stamp = prev_ts ? prev_ts : next_ts;
+ ret = -1;
+ } else {
+ local_set(&bpage->entries, ret);
+ }
+
+ return ret;
}
/* If the meta data has been validated, now validate the events */
@@ -1915,25 +1928,29 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_head = head_page = cpu_buffer->head_page;
orig_reader = cpu_buffer->reader_page;
- /* Do the reader page first */
- ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu, meta);
+ /* Do the head page first */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
+ if (ret < 0) {
+ pr_info("Ring buffer meta [%d] invalid head page detected\n",
+ cpu_buffer->cpu);
+ goto skip_rewind;
+ }
+ ts = head_page->page->time_stamp;
+
+ /* Do the reader page - reader must be previous to head. */
+ ret = rb_validate_buffer(orig_reader, cpu_buffer->cpu, meta, 0, ts);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid reader page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&orig_reader->entries, 0);
- local_set(&orig_reader->page->commit, 0);
} else {
entries += ret;
entry_bytes += rb_page_size(orig_reader);
- local_set(&orig_reader->entries, ret);
+ ts = orig_reader->page->time_stamp;
}
- ts = head_page->page->time_stamp;
-
/*
- * Try to rewind the head so that we can read the pages which already
+ * Try to rewind the head so that we can read the pages which are already
* read in the previous boot.
*/
if (head_page == cpu_buffer->tail_page)
@@ -1946,26 +1963,27 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == cpu_buffer->tail_page)
break;
- /* Ensure the page has older data than head. */
- if (ts < head_page->page->time_stamp)
- break;
-
- ts = head_page->page->time_stamp;
- /* Ensure the page has correct timestamp and some data. */
- if (!ts || rb_page_commit(head_page) == 0)
+ /* Rewind until unused page (no timestamp, no commit). */
+ if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
break;
- /* Stop rewind if the page is invalid. */
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
- if (ret < 0)
- break;
-
- /* Recover the number of entries and update stats. */
- local_set(&head_page->entries, ret);
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
- entries += ret;
- entry_bytes += rb_page_size(head_page);
+ /*
+ * Skip if the page is invalid, or its timestamp is newer than the
+ * previous valid page.
+ */
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, ts);
+ if (ret < 0) {
+ if (!discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ discarded++;
+ } else {
+ entries += ret;
+ entry_bytes += rb_page_size(head_page);
+ if (ret > 0)
+ local_inc(&cpu_buffer->pages_touched);
+ ts = head_page->page->time_stamp;
+ }
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2027,6 +2045,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Nothing more to do, the only page is the reader page */
goto done;
}
+ ts = head_page->page->time_stamp;
/* Iterate until finding the commit page */
for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
@@ -2035,15 +2054,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == orig_reader)
continue;
- ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+ ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, ts, 0);
if (ret < 0) {
if (!discarded)
pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
cpu_buffer->cpu);
discarded++;
- /* Instead of discard whole ring buffer, discard only this sub-buffer. */
- local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
} else {
/* If the buffer has content, update pages_touched */
if (ret)
@@ -2051,7 +2067,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
entries += ret;
entry_bytes += rb_page_size(head_page);
- local_set(&head_page->entries, ret);
+ ts = head_page->page->time_stamp;
}
if (head_page == cpu_buffer->commit_page)
break;
@@ -2079,12 +2095,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset the reader page */
local_set(&cpu_buffer->reader_page->entries, 0);
- local_set(&cpu_buffer->reader_page->page->commit, 0);
+ rb_init_page(cpu_buffer->reader_page->page);
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- local_set(&head_page->page->commit, 0);
+ rb_init_page(head_page->page);
}
}
^ permalink raw reply related
* [PATCH v18 5/8] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Masami Hiramatsu (Google) @ 2026-04-24 6:52 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a self-corrupting test for the persistent ring buffer.
This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.
This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:
reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1
Run the following commands after the 1st boot:
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v18:
- Fix to mask RB_MISSED_* flags when counting entry_bytes.
Changes in v17:
- In rb_test_inject_invalid_pages(), changed entry_bytes and
idx to unsigned long
- Added NULL checks for cpu_buffer and meta.
- In allocate_trace_buffer(), added a NULL check for tr->name
before comparing it with strcmp.
Changes in v16:
- Update description and comments according to review comments.
Changes in v15:
- Use pr_warn() for test result.
- Inject errors on the page index is multiples of 5 so that
this can reproduce contiguous empty pages.
Changes in v14:
- Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
- Clear meta->nr_invalid/entry_bytes after testing.
- Add test commands in config comment.
Changes in v10:
- Add entry_bytes test.
- Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
Changes in v9:
- Test also reader pages.
---
include/linux/ring_buffer.h | 1 +
kernel/trace/Kconfig | 34 +++++++++++++++++++
kernel/trace/ring_buffer.c | 79 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 4 ++
4 files changed, 118 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_TESTING = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..084f34dc6c9f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,40 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
Only say Y if you understand what this does, and you
still want it enabled. Otherwise say N
+config RING_BUFFER_PERSISTENT_INJECT
+ bool "Enable persistent ring buffer error injection test"
+ depends on RING_BUFFER
+ help
+ This option will have the kernel check if the persistent ring
+ buffer is named "ptracingtest". and if so, it will corrupt some
+ of its pages on a kernel panic. This is used to test if the
+ persistent ring buffer can recover from some of its sub-buffers
+ being corrupted.
+ To use this, boot a kernel with a "ptracingtest" persistent
+ ring buffer, e.g.
+
+ reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+ And after the 1st boot, run the following commands:
+
+ cd /sys/kernel/tracing/instances/ptracingtest
+ echo 1 > events/enable
+ echo 1 > tracing_on
+ sleep 3
+ echo c > /proc/sysrq-trigger
+
+ After the panic message, the kernel will reboot and will show
+ the test results in the console output.
+
+ Note that events for the test ring buffer needs to be enabled
+ prior to crashing the kernel so that the ring buffer has content
+ that the test will corrupt.
+ As the test will corrupt events in the "ptracingtest" persistent
+ ring buffer, it should not be used for any other purpose other
+ than this test.
+
+ If unsure, say N
+
config MMIOTRACE_TEST
tristate "Test module for mmiotrace"
depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 899a78b307e6..e8d2b3457d7f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ __u32 nr_invalid;
+ __u32 entry_bytes;
+#endif
int buffers[];
};
@@ -2086,6 +2090,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (discarded)
pr_cont(" (%d pages discarded)", discarded);
pr_cont("\n");
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+ if (meta->nr_invalid)
+ pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+ cpu_buffer->cpu,
+ (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ discarded, meta->nr_invalid);
+ if (meta->entry_bytes)
+ pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+ cpu_buffer->cpu,
+ (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)entry_bytes, (long)meta->entry_bytes);
+ meta->nr_invalid = 0;
+ meta->entry_bytes = 0;
+#endif
return;
invalid:
@@ -2566,12 +2585,72 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct ring_buffer_cpu_meta *meta;
+ struct buffer_data_page *dpage;
+ unsigned long entry_bytes = 0;
+ unsigned long ptr;
+ int subbuf_size;
+ int invalid = 0;
+ int cpu;
+ int i;
+
+ if (!(buffer->flags & RB_FL_TESTING))
+ return;
+
+ guard(preempt)();
+ cpu = smp_processor_id();
+
+ cpu_buffer = buffer->buffers[cpu];
+ if (!cpu_buffer)
+ return;
+ meta = cpu_buffer->ring_meta;
+ if (!meta)
+ return;
+
+ ptr = (unsigned long)rb_subbufs_from_meta(meta);
+ subbuf_size = meta->subbuf_size;
+
+ for (i = 0; i < meta->nr_subbufs; i++) {
+ unsigned long idx = meta->buffers[i];
+
+ dpage = (void *)(ptr + idx * subbuf_size);
+ /* Skip unused pages */
+ if (!local_read(&dpage->commit))
+ continue;
+
+ /*
+ * Invalidate even pages or multiples of 5. This will cause 3
+ * contiguous invalidated(empty) pages.
+ */
+ if (!(i & 0x1) || !(i % 5)) {
+ local_add(subbuf_size + 1, &dpage->commit);
+ invalid++;
+ } else {
+ /* Count total commit bytes. */
+ entry_bytes += local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ }
+ }
+
+ pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+ invalid, cpu, (long)entry_bytes);
+ meta->nr_invalid = invalid;
+ meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer) do { } while (0)
+#endif
+
/* Stop recording on a persistent buffer and flush cache if needed. */
static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
{
struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
ring_buffer_record_off(buffer);
+ rb_test_inject_invalid_pages(buffer);
arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
return NOTIFY_DONE;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e9455d46ec16..d972b24cd73b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9436,6 +9436,8 @@ static void setup_trace_scratch(struct trace_array *tr,
memset(tscratch, 0, size);
}
+#define TRACE_TEST_PTRACING_NAME "ptracingtest"
+
static int
allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
{
@@ -9448,6 +9450,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
+ if (tr->name && !strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+ rb_flags |= RB_FL_TESTING;
/* Add scratch buffer to handle 128 modules */
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
^ permalink raw reply related
* [PATCH v18 6/8] ring-buffer: Show commit numbers in buffer_meta file
From: Masami Hiramatsu (Google) @ 2026-04-24 6:52 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v17:
- Added NULL check for dpage in rbm_show in ring_buffer.c.
Changes in v16:
- update description.
---
kernel/trace/ring_buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e8d2b3457d7f..de653a8e3cec 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2216,6 +2216,7 @@ static int rbm_show(struct seq_file *m, void *v)
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
unsigned long val = (unsigned long)v;
+ struct buffer_data_page *dpage;
if (val == 1) {
seq_printf(m, "head_buffer: %d\n",
@@ -2228,7 +2229,9 @@ static int rbm_show(struct seq_file *m, void *v)
}
val -= 2;
- seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]);
+ dpage = rb_range_buffer(cpu_buffer, val);
+ seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
+ val, meta->buffers[val], dpage ? local_read(&dpage->commit) : -1);
return 0;
}
^ permalink raw reply related
* [PATCH v18 7/8] ring-buffer: Cleanup persistent ring buffer validation
From: Masami Hiramatsu (Google) @ 2026-04-24 6:52 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cleanup rb_meta_validate_events() function to make it easier to read.
This includes the following cleanups:
- Introduce rb_validatation_state to hold working variables in
validation.
- Move repleated validation state updates into rb_validate_buffer().
- Move reader_page injection code outside of rb_meta_validate_events().
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/ring_buffer.c | 186 ++++++++++++++++++++++----------------------
1 file changed, 95 insertions(+), 91 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index de653a8e3cec..9850a0d8d24b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1883,8 +1883,16 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
return events;
}
-static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
- struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
+struct rb_validation_state {
+ unsigned long entries;
+ unsigned long entry_bytes;
+ int discarded;
+ u64 ts;
+};
+
+static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
+ struct ring_buffer_cpu_meta *meta,
+ u64 prev_ts, u64 next_ts)
{
struct buffer_data_page *dpage = bpage->page;
unsigned long long ts;
@@ -1914,16 +1922,82 @@ static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
return ret;
}
+static void rb_validate_buffer(struct buffer_page *bpage,
+ struct ring_buffer_per_cpu *cpu_buffer,
+ struct ring_buffer_cpu_meta *meta,
+ struct rb_validation_state *state,
+ u64 prev_ts, u64 next_ts)
+{
+ int ret;
+
+ ret = __rb_validate_buffer(bpage, cpu_buffer->cpu, meta, prev_ts, next_ts);
+ if (ret < 0) {
+ if (!state->discarded)
+ pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+ cpu_buffer->cpu);
+ state->discarded++;
+ } else {
+ /* If the buffer has content, update pages_touched */
+ if (ret)
+ local_inc(&cpu_buffer->pages_touched);
+
+ state->entries += ret;
+ state->entry_bytes += rb_page_size(bpage);
+ state->ts = bpage->page->time_stamp;
+ }
+}
+
+static void rb_meta_inject_reader_page(struct ring_buffer_per_cpu *cpu_buffer,
+ struct ring_buffer_cpu_meta *meta,
+ struct buffer_page *orig_head,
+ struct buffer_page *head_page)
+{
+ struct buffer_page *bpage = orig_head;
+ int i;
+
+ rb_dec_page(&bpage);
+ /*
+ * Insert the reader_page before the original head page.
+ * Since the list encode RB_PAGE flags, general list
+ * operations should be avoided.
+ */
+ cpu_buffer->reader_page->list.next = &orig_head->list;
+ cpu_buffer->reader_page->list.prev = orig_head->list.prev;
+ orig_head->list.prev = &cpu_buffer->reader_page->list;
+ bpage->list.next = &cpu_buffer->reader_page->list;
+
+ /* Make the head_page the reader page */
+ cpu_buffer->reader_page = head_page;
+ bpage = head_page;
+ rb_inc_page(&head_page);
+ head_page->list.prev = bpage->list.prev;
+ rb_dec_page(&bpage);
+ bpage->list.next = &head_page->list;
+ rb_set_list_to_head(&bpage->list);
+ cpu_buffer->pages = &head_page->list;
+
+ cpu_buffer->head_page = head_page;
+ meta->head_buffer = (unsigned long)head_page->page;
+
+ /* Reset all the indexes */
+ bpage = cpu_buffer->reader_page;
+ meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
+ bpage->id = 0;
+
+ for (i = 1, bpage = head_page; i < meta->nr_subbufs;
+ i++, rb_inc_page(&bpage)) {
+ meta->buffers[i] = rb_meta_subbuf_idx(meta, bpage->page);
+ bpage->id = i;
+ }
+}
+
/* If the meta data has been validated, now validate the events */
static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
{
struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
struct buffer_page *head_page, *orig_head, *orig_reader;
- unsigned long entry_bytes = 0;
- unsigned long entries = 0;
- int discarded = 0;
+ struct rb_validation_state state = { 0 };
int ret;
- u64 ts;
int i;
if (!meta || !meta->head_buffer)
@@ -1933,25 +2007,16 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
orig_reader = cpu_buffer->reader_page;
/* Do the head page first */
- ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
+ ret = __rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid head page detected\n",
cpu_buffer->cpu);
goto skip_rewind;
}
- ts = head_page->page->time_stamp;
+ state.ts = head_page->page->time_stamp;
/* Do the reader page - reader must be previous to head. */
- ret = rb_validate_buffer(orig_reader, cpu_buffer->cpu, meta, 0, ts);
- if (ret < 0) {
- pr_info("Ring buffer meta [%d] invalid reader page detected\n",
- cpu_buffer->cpu);
- discarded++;
- } else {
- entries += ret;
- entry_bytes += rb_page_size(orig_reader);
- ts = orig_reader->page->time_stamp;
- }
+ rb_validate_buffer(orig_reader, cpu_buffer, meta, &state, 0, state.ts);
/*
* Try to rewind the head so that we can read the pages which are already
@@ -1975,19 +2040,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
* Skip if the page is invalid, or its timestamp is newer than the
* previous valid page.
*/
- ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, ts);
- if (ret < 0) {
- if (!discarded)
- pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
- cpu_buffer->cpu);
- discarded++;
- } else {
- entries += ret;
- entry_bytes += rb_page_size(head_page);
- if (ret > 0)
- local_inc(&cpu_buffer->pages_touched);
- ts = head_page->page->time_stamp;
- }
+ rb_validate_buffer(head_page, cpu_buffer, meta, &state, 0, state.ts);
}
if (i)
pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2001,43 +2054,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
* into the location just before the original head page.
*/
if (head_page != orig_head) {
- struct buffer_page *bpage = orig_head;
-
- rb_dec_page(&bpage);
- /*
- * Insert the reader_page before the original head page.
- * Since the list encode RB_PAGE flags, general list
- * operations should be avoided.
- */
- cpu_buffer->reader_page->list.next = &orig_head->list;
- cpu_buffer->reader_page->list.prev = orig_head->list.prev;
- orig_head->list.prev = &cpu_buffer->reader_page->list;
- bpage->list.next = &cpu_buffer->reader_page->list;
-
- /* Make the head_page the reader page */
- cpu_buffer->reader_page = head_page;
- bpage = head_page;
- rb_inc_page(&head_page);
- head_page->list.prev = bpage->list.prev;
- rb_dec_page(&bpage);
- bpage->list.next = &head_page->list;
- rb_set_list_to_head(&bpage->list);
- cpu_buffer->pages = &head_page->list;
-
- cpu_buffer->head_page = head_page;
- meta->head_buffer = (unsigned long)head_page->page;
-
- /* Reset all the indexes */
- bpage = cpu_buffer->reader_page;
- meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
- bpage->id = 0;
-
- for (i = 1, bpage = head_page; i < meta->nr_subbufs;
- i++, rb_inc_page(&bpage)) {
- meta->buffers[i] = rb_meta_subbuf_idx(meta, bpage->page);
- bpage->id = i;
- }
-
+ rb_meta_inject_reader_page(cpu_buffer, meta, orig_head, head_page);
/* We'll restart verifying from orig_head */
head_page = orig_head;
}
@@ -2049,7 +2066,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Nothing more to do, the only page is the reader page */
goto done;
}
- ts = head_page->page->time_stamp;
+ state.ts = head_page->page->time_stamp;
/* Iterate until finding the commit page */
for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
@@ -2058,21 +2075,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
if (head_page == orig_reader)
continue;
- ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, ts, 0);
- if (ret < 0) {
- if (!discarded)
- pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
- cpu_buffer->cpu);
- discarded++;
- } else {
- /* If the buffer has content, update pages_touched */
- if (ret)
- local_inc(&cpu_buffer->pages_touched);
+ rb_validate_buffer(head_page, cpu_buffer, meta, &state, state.ts, 0);
- entries += ret;
- entry_bytes += rb_page_size(head_page);
- ts = head_page->page->time_stamp;
- }
if (head_page == cpu_buffer->commit_page)
break;
}
@@ -2083,25 +2087,25 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
goto invalid;
}
done:
- local_set(&cpu_buffer->entries, entries);
- local_set(&cpu_buffer->entries_bytes, entry_bytes);
+ local_set(&cpu_buffer->entries, state.entries);
+ local_set(&cpu_buffer->entries_bytes, state.entry_bytes);
pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
- if (discarded)
- pr_cont(" (%d pages discarded)", discarded);
+ if (state.discarded)
+ pr_cont(" (%d pages discarded)", state.discarded);
pr_cont("\n");
#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
if (meta->nr_invalid)
pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
cpu_buffer->cpu,
- (discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
- discarded, meta->nr_invalid);
+ (state.discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+ state.discarded, meta->nr_invalid);
if (meta->entry_bytes)
pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
cpu_buffer->cpu,
- (entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
- (long)entry_bytes, (long)meta->entry_bytes);
+ (state.entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+ (long)state.entry_bytes, (long)meta->entry_bytes);
meta->nr_invalid = 0;
meta->entry_bytes = 0;
#endif
^ permalink raw reply related
* [PATCH v18 8/8] ring-buffer: Cleanup buffer_data_page related code
From: Masami Hiramatsu (Google) @ 2026-04-24 6:53 UTC (permalink / raw)
To: Steven Rostedt, Catalin Marinas, Will Deacon
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Code cleanup related to buffer_data_page for readability,
which includes:
- Introduce rb_data_page_commit() and rb_data_page_size()
- Use 'dpage' for buffer_data_page, instead of 'bpage' because
'bpage' is used for buffer_page.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/ring_buffer.c | 112 ++++++++++++++++++++++++--------------------
1 file changed, 60 insertions(+), 52 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9850a0d8d24b..cb524b2afb7b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -364,21 +364,30 @@ struct buffer_page {
#define RB_WRITE_MASK 0xfffff
#define RB_WRITE_INTCNT (1 << 20)
-static void rb_init_page(struct buffer_data_page *bpage)
+static void rb_init_data_page(struct buffer_data_page *bpage)
{
local_set(&bpage->commit, 0);
bpage->time_stamp = 0;
}
+static __always_inline long rb_data_page_commit(struct buffer_data_page *dpage)
+{
+ return local_read(&dpage->commit);
+}
+
+static __always_inline long rb_data_page_size(struct buffer_data_page *dpage)
+{
+ return rb_data_page_commit(dpage) & ~RB_MISSED_MASK;
+}
+
static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
{
- return local_read(&bpage->page->commit);
+ return rb_data_page_commit(bpage->page);
}
-/* Size is determined by what has been committed */
static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
{
- return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+ return rb_data_page_size(bpage->page);
}
static void free_buffer_page(struct buffer_page *bpage)
@@ -419,7 +428,7 @@ static struct buffer_data_page *alloc_cpu_data(int cpu, int order)
return NULL;
dpage = page_address(page);
- rb_init_page(dpage);
+ rb_init_data_page(dpage);
return dpage;
}
@@ -659,7 +668,7 @@ static void verify_event(struct ring_buffer_per_cpu *cpu_buffer,
do {
if (page == tail_page || WARN_ON_ONCE(stop++ > 100))
done = true;
- commit = local_read(&page->page->commit);
+ commit = rb_page_commit(page);
write = local_read(&page->write);
if (addr >= (unsigned long)&page->page->data[commit] &&
addr < (unsigned long)&page->page->data[write])
@@ -1906,7 +1915,7 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
* Even after clearing these bits, a commit value greater than the
* subbuf_size is considered invalid.
*/
- tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ tail = rb_data_page_size(dpage);
if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
@@ -2118,12 +2127,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
/* Reset the reader page */
local_set(&cpu_buffer->reader_page->entries, 0);
- rb_init_page(cpu_buffer->reader_page->page);
+ rb_init_data_page(cpu_buffer->reader_page->page);
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
- rb_init_page(head_page->page);
+ rb_init_data_page(head_page->page);
}
}
@@ -2183,7 +2192,7 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages, int sc
*/
for (i = 0; i < meta->nr_subbufs; i++) {
meta->buffers[i] = i;
- rb_init_page(subbuf);
+ rb_init_data_page(subbuf);
subbuf += meta->subbuf_size;
}
}
@@ -2235,7 +2244,7 @@ static int rbm_show(struct seq_file *m, void *v)
val -= 2;
dpage = rb_range_buffer(cpu_buffer, val);
seq_printf(m, "buffer[%ld]: %d (commit: %ld)\n",
- val, meta->buffers[val], dpage ? local_read(&dpage->commit) : -1);
+ val, meta->buffers[val], dpage ? rb_data_page_commit(dpage) : -1);
return 0;
}
@@ -2626,7 +2635,7 @@ static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
dpage = (void *)(ptr + idx * subbuf_size);
/* Skip unused pages */
- if (!local_read(&dpage->commit))
+ if (!rb_data_page_commit(dpage))
continue;
/*
@@ -2638,7 +2647,7 @@ static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
invalid++;
} else {
/* Count total commit bytes. */
- entry_bytes += local_read(&dpage->commit) & ~RB_MISSED_MASK;
+ entry_bytes += rb_data_page_size(dpage);
}
}
@@ -4167,8 +4176,7 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->commit_page->page->commit,
rb_page_write(cpu_buffer->commit_page));
RB_WARN_ON(cpu_buffer,
- local_read(&cpu_buffer->commit_page->page->commit) &
- ~RB_WRITE_MASK);
+ rb_page_commit(cpu_buffer->commit_page) & ~RB_WRITE_MASK);
barrier();
}
@@ -4540,7 +4548,7 @@ static const char *show_interrupt_level(void)
return show_irq_str(level);
}
-static void dump_buffer_page(struct buffer_data_page *bpage,
+static void dump_buffer_page(struct buffer_data_page *dpage,
struct rb_event_info *info,
unsigned long tail)
{
@@ -4548,12 +4556,12 @@ static void dump_buffer_page(struct buffer_data_page *bpage,
u64 ts, delta;
int e;
- ts = bpage->time_stamp;
+ ts = dpage->time_stamp;
pr_warn(" [%lld] PAGE TIME STAMP\n", ts);
for (e = 0; e < tail; e += rb_event_length(event)) {
- event = (struct ring_buffer_event *)(bpage->data + e);
+ event = (struct ring_buffer_event *)(dpage->data + e);
switch (event->type_len) {
@@ -4603,7 +4611,7 @@ static atomic_t ts_dump;
} \
atomic_inc(&cpu_buffer->record_disabled); \
pr_warn(fmt, ##__VA_ARGS__); \
- dump_buffer_page(bpage, info, tail); \
+ dump_buffer_page(dpage, info, tail); \
atomic_dec(&ts_dump); \
/* There's some cases in boot up that this can happen */ \
if (WARN_ON_ONCE(system_state != SYSTEM_BOOTING)) \
@@ -4619,16 +4627,16 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
struct rb_event_info *info,
unsigned long tail)
{
- struct buffer_data_page *bpage;
+ struct buffer_data_page *dpage;
u64 ts, delta;
bool full = false;
int ret;
- bpage = info->tail_page->page;
+ dpage = info->tail_page->page;
if (tail == CHECK_FULL_PAGE) {
full = true;
- tail = local_read(&bpage->commit);
+ tail = rb_data_page_commit(dpage);
} else if (info->add_timestamp &
(RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)) {
/* Ignore events with absolute time stamps */
@@ -4639,7 +4647,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
* Do not check the first event (skip possible extends too).
* Also do not check if previous events have not been committed.
*/
- if (tail <= 8 || tail > local_read(&bpage->commit))
+ if (tail <= 8 || tail > rb_data_page_commit(dpage))
return;
/*
@@ -4648,7 +4656,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
if (atomic_inc_return(this_cpu_ptr(&checking)) != 1)
goto out;
- ret = rb_read_data_buffer(bpage, tail, cpu_buffer->cpu, &ts, &delta);
+ ret = rb_read_data_buffer(dpage, tail, cpu_buffer->cpu, &ts, &delta);
if (ret < 0) {
if (delta < ts) {
buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT BACKWARDS: last ts: %lld absolute ts: %lld clock:%pS\n",
@@ -6436,7 +6444,7 @@ static void rb_clear_buffer_page(struct buffer_page *page)
{
local_set(&page->write, 0);
local_set(&page->entries, 0);
- rb_init_page(page->page);
+ rb_init_data_page(page->page);
page->read = 0;
}
@@ -6921,7 +6929,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
local_irq_restore(flags);
if (bpage->data) {
- rb_init_page(bpage->data);
+ rb_init_data_page(bpage->data);
} else {
bpage->data = alloc_cpu_data(cpu, cpu_buffer->buffer->subbuf_order);
if (!bpage->data) {
@@ -6946,8 +6954,8 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
struct buffer_data_read_page *data_page)
{
struct ring_buffer_per_cpu *cpu_buffer;
- struct buffer_data_page *bpage = data_page->data;
- struct page *page = virt_to_page(bpage);
+ struct buffer_data_page *dpage = data_page->data;
+ struct page *page = virt_to_page(dpage);
unsigned long flags;
if (!buffer || !buffer->buffers || !buffer->buffers[cpu])
@@ -6967,15 +6975,15 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
arch_spin_lock(&cpu_buffer->lock);
if (!cpu_buffer->free_page) {
- cpu_buffer->free_page = bpage;
- bpage = NULL;
+ cpu_buffer->free_page = dpage;
+ dpage = NULL;
}
arch_spin_unlock(&cpu_buffer->lock);
local_irq_restore(flags);
out:
- free_pages((unsigned long)bpage, data_page->order);
+ free_pages((unsigned long)dpage, data_page->order);
kfree(data_page);
}
EXPORT_SYMBOL_GPL(ring_buffer_free_read_page);
@@ -7020,7 +7028,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
{
struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
struct ring_buffer_event *event;
- struct buffer_data_page *bpage;
+ struct buffer_data_page *dpage;
struct buffer_page *reader;
unsigned long missed_events;
unsigned int commit;
@@ -7046,8 +7054,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
if (data_page->order != buffer->subbuf_order)
return -1;
- bpage = data_page->data;
- if (!bpage)
+ dpage = data_page->data;
+ if (!dpage)
return -1;
guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock);
@@ -7113,7 +7121,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
* We have already ensured there's enough space if this
* is a time extend. */
size = rb_event_length(event);
- memcpy(bpage->data + pos, rpage->data + rpos, size);
+ memcpy(dpage->data + pos, rpage->data + rpos, size);
len -= size;
@@ -7129,9 +7137,9 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
size = rb_event_ts_length(event);
} while (len >= size);
- /* update bpage */
- local_set(&bpage->commit, pos);
- bpage->time_stamp = save_timestamp;
+ /* update dpage */
+ local_set(&dpage->commit, pos);
+ dpage->time_stamp = save_timestamp;
/* we copied everything to the beginning */
read = 0;
@@ -7141,13 +7149,13 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
cpu_buffer->read_bytes += rb_page_size(reader);
/* swap the pages */
- rb_init_page(bpage);
- bpage = reader->page;
+ rb_init_data_page(dpage);
+ dpage = reader->page;
reader->page = data_page->data;
local_set(&reader->write, 0);
local_set(&reader->entries, 0);
reader->read = 0;
- data_page->data = bpage;
+ data_page->data = dpage;
/*
* Use the real_end for the data size,
@@ -7155,12 +7163,12 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
* on the page.
*/
if (reader->real_end)
- local_set(&bpage->commit, reader->real_end);
+ local_set(&dpage->commit, reader->real_end);
}
cpu_buffer->lost_events = 0;
- commit = local_read(&bpage->commit);
+ commit = rb_data_page_commit(dpage);
/*
* Set a flag in the commit field if we lost events
*/
@@ -7169,19 +7177,19 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
* missed events, then record it there.
*/
if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
- memcpy(&bpage->data[commit], &missed_events,
+ memcpy(&dpage->data[commit], &missed_events,
sizeof(missed_events));
- local_add(RB_MISSED_STORED, &bpage->commit);
+ local_add(RB_MISSED_STORED, &dpage->commit);
commit += sizeof(missed_events);
}
- local_add(RB_MISSED_EVENTS, &bpage->commit);
+ local_add(RB_MISSED_EVENTS, &dpage->commit);
}
/*
* This page may be off to user land. Zero it out here.
*/
if (commit < buffer->subbuf_size)
- memset(&bpage->data[commit], 0, buffer->subbuf_size - commit);
+ memset(&dpage->data[commit], 0, buffer->subbuf_size - commit);
return read;
}
@@ -7812,7 +7820,7 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
if (missed_events) {
if (cpu_buffer->reader_page != cpu_buffer->commit_page) {
- struct buffer_data_page *bpage = reader->page;
+ struct buffer_data_page *dpage = reader->page;
unsigned int commit;
/*
* Use the real_end for the data size,
@@ -7820,18 +7828,18 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
* on the page.
*/
if (reader->real_end)
- local_set(&bpage->commit, reader->real_end);
+ local_set(&dpage->commit, reader->real_end);
/*
* If there is room at the end of the page to save the
* missed events, then record it there.
*/
commit = rb_page_size(reader);
if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
- memcpy(&bpage->data[commit], &missed_events,
+ memcpy(&dpage->data[commit], &missed_events,
sizeof(missed_events));
- local_add(RB_MISSED_STORED, &bpage->commit);
+ local_add(RB_MISSED_STORED, &dpage->commit);
}
- local_add(RB_MISSED_EVENTS, &bpage->commit);
+ local_add(RB_MISSED_EVENTS, &dpage->commit);
} else if (!WARN_ONCE(cpu_buffer->reader_page == cpu_buffer->tail_page,
"Reader on commit with %ld missed events",
missed_events)) {
^ permalink raw reply related
* Re: [PATCH v18 0/8] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu @ 2026-04-24 7:06 UTC (permalink / raw)
To: Masami Hiramatsu (Google), Steven Rostedt
Cc: Steven Rostedt, Catalin Marinas, Will Deacon, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>
Hi Steve,
I added a fix related this series as the 1st one. It can be merged
independently.
Thanks,
On Fri, 24 Apr 2026 15:51:59 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> Hi,
>
> Here is the 18th version of improvement patches for making persistent
> ring buffers robust to failures.
> The previous version is here:
>
> https://lore.kernel.org/all/177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com/
>
> This version fixes a newly found bug and some review comments from
> Sashiko[1], also, add 2 cleanups, which includes:
> [1/8] Do not double count the reader_page when verifying persistent
> ring buffer.
> [2/8] Add Geert's Ack (Thanks!)
> [3/8] Fix to substract BUF_PAGE_HDR_SIZE from meta->subbuf_size
> to make the limit of commit size.
> [4/8] Reset timestamp of reader_page when the entire cpu_buffer is
> invalid.
> [5/8] In rb_test_inject_invalid_pages(), changed entry_bytes and
> idx to unsigned long.
> [7/8] Cleanup persistent ring buffer validation code.
> [8/8] Cleanup buffer_data_page related code.
>
> [1] https://sashiko.dev/#/patchset/177687458572.932171.10907864814735342737.stgit%40mhiramat.tok.corp.google.com
>
> Thank you,
>
> Masami Hiramatsu (Google) (8):
> ring-buffer: Do not double count the reader_page
> ring-buffer: Flush and stop persistent ring buffer on panic
> ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
> ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
> ring-buffer: Add persistent ring buffer invalid-page inject test
> ring-buffer: Show commit numbers in buffer_meta file
> ring-buffer: Cleanup persistent ring buffer validation
> ring-buffer: Cleanup buffer_data_page related code
>
>
> arch/alpha/include/asm/Kbuild | 1
> arch/arc/include/asm/Kbuild | 1
> arch/arm/include/asm/Kbuild | 1
> arch/arm64/include/asm/ring_buffer.h | 10 +
> arch/csky/include/asm/Kbuild | 1
> arch/hexagon/include/asm/Kbuild | 1
> arch/loongarch/include/asm/Kbuild | 1
> arch/m68k/include/asm/Kbuild | 1
> arch/microblaze/include/asm/Kbuild | 1
> arch/mips/include/asm/Kbuild | 1
> arch/nios2/include/asm/Kbuild | 1
> arch/openrisc/include/asm/Kbuild | 1
> arch/parisc/include/asm/Kbuild | 1
> arch/powerpc/include/asm/Kbuild | 1
> arch/riscv/include/asm/Kbuild | 1
> arch/s390/include/asm/Kbuild | 1
> arch/sh/include/asm/Kbuild | 1
> arch/sparc/include/asm/Kbuild | 1
> arch/um/include/asm/Kbuild | 1
> arch/x86/include/asm/Kbuild | 1
> arch/xtensa/include/asm/Kbuild | 1
> include/asm-generic/ring_buffer.h | 13 +
> include/linux/ring_buffer.h | 1
> kernel/trace/Kconfig | 34 ++
> kernel/trace/ring_buffer.c | 472 +++++++++++++++++++++++-----------
> kernel/trace/trace.c | 4
> 26 files changed, 395 insertions(+), 159 deletions(-)
> create mode 100644 arch/arm64/include/asm/ring_buffer.h
> create mode 100644 include/asm-generic/ring_buffer.h
>
>
> base-commit: 6170922f137231b98fc568571befef63e1edff3f
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v2 2/2] module/kallsyms: sort function symbols and use binary search
From: Stanislaw Gruszka @ 2026-04-24 9:13 UTC (permalink / raw)
To: Petr Pavlu
Cc: linux-modules, Sami Tolvanen, Luis Chamberlain, linux-kernel,
linux-trace-kernel, live-patching, Daniel Gomez, Aaron Tomlin,
Steven Rostedt, Masami Hiramatsu, Jordan Rome, Viktor Malik
In-Reply-To: <11c8e139-f9f3-4b22-863a-4e021a3947e7@suse.com>
Hi Petr,
thanks for the review.
On Thu, Apr 23, 2026 at 04:00:04PM +0200, Petr Pavlu wrote:
> On 3/27/26 12:00 PM, Stanislaw Gruszka wrote:
> > Module symbol lookup via find_kallsyms_symbol() performs a linear scan
> > over the entire symtab when resolving an address. The number of symbols
> > in module symtabs has grown over the years, largely due to additional
> > metadata in non-standard sections, making this lookup very slow.
> >
> > Improve this by separating function symbols during module load, placing
> > them at the beginning of the symtab, sorting them by address, and using
> > binary search when resolving addresses in module text.
> >
> > This also should improve times for linear symbol name lookups, as valid
> > function symbols are now located at the beginning of the symtab.
> >
> > The cost of sorting is small relative to module load time. In repeated
> > module load tests [1], depending on .config options, this change
> > increases load time between 2% and 4%. With cold caches, the difference
> > is not measurable, as memory access latency dominates.
> >
> > The sorting theoretically could be done in compile time, but much more
> > complicated as we would have to simulate kernel addresses resolution
> > for symbols, and then correct relocation entries. That would be risky
> > if get out of sync.
> >
> > The improvement can be observed when listing ftrace filter functions.
> >
> > Before:
> >
> > root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
> > 74908
> >
> > real 0m1.315s
> > user 0m0.000s
> > sys 0m1.312s
> >
> > After:
> >
> > root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
> > 74911
> >
> > real 0m0.167s
> > user 0m0.004s
> > sys 0m0.175s
> >
> > (there are three more symbols introduced by the patch)
> >
> > For livepatch modules, the symtab layout is preserved and the existing
> > linear search is used. For this case, it should be possible to keep
> > the original ELF symtab instead of copying it 1:1, but that is outside
> > the scope of this patch.
> >
> > Link: https://gist.github.com/sgruszka/09f3fb1dad53a97b1aad96e1927ab117 [1]
> > Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
>
> Sorry for the delay reviewing this patch.
No problem.
> > ---
> > v1 -> v2:
> > - fix searching data symbols for CONFIG_KALLSYMS_ALL
> > - use kallsyms_symbol_value() in elf_sym_cmp()
> >
> > include/linux/module.h | 1 +
> > kernel/module/internal.h | 1 +
> > kernel/module/kallsyms.c | 171 +++++++++++++++++++++++++++++----------
> > 3 files changed, 130 insertions(+), 43 deletions(-)
> >
> > diff --git a/include/linux/module.h b/include/linux/module.h
> > index ac254525014c..67c053afa882 100644
> > --- a/include/linux/module.h
> > +++ b/include/linux/module.h
> > @@ -379,6 +379,7 @@ struct module_memory {
> > struct mod_kallsyms {
> > Elf_Sym *symtab;
> > unsigned int num_symtab;
> > + unsigned int num_func_syms;
> > char *strtab;
> > char *typetab;
> > };
> > diff --git a/kernel/module/internal.h b/kernel/module/internal.h
> > index 618202578b42..6a4d498619b1 100644
> > --- a/kernel/module/internal.h
> > +++ b/kernel/module/internal.h
> > @@ -73,6 +73,7 @@ struct load_info {
> > bool sig_ok;
> > #ifdef CONFIG_KALLSYMS
> > unsigned long mod_kallsyms_init_off;
> > + unsigned long num_func_syms;
> > #endif
> > #ifdef CONFIG_MODULE_DECOMPRESS
> > #ifdef CONFIG_MODULE_STATS
> > diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
> > index f23126d804b2..d69e99e67707 100644
> > --- a/kernel/module/kallsyms.c
> > +++ b/kernel/module/kallsyms.c
> > @@ -10,6 +10,7 @@
> > #include <linux/kallsyms.h>
> > #include <linux/buildid.h>
> > #include <linux/bsearch.h>
> > +#include <linux/sort.h>
> > #include "internal.h"
> >
> > /* Lookup exported symbol in given range of kernel_symbols */
> > @@ -103,6 +104,95 @@ static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs,
> > return true;
> > }
> >
> > +static inline bool is_func_symbol(const Elf_Sym *sym)
> > +{
> > + return sym->st_shndx != SHN_UNDEF && sym->st_size != 0 &&
> > + ELF_ST_TYPE(sym->st_info) == STT_FUNC;
> > +}
> > +
> > +static unsigned int bsearch_func_symbol(struct mod_kallsyms *kallsyms,
> > + unsigned long addr,
> > + unsigned long *bestval,
> > + unsigned long *nextval)
> > +
> > +{
> > + unsigned int mid, low = 1, high = kallsyms->num_func_syms + 1;
> > + unsigned int best = 0;
> > + unsigned long thisval;
> > +
> > + while (low < high) {
> > + mid = low + (high - low) / 2;
> > + thisval = kallsyms_symbol_value(&kallsyms->symtab[mid]);
> > +
> > + if (thisval <= addr) {
> > + *bestval = thisval;
> > + best = mid;
> > + low = mid + 1;
>
> If thisval == addr, the search moves to the right and finds the last
> symbol with the same address. I believe it should do the opposite and
> return the first symbol to match the behavior of
> search_kallsyms_symbol().
In the case of multiple symbols sharing the same address, we have
to pick one and ignore the others. I don’t think it matters much which
one is chosen in practice. Also, I expect function symbol addresses
to be unique, so this shouldn’t be a real issue.
> > + } else {
> > + *nextval = thisval;
> > + high = mid;
> > + }
> > + }
> > +
> > + return best;
> > +}
> > +
> > +static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms,
> > + unsigned int symnum)
> > +{
> > + return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
> > +}
> > +
> > +static unsigned int search_kallsyms_symbol(struct mod_kallsyms *kallsyms,
> > + unsigned long addr,
> > + unsigned long *bestval,
> > + unsigned long *nextval)
> > +{
> > + unsigned int i, best = 0;
> > +
> > + /*
> > + * Scan for closest preceding symbol and next symbol. (ELF starts
> > + * real symbols at 1). Skip the initial function symbols range
> > + * if num_func_syms is non-zero, those are handled separately for
> > + * the core TEXT segment lookup.
> > + */
> > + for (i = 1 + kallsyms->num_func_syms; i < kallsyms->num_symtab; i++) {
> > + const Elf_Sym *sym = &kallsyms->symtab[i];
> > + unsigned long thisval = kallsyms_symbol_value(sym);
> > +
> > + if (sym->st_shndx == SHN_UNDEF)
> > + continue;
> > +
> > + /*
> > + * We ignore unnamed symbols: they're uninformative
> > + * and inserted at a whim.
> > + */
> > + if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
> > + is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
> > + continue;
> > +
> > + if (thisval <= addr && thisval > *bestval) {
> > + best = i;
> > + *bestval = thisval;
> > + }
> > + if (thisval > addr && thisval < *nextval)
> > + *nextval = thisval;
> > + }
> > +
> > + return best;
> > +}
> > +
> > +static int elf_sym_cmp(const void *a, const void *b)
> > +{
> > + unsigned long val_a = kallsyms_symbol_value((const Elf_Sym *)a);
> > + unsigned long val_b = kallsyms_symbol_value((const Elf_Sym *)b);
> > +
> > + if (val_a < val_b)
> > + return -1;
> > +
> > + return val_a > val_b;
>
> Does this comparison function and the sort() call result in stable
> sorting? If val_a and val_b are the same, the sorting should preserve
> the original order.
The kernel’s sort() implementation is not stable.
> > +}
> > +
> > /*
> > * We only allocate and copy the strings needed by the parts of symtab
> > * we keep. This is simple, but has the effect of making multiple
> > @@ -115,9 +205,10 @@ void layout_symtab(struct module *mod, struct load_info *info)
> > Elf_Shdr *symsect = info->sechdrs + info->index.sym;
> > Elf_Shdr *strsect = info->sechdrs + info->index.str;
> > const Elf_Sym *src;
> > - unsigned int i, nsrc, ndst, strtab_size = 0;
> > + unsigned int i, nsrc, ndst, nfunc, strtab_size = 0;
> > struct module_memory *mod_mem_data = &mod->mem[MOD_DATA];
> > struct module_memory *mod_mem_init_data = &mod->mem[MOD_INIT_DATA];
> > + bool is_lp_mod = is_livepatch_module(mod);
> >
> > /* Put symbol section at end of init part of module. */
> > symsect->sh_flags |= SHF_ALLOC;
> > @@ -129,12 +220,14 @@ void layout_symtab(struct module *mod, struct load_info *info)
> > nsrc = symsect->sh_size / sizeof(*src);
> >
> > /* Compute total space required for the core symbols' strtab. */
> > - for (ndst = i = 0; i < nsrc; i++) {
> > - if (i == 0 || is_livepatch_module(mod) ||
> > + for (ndst = nfunc = i = 0; i < nsrc; i++) {
> > + if (i == 0 || is_lp_mod ||
> > is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
> > info->index.pcpu)) {
> > strtab_size += strlen(&info->strtab[src[i].st_name]) + 1;
> > ndst++;
> > + if (!is_lp_mod && is_func_symbol(src + i))
> > + nfunc++;
> > }
> > }
> >
> > @@ -156,6 +249,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
> > mod_mem_init_data->size = ALIGN(mod_mem_init_data->size,
> > __alignof__(struct mod_kallsyms));
> > info->mod_kallsyms_init_off = mod_mem_init_data->size;
> > + info->num_func_syms = nfunc;
> >
> > mod_mem_init_data->size += sizeof(struct mod_kallsyms);
> > info->init_typeoffs = mod_mem_init_data->size;
> > @@ -169,7 +263,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
> > */
> > void add_kallsyms(struct module *mod, const struct load_info *info)
> > {
> > - unsigned int i, ndst;
> > + unsigned int i, di, nfunc, ndst;
> > const Elf_Sym *src;
> > Elf_Sym *dst;
> > char *s;
> > @@ -178,6 +272,7 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
> > void *data_base = mod->mem[MOD_DATA].base;
> > void *init_data_base = mod->mem[MOD_INIT_DATA].base;
> > struct mod_kallsyms *kallsyms;
> > + bool is_lp_mod = is_livepatch_module(mod);
> >
> > kallsyms = init_data_base + info->mod_kallsyms_init_off;
>
> This code is followed by the initialization of kallsyms:
>
> kallsyms->symtab = (void *)symsec->sh_addr;
> kallsyms->num_symtab = symsec->sh_size / sizeof(Elf_Sym);
> /* Make sure we get permanent strtab: don't use info->strtab. */
> kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr;
> kallsyms->typetab = init_data_base + info->init_typeoffs;
>
> I suggest adding 'kallsyms->num_func_syms = 0;' after the initialization
> of kallsyms->num_symtab.
I relied on zeroed memory initialization, but I can add this explicitly
for clarity.
> > @@ -194,19 +289,28 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
> > mod->core_kallsyms.symtab = dst = data_base + info->symoffs;
> > mod->core_kallsyms.strtab = s = data_base + info->stroffs;
> > mod->core_kallsyms.typetab = data_base + info->core_typeoffs;
> > +
> > strtab_size = info->core_typeoffs - info->stroffs;
> > src = kallsyms->symtab;
> > - for (ndst = i = 0; i < kallsyms->num_symtab; i++) {
> > + ndst = info->num_func_syms + 1;
> > +
> > + for (nfunc = i = 0; i < kallsyms->num_symtab; i++) {
> > kallsyms->typetab[i] = elf_type(src + i, info);
> > - if (i == 0 || is_livepatch_module(mod) ||
> > + if (i == 0 || is_lp_mod ||
> > is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
> > info->index.pcpu)) {
> > ssize_t ret;
> >
> > - mod->core_kallsyms.typetab[ndst] =
> > - kallsyms->typetab[i];
> > - dst[ndst] = src[i];
> > - dst[ndst++].st_name = s - mod->core_kallsyms.strtab;
> > + if (i == 0)
> > + di = 0;
> > + else if (!is_lp_mod && is_func_symbol(src + i))
> > + di = 1 + nfunc++;
> > + else
> > + di = ndst++;
> > +
> > + mod->core_kallsyms.typetab[di] = kallsyms->typetab[i];
> > + dst[di] = src[i];
> > + dst[di].st_name = s - mod->core_kallsyms.strtab;
> > ret = strscpy(s, &kallsyms->strtab[src[i].st_name],
> > strtab_size);
> > if (ret < 0)
> > @@ -216,9 +320,13 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
> > }
> > }
> >
> > + WARN_ON_ONCE(nfunc != info->num_func_syms);
> > + sort(dst + 1, nfunc, sizeof(Elf_Sym), elf_sym_cmp, NULL);
> > +
>
> The code sorts mod->core_kallsyms.symtab but mod->core_kallsyms.typetab
> is not reordered accordingly.
Right, but for function symbols the typetab entries are all 't',
so swapping them does not change the type value. The 'T' vs 't'
distinction is handled later when printing (based on export status).
But the comment explaining skiping adjusting of
mod->core_kallsyms.typetab is needed.
> > /* Set up to point into init section. */
> > rcu_assign_pointer(mod->kallsyms, kallsyms);
> > mod->core_kallsyms.num_symtab = ndst;
> > + mod->core_kallsyms.num_func_syms = nfunc;
> > }
> >
> > #if IS_ENABLED(CONFIG_STACKTRACE_BUILD_ID)
> > @@ -241,11 +349,6 @@ void init_build_id(struct module *mod, const struct load_info *info)
> > }
> > #endif
> >
> > -static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms, unsigned int symnum)
> > -{
> > - return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
> > -}
> > -
> > /*
> > * Given a module and address, find the corresponding symbol and return its name
> > * while providing its size and offset if needed.
> > @@ -255,7 +358,10 @@ static const char *find_kallsyms_symbol(struct module *mod,
> > unsigned long *size,
> > unsigned long *offset)
> > {
> > - unsigned int i, best = 0;
> > + unsigned int (*search)(struct mod_kallsyms *kallsyms,
> > + unsigned long addr, unsigned long *bestval,
> > + unsigned long *nextval);
> > + unsigned int best;
> > unsigned long nextval, bestval;
> > struct mod_kallsyms *kallsyms = rcu_dereference(mod->kallsyms);
> > struct module_memory *mod_mem = NULL;
> > @@ -266,6 +372,11 @@ static const char *find_kallsyms_symbol(struct module *mod,
> > continue;
> > #endif
> > if (within_module_mem_type(addr, mod, type)) {
> > + if (type == MOD_TEXT && kallsyms->num_func_syms > 0)
> > + search = bsearch_func_symbol;
>
> I'm not sure if it is ok to limit the search only to function symbols
> when the address lies in MOD_TEXT. The text can theoretically contain
> non-function symbols.
Yes, the patch assumes that the only valid symbols in the MOD_TEXT
are functions. If there are defined OBJECT symbols in .text, the patch
would break lookup for those.
While it’s theoretically possible (e.g. hand-written assembly placing
data in .text ?), I’m not sure this is a practical concern. In general,
having data in executable segments is discouraged for security reasons.
> Could this optimization be adjusted to sort all
> MOD_TEXT symbols (excluding anonymous and mapping symbols) and move them
> to the front of the symbol table?
That’s possible. We could track .text sections indices in
__layout_sections() and include all valid symbols from those sections,
and also reorder typetab accordingly.
However, this adds complexity. I would prefer to first confirm whether
OBJECT symbols in MOD_TEXT is a real issue before going in that direction.
Regards
Stanislaw
> > + else
> > + search = search_kallsyms_symbol;
> > +
> > mod_mem = &mod->mem[type];
> > break;
> > }
> > @@ -278,33 +389,7 @@ static const char *find_kallsyms_symbol(struct module *mod,
> > nextval = (unsigned long)mod_mem->base + mod_mem->size;
> > bestval = (unsigned long)mod_mem->base - 1;
> >
> > - /*
> > - * Scan for closest preceding symbol, and next symbol. (ELF
> > - * starts real symbols at 1).
> > - */
> > - for (i = 1; i < kallsyms->num_symtab; i++) {
> > - const Elf_Sym *sym = &kallsyms->symtab[i];
> > - unsigned long thisval = kallsyms_symbol_value(sym);
> > -
> > - if (sym->st_shndx == SHN_UNDEF)
> > - continue;
> > -
> > - /*
> > - * We ignore unnamed symbols: they're uninformative
> > - * and inserted at a whim.
> > - */
> > - if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
> > - is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
> > - continue;
> > -
> > - if (thisval <= addr && thisval > bestval) {
> > - best = i;
> > - bestval = thisval;
> > - }
> > - if (thisval > addr && thisval < nextval)
> > - nextval = thisval;
> > - }
> > -
> > + best = search(kallsyms, addr, &bestval, &nextval);
> > if (!best)
> > return NULL;
> >
>
> --
> Thanks,
> Petr
^ permalink raw reply
* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Andrew Morton @ 2026-04-24 13:58 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260419185750.260784-1-npache@redhat.com>
On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.
Lots of stuff here:
https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
It's going to take some time. Hopefully worthwhile.
As always, it's useful to hear about the usefulness of the AI review.
^ permalink raw reply
* [PATCH] rtla/tests: Add unit tests for actions module
From: Tomas Glozar @ 2026-04-24 14:02 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar
Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
Wander Lairson Costa, LKML, linux-trace-kernel
Add unit tests covering all functions in the actions module, including
both valid and invalid inputs and all action types, except for
actions_perform(), where only shell and continue actions are tested.
To support testing multiple modules, the unit test build was modified so
that it links the entire rtla-in.o file. For this to work, the main()
function in rtla.c was declared weak, so that the unit test main is able
to override it.
Other included minor changes to unit tests are:
- Make unit test output verbose to show which tests are being run, now
that we have more than 3 tests.
- Add unit_tests file to .gitignore.
- Split unit test sources to one file per test suite, and keep only
main() function in unit_tests.c.
- Fix Makefile dependencies so that "make unit-tests" will rebuild the
binary with the changes in the commit.
Also with the linking the entire rtla-in.o file, it now has rtla's
nr_cpus symbol, so the declaration in utils unit tests is made extern.
Assisted-by: Composer:composer-2-fast
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
tools/tracing/rtla/.gitignore | 1 +
tools/tracing/rtla/src/rtla.c | 3 +
tools/tracing/rtla/tests/unit/Build | 3 +-
tools/tracing/rtla/tests/unit/Makefile.unit | 6 +-
tools/tracing/rtla/tests/unit/actions.c | 380 ++++++++++++++++++++
tools/tracing/rtla/tests/unit/unit_tests.c | 107 +-----
tools/tracing/rtla/tests/unit/utils.c | 106 ++++++
7 files changed, 502 insertions(+), 104 deletions(-)
create mode 100644 tools/tracing/rtla/tests/unit/actions.c
create mode 100644 tools/tracing/rtla/tests/unit/utils.c
diff --git a/tools/tracing/rtla/.gitignore b/tools/tracing/rtla/.gitignore
index 4d39d64ac08c..231fb8d67f97 100644
--- a/tools/tracing/rtla/.gitignore
+++ b/tools/tracing/rtla/.gitignore
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
rtla
rtla-static
+unit_tests
fixdep
feature
FEATURE-DUMP
diff --git a/tools/tracing/rtla/src/rtla.c b/tools/tracing/rtla/src/rtla.c
index 7635c70123ab..3398250076ea 100644
--- a/tools/tracing/rtla/src/rtla.c
+++ b/tools/tracing/rtla/src/rtla.c
@@ -61,6 +61,9 @@ int run_command(int argc, char **argv, int start_position)
return 1;
}
+/* Set main as weak to allow overriding it for building unit test binary */
+#pragma weak main
+
int main(int argc, char *argv[])
{
int retval;
diff --git a/tools/tracing/rtla/tests/unit/Build b/tools/tracing/rtla/tests/unit/Build
index 5f1e531ea8c9..2749f4cf202a 100644
--- a/tools/tracing/rtla/tests/unit/Build
+++ b/tools/tracing/rtla/tests/unit/Build
@@ -1,2 +1,3 @@
+unit_tests-y += utils.o
+unit_tests-y += actions.o
unit_tests-y += unit_tests.o
-unit_tests-y +=../../src/utils.o
diff --git a/tools/tracing/rtla/tests/unit/Makefile.unit b/tools/tracing/rtla/tests/unit/Makefile.unit
index 2088c9cc3571..bacb00164e46 100644
--- a/tools/tracing/rtla/tests/unit/Makefile.unit
+++ b/tools/tracing/rtla/tests/unit/Makefile.unit
@@ -3,10 +3,10 @@
UNIT_TESTS := $(OUTPUT)unit_tests
UNIT_TESTS_IN := $(UNIT_TESTS)-in.o
-$(UNIT_TESTS): $(UNIT_TESTS_IN)
- $(QUIET_LINK)$(CC) $(LDFLAGS) -o $@ $^ -lcheck
+$(UNIT_TESTS): $(UNIT_TESTS_IN) $(RTLA_IN)
+ $(QUIET_LINK)$(CC) $(LDFLAGS) -o $@ $^ $(EXTLIBS) -lcheck
-$(UNIT_TESTS_IN):
+$(UNIT_TESTS_IN): fixdep
make $(build)=unit_tests
unit-tests: FORCE
diff --git a/tools/tracing/rtla/tests/unit/actions.c b/tools/tracing/rtla/tests/unit/actions.c
new file mode 100644
index 000000000000..a5808ab71a4d
--- /dev/null
+++ b/tools/tracing/rtla/tests/unit/actions.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <check.h>
+#include <signal.h>
+
+#include "../../src/actions.h"
+
+static struct actions actions_fixture;
+
+static void actions_fixture_setup(void)
+{
+ actions_init(&actions_fixture);
+}
+
+static void actions_fixture_teardown(void)
+{
+ actions_destroy(&actions_fixture);
+}
+
+START_TEST(test_actions_init)
+{
+ struct actions actions;
+
+ actions_init(&actions);
+
+ ck_assert_int_eq(actions.len, 0);
+ ck_assert_int_eq(actions.size, action_default_size);
+ ck_assert(!actions.continue_flag);
+ ck_assert_ptr_eq(actions.trace_output_inst, NULL);
+}
+END_TEST
+
+START_TEST(test_actions_destroy)
+{
+ struct actions actions;
+
+ actions_init(&actions);
+ actions_destroy(&actions);
+}
+END_TEST
+
+START_TEST(test_actions_reallocate)
+{
+ struct actions actions;
+ int i;
+
+ actions_init(&actions);
+
+ ck_assert_int_eq(actions.len, 0);
+ ck_assert_int_eq(actions.size, action_default_size);
+
+ /* Fill size of actions array */
+ for (i = 0; i < action_default_size; i++)
+ actions_add_continue(&actions);
+
+ ck_assert_int_eq(actions.len, action_default_size);
+ ck_assert_int_eq(actions.size, action_default_size);
+
+ /* Add one more action to trigger reallocation */
+ actions_add_continue(&actions);
+
+ ck_assert_int_eq(actions.len, action_default_size + 1);
+ ck_assert_int_eq(actions.size, action_default_size * 2);
+
+ actions_destroy(&actions);
+}
+END_TEST
+
+START_TEST(test_actions_add_trace_output)
+{
+ actions_add_trace_output(&actions_fixture, "trace_output.txt");
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+ ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace_output.txt");
+ ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_add_signal)
+{
+ actions_add_signal(&actions_fixture, SIGINT, 1234);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+ ck_assert_int_eq(actions_fixture.list[0].signal, SIGINT);
+ ck_assert_int_eq(actions_fixture.list[0].pid, 1234);
+ ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_add_shell)
+{
+ actions_add_shell(&actions_fixture, "echo Hello");
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SHELL);
+ ck_assert_str_eq(actions_fixture.list[0].command, "echo Hello");
+ ck_assert(actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_add_continue)
+{
+ actions_add_continue(&actions_fixture);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_CONTINUE);
+ ck_assert(actions_fixture.present[ACTION_CONTINUE]);
+}
+END_TEST
+
+START_TEST(test_actions_add_multiple_same_action)
+{
+ actions_add_trace_output(&actions_fixture, "trace1.txt");
+ actions_add_trace_output(&actions_fixture, "trace2.txt");
+
+ ck_assert_int_eq(actions_fixture.len, 2);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+ ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace1.txt");
+ ck_assert_int_eq(actions_fixture.list[1].type, ACTION_TRACE_OUTPUT);
+ ck_assert_str_eq(actions_fixture.list[1].trace_output, "trace2.txt");
+ ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_add_multiple_different_action)
+{
+ actions_add_trace_output(&actions_fixture, "trace_output.txt");
+ actions_add_signal(&actions_fixture, SIGINT, 1234);
+
+ ck_assert_int_eq(actions_fixture.len, 2);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+ ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace_output.txt");
+ ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+ ck_assert_int_eq(actions_fixture.list[1].type, ACTION_SIGNAL);
+ ck_assert_int_eq(actions_fixture.list[1].signal, SIGINT);
+ ck_assert_int_eq(actions_fixture.list[1].pid, 1234);
+ ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_trace_output)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "trace", "trace.txt"), 0);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+ ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace.txt");
+ ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_trace_output_arg)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "trace,file=trace2.txt", "trace1.txt"), 0);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+ ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace2.txt");
+ ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_trace_output_arg_bad)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "trace,foo=bar", "trace_output.txt"), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "signal,num=1,pid=1234", NULL), 0);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+ ck_assert_int_eq(actions_fixture.list[0].signal, 1);
+ ck_assert_int_eq(actions_fixture.list[0].pid, 1234);
+ ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_swapped)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "signal,pid=1234,num=1", NULL), 0);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+ ck_assert_int_eq(actions_fixture.list[0].signal, 1);
+ ck_assert_int_eq(actions_fixture.list[0].pid, 1234);
+ ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_parent)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "signal,pid=parent,num=1", NULL), 0);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+ ck_assert_int_eq(actions_fixture.list[0].signal, 1);
+ ck_assert_int_eq(actions_fixture.list[0].pid, -1);
+ ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_no_arg)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "signal", NULL), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_no_pid)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "signal,num=1", NULL), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_no_num)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "signal,pid=1234", NULL), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_arg_bad)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "signal,foo=bar", NULL), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_shell)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "shell,command=echo Hello", NULL), 0);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SHELL);
+ ck_assert_str_eq(actions_fixture.list[0].command, "echo Hello");
+ ck_assert(actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_shell_no_arg)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "shell", NULL), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_shell_arg_bad)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "shell,foo=bar", NULL), -1);
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_continue)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "continue", NULL), 0);
+
+ ck_assert_int_eq(actions_fixture.len, 1);
+ ck_assert_int_eq(actions_fixture.list[0].type, ACTION_CONTINUE);
+ ck_assert(actions_fixture.present[ACTION_CONTINUE]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_continue_arg_bad)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "continue,foo=bar", NULL), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+ ck_assert(!actions_fixture.present[ACTION_CONTINUE]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_invalid)
+{
+ ck_assert_int_eq(actions_parse(&actions_fixture, "foobar", NULL), -1);
+
+ ck_assert_int_eq(actions_fixture.len, 0);
+}
+END_TEST
+
+START_TEST(test_actions_perform_continue)
+{
+ actions_add_continue(&actions_fixture);
+ ck_assert_int_eq(actions_perform(&actions_fixture), 0);
+
+ ck_assert(actions_fixture.continue_flag);
+}
+END_TEST
+
+START_TEST(test_actions_perform_continue_after_successful_shell_command)
+{
+ actions_add_shell(&actions_fixture, "exit 0");
+ actions_add_continue(&actions_fixture);
+ ck_assert_int_eq(actions_perform(&actions_fixture), 0 << 8);
+
+ ck_assert(actions_fixture.continue_flag);
+}
+END_TEST
+
+START_TEST(test_actions_perform_continue_after_failed_shell_command)
+{
+ actions_add_shell(&actions_fixture, "exit 1");
+ actions_add_continue(&actions_fixture);
+ ck_assert_int_eq(actions_perform(&actions_fixture), 1 << 8);
+
+ ck_assert(!actions_fixture.continue_flag);
+}
+END_TEST
+
+Suite *actions_suite(void)
+{
+ Suite *s = suite_create("actions");
+ TCase *tc;
+
+ tc = tcase_create("alloc");
+ tcase_add_test(tc, test_actions_init);
+ tcase_add_test(tc, test_actions_destroy);
+ tcase_add_test(tc, test_actions_reallocate);
+ suite_add_tcase(s, tc);
+
+ tc = tcase_create("add");
+ tcase_add_checked_fixture(tc, actions_fixture_setup, actions_fixture_teardown);
+ tcase_add_test(tc, test_actions_add_trace_output);
+ tcase_add_test(tc, test_actions_add_signal);
+ tcase_add_test(tc, test_actions_add_shell);
+ tcase_add_test(tc, test_actions_add_continue);
+ tcase_add_test(tc, test_actions_add_multiple_same_action);
+ tcase_add_test(tc, test_actions_add_multiple_different_action);
+ suite_add_tcase(s, tc);
+
+ tc = tcase_create("parse");
+ tcase_add_checked_fixture(tc, actions_fixture_setup, actions_fixture_teardown);
+ tcase_add_test(tc, test_actions_parse_trace_output);
+ tcase_add_test(tc, test_actions_parse_trace_output_arg);
+ tcase_add_test(tc, test_actions_parse_trace_output_arg_bad);
+ tcase_add_test(tc, test_actions_parse_signal);
+ tcase_add_test(tc, test_actions_parse_signal_swapped);
+ tcase_add_test(tc, test_actions_parse_signal_parent);
+ tcase_add_test(tc, test_actions_parse_signal_no_arg);
+ tcase_add_test(tc, test_actions_parse_signal_no_pid);
+ tcase_add_test(tc, test_actions_parse_signal_no_num);
+ tcase_add_test(tc, test_actions_parse_signal_arg_bad);
+ tcase_add_test(tc, test_actions_parse_shell);
+ tcase_add_test(tc, test_actions_parse_shell_no_arg);
+ tcase_add_test(tc, test_actions_parse_shell_arg_bad);
+ tcase_add_test(tc, test_actions_parse_continue);
+ tcase_add_test(tc, test_actions_parse_continue_arg_bad);
+ tcase_add_test(tc, test_actions_parse_invalid);
+ suite_add_tcase(s, tc);
+
+ tc = tcase_create("perform");
+ tcase_add_checked_fixture(tc, actions_fixture_setup, actions_fixture_teardown);
+ tcase_add_test(tc, test_actions_perform_continue);
+ tcase_add_test(tc, test_actions_perform_continue_after_successful_shell_command);
+ tcase_add_test(tc, test_actions_perform_continue_after_failed_shell_command);
+ suite_add_tcase(s, tc);
+
+ return s;
+}
diff --git a/tools/tracing/rtla/tests/unit/unit_tests.c b/tools/tracing/rtla/tests/unit/unit_tests.c
index f3c6d89e3300..f87d761f9b12 100644
--- a/tools/tracing/rtla/tests/unit/unit_tests.c
+++ b/tools/tracing/rtla/tests/unit/unit_tests.c
@@ -2,115 +2,22 @@
#define _GNU_SOURCE
#include <check.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <sched.h>
-#include <limits.h>
-#include <unistd.h>
-#include <sys/sysinfo.h>
+#include <stdbool.h>
#include "../../src/utils.h"
-int nr_cpus;
-START_TEST(test_strtoi)
-{
- int result;
- char buf[64];
-
- ck_assert_int_eq(strtoi("123", &result), 0);
- ck_assert_int_eq(result, 123);
- ck_assert_int_eq(strtoi(" -456", &result), 0);
- ck_assert_int_eq(result, -456);
-
- snprintf(buf, sizeof(buf), "%d", INT_MAX);
- ck_assert_int_eq(strtoi(buf, &result), 0);
- snprintf(buf, sizeof(buf), "%ld", (long)INT_MAX + 1);
- ck_assert_int_eq(strtoi(buf, &result), -1);
-
- ck_assert_int_eq(strtoi("", &result), -1);
- ck_assert_int_eq(strtoi("123abc", &result), -1);
- ck_assert_int_eq(strtoi("123 ", &result), -1);
-}
-END_TEST
-
-START_TEST(test_parse_cpu_set)
-{
- cpu_set_t set;
+Suite *utils_suite(void);
+Suite *actions_suite(void);
- nr_cpus = 8;
- ck_assert_int_eq(parse_cpu_set("0", &set), 0);
- ck_assert(CPU_ISSET(0, &set));
- ck_assert(!CPU_ISSET(1, &set));
-
- ck_assert_int_eq(parse_cpu_set("0,2", &set), 0);
- ck_assert(CPU_ISSET(0, &set));
- ck_assert(CPU_ISSET(2, &set));
-
- ck_assert_int_eq(parse_cpu_set("0-3", &set), 0);
- ck_assert(CPU_ISSET(0, &set));
- ck_assert(CPU_ISSET(1, &set));
- ck_assert(CPU_ISSET(2, &set));
- ck_assert(CPU_ISSET(3, &set));
-
- ck_assert_int_eq(parse_cpu_set("1-3,5", &set), 0);
- ck_assert(!CPU_ISSET(0, &set));
- ck_assert(CPU_ISSET(1, &set));
- ck_assert(CPU_ISSET(2, &set));
- ck_assert(CPU_ISSET(3, &set));
- ck_assert(!CPU_ISSET(4, &set));
- ck_assert(CPU_ISSET(5, &set));
-
- ck_assert_int_eq(parse_cpu_set("-1", &set), 1);
- ck_assert_int_eq(parse_cpu_set("abc", &set), 1);
- ck_assert_int_eq(parse_cpu_set("9999", &set), 1);
-}
-END_TEST
-
-START_TEST(test_parse_prio)
-{
- struct sched_attr attr;
-
- ck_assert_int_eq(parse_prio("f:50", &attr), 0);
- ck_assert_uint_eq(attr.sched_policy, SCHED_FIFO);
- ck_assert_uint_eq(attr.sched_priority, 50U);
-
- ck_assert_int_eq(parse_prio("r:30", &attr), 0);
- ck_assert_uint_eq(attr.sched_policy, SCHED_RR);
-
- ck_assert_int_eq(parse_prio("o:0", &attr), 0);
- ck_assert_uint_eq(attr.sched_policy, SCHED_OTHER);
- ck_assert_int_eq(attr.sched_nice, 0);
-
- ck_assert_int_eq(parse_prio("d:10ms:100ms", &attr), 0);
- ck_assert_uint_eq(attr.sched_policy, 6U);
-
- ck_assert_int_eq(parse_prio("f:999", &attr), -1);
- ck_assert_int_eq(parse_prio("o:-20", &attr), -1);
- ck_assert_int_eq(parse_prio("d:100ms:10ms", &attr), -1);
- ck_assert_int_eq(parse_prio("x:50", &attr), -1);
-}
-END_TEST
-
-Suite *utils_suite(void)
-{
- Suite *s = suite_create("utils");
- TCase *tc = tcase_create("core");
-
- tcase_add_test(tc, test_strtoi);
- tcase_add_test(tc, test_parse_cpu_set);
- tcase_add_test(tc, test_parse_prio);
-
- suite_add_tcase(s, tc);
- return s;
-}
-
-int main(void)
+int main(int argc, char *argv[])
{
int num_failed;
SRunner *sr;
sr = srunner_create(utils_suite());
- srunner_run_all(sr, CK_NORMAL);
+ srunner_add_suite(sr, actions_suite());
+
+ srunner_run_all(sr, CK_VERBOSE);
num_failed = srunner_ntests_failed(sr);
srunner_free(sr);
diff --git a/tools/tracing/rtla/tests/unit/utils.c b/tools/tracing/rtla/tests/unit/utils.c
new file mode 100644
index 000000000000..ce53cab49457
--- /dev/null
+++ b/tools/tracing/rtla/tests/unit/utils.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <check.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sched.h>
+#include <limits.h>
+#include <unistd.h>
+#include <sys/sysinfo.h>
+
+#include "../../src/utils.h"
+
+extern int nr_cpus;
+
+START_TEST(test_strtoi)
+{
+ int result;
+ char buf[64];
+
+ ck_assert_int_eq(strtoi("123", &result), 0);
+ ck_assert_int_eq(result, 123);
+ ck_assert_int_eq(strtoi(" -456", &result), 0);
+ ck_assert_int_eq(result, -456);
+
+ snprintf(buf, sizeof(buf), "%d", INT_MAX);
+ ck_assert_int_eq(strtoi(buf, &result), 0);
+ snprintf(buf, sizeof(buf), "%ld", (long)INT_MAX + 1);
+ ck_assert_int_eq(strtoi(buf, &result), -1);
+
+ ck_assert_int_eq(strtoi("", &result), -1);
+ ck_assert_int_eq(strtoi("123abc", &result), -1);
+ ck_assert_int_eq(strtoi("123 ", &result), -1);
+}
+END_TEST
+
+START_TEST(test_parse_cpu_set)
+{
+ cpu_set_t set;
+
+ nr_cpus = 8;
+ ck_assert_int_eq(parse_cpu_set("0", &set), 0);
+ ck_assert(CPU_ISSET(0, &set));
+ ck_assert(!CPU_ISSET(1, &set));
+
+ ck_assert_int_eq(parse_cpu_set("0,2", &set), 0);
+ ck_assert(CPU_ISSET(0, &set));
+ ck_assert(CPU_ISSET(2, &set));
+
+ ck_assert_int_eq(parse_cpu_set("0-3", &set), 0);
+ ck_assert(CPU_ISSET(0, &set));
+ ck_assert(CPU_ISSET(1, &set));
+ ck_assert(CPU_ISSET(2, &set));
+ ck_assert(CPU_ISSET(3, &set));
+
+ ck_assert_int_eq(parse_cpu_set("1-3,5", &set), 0);
+ ck_assert(!CPU_ISSET(0, &set));
+ ck_assert(CPU_ISSET(1, &set));
+ ck_assert(CPU_ISSET(2, &set));
+ ck_assert(CPU_ISSET(3, &set));
+ ck_assert(!CPU_ISSET(4, &set));
+ ck_assert(CPU_ISSET(5, &set));
+
+ ck_assert_int_eq(parse_cpu_set("-1", &set), 1);
+ ck_assert_int_eq(parse_cpu_set("abc", &set), 1);
+ ck_assert_int_eq(parse_cpu_set("9999", &set), 1);
+}
+END_TEST
+
+START_TEST(test_parse_prio)
+{
+ struct sched_attr attr;
+
+ ck_assert_int_eq(parse_prio("f:50", &attr), 0);
+ ck_assert_uint_eq(attr.sched_policy, SCHED_FIFO);
+ ck_assert_uint_eq(attr.sched_priority, 50U);
+
+ ck_assert_int_eq(parse_prio("r:30", &attr), 0);
+ ck_assert_uint_eq(attr.sched_policy, SCHED_RR);
+
+ ck_assert_int_eq(parse_prio("o:0", &attr), 0);
+ ck_assert_uint_eq(attr.sched_policy, SCHED_OTHER);
+ ck_assert_int_eq(attr.sched_nice, 0);
+
+ ck_assert_int_eq(parse_prio("d:10ms:100ms", &attr), 0);
+ ck_assert_uint_eq(attr.sched_policy, 6U);
+
+ ck_assert_int_eq(parse_prio("f:999", &attr), -1);
+ ck_assert_int_eq(parse_prio("o:-20", &attr), -1);
+ ck_assert_int_eq(parse_prio("d:100ms:10ms", &attr), -1);
+ ck_assert_int_eq(parse_prio("x:50", &attr), -1);
+}
+END_TEST
+
+Suite *utils_suite(void)
+{
+ Suite *s = suite_create("utils");
+ TCase *tc = tcase_create("core");
+
+ tcase_add_test(tc, test_strtoi);
+ tcase_add_test(tc, test_parse_cpu_set);
+ tcase_add_test(tc, test_parse_prio);
+
+ suite_add_tcase(s, tc);
+ return s;
+}
--
2.53.0
^ permalink raw reply related
* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Matthew Brost @ 2026-04-24 14:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
mathieu.desnoyers, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260424065828.031775921990de37f83a2468@linux-foundation.org>
On Fri, Apr 24, 2026 at 06:58:28AM -0700, Andrew Morton wrote:
> On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
>
> > The following series provides khugepaged with the capability to collapse
> > anonymous memory regions to mTHPs.
>
> Lots of stuff here:
> https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
>
> It's going to take some time. Hopefully worthwhile.
>
> As always, it's useful to hear about the usefulness of the AI review.
Drive by comment.
On the DRM side sashiko batting average is about .500 but even on misses
it is generally is helpful in questioning assumptions made in patches.
Matt sashiko
^ permalink raw reply
* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Andrew Morton @ 2026-04-24 14:19 UTC (permalink / raw)
To: Matthew Brost
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
mathieu.desnoyers, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <aet4nz/Ljn0kDjDk@gsse-cloud1.jf.intel.com>
On Fri, 24 Apr 2026 07:05:19 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> On Fri, Apr 24, 2026 at 06:58:28AM -0700, Andrew Morton wrote:
> > On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> >
> > Lots of stuff here:
> > https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
> >
> > It's going to take some time. Hopefully worthwhile.
> >
> > As always, it's useful to hear about the usefulness of the AI review.
>
> Drive by comment.
>
> On the DRM side sashiko batting average is about .500 but even on misses
> it is generally is helpful in questioning assumptions made in patches.
Interesting, thanks.
Personally, not adding bugs to Linux is so damn important, I'd be happy
with a lot less than 50%.
> Matt sashiko
"A sashiko mat is a decorative or functional mat, such as a coaster,
table mat, or place mat, made using traditional Japanese, functional,
and meditative embroidery stitching".
So there.
^ permalink raw reply
* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Matthew Brost @ 2026-04-24 14:24 UTC (permalink / raw)
To: Andrew Morton
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
mathieu.desnoyers, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260424071930.62318a9294e07c99ba0ff8a2@linux-foundation.org>
On Fri, Apr 24, 2026 at 07:19:30AM -0700, Andrew Morton wrote:
> On Fri, 24 Apr 2026 07:05:19 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
>
> > On Fri, Apr 24, 2026 at 06:58:28AM -0700, Andrew Morton wrote:
> > > On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
> > >
> > > > The following series provides khugepaged with the capability to collapse
> > > > anonymous memory regions to mTHPs.
> > >
> > > Lots of stuff here:
> > > https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
> > >
> > > It's going to take some time. Hopefully worthwhile.
> > >
> > > As always, it's useful to hear about the usefulness of the AI review.
> >
> > Drive by comment.
> >
> > On the DRM side sashiko batting average is about .500 but even on misses
> > it is generally is helpful in questioning assumptions made in patches.
>
> Interesting, thanks.
>
> Personally, not adding bugs to Linux is so damn important, I'd be happy
> with a lot less than 50%.
>
I'm convinced enough that any patch I post/merge or even RB, I read
sashiko first.
Matt
> > Matt sashiko
>
> "A sashiko mat is a decorative or functional mat, such as a coaster,
> table mat, or place mat, made using traditional Japanese, functional,
> and meditative embroidery stitching".
>
> So there.
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-04-24 19:08 UTC (permalink / raw)
To: Michael Roth
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <3blpenhpvysb2ig7efegedx4v3flppl5ftnz6vhpqlatfk3ycn@vmmhs7mvjieg>
Michael Roth <michael.roth@amd.com> writes:
Thank you for your patches!
>
> [...snip...]
>
>>
>> I also did some minor updates (prefixed with a "[squash]" tag) to advertise
>> the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVED flag so it can be used by
>
> Though I'm not sure how we deal with it if SNP/TDX at some point become
> capable of using the PRESERVED flag *after* populate... but maybe that's
> too unlikely to worry about? If we wanted to address it though, we could
> have both PRESERVED and PRESERVED_BEFORE_LAUNCH so they can be
> enumerated separately from the start.
>
Not sure how likely it is, but if SNP and TDX can honor PRESERVE
semantics after populate, I think we could implement support under a new
flag like CIPHER.
CIPHER can then be used to mean "do the encryption or decryption", and
for platforms not supporting encryption, they'd stick with PRESERVE?
Should we redefine the semantics of PRESERVE to be "ensure that memory
contents don't change while guest_memfd tracking is being updated" and
avoid making a commitment on how the guest should read the memory?
The above update would be aligned with ZERO not being allowed for
conversions to private (because KVM/guest_memfd does not make guarantees
about the contract between the host and guest.
This way, all of those (ZERO, PRESERVE) will focus on KVM's interface
with the host.
This lines up for SW_PROTECTED_VMs too, since reading memory that didn't
change in the guest is the contract between SW_PROTECTED_VMs and the
host.
>> userspace for SNP/TDX in the kvm_gmem_populate() path as agreed upon
>> during PUCK.
>>
>>
>> [...snip...]
>>
^ permalink raw reply
* [PATCH] tracing: simplify pages allocation
From: Rosen Penev @ 2026-04-25 1:44 UTC (permalink / raw)
To: linux-trace-kernel
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Kees Cook,
Gustavo A. R. Silva, open list:TRACING,
open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be)?b
Change to a flexible array member to allocate together with the array
struct.
Simplifies code slightly by removing no longer correct null checks for
pages and removing kfrees.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
kernel/trace/tracing_map.c | 32 +++++++++++---------------------
kernel/trace/tracing_map.h | 2 +-
2 files changed, 12 insertions(+), 22 deletions(-)
diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index bf1a507695b6..627cc3fdf69e 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -288,9 +288,6 @@ static void tracing_map_array_clear(struct tracing_map_array *a)
{
unsigned int i;
- if (!a->pages)
- return;
-
for (i = 0; i < a->n_pages; i++)
memset(a->pages[i], 0, PAGE_SIZE);
}
@@ -302,44 +299,37 @@ static void tracing_map_array_free(struct tracing_map_array *a)
if (!a)
return;
- if (!a->pages)
- goto free;
-
for (i = 0; i < a->n_pages; i++) {
if (!a->pages[i])
break;
kmemleak_free(a->pages[i]);
free_page((unsigned long)a->pages[i]);
}
-
- kfree(a->pages);
-
- free:
- kfree(a);
}
static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts,
unsigned int entry_size)
{
struct tracing_map_array *a;
+ unsigned int entry_size_shift;
+ unsigned int entries_per_page;
+ unsigned int n_pages;
unsigned int i;
- a = kzalloc_obj(*a);
+ entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
+ entries_per_page = PAGE_SIZE / (1 << entry_size_shift);
+ n_pages = max(1, n_elts / entries_per_page);
+
+ a = kzalloc_flex(*a, pages, n_pages);
if (!a)
return NULL;
- a->entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
- a->entries_per_page = PAGE_SIZE / (1 << a->entry_size_shift);
- a->n_pages = n_elts / a->entries_per_page;
- if (!a->n_pages)
- a->n_pages = 1;
+ a->entry_size_shift = entry_size_shift;
+ a->entries_per_page = entries_per_page;
+ a->n_pages = n_pages;
a->entry_shift = fls(a->entries_per_page) - 1;
a->entry_mask = (1 << a->entry_shift) - 1;
- a->pages = kcalloc(a->n_pages, sizeof(void *), GFP_KERNEL);
- if (!a->pages)
- goto free;
-
for (i = 0; i < a->n_pages; i++) {
a->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
if (!a->pages[i])
diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
index 99c37eeebc16..18a02959d77b 100644
--- a/kernel/trace/tracing_map.h
+++ b/kernel/trace/tracing_map.h
@@ -167,7 +167,7 @@ struct tracing_map_array {
unsigned int entry_shift;
unsigned int entry_mask;
unsigned int n_pages;
- void **pages;
+ void *pages[] __counted_by(n_pages);
};
#define TRACING_MAP_ARRAY_ELT(array, idx) \
--
2.54.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox