Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Jeff Layton @ 2026-04-26 14:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426052854.8372fb9d4c616f16a8aa0a0f@linux-foundation.org>

On Sun, 2026-04-26 at 05:28 -0700, Andrew Morton wrote:
> Naive questions...
> 
> On Sun, 26 Apr 2026 07:56:08 -0400 Jeff Layton <jlayton@kernel.org> wrote:
> 
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context.  Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> 
> So in the current case, when generic_write_sync() returns, all that
> memory is written back and clean&reclaimable (or freed?), yes?
> 

No. Before returning, it submits the I/Os for the portion that it wrote
rather than leaving it to the flusher to take care of things, but it
doesn't wait for the I/Os to complete.

> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background.  This moves writeback submission
> > completely off the writer's hot path.
> 
> Whereas after this change, that pagecache is probably still dirty,
> unreclaimable, waiting for the flusher to do its thing?
> 

Correct, but that's sort of the case today too since DONTCACHE I/Os
don't wait for the completion. With this change we're just deferring
the I/O submission to the flusher thread (which should hopefully soon
wake and take care of business). If the flusher thread can't keep up,
then eventually balance_dirty_pages() will kick in and start slowing
things down.

> So is there potential that the system will get all gummed up with
> dirty, to-be-written-soon pagecache?  Is there something which limits
> this buildup?
> 

Today in this situation, the writers are limited by the backing device
throughput. Once the I/O submission queues are full, then the DONTCACHE
writers end up stacking up on those. With this change, the writers will
be more limited by traditional VM limits in this situation. 

In the test runs I did, the peak pagecache with DONTCACHE writes was
higher than with the unpatched version but still considerably less than
with normal buffered I/O. That's the cost of deferring the I/O
submission to the flusher.

One thing we could consider is going back to submitting the writes
inline when the number of dirty pages is high. But, that could have a
detrimental effect on performance too.

> > ...
> > 
> > dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> > RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> > ~503 GB, compared to a v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              1449.8     1440.1      -0.7%
> >   dontcache             1347.9     1461.5      +8.4%
> >   direct                1450.0     1440.1      -0.7%
> > 
> >   Single-client sequential write latency (us):
> >                        baseline    patched     change
> >   dontcache p50         3031.0    10551.3    +248.1%
> >   dontcache p99        74973.2    21626.9     -71.2%
> >   dontcache p99.9      85459.0    23199.7     -72.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              284.2      295.4      +3.9%
> > 
> >   Single-client random write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache             2277.4      872.4     -61.7%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              1619.5     1611.2      -0.5%
> >   dontcache             1281.1     1629.4     +27.2%
> >   direct                1545.4     1609.4      +4.1%
> > 
> >   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1297.6     1471.1     +13.4%
> >   readers avg (MB/s)     855.0      462.4     -45.9%
> 
> These results look ambiguous.  Sometimes better, sometimes worse?
> 
> > nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> > NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> > file size ~502 GB, compared to v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              4844.2     4653.4      -3.9%
> >   dontcache             3028.3     3723.1     +22.9%
> >   direct                 957.6      987.8      +3.2%
> > 
> >   Single-client sequential write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache            759169.0   175112.2     -76.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              590.0     1561.0    +164.6%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              9636.3     9422.9      -2.2%
> >   dontcache             1894.9     9442.6    +398.3%
> >   direct                 809.6      975.1     +20.4%
> > 
> >   Noisy neighbor (dontcache writer + random readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1854.5     4063.6    +119.1%
> >   readers avg (MB/s)     131.2      101.6     -22.5%
> 
> Ditto but less so.
> 
> > The NFS results show even larger improvements than the local benchmarks.
> > Multi-writer dontcache throughput improves nearly 5x, matching buffered
> > I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> > buffered.
> 
> It sounds that you like the results, so OK ;)

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Song Chen @ 2026-04-26 13:56 UTC (permalink / raw)
  To: Petr Mladek, Masami Hiramatsu
  Cc: chensong_2000, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
	snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
	danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
	frederic, mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin,
	jpoimboe, jikos, mbenes, joe.lawrence, rostedt, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <aec90caYZDHDAHgw@pathway.suse.cz>

Hi,

On 4/21/26 17:05, Petr Mladek wrote:
> On Mon 2026-04-20 14:44:29, Masami Hiramatsu wrote:
>> Hi Song,
>>
>> On Wed, 15 Apr 2026 15:01:37 +0800
>> chensong_2000@189.cn wrote:
>>
>>> From: Song Chen <chensong_2000@189.cn>
>>>
>>> The current notifier chain implementation uses a single-linked list
>>> (struct notifier_block *next), which only supports forward traversal
>>> in priority order. This makes it difficult to handle cleanup/teardown
>>> scenarios that require notifiers to be called in reverse priority order.
>>
>> What about introducing a new notification callback API that allows you
>> to describe dependencies between callback functions?
>>
>> For example, when registering a callback, you could register a string
>> as an ID and specify whether to call it before or after that ID,
>> or you could register a comparison function that is called when adding
>> to a list. (I prefer @name and @depends fields so that it can be easily
>> maintained.)
> 
> This looks too complex. It would make sense only
> when this API has more users.
> 
> Also this won't be enough for the ftrace/livepatch callbacks.
> They need to be ordered against against each other. But they
> also need to be called before/after all other callbacks.
> For example, when the module is loaded:
> 
>     + 1st frace
>     + 2nd livepatch
>     + then other notifiers
> 
> See the commit c1bf08ac26e92122 ("ftrace: Be first to run code
> modification on modules").
> 
>> This would allow for better dependency building when adding to the list.
>   
>>>
>>> A concrete example is the ordering dependency between ftrace and
>>> livepatch during module load/unload. see the detail here [1].
>>
>> If this only concerns notification callback issues with the ftrace
>> and livepatch modules, it's far more robust to simply call the
>> necessary processing directly when the modules load and unload,
>> rather than registering notification callbacks externally.
>>
>> There are fprobe, kprobe and its trace-events, all of them are using
>> ftrace as its fundation layer. In this case, I always needs to
>> consider callback order when a module is unloaded.
>>
>> If ftrace is working as a part of module callbacks, it will conflict
>> with fprobe/kprobe module callback. Of course we can reorder it with
>> modifying its priority. But this is ugly, because when we introduce
>> a new other feature which depends on another layer, we need to
>> reorder the callback's priority number on the list.
>>
>> Based on the above, I don't think this can be resolved simply by
>> changing the list of notification callbacks to a bidirectional list.
> 
> I agree. I would keep it as is (hardcoded).
> 
> Best Regards,
> Petr
> 


Thanks for the feedback, the necessity doesn't convincing enough. I will 
try the proposal from Masami Hiramatsu.

Best regards,

Song


^ permalink raw reply

* Re: [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite
From: Andrew Morton @ 2026-04-26 12:34 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426-dontcache-v3-3-79eb37da9547@kernel.org>

On Sun, 26 Apr 2026 07:56:09 -0400 Jeff Layton <jlayton@kernel.org> wrote:

> Add a benchmark suite for testing NFSD I/O mode performance using fio
> with the libnfs backend against an NFS server on localhost.  Tests
> buffered, dontcache, and direct I/O modes via NFSD debugfs controls.
> 
> Includes:
>  - fio job files for sequential/random read/write, multi-writer,
>    noisy-neighbor, and latency-sensitive reader workloads
>  - run-benchmarks.sh: orchestrates test matrix with mode switching
>  - parse-results.sh: extracts metrics from fio JSON output
>  - setup-server.sh: configures NFS export for testing
> 
> Assisted-by: Claude:claude-opus-4-6

OK, question.

>  10 files changed, 1024 insertions(+)

Seems that this code was largely machine-generated.  So I assume that
you're in possession of the scripts/prompts/whatever which were used to
generate this code.

(Can you please briefly describe the process which you used here?)

So how are we to maintain this?  Will other developers have to go in
and hack this machine-generated output by hand?  Or would it be better
to provide (in-tree) other developers with the means to regenerate this code,
presumably using Claude?

IOW, this feels a bit like shipping the .s file without giving us the .c
file!

^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Andrew Morton @ 2026-04-26 12:28 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426-dontcache-v3-2-79eb37da9547@kernel.org>

Naive questions...

On Sun, 26 Apr 2026 07:56:08 -0400 Jeff Layton <jlayton@kernel.org> wrote:

> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context.  Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).

So in the current case, when generic_write_sync() returns, all that
memory is written back and clean&reclaimable (or freed?), yes?

> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background.  This moves writeback submission
> completely off the writer's hot path.

Whereas after this change, that pagecache is probably still dirty,
unreclaimable, waiting for the flusher to do its thing?

So is there potential that the system will get all gummed up with
dirty, to-be-written-soon pagecache?  Is there something which limits
this buildup?

> ...
>
> dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> ~503 GB, compared to a v6.19-ish baseline):
> 
>   Single-client sequential write (MB/s):
>                        baseline    patched     change
>   buffered              1449.8     1440.1      -0.7%
>   dontcache             1347.9     1461.5      +8.4%
>   direct                1450.0     1440.1      -0.7%
> 
>   Single-client sequential write latency (us):
>                        baseline    patched     change
>   dontcache p50         3031.0    10551.3    +248.1%
>   dontcache p99        74973.2    21626.9     -71.2%
>   dontcache p99.9      85459.0    23199.7     -72.9%
> 
>   Single-client random write (MB/s):
>                        baseline    patched     change
>   dontcache              284.2      295.4      +3.9%
> 
>   Single-client random write p99.9 latency (us):
>                        baseline    patched     change
>   dontcache             2277.4      872.4     -61.7%
> 
>   Multi-writer aggregate throughput (MB/s):
>                        baseline    patched     change
>   buffered              1619.5     1611.2      -0.5%
>   dontcache             1281.1     1629.4     +27.2%
>   direct                1545.4     1609.4      +4.1%
> 
>   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
>                        baseline    patched     change
>   writer (MB/s)         1297.6     1471.1     +13.4%
>   readers avg (MB/s)     855.0      462.4     -45.9%

These results look ambiguous.  Sometimes better, sometimes worse?

> nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> file size ~502 GB, compared to v6.19-ish baseline):
> 
>   Single-client sequential write (MB/s):
>                        baseline    patched     change
>   buffered              4844.2     4653.4      -3.9%
>   dontcache             3028.3     3723.1     +22.9%
>   direct                 957.6      987.8      +3.2%
> 
>   Single-client sequential write p99.9 latency (us):
>                        baseline    patched     change
>   dontcache            759169.0   175112.2     -76.9%
> 
>   Single-client random write (MB/s):
>                        baseline    patched     change
>   dontcache              590.0     1561.0    +164.6%
> 
>   Multi-writer aggregate throughput (MB/s):
>                        baseline    patched     change
>   buffered              9636.3     9422.9      -2.2%
>   dontcache             1894.9     9442.6    +398.3%
>   direct                 809.6      975.1     +20.4%
> 
>   Noisy neighbor (dontcache writer + random readers):
>                        baseline    patched     change
>   writer (MB/s)         1854.5     4063.6    +119.1%
>   readers avg (MB/s)     131.2      101.6     -22.5%

Ditto but less so.

> The NFS results show even larger improvements than the local benchmarks.
> Multi-writer dontcache throughput improves nearly 5x, matching buffered
> I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> buffered.

It sounds that you like the results, so OK ;)


^ permalink raw reply

* [PATCH v3 4/4] testing: add dontcache-bench local filesystem benchmark suite
From: Jeff Layton @ 2026-04-26 11:56 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel, Jeff Layton
In-Reply-To: <20260426-dontcache-v3-0-79eb37da9547@kernel.org>

Add a benchmark suite for testing IOCB_DONTCACHE on local filesystems
via fio's io_uring engine with the RWF_DONTCACHE flag.

The suite mirrors the nfsd-io-bench test matrix but uses io_uring with
the "uncached" fio option instead of NFSD debugfs mode switching:
 - uncached=0: standard buffered I/O
 - uncached=1: RWF_DONTCACHE
 - Mode 2 uses O_DIRECT via fio's --direct=1

Includes fio job files, run-benchmarks.sh, and parse-results.sh.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 .../dontcache-bench/fio-jobs/lat-reader.fio        |  12 +
 .../dontcache-bench/fio-jobs/multi-write.fio       |   9 +
 .../dontcache-bench/fio-jobs/noisy-writer.fio      |  12 +
 .../testing/dontcache-bench/fio-jobs/rand-read.fio |  13 +
 .../dontcache-bench/fio-jobs/rand-write.fio        |  13 +
 .../testing/dontcache-bench/fio-jobs/seq-read.fio  |  13 +
 .../testing/dontcache-bench/fio-jobs/seq-write.fio |  13 +
 .../dontcache-bench/scripts/parse-results.sh       | 238 +++++++++
 .../dontcache-bench/scripts/run-benchmarks.sh      | 562 +++++++++++++++++++++
 9 files changed, 885 insertions(+)

diff --git a/tools/testing/dontcache-bench/fio-jobs/lat-reader.fio b/tools/testing/dontcache-bench/fio-jobs/lat-reader.fio
new file mode 100644
index 000000000000..e221e7aedec9
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/lat-reader.fio
@@ -0,0 +1,12 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=4k
+numjobs=1
+time_based=0
+rw=read
+log_avg_msec=1000
+write_bw_log=latreader
+write_lat_log=latreader
+
+[latreader]
diff --git a/tools/testing/dontcache-bench/fio-jobs/multi-write.fio b/tools/testing/dontcache-bench/fio-jobs/multi-write.fio
new file mode 100644
index 000000000000..8fc0770f5860
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/multi-write.fio
@@ -0,0 +1,9 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=1
+time_based=0
+rw=write
+
+[multiwrite]
diff --git a/tools/testing/dontcache-bench/fio-jobs/noisy-writer.fio b/tools/testing/dontcache-bench/fio-jobs/noisy-writer.fio
new file mode 100644
index 000000000000..4524eebd4642
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/noisy-writer.fio
@@ -0,0 +1,12 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=1
+time_based=0
+rw=write
+log_avg_msec=1000
+write_bw_log=noisywriter
+write_lat_log=noisywriter
+
+[noisywriter]
diff --git a/tools/testing/dontcache-bench/fio-jobs/rand-read.fio b/tools/testing/dontcache-bench/fio-jobs/rand-read.fio
new file mode 100644
index 000000000000..e281fa82b86a
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/rand-read.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=4k
+numjobs=1
+iodepth=16
+time_based=0
+rw=randread
+log_avg_msec=1000
+write_bw_log=randread
+write_lat_log=randread
+
+[randread]
diff --git a/tools/testing/dontcache-bench/fio-jobs/rand-write.fio b/tools/testing/dontcache-bench/fio-jobs/rand-write.fio
new file mode 100644
index 000000000000..cf53bc6f14b9
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/rand-write.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=4k
+numjobs=1
+iodepth=16
+time_based=0
+rw=randwrite
+log_avg_msec=1000
+write_bw_log=randwrite
+write_lat_log=randwrite
+
+[randwrite]
diff --git a/tools/testing/dontcache-bench/fio-jobs/seq-read.fio b/tools/testing/dontcache-bench/fio-jobs/seq-read.fio
new file mode 100644
index 000000000000..ef87921465a7
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/seq-read.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=1
+iodepth=16
+time_based=0
+rw=read
+log_avg_msec=1000
+write_bw_log=seqread
+write_lat_log=seqread
+
+[seqread]
diff --git a/tools/testing/dontcache-bench/fio-jobs/seq-write.fio b/tools/testing/dontcache-bench/fio-jobs/seq-write.fio
new file mode 100644
index 000000000000..da3082f9b391
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/seq-write.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=1
+iodepth=16
+time_based=0
+rw=write
+log_avg_msec=1000
+write_bw_log=seqwrite
+write_lat_log=seqwrite
+
+[seqwrite]
diff --git a/tools/testing/dontcache-bench/scripts/parse-results.sh b/tools/testing/dontcache-bench/scripts/parse-results.sh
new file mode 100755
index 000000000000..0427d411db04
--- /dev/null
+++ b/tools/testing/dontcache-bench/scripts/parse-results.sh
@@ -0,0 +1,238 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Parse fio JSON output and generate comparison tables.
+#
+# Usage: ./parse-results.sh <results-dir>
+
+set -euo pipefail
+
+if [ $# -lt 1 ]; then
+	echo "Usage: $0 <results-dir>"
+	exit 1
+fi
+
+RESULTS_DIR="$1"
+
+if ! command -v jq &>/dev/null; then
+	echo "ERROR: jq is required"
+	exit 1
+fi
+
+# Extract metrics from a single fio JSON result
+extract_metrics() {
+	local json_file=$1
+	local rw_type=$2  # read or write
+
+	if [ ! -f "$json_file" ]; then
+		echo "N/A N/A N/A N/A N/A N/A"
+		return
+	fi
+
+	jq -r --arg rw "$rw_type" '
+		.jobs[0][$rw] as $d |
+		[
+			(($d.bw // 0) / 1024 | . * 10 | round / 10),    # MB/s
+			($d.iops // 0),                                    # IOPS
+			((($d.clat_ns.mean // 0) / 1000) | . * 10 | round / 10), # avg lat us
+			(($d.clat_ns.percentile["50.000000"] // 0) / 1000), # p50 us
+			(($d.clat_ns.percentile["99.000000"] // 0) / 1000), # p99 us
+			(($d.clat_ns.percentile["99.900000"] // 0) / 1000)  # p99.9 us
+		] | @tsv
+	' "$json_file" 2>/dev/null || echo "N/A N/A N/A N/A N/A N/A"
+}
+
+# Extract server CPU from vmstat log (average sys%)
+extract_cpu() {
+	local vmstat_log=$1
+	if [ ! -f "$vmstat_log" ]; then
+		echo "N/A"
+		return
+	fi
+	# vmstat columns: us sy id wa st — skip header lines
+	awk 'NR>2 {sum+=$14; n++} END {if(n>0) printf "%.1f", sum/n; else print "N/A"}' \
+		"$vmstat_log" 2>/dev/null || echo "N/A"
+}
+
+# Extract peak dirty pages from meminfo log
+extract_peak_dirty() {
+	local meminfo_log=$1
+	if [ ! -f "$meminfo_log" ]; then
+		echo "N/A"
+		return
+	fi
+	grep "^Dirty:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+# Extract peak cached from meminfo log
+extract_peak_cached() {
+	local meminfo_log=$1
+	if [ ! -f "$meminfo_log" ]; then
+		echo "N/A"
+		return
+	fi
+	grep "^Cached:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+print_separator() {
+	printf '%*s\n' 120 '' | tr ' ' '-'
+}
+
+########################################################################
+# Deliverable 1: Single-client results
+########################################################################
+echo ""
+echo "=================================================================="
+echo "  Deliverable 1: Single-Client fio Benchmarks"
+echo "=================================================================="
+echo ""
+
+for workload in seq-write rand-write seq-read rand-read; do
+	case $workload in
+	seq-write|rand-write) rw_type="write" ;;
+	seq-read|rand-read)   rw_type="read" ;;
+	esac
+
+	echo "--- $workload ---"
+	printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+		"Mode" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)" "Sys CPU%" "PeakDirty(kB)" "PeakCache(kB)"
+	print_separator
+
+	for mode in buffered dontcache direct; do
+		dir="${RESULTS_DIR}/${workload}/${mode}"
+		json_file=$(find "$dir" -name '*.json' -not -name 'client*' 2>/dev/null | head -1 || true)
+		if [ -z "$json_file" ]; then
+			printf "%-16s %10s\n" "$mode" "(no data)"
+			continue
+		fi
+
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "$rw_type")"
+		cpu=$(extract_cpu "${dir}/vmstat.log")
+		dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+		cached=$(extract_peak_cached "${dir}/meminfo.log")
+
+		printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+			"$mode" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999" \
+			"$cpu" "${dirty:-N/A}" "${cached:-N/A}"
+	done
+	echo ""
+done
+
+########################################################################
+# Deliverable 2: Multi-client results
+########################################################################
+echo "=================================================================="
+echo "  Deliverable 2: Noisy-Neighbor Benchmarks"
+echo "=================================================================="
+echo ""
+
+# Scenario A: Multiple writers
+echo "--- Scenario A: Multiple Writers ---"
+for mode in buffered dontcache direct; do
+	dir="${RESULTS_DIR}/multi-write/${mode}"
+	if [ ! -d "$dir" ]; then
+		continue
+	fi
+
+	echo "  Mode: $mode"
+	printf "  %-10s %10s %10s %10s %10s %10s %10s\n" \
+		"Client" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+	total_bw=0
+	count=0
+	for json_file in "${dir}"/client*.json; do
+		[ -f "$json_file" ] || continue
+		client=$(basename "$json_file" .json)
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "write")"
+		printf "  %-10s %10s %10s %10s %10s %10s %10s\n" \
+			"$client" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+		total_bw=$(echo "$total_bw + ${mbps:-0}" | bc 2>/dev/null || echo "$total_bw")
+		count=$(( count + 1 ))
+	done
+
+	cpu=$(extract_cpu "${dir}/vmstat.log")
+	dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+	printf "  Aggregate BW: %s MB/s | Sys CPU: %s%% | Peak Dirty: %s kB\n" \
+		"$total_bw" "$cpu" "${dirty:-N/A}"
+	echo ""
+done
+
+# Scenario C: Noisy neighbor
+echo "--- Scenario C: Noisy Writer + Latency-Sensitive Readers ---"
+for mode in buffered dontcache direct; do
+	dir="${RESULTS_DIR}/noisy-neighbor/${mode}"
+	if [ ! -d "$dir" ]; then
+		continue
+	fi
+
+	echo "  Mode: $mode"
+	printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+		"Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+	# Writer
+	if [ -f "${dir}/noisy_writer.json" ]; then
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "${dir}/noisy_writer.json" "write")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	fi
+
+	# Readers
+	for json_file in "${dir}"/reader*.json; do
+		[ -f "$json_file" ] || continue
+		reader=$(basename "$json_file" .json)
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "read")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	done
+
+	cpu=$(extract_cpu "${dir}/vmstat.log")
+	dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+	printf "  Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+	echo ""
+done
+
+# Scenario D: Mixed-mode noisy neighbor
+echo "--- Scenario D: Mixed-Mode Noisy Writer + Readers ---"
+for dir in "${RESULTS_DIR}"/noisy-neighbor-mixed/*/; do
+	[ -d "$dir" ] || continue
+	label=$(basename "$dir")
+
+	echo "  Mode: $label"
+	printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+		"Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+	# Writer
+	if [ -f "${dir}/noisy_writer.json" ]; then
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "${dir}/noisy_writer.json" "write")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	fi
+
+	# Readers
+	for json_file in "${dir}"/reader*.json; do
+		[ -f "$json_file" ] || continue
+		reader=$(basename "$json_file" .json)
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "read")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	done
+
+	cpu=$(extract_cpu "${dir}/vmstat.log")
+	dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+	printf "  Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+	echo ""
+done
+
+echo "=================================================================="
+echo "  System Info"
+echo "=================================================================="
+if [ -f "${RESULTS_DIR}/sysinfo.txt" ]; then
+	head -6 "${RESULTS_DIR}/sysinfo.txt"
+fi
+echo ""
diff --git a/tools/testing/dontcache-bench/scripts/run-benchmarks.sh b/tools/testing/dontcache-bench/scripts/run-benchmarks.sh
new file mode 100755
index 000000000000..11bf400ef092
--- /dev/null
+++ b/tools/testing/dontcache-bench/scripts/run-benchmarks.sh
@@ -0,0 +1,562 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Local filesystem I/O mode benchmark suite.
+#
+# Runs the same test matrix as run-benchmarks.sh but on a local filesystem
+# using fio's io_uring engine with the RWF_DONTCACHE flag instead of NFSD's
+# debugfs mode knobs.
+#
+# Usage: ./run-local-benchmarks.sh [options]
+#   -t <dir>    Test directory (must be on a filesystem supporting FOP_DONTCACHE)
+#   -s <size>   File size (default: auto-sized to exceed RAM)
+#   -f <path>   Path to fio binary (default: fio in PATH)
+#   -o <dir>    Output directory for results (default: ./results/<timestamp>)
+#   -d          Dry run (print commands without executing)
+
+set -euo pipefail
+
+# Defaults
+TEST_DIR=""
+SIZE=""
+FIO_BIN="fio"
+RESULTS_DIR=""
+DRY_RUN=0
+MODES="0 1 2"
+PERF_LOCK=0
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+FIO_JOBS_DIR="${SCRIPT_DIR}/../fio-jobs"
+
+usage() {
+	echo "Usage: $0 -t <test-dir> [-s <size>] [-f <fio-path>] [-o <output-dir>] [-D] [-p] [-d]"
+	echo ""
+	echo "  -t <dir>    Test directory (required, must support RWF_DONTCACHE)"
+	echo "  -s <size>   File size (default: 2x RAM)"
+	echo "  -f <path>   Path to fio binary (default: fio)"
+	echo "  -o <dir>    Output directory (default: ./results/<timestamp>)"
+	echo "  -D          Dontcache only (skip buffered and direct tests)"
+	echo "  -p          Profile kernel lock contention with perf lock"
+	echo "  -d          Dry run"
+	exit 1
+}
+
+while getopts "t:s:f:o:Dpdh" opt; do
+	case $opt in
+	t) TEST_DIR="$OPTARG" ;;
+	s) SIZE="$OPTARG" ;;
+	f) FIO_BIN="$OPTARG" ;;
+	o) RESULTS_DIR="$OPTARG" ;;
+	D) MODES="1" ;;
+	p) PERF_LOCK=1 ;;
+	d) DRY_RUN=1 ;;
+	h) usage ;;
+	*) usage ;;
+	esac
+done
+
+if [ -z "$TEST_DIR" ]; then
+	echo "ERROR: -t <test-dir> is required"
+	usage
+fi
+
+# Auto-size to 2x RAM if not specified
+if [ -z "$SIZE" ]; then
+	mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+	SIZE="$(( mem_kb * 2 / 1024 ))M"
+fi
+
+if [ -z "$RESULTS_DIR" ]; then
+	RESULTS_DIR="./results/local-$(date +%Y%m%d-%H%M%S)"
+fi
+
+mkdir -p "$RESULTS_DIR"
+
+log() {
+	echo "[$(date '+%H:%M:%S')] $*"
+}
+
+run_cmd() {
+	if [ "$DRY_RUN" -eq 1 ]; then
+		echo "  [DRY RUN] $*"
+	else
+		"$@"
+	fi
+}
+
+# I/O mode definitions:
+#   buffered:  direct=0, uncached=0
+#   dontcache: direct=0, uncached=1
+#   direct:    direct=1, uncached=0
+#
+# Mode name from numeric value
+mode_name() {
+	case $1 in
+	0) echo "buffered" ;;
+	1) echo "dontcache" ;;
+	2) echo "direct" ;;
+	esac
+}
+
+# Return fio command-line flags for a given mode.
+# "direct" is a standard fio option and works on the command line.
+# "uncached" is an io_uring engine option that must be in the job file,
+# so we inject it via make_job_file() below.
+mode_fio_args() {
+	case $1 in
+	0) echo "--direct=0" ;;           # buffered
+	1) echo "--direct=0" ;;           # dontcache
+	2) echo "--direct=1" ;;           # direct
+	esac
+}
+
+# Return the uncached= value for a given mode.
+mode_uncached() {
+	case $1 in
+	0) echo "0" ;;
+	1) echo "1" ;;
+	2) echo "0" ;;
+	esac
+}
+
+# Create a temporary job file with uncached=N injected into [global].
+# For uncached=0 (buffered/direct), return the original file unchanged.
+make_job_file() {
+	local job_file=$1
+	local uncached=$2
+
+	if [ "$uncached" -eq 0 ]; then
+		echo "$job_file"
+		return
+	fi
+
+	local tmp
+	tmp=$(mktemp)
+	sed "/^\[global\]/a uncached=${uncached}" "$job_file" > "$tmp"
+	echo "$tmp"
+}
+
+drop_caches() {
+	run_cmd bash -c "sync && echo 3 > /proc/sys/vm/drop_caches"
+}
+
+# perf lock profiling — uses BPF-based live contention tracing
+PERF_LOCK_PID=""
+
+start_perf_lock() {
+	local outdir=$1
+
+	if [ "$PERF_LOCK" -ne 1 ]; then
+		return
+	fi
+
+	log "Starting perf lock contention tracing"
+	perf lock contention -a -b --max-stack 8 \
+		> "${outdir}/perf-lock-contention.txt" 2>&1 &
+	PERF_LOCK_PID=$!
+}
+
+stop_perf_lock() {
+	local outdir=$1
+
+	if [ -z "$PERF_LOCK_PID" ]; then
+		return
+	fi
+
+	log "Stopping perf lock contention tracing"
+	kill -TERM "$PERF_LOCK_PID" 2>/dev/null || true
+	wait "$PERF_LOCK_PID" 2>/dev/null || true
+	PERF_LOCK_PID=""
+}
+
+# Background monitors
+VMSTAT_PID=""
+IOSTAT_PID=""
+MEMINFO_PID=""
+
+start_monitors() {
+	local outdir=$1
+	log "Starting monitors in $outdir"
+	run_cmd vmstat 1 > "${outdir}/vmstat.log" 2>&1 &
+	VMSTAT_PID=$!
+	run_cmd iostat -x 1 > "${outdir}/iostat.log" 2>&1 &
+	IOSTAT_PID=$!
+	(while true; do
+		echo "=== $(date '+%s') ==="
+		cat /proc/meminfo
+		sleep 1
+	done) > "${outdir}/meminfo.log" 2>&1 &
+	MEMINFO_PID=$!
+}
+
+stop_monitors() {
+	log "Stopping monitors"
+	kill "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+	wait "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+}
+
+cleanup_test_files() {
+	local filepath="${TEST_DIR}/$1"
+	log "Cleaning up $filepath"
+	run_cmd rm -f "$filepath"
+}
+
+# Run a single fio benchmark
+run_fio() {
+	local job_file=$1
+	local outdir=$2
+	local filename=$3
+	local fio_size=${4:-$SIZE}
+	local keep=${5:-}
+	local extra_args=${6:-}
+	local uncached=${7:-0}
+
+	# Inject uncached=N into the job file if needed
+	local actual_job
+	actual_job=$(make_job_file "$job_file" "$uncached")
+
+	local job_name
+	job_name=$(basename "$job_file" .fio)
+
+	log "Running fio job: $job_name -> $outdir (file=${TEST_DIR}/$filename size=$fio_size)"
+	mkdir -p "$outdir"
+
+	drop_caches
+	start_monitors "$outdir"
+	# Skip perf lock profiling for precreate/setup runs
+	[ "$keep" != "keep" ] && start_perf_lock "$outdir"
+
+	# shellcheck disable=SC2086
+	run_cmd "$FIO_BIN" "$actual_job" \
+		--output-format=json \
+		--output="${outdir}/${job_name}.json" \
+		--filename="${TEST_DIR}/$filename" \
+		--size="$fio_size" \
+		$extra_args
+
+	[ "$keep" != "keep" ] && stop_perf_lock "$outdir"
+	stop_monitors
+	log "Finished: $job_name"
+
+	# Clean up temp job file if one was created
+	[ "$actual_job" != "$job_file" ] && rm -f "$actual_job"
+
+	if [ "$keep" != "keep" ]; then
+		cleanup_test_files "$filename"
+	fi
+}
+
+########################################################################
+# Preflight
+########################################################################
+preflight() {
+	log "=== Preflight checks ==="
+
+	if ! command -v "$FIO_BIN" &>/dev/null; then
+		echo "ERROR: fio not found at $FIO_BIN"
+		exit 1
+	fi
+
+	if [ ! -d "$TEST_DIR" ]; then
+		echo "ERROR: Test directory $TEST_DIR does not exist"
+		exit 1
+	fi
+
+	# Quick check that RWF_DONTCACHE works on this filesystem
+	local testfile="${TEST_DIR}/.dontcache_test"
+	if ! "$FIO_BIN" --name=test --ioengine=io_uring --rw=write \
+		--bs=4k --size=4k --direct=0 --uncached=1 \
+		--filename="$testfile" 2>/dev/null; then
+		echo "WARNING: RWF_DONTCACHE may not be supported on $TEST_DIR"
+		echo "         (filesystem must support FOP_DONTCACHE)"
+	fi
+	rm -f "$testfile"
+
+	log "Test directory: $TEST_DIR"
+	log "File size: $SIZE"
+	log "fio binary: $FIO_BIN"
+	log "Results: $RESULTS_DIR"
+
+	# Record system info
+	{
+		echo "Timestamp: $(date +%Y%m%d-%H%M%S)"
+		echo "Kernel: $(uname -r)"
+		echo "Hostname: $(hostname)"
+		echo "Filesystem: $(df -T "$TEST_DIR" | tail -1 | awk '{print $2}')"
+		echo "File size: $SIZE"
+		echo "Test dir: $TEST_DIR"
+	} > "${RESULTS_DIR}/sysinfo.txt"
+}
+
+########################################################################
+# Deliverable 1: Single-client benchmarks
+########################################################################
+run_deliverable1() {
+	log "=========================================="
+	log "Deliverable 1: Single-client benchmarks"
+	log "=========================================="
+
+	# Sequential write
+	for mode in $MODES; do
+		local mname
+		mname=$(mode_name $mode)
+		local fio_args
+		fio_args=$(mode_fio_args $mode)
+
+		drop_caches
+		run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+			"${RESULTS_DIR}/seq-write/${mname}" \
+			"seq-write_testfile" "$SIZE" "" "$fio_args" \
+			"$(mode_uncached $mode)"
+	done
+
+	# Random write
+	for mode in $MODES; do
+		local mname
+		mname=$(mode_name $mode)
+		local fio_args
+		fio_args=$(mode_fio_args $mode)
+
+		drop_caches
+		run_fio "${FIO_JOBS_DIR}/rand-write.fio" \
+			"${RESULTS_DIR}/rand-write/${mname}" \
+			"rand-write_testfile" "$SIZE" "" "$fio_args" \
+			"$(mode_uncached $mode)"
+	done
+
+	# Sequential read — pre-create file, then read with each mode
+	log "Pre-creating sequential read test file"
+	run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+		"${RESULTS_DIR}/seq-read/precreate" \
+		"seq-read_testfile" "$SIZE" "keep"
+
+	for rmode in $MODES; do
+		local mname
+		mname=$(mode_name $rmode)
+		local fio_args
+		fio_args=$(mode_fio_args $rmode)
+		local keep="keep"
+		[ "$rmode" -eq 2 ] && keep=""
+
+		drop_caches
+		run_fio "${FIO_JOBS_DIR}/seq-read.fio" \
+			"${RESULTS_DIR}/seq-read/${mname}" \
+			"seq-read_testfile" "$SIZE" "$keep" "$fio_args" \
+			"$(mode_uncached $rmode)"
+	done
+
+	# Random read — pre-create file, then read with each mode
+	log "Pre-creating random read test file"
+	run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+		"${RESULTS_DIR}/rand-read/precreate" \
+		"rand-read_testfile" "$SIZE" "keep"
+
+	for rmode in $MODES; do
+		local mname
+		mname=$(mode_name $rmode)
+		local fio_args
+		fio_args=$(mode_fio_args $rmode)
+		local keep="keep"
+		[ "$rmode" -eq 2 ] && keep=""
+
+		drop_caches
+		run_fio "${FIO_JOBS_DIR}/rand-read.fio" \
+			"${RESULTS_DIR}/rand-read/${mname}" \
+			"rand-read_testfile" "$SIZE" "$keep" "$fio_args" \
+			"$(mode_uncached $rmode)"
+	done
+}
+
+########################################################################
+# Deliverable 2: Multi-client tests
+########################################################################
+run_deliverable2() {
+	log "=========================================="
+	log "Deliverable 2: Noisy-neighbor benchmarks"
+	log "=========================================="
+
+	local num_clients=4
+	local client_size
+	local mem_kb
+	mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+	client_size="$(( mem_kb / 1024 / num_clients ))M"
+
+	# Scenario A: Multiple writers
+	for mode in $MODES; do
+		local mname
+		mname=$(mode_name $mode)
+		local fio_args
+		fio_args=$(mode_fio_args $mode)
+		local uncached
+		uncached=$(mode_uncached $mode)
+		local actual_job
+		actual_job=$(make_job_file "${FIO_JOBS_DIR}/multi-write.fio" "$uncached")
+		local outdir="${RESULTS_DIR}/multi-write/${mname}"
+		mkdir -p "$outdir"
+
+		drop_caches
+		start_monitors "$outdir"
+		start_perf_lock "$outdir"
+
+		local pids=()
+		for i in $(seq 1 $num_clients); do
+			# shellcheck disable=SC2086
+			run_cmd "$FIO_BIN" "$actual_job" \
+				--output-format=json \
+				--output="${outdir}/client${i}.json" \
+				--filename="${TEST_DIR}/client${i}_testfile" \
+				--size="$client_size" \
+				$fio_args &
+			pids+=($!)
+		done
+
+		local rc=0
+		for pid in "${pids[@]}"; do
+			wait "$pid" || rc=$?
+		done
+
+		stop_perf_lock "$outdir"
+		stop_monitors
+		[ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+		[ "$actual_job" != "${FIO_JOBS_DIR}/multi-write.fio" ] && rm -f "$actual_job"
+		for i in $(seq 1 $num_clients); do
+			cleanup_test_files "client${i}_testfile"
+		done
+	done
+
+	# Scenario C: Noisy writer + latency-sensitive readers
+	for mode in $MODES; do
+		local mname
+		mname=$(mode_name $mode)
+		local fio_args
+		fio_args=$(mode_fio_args $mode)
+		local uncached
+		uncached=$(mode_uncached $mode)
+		local writer_job
+		writer_job=$(make_job_file "${FIO_JOBS_DIR}/noisy-writer.fio" "$uncached")
+		local reader_job
+		reader_job=$(make_job_file "${FIO_JOBS_DIR}/lat-reader.fio" "$uncached")
+		local outdir="${RESULTS_DIR}/noisy-neighbor/${mname}"
+		mkdir -p "$outdir"
+
+		# Pre-create read files
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			log "Pre-creating read file for reader $i"
+			run_fio "${FIO_JOBS_DIR}/multi-write.fio" \
+				"${outdir}/precreate_reader${i}" \
+				"reader${i}_readfile" \
+				"512M" "keep"
+		done
+		drop_caches
+		start_monitors "$outdir"
+		start_perf_lock "$outdir"
+
+		# Noisy writer
+		# shellcheck disable=SC2086
+		run_cmd "$FIO_BIN" "$writer_job" \
+			--output-format=json \
+			--output="${outdir}/noisy_writer.json" \
+			--filename="${TEST_DIR}/bulk_testfile" \
+			--size="$SIZE" \
+			$fio_args &
+		local writer_pid=$!
+
+		# Latency-sensitive readers
+		local reader_pids=()
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			# shellcheck disable=SC2086
+			run_cmd "$FIO_BIN" "$reader_job" \
+				--output-format=json \
+				--output="${outdir}/reader${i}.json" \
+				--filename="${TEST_DIR}/reader${i}_readfile" \
+				--size="512M" \
+				$fio_args &
+			reader_pids+=($!)
+		done
+
+		local rc=0
+		wait "$writer_pid" || rc=$?
+		for pid in "${reader_pids[@]}"; do
+			wait "$pid" || rc=$?
+		done
+
+		stop_perf_lock "$outdir"
+		stop_monitors
+		[ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+		[ "$writer_job" != "${FIO_JOBS_DIR}/noisy-writer.fio" ] && rm -f "$writer_job"
+		[ "$reader_job" != "${FIO_JOBS_DIR}/lat-reader.fio" ] && rm -f "$reader_job"
+		cleanup_test_files "bulk_testfile"
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			cleanup_test_files "reader${i}_readfile"
+		done
+	done
+
+	# Scenario D: Mixed-mode noisy neighbor
+	# dontcache writes + buffered reads
+	local outdir="${RESULTS_DIR}/noisy-neighbor-mixed/dontcache-w_buffered-r"
+	mkdir -p "$outdir"
+	local writer_job
+	writer_job=$(make_job_file "${FIO_JOBS_DIR}/noisy-writer.fio" 1)
+
+	for i in $(seq 1 $(( num_clients - 1 ))); do
+		log "Pre-creating read file for reader $i"
+		run_fio "${FIO_JOBS_DIR}/multi-write.fio" \
+			"${outdir}/precreate_reader${i}" \
+			"reader${i}_readfile" \
+			"512M" "keep"
+	done
+	drop_caches
+	start_monitors "$outdir"
+	start_perf_lock "$outdir"
+
+	# Writer with dontcache
+	run_cmd "$FIO_BIN" "$writer_job" \
+		--output-format=json \
+		--output="${outdir}/noisy_writer.json" \
+		--filename="${TEST_DIR}/bulk_testfile" \
+		--size="$SIZE" \
+		--direct=0 &
+	local writer_pid=$!
+
+	# Readers with buffered (no uncached flag)
+	local reader_pids=()
+	for i in $(seq 1 $(( num_clients - 1 ))); do
+		run_cmd "$FIO_BIN" "${FIO_JOBS_DIR}/lat-reader.fio" \
+			--output-format=json \
+			--output="${outdir}/reader${i}.json" \
+			--filename="${TEST_DIR}/reader${i}_readfile" \
+			--size="512M" \
+			--direct=0 &
+		reader_pids+=($!)
+	done
+
+	local rc=0
+	wait "$writer_pid" || rc=$?
+	for pid in "${reader_pids[@]}"; do
+		wait "$pid" || rc=$?
+	done
+
+	stop_perf_lock "$outdir"
+	stop_monitors
+	[ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+	[ "$writer_job" != "${FIO_JOBS_DIR}/noisy-writer.fio" ] && rm -f "$writer_job"
+	cleanup_test_files "bulk_testfile"
+	for i in $(seq 1 $(( num_clients - 1 ))); do
+		cleanup_test_files "reader${i}_readfile"
+	done
+}
+
+########################################################################
+# Main
+########################################################################
+preflight
+run_deliverable1
+run_deliverable2
+
+log "=========================================="
+log "All benchmarks complete."
+log "Results in: $RESULTS_DIR"
+log "Parse with: scripts/parse-results.sh $RESULTS_DIR"
+log "=========================================="

-- 
2.53.0


^ permalink raw reply related

* [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite
From: Jeff Layton @ 2026-04-26 11:56 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel, Jeff Layton
In-Reply-To: <20260426-dontcache-v3-0-79eb37da9547@kernel.org>

Add a benchmark suite for testing NFSD I/O mode performance using fio
with the libnfs backend against an NFS server on localhost.  Tests
buffered, dontcache, and direct I/O modes via NFSD debugfs controls.

Includes:
 - fio job files for sequential/random read/write, multi-writer,
   noisy-neighbor, and latency-sensitive reader workloads
 - run-benchmarks.sh: orchestrates test matrix with mode switching
 - parse-results.sh: extracts metrics from fio JSON output
 - setup-server.sh: configures NFS export for testing

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 .../testing/nfsd-io-bench/fio-jobs/lat-reader.fio  |  15 +
 .../testing/nfsd-io-bench/fio-jobs/multi-write.fio |  14 +
 .../nfsd-io-bench/fio-jobs/noisy-writer.fio        |  14 +
 tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio |  15 +
 .../testing/nfsd-io-bench/fio-jobs/rand-write.fio  |  15 +
 tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio  |  14 +
 tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio |  14 +
 .../testing/nfsd-io-bench/scripts/parse-results.sh | 238 +++++++++
 .../nfsd-io-bench/scripts/run-benchmarks.sh        | 591 +++++++++++++++++++++
 .../testing/nfsd-io-bench/scripts/setup-server.sh  |  94 ++++
 10 files changed, 1024 insertions(+)

diff --git a/tools/testing/nfsd-io-bench/fio-jobs/lat-reader.fio b/tools/testing/nfsd-io-bench/fio-jobs/lat-reader.fio
new file mode 100644
index 000000000000..61af37e8b860
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/lat-reader.fio
@@ -0,0 +1,15 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=4k
+numjobs=16
+runtime=300
+time_based=1
+group_reporting=1
+rw=randread
+log_avg_msec=1000
+write_bw_log=latreader
+write_lat_log=latreader
+
+[lat_reader]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/multi-write.fio b/tools/testing/nfsd-io-bench/fio-jobs/multi-write.fio
new file mode 100644
index 000000000000..16b792aecabb
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/multi-write.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=write
+log_avg_msec=1000
+write_bw_log=multiwrite
+write_lat_log=multiwrite
+
+[writer]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/noisy-writer.fio b/tools/testing/nfsd-io-bench/fio-jobs/noisy-writer.fio
new file mode 100644
index 000000000000..615154a7737e
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/noisy-writer.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=write
+log_avg_msec=1000
+write_bw_log=noisywriter
+write_lat_log=noisywriter
+
+[bulk_writer]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio b/tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio
new file mode 100644
index 000000000000..501bae7416a8
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio
@@ -0,0 +1,15 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=4k
+numjobs=16
+runtime=300
+time_based=1
+group_reporting=1
+rw=randread
+log_avg_msec=1000
+write_bw_log=randread
+write_lat_log=randread
+
+[randread]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/rand-write.fio b/tools/testing/nfsd-io-bench/fio-jobs/rand-write.fio
new file mode 100644
index 000000000000..d891d04197ae
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/rand-write.fio
@@ -0,0 +1,15 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=64k
+numjobs=16
+runtime=300
+time_based=1
+group_reporting=1
+rw=randwrite
+log_avg_msec=1000
+write_bw_log=randwrite
+write_lat_log=randwrite
+
+[randwrite]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio b/tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio
new file mode 100644
index 000000000000..6e24ab355026
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=read
+log_avg_msec=1000
+write_bw_log=seqread
+write_lat_log=seqread
+
+[seqread]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio b/tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio
new file mode 100644
index 000000000000..260858e345f5
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=write
+log_avg_msec=1000
+write_bw_log=seqwrite
+write_lat_log=seqwrite
+
+[seqwrite]
diff --git a/tools/testing/nfsd-io-bench/scripts/parse-results.sh b/tools/testing/nfsd-io-bench/scripts/parse-results.sh
new file mode 100755
index 000000000000..0427d411db04
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/scripts/parse-results.sh
@@ -0,0 +1,238 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Parse fio JSON output and generate comparison tables.
+#
+# Usage: ./parse-results.sh <results-dir>
+
+set -euo pipefail
+
+if [ $# -lt 1 ]; then
+	echo "Usage: $0 <results-dir>"
+	exit 1
+fi
+
+RESULTS_DIR="$1"
+
+if ! command -v jq &>/dev/null; then
+	echo "ERROR: jq is required"
+	exit 1
+fi
+
+# Extract metrics from a single fio JSON result
+extract_metrics() {
+	local json_file=$1
+	local rw_type=$2  # read or write
+
+	if [ ! -f "$json_file" ]; then
+		echo "N/A N/A N/A N/A N/A N/A"
+		return
+	fi
+
+	jq -r --arg rw "$rw_type" '
+		.jobs[0][$rw] as $d |
+		[
+			(($d.bw // 0) / 1024 | . * 10 | round / 10),    # MB/s
+			($d.iops // 0),                                    # IOPS
+			((($d.clat_ns.mean // 0) / 1000) | . * 10 | round / 10), # avg lat us
+			(($d.clat_ns.percentile["50.000000"] // 0) / 1000), # p50 us
+			(($d.clat_ns.percentile["99.000000"] // 0) / 1000), # p99 us
+			(($d.clat_ns.percentile["99.900000"] // 0) / 1000)  # p99.9 us
+		] | @tsv
+	' "$json_file" 2>/dev/null || echo "N/A N/A N/A N/A N/A N/A"
+}
+
+# Extract server CPU from vmstat log (average sys%)
+extract_cpu() {
+	local vmstat_log=$1
+	if [ ! -f "$vmstat_log" ]; then
+		echo "N/A"
+		return
+	fi
+	# vmstat columns: us sy id wa st — skip header lines
+	awk 'NR>2 {sum+=$14; n++} END {if(n>0) printf "%.1f", sum/n; else print "N/A"}' \
+		"$vmstat_log" 2>/dev/null || echo "N/A"
+}
+
+# Extract peak dirty pages from meminfo log
+extract_peak_dirty() {
+	local meminfo_log=$1
+	if [ ! -f "$meminfo_log" ]; then
+		echo "N/A"
+		return
+	fi
+	grep "^Dirty:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+# Extract peak cached from meminfo log
+extract_peak_cached() {
+	local meminfo_log=$1
+	if [ ! -f "$meminfo_log" ]; then
+		echo "N/A"
+		return
+	fi
+	grep "^Cached:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+print_separator() {
+	printf '%*s\n' 120 '' | tr ' ' '-'
+}
+
+########################################################################
+# Deliverable 1: Single-client results
+########################################################################
+echo ""
+echo "=================================================================="
+echo "  Deliverable 1: Single-Client fio Benchmarks"
+echo "=================================================================="
+echo ""
+
+for workload in seq-write rand-write seq-read rand-read; do
+	case $workload in
+	seq-write|rand-write) rw_type="write" ;;
+	seq-read|rand-read)   rw_type="read" ;;
+	esac
+
+	echo "--- $workload ---"
+	printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+		"Mode" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)" "Sys CPU%" "PeakDirty(kB)" "PeakCache(kB)"
+	print_separator
+
+	for mode in buffered dontcache direct; do
+		dir="${RESULTS_DIR}/${workload}/${mode}"
+		json_file=$(find "$dir" -name '*.json' -not -name 'client*' 2>/dev/null | head -1 || true)
+		if [ -z "$json_file" ]; then
+			printf "%-16s %10s\n" "$mode" "(no data)"
+			continue
+		fi
+
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "$rw_type")"
+		cpu=$(extract_cpu "${dir}/vmstat.log")
+		dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+		cached=$(extract_peak_cached "${dir}/meminfo.log")
+
+		printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+			"$mode" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999" \
+			"$cpu" "${dirty:-N/A}" "${cached:-N/A}"
+	done
+	echo ""
+done
+
+########################################################################
+# Deliverable 2: Multi-client results
+########################################################################
+echo "=================================================================="
+echo "  Deliverable 2: Noisy-Neighbor Benchmarks"
+echo "=================================================================="
+echo ""
+
+# Scenario A: Multiple writers
+echo "--- Scenario A: Multiple Writers ---"
+for mode in buffered dontcache direct; do
+	dir="${RESULTS_DIR}/multi-write/${mode}"
+	if [ ! -d "$dir" ]; then
+		continue
+	fi
+
+	echo "  Mode: $mode"
+	printf "  %-10s %10s %10s %10s %10s %10s %10s\n" \
+		"Client" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+	total_bw=0
+	count=0
+	for json_file in "${dir}"/client*.json; do
+		[ -f "$json_file" ] || continue
+		client=$(basename "$json_file" .json)
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "write")"
+		printf "  %-10s %10s %10s %10s %10s %10s %10s\n" \
+			"$client" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+		total_bw=$(echo "$total_bw + ${mbps:-0}" | bc 2>/dev/null || echo "$total_bw")
+		count=$(( count + 1 ))
+	done
+
+	cpu=$(extract_cpu "${dir}/vmstat.log")
+	dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+	printf "  Aggregate BW: %s MB/s | Sys CPU: %s%% | Peak Dirty: %s kB\n" \
+		"$total_bw" "$cpu" "${dirty:-N/A}"
+	echo ""
+done
+
+# Scenario C: Noisy neighbor
+echo "--- Scenario C: Noisy Writer + Latency-Sensitive Readers ---"
+for mode in buffered dontcache direct; do
+	dir="${RESULTS_DIR}/noisy-neighbor/${mode}"
+	if [ ! -d "$dir" ]; then
+		continue
+	fi
+
+	echo "  Mode: $mode"
+	printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+		"Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+	# Writer
+	if [ -f "${dir}/noisy_writer.json" ]; then
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "${dir}/noisy_writer.json" "write")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	fi
+
+	# Readers
+	for json_file in "${dir}"/reader*.json; do
+		[ -f "$json_file" ] || continue
+		reader=$(basename "$json_file" .json)
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "read")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	done
+
+	cpu=$(extract_cpu "${dir}/vmstat.log")
+	dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+	printf "  Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+	echo ""
+done
+
+# Scenario D: Mixed-mode noisy neighbor
+echo "--- Scenario D: Mixed-Mode Noisy Writer + Readers ---"
+for dir in "${RESULTS_DIR}"/noisy-neighbor-mixed/*/; do
+	[ -d "$dir" ] || continue
+	label=$(basename "$dir")
+
+	echo "  Mode: $label"
+	printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+		"Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+	# Writer
+	if [ -f "${dir}/noisy_writer.json" ]; then
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "${dir}/noisy_writer.json" "write")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	fi
+
+	# Readers
+	for json_file in "${dir}"/reader*.json; do
+		[ -f "$json_file" ] || continue
+		reader=$(basename "$json_file" .json)
+		read -r mbps iops avg_lat p50 p99 p999 <<< \
+			"$(extract_metrics "$json_file" "read")"
+		printf "  %-14s %10s %10s %10s %10s %10s %10s\n" \
+			"$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+	done
+
+	cpu=$(extract_cpu "${dir}/vmstat.log")
+	dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+	printf "  Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+	echo ""
+done
+
+echo "=================================================================="
+echo "  System Info"
+echo "=================================================================="
+if [ -f "${RESULTS_DIR}/sysinfo.txt" ]; then
+	head -6 "${RESULTS_DIR}/sysinfo.txt"
+fi
+echo ""
diff --git a/tools/testing/nfsd-io-bench/scripts/run-benchmarks.sh b/tools/testing/nfsd-io-bench/scripts/run-benchmarks.sh
new file mode 100755
index 000000000000..2b0cf6e79dff
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/scripts/run-benchmarks.sh
@@ -0,0 +1,591 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# NFS server I/O mode benchmark suite
+#
+# Runs fio with the NFS ioengine against an NFS server on localhost,
+# testing buffered, dontcache, and direct I/O modes.
+#
+# Usage: ./run-benchmarks.sh [OPTIONS]
+#
+# Options:
+#   -e EXPORT_PATH   Server export path (default: /export)
+#   -s SIZE          fio file size, should be >= 2x RAM (default: auto-detect)
+#   -r RESULTS_DIR   Where to store results (default: ./results)
+#   -n NFS_VER       NFS version: 3 or 4 (default: 3)
+#   -j FIO_JOBS_DIR  Path to fio job files (default: ../fio-jobs)
+#   -d               Dry run: print commands without executing
+#   -h               Show this help
+
+set -euo pipefail
+
+# Defaults
+EXPORT_PATH="/export"
+SIZE=""
+RESULTS_DIR="./results"
+NFS_VER=3
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+FIO_JOBS_DIR="${SCRIPT_DIR}/../fio-jobs"
+DRY_RUN=0
+MODES="0 1 2"
+PERF_LOCK=0
+
+DEBUGFS_BASE="/sys/kernel/debug/nfsd"
+IO_CACHE_READ="${DEBUGFS_BASE}/io_cache_read"
+IO_CACHE_WRITE="${DEBUGFS_BASE}/io_cache_write"
+DISABLE_SPLICE="${DEBUGFS_BASE}/disable-splice-read"
+
+usage() {
+	echo "Usage: $0 [OPTIONS]"
+	echo "  -e EXPORT_PATH   Server export path (default: /export)"
+	echo "  -s SIZE          fio file size (default: 2x RAM)"
+	echo "  -r RESULTS_DIR   Results directory (default: ./results)"
+	echo "  -n NFS_VER       NFS version: 3 or 4 (default: 3)"
+	echo "  -j FIO_JOBS_DIR  Path to fio job files"
+	echo "  -D               Dontcache only (skip buffered and direct tests)"
+	echo "  -p               Profile kernel lock contention with perf lock"
+	echo "  -d               Dry run"
+	echo "  -h               Help"
+	exit 1
+}
+
+while getopts "e:s:r:n:j:Dpdh" opt; do
+	case $opt in
+	e) EXPORT_PATH="$OPTARG" ;;
+	s) SIZE="$OPTARG" ;;
+	r) RESULTS_DIR="$OPTARG" ;;
+	n) NFS_VER="$OPTARG" ;;
+	j) FIO_JOBS_DIR="$OPTARG" ;;
+	D) MODES="1" ;;
+	p) PERF_LOCK=1 ;;
+	d) DRY_RUN=1 ;;
+	h) usage ;;
+	*) usage ;;
+	esac
+done
+
+# Auto-detect size: 2x total RAM
+if [ -z "$SIZE" ]; then
+	MEM_KB=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+	MEM_GB=$(( MEM_KB / 1024 / 1024 ))
+	SIZE="$(( MEM_GB * 2 ))G"
+	echo "Auto-detected RAM: ${MEM_GB}G, using file size: ${SIZE}"
+fi
+
+
+log() {
+	echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
+}
+
+run_cmd() {
+	if [ "$DRY_RUN" -eq 1 ]; then
+		echo "  [DRY RUN] $*"
+	else
+		"$@"
+	fi
+}
+
+# Preflight checks
+preflight() {
+	log "=== Preflight checks ==="
+
+	if ! command -v fio &>/dev/null; then
+		echo "ERROR: fio not found in PATH"
+		exit 1
+	fi
+
+	# Check fio has nfs ioengine
+	if ! fio --enghelp=nfs &>/dev/null; then
+		echo "ERROR: fio does not have the nfs ioengine (needs libnfs)"
+		exit 1
+	fi
+
+	# Check debugfs knobs exist
+	for knob in "$IO_CACHE_READ" "$IO_CACHE_WRITE" "$DISABLE_SPLICE"; do
+		if [ ! -f "$knob" ]; then
+			echo "ERROR: $knob not found. Is the kernel new enough?"
+			exit 1
+		fi
+	done
+
+	# Check NFS server is exporting
+	if ! showmount -e localhost 2>/dev/null | grep -q "$EXPORT_PATH"; then
+		echo "WARNING: $EXPORT_PATH not in showmount output, proceeding anyway"
+	fi
+
+	# Print system info
+	echo "Kernel:     $(uname -r)"
+	echo "RAM:        $(awk '/MemTotal/ {printf "%.1f GB", $2/1024/1024}' /proc/meminfo)"
+	echo "Export:     $EXPORT_PATH"
+	echo "NFS ver:    $NFS_VER"
+	echo "File size:  $SIZE"
+	echo "Results:    $RESULTS_DIR"
+	echo ""
+}
+
+# Set server I/O mode via debugfs
+set_io_mode() {
+	local cache_write=$1
+	local cache_read=$2
+	local splice_off=$3
+
+	log "Setting io_cache_write=$cache_write io_cache_read=$cache_read disable-splice-read=$splice_off"
+	run_cmd bash -c "echo $cache_write > $IO_CACHE_WRITE"
+	run_cmd bash -c "echo $cache_read  > $IO_CACHE_READ"
+	run_cmd bash -c "echo $splice_off  > $DISABLE_SPLICE"
+}
+
+# Drop page cache on server
+drop_caches() {
+	log "Dropping page cache"
+	run_cmd bash -c "sync && echo 3 > /proc/sys/vm/drop_caches"
+	sleep 1
+}
+
+# Start background server monitoring
+start_monitors() {
+	local outdir=$1
+
+	log "Starting server monitors in $outdir"
+	run_cmd vmstat 1 > "${outdir}/vmstat.log" 2>&1 &
+	VMSTAT_PID=$!
+
+	run_cmd iostat -x 1 > "${outdir}/iostat.log" 2>&1 &
+	IOSTAT_PID=$!
+
+	# Sample /proc/meminfo every second
+	(while true; do
+		echo "=== $(date '+%s') ==="
+		cat /proc/meminfo
+		sleep 1
+	done) > "${outdir}/meminfo.log" 2>&1 &
+	MEMINFO_PID=$!
+}
+
+# Stop background monitors
+stop_monitors() {
+	log "Stopping monitors"
+	kill "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+	wait "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+}
+
+# perf lock profiling — uses BPF-based live contention tracing
+PERF_LOCK_PID=""
+
+start_perf_lock() {
+	local outdir=$1
+
+	if [ "$PERF_LOCK" -ne 1 ]; then
+		return
+	fi
+
+	log "Starting perf lock contention tracing"
+	perf lock contention -a -b --max-stack 8 \
+		> "${outdir}/perf-lock-contention.txt" 2>&1 &
+	PERF_LOCK_PID=$!
+}
+
+stop_perf_lock() {
+	local outdir=$1
+
+	if [ -z "$PERF_LOCK_PID" ]; then
+		return
+	fi
+
+	log "Stopping perf lock contention tracing"
+	kill -TERM "$PERF_LOCK_PID" 2>/dev/null || true
+	wait "$PERF_LOCK_PID" 2>/dev/null || true
+	PERF_LOCK_PID=""
+}
+
+# Run a single fio benchmark.
+# nfs_url is set in the job files; we pass --filename and --size on
+# the command line to vary the target file and data volume per run.
+# Pass "keep" as 5th arg to preserve the test file after the run.
+run_fio() {
+	local job_file=$1
+	local outdir=$2
+	local filename=$3
+	local fio_size=${4:-$SIZE}
+	local keep=${5:-}
+
+	local job_name
+	job_name=$(basename "$job_file" .fio)
+
+	log "Running fio job: $job_name -> $outdir (file=$filename size=$fio_size)"
+	mkdir -p "$outdir"
+
+	drop_caches
+	start_monitors "$outdir"
+	# Skip perf lock profiling for precreate/setup runs
+	[ "$keep" != "keep" ] && start_perf_lock "$outdir"
+
+	run_cmd fio "$job_file" \
+		--output-format=json \
+		--output="${outdir}/${job_name}.json" \
+		--filename="$filename" \
+		--size="$fio_size"
+
+	[ "$keep" != "keep" ] && stop_perf_lock "$outdir"
+	stop_monitors
+
+	log "Finished: $job_name"
+
+	# Clean up test file to free disk space unless told to keep it
+	if [ "$keep" != "keep" ]; then
+		cleanup_test_files "$filename"
+	fi
+}
+
+# Remove test files from the export to free disk space
+cleanup_test_files() {
+	local filename
+	for filename in "$@"; do
+		local filepath="${EXPORT_PATH}/${filename}"
+		log "Cleaning up: $filepath"
+		run_cmd rm -f "$filepath"
+	done
+}
+
+# Ensure parent directories exist under the export for a given filename
+ensure_export_dirs() {
+	local filename
+	for filename in "$@"; do
+		local dirpath="${EXPORT_PATH}/$(dirname "$filename")"
+		if [ "$dirpath" != "${EXPORT_PATH}/." ] && [ ! -d "$dirpath" ]; then
+			log "Creating directory: $dirpath"
+			run_cmd mkdir -p "$dirpath"
+		fi
+	done
+}
+
+# Mode name from numeric value
+mode_name() {
+	case $1 in
+	0) echo "buffered" ;;
+	1) echo "dontcache" ;;
+	2) echo "direct" ;;
+	esac
+}
+
+########################################################################
+# Deliverable 1: Single-client fio benchmarks
+########################################################################
+run_deliverable1() {
+	log "=========================================="
+	log "Deliverable 1: Single-client fio benchmarks"
+	log "=========================================="
+
+	# Write test matrix:
+	# mode 0 (buffered):    splice on  (default)
+	# mode 1 (dontcache):   splice off (required)
+	# mode 2 (direct):      splice off (required)
+
+	# Sequential write
+	for wmode in $MODES; do
+		local mname
+		mname=$(mode_name $wmode)
+		local splice_off=0
+		[ "$wmode" -ne 0 ] && splice_off=1
+
+		drop_caches
+		set_io_mode "$wmode" 0 "$splice_off"
+		run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+			"${RESULTS_DIR}/seq-write/${mname}" \
+			"seq-write_testfile"
+	done
+
+	# Random write
+	for wmode in $MODES; do
+		local mname
+		mname=$(mode_name $wmode)
+		local splice_off=0
+		[ "$wmode" -ne 0 ] && splice_off=1
+
+		drop_caches
+		set_io_mode "$wmode" 0 "$splice_off"
+		run_fio "${FIO_JOBS_DIR}/rand-write.fio" \
+			"${RESULTS_DIR}/rand-write/${mname}" \
+			"rand-write_testfile"
+	done
+
+	# Sequential read — vary read mode, write stays buffered
+	# Pre-create the file for reading
+	log "Pre-creating sequential read test file"
+	set_io_mode 0 0 0
+	run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+		"${RESULTS_DIR}/seq-read/precreate" \
+		"seq-read_testfile" "$SIZE" "keep"
+
+	# shellcheck disable=SC2086
+	local last_mode
+	last_mode=$(echo $MODES | awk '{print $NF}')
+
+	for rmode in $MODES; do
+		local mname
+		mname=$(mode_name $rmode)
+		local splice_off=0
+		[ "$rmode" -ne 0 ] && splice_off=1
+		# Keep file for subsequent modes; clean up after last
+		local keep="keep"
+		[ "$rmode" = "$last_mode" ] && keep=""
+
+		drop_caches
+		set_io_mode 0 "$rmode" "$splice_off"
+		run_fio "${FIO_JOBS_DIR}/seq-read.fio" \
+			"${RESULTS_DIR}/seq-read/${mname}" \
+			"seq-read_testfile" "$SIZE" "$keep"
+	done
+
+	# Random read — vary read mode, write stays buffered
+	# Pre-create the file for reading
+	log "Pre-creating random read test file"
+	set_io_mode 0 0 0
+	run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+		"${RESULTS_DIR}/rand-read/precreate" \
+		"rand-read_testfile" "$SIZE" "keep"
+
+	for rmode in $MODES; do
+		local mname
+		mname=$(mode_name $rmode)
+		local splice_off=0
+		[ "$rmode" -ne 0 ] && splice_off=1
+		# Keep file for subsequent modes; clean up after last
+		local keep="keep"
+		[ "$rmode" = "$last_mode" ] && keep=""
+
+		drop_caches
+		set_io_mode 0 "$rmode" "$splice_off"
+		run_fio "${FIO_JOBS_DIR}/rand-read.fio" \
+			"${RESULTS_DIR}/rand-read/${mname}" \
+			"rand-read_testfile" "$SIZE" "$keep"
+	done
+}
+
+########################################################################
+# Deliverable 2: Multi-client (simulated with multiple fio jobs)
+########################################################################
+run_deliverable2() {
+	log "=========================================="
+	log "Deliverable 2: Noisy-neighbor benchmarks"
+	log "=========================================="
+
+	local num_clients=4
+	local client_size
+	local mem_kb
+	mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+	# Each client gets RAM/num_clients so total > RAM
+	client_size="$(( mem_kb / 1024 / num_clients ))M"
+
+	# Scenario A: Multiple writers
+	for mode in $MODES; do
+		local mname
+		mname=$(mode_name $mode)
+		local splice_off=0
+		[ "$mode" -ne 0 ] && splice_off=1
+		local outdir="${RESULTS_DIR}/multi-write/${mname}"
+		mkdir -p "$outdir"
+
+		set_io_mode "$mode" "$mode" "$splice_off"
+		drop_caches
+
+		# Ensure client directories exist on export
+		for i in $(seq 1 $num_clients); do
+			ensure_export_dirs "client${i}/testfile"
+		done
+
+		start_monitors "$outdir"
+		start_perf_lock "$outdir"
+
+		# Launch N parallel fio writers
+		local pids=()
+		for i in $(seq 1 $num_clients); do
+			run_cmd fio "${FIO_JOBS_DIR}/multi-write.fio" \
+				--output-format=json \
+				--output="${outdir}/client${i}.json" \
+				--filename="client${i}/testfile" \
+				--size="$client_size" &
+			pids+=($!)
+		done
+
+		# Wait for all
+		local rc=0
+		for pid in "${pids[@]}"; do
+			wait "$pid" || rc=$?
+		done
+
+		stop_perf_lock "$outdir"
+		stop_monitors
+		[ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+		# Clean up test files
+		for i in $(seq 1 $num_clients); do
+			cleanup_test_files "client${i}/testfile"
+		done
+	done
+
+	# Scenario C: Noisy writer + latency-sensitive readers
+	for mode in $MODES; do
+		local mname
+		mname=$(mode_name $mode)
+		local splice_off=0
+		[ "$mode" -ne 0 ] && splice_off=1
+		local outdir="${RESULTS_DIR}/noisy-neighbor/${mname}"
+		mkdir -p "$outdir"
+
+		set_io_mode "$mode" "$mode" "$splice_off"
+		drop_caches
+
+		# Pre-create read files for latency readers
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			ensure_export_dirs "reader${i}/readfile"
+			log "Pre-creating read file for reader $i"
+			run_fio "${FIO_JOBS_DIR}/multi-write.fio" \
+				"${outdir}/precreate_reader${i}" \
+				"reader${i}/readfile" \
+				"512M" "keep"
+		done
+		drop_caches
+		ensure_export_dirs "bulk/testfile"
+		start_monitors "$outdir"
+		start_perf_lock "$outdir"
+
+		# Noisy writer
+		run_cmd fio "${FIO_JOBS_DIR}/noisy-writer.fio" \
+			--output-format=json \
+			--output="${outdir}/noisy_writer.json" \
+			--filename="bulk/testfile" \
+			--size="$SIZE" &
+		local writer_pid=$!
+
+		# Latency-sensitive readers
+		local reader_pids=()
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			run_cmd fio "${FIO_JOBS_DIR}/lat-reader.fio" \
+				--output-format=json \
+				--output="${outdir}/reader${i}.json" \
+				--filename="reader${i}/readfile" \
+				--size="512M" &
+			reader_pids+=($!)
+		done
+
+		local rc=0
+		wait "$writer_pid" || rc=$?
+		for pid in "${reader_pids[@]}"; do
+			wait "$pid" || rc=$?
+		done
+
+		stop_perf_lock "$outdir"
+		stop_monitors
+		[ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+		# Clean up test files
+		cleanup_test_files "bulk/testfile"
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			cleanup_test_files "reader${i}/readfile"
+		done
+	done
+	# Scenario D: Mixed-mode noisy neighbor
+	# Test write/read mode combinations where the writer uses a
+	# cache-friendly mode and readers use buffered reads to benefit
+	# from warm cache.
+	local mixed_modes=(
+		# write_mode read_mode label
+		"1 0 dontcache-w_buffered-r"
+	)
+
+	for combo in "${mixed_modes[@]}"; do
+		local wmode rmode label
+		read -r wmode rmode label <<< "$combo"
+		local splice_off=0
+		[ "$wmode" -ne 0 ] && splice_off=1
+		local outdir="${RESULTS_DIR}/noisy-neighbor-mixed/${label}"
+		mkdir -p "$outdir"
+
+		set_io_mode "$wmode" "$rmode" "$splice_off"
+		drop_caches
+
+		# Pre-create read files for latency readers
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			ensure_export_dirs "reader${i}/readfile"
+			log "Pre-creating read file for reader $i"
+			run_fio "${FIO_JOBS_DIR}/multi-write.fio" \
+				"${outdir}/precreate_reader${i}" \
+				"reader${i}/readfile" \
+				"512M" "keep"
+		done
+		drop_caches
+		ensure_export_dirs "bulk/testfile"
+		start_monitors "$outdir"
+		start_perf_lock "$outdir"
+
+		# Noisy writer
+		run_cmd fio "${FIO_JOBS_DIR}/noisy-writer.fio" \
+			--output-format=json \
+			--output="${outdir}/noisy_writer.json" \
+			--filename="bulk/testfile" \
+			--size="$SIZE" &
+		local writer_pid=$!
+
+		# Latency-sensitive readers
+		local reader_pids=()
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			run_cmd fio "${FIO_JOBS_DIR}/lat-reader.fio" \
+				--output-format=json \
+				--output="${outdir}/reader${i}.json" \
+				--filename="reader${i}/readfile" \
+				--size="512M" &
+			reader_pids+=($!)
+		done
+
+		local rc=0
+		wait "$writer_pid" || rc=$?
+		for pid in "${reader_pids[@]}"; do
+			wait "$pid" || rc=$?
+		done
+
+		stop_perf_lock "$outdir"
+		stop_monitors
+		[ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+		# Clean up test files
+		cleanup_test_files "bulk/testfile"
+		for i in $(seq 1 $(( num_clients - 1 ))); do
+			cleanup_test_files "reader${i}/readfile"
+		done
+	done
+}
+
+########################################################################
+# Main
+########################################################################
+preflight
+
+TIMESTAMP=$(date '+%Y%m%d-%H%M%S')
+RESULTS_DIR="${RESULTS_DIR}/${TIMESTAMP}"
+mkdir -p "$RESULTS_DIR"
+
+# Save system info
+{
+	echo "Timestamp: $TIMESTAMP"
+	echo "Kernel: $(uname -r)"
+	echo "Hostname: $(hostname)"
+	echo "NFS version: $NFS_VER"
+	echo "File size: $SIZE"
+	echo "Export: $EXPORT_PATH"
+	cat /proc/meminfo
+} > "${RESULTS_DIR}/sysinfo.txt"
+
+log "Results will be saved to: $RESULTS_DIR"
+
+run_deliverable1
+run_deliverable2
+
+# Reset to defaults
+set_io_mode 0 0 0
+
+log "=========================================="
+log "All benchmarks complete."
+log "Results in: $RESULTS_DIR"
+log "Run: scripts/parse-results.sh $RESULTS_DIR"
+log "=========================================="
diff --git a/tools/testing/nfsd-io-bench/scripts/setup-server.sh b/tools/testing/nfsd-io-bench/scripts/setup-server.sh
new file mode 100755
index 000000000000..0efdd74a705e
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/scripts/setup-server.sh
@@ -0,0 +1,94 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# One-time setup script for the NFS test server.
+# Run this once before running benchmarks.
+#
+# Usage: sudo ./setup-server.sh [EXPORT_PATH]
+
+set -euo pipefail
+
+EXPORT_PATH="${1:-/export}"
+FSTYPE="ext4"
+
+log() {
+	echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
+}
+
+if [ "$(id -u)" -ne 0 ]; then
+	echo "ERROR: must run as root"
+	exit 1
+fi
+
+# Check for required tools
+for cmd in fio exportfs showmount jq; do
+	if ! command -v "$cmd" &>/dev/null; then
+		echo "WARNING: $cmd not found, attempting install"
+		dnf install -y "$cmd" 2>/dev/null || \
+		apt-get install -y "$cmd" 2>/dev/null || \
+		echo "ERROR: cannot install $cmd, please install manually"
+	fi
+done
+
+# Check fio has nfs ioengine
+if ! fio --enghelp=nfs &>/dev/null; then
+	echo "ERROR: fio nfs ioengine not available."
+	echo "You may need to install fio with libnfs support."
+	echo "Try: dnf install fio libnfs-devel  (or build fio from source with --enable-nfs)"
+	exit 1
+fi
+
+# Create export directory if needed
+if [ ! -d "$EXPORT_PATH" ]; then
+	log "Creating export directory: $EXPORT_PATH"
+	mkdir -p "$EXPORT_PATH"
+fi
+
+# Create subdirectories for multi-client tests
+for i in 1 2 3 4; do
+	mkdir -p "${EXPORT_PATH}/client${i}"
+	mkdir -p "${EXPORT_PATH}/reader${i}"
+done
+mkdir -p "${EXPORT_PATH}/bulk"
+
+# Check if already exported
+if ! exportfs -s 2>/dev/null | grep -q "$EXPORT_PATH"; then
+	log "Adding NFS export for $EXPORT_PATH"
+	if ! grep -q "$EXPORT_PATH" /etc/exports 2>/dev/null; then
+		echo "${EXPORT_PATH} 127.0.0.1/32(rw,sync,no_root_squash,no_subtree_check)" >> /etc/exports
+	fi
+	exportfs -ra
+fi
+
+# Ensure NFS server is running
+if ! systemctl is-active --quiet nfs-server 2>/dev/null; then
+	log "Starting NFS server"
+	systemctl start nfs-server
+fi
+
+# Verify export
+log "Current exports:"
+showmount -e localhost
+
+# Check debugfs knobs
+log "Checking debugfs knobs:"
+DEBUGFS_BASE="/sys/kernel/debug/nfsd"
+for knob in io_cache_read io_cache_write disable-splice-read; do
+	if [ -f "${DEBUGFS_BASE}/${knob}" ]; then
+		echo "  ${knob} = $(cat "${DEBUGFS_BASE}/${knob}")"
+	else
+		echo "  ${knob}: NOT FOUND (kernel may be too old)"
+	fi
+done
+
+# Print system summary
+echo ""
+log "=== System Summary ==="
+echo "Kernel:      $(uname -r)"
+echo "RAM:         $(awk '/MemTotal/ {printf "%.1f GB", $2/1024/1024}' /proc/meminfo)"
+echo "Export:      $EXPORT_PATH"
+echo "Filesystem:  $(df -T "$EXPORT_PATH" | awk 'NR==2 {print $2}')"
+echo "Disk:        $(df -h "$EXPORT_PATH" | awk 'NR==2 {print $2, "total,", $4, "free"}')"
+echo ""
+log "Setup complete. Run benchmarks with:"
+echo "  sudo ./scripts/run-benchmarks.sh -e $EXPORT_PATH"

-- 
2.53.0


^ permalink raw reply related

* [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Jeff Layton @ 2026-04-26 11:56 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel, Jeff Layton
In-Reply-To: <20260426-dontcache-v3-0-79eb37da9547@kernel.org>

The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context.  Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).

Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background.  This moves writeback submission
completely off the writer's hot path.

To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the new NR_DONTCACHE_DIRTY counter to determine how many pages to write
back.  The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.

Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations.

Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility, and target the correct cgroup writeback domain via
unlocked_inode_to_wb_begin().

dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
~503 GB, compared to a v6.19-ish baseline):

  Single-client sequential write (MB/s):
                       baseline    patched     change
  buffered              1449.8     1440.1      -0.7%
  dontcache             1347.9     1461.5      +8.4%
  direct                1450.0     1440.1      -0.7%

  Single-client sequential write latency (us):
                       baseline    patched     change
  dontcache p50         3031.0    10551.3    +248.1%
  dontcache p99        74973.2    21626.9     -71.2%
  dontcache p99.9      85459.0    23199.7     -72.9%

  Single-client random write (MB/s):
                       baseline    patched     change
  dontcache              284.2      295.4      +3.9%

  Single-client random write p99.9 latency (us):
                       baseline    patched     change
  dontcache             2277.4      872.4     -61.7%

  Multi-writer aggregate throughput (MB/s):
                       baseline    patched     change
  buffered              1619.5     1611.2      -0.5%
  dontcache             1281.1     1629.4     +27.2%
  direct                1545.4     1609.4      +4.1%

  Mixed-mode noisy neighbor (dontcache writer + buffered readers):
                       baseline    patched     change
  writer (MB/s)         1297.6     1471.1     +13.4%
  readers avg (MB/s)     855.0      462.4     -45.9%

nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
file size ~502 GB, compared to v6.19-ish baseline):

  Single-client sequential write (MB/s):
                       baseline    patched     change
  buffered              4844.2     4653.4      -3.9%
  dontcache             3028.3     3723.1     +22.9%
  direct                 957.6      987.8      +3.2%

  Single-client sequential write p99.9 latency (us):
                       baseline    patched     change
  dontcache            759169.0   175112.2     -76.9%

  Single-client random write (MB/s):
                       baseline    patched     change
  dontcache              590.0     1561.0    +164.6%

  Multi-writer aggregate throughput (MB/s):
                       baseline    patched     change
  buffered              9636.3     9422.9      -2.2%
  dontcache             1894.9     9442.6    +398.3%
  direct                 809.6      975.1     +20.4%

  Noisy neighbor (dontcache writer + random readers):
                       baseline    patched     change
  writer (MB/s)         1854.5     4063.6    +119.1%
  readers avg (MB/s)     131.2      101.6     -22.5%

The NFS results show even larger improvements than the local benchmarks.
Multi-writer dontcache throughput improves nearly 5x, matching buffered
I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
buffered.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/fs-writeback.c                | 60 ++++++++++++++++++++++++++++++++++++++++
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h               |  6 ++--
 include/trace/events/writeback.h |  3 +-
 4 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a65694cbfe68..377767db48f7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1334,6 +1334,18 @@ static void wb_start_writeback(struct bdi_writeback *wb, enum wb_reason reason)
 	wb_wakeup(wb);
 }
 
+static void wb_start_dontcache_writeback(struct bdi_writeback *wb)
+{
+	if (!wb_has_dirty_io(wb))
+		return;
+
+	if (test_bit(WB_start_dontcache, &wb->state) ||
+	    test_and_set_bit(WB_start_dontcache, &wb->state))
+		return;
+
+	wb_wakeup(wb);
+}
+
 /**
  * wb_start_background_writeback - start background writeback
  * @wb: bdi_writback to write from
@@ -2373,6 +2385,28 @@ static long wb_check_start_all(struct bdi_writeback *wb)
 	return nr_pages;
 }
 
+static long wb_check_start_dontcache(struct bdi_writeback *wb)
+{
+	long nr_pages;
+
+	if (!test_bit(WB_start_dontcache, &wb->state))
+		return 0;
+
+	nr_pages = global_node_page_state(NR_DONTCACHE_DIRTY);
+	if (nr_pages) {
+		struct wb_writeback_work work = {
+			.nr_pages	= wb_split_bdi_pages(wb, nr_pages),
+			.sync_mode	= WB_SYNC_NONE,
+			.range_cyclic	= 1,
+			.reason		= WB_REASON_DONTCACHE,
+		};
+
+		nr_pages = wb_writeback(wb, &work);
+	}
+
+	clear_bit(WB_start_dontcache, &wb->state);
+	return nr_pages;
+}
 
 /*
  * Retrieve work items and do the writeback they describe
@@ -2394,6 +2428,11 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 	 */
 	wrote += wb_check_start_all(wb);
 
+	/*
+	 * Check for dontcache writeback request
+	 */
+	wrote += wb_check_start_dontcache(wb);
+
 	/*
 	 * Check for periodic writeback, kupdated() style
 	 */
@@ -2468,6 +2507,27 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 	rcu_read_unlock();
 }
 
+/**
+ * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
+ * @mapping:	address_space that was just written to
+ *
+ * Kick the writeback flusher thread to expedite writeback of dontcache
+ * dirty pages.  Uses a dedicated WB_start_dontcache bit so that only
+ * pages tracked by NR_DONTCACHE_DIRTY are written back, rather than
+ * flushing the entire BDI's dirty pages.
+ */
+void filemap_dontcache_kick_writeback(struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+	struct bdi_writeback *wb;
+	struct wb_lock_cookie cookie = {};
+
+	wb = unlocked_inode_to_wb_begin(inode, &cookie);
+	wb_start_dontcache_writeback(wb);
+	unlocked_inode_to_wb_end(inode, &cookie);
+}
+EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);
+
 /*
  * Wakeup the flusher threads to start writeback of all currently dirty pages
  */
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index a06b93446d10..74f8a9977f5d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -26,6 +26,7 @@ enum wb_state {
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 	WB_start_all,		/* nr_pages == 0 (all) work pending */
+	WB_start_dontcache,	/* dontcache writeback pending */
 };
 
 enum wb_stat_item {
@@ -55,6 +56,7 @@ enum wb_reason {
 	 */
 	WB_REASON_FORKER_THREAD,
 	WB_REASON_FOREIGN_FLUSH,
+	WB_REASON_DONTCACHE,
 
 	WB_REASON_MAX,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..df72b42a9e9b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file,
 						loff_t start, loff_t end);
 int filemap_flush_range(struct address_space *mapping, loff_t start,
 		loff_t end);
+void filemap_dontcache_kick_writeback(struct address_space *mapping);
 
 static inline int file_write_and_wait(struct file *file)
 {
@@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 		if (ret)
 			return ret;
 	} else if (iocb->ki_flags & IOCB_DONTCACHE) {
-		struct address_space *mapping = iocb->ki_filp->f_mapping;
-
-		filemap_flush_range(mapping, iocb->ki_pos - count,
-				iocb->ki_pos - 1);
+		filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping);
 	}
 
 	return count;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..13ee076ccd16 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -44,7 +44,8 @@
 	EM( WB_REASON_PERIODIC,			"periodic")		\
 	EM( WB_REASON_FS_FREE_SPACE,		"fs_free_space")	\
 	EM( WB_REASON_FORKER_THREAD,		"forker_thread")	\
-	EMe(WB_REASON_FOREIGN_FLUSH,		"foreign_flush")
+	EM( WB_REASON_FOREIGN_FLUSH,		"foreign_flush")	\
+	EMe(WB_REASON_DONTCACHE,		"dontcache")
 
 WB_WORK_REASON
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH v3 1/4] mm: add NR_DONTCACHE_DIRTY node page counter
From: Jeff Layton @ 2026-04-26 11:56 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel, Jeff Layton
In-Reply-To: <20260426-dontcache-v3-0-79eb37da9547@kernel.org>

Add a per-node page counter that tracks the number of dirty pages with
the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE writes).

Increment the counter alongside NR_FILE_DIRTY in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io(), folio_account_cleaned(), and when a
non-DONTCACHE access clears the dropbehind flag on a dirty folio.

The counter is visible via /proc/vmstat as "nr_dontcache_dirty" and
will be used by the writeback flusher to determine how many pages to
write back when expediting writeback for IOCB_DONTCACHE writes, without
flushing the entire BDI's dirty pages.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/mmzone.h | 1 +
 mm/filemap.c           | 6 +++++-
 mm/page-writeback.c    | 7 +++++++
 mm/vmstat.c            | 1 +
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da5..ed9cc61c7627 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -259,6 +259,7 @@ enum node_stat_item {
 			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
+	NR_DONTCACHE_DIRTY,
 	NR_WRITEBACK,
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_SHMEM_THPS,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..45089fde5150 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2052,8 +2052,12 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 	if (!folio)
 		return ERR_PTR(-ENOENT);
 	/* not an uncached lookup, clear uncached if set */
-	if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
+	if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE)) {
+		if (folio_test_dirty(folio))
+			lruvec_stat_mod_folio(folio, NR_DONTCACHE_DIRTY,
+					      -folio_nr_pages(folio));
 		folio_clear_dropbehind(folio);
+	}
 	return folio;
 }
 EXPORT_SYMBOL(__filemap_get_folio_mpol);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88cd53d4ba09..e1df93fb3e3b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2630,6 +2630,8 @@ static void folio_account_dirtied(struct folio *folio,
 		wb = inode_to_wb(inode);
 
 		lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
+		if (folio_test_dropbehind(folio))
+			lruvec_stat_mod_folio(folio, NR_DONTCACHE_DIRTY, nr);
 		__zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
 		__node_stat_mod_folio(folio, NR_DIRTIED, nr);
 		wb_stat_mod(wb, WB_RECLAIMABLE, nr);
@@ -2651,6 +2653,8 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
 	long nr = folio_nr_pages(folio);
 
 	lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+	if (folio_test_dropbehind(folio))
+		lruvec_stat_mod_folio(folio, NR_DONTCACHE_DIRTY, -nr);
 	zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
 	wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
 	task_io_account_cancelled_write(nr * PAGE_SIZE);
@@ -2920,6 +2924,9 @@ bool folio_clear_dirty_for_io(struct folio *folio)
 		if (folio_test_clear_dirty(folio)) {
 			long nr = folio_nr_pages(folio);
 			lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+			if (folio_test_dropbehind(folio))
+				lruvec_stat_mod_folio(folio,
+						NR_DONTCACHE_DIRTY, -nr);
 			zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
 			wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
 			ret = true;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..c3e5dfadb9a5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1240,6 +1240,7 @@ const char * const vmstat_text[] = {
 	[I(NR_FILE_MAPPED)]			= "nr_mapped",
 	[I(NR_FILE_PAGES)]			= "nr_file_pages",
 	[I(NR_FILE_DIRTY)]			= "nr_dirty",
+	[I(NR_DONTCACHE_DIRTY)]			= "nr_dontcache_dirty",
 	[I(NR_WRITEBACK)]			= "nr_writeback",
 	[I(NR_SHMEM)]				= "nr_shmem",
 	[I(NR_SHMEM_THPS)]			= "nr_shmem_hugepages",

-- 
2.53.0


^ permalink raw reply related

* [PATCH v3 0/4] mm: improve write performance with RWF_DONTCACHE
From: Jeff Layton @ 2026-04-26 11:56 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel, Jeff Layton

This patch series attempts to improve write performance with
RWF_DONTCACHE. The main justification and benchmarks for the series are
in patch #2.

This version implements a scheme that Jan Kara and Christoph Hellwig
suggested during review of the earlier series: after a DONTCACHE write,
kick the flusher thread to do an amount of writeback proportional to the
amount written, but don't target any particular inode or pages when
doing writeback.

The second patch in the series has a summary of the benchmark results.
This seems to work as well or better than the earlier approaches.

The benchmarks I used are in the last two patches. I'm not sure if we
want to merge those into the tree as they are (mostly) AI slop. There
is probably a better tool for this out there.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v3:
- Track dirty DONTCACHE pages in the VM
- Have flusher write back a proportional number of pages after DONTCACHE write
- Link to v2: https://lore.kernel.org/r/20260408-dontcache-v2-0-948dec1e756b@kernel.org

Changes in v2:
- kick flusher thread instead of initiating writeback inline
- add mechanism to run 'perf lock' around the testcases
- Link to v1: https://lore.kernel.org/r/20260401-dontcache-v1-0-1f5746fab47a@kernel.org

---
Jeff Layton (4):
      mm: add NR_DONTCACHE_DIRTY node page counter
      mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
      testing: add nfsd-io-bench NFS server benchmark suite
      testing: add dontcache-bench local filesystem benchmark suite

 fs/fs-writeback.c                                  |  60 +++
 include/linux/backing-dev-defs.h                   |   2 +
 include/linux/fs.h                                 |   6 +-
 include/linux/mmzone.h                             |   1 +
 include/trace/events/writeback.h                   |   3 +-
 mm/filemap.c                                       |   6 +-
 mm/page-writeback.c                                |   7 +
 mm/vmstat.c                                        |   1 +
 .../dontcache-bench/fio-jobs/lat-reader.fio        |  12 +
 .../dontcache-bench/fio-jobs/multi-write.fio       |   9 +
 .../dontcache-bench/fio-jobs/noisy-writer.fio      |  12 +
 .../testing/dontcache-bench/fio-jobs/rand-read.fio |  13 +
 .../dontcache-bench/fio-jobs/rand-write.fio        |  13 +
 .../testing/dontcache-bench/fio-jobs/seq-read.fio  |  13 +
 .../testing/dontcache-bench/fio-jobs/seq-write.fio |  13 +
 .../dontcache-bench/scripts/parse-results.sh       | 238 +++++++++
 .../dontcache-bench/scripts/run-benchmarks.sh      | 562 ++++++++++++++++++++
 .../testing/nfsd-io-bench/fio-jobs/lat-reader.fio  |  15 +
 .../testing/nfsd-io-bench/fio-jobs/multi-write.fio |  14 +
 .../nfsd-io-bench/fio-jobs/noisy-writer.fio        |  14 +
 tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio |  15 +
 .../testing/nfsd-io-bench/fio-jobs/rand-write.fio  |  15 +
 tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio  |  14 +
 tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio |  14 +
 .../testing/nfsd-io-bench/scripts/parse-results.sh | 238 +++++++++
 .../nfsd-io-bench/scripts/run-benchmarks.sh        | 591 +++++++++++++++++++++
 .../testing/nfsd-io-bench/scripts/setup-server.sh  |  94 ++++
 27 files changed, 1989 insertions(+), 6 deletions(-)
---
base-commit: 27d128c1cff64c3b8012cc56dd5a1391bb4f1821
change-id: 20260401-dontcache-5811efd7eaf3

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply

* [PATCH] mm/page_alloc: add tracepoint for PCP refills
From: Bunyod Suvonov @ 2026-04-25  9:13 UTC (permalink / raw)
  To: akpm, vbabka, linux-mm
  Cc: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel,
	linux-kernel, surenb, mhocko, jackmanb, hannes, ziy,
	Bunyod Suvonov

The page allocator already has mm_page_pcpu_drain to trace pages
drained from the per-cpu page lists back to the buddy allocator. There
is no matching tracepoint for the opposite direction, where
rmqueue_bulk() refills a PCP list from the buddy allocator.

mm_page_alloc_zone_locked is not a good substitute for this. It is
emitted from __rmqueue_smallest(), which is used both by rmqueue_bulk()
and by the direct buddy allocation path. Its percpu_refill field is
derived from the allocation order and migratetype, so it does not
reliably identify whether the allocation came from a PCP refill.

Add mm_page_pcpu_refill and emit it from rmqueue_bulk() for each page
added to the PCP list. The new tracepoint uses the same page, order and
migratetype fields as mm_page_pcpu_drain, making refill and drain
activity directly comparable.

Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
---
 include/trace/events/kmem.h | 23 +++++++++++++++++++++++
 mm/page_alloc.c             |  1 +
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..16985604fc51 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -243,6 +243,29 @@ DEFINE_EVENT(mm_page, mm_page_alloc_zone_locked,
 	TP_ARGS(page, order, migratetype, percpu_refill)
 );

+TRACE_EVENT(mm_page_pcpu_refill,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	pfn		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->pfn		= page ? page_to_pfn(page) : -1UL;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=0x%lx order=%d migratetype=%d",
+		pfn_to_page(__entry->pfn), __entry->pfn,
+		__entry->order, __entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_pcpu_drain,

 	TP_PROTO(struct page *page, unsigned int order, int migratetype),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e205111553..a60b73ed39a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2544,6 +2544,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
+		trace_mm_page_pcpu_refill(page, order, migratetype);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);

-- 
2.53.0

^ permalink raw reply related

* [PATCH] tracing: simplify pages allocation
From: Rosen Penev @ 2026-04-25  1:44 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Kees Cook,
	Gustavo A. R. Silva, open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be)?b

Change to a flexible array member to allocate together with the array
struct.

Simplifies code slightly by removing no longer correct null checks for
pages and removing kfrees.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 kernel/trace/tracing_map.c | 32 +++++++++++---------------------
 kernel/trace/tracing_map.h |  2 +-
 2 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index bf1a507695b6..627cc3fdf69e 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -288,9 +288,6 @@ static void tracing_map_array_clear(struct tracing_map_array *a)
 {
 	unsigned int i;
 
-	if (!a->pages)
-		return;
-
 	for (i = 0; i < a->n_pages; i++)
 		memset(a->pages[i], 0, PAGE_SIZE);
 }
@@ -302,44 +299,37 @@ static void tracing_map_array_free(struct tracing_map_array *a)
 	if (!a)
 		return;
 
-	if (!a->pages)
-		goto free;
-
 	for (i = 0; i < a->n_pages; i++) {
 		if (!a->pages[i])
 			break;
 		kmemleak_free(a->pages[i]);
 		free_page((unsigned long)a->pages[i]);
 	}
-
-	kfree(a->pages);
-
- free:
-	kfree(a);
 }
 
 static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts,
 						  unsigned int entry_size)
 {
 	struct tracing_map_array *a;
+	unsigned int entry_size_shift;
+	unsigned int entries_per_page;
+	unsigned int n_pages;
 	unsigned int i;
 
-	a = kzalloc_obj(*a);
+	entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
+	entries_per_page = PAGE_SIZE / (1 << entry_size_shift);
+	n_pages = max(1, n_elts / entries_per_page);
+
+	a = kzalloc_flex(*a, pages, n_pages);
 	if (!a)
 		return NULL;
 
-	a->entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
-	a->entries_per_page = PAGE_SIZE / (1 << a->entry_size_shift);
-	a->n_pages = n_elts / a->entries_per_page;
-	if (!a->n_pages)
-		a->n_pages = 1;
+	a->entry_size_shift = entry_size_shift;
+	a->entries_per_page = entries_per_page;
+	a->n_pages = n_pages;
 	a->entry_shift = fls(a->entries_per_page) - 1;
 	a->entry_mask = (1 << a->entry_shift) - 1;
 
-	a->pages = kcalloc(a->n_pages, sizeof(void *), GFP_KERNEL);
-	if (!a->pages)
-		goto free;
-
 	for (i = 0; i < a->n_pages; i++) {
 		a->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
 		if (!a->pages[i])
diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
index 99c37eeebc16..18a02959d77b 100644
--- a/kernel/trace/tracing_map.h
+++ b/kernel/trace/tracing_map.h
@@ -167,7 +167,7 @@ struct tracing_map_array {
 	unsigned int entry_shift;
 	unsigned int entry_mask;
 	unsigned int n_pages;
-	void **pages;
+	void *pages[] __counted_by(n_pages);
 };
 
 #define TRACING_MAP_ARRAY_ELT(array, idx)				\
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-04-24 19:08 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <3blpenhpvysb2ig7efegedx4v3flppl5ftnz6vhpqlatfk3ycn@vmmhs7mvjieg>

Michael Roth <michael.roth@amd.com> writes:

Thank you for your patches!

>
> [...snip...]
>
>>
>> I also did some minor updates (prefixed with a "[squash]" tag) to advertise
>> the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVED flag so it can be used by
>
> Though I'm not sure how we deal with it if SNP/TDX at some point become
> capable of using the PRESERVED flag *after* populate... but maybe that's
> too unlikely to worry about? If we wanted to address it though, we could
> have both PRESERVED and PRESERVED_BEFORE_LAUNCH so they can be
> enumerated separately from the start.
>

Not sure how likely it is, but if SNP and TDX can honor PRESERVE
semantics after populate, I think we could implement support under a new
flag like CIPHER.

CIPHER can then be used to mean "do the encryption or decryption", and
for platforms not supporting encryption, they'd stick with PRESERVE?

Should we redefine the semantics of PRESERVE to be "ensure that memory
contents don't change while guest_memfd tracking is being updated" and
avoid making a commitment on how the guest should read the memory?

The above update would be aligned with ZERO not being allowed for
conversions to private (because KVM/guest_memfd does not make guarantees
about the contract between the host and guest.

This way, all of those (ZERO, PRESERVE) will focus on KVM's interface
with the host.

This lines up for SW_PROTECTED_VMs too, since reading memory that didn't
change in the guest is the contract between SW_PROTECTED_VMs and the
host.

>> userspace for SNP/TDX in the kvm_gmem_populate() path as agreed upon
>> during PUCK.
>>
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Matthew Brost @ 2026-04-24 14:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260424071930.62318a9294e07c99ba0ff8a2@linux-foundation.org>

On Fri, Apr 24, 2026 at 07:19:30AM -0700, Andrew Morton wrote:
> On Fri, 24 Apr 2026 07:05:19 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > On Fri, Apr 24, 2026 at 06:58:28AM -0700, Andrew Morton wrote:
> > > On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
> > > 
> > > > The following series provides khugepaged with the capability to collapse
> > > > anonymous memory regions to mTHPs.
> > > 
> > > Lots of stuff here:
> > > 	https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
> > > 
> > > It's going to take some time.  Hopefully worthwhile.
> > > 
> > > As always, it's useful to hear about the usefulness of the AI review.
> > 
> > Drive by comment.
> > 
> > On the DRM side sashiko batting average is about .500 but even on misses
> > it is generally is helpful in questioning assumptions made in patches.
> 
> Interesting, thanks.
> 
> Personally, not adding bugs to Linux is so damn important, I'd be happy
> with a lot less than 50%.
> 

I'm convinced enough that any patch I post/merge or even RB, I read
sashiko first.

Matt

> > Matt sashiko
> 
> "A sashiko mat is a decorative or functional mat, such as a coaster,
> table mat, or place mat, made using traditional Japanese, functional,
> and meditative embroidery stitching".
> 
> So there.

^ permalink raw reply

* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Andrew Morton @ 2026-04-24 14:19 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <aet4nz/Ljn0kDjDk@gsse-cloud1.jf.intel.com>

On Fri, 24 Apr 2026 07:05:19 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> On Fri, Apr 24, 2026 at 06:58:28AM -0700, Andrew Morton wrote:
> > On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
> > 
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> > 
> > Lots of stuff here:
> > 	https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
> > 
> > It's going to take some time.  Hopefully worthwhile.
> > 
> > As always, it's useful to hear about the usefulness of the AI review.
> 
> Drive by comment.
> 
> On the DRM side sashiko batting average is about .500 but even on misses
> it is generally is helpful in questioning assumptions made in patches.

Interesting, thanks.

Personally, not adding bugs to Linux is so damn important, I'd be happy
with a lot less than 50%.

> Matt sashiko

"A sashiko mat is a decorative or functional mat, such as a coaster,
table mat, or place mat, made using traditional Japanese, functional,
and meditative embroidery stitching".

So there.

^ permalink raw reply

* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Matthew Brost @ 2026-04-24 14:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260424065828.031775921990de37f83a2468@linux-foundation.org>

On Fri, Apr 24, 2026 at 06:58:28AM -0700, Andrew Morton wrote:
> On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
> 
> > The following series provides khugepaged with the capability to collapse
> > anonymous memory regions to mTHPs.
> 
> Lots of stuff here:
> 	https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
> 
> It's going to take some time.  Hopefully worthwhile.
> 
> As always, it's useful to hear about the usefulness of the AI review.

Drive by comment.

On the DRM side sashiko batting average is about .500 but even on misses
it is generally is helpful in questioning assumptions made in patches.

Matt sashiko

^ permalink raw reply

* [PATCH] rtla/tests: Add unit tests for actions module
From: Tomas Glozar @ 2026-04-24 14:02 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

Add unit tests covering all functions in the actions module, including
both valid and invalid inputs and all action types, except for
actions_perform(), where only shell and continue actions are tested.

To support testing multiple modules, the unit test build was modified so
that it links the entire rtla-in.o file. For this to work, the main()
function in rtla.c was declared weak, so that the unit test main is able
to override it.

Other included minor changes to unit tests are:

- Make unit test output verbose to show which tests are being run, now
  that we have more than 3 tests.
- Add unit_tests file to .gitignore.
- Split unit test sources to one file per test suite, and keep only
  main() function in unit_tests.c.
- Fix Makefile dependencies so that "make unit-tests" will rebuild the
  binary with the changes in the commit.

Also with the linking the entire rtla-in.o file, it now has rtla's
nr_cpus symbol, so the declaration in utils unit tests is made extern.

Assisted-by: Composer:composer-2-fast
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/.gitignore               |   1 +
 tools/tracing/rtla/src/rtla.c               |   3 +
 tools/tracing/rtla/tests/unit/Build         |   3 +-
 tools/tracing/rtla/tests/unit/Makefile.unit |   6 +-
 tools/tracing/rtla/tests/unit/actions.c     | 380 ++++++++++++++++++++
 tools/tracing/rtla/tests/unit/unit_tests.c  | 107 +-----
 tools/tracing/rtla/tests/unit/utils.c       | 106 ++++++
 7 files changed, 502 insertions(+), 104 deletions(-)
 create mode 100644 tools/tracing/rtla/tests/unit/actions.c
 create mode 100644 tools/tracing/rtla/tests/unit/utils.c

diff --git a/tools/tracing/rtla/.gitignore b/tools/tracing/rtla/.gitignore
index 4d39d64ac08c..231fb8d67f97 100644
--- a/tools/tracing/rtla/.gitignore
+++ b/tools/tracing/rtla/.gitignore
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 rtla
 rtla-static
+unit_tests
 fixdep
 feature
 FEATURE-DUMP
diff --git a/tools/tracing/rtla/src/rtla.c b/tools/tracing/rtla/src/rtla.c
index 7635c70123ab..3398250076ea 100644
--- a/tools/tracing/rtla/src/rtla.c
+++ b/tools/tracing/rtla/src/rtla.c
@@ -61,6 +61,9 @@ int run_command(int argc, char **argv, int start_position)
 	return 1;
 }
 
+/* Set main as weak to allow overriding it for building unit test binary */
+#pragma weak main
+
 int main(int argc, char *argv[])
 {
 	int retval;
diff --git a/tools/tracing/rtla/tests/unit/Build b/tools/tracing/rtla/tests/unit/Build
index 5f1e531ea8c9..2749f4cf202a 100644
--- a/tools/tracing/rtla/tests/unit/Build
+++ b/tools/tracing/rtla/tests/unit/Build
@@ -1,2 +1,3 @@
+unit_tests-y += utils.o
+unit_tests-y += actions.o
 unit_tests-y += unit_tests.o
-unit_tests-y +=../../src/utils.o
diff --git a/tools/tracing/rtla/tests/unit/Makefile.unit b/tools/tracing/rtla/tests/unit/Makefile.unit
index 2088c9cc3571..bacb00164e46 100644
--- a/tools/tracing/rtla/tests/unit/Makefile.unit
+++ b/tools/tracing/rtla/tests/unit/Makefile.unit
@@ -3,10 +3,10 @@
 UNIT_TESTS := $(OUTPUT)unit_tests
 UNIT_TESTS_IN := $(UNIT_TESTS)-in.o
 
-$(UNIT_TESTS): $(UNIT_TESTS_IN)
-	$(QUIET_LINK)$(CC) $(LDFLAGS) -o $@ $^ -lcheck
+$(UNIT_TESTS): $(UNIT_TESTS_IN) $(RTLA_IN)
+	$(QUIET_LINK)$(CC) $(LDFLAGS) -o $@ $^ $(EXTLIBS) -lcheck
 
-$(UNIT_TESTS_IN):
+$(UNIT_TESTS_IN): fixdep
 	make $(build)=unit_tests
 
 unit-tests: FORCE
diff --git a/tools/tracing/rtla/tests/unit/actions.c b/tools/tracing/rtla/tests/unit/actions.c
new file mode 100644
index 000000000000..a5808ab71a4d
--- /dev/null
+++ b/tools/tracing/rtla/tests/unit/actions.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <check.h>
+#include <signal.h>
+
+#include "../../src/actions.h"
+
+static struct actions actions_fixture;
+
+static void actions_fixture_setup(void)
+{
+	actions_init(&actions_fixture);
+}
+
+static void actions_fixture_teardown(void)
+{
+	actions_destroy(&actions_fixture);
+}
+
+START_TEST(test_actions_init)
+{
+	struct actions actions;
+
+	actions_init(&actions);
+
+	ck_assert_int_eq(actions.len, 0);
+	ck_assert_int_eq(actions.size, action_default_size);
+	ck_assert(!actions.continue_flag);
+	ck_assert_ptr_eq(actions.trace_output_inst, NULL);
+}
+END_TEST
+
+START_TEST(test_actions_destroy)
+{
+	struct actions actions;
+
+	actions_init(&actions);
+	actions_destroy(&actions);
+}
+END_TEST
+
+START_TEST(test_actions_reallocate)
+{
+	struct actions actions;
+	int i;
+
+	actions_init(&actions);
+
+	ck_assert_int_eq(actions.len, 0);
+	ck_assert_int_eq(actions.size, action_default_size);
+
+	/* Fill size of actions array */
+	for (i = 0; i < action_default_size; i++)
+		actions_add_continue(&actions);
+
+	ck_assert_int_eq(actions.len, action_default_size);
+	ck_assert_int_eq(actions.size, action_default_size);
+
+	/* Add one more action to trigger reallocation */
+	actions_add_continue(&actions);
+
+	ck_assert_int_eq(actions.len, action_default_size + 1);
+	ck_assert_int_eq(actions.size, action_default_size * 2);
+
+	actions_destroy(&actions);
+}
+END_TEST
+
+START_TEST(test_actions_add_trace_output)
+{
+	actions_add_trace_output(&actions_fixture, "trace_output.txt");
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+	ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace_output.txt");
+	ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_add_signal)
+{
+	actions_add_signal(&actions_fixture, SIGINT, 1234);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+	ck_assert_int_eq(actions_fixture.list[0].signal, SIGINT);
+	ck_assert_int_eq(actions_fixture.list[0].pid, 1234);
+	ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_add_shell)
+{
+	actions_add_shell(&actions_fixture, "echo Hello");
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SHELL);
+	ck_assert_str_eq(actions_fixture.list[0].command, "echo Hello");
+	ck_assert(actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_add_continue)
+{
+	actions_add_continue(&actions_fixture);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_CONTINUE);
+	ck_assert(actions_fixture.present[ACTION_CONTINUE]);
+}
+END_TEST
+
+START_TEST(test_actions_add_multiple_same_action)
+{
+	actions_add_trace_output(&actions_fixture, "trace1.txt");
+	actions_add_trace_output(&actions_fixture, "trace2.txt");
+
+	ck_assert_int_eq(actions_fixture.len, 2);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+	ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace1.txt");
+	ck_assert_int_eq(actions_fixture.list[1].type, ACTION_TRACE_OUTPUT);
+	ck_assert_str_eq(actions_fixture.list[1].trace_output, "trace2.txt");
+	ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_add_multiple_different_action)
+{
+	actions_add_trace_output(&actions_fixture, "trace_output.txt");
+	actions_add_signal(&actions_fixture, SIGINT, 1234);
+
+	ck_assert_int_eq(actions_fixture.len, 2);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+	ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace_output.txt");
+	ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+	ck_assert_int_eq(actions_fixture.list[1].type, ACTION_SIGNAL);
+	ck_assert_int_eq(actions_fixture.list[1].signal, SIGINT);
+	ck_assert_int_eq(actions_fixture.list[1].pid, 1234);
+	ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_trace_output)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "trace", "trace.txt"), 0);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+	ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace.txt");
+	ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_trace_output_arg)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "trace,file=trace2.txt", "trace1.txt"), 0);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_TRACE_OUTPUT);
+	ck_assert_str_eq(actions_fixture.list[0].trace_output, "trace2.txt");
+	ck_assert(actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_trace_output_arg_bad)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "trace,foo=bar", "trace_output.txt"), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_TRACE_OUTPUT]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "signal,num=1,pid=1234", NULL), 0);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+	ck_assert_int_eq(actions_fixture.list[0].signal, 1);
+	ck_assert_int_eq(actions_fixture.list[0].pid, 1234);
+	ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_swapped)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "signal,pid=1234,num=1", NULL), 0);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+	ck_assert_int_eq(actions_fixture.list[0].signal, 1);
+	ck_assert_int_eq(actions_fixture.list[0].pid, 1234);
+	ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_parent)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "signal,pid=parent,num=1", NULL), 0);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SIGNAL);
+	ck_assert_int_eq(actions_fixture.list[0].signal, 1);
+	ck_assert_int_eq(actions_fixture.list[0].pid, -1);
+	ck_assert(actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_no_arg)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "signal", NULL), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_no_pid)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "signal,num=1", NULL), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_no_num)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "signal,pid=1234", NULL), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_signal_arg_bad)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "signal,foo=bar", NULL), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_SIGNAL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_shell)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "shell,command=echo Hello", NULL), 0);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_SHELL);
+	ck_assert_str_eq(actions_fixture.list[0].command, "echo Hello");
+	ck_assert(actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_shell_no_arg)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "shell", NULL), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_shell_arg_bad)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "shell,foo=bar", NULL), -1);
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_SHELL]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_continue)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "continue", NULL), 0);
+
+	ck_assert_int_eq(actions_fixture.len, 1);
+	ck_assert_int_eq(actions_fixture.list[0].type, ACTION_CONTINUE);
+	ck_assert(actions_fixture.present[ACTION_CONTINUE]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_continue_arg_bad)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "continue,foo=bar", NULL), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+	ck_assert(!actions_fixture.present[ACTION_CONTINUE]);
+}
+END_TEST
+
+START_TEST(test_actions_parse_invalid)
+{
+	ck_assert_int_eq(actions_parse(&actions_fixture, "foobar", NULL), -1);
+
+	ck_assert_int_eq(actions_fixture.len, 0);
+}
+END_TEST
+
+START_TEST(test_actions_perform_continue)
+{
+	actions_add_continue(&actions_fixture);
+	ck_assert_int_eq(actions_perform(&actions_fixture), 0);
+
+	ck_assert(actions_fixture.continue_flag);
+}
+END_TEST
+
+START_TEST(test_actions_perform_continue_after_successful_shell_command)
+{
+	actions_add_shell(&actions_fixture, "exit 0");
+	actions_add_continue(&actions_fixture);
+	ck_assert_int_eq(actions_perform(&actions_fixture), 0 << 8);
+
+	ck_assert(actions_fixture.continue_flag);
+}
+END_TEST
+
+START_TEST(test_actions_perform_continue_after_failed_shell_command)
+{
+	actions_add_shell(&actions_fixture, "exit 1");
+	actions_add_continue(&actions_fixture);
+	ck_assert_int_eq(actions_perform(&actions_fixture), 1 << 8);
+
+	ck_assert(!actions_fixture.continue_flag);
+}
+END_TEST
+
+Suite *actions_suite(void)
+{
+	Suite *s = suite_create("actions");
+	TCase *tc;
+
+	tc = tcase_create("alloc");
+	tcase_add_test(tc, test_actions_init);
+	tcase_add_test(tc, test_actions_destroy);
+	tcase_add_test(tc, test_actions_reallocate);
+	suite_add_tcase(s, tc);
+
+	tc = tcase_create("add");
+	tcase_add_checked_fixture(tc, actions_fixture_setup, actions_fixture_teardown);
+	tcase_add_test(tc, test_actions_add_trace_output);
+	tcase_add_test(tc, test_actions_add_signal);
+	tcase_add_test(tc, test_actions_add_shell);
+	tcase_add_test(tc, test_actions_add_continue);
+	tcase_add_test(tc, test_actions_add_multiple_same_action);
+	tcase_add_test(tc, test_actions_add_multiple_different_action);
+	suite_add_tcase(s, tc);
+
+	tc = tcase_create("parse");
+	tcase_add_checked_fixture(tc, actions_fixture_setup, actions_fixture_teardown);
+	tcase_add_test(tc, test_actions_parse_trace_output);
+	tcase_add_test(tc, test_actions_parse_trace_output_arg);
+	tcase_add_test(tc, test_actions_parse_trace_output_arg_bad);
+	tcase_add_test(tc, test_actions_parse_signal);
+	tcase_add_test(tc, test_actions_parse_signal_swapped);
+	tcase_add_test(tc, test_actions_parse_signal_parent);
+	tcase_add_test(tc, test_actions_parse_signal_no_arg);
+	tcase_add_test(tc, test_actions_parse_signal_no_pid);
+	tcase_add_test(tc, test_actions_parse_signal_no_num);
+	tcase_add_test(tc, test_actions_parse_signal_arg_bad);
+	tcase_add_test(tc, test_actions_parse_shell);
+	tcase_add_test(tc, test_actions_parse_shell_no_arg);
+	tcase_add_test(tc, test_actions_parse_shell_arg_bad);
+	tcase_add_test(tc, test_actions_parse_continue);
+	tcase_add_test(tc, test_actions_parse_continue_arg_bad);
+	tcase_add_test(tc, test_actions_parse_invalid);
+	suite_add_tcase(s, tc);
+
+	tc = tcase_create("perform");
+	tcase_add_checked_fixture(tc, actions_fixture_setup, actions_fixture_teardown);
+	tcase_add_test(tc, test_actions_perform_continue);
+	tcase_add_test(tc, test_actions_perform_continue_after_successful_shell_command);
+	tcase_add_test(tc, test_actions_perform_continue_after_failed_shell_command);
+	suite_add_tcase(s, tc);
+
+	return s;
+}
diff --git a/tools/tracing/rtla/tests/unit/unit_tests.c b/tools/tracing/rtla/tests/unit/unit_tests.c
index f3c6d89e3300..f87d761f9b12 100644
--- a/tools/tracing/rtla/tests/unit/unit_tests.c
+++ b/tools/tracing/rtla/tests/unit/unit_tests.c
@@ -2,115 +2,22 @@
 
 #define _GNU_SOURCE
 #include <check.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <sched.h>
-#include <limits.h>
-#include <unistd.h>
-#include <sys/sysinfo.h>
+#include <stdbool.h>
 
 #include "../../src/utils.h"
-int nr_cpus;
 
-START_TEST(test_strtoi)
-{
-	int result;
-	char buf[64];
-
-	ck_assert_int_eq(strtoi("123", &result), 0);
-	ck_assert_int_eq(result, 123);
-	ck_assert_int_eq(strtoi(" -456", &result), 0);
-	ck_assert_int_eq(result, -456);
-
-	snprintf(buf, sizeof(buf), "%d", INT_MAX);
-	ck_assert_int_eq(strtoi(buf, &result), 0);
-	snprintf(buf, sizeof(buf), "%ld", (long)INT_MAX + 1);
-	ck_assert_int_eq(strtoi(buf, &result), -1);
-
-	ck_assert_int_eq(strtoi("", &result), -1);
-	ck_assert_int_eq(strtoi("123abc", &result), -1);
-	ck_assert_int_eq(strtoi("123 ", &result), -1);
-}
-END_TEST
-
-START_TEST(test_parse_cpu_set)
-{
-	cpu_set_t set;
+Suite *utils_suite(void);
+Suite *actions_suite(void);
 
-	nr_cpus = 8;
-	ck_assert_int_eq(parse_cpu_set("0", &set), 0);
-	ck_assert(CPU_ISSET(0, &set));
-	ck_assert(!CPU_ISSET(1, &set));
-
-	ck_assert_int_eq(parse_cpu_set("0,2", &set), 0);
-	ck_assert(CPU_ISSET(0, &set));
-	ck_assert(CPU_ISSET(2, &set));
-
-	ck_assert_int_eq(parse_cpu_set("0-3", &set), 0);
-	ck_assert(CPU_ISSET(0, &set));
-	ck_assert(CPU_ISSET(1, &set));
-	ck_assert(CPU_ISSET(2, &set));
-	ck_assert(CPU_ISSET(3, &set));
-
-	ck_assert_int_eq(parse_cpu_set("1-3,5", &set), 0);
-	ck_assert(!CPU_ISSET(0, &set));
-	ck_assert(CPU_ISSET(1, &set));
-	ck_assert(CPU_ISSET(2, &set));
-	ck_assert(CPU_ISSET(3, &set));
-	ck_assert(!CPU_ISSET(4, &set));
-	ck_assert(CPU_ISSET(5, &set));
-
-	ck_assert_int_eq(parse_cpu_set("-1", &set), 1);
-	ck_assert_int_eq(parse_cpu_set("abc", &set), 1);
-	ck_assert_int_eq(parse_cpu_set("9999", &set), 1);
-}
-END_TEST
-
-START_TEST(test_parse_prio)
-{
-	struct sched_attr attr;
-
-	ck_assert_int_eq(parse_prio("f:50", &attr), 0);
-	ck_assert_uint_eq(attr.sched_policy, SCHED_FIFO);
-	ck_assert_uint_eq(attr.sched_priority, 50U);
-
-	ck_assert_int_eq(parse_prio("r:30", &attr), 0);
-	ck_assert_uint_eq(attr.sched_policy, SCHED_RR);
-
-	ck_assert_int_eq(parse_prio("o:0", &attr), 0);
-	ck_assert_uint_eq(attr.sched_policy, SCHED_OTHER);
-	ck_assert_int_eq(attr.sched_nice, 0);
-
-	ck_assert_int_eq(parse_prio("d:10ms:100ms", &attr), 0);
-	ck_assert_uint_eq(attr.sched_policy, 6U);
-
-	ck_assert_int_eq(parse_prio("f:999", &attr), -1);
-	ck_assert_int_eq(parse_prio("o:-20", &attr), -1);
-	ck_assert_int_eq(parse_prio("d:100ms:10ms", &attr), -1);
-	ck_assert_int_eq(parse_prio("x:50", &attr), -1);
-}
-END_TEST
-
-Suite *utils_suite(void)
-{
-	Suite *s = suite_create("utils");
-	TCase *tc = tcase_create("core");
-
-	tcase_add_test(tc, test_strtoi);
-	tcase_add_test(tc, test_parse_cpu_set);
-	tcase_add_test(tc, test_parse_prio);
-
-	suite_add_tcase(s, tc);
-	return s;
-}
-
-int main(void)
+int main(int argc, char *argv[])
 {
 	int num_failed;
 	SRunner *sr;
 
 	sr = srunner_create(utils_suite());
-	srunner_run_all(sr, CK_NORMAL);
+	srunner_add_suite(sr, actions_suite());
+
+	srunner_run_all(sr, CK_VERBOSE);
 	num_failed = srunner_ntests_failed(sr);
 
 	srunner_free(sr);
diff --git a/tools/tracing/rtla/tests/unit/utils.c b/tools/tracing/rtla/tests/unit/utils.c
new file mode 100644
index 000000000000..ce53cab49457
--- /dev/null
+++ b/tools/tracing/rtla/tests/unit/utils.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <check.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sched.h>
+#include <limits.h>
+#include <unistd.h>
+#include <sys/sysinfo.h>
+
+#include "../../src/utils.h"
+
+extern int nr_cpus;
+
+START_TEST(test_strtoi)
+{
+	int result;
+	char buf[64];
+
+	ck_assert_int_eq(strtoi("123", &result), 0);
+	ck_assert_int_eq(result, 123);
+	ck_assert_int_eq(strtoi(" -456", &result), 0);
+	ck_assert_int_eq(result, -456);
+
+	snprintf(buf, sizeof(buf), "%d", INT_MAX);
+	ck_assert_int_eq(strtoi(buf, &result), 0);
+	snprintf(buf, sizeof(buf), "%ld", (long)INT_MAX + 1);
+	ck_assert_int_eq(strtoi(buf, &result), -1);
+
+	ck_assert_int_eq(strtoi("", &result), -1);
+	ck_assert_int_eq(strtoi("123abc", &result), -1);
+	ck_assert_int_eq(strtoi("123 ", &result), -1);
+}
+END_TEST
+
+START_TEST(test_parse_cpu_set)
+{
+	cpu_set_t set;
+
+	nr_cpus = 8;
+	ck_assert_int_eq(parse_cpu_set("0", &set), 0);
+	ck_assert(CPU_ISSET(0, &set));
+	ck_assert(!CPU_ISSET(1, &set));
+
+	ck_assert_int_eq(parse_cpu_set("0,2", &set), 0);
+	ck_assert(CPU_ISSET(0, &set));
+	ck_assert(CPU_ISSET(2, &set));
+
+	ck_assert_int_eq(parse_cpu_set("0-3", &set), 0);
+	ck_assert(CPU_ISSET(0, &set));
+	ck_assert(CPU_ISSET(1, &set));
+	ck_assert(CPU_ISSET(2, &set));
+	ck_assert(CPU_ISSET(3, &set));
+
+	ck_assert_int_eq(parse_cpu_set("1-3,5", &set), 0);
+	ck_assert(!CPU_ISSET(0, &set));
+	ck_assert(CPU_ISSET(1, &set));
+	ck_assert(CPU_ISSET(2, &set));
+	ck_assert(CPU_ISSET(3, &set));
+	ck_assert(!CPU_ISSET(4, &set));
+	ck_assert(CPU_ISSET(5, &set));
+
+	ck_assert_int_eq(parse_cpu_set("-1", &set), 1);
+	ck_assert_int_eq(parse_cpu_set("abc", &set), 1);
+	ck_assert_int_eq(parse_cpu_set("9999", &set), 1);
+}
+END_TEST
+
+START_TEST(test_parse_prio)
+{
+	struct sched_attr attr;
+
+	ck_assert_int_eq(parse_prio("f:50", &attr), 0);
+	ck_assert_uint_eq(attr.sched_policy, SCHED_FIFO);
+	ck_assert_uint_eq(attr.sched_priority, 50U);
+
+	ck_assert_int_eq(parse_prio("r:30", &attr), 0);
+	ck_assert_uint_eq(attr.sched_policy, SCHED_RR);
+
+	ck_assert_int_eq(parse_prio("o:0", &attr), 0);
+	ck_assert_uint_eq(attr.sched_policy, SCHED_OTHER);
+	ck_assert_int_eq(attr.sched_nice, 0);
+
+	ck_assert_int_eq(parse_prio("d:10ms:100ms", &attr), 0);
+	ck_assert_uint_eq(attr.sched_policy, 6U);
+
+	ck_assert_int_eq(parse_prio("f:999", &attr), -1);
+	ck_assert_int_eq(parse_prio("o:-20", &attr), -1);
+	ck_assert_int_eq(parse_prio("d:100ms:10ms", &attr), -1);
+	ck_assert_int_eq(parse_prio("x:50", &attr), -1);
+}
+END_TEST
+
+Suite *utils_suite(void)
+{
+	Suite *s = suite_create("utils");
+	TCase *tc = tcase_create("core");
+
+	tcase_add_test(tc, test_strtoi);
+	tcase_add_test(tc, test_parse_cpu_set);
+	tcase_add_test(tc, test_parse_prio);
+
+	suite_add_tcase(s, tc);
+	return s;
+}
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Andrew Morton @ 2026-04-24 13:58 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260419185750.260784-1-npache@redhat.com>

On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:

> The following series provides khugepaged with the capability to collapse
> anonymous memory regions to mTHPs.

Lots of stuff here:
	https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com

It's going to take some time.  Hopefully worthwhile.

As always, it's useful to hear about the usefulness of the AI review.

^ permalink raw reply

* Re: [PATCH v2 2/2] module/kallsyms: sort function symbols and use binary search
From: Stanislaw Gruszka @ 2026-04-24  9:13 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: linux-modules, Sami Tolvanen, Luis Chamberlain, linux-kernel,
	linux-trace-kernel, live-patching, Daniel Gomez, Aaron Tomlin,
	Steven Rostedt, Masami Hiramatsu, Jordan Rome, Viktor Malik
In-Reply-To: <11c8e139-f9f3-4b22-863a-4e021a3947e7@suse.com>

Hi Petr,

thanks for the review.

On Thu, Apr 23, 2026 at 04:00:04PM +0200, Petr Pavlu wrote:
> On 3/27/26 12:00 PM, Stanislaw Gruszka wrote:
> > Module symbol lookup via find_kallsyms_symbol() performs a linear scan
> > over the entire symtab when resolving an address. The number of symbols
> > in module symtabs has grown over the years, largely due to additional
> > metadata in non-standard sections, making this lookup very slow.
> > 
> > Improve this by separating function symbols during module load, placing
> > them at the beginning of the symtab, sorting them by address, and using
> > binary search when resolving addresses in module text.
> > 
> > This also should improve times for linear symbol name lookups, as valid
> > function symbols are now located at the beginning of the symtab.
> > 
> > The cost of sorting is small relative to module load time. In repeated
> > module load tests [1], depending on .config options, this change
> > increases load time between 2% and 4%. With cold caches, the difference
> > is not measurable, as memory access latency dominates.
> > 
> > The sorting theoretically could be done in compile time, but much more
> > complicated as we would have to simulate kernel addresses resolution
> > for symbols, and then correct relocation entries. That would be risky
> > if get out of sync.
> > 
> > The improvement can be observed when listing ftrace filter functions.
> > 
> > Before:
> > 
> > root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
> > 74908
> > 
> > real	0m1.315s
> > user	0m0.000s
> > sys	0m1.312s
> > 
> > After:
> > 
> > root@nano:~# time cat /sys/kernel/tracing/available_filter_functions | wc -l
> > 74911
> > 
> > real	0m0.167s
> > user	0m0.004s
> > sys	0m0.175s
> > 
> > (there are three more symbols introduced by the patch)
> > 
> > For livepatch modules, the symtab layout is preserved and the existing
> > linear search is used. For this case, it should be possible to keep
> > the original ELF symtab instead of copying it 1:1, but that is outside
> > the scope of this patch.
> > 
> > Link: https://gist.github.com/sgruszka/09f3fb1dad53a97b1aad96e1927ab117 [1]
> > Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
> 
> Sorry for the delay reviewing this patch.

No problem.

> > ---
> > v1 -> v2: 
> >  - fix searching data symbols for CONFIG_KALLSYMS_ALL
> >  - use kallsyms_symbol_value() in elf_sym_cmp()
> > 
> >  include/linux/module.h   |   1 +
> >  kernel/module/internal.h |   1 +
> >  kernel/module/kallsyms.c | 171 +++++++++++++++++++++++++++++----------
> >  3 files changed, 130 insertions(+), 43 deletions(-)
> > 
> > diff --git a/include/linux/module.h b/include/linux/module.h
> > index ac254525014c..67c053afa882 100644
> > --- a/include/linux/module.h
> > +++ b/include/linux/module.h
> > @@ -379,6 +379,7 @@ struct module_memory {
> >  struct mod_kallsyms {
> >  	Elf_Sym *symtab;
> >  	unsigned int num_symtab;
> > +	unsigned int num_func_syms;
> >  	char *strtab;
> >  	char *typetab;
> >  };
> > diff --git a/kernel/module/internal.h b/kernel/module/internal.h
> > index 618202578b42..6a4d498619b1 100644
> > --- a/kernel/module/internal.h
> > +++ b/kernel/module/internal.h
> > @@ -73,6 +73,7 @@ struct load_info {
> >  	bool sig_ok;
> >  #ifdef CONFIG_KALLSYMS
> >  	unsigned long mod_kallsyms_init_off;
> > +	unsigned long num_func_syms;
> >  #endif
> >  #ifdef CONFIG_MODULE_DECOMPRESS
> >  #ifdef CONFIG_MODULE_STATS
> > diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
> > index f23126d804b2..d69e99e67707 100644
> > --- a/kernel/module/kallsyms.c
> > +++ b/kernel/module/kallsyms.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/kallsyms.h>
> >  #include <linux/buildid.h>
> >  #include <linux/bsearch.h>
> > +#include <linux/sort.h>
> >  #include "internal.h"
> >  
> >  /* Lookup exported symbol in given range of kernel_symbols */
> > @@ -103,6 +104,95 @@ static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs,
> >  	return true;
> >  }
> >  
> > +static inline bool is_func_symbol(const Elf_Sym *sym)
> > +{
> > +	return sym->st_shndx != SHN_UNDEF && sym->st_size != 0 &&
> > +	       ELF_ST_TYPE(sym->st_info) == STT_FUNC;
> > +}
> > +
> > +static unsigned int bsearch_func_symbol(struct mod_kallsyms *kallsyms,
> > +					unsigned long addr,
> > +					unsigned long *bestval,
> > +					unsigned long *nextval)
> > +
> > +{
> > +	unsigned int mid, low = 1, high = kallsyms->num_func_syms + 1;
> > +	unsigned int best = 0;
> > +	unsigned long thisval;
> > +
> > +	while (low < high) {
> > +		mid = low + (high - low) / 2;
> > +		thisval = kallsyms_symbol_value(&kallsyms->symtab[mid]);
> > +
> > +		if (thisval <= addr) {
> > +			*bestval = thisval;
> > +			best = mid;
> > +			low = mid + 1;
> 
> If thisval == addr, the search moves to the right and finds the last
> symbol with the same address. I believe it should do the opposite and
> return the first symbol to match the behavior of
> search_kallsyms_symbol().

In the case of multiple symbols sharing the same address, we have
to pick one and ignore the others. I don’t think it matters much which
one is chosen in practice. Also, I expect function symbol addresses
to be unique, so this shouldn’t be a real issue.

> > +		} else {
> > +			*nextval = thisval;
> > +			high = mid;
> > +		}
> > +	}
> > +
> > +	return best;
> > +}
> > +
> > +static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms,
> > +					unsigned int symnum)
> > +{
> > +	return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
> > +}
> > +
> > +static unsigned int search_kallsyms_symbol(struct mod_kallsyms *kallsyms,
> > +					   unsigned long addr,
> > +					   unsigned long *bestval,
> > +					   unsigned long *nextval)
> > +{
> > +	unsigned int i, best = 0;
> > +
> > +	/*
> > +	 * Scan for closest preceding symbol and next symbol. (ELF starts
> > +	 * real symbols at 1). Skip the initial function symbols range
> > +	 * if num_func_syms is non-zero, those are handled separately for
> > +	 * the core TEXT segment lookup.
> > +	 */
> > +	for (i = 1 + kallsyms->num_func_syms; i < kallsyms->num_symtab; i++) {
> > +		const Elf_Sym *sym = &kallsyms->symtab[i];
> > +		unsigned long thisval = kallsyms_symbol_value(sym);
> > +
> > +		if (sym->st_shndx == SHN_UNDEF)
> > +			continue;
> > +
> > +		/*
> > +		 * We ignore unnamed symbols: they're uninformative
> > +		 * and inserted at a whim.
> > +		 */
> > +		if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
> > +		    is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
> > +			continue;
> > +
> > +		if (thisval <= addr && thisval > *bestval) {
> > +			best = i;
> > +			*bestval = thisval;
> > +		}
> > +		if (thisval > addr && thisval < *nextval)
> > +			*nextval = thisval;
> > +	}
> > +
> > +	return best;
> > +}
> > +
> > +static int elf_sym_cmp(const void *a, const void *b)
> > +{
> > +	unsigned long val_a = kallsyms_symbol_value((const Elf_Sym *)a);
> > +	unsigned long val_b = kallsyms_symbol_value((const Elf_Sym *)b);
> > +
> > +	if (val_a < val_b)
> > +		return -1;
> > +
> > +	return val_a > val_b;
> 
> Does this comparison function and the sort() call result in stable
> sorting? If val_a and val_b are the same, the sorting should preserve
> the original order.

The kernel’s sort() implementation is not stable.

> > +}
> > +
> >  /*
> >   * We only allocate and copy the strings needed by the parts of symtab
> >   * we keep.  This is simple, but has the effect of making multiple
> > @@ -115,9 +205,10 @@ void layout_symtab(struct module *mod, struct load_info *info)
> >  	Elf_Shdr *symsect = info->sechdrs + info->index.sym;
> >  	Elf_Shdr *strsect = info->sechdrs + info->index.str;
> >  	const Elf_Sym *src;
> > -	unsigned int i, nsrc, ndst, strtab_size = 0;
> > +	unsigned int i, nsrc, ndst, nfunc, strtab_size = 0;
> >  	struct module_memory *mod_mem_data = &mod->mem[MOD_DATA];
> >  	struct module_memory *mod_mem_init_data = &mod->mem[MOD_INIT_DATA];
> > +	bool is_lp_mod = is_livepatch_module(mod);
> >  
> >  	/* Put symbol section at end of init part of module. */
> >  	symsect->sh_flags |= SHF_ALLOC;
> > @@ -129,12 +220,14 @@ void layout_symtab(struct module *mod, struct load_info *info)
> >  	nsrc = symsect->sh_size / sizeof(*src);
> >  
> >  	/* Compute total space required for the core symbols' strtab. */
> > -	for (ndst = i = 0; i < nsrc; i++) {
> > -		if (i == 0 || is_livepatch_module(mod) ||
> > +	for (ndst = nfunc = i = 0; i < nsrc; i++) {
> > +		if (i == 0 || is_lp_mod ||
> >  		    is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
> >  				   info->index.pcpu)) {
> >  			strtab_size += strlen(&info->strtab[src[i].st_name]) + 1;
> >  			ndst++;
> > +			if (!is_lp_mod && is_func_symbol(src + i))
> > +				nfunc++;
> >  		}
> >  	}
> >  
> > @@ -156,6 +249,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
> >  	mod_mem_init_data->size = ALIGN(mod_mem_init_data->size,
> >  					__alignof__(struct mod_kallsyms));
> >  	info->mod_kallsyms_init_off = mod_mem_init_data->size;
> > +	info->num_func_syms = nfunc;
> >  
> >  	mod_mem_init_data->size += sizeof(struct mod_kallsyms);
> >  	info->init_typeoffs = mod_mem_init_data->size;
> > @@ -169,7 +263,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
> >   */
> >  void add_kallsyms(struct module *mod, const struct load_info *info)
> >  {
> > -	unsigned int i, ndst;
> > +	unsigned int i, di, nfunc, ndst;
> >  	const Elf_Sym *src;
> >  	Elf_Sym *dst;
> >  	char *s;
> > @@ -178,6 +272,7 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
> >  	void *data_base = mod->mem[MOD_DATA].base;
> >  	void *init_data_base = mod->mem[MOD_INIT_DATA].base;
> >  	struct mod_kallsyms *kallsyms;
> > +	bool is_lp_mod = is_livepatch_module(mod);
> >  
> >  	kallsyms = init_data_base + info->mod_kallsyms_init_off;
> 
> This code is followed by the initialization of kallsyms:
> 
> 	kallsyms->symtab = (void *)symsec->sh_addr;
> 	kallsyms->num_symtab = symsec->sh_size / sizeof(Elf_Sym);
> 	/* Make sure we get permanent strtab: don't use info->strtab. */
> 	kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr;
> 	kallsyms->typetab = init_data_base + info->init_typeoffs;
> 
> I suggest adding 'kallsyms->num_func_syms = 0;' after the initialization
> of kallsyms->num_symtab.

I relied on zeroed memory initialization, but I can add this explicitly
for clarity.

> > @@ -194,19 +289,28 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
> >  	mod->core_kallsyms.symtab = dst = data_base + info->symoffs;
> >  	mod->core_kallsyms.strtab = s = data_base + info->stroffs;
> >  	mod->core_kallsyms.typetab = data_base + info->core_typeoffs;
> > +
> >  	strtab_size = info->core_typeoffs - info->stroffs;
> >  	src = kallsyms->symtab;
> > -	for (ndst = i = 0; i < kallsyms->num_symtab; i++) {
> > +	ndst = info->num_func_syms + 1;
> > +
> > +	for (nfunc = i = 0; i < kallsyms->num_symtab; i++) {
> >  		kallsyms->typetab[i] = elf_type(src + i, info);
> > -		if (i == 0 || is_livepatch_module(mod) ||
> > +		if (i == 0 || is_lp_mod ||
> >  		    is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
> >  				   info->index.pcpu)) {
> >  			ssize_t ret;
> >  
> > -			mod->core_kallsyms.typetab[ndst] =
> > -				kallsyms->typetab[i];
> > -			dst[ndst] = src[i];
> > -			dst[ndst++].st_name = s - mod->core_kallsyms.strtab;
> > +			if (i == 0)
> > +				di = 0;
> > +			else if (!is_lp_mod && is_func_symbol(src + i))
> > +				di = 1 + nfunc++;
> > +			else
> > +				di = ndst++;
> > +
> > +			mod->core_kallsyms.typetab[di] = kallsyms->typetab[i];
> > +			dst[di] = src[i];
> > +			dst[di].st_name = s - mod->core_kallsyms.strtab;
> >  			ret = strscpy(s, &kallsyms->strtab[src[i].st_name],
> >  				      strtab_size);
> >  			if (ret < 0)
> > @@ -216,9 +320,13 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
> >  		}
> >  	}
> >  
> > +	WARN_ON_ONCE(nfunc != info->num_func_syms);
> > +	sort(dst + 1, nfunc, sizeof(Elf_Sym), elf_sym_cmp, NULL);
> > +
> 
> The code sorts mod->core_kallsyms.symtab but mod->core_kallsyms.typetab
> is not reordered accordingly.

Right, but for function symbols the typetab entries are all 't',
so swapping them does not change the type value. The 'T' vs 't'
distinction is handled later when printing (based on export status).
But the comment explaining skiping adjusting of
mod->core_kallsyms.typetab is needed.

> >  	/* Set up to point into init section. */
> >  	rcu_assign_pointer(mod->kallsyms, kallsyms);
> >  	mod->core_kallsyms.num_symtab = ndst;
> > +	mod->core_kallsyms.num_func_syms = nfunc;
> >  }
> >  
> >  #if IS_ENABLED(CONFIG_STACKTRACE_BUILD_ID)
> > @@ -241,11 +349,6 @@ void init_build_id(struct module *mod, const struct load_info *info)
> >  }
> >  #endif
> >  
> > -static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms, unsigned int symnum)
> > -{
> > -	return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
> > -}
> > -
> >  /*
> >   * Given a module and address, find the corresponding symbol and return its name
> >   * while providing its size and offset if needed.
> > @@ -255,7 +358,10 @@ static const char *find_kallsyms_symbol(struct module *mod,
> >  					unsigned long *size,
> >  					unsigned long *offset)
> >  {
> > -	unsigned int i, best = 0;
> > +	unsigned int (*search)(struct mod_kallsyms *kallsyms,
> > +			       unsigned long addr, unsigned long *bestval,
> > +			       unsigned long *nextval);
> > +	unsigned int best;
> >  	unsigned long nextval, bestval;
> >  	struct mod_kallsyms *kallsyms = rcu_dereference(mod->kallsyms);
> >  	struct module_memory *mod_mem = NULL;
> > @@ -266,6 +372,11 @@ static const char *find_kallsyms_symbol(struct module *mod,
> >  			continue;
> >  #endif
> >  		if (within_module_mem_type(addr, mod, type)) {
> > +			if (type == MOD_TEXT && kallsyms->num_func_syms > 0)
> > +				search = bsearch_func_symbol;
> 
> I'm not sure if it is ok to limit the search only to function symbols
> when the address lies in MOD_TEXT. The text can theoretically contain
> non-function symbols.

Yes, the patch assumes that the only valid symbols in the MOD_TEXT
are functions. If there are defined OBJECT symbols in .text, the patch
would break lookup for those.

While it’s theoretically possible (e.g. hand-written assembly placing
data in .text ?), I’m not sure this is a practical concern. In general,
having data in executable segments is discouraged for security reasons. 

> Could this optimization be adjusted to sort all
> MOD_TEXT symbols (excluding anonymous and mapping symbols) and move them
> to the front of the symbol table?

That’s possible. We could track .text sections indices in
__layout_sections() and include all valid symbols from those sections,
and also reorder typetab accordingly.

However, this adds complexity. I would prefer to first confirm whether
OBJECT symbols in MOD_TEXT is a real issue before going in that direction.

Regards
Stanislaw

> > +			else
> > +				search = search_kallsyms_symbol;
> > +
> >  			mod_mem = &mod->mem[type];
> >  			break;
> >  		}
> > @@ -278,33 +389,7 @@ static const char *find_kallsyms_symbol(struct module *mod,
> >  	nextval = (unsigned long)mod_mem->base + mod_mem->size;
> >  	bestval = (unsigned long)mod_mem->base - 1;
> >  
> > -	/*
> > -	 * Scan for closest preceding symbol, and next symbol. (ELF
> > -	 * starts real symbols at 1).
> > -	 */
> > -	for (i = 1; i < kallsyms->num_symtab; i++) {
> > -		const Elf_Sym *sym = &kallsyms->symtab[i];
> > -		unsigned long thisval = kallsyms_symbol_value(sym);
> > -
> > -		if (sym->st_shndx == SHN_UNDEF)
> > -			continue;
> > -
> > -		/*
> > -		 * We ignore unnamed symbols: they're uninformative
> > -		 * and inserted at a whim.
> > -		 */
> > -		if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
> > -		    is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
> > -			continue;
> > -
> > -		if (thisval <= addr && thisval > bestval) {
> > -			best = i;
> > -			bestval = thisval;
> > -		}
> > -		if (thisval > addr && thisval < nextval)
> > -			nextval = thisval;
> > -	}
> > -
> > +	best = search(kallsyms, addr, &bestval, &nextval);
> >  	if (!best)
> >  		return NULL;
> >  
> 
> -- 
> Thanks,
> Petr

^ permalink raw reply

* Re: [PATCH v18 0/8] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu @ 2026-04-24  7:06 UTC (permalink / raw)
  To: Masami Hiramatsu (Google), Steven Rostedt
  Cc: Steven Rostedt, Catalin Marinas, Will Deacon, Mathieu Desnoyers,
	linux-kernel, linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>

Hi Steve,

I added a fix related this series as the 1st one. It can be merged
independently.

Thanks,

On Fri, 24 Apr 2026 15:51:59 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> Hi,
> 
> Here is the 18th version of improvement patches for making persistent
> ring buffers robust to failures.
> The previous version is here:
> 
> https://lore.kernel.org/all/177687458572.932171.10907864814735342737.stgit@mhiramat.tok.corp.google.com/
> 
> This version fixes a newly found bug and some review comments from
> Sashiko[1], also, add 2 cleanups, which includes:
> [1/8] Do not double count the reader_page when verifying persistent
>       ring buffer.
> [2/8] Add Geert's Ack (Thanks!)
> [3/8] Fix to substract BUF_PAGE_HDR_SIZE from meta->subbuf_size
>       to make the limit of commit size.
> [4/8] Reset timestamp of reader_page when the entire cpu_buffer is
>       invalid.
> [5/8] In rb_test_inject_invalid_pages(), changed entry_bytes and
>       idx to unsigned long.
> [7/8] Cleanup persistent ring buffer validation code.
> [8/8] Cleanup buffer_data_page related code.
> 
> [1] https://sashiko.dev/#/patchset/177687458572.932171.10907864814735342737.stgit%40mhiramat.tok.corp.google.com
> 
> Thank you,
> 
> Masami Hiramatsu (Google) (8):
>       ring-buffer: Do not double count the reader_page
>       ring-buffer: Flush and stop persistent ring buffer on panic
>       ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
>       ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
>       ring-buffer: Add persistent ring buffer invalid-page inject test
>       ring-buffer: Show commit numbers in buffer_meta file
>       ring-buffer: Cleanup persistent ring buffer validation
>       ring-buffer: Cleanup buffer_data_page related code
> 
> 
>  arch/alpha/include/asm/Kbuild        |    1 
>  arch/arc/include/asm/Kbuild          |    1 
>  arch/arm/include/asm/Kbuild          |    1 
>  arch/arm64/include/asm/ring_buffer.h |   10 +
>  arch/csky/include/asm/Kbuild         |    1 
>  arch/hexagon/include/asm/Kbuild      |    1 
>  arch/loongarch/include/asm/Kbuild    |    1 
>  arch/m68k/include/asm/Kbuild         |    1 
>  arch/microblaze/include/asm/Kbuild   |    1 
>  arch/mips/include/asm/Kbuild         |    1 
>  arch/nios2/include/asm/Kbuild        |    1 
>  arch/openrisc/include/asm/Kbuild     |    1 
>  arch/parisc/include/asm/Kbuild       |    1 
>  arch/powerpc/include/asm/Kbuild      |    1 
>  arch/riscv/include/asm/Kbuild        |    1 
>  arch/s390/include/asm/Kbuild         |    1 
>  arch/sh/include/asm/Kbuild           |    1 
>  arch/sparc/include/asm/Kbuild        |    1 
>  arch/um/include/asm/Kbuild           |    1 
>  arch/x86/include/asm/Kbuild          |    1 
>  arch/xtensa/include/asm/Kbuild       |    1 
>  include/asm-generic/ring_buffer.h    |   13 +
>  include/linux/ring_buffer.h          |    1 
>  kernel/trace/Kconfig                 |   34 ++
>  kernel/trace/ring_buffer.c           |  472 +++++++++++++++++++++++-----------
>  kernel/trace/trace.c                 |    4 
>  26 files changed, 395 insertions(+), 159 deletions(-)
>  create mode 100644 arch/arm64/include/asm/ring_buffer.h
>  create mode 100644 include/asm-generic/ring_buffer.h
> 
> 
> base-commit: 6170922f137231b98fc568571befef63e1edff3f
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v18 8/8] ring-buffer: Cleanup buffer_data_page related code
From: Masami Hiramatsu (Google) @ 2026-04-24  6:53 UTC (permalink / raw)
  To: Steven Rostedt, Catalin Marinas, Will Deacon
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Code cleanup related to buffer_data_page for readability,
which includes:
- Introduce rb_data_page_commit() and rb_data_page_size()
- Use 'dpage' for buffer_data_page, instead of 'bpage' because
  'bpage' is used for buffer_page.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/ring_buffer.c |  112 ++++++++++++++++++++++++--------------------
 1 file changed, 60 insertions(+), 52 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9850a0d8d24b..cb524b2afb7b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -364,21 +364,30 @@ struct buffer_page {
 #define RB_WRITE_MASK		0xfffff
 #define RB_WRITE_INTCNT		(1 << 20)
 
-static void rb_init_page(struct buffer_data_page *bpage)
+static void rb_init_data_page(struct buffer_data_page *bpage)
 {
 	local_set(&bpage->commit, 0);
 	bpage->time_stamp = 0;
 }
 
+static __always_inline long rb_data_page_commit(struct buffer_data_page *dpage)
+{
+	return local_read(&dpage->commit);
+}
+
+static __always_inline long rb_data_page_size(struct buffer_data_page *dpage)
+{
+	return rb_data_page_commit(dpage) & ~RB_MISSED_MASK;
+}
+
 static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
 {
-	return local_read(&bpage->page->commit);
+	return rb_data_page_commit(bpage->page);
 }
 
-/* Size is determined by what has been committed */
 static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
 {
-	return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+	return rb_data_page_size(bpage->page);
 }
 
 static void free_buffer_page(struct buffer_page *bpage)
@@ -419,7 +428,7 @@ static struct buffer_data_page *alloc_cpu_data(int cpu, int order)
 		return NULL;
 
 	dpage = page_address(page);
-	rb_init_page(dpage);
+	rb_init_data_page(dpage);
 
 	return dpage;
 }
@@ -659,7 +668,7 @@ static void verify_event(struct ring_buffer_per_cpu *cpu_buffer,
 	do {
 		if (page == tail_page || WARN_ON_ONCE(stop++ > 100))
 			done = true;
-		commit = local_read(&page->page->commit);
+		commit = rb_page_commit(page);
 		write = local_read(&page->write);
 		if (addr >= (unsigned long)&page->page->data[commit] &&
 		    addr < (unsigned long)&page->page->data[write])
@@ -1906,7 +1915,7 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
 	 * Even after clearing these bits, a commit value greater than the
 	 * subbuf_size is considered invalid.
 	 */
-	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+	tail = rb_data_page_size(dpage);
 	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
 		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
 
@@ -2118,12 +2127,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 	/* Reset the reader page */
 	local_set(&cpu_buffer->reader_page->entries, 0);
-	rb_init_page(cpu_buffer->reader_page->page);
+	rb_init_data_page(cpu_buffer->reader_page->page);
 
 	/* Reset all the subbuffers */
 	for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
 		local_set(&head_page->entries, 0);
-		rb_init_page(head_page->page);
+		rb_init_data_page(head_page->page);
 	}
 }
 
@@ -2183,7 +2192,7 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages, int sc
 		 */
 		for (i = 0; i < meta->nr_subbufs; i++) {
 			meta->buffers[i] = i;
-			rb_init_page(subbuf);
+			rb_init_data_page(subbuf);
 			subbuf += meta->subbuf_size;
 		}
 	}
@@ -2235,7 +2244,7 @@ static int rbm_show(struct seq_file *m, void *v)
 	val -= 2;
 	dpage = rb_range_buffer(cpu_buffer, val);
 	seq_printf(m, "buffer[%ld]:    %d (commit: %ld)\n",
-		   val, meta->buffers[val], dpage ? local_read(&dpage->commit) : -1);
+		   val, meta->buffers[val], dpage ? rb_data_page_commit(dpage) : -1);
 
 	return 0;
 }
@@ -2626,7 +2635,7 @@ static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
 
 		dpage = (void *)(ptr + idx * subbuf_size);
 		/* Skip unused pages */
-		if (!local_read(&dpage->commit))
+		if (!rb_data_page_commit(dpage))
 			continue;
 
 		/*
@@ -2638,7 +2647,7 @@ static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
 			invalid++;
 		} else {
 			/* Count total commit bytes. */
-			entry_bytes += local_read(&dpage->commit) & ~RB_MISSED_MASK;
+			entry_bytes += rb_data_page_size(dpage);
 		}
 	}
 
@@ -4167,8 +4176,7 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
 		local_set(&cpu_buffer->commit_page->page->commit,
 			  rb_page_write(cpu_buffer->commit_page));
 		RB_WARN_ON(cpu_buffer,
-			   local_read(&cpu_buffer->commit_page->page->commit) &
-			   ~RB_WRITE_MASK);
+			   rb_page_commit(cpu_buffer->commit_page) & ~RB_WRITE_MASK);
 		barrier();
 	}
 
@@ -4540,7 +4548,7 @@ static const char *show_interrupt_level(void)
 	return show_irq_str(level);
 }
 
-static void dump_buffer_page(struct buffer_data_page *bpage,
+static void dump_buffer_page(struct buffer_data_page *dpage,
 			     struct rb_event_info *info,
 			     unsigned long tail)
 {
@@ -4548,12 +4556,12 @@ static void dump_buffer_page(struct buffer_data_page *bpage,
 	u64 ts, delta;
 	int e;
 
-	ts = bpage->time_stamp;
+	ts = dpage->time_stamp;
 	pr_warn("  [%lld] PAGE TIME STAMP\n", ts);
 
 	for (e = 0; e < tail; e += rb_event_length(event)) {
 
-		event = (struct ring_buffer_event *)(bpage->data + e);
+		event = (struct ring_buffer_event *)(dpage->data + e);
 
 		switch (event->type_len) {
 
@@ -4603,7 +4611,7 @@ static atomic_t ts_dump;
 		}							\
 		atomic_inc(&cpu_buffer->record_disabled);		\
 		pr_warn(fmt, ##__VA_ARGS__);				\
-		dump_buffer_page(bpage, info, tail);			\
+		dump_buffer_page(dpage, info, tail);			\
 		atomic_dec(&ts_dump);					\
 		/* There's some cases in boot up that this can happen */ \
 		if (WARN_ON_ONCE(system_state != SYSTEM_BOOTING))	\
@@ -4619,16 +4627,16 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
 			 struct rb_event_info *info,
 			 unsigned long tail)
 {
-	struct buffer_data_page *bpage;
+	struct buffer_data_page *dpage;
 	u64 ts, delta;
 	bool full = false;
 	int ret;
 
-	bpage = info->tail_page->page;
+	dpage = info->tail_page->page;
 
 	if (tail == CHECK_FULL_PAGE) {
 		full = true;
-		tail = local_read(&bpage->commit);
+		tail = rb_data_page_commit(dpage);
 	} else if (info->add_timestamp &
 		   (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE)) {
 		/* Ignore events with absolute time stamps */
@@ -4639,7 +4647,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
 	 * Do not check the first event (skip possible extends too).
 	 * Also do not check if previous events have not been committed.
 	 */
-	if (tail <= 8 || tail > local_read(&bpage->commit))
+	if (tail <= 8 || tail > rb_data_page_commit(dpage))
 		return;
 
 	/*
@@ -4648,7 +4656,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
 	if (atomic_inc_return(this_cpu_ptr(&checking)) != 1)
 		goto out;
 
-	ret = rb_read_data_buffer(bpage, tail, cpu_buffer->cpu, &ts, &delta);
+	ret = rb_read_data_buffer(dpage, tail, cpu_buffer->cpu, &ts, &delta);
 	if (ret < 0) {
 		if (delta < ts) {
 			buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT BACKWARDS: last ts: %lld absolute ts: %lld clock:%pS\n",
@@ -6436,7 +6444,7 @@ static void rb_clear_buffer_page(struct buffer_page *page)
 {
 	local_set(&page->write, 0);
 	local_set(&page->entries, 0);
-	rb_init_page(page->page);
+	rb_init_data_page(page->page);
 	page->read = 0;
 }
 
@@ -6921,7 +6929,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
 	local_irq_restore(flags);
 
 	if (bpage->data) {
-		rb_init_page(bpage->data);
+		rb_init_data_page(bpage->data);
 	} else {
 		bpage->data = alloc_cpu_data(cpu, cpu_buffer->buffer->subbuf_order);
 		if (!bpage->data) {
@@ -6946,8 +6954,8 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
 				struct buffer_data_read_page *data_page)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
-	struct buffer_data_page *bpage = data_page->data;
-	struct page *page = virt_to_page(bpage);
+	struct buffer_data_page *dpage = data_page->data;
+	struct page *page = virt_to_page(dpage);
 	unsigned long flags;
 
 	if (!buffer || !buffer->buffers || !buffer->buffers[cpu])
@@ -6967,15 +6975,15 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
 	arch_spin_lock(&cpu_buffer->lock);
 
 	if (!cpu_buffer->free_page) {
-		cpu_buffer->free_page = bpage;
-		bpage = NULL;
+		cpu_buffer->free_page = dpage;
+		dpage = NULL;
 	}
 
 	arch_spin_unlock(&cpu_buffer->lock);
 	local_irq_restore(flags);
 
  out:
-	free_pages((unsigned long)bpage, data_page->order);
+	free_pages((unsigned long)dpage, data_page->order);
 	kfree(data_page);
 }
 EXPORT_SYMBOL_GPL(ring_buffer_free_read_page);
@@ -7020,7 +7028,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 {
 	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
 	struct ring_buffer_event *event;
-	struct buffer_data_page *bpage;
+	struct buffer_data_page *dpage;
 	struct buffer_page *reader;
 	unsigned long missed_events;
 	unsigned int commit;
@@ -7046,8 +7054,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	if (data_page->order != buffer->subbuf_order)
 		return -1;
 
-	bpage = data_page->data;
-	if (!bpage)
+	dpage = data_page->data;
+	if (!dpage)
 		return -1;
 
 	guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock);
@@ -7113,7 +7121,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 			 * We have already ensured there's enough space if this
 			 * is a time extend. */
 			size = rb_event_length(event);
-			memcpy(bpage->data + pos, rpage->data + rpos, size);
+			memcpy(dpage->data + pos, rpage->data + rpos, size);
 
 			len -= size;
 
@@ -7129,9 +7137,9 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 			size = rb_event_ts_length(event);
 		} while (len >= size);
 
-		/* update bpage */
-		local_set(&bpage->commit, pos);
-		bpage->time_stamp = save_timestamp;
+		/* update dpage */
+		local_set(&dpage->commit, pos);
+		dpage->time_stamp = save_timestamp;
 
 		/* we copied everything to the beginning */
 		read = 0;
@@ -7141,13 +7149,13 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		cpu_buffer->read_bytes += rb_page_size(reader);
 
 		/* swap the pages */
-		rb_init_page(bpage);
-		bpage = reader->page;
+		rb_init_data_page(dpage);
+		dpage = reader->page;
 		reader->page = data_page->data;
 		local_set(&reader->write, 0);
 		local_set(&reader->entries, 0);
 		reader->read = 0;
-		data_page->data = bpage;
+		data_page->data = dpage;
 
 		/*
 		 * Use the real_end for the data size,
@@ -7155,12 +7163,12 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		 * on the page.
 		 */
 		if (reader->real_end)
-			local_set(&bpage->commit, reader->real_end);
+			local_set(&dpage->commit, reader->real_end);
 	}
 
 	cpu_buffer->lost_events = 0;
 
-	commit = local_read(&bpage->commit);
+	commit = rb_data_page_commit(dpage);
 	/*
 	 * Set a flag in the commit field if we lost events
 	 */
@@ -7169,19 +7177,19 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		 * missed events, then record it there.
 		 */
 		if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
-			memcpy(&bpage->data[commit], &missed_events,
+			memcpy(&dpage->data[commit], &missed_events,
 			       sizeof(missed_events));
-			local_add(RB_MISSED_STORED, &bpage->commit);
+			local_add(RB_MISSED_STORED, &dpage->commit);
 			commit += sizeof(missed_events);
 		}
-		local_add(RB_MISSED_EVENTS, &bpage->commit);
+		local_add(RB_MISSED_EVENTS, &dpage->commit);
 	}
 
 	/*
 	 * This page may be off to user land. Zero it out here.
 	 */
 	if (commit < buffer->subbuf_size)
-		memset(&bpage->data[commit], 0, buffer->subbuf_size - commit);
+		memset(&dpage->data[commit], 0, buffer->subbuf_size - commit);
 
 	return read;
 }
@@ -7812,7 +7820,7 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 
 	if (missed_events) {
 		if (cpu_buffer->reader_page != cpu_buffer->commit_page) {
-			struct buffer_data_page *bpage = reader->page;
+			struct buffer_data_page *dpage = reader->page;
 			unsigned int commit;
 			/*
 			 * Use the real_end for the data size,
@@ -7820,18 +7828,18 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 			 * on the page.
 			 */
 			if (reader->real_end)
-				local_set(&bpage->commit, reader->real_end);
+				local_set(&dpage->commit, reader->real_end);
 			/*
 			 * If there is room at the end of the page to save the
 			 * missed events, then record it there.
 			 */
 			commit = rb_page_size(reader);
 			if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
-				memcpy(&bpage->data[commit], &missed_events,
+				memcpy(&dpage->data[commit], &missed_events,
 				       sizeof(missed_events));
-				local_add(RB_MISSED_STORED, &bpage->commit);
+				local_add(RB_MISSED_STORED, &dpage->commit);
 			}
-			local_add(RB_MISSED_EVENTS, &bpage->commit);
+			local_add(RB_MISSED_EVENTS, &dpage->commit);
 		} else if (!WARN_ONCE(cpu_buffer->reader_page == cpu_buffer->tail_page,
 				      "Reader on commit with %ld missed events",
 				      missed_events)) {


^ permalink raw reply related

* [PATCH v18 7/8] ring-buffer: Cleanup persistent ring buffer validation
From: Masami Hiramatsu (Google) @ 2026-04-24  6:52 UTC (permalink / raw)
  To: Steven Rostedt, Catalin Marinas, Will Deacon
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Cleanup rb_meta_validate_events() function to make it easier to read.
This includes the following cleanups:
 - Introduce rb_validatation_state to hold working variables in
   validation.
 - Move repleated validation state updates into rb_validate_buffer().
 - Move reader_page injection code outside of rb_meta_validate_events().

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/ring_buffer.c |  186 ++++++++++++++++++++++----------------------
 1 file changed, 95 insertions(+), 91 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index de653a8e3cec..9850a0d8d24b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1883,8 +1883,16 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	return events;
 }
 
-static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
-			      struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
+struct rb_validation_state {
+	unsigned long entries;
+	unsigned long entry_bytes;
+	int discarded;
+	u64 ts;
+};
+
+static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
+				struct ring_buffer_cpu_meta *meta,
+				u64 prev_ts, u64 next_ts)
 {
 	struct buffer_data_page *dpage = bpage->page;
 	unsigned long long ts;
@@ -1914,16 +1922,82 @@ static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
 	return ret;
 }
 
+static void rb_validate_buffer(struct buffer_page *bpage,
+			       struct ring_buffer_per_cpu *cpu_buffer,
+			       struct ring_buffer_cpu_meta *meta,
+			       struct rb_validation_state *state,
+			       u64 prev_ts, u64 next_ts)
+{
+	int ret;
+
+	ret = __rb_validate_buffer(bpage, cpu_buffer->cpu, meta, prev_ts, next_ts);
+	if (ret < 0) {
+		if (!state->discarded)
+			pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+				cpu_buffer->cpu);
+		state->discarded++;
+	} else {
+		/* If the buffer has content, update pages_touched */
+		if (ret)
+			local_inc(&cpu_buffer->pages_touched);
+
+		state->entries += ret;
+		state->entry_bytes += rb_page_size(bpage);
+		state->ts = bpage->page->time_stamp;
+	}
+}
+
+static void rb_meta_inject_reader_page(struct ring_buffer_per_cpu *cpu_buffer,
+				       struct ring_buffer_cpu_meta *meta,
+				       struct buffer_page *orig_head,
+				       struct buffer_page *head_page)
+{
+	struct buffer_page *bpage = orig_head;
+	int i;
+
+	rb_dec_page(&bpage);
+	/*
+	 * Insert the reader_page before the original head page.
+	 * Since the list encode RB_PAGE flags, general list
+	 * operations should be avoided.
+	 */
+	cpu_buffer->reader_page->list.next = &orig_head->list;
+	cpu_buffer->reader_page->list.prev = orig_head->list.prev;
+	orig_head->list.prev = &cpu_buffer->reader_page->list;
+	bpage->list.next = &cpu_buffer->reader_page->list;
+
+	/* Make the head_page the reader page */
+	cpu_buffer->reader_page = head_page;
+	bpage = head_page;
+	rb_inc_page(&head_page);
+	head_page->list.prev = bpage->list.prev;
+	rb_dec_page(&bpage);
+	bpage->list.next = &head_page->list;
+	rb_set_list_to_head(&bpage->list);
+	cpu_buffer->pages = &head_page->list;
+
+	cpu_buffer->head_page = head_page;
+	meta->head_buffer = (unsigned long)head_page->page;
+
+	/* Reset all the indexes */
+	bpage = cpu_buffer->reader_page;
+	meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
+	bpage->id = 0;
+
+	for (i = 1, bpage = head_page; i < meta->nr_subbufs;
+	     i++, rb_inc_page(&bpage)) {
+		meta->buffers[i] = rb_meta_subbuf_idx(meta, bpage->page);
+		bpage->id = i;
+	}
+}
+
 /* If the meta data has been validated, now validate the events */
 static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
 	struct buffer_page *head_page, *orig_head, *orig_reader;
-	unsigned long entry_bytes = 0;
-	unsigned long entries = 0;
-	int discarded = 0;
+	struct rb_validation_state state = { 0 };
 	int ret;
-	u64 ts;
 	int i;
 
 	if (!meta || !meta->head_buffer)
@@ -1933,25 +2007,16 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	orig_reader = cpu_buffer->reader_page;
 
 	/* Do the head page first */
-	ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
+	ret = __rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
 	if (ret < 0) {
 		pr_info("Ring buffer meta [%d] invalid head page detected\n",
 			cpu_buffer->cpu);
 		goto skip_rewind;
 	}
-	ts = head_page->page->time_stamp;
+	state.ts = head_page->page->time_stamp;
 
 	/* Do the reader page - reader must be previous to head. */
-	ret = rb_validate_buffer(orig_reader, cpu_buffer->cpu, meta, 0, ts);
-	if (ret < 0) {
-		pr_info("Ring buffer meta [%d] invalid reader page detected\n",
-			cpu_buffer->cpu);
-		discarded++;
-	} else {
-		entries += ret;
-		entry_bytes += rb_page_size(orig_reader);
-		ts = orig_reader->page->time_stamp;
-	}
+	rb_validate_buffer(orig_reader, cpu_buffer, meta, &state, 0, state.ts);
 
 	/*
 	 * Try to rewind the head so that we can read the pages which are already
@@ -1975,19 +2040,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		 * Skip if the page is invalid, or its timestamp is newer than the
 		 * previous valid page.
 		 */
-		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, ts);
-		if (ret < 0) {
-			if (!discarded)
-				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
-					cpu_buffer->cpu);
-			discarded++;
-		} else {
-			entries += ret;
-			entry_bytes += rb_page_size(head_page);
-			if (ret > 0)
-				local_inc(&cpu_buffer->pages_touched);
-			ts = head_page->page->time_stamp;
-		}
+		rb_validate_buffer(head_page, cpu_buffer, meta, &state, 0, state.ts);
 	}
 	if (i)
 		pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2001,43 +2054,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	 * into the location just before the original head page.
 	 */
 	if (head_page != orig_head) {
-		struct buffer_page *bpage = orig_head;
-
-		rb_dec_page(&bpage);
-		/*
-		 * Insert the reader_page before the original head page.
-		 * Since the list encode RB_PAGE flags, general list
-		 * operations should be avoided.
-		 */
-		cpu_buffer->reader_page->list.next = &orig_head->list;
-		cpu_buffer->reader_page->list.prev = orig_head->list.prev;
-		orig_head->list.prev = &cpu_buffer->reader_page->list;
-		bpage->list.next = &cpu_buffer->reader_page->list;
-
-		/* Make the head_page the reader page */
-		cpu_buffer->reader_page = head_page;
-		bpage = head_page;
-		rb_inc_page(&head_page);
-		head_page->list.prev = bpage->list.prev;
-		rb_dec_page(&bpage);
-		bpage->list.next = &head_page->list;
-		rb_set_list_to_head(&bpage->list);
-		cpu_buffer->pages = &head_page->list;
-
-		cpu_buffer->head_page = head_page;
-		meta->head_buffer = (unsigned long)head_page->page;
-
-		/* Reset all the indexes */
-		bpage = cpu_buffer->reader_page;
-		meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
-		bpage->id = 0;
-
-		for (i = 1, bpage = head_page; i < meta->nr_subbufs;
-		     i++, rb_inc_page(&bpage)) {
-			meta->buffers[i] = rb_meta_subbuf_idx(meta, bpage->page);
-			bpage->id = i;
-		}
-
+		rb_meta_inject_reader_page(cpu_buffer, meta, orig_head, head_page);
 		/* We'll restart verifying from orig_head */
 		head_page = orig_head;
 	}
@@ -2049,7 +2066,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		/* Nothing more to do, the only page is the reader page */
 		goto done;
 	}
-	ts = head_page->page->time_stamp;
+	state.ts = head_page->page->time_stamp;
 
 	/* Iterate until finding the commit page */
 	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
@@ -2058,21 +2075,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == orig_reader)
 			continue;
 
-		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, ts, 0);
-		if (ret < 0) {
-			if (!discarded)
-				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
-					cpu_buffer->cpu);
-			discarded++;
-		} else {
-			/* If the buffer has content, update pages_touched */
-			if (ret)
-				local_inc(&cpu_buffer->pages_touched);
+		rb_validate_buffer(head_page, cpu_buffer, meta, &state, state.ts, 0);
 
-			entries += ret;
-			entry_bytes += rb_page_size(head_page);
-			ts = head_page->page->time_stamp;
-		}
 		if (head_page == cpu_buffer->commit_page)
 			break;
 	}
@@ -2083,25 +2087,25 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		goto invalid;
 	}
  done:
-	local_set(&cpu_buffer->entries, entries);
-	local_set(&cpu_buffer->entries_bytes, entry_bytes);
+	local_set(&cpu_buffer->entries, state.entries);
+	local_set(&cpu_buffer->entries_bytes, state.entry_bytes);
 
 	pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
-	if (discarded)
-		pr_cont(" (%d pages discarded)", discarded);
+	if (state.discarded)
+		pr_cont(" (%d pages discarded)", state.discarded);
 	pr_cont("\n");
 
 #ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
 	if (meta->nr_invalid)
 		pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
 			cpu_buffer->cpu,
-			(discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
-			discarded, meta->nr_invalid);
+			(state.discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+			state.discarded, meta->nr_invalid);
 	if (meta->entry_bytes)
 		pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
 			cpu_buffer->cpu,
-			(entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
-			(long)entry_bytes, (long)meta->entry_bytes);
+			(state.entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+			(long)state.entry_bytes, (long)meta->entry_bytes);
 	meta->nr_invalid = 0;
 	meta->entry_bytes = 0;
 #endif


^ permalink raw reply related

* [PATCH v18 6/8] ring-buffer: Show commit numbers in buffer_meta file
From: Masami Hiramatsu (Google) @ 2026-04-24  6:52 UTC (permalink / raw)
  To: Steven Rostedt, Catalin Marinas, Will Deacon
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v17:
 - Added NULL check for dpage in rbm_show in ring_buffer.c.
 Changes in v16:
  - update description.
---
 kernel/trace/ring_buffer.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e8d2b3457d7f..de653a8e3cec 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2216,6 +2216,7 @@ static int rbm_show(struct seq_file *m, void *v)
 	struct ring_buffer_per_cpu *cpu_buffer = m->private;
 	struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
 	unsigned long val = (unsigned long)v;
+	struct buffer_data_page *dpage;
 
 	if (val == 1) {
 		seq_printf(m, "head_buffer:   %d\n",
@@ -2228,7 +2229,9 @@ static int rbm_show(struct seq_file *m, void *v)
 	}
 
 	val -= 2;
-	seq_printf(m, "buffer[%ld]:    %d\n", val, meta->buffers[val]);
+	dpage = rb_range_buffer(cpu_buffer, val);
+	seq_printf(m, "buffer[%ld]:    %d (commit: %ld)\n",
+		   val, meta->buffers[val], dpage ? local_read(&dpage->commit) : -1);
 
 	return 0;
 }


^ permalink raw reply related

* [PATCH v18 5/8] ring-buffer: Add persistent ring buffer invalid-page inject test
From: Masami Hiramatsu (Google) @ 2026-04-24  6:52 UTC (permalink / raw)
  To: Steven Rostedt, Catalin Marinas, Will Deacon
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a self-corrupting test for the persistent ring buffer.

This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.

This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.

The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.

To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:

 reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
 panic=1

Run the following commands after the 1st boot:

 cd /sys/kernel/tracing/instances/ptracingtest
 echo 1 > tracing_on
 echo 1 > events/enable
 sleep 3
 echo c > /proc/sysrq-trigger

After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.

 Ring buffer meta [2] invalid buffer page detected
 Ring buffer meta [2] is from previous boot! (318 pages discarded)
 Ring buffer testing [2] invalid pages: PASSED (318/318)
 Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v18:
 - Fix to mask RB_MISSED_* flags when counting entry_bytes.
 Changes in v17:
 - In rb_test_inject_invalid_pages(), changed entry_bytes and
   idx to unsigned long
 - Added NULL checks for cpu_buffer and meta.
 - In allocate_trace_buffer(), added a NULL check for tr->name
   before comparing it with strcmp.
 Changes in v16:
  - Update description and comments according to review comments.
 Changes in v15:
  - Use pr_warn() for test result.
  - Inject errors on the page index is multiples of 5 so that
    this can reproduce contiguous empty pages.
 Changes in v14:
  - Rename config to CONFIG_RING_BUFFER_PERSISTENT_INJECT.
  - Clear meta->nr_invalid/entry_bytes after testing.
  - Add test commands in config comment.
 Changes in v10:
  - Add entry_bytes test.
  - Do not compile test code if CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n.
 Changes in v9:
  - Test also reader pages.
---
 include/linux/ring_buffer.h |    1 +
 kernel/trace/Kconfig        |   34 +++++++++++++++++++
 kernel/trace/ring_buffer.c  |   79 +++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.c        |    4 ++
 4 files changed, 118 insertions(+)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 994f52b34344..0670742b2d60 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -238,6 +238,7 @@ int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
 
 enum ring_buffer_flags {
 	RB_FL_OVERWRITE		= 1 << 0,
+	RB_FL_TESTING		= 1 << 1,
 };
 
 #ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..084f34dc6c9f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1202,6 +1202,40 @@ config RING_BUFFER_VALIDATE_TIME_DELTAS
 	  Only say Y if you understand what this does, and you
 	  still want it enabled. Otherwise say N
 
+config RING_BUFFER_PERSISTENT_INJECT
+	bool "Enable persistent ring buffer error injection test"
+	depends on RING_BUFFER
+	help
+	  This option will have the kernel check if the persistent ring
+	  buffer is named "ptracingtest". and if so, it will corrupt some
+	  of its pages on a kernel panic. This is used to test if the
+	  persistent ring buffer can recover from some of its sub-buffers
+	  being corrupted.
+	  To use this, boot a kernel with a "ptracingtest" persistent
+	  ring buffer, e.g.
+
+	   reserve_mem=20M:2M:trace trace_instance=ptracingtest@trace panic=1
+
+	  And after the 1st boot, run the following commands:
+
+	   cd /sys/kernel/tracing/instances/ptracingtest
+	   echo 1 > events/enable
+	   echo 1 > tracing_on
+	   sleep 3
+	   echo c > /proc/sysrq-trigger
+
+	  After the panic message, the kernel will reboot and will show
+	  the test results in the console output.
+
+	  Note that events for the test ring buffer needs to be enabled
+	  prior to crashing the kernel so that the ring buffer has content
+	  that the test will corrupt.
+	  As the test will corrupt events in the "ptracingtest" persistent
+	  ring buffer, it should not be used for any other purpose other
+	  than this test.
+
+	  If unsure, say N
+
 config MMIOTRACE_TEST
 	tristate "Test module for mmiotrace"
 	depends on MMIOTRACE && m
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 899a78b307e6..e8d2b3457d7f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -64,6 +64,10 @@ struct ring_buffer_cpu_meta {
 	unsigned long	commit_buffer;
 	__u32		subbuf_size;
 	__u32		nr_subbufs;
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+	__u32		nr_invalid;
+	__u32		entry_bytes;
+#endif
 	int		buffers[];
 };
 
@@ -2086,6 +2090,21 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	if (discarded)
 		pr_cont(" (%d pages discarded)", discarded);
 	pr_cont("\n");
+
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+	if (meta->nr_invalid)
+		pr_warn("Ring buffer testing [%d] invalid pages: %s (%d/%d)\n",
+			cpu_buffer->cpu,
+			(discarded == meta->nr_invalid) ? "PASSED" : "FAILED",
+			discarded, meta->nr_invalid);
+	if (meta->entry_bytes)
+		pr_warn("Ring buffer testing [%d] entry_bytes: %s (%ld/%ld)\n",
+			cpu_buffer->cpu,
+			(entry_bytes == meta->entry_bytes) ? "PASSED" : "FAILED",
+			(long)entry_bytes, (long)meta->entry_bytes);
+	meta->nr_invalid = 0;
+	meta->entry_bytes = 0;
+#endif
 	return;
 
  invalid:
@@ -2566,12 +2585,72 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 	kfree(cpu_buffer);
 }
 
+#ifdef CONFIG_RING_BUFFER_PERSISTENT_INJECT
+static void rb_test_inject_invalid_pages(struct trace_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_cpu_meta *meta;
+	struct buffer_data_page *dpage;
+	unsigned long entry_bytes = 0;
+	unsigned long ptr;
+	int subbuf_size;
+	int invalid = 0;
+	int cpu;
+	int i;
+
+	if (!(buffer->flags & RB_FL_TESTING))
+		return;
+
+	guard(preempt)();
+	cpu = smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+	if (!cpu_buffer)
+		return;
+	meta = cpu_buffer->ring_meta;
+	if (!meta)
+		return;
+
+	ptr = (unsigned long)rb_subbufs_from_meta(meta);
+	subbuf_size = meta->subbuf_size;
+
+	for (i = 0; i < meta->nr_subbufs; i++) {
+		unsigned long idx = meta->buffers[i];
+
+		dpage = (void *)(ptr + idx * subbuf_size);
+		/* Skip unused pages */
+		if (!local_read(&dpage->commit))
+			continue;
+
+		/*
+		 * Invalidate even pages or multiples of 5. This will cause 3
+		 * contiguous invalidated(empty) pages.
+		 */
+		if (!(i & 0x1) || !(i % 5)) {
+			local_add(subbuf_size + 1, &dpage->commit);
+			invalid++;
+		} else {
+			/* Count total commit bytes. */
+			entry_bytes += local_read(&dpage->commit) & ~RB_MISSED_MASK;
+		}
+	}
+
+	pr_info("Inject invalidated %d pages on CPU%d, total size: %ld\n",
+		invalid, cpu, (long)entry_bytes);
+	meta->nr_invalid = invalid;
+	meta->entry_bytes = entry_bytes;
+}
+#else /* !CONFIG_RING_BUFFER_PERSISTENT_INJECT */
+#define rb_test_inject_invalid_pages(buffer)	do { } while (0)
+#endif
+
 /* Stop recording on a persistent buffer and flush cache if needed. */
 static int rb_flush_buffer_cb(struct notifier_block *nb, unsigned long event, void *data)
 {
 	struct trace_buffer *buffer = container_of(nb, struct trace_buffer, flush_nb);
 
 	ring_buffer_record_off(buffer);
+	rb_test_inject_invalid_pages(buffer);
 	arch_ring_buffer_flush_range(buffer->range_addr_start, buffer->range_addr_end);
 	return NOTIFY_DONE;
 }
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e9455d46ec16..d972b24cd73b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9436,6 +9436,8 @@ static void setup_trace_scratch(struct trace_array *tr,
 	memset(tscratch, 0, size);
 }
 
+#define TRACE_TEST_PTRACING_NAME	"ptracingtest"
+
 static int
 allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned long size)
 {
@@ -9448,6 +9450,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, unsigned
 	buf->tr = tr;
 
 	if (tr->range_addr_start && tr->range_addr_size) {
+		if (tr->name && !strcmp(tr->name, TRACE_TEST_PTRACING_NAME))
+			rb_flags |= RB_FL_TESTING;
 		/* Add scratch buffer to handle 128 modules */
 		buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
 						      tr->range_addr_start,


^ permalink raw reply related

* [PATCH v18 4/8] ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-24  6:52 UTC (permalink / raw)
  To: Steven Rostedt, Catalin Marinas, Will Deacon
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.

To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v18:
  - Reset timestamp of reader_page when the entire cpu_buffer is
    invalid.
  - Minor update by new fix.
 Changes in v17:
  - Fix to verify head_page at first before using its timestamp.
  - Reset timestamp if the page is invalid.
 Changes in v12:
   - Fix build error.
 Changes in v11:
   - Reset timestamp when the buffer is invalid.
   - When rewinding, skip subbuf page if timestamp is wrong and
     check timestamp after validating buffer data page.
 Changes in v10:
   - Newly added.
---
 kernel/trace/ring_buffer.c |   94 ++++++++++++++++++++++++++------------------
 1 file changed, 55 insertions(+), 39 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 404c1fcac0ae..899a78b307e6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -363,6 +363,7 @@ struct buffer_page {
 static void rb_init_page(struct buffer_data_page *bpage)
 {
 	local_set(&bpage->commit, 0);
+	bpage->time_stamp = 0;
 }
 
 static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
@@ -1878,12 +1879,14 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	return events;
 }
 
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
-			      struct ring_buffer_cpu_meta *meta)
+static int rb_validate_buffer(struct buffer_page *bpage, int cpu,
+			      struct ring_buffer_cpu_meta *meta, u64 prev_ts, u64 next_ts)
 {
+	struct buffer_data_page *dpage = bpage->page;
 	unsigned long long ts;
 	unsigned long tail;
 	u64 delta;
+	int ret = -1;
 
 	/*
 	 * When a sub-buffer is recovered from a read, the commit value may
@@ -1892,9 +1895,19 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
 	 * subbuf_size is considered invalid.
 	 */
 	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
-	if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
-		return -1;
-	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
+		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
+
+	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
+		local_set(&bpage->entries, 0);
+		local_set(&bpage->page->commit, 0);
+		bpage->page->time_stamp = prev_ts ? prev_ts : next_ts;
+		ret = -1;
+	} else {
+		local_set(&bpage->entries, ret);
+	}
+
+	return ret;
 }
 
 /* If the meta data has been validated, now validate the events */
@@ -1915,25 +1928,29 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	orig_head = head_page = cpu_buffer->head_page;
 	orig_reader = cpu_buffer->reader_page;
 
-	/* Do the reader page first */
-	ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu, meta);
+	/* Do the head page first */
+	ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, 0);
+	if (ret < 0) {
+		pr_info("Ring buffer meta [%d] invalid head page detected\n",
+			cpu_buffer->cpu);
+		goto skip_rewind;
+	}
+	ts = head_page->page->time_stamp;
+
+	/* Do the reader page - reader must be previous to head. */
+	ret = rb_validate_buffer(orig_reader, cpu_buffer->cpu, meta, 0, ts);
 	if (ret < 0) {
 		pr_info("Ring buffer meta [%d] invalid reader page detected\n",
 			cpu_buffer->cpu);
 		discarded++;
-		/* Instead of discard whole ring buffer, discard only this sub-buffer. */
-		local_set(&orig_reader->entries, 0);
-		local_set(&orig_reader->page->commit, 0);
 	} else {
 		entries += ret;
 		entry_bytes += rb_page_size(orig_reader);
-		local_set(&orig_reader->entries, ret);
+		ts = orig_reader->page->time_stamp;
 	}
 
-	ts = head_page->page->time_stamp;
-
 	/*
-	 * Try to rewind the head so that we can read the pages which already
+	 * Try to rewind the head so that we can read the pages which are already
 	 * read in the previous boot.
 	 */
 	if (head_page == cpu_buffer->tail_page)
@@ -1946,26 +1963,27 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == cpu_buffer->tail_page)
 			break;
 
-		/* Ensure the page has older data than head. */
-		if (ts < head_page->page->time_stamp)
-			break;
-
-		ts = head_page->page->time_stamp;
-		/* Ensure the page has correct timestamp and some data. */
-		if (!ts || rb_page_commit(head_page) == 0)
+		/* Rewind until unused page (no timestamp, no commit). */
+		if (!head_page->page->time_stamp && rb_page_commit(head_page) == 0)
 			break;
 
-		/* Stop rewind if the page is invalid. */
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
-		if (ret < 0)
-			break;
-
-		/* Recover the number of entries and update stats. */
-		local_set(&head_page->entries, ret);
-		if (ret)
-			local_inc(&cpu_buffer->pages_touched);
-		entries += ret;
-		entry_bytes += rb_page_size(head_page);
+		/*
+		 * Skip if the page is invalid, or its timestamp is newer than the
+		 * previous valid page.
+		 */
+		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, 0, ts);
+		if (ret < 0) {
+			if (!discarded)
+				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+					cpu_buffer->cpu);
+			discarded++;
+		} else {
+			entries += ret;
+			entry_bytes += rb_page_size(head_page);
+			if (ret > 0)
+				local_inc(&cpu_buffer->pages_touched);
+			ts = head_page->page->time_stamp;
+		}
 	}
 	if (i)
 		pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2027,6 +2045,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		/* Nothing more to do, the only page is the reader page */
 		goto done;
 	}
+	ts = head_page->page->time_stamp;
 
 	/* Iterate until finding the commit page */
 	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
@@ -2035,15 +2054,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == orig_reader)
 			continue;
 
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
+		ret = rb_validate_buffer(head_page, cpu_buffer->cpu, meta, ts, 0);
 		if (ret < 0) {
 			if (!discarded)
 				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
 					cpu_buffer->cpu);
 			discarded++;
-			/* Instead of discard whole ring buffer, discard only this sub-buffer. */
-			local_set(&head_page->entries, 0);
-			local_set(&head_page->page->commit, 0);
 		} else {
 			/* If the buffer has content, update pages_touched */
 			if (ret)
@@ -2051,7 +2067,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 			entries += ret;
 			entry_bytes += rb_page_size(head_page);
-			local_set(&head_page->entries, ret);
+			ts = head_page->page->time_stamp;
 		}
 		if (head_page == cpu_buffer->commit_page)
 			break;
@@ -2079,12 +2095,12 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 
 	/* Reset the reader page */
 	local_set(&cpu_buffer->reader_page->entries, 0);
-	local_set(&cpu_buffer->reader_page->page->commit, 0);
+	rb_init_page(cpu_buffer->reader_page->page);
 
 	/* Reset all the subbuffers */
 	for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
 		local_set(&head_page->entries, 0);
-		local_set(&head_page->page->commit, 0);
+		rb_init_page(head_page->page);
 	}
 }
 


^ permalink raw reply related

* [PATCH v18 3/8] ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
From: Masami Hiramatsu (Google) @ 2026-04-24  6:52 UTC (permalink / raw)
  To: Steven Rostedt, Catalin Marinas, Will Deacon
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177701351903.2223789.17087009302463188638.stgit@mhiramat.tok.corp.google.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).

If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
  Changes in v18:
  - Minor update by the new fix.
  - Fix to substract BUF_PAGE_HDR_SIZE from meta->subbuf_size
    to make the limit of commit size.
  Changes in v17:
  - Fix to use rb_page_size() of rewound pages for entry_bytes.
  Changes in v15:
  - Skip reader_page loop check on persistent ring buffer because
    there can be contiguous empty(invalidated) pages.
  - Do not show discarded page number information if it is 0.
  Changes in v11:
  - Fix a typo.
  Changes in v9:
  - Add meta->subbuf_size check.
  - Fix a typo.
  - Handle invalid reader_page case.
  Changes in v8:
  - Add comment in rb_valudate_buffer()
  - Clear the RB_MISSED_* flags in rb_valudate_buffer() instead of
    skipping subbuf.
  - Remove unused subbuf local variable from rb_cpu_meta_valid().
  Changes in v7:
  - Combined with Handling RB_MISSED_* flags patch, focus on validation at boot.
  - Remove checking subbuffer data when validating metadata, because it should be done
    later.
  - Do not mark the discarded sub buffer page but just reset it.
  Changes in v6:
  - Show invalid page detection message once per CPU.
  Changes in v5:
  - Instead of showing errors for each page, just show the number
    of discarded pages at last.
  Changes in v3:
  - Record missed data event on commit.
---
 kernel/trace/ring_buffer.c |  111 ++++++++++++++++++++++++++------------------
 1 file changed, 66 insertions(+), 45 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7288383b1f27..404c1fcac0ae 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -370,6 +370,12 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
 	return local_read(&bpage->page->commit);
 }
 
+/* Size is determined by what has been committed */
+static __always_inline unsigned int rb_page_size(struct buffer_page *bpage)
+{
+	return rb_page_commit(bpage) & ~RB_MISSED_MASK;
+}
+
 static void free_buffer_page(struct buffer_page *bpage)
 {
 	/* Range pages are not to be freed */
@@ -1762,7 +1768,6 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 			      unsigned long *subbuf_mask)
 {
 	int subbuf_size = PAGE_SIZE;
-	struct buffer_data_page *subbuf;
 	unsigned long buffers_start;
 	unsigned long buffers_end;
 	int i;
@@ -1770,6 +1775,11 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 	if (!subbuf_mask)
 		return false;
 
+	if (meta->subbuf_size != PAGE_SIZE) {
+		pr_info("Ring buffer boot meta [%d] invalid subbuf_size\n", cpu);
+		return false;
+	}
+
 	buffers_start = meta->first_buffer;
 	buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
 
@@ -1786,11 +1796,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 		return false;
 	}
 
-	subbuf = rb_subbufs_from_meta(meta);
-
 	bitmap_clear(subbuf_mask, 0, meta->nr_subbufs);
 
-	/* Is the meta buffers and the subbufs themselves have correct data? */
+	/*
+	 * Ensure the meta::buffers array has correct data. The data in each subbufs
+	 * are checked later in rb_meta_validate_events().
+	 */
 	for (i = 0; i < meta->nr_subbufs; i++) {
 		if (meta->buffers[i] < 0 ||
 		    meta->buffers[i] >= meta->nr_subbufs) {
@@ -1798,18 +1809,12 @@ static bool rb_cpu_meta_valid(struct ring_buffer_cpu_meta *meta, int cpu,
 			return false;
 		}
 
-		if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
-			pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
-			return false;
-		}
-
 		if (test_bit(meta->buffers[i], subbuf_mask)) {
 			pr_info("Ring buffer boot meta [%d] array has duplicates\n", cpu);
 			return false;
 		}
 
 		set_bit(meta->buffers[i], subbuf_mask);
-		subbuf = (void *)subbuf + subbuf_size;
 	}
 
 	return true;
@@ -1873,13 +1878,22 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	return events;
 }
 
-static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu,
+			      struct ring_buffer_cpu_meta *meta)
 {
 	unsigned long long ts;
+	unsigned long tail;
 	u64 delta;
-	int tail;
 
-	tail = local_read(&dpage->commit);
+	/*
+	 * When a sub-buffer is recovered from a read, the commit value may
+	 * have RB_MISSED_* bits set, as these bits are reset on reuse.
+	 * Even after clearing these bits, a commit value greater than the
+	 * subbuf_size is considered invalid.
+	 */
+	tail = local_read(&dpage->commit) & ~RB_MISSED_MASK;
+	if (tail > meta->subbuf_size - BUF_PAGE_HDR_SIZE)
+		return -1;
 	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
 }
 
@@ -1890,6 +1904,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	struct buffer_page *head_page, *orig_head, *orig_reader;
 	unsigned long entry_bytes = 0;
 	unsigned long entries = 0;
+	int discarded = 0;
 	int ret;
 	u64 ts;
 	int i;
@@ -1901,14 +1916,19 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	orig_reader = cpu_buffer->reader_page;
 
 	/* Do the reader page first */
-	ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu);
+	ret = rb_validate_buffer(orig_reader->page, cpu_buffer->cpu, meta);
 	if (ret < 0) {
-		pr_info("Ring buffer reader page is invalid\n");
-		goto invalid;
+		pr_info("Ring buffer meta [%d] invalid reader page detected\n",
+			cpu_buffer->cpu);
+		discarded++;
+		/* Instead of discard whole ring buffer, discard only this sub-buffer. */
+		local_set(&orig_reader->entries, 0);
+		local_set(&orig_reader->page->commit, 0);
+	} else {
+		entries += ret;
+		entry_bytes += rb_page_size(orig_reader);
+		local_set(&orig_reader->entries, ret);
 	}
-	entries += ret;
-	entry_bytes += local_read(&orig_reader->page->commit);
-	local_set(&orig_reader->entries, ret);
 
 	ts = head_page->page->time_stamp;
 
@@ -1936,7 +1956,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 			break;
 
 		/* Stop rewind if the page is invalid. */
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
 		if (ret < 0)
 			break;
 
@@ -1945,7 +1965,7 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (ret)
 			local_inc(&cpu_buffer->pages_touched);
 		entries += ret;
-		entry_bytes += rb_page_commit(head_page);
+		entry_bytes += rb_page_size(head_page);
 	}
 	if (i)
 		pr_info("Ring buffer [%d] rewound %d pages\n", cpu_buffer->cpu, i);
@@ -2015,21 +2035,24 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		if (head_page == orig_reader)
 			continue;
 
-		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu, meta);
 		if (ret < 0) {
-			pr_info("Ring buffer meta [%d] invalid buffer page\n",
-				cpu_buffer->cpu);
-			goto invalid;
-		}
-
-		/* If the buffer has content, update pages_touched */
-		if (ret)
-			local_inc(&cpu_buffer->pages_touched);
-
-		entries += ret;
-		entry_bytes += local_read(&head_page->page->commit);
-		local_set(&head_page->entries, ret);
+			if (!discarded)
+				pr_info("Ring buffer meta [%d] invalid buffer page detected\n",
+					cpu_buffer->cpu);
+			discarded++;
+			/* Instead of discard whole ring buffer, discard only this sub-buffer. */
+			local_set(&head_page->entries, 0);
+			local_set(&head_page->page->commit, 0);
+		} else {
+			/* If the buffer has content, update pages_touched */
+			if (ret)
+				local_inc(&cpu_buffer->pages_touched);
 
+			entries += ret;
+			entry_bytes += rb_page_size(head_page);
+			local_set(&head_page->entries, ret);
+		}
 		if (head_page == cpu_buffer->commit_page)
 			break;
 	}
@@ -2043,7 +2066,10 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	local_set(&cpu_buffer->entries, entries);
 	local_set(&cpu_buffer->entries_bytes, entry_bytes);
 
-	pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
+	pr_info("Ring buffer meta [%d] is from previous boot!", cpu_buffer->cpu);
+	if (discarded)
+		pr_cont(" (%d pages discarded)", discarded);
+	pr_cont("\n");
 	return;
 
  invalid:
@@ -3330,12 +3356,6 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 	return NULL;
 }
 
-/* Size is determined by what has been committed */
-static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
-{
-	return rb_page_commit(bpage) & ~RB_MISSED_MASK;
-}
-
 static __always_inline unsigned
 rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
 {
@@ -5648,11 +5668,12 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
  again:
 	/*
 	 * This should normally only loop twice. But because the
-	 * start of the reader inserts an empty page, it causes
-	 * a case where we will loop three times. There should be no
-	 * reason to loop four times (that I know of).
+	 * start of the reader inserts an empty page, it causes a
+	 * case where we will loop three times. There should be no
+	 * reason to loop four times unless the ring buffer is a
+	 * recovered persistent ring buffer.
 	 */
-	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) {
+	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3 && !cpu_buffer->ring_meta)) {
 		reader = NULL;
 		goto out;
 	}


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox