* [PATCH v4 0/4] mm: improve write performance with RWF_DONTCACHE
@ 2026-05-01 9:49 Jeff Layton
2026-05-01 9:49 ` [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Jeff Layton @ 2026-05-01 9:49 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever
Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton
Here's a new version for everyone to contemplate while traveling!
This patch series attempts to improve write performance with
RWF_DONTCACHE. The main justification and benchmarks for the series are
in patch #2.
This version implements a scheme that Jan Kara and Christoph Hellwig
suggested during review of the earlier series: after a DONTCACHE write,
kick the flusher thread to do an amount of writeback proportional to the
amount written, but don't target any particular inode or pages when
doing writeback.
The second patch in the series has a summary of the benchmark results.
This seems to markedly improve RWF_DONTCACHE write performance in all
the benchmarks I tested.
The benchmarks I used are in the last two patches. I'm not sure if we
want to merge those into the tree as they are (mostly) AI slop. There
is probably a better tool for this out there.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v4:
- Track DONTCACHE dirty pages per bdi_writeback
- New benchmark for competing buffered and dontcache writers
- New benchmark replicating Jens' original 32 concurrent writer test
- Link to v3: https://lore.kernel.org/r/20260426-dontcache-v3-0-79eb37da9547@kernel.org
Changes in v3:
- Track dirty DONTCACHE pages in the VM
- Have flusher write back a proportional number of pages after DONTCACHE write
- Link to v2: https://lore.kernel.org/r/20260408-dontcache-v2-0-948dec1e756b@kernel.org
Changes in v2:
- kick flusher thread instead of initiating writeback inline
- add mechanism to run 'perf lock' around the testcases
- Link to v1: https://lore.kernel.org/r/20260401-dontcache-v1-0-1f5746fab47a@kernel.org
---
Jeff Layton (4):
mm: track DONTCACHE dirty pages per bdi_writeback
mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
testing: add nfsd-io-bench NFS server benchmark suite
testing: add dontcache-bench local filesystem benchmark suite
fs/fs-writeback.c | 60 ++
include/linux/backing-dev-defs.h | 3 +
include/linux/fs.h | 6 +-
include/trace/events/writeback.h | 3 +-
mm/filemap.c | 13 +-
mm/page-writeback.c | 6 +
.../dontcache-bench/fio-jobs/axboe-write.fio | 14 +
.../dontcache-bench/fio-jobs/lat-reader.fio | 12 +
.../dontcache-bench/fio-jobs/multi-write.fio | 11 +
.../dontcache-bench/fio-jobs/noisy-writer.fio | 12 +
.../testing/dontcache-bench/fio-jobs/rand-read.fio | 13 +
.../dontcache-bench/fio-jobs/rand-write.fio | 13 +
.../testing/dontcache-bench/fio-jobs/seq-read.fio | 13 +
.../testing/dontcache-bench/fio-jobs/seq-write.fio | 13 +
.../dontcache-bench/scripts/parse-results.sh | 346 +++++++++++
.../dontcache-bench/scripts/run-benchmarks.sh | 643 +++++++++++++++++++++
.../testing/nfsd-io-bench/fio-jobs/lat-reader.fio | 15 +
.../testing/nfsd-io-bench/fio-jobs/multi-write.fio | 14 +
.../nfsd-io-bench/fio-jobs/noisy-writer.fio | 14 +
tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio | 15 +
.../testing/nfsd-io-bench/fio-jobs/rand-write.fio | 15 +
tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio | 14 +
tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio | 14 +
.../testing/nfsd-io-bench/scripts/parse-results.sh | 238 ++++++++
.../nfsd-io-bench/scripts/run-benchmarks.sh | 591 +++++++++++++++++++
.../testing/nfsd-io-bench/scripts/setup-server.sh | 94 +++
26 files changed, 2199 insertions(+), 6 deletions(-)
---
base-commit: 26fd6bff2c050196005312d1d306889220952a99
change-id: 20260401-dontcache-5811efd7eaf3
Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback
2026-05-01 9:49 [PATCH v4 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
@ 2026-05-01 9:49 ` Jeff Layton
2026-05-03 14:37 ` Jan Kara
2026-05-01 9:49 ` [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2026-05-01 9:49 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever
Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton
Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
writes).
Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
when a non-DONTCACHE lookup clears the dropbehind flag on a dirty folio in
__filemap_get_folio_mpol(), using proper writeback domain locking.
The counter will be used by the writeback flusher to determine how many
pages to write back when expediting writeback for IOCB_DONTCACHE writes,
without flushing the entire BDI's dirty pages.
Suggested-by: Jan Kara <jack@suse.cz>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
include/linux/backing-dev-defs.h | 1 +
mm/filemap.c | 13 ++++++++++++-
mm/page-writeback.c | 6 ++++++
3 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index a06b93446d10..cb660dd37286 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -33,6 +33,7 @@ enum wb_stat_item {
WB_WRITEBACK,
WB_DIRTIED,
WB_WRITTEN,
+ WB_DONTCACHE_DIRTY,
NR_WB_STAT_ITEMS
};
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..1c9c0d5f495f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2052,8 +2052,19 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
if (!folio)
return ERR_PTR(-ENOENT);
/* not an uncached lookup, clear uncached if set */
- if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
+ if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE)) {
+ if (folio_test_dirty(folio)) {
+ struct inode *inode = mapping->host;
+ struct bdi_writeback *wb;
+ struct wb_lock_cookie cookie = {};
+
+ wb = unlocked_inode_to_wb_begin(inode, &cookie);
+ wb_stat_mod(wb, WB_DONTCACHE_DIRTY,
+ -folio_nr_pages(folio));
+ unlocked_inode_to_wb_end(inode, &cookie);
+ }
folio_clear_dropbehind(folio);
+ }
return folio;
}
EXPORT_SYMBOL(__filemap_get_folio_mpol);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88cd53d4ba09..8e520717d1f6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2630,6 +2630,8 @@ static void folio_account_dirtied(struct folio *folio,
wb = inode_to_wb(inode);
lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
+ if (folio_test_dropbehind(folio))
+ wb_stat_mod(wb, WB_DONTCACHE_DIRTY, nr);
__zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
__node_stat_mod_folio(folio, NR_DIRTIED, nr);
wb_stat_mod(wb, WB_RECLAIMABLE, nr);
@@ -2651,6 +2653,8 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
long nr = folio_nr_pages(folio);
lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+ if (folio_test_dropbehind(folio))
+ wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
task_io_account_cancelled_write(nr * PAGE_SIZE);
@@ -2920,6 +2924,8 @@ bool folio_clear_dirty_for_io(struct folio *folio)
if (folio_test_clear_dirty(folio)) {
long nr = folio_nr_pages(folio);
lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
+ if (folio_test_dropbehind(folio))
+ wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
ret = true;
--
2.54.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
2026-05-01 9:49 [PATCH v4 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-05-01 9:49 ` [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
@ 2026-05-01 9:49 ` Jeff Layton
2026-05-01 16:44 ` Jens Axboe
2026-05-03 14:45 ` Jan Kara
2026-05-01 9:49 ` [PATCH v4 3/4] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-05-01 9:49 ` [PATCH v4 4/4] testing: add dontcache-bench local filesystem " Jeff Layton
3 siblings, 2 replies; 9+ messages in thread
From: Jeff Layton @ 2026-05-01 9:49 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever
Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton
The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context. Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).
Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background. This moves writeback submission
completely off the writer's hot path.
To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back. The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.
Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations.
Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility, and target the correct cgroup writeback domain via
unlocked_inode_to_wb_begin().
dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
xfs on NVMe, fio io_uring):
Buffered and direct I/O paths are unaffected by this patchset. All
improvements are confined to the dontcache path:
Single-stream throughput (MB/s):
Before After Change
seq-write/dontcache 298 897 +201%
rand-write/dontcache 131 236 +80%
Tail latency improvements (seq-write/dontcache):
p99: 135,266 us -> 23,986 us (-82%)
p99.9: 8,925,479 us -> 28,443 us (-99.7%)
Multi-writer (4 jobs, sequential write):
Before After Change
dontcache aggregate (MB/s) 2,529 4,532 +79%
dontcache p99 (us) 8,553 1,002 -88%
dontcache p99.9 (us) 109,314 1,057 -99%
Dontcache multi-writer throughput now matches buffered (4,532 vs
4,616 MB/s).
32-file write (Axboe test):
Before After Change
dontcache aggregate (MB/s) 1,548 3,499 +126%
dontcache p99 (us) 10,170 602 -94%
Peak dirty pages (MB) 1,837 213 -88%
Dontcache now reaches 81% of buffered throughput (was 35%).
Competing writers (dontcache vs buffered, separate files):
Before After
buffered writer 868 433 MB/s
dontcache writer 415 433 MB/s
Aggregate 1,284 866 MB/s
Previously the buffered writer starved the dontcache writer 2:1.
With per-bdi_writeback tracking, both writers now receive equal
bandwidth. The aggregate matches the buffered-vs-buffered baseline
(863 MB/s), indicating fair sharing regardless of I/O mode.
The dontcache writer's p99.9 latency collapsed from 119 ms to
33 ms (-73%), eliminating the severe periodic stalls seen in the
baseline. Both writers now share identical latency profiles,
matching the buffered-vs-buffered pattern.
The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/fs-writeback.c | 60 ++++++++++++++++++++++++++++++++++++++++
include/linux/backing-dev-defs.h | 2 ++
include/linux/fs.h | 6 ++--
include/trace/events/writeback.h | 3 +-
4 files changed, 66 insertions(+), 5 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a65694cbfe68..b06a51fb5d6c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1334,6 +1334,18 @@ static void wb_start_writeback(struct bdi_writeback *wb, enum wb_reason reason)
wb_wakeup(wb);
}
+static void wb_start_dontcache_writeback(struct bdi_writeback *wb)
+{
+ if (!wb_has_dirty_io(wb))
+ return;
+
+ if (test_bit(WB_start_dontcache, &wb->state) ||
+ test_and_set_bit(WB_start_dontcache, &wb->state))
+ return;
+
+ wb_wakeup(wb);
+}
+
/**
* wb_start_background_writeback - start background writeback
* @wb: bdi_writback to write from
@@ -2373,6 +2385,28 @@ static long wb_check_start_all(struct bdi_writeback *wb)
return nr_pages;
}
+static long wb_check_start_dontcache(struct bdi_writeback *wb)
+{
+ long nr_pages;
+
+ if (!test_bit(WB_start_dontcache, &wb->state))
+ return 0;
+
+ nr_pages = wb_stat(wb, WB_DONTCACHE_DIRTY);
+ if (nr_pages) {
+ struct wb_writeback_work work = {
+ .nr_pages = nr_pages,
+ .sync_mode = WB_SYNC_NONE,
+ .range_cyclic = 1,
+ .reason = WB_REASON_DONTCACHE,
+ };
+
+ nr_pages = wb_writeback(wb, &work);
+ }
+
+ clear_bit(WB_start_dontcache, &wb->state);
+ return nr_pages;
+}
/*
* Retrieve work items and do the writeback they describe
@@ -2394,6 +2428,11 @@ static long wb_do_writeback(struct bdi_writeback *wb)
*/
wrote += wb_check_start_all(wb);
+ /*
+ * Check for dontcache writeback request
+ */
+ wrote += wb_check_start_dontcache(wb);
+
/*
* Check for periodic writeback, kupdated() style
*/
@@ -2468,6 +2507,27 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
rcu_read_unlock();
}
+/**
+ * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
+ * @mapping: address_space that was just written to
+ *
+ * Kick the writeback flusher thread to expedite writeback of dontcache
+ * dirty pages. Uses a dedicated WB_start_dontcache bit so that only
+ * pages tracked by WB_DONTCACHE_DIRTY are written back, rather than
+ * flushing the entire BDI's dirty pages.
+ */
+void filemap_dontcache_kick_writeback(struct address_space *mapping)
+{
+ struct inode *inode = mapping->host;
+ struct bdi_writeback *wb;
+ struct wb_lock_cookie cookie = {};
+
+ wb = unlocked_inode_to_wb_begin(inode, &cookie);
+ wb_start_dontcache_writeback(wb);
+ unlocked_inode_to_wb_end(inode, &cookie);
+}
+EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);
+
/*
* Wakeup the flusher threads to start writeback of all currently dirty pages
*/
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index cb660dd37286..4f1084937315 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -26,6 +26,7 @@ enum wb_state {
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
WB_start_all, /* nr_pages == 0 (all) work pending */
+ WB_start_dontcache, /* dontcache writeback pending */
};
enum wb_stat_item {
@@ -56,6 +57,7 @@ enum wb_reason {
*/
WB_REASON_FORKER_THREAD,
WB_REASON_FOREIGN_FLUSH,
+ WB_REASON_DONTCACHE,
WB_REASON_MAX,
};
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..df72b42a9e9b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);
int filemap_flush_range(struct address_space *mapping, loff_t start,
loff_t end);
+void filemap_dontcache_kick_writeback(struct address_space *mapping);
static inline int file_write_and_wait(struct file *file)
{
@@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
if (ret)
return ret;
} else if (iocb->ki_flags & IOCB_DONTCACHE) {
- struct address_space *mapping = iocb->ki_filp->f_mapping;
-
- filemap_flush_range(mapping, iocb->ki_pos - count,
- iocb->ki_pos - 1);
+ filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping);
}
return count;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..13ee076ccd16 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -44,7 +44,8 @@
EM( WB_REASON_PERIODIC, "periodic") \
EM( WB_REASON_FS_FREE_SPACE, "fs_free_space") \
EM( WB_REASON_FORKER_THREAD, "forker_thread") \
- EMe(WB_REASON_FOREIGN_FLUSH, "foreign_flush")
+ EM( WB_REASON_FOREIGN_FLUSH, "foreign_flush") \
+ EMe(WB_REASON_DONTCACHE, "dontcache")
WB_WORK_REASON
--
2.54.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v4 3/4] testing: add nfsd-io-bench NFS server benchmark suite
2026-05-01 9:49 [PATCH v4 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-05-01 9:49 ` [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
2026-05-01 9:49 ` [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
@ 2026-05-01 9:49 ` Jeff Layton
2026-05-01 9:49 ` [PATCH v4 4/4] testing: add dontcache-bench local filesystem " Jeff Layton
3 siblings, 0 replies; 9+ messages in thread
From: Jeff Layton @ 2026-05-01 9:49 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever
Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton
Add a benchmark suite for testing NFSD I/O mode performance using fio
with the libnfs backend against an NFS server on localhost. Tests
buffered, dontcache, and direct I/O modes via NFSD debugfs controls.
Includes:
- fio job files for sequential/random read/write, multi-writer,
noisy-neighbor, and latency-sensitive reader workloads
- run-benchmarks.sh: orchestrates test matrix with mode switching
- parse-results.sh: extracts metrics from fio JSON output
- setup-server.sh: configures NFS export for testing
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
.../testing/nfsd-io-bench/fio-jobs/lat-reader.fio | 15 +
.../testing/nfsd-io-bench/fio-jobs/multi-write.fio | 14 +
.../nfsd-io-bench/fio-jobs/noisy-writer.fio | 14 +
tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio | 15 +
.../testing/nfsd-io-bench/fio-jobs/rand-write.fio | 15 +
tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio | 14 +
tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio | 14 +
.../testing/nfsd-io-bench/scripts/parse-results.sh | 238 +++++++++
.../nfsd-io-bench/scripts/run-benchmarks.sh | 591 +++++++++++++++++++++
.../testing/nfsd-io-bench/scripts/setup-server.sh | 94 ++++
10 files changed, 1024 insertions(+)
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/lat-reader.fio b/tools/testing/nfsd-io-bench/fio-jobs/lat-reader.fio
new file mode 100644
index 000000000000..61af37e8b860
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/lat-reader.fio
@@ -0,0 +1,15 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=4k
+numjobs=16
+runtime=300
+time_based=1
+group_reporting=1
+rw=randread
+log_avg_msec=1000
+write_bw_log=latreader
+write_lat_log=latreader
+
+[lat_reader]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/multi-write.fio b/tools/testing/nfsd-io-bench/fio-jobs/multi-write.fio
new file mode 100644
index 000000000000..16b792aecabb
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/multi-write.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=write
+log_avg_msec=1000
+write_bw_log=multiwrite
+write_lat_log=multiwrite
+
+[writer]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/noisy-writer.fio b/tools/testing/nfsd-io-bench/fio-jobs/noisy-writer.fio
new file mode 100644
index 000000000000..615154a7737e
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/noisy-writer.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=write
+log_avg_msec=1000
+write_bw_log=noisywriter
+write_lat_log=noisywriter
+
+[bulk_writer]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio b/tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio
new file mode 100644
index 000000000000..501bae7416a8
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio
@@ -0,0 +1,15 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=4k
+numjobs=16
+runtime=300
+time_based=1
+group_reporting=1
+rw=randread
+log_avg_msec=1000
+write_bw_log=randread
+write_lat_log=randread
+
+[randread]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/rand-write.fio b/tools/testing/nfsd-io-bench/fio-jobs/rand-write.fio
new file mode 100644
index 000000000000..d891d04197ae
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/rand-write.fio
@@ -0,0 +1,15 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=64k
+numjobs=16
+runtime=300
+time_based=1
+group_reporting=1
+rw=randwrite
+log_avg_msec=1000
+write_bw_log=randwrite
+write_lat_log=randwrite
+
+[randwrite]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio b/tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio
new file mode 100644
index 000000000000..6e24ab355026
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=read
+log_avg_msec=1000
+write_bw_log=seqread
+write_lat_log=seqread
+
+[seqread]
diff --git a/tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio b/tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio
new file mode 100644
index 000000000000..260858e345f5
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=nfs
+nfs_url=nfs://localhost/export
+direct=0
+bs=1M
+numjobs=16
+time_based=0
+group_reporting=1
+rw=write
+log_avg_msec=1000
+write_bw_log=seqwrite
+write_lat_log=seqwrite
+
+[seqwrite]
diff --git a/tools/testing/nfsd-io-bench/scripts/parse-results.sh b/tools/testing/nfsd-io-bench/scripts/parse-results.sh
new file mode 100755
index 000000000000..0427d411db04
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/scripts/parse-results.sh
@@ -0,0 +1,238 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Parse fio JSON output and generate comparison tables.
+#
+# Usage: ./parse-results.sh <results-dir>
+
+set -euo pipefail
+
+if [ $# -lt 1 ]; then
+ echo "Usage: $0 <results-dir>"
+ exit 1
+fi
+
+RESULTS_DIR="$1"
+
+if ! command -v jq &>/dev/null; then
+ echo "ERROR: jq is required"
+ exit 1
+fi
+
+# Extract metrics from a single fio JSON result
+extract_metrics() {
+ local json_file=$1
+ local rw_type=$2 # read or write
+
+ if [ ! -f "$json_file" ]; then
+ echo "N/A N/A N/A N/A N/A N/A"
+ return
+ fi
+
+ jq -r --arg rw "$rw_type" '
+ .jobs[0][$rw] as $d |
+ [
+ (($d.bw // 0) / 1024 | . * 10 | round / 10), # MB/s
+ ($d.iops // 0), # IOPS
+ ((($d.clat_ns.mean // 0) / 1000) | . * 10 | round / 10), # avg lat us
+ (($d.clat_ns.percentile["50.000000"] // 0) / 1000), # p50 us
+ (($d.clat_ns.percentile["99.000000"] // 0) / 1000), # p99 us
+ (($d.clat_ns.percentile["99.900000"] // 0) / 1000) # p99.9 us
+ ] | @tsv
+ ' "$json_file" 2>/dev/null || echo "N/A N/A N/A N/A N/A N/A"
+}
+
+# Extract server CPU from vmstat log (average sys%)
+extract_cpu() {
+ local vmstat_log=$1
+ if [ ! -f "$vmstat_log" ]; then
+ echo "N/A"
+ return
+ fi
+ # vmstat columns: us sy id wa st — skip header lines
+ awk 'NR>2 {sum+=$14; n++} END {if(n>0) printf "%.1f", sum/n; else print "N/A"}' \
+ "$vmstat_log" 2>/dev/null || echo "N/A"
+}
+
+# Extract peak dirty pages from meminfo log
+extract_peak_dirty() {
+ local meminfo_log=$1
+ if [ ! -f "$meminfo_log" ]; then
+ echo "N/A"
+ return
+ fi
+ grep "^Dirty:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+# Extract peak cached from meminfo log
+extract_peak_cached() {
+ local meminfo_log=$1
+ if [ ! -f "$meminfo_log" ]; then
+ echo "N/A"
+ return
+ fi
+ grep "^Cached:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+print_separator() {
+ printf '%*s\n' 120 '' | tr ' ' '-'
+}
+
+########################################################################
+# Deliverable 1: Single-client results
+########################################################################
+echo ""
+echo "=================================================================="
+echo " Deliverable 1: Single-Client fio Benchmarks"
+echo "=================================================================="
+echo ""
+
+for workload in seq-write rand-write seq-read rand-read; do
+ case $workload in
+ seq-write|rand-write) rw_type="write" ;;
+ seq-read|rand-read) rw_type="read" ;;
+ esac
+
+ echo "--- $workload ---"
+ printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+ "Mode" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)" "Sys CPU%" "PeakDirty(kB)" "PeakCache(kB)"
+ print_separator
+
+ for mode in buffered dontcache direct; do
+ dir="${RESULTS_DIR}/${workload}/${mode}"
+ json_file=$(find "$dir" -name '*.json' -not -name 'client*' 2>/dev/null | head -1 || true)
+ if [ -z "$json_file" ]; then
+ printf "%-16s %10s\n" "$mode" "(no data)"
+ continue
+ fi
+
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "$rw_type")"
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ cached=$(extract_peak_cached "${dir}/meminfo.log")
+
+ printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+ "$mode" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999" \
+ "$cpu" "${dirty:-N/A}" "${cached:-N/A}"
+ done
+ echo ""
+done
+
+########################################################################
+# Deliverable 2: Multi-client results
+########################################################################
+echo "=================================================================="
+echo " Deliverable 2: Noisy-Neighbor Benchmarks"
+echo "=================================================================="
+echo ""
+
+# Scenario A: Multiple writers
+echo "--- Scenario A: Multiple Writers ---"
+for mode in buffered dontcache direct; do
+ dir="${RESULTS_DIR}/multi-write/${mode}"
+ if [ ! -d "$dir" ]; then
+ continue
+ fi
+
+ echo " Mode: $mode"
+ printf " %-10s %10s %10s %10s %10s %10s %10s\n" \
+ "Client" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+ total_bw=0
+ count=0
+ for json_file in "${dir}"/client*.json; do
+ [ -f "$json_file" ] || continue
+ client=$(basename "$json_file" .json)
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "write")"
+ printf " %-10s %10s %10s %10s %10s %10s %10s\n" \
+ "$client" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ total_bw=$(echo "$total_bw + ${mbps:-0}" | bc 2>/dev/null || echo "$total_bw")
+ count=$(( count + 1 ))
+ done
+
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ printf " Aggregate BW: %s MB/s | Sys CPU: %s%% | Peak Dirty: %s kB\n" \
+ "$total_bw" "$cpu" "${dirty:-N/A}"
+ echo ""
+done
+
+# Scenario C: Noisy neighbor
+echo "--- Scenario C: Noisy Writer + Latency-Sensitive Readers ---"
+for mode in buffered dontcache direct; do
+ dir="${RESULTS_DIR}/noisy-neighbor/${mode}"
+ if [ ! -d "$dir" ]; then
+ continue
+ fi
+
+ echo " Mode: $mode"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+ # Writer
+ if [ -f "${dir}/noisy_writer.json" ]; then
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "${dir}/noisy_writer.json" "write")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ fi
+
+ # Readers
+ for json_file in "${dir}"/reader*.json; do
+ [ -f "$json_file" ] || continue
+ reader=$(basename "$json_file" .json)
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "read")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ done
+
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ printf " Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+ echo ""
+done
+
+# Scenario D: Mixed-mode noisy neighbor
+echo "--- Scenario D: Mixed-Mode Noisy Writer + Readers ---"
+for dir in "${RESULTS_DIR}"/noisy-neighbor-mixed/*/; do
+ [ -d "$dir" ] || continue
+ label=$(basename "$dir")
+
+ echo " Mode: $label"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+ # Writer
+ if [ -f "${dir}/noisy_writer.json" ]; then
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "${dir}/noisy_writer.json" "write")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ fi
+
+ # Readers
+ for json_file in "${dir}"/reader*.json; do
+ [ -f "$json_file" ] || continue
+ reader=$(basename "$json_file" .json)
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "read")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ done
+
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ printf " Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+ echo ""
+done
+
+echo "=================================================================="
+echo " System Info"
+echo "=================================================================="
+if [ -f "${RESULTS_DIR}/sysinfo.txt" ]; then
+ head -6 "${RESULTS_DIR}/sysinfo.txt"
+fi
+echo ""
diff --git a/tools/testing/nfsd-io-bench/scripts/run-benchmarks.sh b/tools/testing/nfsd-io-bench/scripts/run-benchmarks.sh
new file mode 100755
index 000000000000..2b0cf6e79dff
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/scripts/run-benchmarks.sh
@@ -0,0 +1,591 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# NFS server I/O mode benchmark suite
+#
+# Runs fio with the NFS ioengine against an NFS server on localhost,
+# testing buffered, dontcache, and direct I/O modes.
+#
+# Usage: ./run-benchmarks.sh [OPTIONS]
+#
+# Options:
+# -e EXPORT_PATH Server export path (default: /export)
+# -s SIZE fio file size, should be >= 2x RAM (default: auto-detect)
+# -r RESULTS_DIR Where to store results (default: ./results)
+# -n NFS_VER NFS version: 3 or 4 (default: 3)
+# -j FIO_JOBS_DIR Path to fio job files (default: ../fio-jobs)
+# -d Dry run: print commands without executing
+# -h Show this help
+
+set -euo pipefail
+
+# Defaults
+EXPORT_PATH="/export"
+SIZE=""
+RESULTS_DIR="./results"
+NFS_VER=3
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+FIO_JOBS_DIR="${SCRIPT_DIR}/../fio-jobs"
+DRY_RUN=0
+MODES="0 1 2"
+PERF_LOCK=0
+
+DEBUGFS_BASE="/sys/kernel/debug/nfsd"
+IO_CACHE_READ="${DEBUGFS_BASE}/io_cache_read"
+IO_CACHE_WRITE="${DEBUGFS_BASE}/io_cache_write"
+DISABLE_SPLICE="${DEBUGFS_BASE}/disable-splice-read"
+
+usage() {
+ echo "Usage: $0 [OPTIONS]"
+ echo " -e EXPORT_PATH Server export path (default: /export)"
+ echo " -s SIZE fio file size (default: 2x RAM)"
+ echo " -r RESULTS_DIR Results directory (default: ./results)"
+ echo " -n NFS_VER NFS version: 3 or 4 (default: 3)"
+ echo " -j FIO_JOBS_DIR Path to fio job files"
+ echo " -D Dontcache only (skip buffered and direct tests)"
+ echo " -p Profile kernel lock contention with perf lock"
+ echo " -d Dry run"
+ echo " -h Help"
+ exit 1
+}
+
+while getopts "e:s:r:n:j:Dpdh" opt; do
+ case $opt in
+ e) EXPORT_PATH="$OPTARG" ;;
+ s) SIZE="$OPTARG" ;;
+ r) RESULTS_DIR="$OPTARG" ;;
+ n) NFS_VER="$OPTARG" ;;
+ j) FIO_JOBS_DIR="$OPTARG" ;;
+ D) MODES="1" ;;
+ p) PERF_LOCK=1 ;;
+ d) DRY_RUN=1 ;;
+ h) usage ;;
+ *) usage ;;
+ esac
+done
+
+# Auto-detect size: 2x total RAM
+if [ -z "$SIZE" ]; then
+ MEM_KB=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+ MEM_GB=$(( MEM_KB / 1024 / 1024 ))
+ SIZE="$(( MEM_GB * 2 ))G"
+ echo "Auto-detected RAM: ${MEM_GB}G, using file size: ${SIZE}"
+fi
+
+
+log() {
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
+}
+
+run_cmd() {
+ if [ "$DRY_RUN" -eq 1 ]; then
+ echo " [DRY RUN] $*"
+ else
+ "$@"
+ fi
+}
+
+# Preflight checks
+preflight() {
+ log "=== Preflight checks ==="
+
+ if ! command -v fio &>/dev/null; then
+ echo "ERROR: fio not found in PATH"
+ exit 1
+ fi
+
+ # Check fio has nfs ioengine
+ if ! fio --enghelp=nfs &>/dev/null; then
+ echo "ERROR: fio does not have the nfs ioengine (needs libnfs)"
+ exit 1
+ fi
+
+ # Check debugfs knobs exist
+ for knob in "$IO_CACHE_READ" "$IO_CACHE_WRITE" "$DISABLE_SPLICE"; do
+ if [ ! -f "$knob" ]; then
+ echo "ERROR: $knob not found. Is the kernel new enough?"
+ exit 1
+ fi
+ done
+
+ # Check NFS server is exporting
+ if ! showmount -e localhost 2>/dev/null | grep -q "$EXPORT_PATH"; then
+ echo "WARNING: $EXPORT_PATH not in showmount output, proceeding anyway"
+ fi
+
+ # Print system info
+ echo "Kernel: $(uname -r)"
+ echo "RAM: $(awk '/MemTotal/ {printf "%.1f GB", $2/1024/1024}' /proc/meminfo)"
+ echo "Export: $EXPORT_PATH"
+ echo "NFS ver: $NFS_VER"
+ echo "File size: $SIZE"
+ echo "Results: $RESULTS_DIR"
+ echo ""
+}
+
+# Set server I/O mode via debugfs
+set_io_mode() {
+ local cache_write=$1
+ local cache_read=$2
+ local splice_off=$3
+
+ log "Setting io_cache_write=$cache_write io_cache_read=$cache_read disable-splice-read=$splice_off"
+ run_cmd bash -c "echo $cache_write > $IO_CACHE_WRITE"
+ run_cmd bash -c "echo $cache_read > $IO_CACHE_READ"
+ run_cmd bash -c "echo $splice_off > $DISABLE_SPLICE"
+}
+
+# Drop page cache on server
+drop_caches() {
+ log "Dropping page cache"
+ run_cmd bash -c "sync && echo 3 > /proc/sys/vm/drop_caches"
+ sleep 1
+}
+
+# Start background server monitoring
+start_monitors() {
+ local outdir=$1
+
+ log "Starting server monitors in $outdir"
+ run_cmd vmstat 1 > "${outdir}/vmstat.log" 2>&1 &
+ VMSTAT_PID=$!
+
+ run_cmd iostat -x 1 > "${outdir}/iostat.log" 2>&1 &
+ IOSTAT_PID=$!
+
+ # Sample /proc/meminfo every second
+ (while true; do
+ echo "=== $(date '+%s') ==="
+ cat /proc/meminfo
+ sleep 1
+ done) > "${outdir}/meminfo.log" 2>&1 &
+ MEMINFO_PID=$!
+}
+
+# Stop background monitors
+stop_monitors() {
+ log "Stopping monitors"
+ kill "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+ wait "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+}
+
+# perf lock profiling — uses BPF-based live contention tracing
+PERF_LOCK_PID=""
+
+start_perf_lock() {
+ local outdir=$1
+
+ if [ "$PERF_LOCK" -ne 1 ]; then
+ return
+ fi
+
+ log "Starting perf lock contention tracing"
+ perf lock contention -a -b --max-stack 8 \
+ > "${outdir}/perf-lock-contention.txt" 2>&1 &
+ PERF_LOCK_PID=$!
+}
+
+stop_perf_lock() {
+ local outdir=$1
+
+ if [ -z "$PERF_LOCK_PID" ]; then
+ return
+ fi
+
+ log "Stopping perf lock contention tracing"
+ kill -TERM "$PERF_LOCK_PID" 2>/dev/null || true
+ wait "$PERF_LOCK_PID" 2>/dev/null || true
+ PERF_LOCK_PID=""
+}
+
+# Run a single fio benchmark.
+# nfs_url is set in the job files; we pass --filename and --size on
+# the command line to vary the target file and data volume per run.
+# Pass "keep" as 5th arg to preserve the test file after the run.
+run_fio() {
+ local job_file=$1
+ local outdir=$2
+ local filename=$3
+ local fio_size=${4:-$SIZE}
+ local keep=${5:-}
+
+ local job_name
+ job_name=$(basename "$job_file" .fio)
+
+ log "Running fio job: $job_name -> $outdir (file=$filename size=$fio_size)"
+ mkdir -p "$outdir"
+
+ drop_caches
+ start_monitors "$outdir"
+ # Skip perf lock profiling for precreate/setup runs
+ [ "$keep" != "keep" ] && start_perf_lock "$outdir"
+
+ run_cmd fio "$job_file" \
+ --output-format=json \
+ --output="${outdir}/${job_name}.json" \
+ --filename="$filename" \
+ --size="$fio_size"
+
+ [ "$keep" != "keep" ] && stop_perf_lock "$outdir"
+ stop_monitors
+
+ log "Finished: $job_name"
+
+ # Clean up test file to free disk space unless told to keep it
+ if [ "$keep" != "keep" ]; then
+ cleanup_test_files "$filename"
+ fi
+}
+
+# Remove test files from the export to free disk space
+cleanup_test_files() {
+ local filename
+ for filename in "$@"; do
+ local filepath="${EXPORT_PATH}/${filename}"
+ log "Cleaning up: $filepath"
+ run_cmd rm -f "$filepath"
+ done
+}
+
+# Ensure parent directories exist under the export for a given filename
+ensure_export_dirs() {
+ local filename
+ for filename in "$@"; do
+ local dirpath="${EXPORT_PATH}/$(dirname "$filename")"
+ if [ "$dirpath" != "${EXPORT_PATH}/." ] && [ ! -d "$dirpath" ]; then
+ log "Creating directory: $dirpath"
+ run_cmd mkdir -p "$dirpath"
+ fi
+ done
+}
+
+# Mode name from numeric value
+mode_name() {
+ case $1 in
+ 0) echo "buffered" ;;
+ 1) echo "dontcache" ;;
+ 2) echo "direct" ;;
+ esac
+}
+
+########################################################################
+# Deliverable 1: Single-client fio benchmarks
+########################################################################
+run_deliverable1() {
+ log "=========================================="
+ log "Deliverable 1: Single-client fio benchmarks"
+ log "=========================================="
+
+ # Write test matrix:
+ # mode 0 (buffered): splice on (default)
+ # mode 1 (dontcache): splice off (required)
+ # mode 2 (direct): splice off (required)
+
+ # Sequential write
+ for wmode in $MODES; do
+ local mname
+ mname=$(mode_name $wmode)
+ local splice_off=0
+ [ "$wmode" -ne 0 ] && splice_off=1
+
+ drop_caches
+ set_io_mode "$wmode" 0 "$splice_off"
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${RESULTS_DIR}/seq-write/${mname}" \
+ "seq-write_testfile"
+ done
+
+ # Random write
+ for wmode in $MODES; do
+ local mname
+ mname=$(mode_name $wmode)
+ local splice_off=0
+ [ "$wmode" -ne 0 ] && splice_off=1
+
+ drop_caches
+ set_io_mode "$wmode" 0 "$splice_off"
+ run_fio "${FIO_JOBS_DIR}/rand-write.fio" \
+ "${RESULTS_DIR}/rand-write/${mname}" \
+ "rand-write_testfile"
+ done
+
+ # Sequential read — vary read mode, write stays buffered
+ # Pre-create the file for reading
+ log "Pre-creating sequential read test file"
+ set_io_mode 0 0 0
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${RESULTS_DIR}/seq-read/precreate" \
+ "seq-read_testfile" "$SIZE" "keep"
+
+ # shellcheck disable=SC2086
+ local last_mode
+ last_mode=$(echo $MODES | awk '{print $NF}')
+
+ for rmode in $MODES; do
+ local mname
+ mname=$(mode_name $rmode)
+ local splice_off=0
+ [ "$rmode" -ne 0 ] && splice_off=1
+ # Keep file for subsequent modes; clean up after last
+ local keep="keep"
+ [ "$rmode" = "$last_mode" ] && keep=""
+
+ drop_caches
+ set_io_mode 0 "$rmode" "$splice_off"
+ run_fio "${FIO_JOBS_DIR}/seq-read.fio" \
+ "${RESULTS_DIR}/seq-read/${mname}" \
+ "seq-read_testfile" "$SIZE" "$keep"
+ done
+
+ # Random read — vary read mode, write stays buffered
+ # Pre-create the file for reading
+ log "Pre-creating random read test file"
+ set_io_mode 0 0 0
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${RESULTS_DIR}/rand-read/precreate" \
+ "rand-read_testfile" "$SIZE" "keep"
+
+ for rmode in $MODES; do
+ local mname
+ mname=$(mode_name $rmode)
+ local splice_off=0
+ [ "$rmode" -ne 0 ] && splice_off=1
+ # Keep file for subsequent modes; clean up after last
+ local keep="keep"
+ [ "$rmode" = "$last_mode" ] && keep=""
+
+ drop_caches
+ set_io_mode 0 "$rmode" "$splice_off"
+ run_fio "${FIO_JOBS_DIR}/rand-read.fio" \
+ "${RESULTS_DIR}/rand-read/${mname}" \
+ "rand-read_testfile" "$SIZE" "$keep"
+ done
+}
+
+########################################################################
+# Deliverable 2: Multi-client (simulated with multiple fio jobs)
+########################################################################
+run_deliverable2() {
+ log "=========================================="
+ log "Deliverable 2: Noisy-neighbor benchmarks"
+ log "=========================================="
+
+ local num_clients=4
+ local client_size
+ local mem_kb
+ mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+ # Each client gets RAM/num_clients so total > RAM
+ client_size="$(( mem_kb / 1024 / num_clients ))M"
+
+ # Scenario A: Multiple writers
+ for mode in $MODES; do
+ local mname
+ mname=$(mode_name $mode)
+ local splice_off=0
+ [ "$mode" -ne 0 ] && splice_off=1
+ local outdir="${RESULTS_DIR}/multi-write/${mname}"
+ mkdir -p "$outdir"
+
+ set_io_mode "$mode" "$mode" "$splice_off"
+ drop_caches
+
+ # Ensure client directories exist on export
+ for i in $(seq 1 $num_clients); do
+ ensure_export_dirs "client${i}/testfile"
+ done
+
+ start_monitors "$outdir"
+ start_perf_lock "$outdir"
+
+ # Launch N parallel fio writers
+ local pids=()
+ for i in $(seq 1 $num_clients); do
+ run_cmd fio "${FIO_JOBS_DIR}/multi-write.fio" \
+ --output-format=json \
+ --output="${outdir}/client${i}.json" \
+ --filename="client${i}/testfile" \
+ --size="$client_size" &
+ pids+=($!)
+ done
+
+ # Wait for all
+ local rc=0
+ for pid in "${pids[@]}"; do
+ wait "$pid" || rc=$?
+ done
+
+ stop_perf_lock "$outdir"
+ stop_monitors
+ [ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+ # Clean up test files
+ for i in $(seq 1 $num_clients); do
+ cleanup_test_files "client${i}/testfile"
+ done
+ done
+
+ # Scenario C: Noisy writer + latency-sensitive readers
+ for mode in $MODES; do
+ local mname
+ mname=$(mode_name $mode)
+ local splice_off=0
+ [ "$mode" -ne 0 ] && splice_off=1
+ local outdir="${RESULTS_DIR}/noisy-neighbor/${mname}"
+ mkdir -p "$outdir"
+
+ set_io_mode "$mode" "$mode" "$splice_off"
+ drop_caches
+
+ # Pre-create read files for latency readers
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ ensure_export_dirs "reader${i}/readfile"
+ log "Pre-creating read file for reader $i"
+ run_fio "${FIO_JOBS_DIR}/multi-write.fio" \
+ "${outdir}/precreate_reader${i}" \
+ "reader${i}/readfile" \
+ "512M" "keep"
+ done
+ drop_caches
+ ensure_export_dirs "bulk/testfile"
+ start_monitors "$outdir"
+ start_perf_lock "$outdir"
+
+ # Noisy writer
+ run_cmd fio "${FIO_JOBS_DIR}/noisy-writer.fio" \
+ --output-format=json \
+ --output="${outdir}/noisy_writer.json" \
+ --filename="bulk/testfile" \
+ --size="$SIZE" &
+ local writer_pid=$!
+
+ # Latency-sensitive readers
+ local reader_pids=()
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ run_cmd fio "${FIO_JOBS_DIR}/lat-reader.fio" \
+ --output-format=json \
+ --output="${outdir}/reader${i}.json" \
+ --filename="reader${i}/readfile" \
+ --size="512M" &
+ reader_pids+=($!)
+ done
+
+ local rc=0
+ wait "$writer_pid" || rc=$?
+ for pid in "${reader_pids[@]}"; do
+ wait "$pid" || rc=$?
+ done
+
+ stop_perf_lock "$outdir"
+ stop_monitors
+ [ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+ # Clean up test files
+ cleanup_test_files "bulk/testfile"
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ cleanup_test_files "reader${i}/readfile"
+ done
+ done
+ # Scenario D: Mixed-mode noisy neighbor
+ # Test write/read mode combinations where the writer uses a
+ # cache-friendly mode and readers use buffered reads to benefit
+ # from warm cache.
+ local mixed_modes=(
+ # write_mode read_mode label
+ "1 0 dontcache-w_buffered-r"
+ )
+
+ for combo in "${mixed_modes[@]}"; do
+ local wmode rmode label
+ read -r wmode rmode label <<< "$combo"
+ local splice_off=0
+ [ "$wmode" -ne 0 ] && splice_off=1
+ local outdir="${RESULTS_DIR}/noisy-neighbor-mixed/${label}"
+ mkdir -p "$outdir"
+
+ set_io_mode "$wmode" "$rmode" "$splice_off"
+ drop_caches
+
+ # Pre-create read files for latency readers
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ ensure_export_dirs "reader${i}/readfile"
+ log "Pre-creating read file for reader $i"
+ run_fio "${FIO_JOBS_DIR}/multi-write.fio" \
+ "${outdir}/precreate_reader${i}" \
+ "reader${i}/readfile" \
+ "512M" "keep"
+ done
+ drop_caches
+ ensure_export_dirs "bulk/testfile"
+ start_monitors "$outdir"
+ start_perf_lock "$outdir"
+
+ # Noisy writer
+ run_cmd fio "${FIO_JOBS_DIR}/noisy-writer.fio" \
+ --output-format=json \
+ --output="${outdir}/noisy_writer.json" \
+ --filename="bulk/testfile" \
+ --size="$SIZE" &
+ local writer_pid=$!
+
+ # Latency-sensitive readers
+ local reader_pids=()
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ run_cmd fio "${FIO_JOBS_DIR}/lat-reader.fio" \
+ --output-format=json \
+ --output="${outdir}/reader${i}.json" \
+ --filename="reader${i}/readfile" \
+ --size="512M" &
+ reader_pids+=($!)
+ done
+
+ local rc=0
+ wait "$writer_pid" || rc=$?
+ for pid in "${reader_pids[@]}"; do
+ wait "$pid" || rc=$?
+ done
+
+ stop_perf_lock "$outdir"
+ stop_monitors
+ [ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+ # Clean up test files
+ cleanup_test_files "bulk/testfile"
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ cleanup_test_files "reader${i}/readfile"
+ done
+ done
+}
+
+########################################################################
+# Main
+########################################################################
+preflight
+
+TIMESTAMP=$(date '+%Y%m%d-%H%M%S')
+RESULTS_DIR="${RESULTS_DIR}/${TIMESTAMP}"
+mkdir -p "$RESULTS_DIR"
+
+# Save system info
+{
+ echo "Timestamp: $TIMESTAMP"
+ echo "Kernel: $(uname -r)"
+ echo "Hostname: $(hostname)"
+ echo "NFS version: $NFS_VER"
+ echo "File size: $SIZE"
+ echo "Export: $EXPORT_PATH"
+ cat /proc/meminfo
+} > "${RESULTS_DIR}/sysinfo.txt"
+
+log "Results will be saved to: $RESULTS_DIR"
+
+run_deliverable1
+run_deliverable2
+
+# Reset to defaults
+set_io_mode 0 0 0
+
+log "=========================================="
+log "All benchmarks complete."
+log "Results in: $RESULTS_DIR"
+log "Run: scripts/parse-results.sh $RESULTS_DIR"
+log "=========================================="
diff --git a/tools/testing/nfsd-io-bench/scripts/setup-server.sh b/tools/testing/nfsd-io-bench/scripts/setup-server.sh
new file mode 100755
index 000000000000..0efdd74a705e
--- /dev/null
+++ b/tools/testing/nfsd-io-bench/scripts/setup-server.sh
@@ -0,0 +1,94 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# One-time setup script for the NFS test server.
+# Run this once before running benchmarks.
+#
+# Usage: sudo ./setup-server.sh [EXPORT_PATH]
+
+set -euo pipefail
+
+EXPORT_PATH="${1:-/export}"
+FSTYPE="ext4"
+
+log() {
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
+}
+
+if [ "$(id -u)" -ne 0 ]; then
+ echo "ERROR: must run as root"
+ exit 1
+fi
+
+# Check for required tools
+for cmd in fio exportfs showmount jq; do
+ if ! command -v "$cmd" &>/dev/null; then
+ echo "WARNING: $cmd not found, attempting install"
+ dnf install -y "$cmd" 2>/dev/null || \
+ apt-get install -y "$cmd" 2>/dev/null || \
+ echo "ERROR: cannot install $cmd, please install manually"
+ fi
+done
+
+# Check fio has nfs ioengine
+if ! fio --enghelp=nfs &>/dev/null; then
+ echo "ERROR: fio nfs ioengine not available."
+ echo "You may need to install fio with libnfs support."
+ echo "Try: dnf install fio libnfs-devel (or build fio from source with --enable-nfs)"
+ exit 1
+fi
+
+# Create export directory if needed
+if [ ! -d "$EXPORT_PATH" ]; then
+ log "Creating export directory: $EXPORT_PATH"
+ mkdir -p "$EXPORT_PATH"
+fi
+
+# Create subdirectories for multi-client tests
+for i in 1 2 3 4; do
+ mkdir -p "${EXPORT_PATH}/client${i}"
+ mkdir -p "${EXPORT_PATH}/reader${i}"
+done
+mkdir -p "${EXPORT_PATH}/bulk"
+
+# Check if already exported
+if ! exportfs -s 2>/dev/null | grep -q "$EXPORT_PATH"; then
+ log "Adding NFS export for $EXPORT_PATH"
+ if ! grep -q "$EXPORT_PATH" /etc/exports 2>/dev/null; then
+ echo "${EXPORT_PATH} 127.0.0.1/32(rw,sync,no_root_squash,no_subtree_check)" >> /etc/exports
+ fi
+ exportfs -ra
+fi
+
+# Ensure NFS server is running
+if ! systemctl is-active --quiet nfs-server 2>/dev/null; then
+ log "Starting NFS server"
+ systemctl start nfs-server
+fi
+
+# Verify export
+log "Current exports:"
+showmount -e localhost
+
+# Check debugfs knobs
+log "Checking debugfs knobs:"
+DEBUGFS_BASE="/sys/kernel/debug/nfsd"
+for knob in io_cache_read io_cache_write disable-splice-read; do
+ if [ -f "${DEBUGFS_BASE}/${knob}" ]; then
+ echo " ${knob} = $(cat "${DEBUGFS_BASE}/${knob}")"
+ else
+ echo " ${knob}: NOT FOUND (kernel may be too old)"
+ fi
+done
+
+# Print system summary
+echo ""
+log "=== System Summary ==="
+echo "Kernel: $(uname -r)"
+echo "RAM: $(awk '/MemTotal/ {printf "%.1f GB", $2/1024/1024}' /proc/meminfo)"
+echo "Export: $EXPORT_PATH"
+echo "Filesystem: $(df -T "$EXPORT_PATH" | awk 'NR==2 {print $2}')"
+echo "Disk: $(df -h "$EXPORT_PATH" | awk 'NR==2 {print $2, "total,", $4, "free"}')"
+echo ""
+log "Setup complete. Run benchmarks with:"
+echo " sudo ./scripts/run-benchmarks.sh -e $EXPORT_PATH"
--
2.54.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v4 4/4] testing: add dontcache-bench local filesystem benchmark suite
2026-05-01 9:49 [PATCH v4 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
` (2 preceding siblings ...)
2026-05-01 9:49 ` [PATCH v4 3/4] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
@ 2026-05-01 9:49 ` Jeff Layton
3 siblings, 0 replies; 9+ messages in thread
From: Jeff Layton @ 2026-05-01 9:49 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever
Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm, Jeff Layton
Add a benchmark suite for testing IOCB_DONTCACHE on local filesystems
via fio's io_uring engine with the RWF_DONTCACHE flag.
The suite mirrors the nfsd-io-bench test matrix but uses io_uring with
the "uncached" fio option instead of NFSD debugfs mode switching:
- uncached=0: standard buffered I/O
- uncached=1: RWF_DONTCACHE
- Mode 2 uses O_DIRECT via fio's --direct=1
Additionally, this benchmark includes:
- a benchmark for competing buffered vs. dontcache writers to
different files on the same backing device.
- a benchmark mirroring Jens Axboe's original RWF_UNCACHED write test:
32 concurrent writers with 64K block size, time-based (300s), with
per-second bandwidth logging
Includes fio job files, run-benchmarks.sh, and parse-results.sh.
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
.../dontcache-bench/fio-jobs/axboe-write.fio | 14 +
.../dontcache-bench/fio-jobs/lat-reader.fio | 12 +
.../dontcache-bench/fio-jobs/multi-write.fio | 11 +
.../dontcache-bench/fio-jobs/noisy-writer.fio | 12 +
.../testing/dontcache-bench/fio-jobs/rand-read.fio | 13 +
.../dontcache-bench/fio-jobs/rand-write.fio | 13 +
.../testing/dontcache-bench/fio-jobs/seq-read.fio | 13 +
.../testing/dontcache-bench/fio-jobs/seq-write.fio | 13 +
.../dontcache-bench/scripts/parse-results.sh | 346 +++++++++++
.../dontcache-bench/scripts/run-benchmarks.sh | 643 +++++++++++++++++++++
10 files changed, 1090 insertions(+)
diff --git a/tools/testing/dontcache-bench/fio-jobs/axboe-write.fio b/tools/testing/dontcache-bench/fio-jobs/axboe-write.fio
new file mode 100644
index 000000000000..7cabcb740f0d
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/axboe-write.fio
@@ -0,0 +1,14 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=64k
+numjobs=32
+time_based
+runtime=300
+rw=write
+group_reporting=0
+filename_format=$jobname.$jobnum
+log_avg_msec=1000
+write_bw_log=axboe-write
+
+[axboe-write]
diff --git a/tools/testing/dontcache-bench/fio-jobs/lat-reader.fio b/tools/testing/dontcache-bench/fio-jobs/lat-reader.fio
new file mode 100644
index 000000000000..e221e7aedec9
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/lat-reader.fio
@@ -0,0 +1,12 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=4k
+numjobs=1
+time_based=0
+rw=read
+log_avg_msec=1000
+write_bw_log=latreader
+write_lat_log=latreader
+
+[latreader]
diff --git a/tools/testing/dontcache-bench/fio-jobs/multi-write.fio b/tools/testing/dontcache-bench/fio-jobs/multi-write.fio
new file mode 100644
index 000000000000..c9cd11ec40fd
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/multi-write.fio
@@ -0,0 +1,11 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=4
+time_based=0
+rw=write
+group_reporting=0
+filename_format=$jobname.$jobnum
+
+[multiwrite]
diff --git a/tools/testing/dontcache-bench/fio-jobs/noisy-writer.fio b/tools/testing/dontcache-bench/fio-jobs/noisy-writer.fio
new file mode 100644
index 000000000000..4524eebd4642
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/noisy-writer.fio
@@ -0,0 +1,12 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=1
+time_based=0
+rw=write
+log_avg_msec=1000
+write_bw_log=noisywriter
+write_lat_log=noisywriter
+
+[noisywriter]
diff --git a/tools/testing/dontcache-bench/fio-jobs/rand-read.fio b/tools/testing/dontcache-bench/fio-jobs/rand-read.fio
new file mode 100644
index 000000000000..e281fa82b86a
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/rand-read.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=4k
+numjobs=1
+iodepth=16
+time_based=0
+rw=randread
+log_avg_msec=1000
+write_bw_log=randread
+write_lat_log=randread
+
+[randread]
diff --git a/tools/testing/dontcache-bench/fio-jobs/rand-write.fio b/tools/testing/dontcache-bench/fio-jobs/rand-write.fio
new file mode 100644
index 000000000000..cf53bc6f14b9
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/rand-write.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=4k
+numjobs=1
+iodepth=16
+time_based=0
+rw=randwrite
+log_avg_msec=1000
+write_bw_log=randwrite
+write_lat_log=randwrite
+
+[randwrite]
diff --git a/tools/testing/dontcache-bench/fio-jobs/seq-read.fio b/tools/testing/dontcache-bench/fio-jobs/seq-read.fio
new file mode 100644
index 000000000000..ef87921465a7
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/seq-read.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=1
+iodepth=16
+time_based=0
+rw=read
+log_avg_msec=1000
+write_bw_log=seqread
+write_lat_log=seqread
+
+[seqread]
diff --git a/tools/testing/dontcache-bench/fio-jobs/seq-write.fio b/tools/testing/dontcache-bench/fio-jobs/seq-write.fio
new file mode 100644
index 000000000000..da3082f9b391
--- /dev/null
+++ b/tools/testing/dontcache-bench/fio-jobs/seq-write.fio
@@ -0,0 +1,13 @@
+[global]
+ioengine=io_uring
+direct=0
+bs=1M
+numjobs=1
+iodepth=16
+time_based=0
+rw=write
+log_avg_msec=1000
+write_bw_log=seqwrite
+write_lat_log=seqwrite
+
+[seqwrite]
diff --git a/tools/testing/dontcache-bench/scripts/parse-results.sh b/tools/testing/dontcache-bench/scripts/parse-results.sh
new file mode 100755
index 000000000000..ba43a039153f
--- /dev/null
+++ b/tools/testing/dontcache-bench/scripts/parse-results.sh
@@ -0,0 +1,346 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Parse fio JSON output and generate comparison tables.
+#
+# Usage: ./parse-results.sh <results-dir>
+
+set -euo pipefail
+
+if [ $# -lt 1 ]; then
+ echo "Usage: $0 <results-dir>"
+ exit 1
+fi
+
+RESULTS_DIR="$1"
+
+if ! command -v jq &>/dev/null; then
+ echo "ERROR: jq is required"
+ exit 1
+fi
+
+# Extract metrics from a single fio JSON result
+extract_metrics() {
+ local json_file=$1
+ local rw_type=$2 # read or write
+
+ if [ ! -f "$json_file" ]; then
+ echo "N/A N/A N/A N/A N/A N/A"
+ return
+ fi
+
+ jq -r --arg rw "$rw_type" '
+ .jobs[0][$rw] as $d |
+ [
+ (($d.bw // 0) / 1024 | . * 10 | round / 10), # MB/s
+ ($d.iops // 0), # IOPS
+ ((($d.clat_ns.mean // 0) / 1000) | . * 10 | round / 10), # avg lat us
+ (($d.clat_ns.percentile["50.000000"] // 0) / 1000), # p50 us
+ (($d.clat_ns.percentile["99.000000"] // 0) / 1000), # p99 us
+ (($d.clat_ns.percentile["99.900000"] // 0) / 1000) # p99.9 us
+ ] | @tsv
+ ' "$json_file" 2>/dev/null || echo "N/A N/A N/A N/A N/A N/A"
+}
+
+# Extract server CPU from vmstat log (average sys%)
+extract_cpu() {
+ local vmstat_log=$1
+ if [ ! -f "$vmstat_log" ]; then
+ echo "N/A"
+ return
+ fi
+ # vmstat columns: us sy id wa st — skip header lines
+ awk 'NR>2 {sum+=$14; n++} END {if(n>0) printf "%.1f", sum/n; else print "N/A"}' \
+ "$vmstat_log" 2>/dev/null || echo "N/A"
+}
+
+# Extract peak dirty pages from meminfo log
+extract_peak_dirty() {
+ local meminfo_log=$1
+ if [ ! -f "$meminfo_log" ]; then
+ echo "N/A"
+ return
+ fi
+ grep "^Dirty:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+# Extract peak cached from meminfo log
+extract_peak_cached() {
+ local meminfo_log=$1
+ if [ ! -f "$meminfo_log" ]; then
+ echo "N/A"
+ return
+ fi
+ grep "^Cached:" "$meminfo_log" | awk '{print $2}' | sort -n | tail -1 || echo "N/A"
+}
+
+print_separator() {
+ printf '%*s\n' 120 '' | tr ' ' '-'
+}
+
+########################################################################
+# Deliverable 1: Single-client results
+########################################################################
+echo ""
+echo "=================================================================="
+echo " Deliverable 1: Single-Client fio Benchmarks"
+echo "=================================================================="
+echo ""
+
+for workload in seq-write rand-write seq-read rand-read; do
+ case $workload in
+ seq-write|rand-write) rw_type="write" ;;
+ seq-read|rand-read) rw_type="read" ;;
+ esac
+
+ echo "--- $workload ---"
+ printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+ "Mode" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)" "Sys CPU%" "PeakDirty(kB)" "PeakCache(kB)"
+ print_separator
+
+ for mode in buffered dontcache direct; do
+ dir="${RESULTS_DIR}/${workload}/${mode}"
+ json_file=$(find "$dir" -name '*.json' -not -name 'client*' 2>/dev/null | head -1 || true)
+ if [ -z "$json_file" ]; then
+ printf "%-16s %10s\n" "$mode" "(no data)"
+ continue
+ fi
+
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "$rw_type")"
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ cached=$(extract_peak_cached "${dir}/meminfo.log")
+
+ printf "%-16s %10s %10s %10s %10s %10s %10s %10s %12s %12s\n" \
+ "$mode" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999" \
+ "$cpu" "${dirty:-N/A}" "${cached:-N/A}"
+ done
+ echo ""
+done
+
+########################################################################
+# Deliverable 2: Multi-client results
+########################################################################
+echo "=================================================================="
+echo " Deliverable 2: Noisy-Neighbor Benchmarks"
+echo "=================================================================="
+echo ""
+
+# Scenario A: Multiple writers
+echo "--- Scenario A: Multiple Writers ---"
+for mode in buffered dontcache direct; do
+ dir="${RESULTS_DIR}/multi-write/${mode}"
+ if [ ! -d "$dir" ]; then
+ continue
+ fi
+
+ json_file=$(find "$dir" -name '*.json' 2>/dev/null | head -1 || true)
+ if [ -z "$json_file" ] || [ ! -f "$json_file" ]; then
+ echo " Mode: $mode (no data)"
+ continue
+ fi
+
+ echo " Mode: $mode"
+ printf " %-10s %10s %10s %10s %10s %10s %10s\n" \
+ "Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+ # Parse per-job stats from the single fio JSON output
+ jq -r '.jobs[] |
+ [
+ .jobname,
+ ((.write.bw // 0) / 1024 | . * 10 | round / 10),
+ (.write.iops // 0),
+ (((.write.clat_ns.mean // 0) / 1000) | . * 10 | round / 10),
+ ((.write.clat_ns.percentile["50.000000"] // 0) / 1000),
+ ((.write.clat_ns.percentile["99.000000"] // 0) / 1000),
+ ((.write.clat_ns.percentile["99.900000"] // 0) / 1000)
+ ] | @tsv
+ ' "$json_file" 2>/dev/null | while IFS=$'\t' read -r name mbps iops avg_lat p50 p99 p999; do
+ printf " %-10s %10s %10s %10s %10s %10s %10s\n" \
+ "$name" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ done
+
+ # Aggregate bandwidth
+ total_bw=$(jq '[.jobs[].write.bw // 0] | add / 1024 | . * 10 | round / 10' \
+ "$json_file" 2>/dev/null || echo "N/A")
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ printf " Aggregate BW: %s MB/s | Sys CPU: %s%% | Peak Dirty: %s kB\n" \
+ "$total_bw" "$cpu" "${dirty:-N/A}"
+ echo ""
+done
+
+# Scenario C: Noisy neighbor
+echo "--- Scenario C: Noisy Writer + Latency-Sensitive Readers ---"
+for mode in buffered dontcache direct; do
+ dir="${RESULTS_DIR}/noisy-neighbor/${mode}"
+ if [ ! -d "$dir" ]; then
+ continue
+ fi
+
+ echo " Mode: $mode"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+ # Writer
+ if [ -f "${dir}/noisy_writer.json" ]; then
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "${dir}/noisy_writer.json" "write")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ fi
+
+ # Readers
+ for json_file in "${dir}"/reader*.json; do
+ [ -f "$json_file" ] || continue
+ reader=$(basename "$json_file" .json)
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "read")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ done
+
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ printf " Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+ echo ""
+done
+
+# Scenario D: Mixed-mode noisy neighbor
+echo "--- Scenario D: Mixed-Mode Noisy Writer + Readers ---"
+for dir in "${RESULTS_DIR}"/noisy-neighbor-mixed/*/; do
+ [ -d "$dir" ] || continue
+ label=$(basename "$dir")
+
+ echo " Mode: $label"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Job" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+ # Writer
+ if [ -f "${dir}/noisy_writer.json" ]; then
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "${dir}/noisy_writer.json" "write")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "Bulk writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ fi
+
+ # Readers
+ for json_file in "${dir}"/reader*.json; do
+ [ -f "$json_file" ] || continue
+ reader=$(basename "$json_file" .json)
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "read")"
+ printf " %-14s %10s %10s %10s %10s %10s %10s\n" \
+ "$reader" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ done
+
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ printf " Sys CPU: %s%% | Peak Dirty: %s kB\n" "$cpu" "${dirty:-N/A}"
+ echo ""
+done
+
+# Scenario E: Competing writers
+echo "--- Scenario E: Competing Writers (Separate Files) ---"
+for dir in "${RESULTS_DIR}"/competing-writers/*/; do
+ [ -d "$dir" ] || continue
+ label=$(basename "$dir")
+
+ echo " Mode: $label"
+ printf " %-20s %10s %10s %10s %10s %10s %10s\n" \
+ "Writer" "MB/s" "IOPS" "Avg(us)" "p50(us)" "p99(us)" "p99.9(us)"
+
+ total_bw=0
+ for json_file in "${dir}"/writer*.json; do
+ [ -f "$json_file" ] || continue
+ writer=$(basename "$json_file" .json)
+ read -r mbps iops avg_lat p50 p99 p999 <<< \
+ "$(extract_metrics "$json_file" "write")"
+ printf " %-20s %10s %10s %10s %10s %10s %10s\n" \
+ "$writer" "$mbps" "$iops" "$avg_lat" "$p50" "$p99" "$p999"
+ total_bw=$(echo "$total_bw + ${mbps:-0}" | bc 2>/dev/null || echo "$total_bw")
+ done
+
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ printf " Aggregate BW: %s MB/s | Sys CPU: %s%% | Peak Dirty: %s kB\n" \
+ "$total_bw" "$cpu" "${dirty:-N/A}"
+ echo ""
+done
+
+########################################################################
+# Deliverable 3: Axboe 32-file write benchmark
+########################################################################
+echo "=================================================================="
+echo " Deliverable 3: 32-File Write (Axboe Test)"
+echo "=================================================================="
+echo ""
+
+for mode in buffered dontcache direct; do
+ dir="${RESULTS_DIR}/axboe-write/${mode}"
+ if [ ! -d "$dir" ]; then
+ continue
+ fi
+
+ json_file=$(find "$dir" -name '*.json' 2>/dev/null | head -1 || true)
+ if [ -z "$json_file" ] || [ ! -f "$json_file" ]; then
+ echo "--- $mode: (no data) ---"
+ continue
+ fi
+
+ echo "--- $mode ---"
+
+ # Aggregate stats across all 32 jobs
+ agg_bw=$(jq '[.jobs[].write.bw // 0] | add / 1024 | . * 10 | round / 10' \
+ "$json_file" 2>/dev/null || echo "N/A")
+ agg_iops=$(jq '[.jobs[].write.iops // 0] | add | round' \
+ "$json_file" 2>/dev/null || echo "N/A")
+
+ # Average latency across jobs
+ avg_lat=$(jq '[.jobs[].write.clat_ns.mean // 0] | (add / length / 1000) |
+ . * 10 | round / 10' "$json_file" 2>/dev/null || echo "N/A")
+ avg_p50=$(jq '[.jobs[].write.clat_ns.percentile["50.000000"] // 0] |
+ (add / length / 1000) | round' "$json_file" 2>/dev/null || echo "N/A")
+ avg_p99=$(jq '[.jobs[].write.clat_ns.percentile["99.000000"] // 0] |
+ (add / length / 1000) | round' "$json_file" 2>/dev/null || echo "N/A")
+ avg_p999=$(jq '[.jobs[].write.clat_ns.percentile["99.900000"] // 0] |
+ (add / length / 1000) | round' "$json_file" 2>/dev/null || echo "N/A")
+
+ printf " Aggregate BW: %s MB/s | IOPS: %s\n" "$agg_bw" "$agg_iops"
+ printf " Avg Latency: %s us | p50: %s us | p99: %s us | p99.9: %s us\n" \
+ "$avg_lat" "$avg_p50" "$avg_p99" "$avg_p999"
+
+ cpu=$(extract_cpu "${dir}/vmstat.log")
+ dirty=$(extract_peak_dirty "${dir}/meminfo.log")
+ cached=$(extract_peak_cached "${dir}/meminfo.log")
+ printf " Sys CPU: %s%% | Peak Dirty: %s kB | Peak Cached: %s kB\n" \
+ "$cpu" "${dirty:-N/A}" "${cached:-N/A}"
+
+ # Per-second bandwidth from fio bw log (shows the page-cache cliff)
+ bw_log=$(find "$dir" -name '*_bw.*.log' 2>/dev/null | head -1 || true)
+ if [ -n "$bw_log" ] && [ -f "$bw_log" ]; then
+ echo " Per-second aggregate BW (MB/s):"
+ # fio bw logs: msec, bw_kB, rw, bs — aggregate across all job logs
+ for logfile in "${dir}"/*_bw.*.log; do
+ [ -f "$logfile" ] || continue
+ cat "$logfile"
+ done | awk -F',' '{
+ sec = int($1 / 1000) + 1
+ bw[sec] += $2
+ } END {
+ n = asorti(bw, sorted, "@ind_num_asc")
+ for (i = 1; i <= n; i++)
+ printf " %2ds: %.0f MB/s\n", sorted[i], bw[sorted[i]] / 1024
+ }'
+ fi
+ echo ""
+done
+
+echo "=================================================================="
+echo " System Info"
+echo "=================================================================="
+if [ -f "${RESULTS_DIR}/sysinfo.txt" ]; then
+ head -6 "${RESULTS_DIR}/sysinfo.txt"
+fi
+echo ""
diff --git a/tools/testing/dontcache-bench/scripts/run-benchmarks.sh b/tools/testing/dontcache-bench/scripts/run-benchmarks.sh
new file mode 100755
index 000000000000..e7278567e1a5
--- /dev/null
+++ b/tools/testing/dontcache-bench/scripts/run-benchmarks.sh
@@ -0,0 +1,643 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Local filesystem I/O mode benchmark suite.
+#
+# Runs the same test matrix as run-benchmarks.sh but on a local filesystem
+# using fio's io_uring engine with the RWF_DONTCACHE flag instead of NFSD's
+# debugfs mode knobs.
+#
+# Usage: ./run-local-benchmarks.sh [options]
+# -t <dir> Test directory (must be on a filesystem supporting FOP_DONTCACHE)
+# -s <size> File size (default: auto-sized to exceed RAM)
+# -f <path> Path to fio binary (default: fio in PATH)
+# -o <dir> Output directory for results (default: ./results/<timestamp>)
+# -d Dry run (print commands without executing)
+
+set -euo pipefail
+
+# Defaults
+TEST_DIR=""
+SIZE=""
+FIO_BIN="fio"
+RESULTS_DIR=""
+DRY_RUN=0
+MODES="0 1 2"
+PERF_LOCK=0
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+FIO_JOBS_DIR="${SCRIPT_DIR}/../fio-jobs"
+
+usage() {
+ echo "Usage: $0 -t <test-dir> [-s <size>] [-f <fio-path>] [-o <output-dir>] [-D] [-p] [-d]"
+ echo ""
+ echo " -t <dir> Test directory (required, must support RWF_DONTCACHE)"
+ echo " -s <size> File size (default: 2x RAM)"
+ echo " -f <path> Path to fio binary (default: fio)"
+ echo " -o <dir> Output directory (default: ./results/<timestamp>)"
+ echo " -D Dontcache only (skip buffered and direct tests)"
+ echo " -p Profile kernel lock contention with perf lock"
+ echo " -d Dry run"
+ exit 1
+}
+
+while getopts "t:s:f:o:Dpdh" opt; do
+ case $opt in
+ t) TEST_DIR="$OPTARG" ;;
+ s) SIZE="$OPTARG" ;;
+ f) FIO_BIN="$OPTARG" ;;
+ o) RESULTS_DIR="$OPTARG" ;;
+ D) MODES="1" ;;
+ p) PERF_LOCK=1 ;;
+ d) DRY_RUN=1 ;;
+ h) usage ;;
+ *) usage ;;
+ esac
+done
+
+if [ -z "$TEST_DIR" ]; then
+ echo "ERROR: -t <test-dir> is required"
+ usage
+fi
+
+# Auto-size to 2x RAM if not specified
+if [ -z "$SIZE" ]; then
+ mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+ SIZE="$(( mem_kb * 2 / 1024 ))M"
+fi
+
+if [ -z "$RESULTS_DIR" ]; then
+ RESULTS_DIR="./results/local-$(date +%Y%m%d-%H%M%S)"
+fi
+
+mkdir -p "$RESULTS_DIR"
+
+log() {
+ echo "[$(date '+%H:%M:%S')] $*"
+}
+
+run_cmd() {
+ if [ "$DRY_RUN" -eq 1 ]; then
+ echo " [DRY RUN] $*"
+ else
+ "$@"
+ fi
+}
+
+# I/O mode definitions:
+# buffered: direct=0, uncached=0
+# dontcache: direct=0, uncached=1
+# direct: direct=1, uncached=0
+#
+# Mode name from numeric value
+mode_name() {
+ case $1 in
+ 0) echo "buffered" ;;
+ 1) echo "dontcache" ;;
+ 2) echo "direct" ;;
+ esac
+}
+
+# Return fio command-line flags for a given mode.
+# "direct" is a standard fio option and works on the command line.
+# "uncached" is an io_uring engine option that must be in the job file,
+# so we inject it via make_job_file() below.
+mode_fio_args() {
+ case $1 in
+ 0) echo "--direct=0" ;; # buffered
+ 1) echo "--direct=0" ;; # dontcache
+ 2) echo "--direct=1" ;; # direct
+ esac
+}
+
+# Return the uncached= value for a given mode.
+mode_uncached() {
+ case $1 in
+ 0) echo "0" ;;
+ 1) echo "1" ;;
+ 2) echo "0" ;;
+ esac
+}
+
+# Create a temporary job file with uncached=N injected into [global].
+# For uncached=0 (buffered/direct), return the original file unchanged.
+make_job_file() {
+ local job_file=$1
+ local uncached=$2
+
+ if [ "$uncached" -eq 0 ]; then
+ echo "$job_file"
+ return
+ fi
+
+ local tmp
+ tmp=$(mktemp)
+ sed "/^\[global\]/a uncached=${uncached}" "$job_file" > "$tmp"
+ echo "$tmp"
+}
+
+drop_caches() {
+ run_cmd bash -c "sync && echo 3 > /proc/sys/vm/drop_caches"
+}
+
+# perf lock profiling — uses BPF-based live contention tracing
+PERF_LOCK_PID=""
+
+start_perf_lock() {
+ local outdir=$1
+
+ if [ "$PERF_LOCK" -ne 1 ]; then
+ return
+ fi
+
+ log "Starting perf lock contention tracing"
+ perf lock contention -a -b --max-stack 8 \
+ > "${outdir}/perf-lock-contention.txt" 2>&1 &
+ PERF_LOCK_PID=$!
+}
+
+stop_perf_lock() {
+ local outdir=$1
+
+ if [ -z "$PERF_LOCK_PID" ]; then
+ return
+ fi
+
+ log "Stopping perf lock contention tracing"
+ kill -TERM "$PERF_LOCK_PID" 2>/dev/null || true
+ wait "$PERF_LOCK_PID" 2>/dev/null || true
+ PERF_LOCK_PID=""
+}
+
+# Background monitors
+VMSTAT_PID=""
+IOSTAT_PID=""
+MEMINFO_PID=""
+
+start_monitors() {
+ local outdir=$1
+ log "Starting monitors in $outdir"
+ run_cmd vmstat 1 > "${outdir}/vmstat.log" 2>&1 &
+ VMSTAT_PID=$!
+ run_cmd iostat -x 1 > "${outdir}/iostat.log" 2>&1 &
+ IOSTAT_PID=$!
+ (while true; do
+ echo "=== $(date '+%s') ==="
+ cat /proc/meminfo
+ sleep 1
+ done) > "${outdir}/meminfo.log" 2>&1 &
+ MEMINFO_PID=$!
+}
+
+stop_monitors() {
+ log "Stopping monitors"
+ kill "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+ wait "$VMSTAT_PID" "$IOSTAT_PID" "$MEMINFO_PID" 2>/dev/null || true
+}
+
+cleanup_test_files() {
+ local filepath="${TEST_DIR}/$1"
+ log "Cleaning up $filepath"
+ run_cmd rm -f "$filepath"
+}
+
+# Run a single fio benchmark
+run_fio() {
+ local job_file=$1
+ local outdir=$2
+ local filename=$3
+ local fio_size=${4:-$SIZE}
+ local keep=${5:-}
+ local extra_args=${6:-}
+ local uncached=${7:-0}
+
+ # Inject uncached=N into the job file if needed
+ local actual_job
+ actual_job=$(make_job_file "$job_file" "$uncached")
+
+ local job_name
+ job_name=$(basename "$job_file" .fio)
+
+ log "Running fio job: $job_name -> $outdir (file=${TEST_DIR}/$filename size=$fio_size)"
+ mkdir -p "$outdir"
+
+ drop_caches
+ start_monitors "$outdir"
+ # Skip perf lock profiling for precreate/setup runs
+ [ "$keep" != "keep" ] && start_perf_lock "$outdir"
+
+ # shellcheck disable=SC2086
+ run_cmd "$FIO_BIN" "$actual_job" \
+ --output-format=json \
+ --output="${outdir}/${job_name}.json" \
+ --filename="${TEST_DIR}/$filename" \
+ --size="$fio_size" \
+ $extra_args
+
+ [ "$keep" != "keep" ] && stop_perf_lock "$outdir"
+ stop_monitors
+ log "Finished: $job_name"
+
+ # Clean up temp job file if one was created
+ [ "$actual_job" != "$job_file" ] && rm -f "$actual_job"
+
+ if [ "$keep" != "keep" ]; then
+ cleanup_test_files "$filename"
+ fi
+}
+
+########################################################################
+# Preflight
+########################################################################
+preflight() {
+ log "=== Preflight checks ==="
+
+ if ! command -v "$FIO_BIN" &>/dev/null; then
+ echo "ERROR: fio not found at $FIO_BIN"
+ exit 1
+ fi
+
+ if [ ! -d "$TEST_DIR" ]; then
+ echo "ERROR: Test directory $TEST_DIR does not exist"
+ exit 1
+ fi
+
+ # Quick check that RWF_DONTCACHE works on this filesystem
+ local testfile="${TEST_DIR}/.dontcache_test"
+ if ! "$FIO_BIN" --name=test --ioengine=io_uring --rw=write \
+ --bs=4k --size=4k --direct=0 --uncached=1 \
+ --filename="$testfile" 2>/dev/null; then
+ echo "WARNING: RWF_DONTCACHE may not be supported on $TEST_DIR"
+ echo " (filesystem must support FOP_DONTCACHE)"
+ fi
+ rm -f "$testfile"
+
+ log "Test directory: $TEST_DIR"
+ log "File size: $SIZE"
+ log "fio binary: $FIO_BIN"
+ log "Results: $RESULTS_DIR"
+
+ # Record system info
+ {
+ echo "Timestamp: $(date +%Y%m%d-%H%M%S)"
+ echo "Kernel: $(uname -r)"
+ echo "Hostname: $(hostname)"
+ echo "Filesystem: $(df -T "$TEST_DIR" | tail -1 | awk '{print $2}')"
+ echo "File size: $SIZE"
+ echo "Test dir: $TEST_DIR"
+ } > "${RESULTS_DIR}/sysinfo.txt"
+}
+
+########################################################################
+# Deliverable 1: Single-client benchmarks
+########################################################################
+run_deliverable1() {
+ log "=========================================="
+ log "Deliverable 1: Single-client benchmarks"
+ log "=========================================="
+
+ # Sequential write
+ for mode in $MODES; do
+ local mname
+ mname=$(mode_name $mode)
+ local fio_args
+ fio_args=$(mode_fio_args $mode)
+
+ drop_caches
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${RESULTS_DIR}/seq-write/${mname}" \
+ "seq-write_testfile" "$SIZE" "" "$fio_args" \
+ "$(mode_uncached $mode)"
+ done
+
+ # Random write
+ for mode in $MODES; do
+ local mname
+ mname=$(mode_name $mode)
+ local fio_args
+ fio_args=$(mode_fio_args $mode)
+
+ drop_caches
+ run_fio "${FIO_JOBS_DIR}/rand-write.fio" \
+ "${RESULTS_DIR}/rand-write/${mname}" \
+ "rand-write_testfile" "$SIZE" "" "$fio_args" \
+ "$(mode_uncached $mode)"
+ done
+
+ # Sequential read — pre-create file, then read with each mode
+ log "Pre-creating sequential read test file"
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${RESULTS_DIR}/seq-read/precreate" \
+ "seq-read_testfile" "$SIZE" "keep"
+
+ for rmode in $MODES; do
+ local mname
+ mname=$(mode_name $rmode)
+ local fio_args
+ fio_args=$(mode_fio_args $rmode)
+ local keep="keep"
+ [ "$rmode" -eq 2 ] && keep=""
+
+ drop_caches
+ run_fio "${FIO_JOBS_DIR}/seq-read.fio" \
+ "${RESULTS_DIR}/seq-read/${mname}" \
+ "seq-read_testfile" "$SIZE" "$keep" "$fio_args" \
+ "$(mode_uncached $rmode)"
+ done
+
+ # Random read — pre-create file, then read with each mode
+ log "Pre-creating random read test file"
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${RESULTS_DIR}/rand-read/precreate" \
+ "rand-read_testfile" "$SIZE" "keep"
+
+ for rmode in $MODES; do
+ local mname
+ mname=$(mode_name $rmode)
+ local fio_args
+ fio_args=$(mode_fio_args $rmode)
+ local keep="keep"
+ [ "$rmode" -eq 2 ] && keep=""
+
+ drop_caches
+ run_fio "${FIO_JOBS_DIR}/rand-read.fio" \
+ "${RESULTS_DIR}/rand-read/${mname}" \
+ "rand-read_testfile" "$SIZE" "$keep" "$fio_args" \
+ "$(mode_uncached $rmode)"
+ done
+}
+
+########################################################################
+# Deliverable 2: Multi-client tests
+########################################################################
+run_deliverable2() {
+ log "=========================================="
+ log "Deliverable 2: Noisy-neighbor benchmarks"
+ log "=========================================="
+
+ local num_clients=4
+ local client_size
+ local mem_kb
+ mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+ client_size="$(( mem_kb / 1024 / num_clients ))M"
+
+ # Scenario A: Multiple writers
+ for mode in $MODES; do
+ local mname
+ mname=$(mode_name $mode)
+ local fio_args
+ fio_args=$(mode_fio_args $mode)
+
+ drop_caches
+ run_fio "${FIO_JOBS_DIR}/multi-write.fio" \
+ "${RESULTS_DIR}/multi-write/${mname}" \
+ "multi-write_testfile" "$client_size" "" "$fio_args" \
+ "$(mode_uncached $mode)"
+ done
+
+ # Scenario C: Noisy writer + latency-sensitive readers
+ for mode in $MODES; do
+ local mname
+ mname=$(mode_name $mode)
+ local fio_args
+ fio_args=$(mode_fio_args $mode)
+ local uncached
+ uncached=$(mode_uncached $mode)
+ local writer_job
+ writer_job=$(make_job_file "${FIO_JOBS_DIR}/noisy-writer.fio" "$uncached")
+ local reader_job
+ reader_job=$(make_job_file "${FIO_JOBS_DIR}/lat-reader.fio" "$uncached")
+ local outdir="${RESULTS_DIR}/noisy-neighbor/${mname}"
+ mkdir -p "$outdir"
+
+ # Pre-create read files
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ log "Pre-creating read file for reader $i"
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${outdir}/precreate_reader${i}" \
+ "reader${i}_readfile" \
+ "512M" "keep"
+ done
+ drop_caches
+ start_monitors "$outdir"
+ start_perf_lock "$outdir"
+
+ # Noisy writer
+ # shellcheck disable=SC2086
+ run_cmd "$FIO_BIN" "$writer_job" \
+ --output-format=json \
+ --output="${outdir}/noisy_writer.json" \
+ --filename="${TEST_DIR}/bulk_testfile" \
+ --size="$SIZE" \
+ $fio_args &
+ local writer_pid=$!
+
+ # Latency-sensitive readers
+ local reader_pids=()
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ # shellcheck disable=SC2086
+ run_cmd "$FIO_BIN" "$reader_job" \
+ --output-format=json \
+ --output="${outdir}/reader${i}.json" \
+ --filename="${TEST_DIR}/reader${i}_readfile" \
+ --size="512M" \
+ $fio_args &
+ reader_pids+=($!)
+ done
+
+ local rc=0
+ wait "$writer_pid" || rc=$?
+ for pid in "${reader_pids[@]}"; do
+ wait "$pid" || rc=$?
+ done
+
+ stop_perf_lock "$outdir"
+ stop_monitors
+ [ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+ [ "$writer_job" != "${FIO_JOBS_DIR}/noisy-writer.fio" ] && rm -f "$writer_job"
+ [ "$reader_job" != "${FIO_JOBS_DIR}/lat-reader.fio" ] && rm -f "$reader_job"
+ cleanup_test_files "bulk_testfile"
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ cleanup_test_files "reader${i}_readfile"
+ done
+ done
+
+ # Scenario D: Mixed-mode noisy neighbor
+ # dontcache writes + buffered reads
+ local outdir="${RESULTS_DIR}/noisy-neighbor-mixed/dontcache-w_buffered-r"
+ mkdir -p "$outdir"
+ local writer_job
+ writer_job=$(make_job_file "${FIO_JOBS_DIR}/noisy-writer.fio" 1)
+
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ log "Pre-creating read file for reader $i"
+ run_fio "${FIO_JOBS_DIR}/seq-write.fio" \
+ "${outdir}/precreate_reader${i}" \
+ "reader${i}_readfile" \
+ "512M" "keep"
+ done
+ drop_caches
+ start_monitors "$outdir"
+ start_perf_lock "$outdir"
+
+ # Writer with dontcache
+ run_cmd "$FIO_BIN" "$writer_job" \
+ --output-format=json \
+ --output="${outdir}/noisy_writer.json" \
+ --filename="${TEST_DIR}/bulk_testfile" \
+ --size="$SIZE" \
+ --direct=0 &
+ local writer_pid=$!
+
+ # Readers with buffered (no uncached flag)
+ local reader_pids=()
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ run_cmd "$FIO_BIN" "${FIO_JOBS_DIR}/lat-reader.fio" \
+ --output-format=json \
+ --output="${outdir}/reader${i}.json" \
+ --filename="${TEST_DIR}/reader${i}_readfile" \
+ --size="512M" \
+ --direct=0 &
+ reader_pids+=($!)
+ done
+
+ local rc=0
+ wait "$writer_pid" || rc=$?
+ for pid in "${reader_pids[@]}"; do
+ wait "$pid" || rc=$?
+ done
+
+ stop_perf_lock "$outdir"
+ stop_monitors
+ [ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+ [ "$writer_job" != "${FIO_JOBS_DIR}/noisy-writer.fio" ] && rm -f "$writer_job"
+ cleanup_test_files "bulk_testfile"
+ for i in $(seq 1 $(( num_clients - 1 ))); do
+ cleanup_test_files "reader${i}_readfile"
+ done
+
+ # Scenario E: Competing writers (dontcache vs buffered on separate files)
+ # This tests whether the dontcache flusher kick interferes with a
+ # normal buffered writer sharing the same backing device.
+ log "--- Scenario E: Competing writers (separate files) ---"
+
+ # Sub-scenario: dontcache writer vs buffered writer
+ local outdir="${RESULTS_DIR}/competing-writers/dontcache-vs-buffered"
+ mkdir -p "$outdir"
+ local dc_writer_job
+ dc_writer_job=$(make_job_file "${FIO_JOBS_DIR}/noisy-writer.fio" 1)
+
+ drop_caches
+ start_monitors "$outdir"
+ start_perf_lock "$outdir"
+
+ # Writer A: dontcache
+ run_cmd "$FIO_BIN" "$dc_writer_job" \
+ --output-format=json \
+ --output="${outdir}/writer_dontcache.json" \
+ --filename="${TEST_DIR}/writer_a_testfile" \
+ --size="$SIZE" \
+ --direct=0 &
+ local writer_a_pid=$!
+
+ # Writer B: buffered
+ run_cmd "$FIO_BIN" "${FIO_JOBS_DIR}/noisy-writer.fio" \
+ --output-format=json \
+ --output="${outdir}/writer_buffered.json" \
+ --filename="${TEST_DIR}/writer_b_testfile" \
+ --size="$SIZE" \
+ --direct=0 &
+ local writer_b_pid=$!
+
+ local rc=0
+ wait "$writer_a_pid" || rc=$?
+ wait "$writer_b_pid" || rc=$?
+
+ stop_perf_lock "$outdir"
+ stop_monitors
+ [ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+ [ "$dc_writer_job" != "${FIO_JOBS_DIR}/noisy-writer.fio" ] && rm -f "$dc_writer_job"
+ cleanup_test_files "writer_a_testfile"
+ cleanup_test_files "writer_b_testfile"
+
+ # Sub-scenario: buffered writer vs buffered writer (baseline)
+ outdir="${RESULTS_DIR}/competing-writers/buffered-vs-buffered"
+ mkdir -p "$outdir"
+
+ drop_caches
+ start_monitors "$outdir"
+ start_perf_lock "$outdir"
+
+ # Writer A: buffered
+ run_cmd "$FIO_BIN" "${FIO_JOBS_DIR}/noisy-writer.fio" \
+ --output-format=json \
+ --output="${outdir}/writer_a.json" \
+ --filename="${TEST_DIR}/writer_a_testfile" \
+ --size="$SIZE" \
+ --direct=0 &
+ writer_a_pid=$!
+
+ # Writer B: buffered
+ run_cmd "$FIO_BIN" "${FIO_JOBS_DIR}/noisy-writer.fio" \
+ --output-format=json \
+ --output="${outdir}/writer_b.json" \
+ --filename="${TEST_DIR}/writer_b_testfile" \
+ --size="$SIZE" \
+ --direct=0 &
+ writer_b_pid=$!
+
+ rc=0
+ wait "$writer_a_pid" || rc=$?
+ wait "$writer_b_pid" || rc=$?
+
+ stop_perf_lock "$outdir"
+ stop_monitors
+ [ $rc -ne 0 ] && log "WARNING: some fio jobs exited non-zero"
+
+ cleanup_test_files "writer_a_testfile"
+ cleanup_test_files "writer_b_testfile"
+}
+
+########################################################################
+# Deliverable 3: Axboe 32-file write benchmark
+########################################################################
+run_deliverable3() {
+ log "=========================================="
+ log "Deliverable 3: 32-file write (Axboe test)"
+ log "=========================================="
+
+ # Per-file size: 2x RAM / 32 so total written exceeds RAM
+ local per_file_size
+ local mem_kb
+ mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+ per_file_size="$(( mem_kb * 2 / 1024 / 32 ))M"
+
+ for mode in $MODES; do
+ local mname
+ mname=$(mode_name $mode)
+ local fio_args
+ fio_args=$(mode_fio_args $mode)
+
+ drop_caches
+ run_fio "${FIO_JOBS_DIR}/axboe-write.fio" \
+ "${RESULTS_DIR}/axboe-write/${mname}" \
+ "axboe-write_testfile" "$per_file_size" "" "$fio_args" \
+ "$(mode_uncached $mode)"
+ done
+}
+
+########################################################################
+# Main
+########################################################################
+preflight
+run_deliverable1
+run_deliverable2
+run_deliverable3
+
+log "=========================================="
+log "All benchmarks complete."
+log "Results in: $RESULTS_DIR"
+log "Parse with: scripts/parse-results.sh $RESULTS_DIR"
+log "=========================================="
--
2.54.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
2026-05-01 9:49 ` [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
@ 2026-05-01 16:44 ` Jens Axboe
2026-05-03 14:45 ` Jan Kara
1 sibling, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2026-05-01 16:44 UTC (permalink / raw)
To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Ritesh Harjani,
Chuck Lever
Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm
On 5/1/26 3:49 AM, Jeff Layton wrote:
> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context. Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
>
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background. This moves writeback submission
> completely off the writer's hot path.
>
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> write back. The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
>
> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.
>
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility, and target the correct cgroup writeback domain via
> unlocked_inode_to_wb_begin().
>
> dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> xfs on NVMe, fio io_uring):
>
> Buffered and direct I/O paths are unaffected by this patchset. All
> improvements are confined to the dontcache path:
>
> Single-stream throughput (MB/s):
> Before After Change
> seq-write/dontcache 298 897 +201%
> rand-write/dontcache 131 236 +80%
>
> Tail latency improvements (seq-write/dontcache):
> p99: 135,266 us -> 23,986 us (-82%)
> p99.9: 8,925,479 us -> 28,443 us (-99.7%)
>
> Multi-writer (4 jobs, sequential write):
> Before After Change
> dontcache aggregate (MB/s) 2,529 4,532 +79%
> dontcache p99 (us) 8,553 1,002 -88%
> dontcache p99.9 (us) 109,314 1,057 -99%
>
> Dontcache multi-writer throughput now matches buffered (4,532 vs
> 4,616 MB/s).
>
> 32-file write (Axboe test):
> Before After Change
> dontcache aggregate (MB/s) 1,548 3,499 +126%
> dontcache p99 (us) 10,170 602 -94%
> Peak dirty pages (MB) 1,837 213 -88%
>
> Dontcache now reaches 81% of buffered throughput (was 35%).
>
> Competing writers (dontcache vs buffered, separate files):
> Before After
> buffered writer 868 433 MB/s
> dontcache writer 415 433 MB/s
> Aggregate 1,284 866 MB/s
>
> Previously the buffered writer starved the dontcache writer 2:1.
> With per-bdi_writeback tracking, both writers now receive equal
> bandwidth. The aggregate matches the buffered-vs-buffered baseline
> (863 MB/s), indicating fair sharing regardless of I/O mode.
>
> The dontcache writer's p99.9 latency collapsed from 119 ms to
> 33 ms (-73%), eliminating the severe periodic stalls seen in the
> baseline. Both writers now share identical latency profiles,
> matching the buffered-vs-buffered pattern.
>
> The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> pages in dontcache workloads, with the 32-file test dropping from
> 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> multi-writer throughput reaches parity with buffered I/O, with tail
> latencies collapsing by 1-2 orders of magnitude.
I like this, this is the better way to kick off the writeback.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback
2026-05-01 9:49 ` [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
@ 2026-05-03 14:37 ` Jan Kara
0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2026-05-03 14:37 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever, linux-fsdevel, linux-kernel,
linux-nfs, linux-mm
On Fri 01-05-26 10:49:35, Jeff Layton wrote:
> Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
> pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
> writes).
>
> Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
> when the folio has the dropbehind flag set, and decrement it in
> folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
> when a non-DONTCACHE lookup clears the dropbehind flag on a dirty folio in
> __filemap_get_folio_mpol(), using proper writeback domain locking.
>
> The counter will be used by the writeback flusher to determine how many
> pages to write back when expediting writeback for IOCB_DONTCACHE writes,
> without flushing the entire BDI's dirty pages.
>
> Suggested-by: Jan Kara <jack@suse.cz>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/backing-dev-defs.h | 1 +
> mm/filemap.c | 13 ++++++++++++-
> mm/page-writeback.c | 6 ++++++
> 3 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index a06b93446d10..cb660dd37286 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -33,6 +33,7 @@ enum wb_stat_item {
> WB_WRITEBACK,
> WB_DIRTIED,
> WB_WRITTEN,
> + WB_DONTCACHE_DIRTY,
> NR_WB_STAT_ITEMS
> };
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c..1c9c0d5f495f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2052,8 +2052,19 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
> if (!folio)
> return ERR_PTR(-ENOENT);
> /* not an uncached lookup, clear uncached if set */
> - if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
> + if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE)) {
> + if (folio_test_dirty(folio)) {
> + struct inode *inode = mapping->host;
> + struct bdi_writeback *wb;
> + struct wb_lock_cookie cookie = {};
> +
> + wb = unlocked_inode_to_wb_begin(inode, &cookie);
> + wb_stat_mod(wb, WB_DONTCACHE_DIRTY,
> + -folio_nr_pages(folio));
> + unlocked_inode_to_wb_end(inode, &cookie);
> + }
> folio_clear_dropbehind(folio);
> + }
> return folio;
> }
> EXPORT_SYMBOL(__filemap_get_folio_mpol);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 88cd53d4ba09..8e520717d1f6 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2630,6 +2630,8 @@ static void folio_account_dirtied(struct folio *folio,
> wb = inode_to_wb(inode);
>
> lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr);
> + if (folio_test_dropbehind(folio))
> + wb_stat_mod(wb, WB_DONTCACHE_DIRTY, nr);
> __zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr);
> __node_stat_mod_folio(folio, NR_DIRTIED, nr);
> wb_stat_mod(wb, WB_RECLAIMABLE, nr);
> @@ -2651,6 +2653,8 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
> long nr = folio_nr_pages(folio);
>
> lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
> + if (folio_test_dropbehind(folio))
> + wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
> zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
> wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
> task_io_account_cancelled_write(nr * PAGE_SIZE);
> @@ -2920,6 +2924,8 @@ bool folio_clear_dirty_for_io(struct folio *folio)
> if (folio_test_clear_dirty(folio)) {
> long nr = folio_nr_pages(folio);
> lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr);
> + if (folio_test_dropbehind(folio))
> + wb_stat_mod(wb, WB_DONTCACHE_DIRTY, -nr);
> zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr);
> wb_stat_mod(wb, WB_RECLAIMABLE, -nr);
> ret = true;
>
> --
> 2.54.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
2026-05-01 9:49 ` [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
2026-05-01 16:44 ` Jens Axboe
@ 2026-05-03 14:45 ` Jan Kara
2026-05-03 18:41 ` Jeff Layton
1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2026-05-03 14:45 UTC (permalink / raw)
To: Jeff Layton
Cc: Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever, linux-fsdevel, linux-kernel,
linux-nfs, linux-mm
On Fri 01-05-26 10:49:36, Jeff Layton wrote:
> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context. Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
>
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background. This moves writeback submission
> completely off the writer's hot path.
>
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> write back. The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
>
> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.
>
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility, and target the correct cgroup writeback domain via
> unlocked_inode_to_wb_begin().
>
> dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> xfs on NVMe, fio io_uring):
>
> Buffered and direct I/O paths are unaffected by this patchset. All
> improvements are confined to the dontcache path:
>
> Single-stream throughput (MB/s):
> Before After Change
> seq-write/dontcache 298 897 +201%
> rand-write/dontcache 131 236 +80%
>
> Tail latency improvements (seq-write/dontcache):
> p99: 135,266 us -> 23,986 us (-82%)
> p99.9: 8,925,479 us -> 28,443 us (-99.7%)
>
> Multi-writer (4 jobs, sequential write):
> Before After Change
> dontcache aggregate (MB/s) 2,529 4,532 +79%
> dontcache p99 (us) 8,553 1,002 -88%
> dontcache p99.9 (us) 109,314 1,057 -99%
>
> Dontcache multi-writer throughput now matches buffered (4,532 vs
> 4,616 MB/s).
>
> 32-file write (Axboe test):
> Before After Change
> dontcache aggregate (MB/s) 1,548 3,499 +126%
> dontcache p99 (us) 10,170 602 -94%
> Peak dirty pages (MB) 1,837 213 -88%
>
> Dontcache now reaches 81% of buffered throughput (was 35%).
>
> Competing writers (dontcache vs buffered, separate files):
> Before After
> buffered writer 868 433 MB/s
> dontcache writer 415 433 MB/s
> Aggregate 1,284 866 MB/s
>
> Previously the buffered writer starved the dontcache writer 2:1.
> With per-bdi_writeback tracking, both writers now receive equal
> bandwidth. The aggregate matches the buffered-vs-buffered baseline
> (863 MB/s), indicating fair sharing regardless of I/O mode.
>
> The dontcache writer's p99.9 latency collapsed from 119 ms to
> 33 ms (-73%), eliminating the severe periodic stalls seen in the
> baseline. Both writers now share identical latency profiles,
> matching the buffered-vs-buffered pattern.
>
> The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> pages in dontcache workloads, with the 32-file test dropping from
> 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> multi-writer throughput reaches parity with buffered I/O, with tail
> latencies collapsing by 1-2 orders of magnitude.
>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Nice and looks good to me now. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
One nit below:
> +/**
> + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> + * @mapping: address_space that was just written to
> + *
> + * Kick the writeback flusher thread to expedite writeback of dontcache
> + * dirty pages. Uses a dedicated WB_start_dontcache bit so that only
> + * pages tracked by WB_DONTCACHE_DIRTY are written back, rather than
> + * flushing the entire BDI's dirty pages.
This comment is a bit confusing as in fact we write arbitrary dirty pages.
It is only the amount of pages that is influenced by WB_DONTCACHE_DIRTY. So
I'd rephrase the last sentence like: We queue writeback for the inode's wb
for as many pages as there are dontcache pages but we don't restrict
writeback to dontcache pages only. This significantly improves performance
over either writing all wb's pages or writing only dontcache pages.
Although it doesn't guarantee quick writeback and reclaim of dontcache
pages it keeps the amount of dirty pages in check and over longer term
dontcache pages get written and reclaimed by background writeback even with
this rough heuristic.
Honza
> + */
> +void filemap_dontcache_kick_writeback(struct address_space *mapping)
> +{
> + struct inode *inode = mapping->host;
> + struct bdi_writeback *wb;
> + struct wb_lock_cookie cookie = {};
> +
> + wb = unlocked_inode_to_wb_begin(inode, &cookie);
> + wb_start_dontcache_writeback(wb);
> + unlocked_inode_to_wb_end(inode, &cookie);
> +}
> +EXPORT_SYMBOL_GPL(filemap_dontcache_kick_writeback);
> +
> /*
> * Wakeup the flusher threads to start writeback of all currently dirty pages
> */
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index cb660dd37286..4f1084937315 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -26,6 +26,7 @@ enum wb_state {
> WB_writeback_running, /* Writeback is in progress */
> WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
> WB_start_all, /* nr_pages == 0 (all) work pending */
> + WB_start_dontcache, /* dontcache writeback pending */
> };
>
> enum wb_stat_item {
> @@ -56,6 +57,7 @@ enum wb_reason {
> */
> WB_REASON_FORKER_THREAD,
> WB_REASON_FOREIGN_FLUSH,
> + WB_REASON_DONTCACHE,
>
> WB_REASON_MAX,
> };
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..df72b42a9e9b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2624,6 +2624,7 @@ extern int __must_check file_write_and_wait_range(struct file *file,
> loff_t start, loff_t end);
> int filemap_flush_range(struct address_space *mapping, loff_t start,
> loff_t end);
> +void filemap_dontcache_kick_writeback(struct address_space *mapping);
>
> static inline int file_write_and_wait(struct file *file)
> {
> @@ -2657,10 +2658,7 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
> if (ret)
> return ret;
> } else if (iocb->ki_flags & IOCB_DONTCACHE) {
> - struct address_space *mapping = iocb->ki_filp->f_mapping;
> -
> - filemap_flush_range(mapping, iocb->ki_pos - count,
> - iocb->ki_pos - 1);
> + filemap_dontcache_kick_writeback(iocb->ki_filp->f_mapping);
> }
>
> return count;
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index bdac0d685a98..13ee076ccd16 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -44,7 +44,8 @@
> EM( WB_REASON_PERIODIC, "periodic") \
> EM( WB_REASON_FS_FREE_SPACE, "fs_free_space") \
> EM( WB_REASON_FORKER_THREAD, "forker_thread") \
> - EMe(WB_REASON_FOREIGN_FLUSH, "foreign_flush")
> + EM( WB_REASON_FOREIGN_FLUSH, "foreign_flush") \
> + EMe(WB_REASON_DONTCACHE, "dontcache")
>
> WB_WORK_REASON
>
>
> --
> 2.54.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
2026-05-03 14:45 ` Jan Kara
@ 2026-05-03 18:41 ` Jeff Layton
0 siblings, 0 replies; 9+ messages in thread
From: Jeff Layton @ 2026-05-03 18:41 UTC (permalink / raw)
To: Jan Kara
Cc: Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
Ritesh Harjani, Chuck Lever, linux-fsdevel, linux-kernel,
linux-nfs, linux-mm
On Sun, 2026-05-03 at 16:45 +0200, Jan Kara wrote:
> On Fri 01-05-26 10:49:36, Jeff Layton wrote:
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context. Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> >
> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background. This moves writeback submission
> > completely off the writer's hot path.
> >
> > To avoid flushing unrelated buffered dirty data, add a dedicated
> > WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> > the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> > write back. The flusher writes back that many pages from the oldest dirty
> > inodes (not restricted to dontcache-specific inodes). This helps
> > preserve I/O batching while limiting the scope of expedited writeback.
> >
> > Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> > DONTCACHE writes into a single flusher wakeup without per-write
> > allocations.
> >
> > Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > visibility, and target the correct cgroup writeback domain via
> > unlocked_inode_to_wb_begin().
> >
> > dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> > xfs on NVMe, fio io_uring):
> >
> > Buffered and direct I/O paths are unaffected by this patchset. All
> > improvements are confined to the dontcache path:
> >
> > Single-stream throughput (MB/s):
> > Before After Change
> > seq-write/dontcache 298 897 +201%
> > rand-write/dontcache 131 236 +80%
> >
> > Tail latency improvements (seq-write/dontcache):
> > p99: 135,266 us -> 23,986 us (-82%)
> > p99.9: 8,925,479 us -> 28,443 us (-99.7%)
> >
> > Multi-writer (4 jobs, sequential write):
> > Before After Change
> > dontcache aggregate (MB/s) 2,529 4,532 +79%
> > dontcache p99 (us) 8,553 1,002 -88%
> > dontcache p99.9 (us) 109,314 1,057 -99%
> >
> > Dontcache multi-writer throughput now matches buffered (4,532 vs
> > 4,616 MB/s).
> >
> > 32-file write (Axboe test):
> > Before After Change
> > dontcache aggregate (MB/s) 1,548 3,499 +126%
> > dontcache p99 (us) 10,170 602 -94%
> > Peak dirty pages (MB) 1,837 213 -88%
> >
> > Dontcache now reaches 81% of buffered throughput (was 35%).
> >
> > Competing writers (dontcache vs buffered, separate files):
> > Before After
> > buffered writer 868 433 MB/s
> > dontcache writer 415 433 MB/s
> > Aggregate 1,284 866 MB/s
> >
> > Previously the buffered writer starved the dontcache writer 2:1.
> > With per-bdi_writeback tracking, both writers now receive equal
> > bandwidth. The aggregate matches the buffered-vs-buffered baseline
> > (863 MB/s), indicating fair sharing regardless of I/O mode.
> >
> > The dontcache writer's p99.9 latency collapsed from 119 ms to
> > 33 ms (-73%), eliminating the severe periodic stalls seen in the
> > baseline. Both writers now share identical latency profiles,
> > matching the buffered-vs-buffered pattern.
> >
> > The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> > pages in dontcache workloads, with the 32-file test dropping from
> > 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> > multi-writer throughput reaches parity with buffered I/O, with tail
> > latencies collapsing by 1-2 orders of magnitude.
> >
> > Assisted-by: Claude:claude-opus-4-6
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
>
> Nice and looks good to me now. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> One nit below:
>
> > +/**
> > + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> > + * @mapping: address_space that was just written to
> > + *
> > + * Kick the writeback flusher thread to expedite writeback of dontcache
> > + * dirty pages. Uses a dedicated WB_start_dontcache bit so that only
> > + * pages tracked by WB_DONTCACHE_DIRTY are written back, rather than
> > + * flushing the entire BDI's dirty pages.
>
> This comment is a bit confusing as in fact we write arbitrary dirty pages.
> It is only the amount of pages that is influenced by WB_DONTCACHE_DIRTY. So
> I'd rephrase the last sentence like: We queue writeback for the inode's wb
> for as many pages as there are dontcache pages but we don't restrict
> writeback to dontcache pages only. This significantly improves performance
> over either writing all wb's pages or writing only dontcache pages.
> Although it doesn't guarantee quick writeback and reclaim of dontcache
> pages it keeps the amount of dirty pages in check and over longer term
> dontcache pages get written and reclaimed by background writeback even with
> this rough heuristic.
>
> Honza
>
I'll add that. Thanks for the suggestion and review!
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-05-03 18:41 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01 9:49 [PATCH v4 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-05-01 9:49 ` [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
2026-05-03 14:37 ` Jan Kara
2026-05-01 9:49 ` [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
2026-05-01 16:44 ` Jens Axboe
2026-05-03 14:45 ` Jan Kara
2026-05-03 18:41 ` Jeff Layton
2026-05-01 9:49 ` [PATCH v4 3/4] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-05-01 9:49 ` [PATCH v4 4/4] testing: add dontcache-bench local filesystem " Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox