* A comparison of the new nfsd iomodes (and an experimental one)
@ 2026-03-26 15:23 Jeff Layton
2026-03-26 15:30 ` Chuck Lever
0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2026-03-26 15:23 UTC (permalink / raw)
To: linux-nfs, linux-fsdevel, linux-block
Cc: Chuck Lever, Mike Snitzer, Jens Axboe
I've been doing some benchmarking of the new nfsd iomodes, using
different fio-based workloads.
The results have been interesting, but one thing that stands out is
that RWF_DONTCACHE is absolutely terrible for streaming write
workloads. That prompted me to experiment with a new iomode that added
some optimizations (DONTCACHE_LAZY).
The results along with Claude's analysis are here:
https://markdownpastebin.com/?id=387375d00b5443b3a2e37d58a062331f
He gets a bit out over his skis on the upstream plan, but tl;dr is that
DONTCACHE_LAZY (which is DONTCACHE with some optimizations) outperforms
the other write iomodes.
The core DONTCACHE_LAZY patch is below. I doubt we'll want a new iomode
long-term. What we'll probably want to do is modify DONTCACHE to work
like DONTCACHE_LAZY:
-------------------8<-------------------
[PATCH] mm: add IOCB_DONTCACHE_LAZY and RWF_DONTCACHE_LAZY
IOCB_DONTCACHE flushes all dirty pages on every write via
filemap_flush_range() with nr_to_write=LONG_MAX. Under concurrent
writers, this creates severe serialization: every writer contends on
the writeback submission path, leading to catastrophic throughput
collapse (~1 GB/s vs ~10 GB/s for buffered) and multi-second tail
latency.
Add IOCB_DONTCACHE_LAZY as a gentler alternative with two mechanisms:
1. Skip-if-busy: check mapping_tagged(PAGECACHE_TAG_WRITEBACK) before
flushing. If writeback is already in progress on the mapping, the
flush is skipped entirely, eliminating writeback submission
contention between concurrent writers.
2. Proportional cap: when flushing does occur, cap nr_to_write to the
number of pages just written. This prevents any single write from
triggering a full-file flush that would starve concurrent readers.
Together these mechanisms rate-limit writeback to match the incoming
write rate while avoiding I/O bursts that cause tail latency spikes.
Like IOCB_DONTCACHE, pages touched under IOCB_DONTCACHE_LAZY are
marked for eviction (dropbehind) to keep page cache usage bounded.
Also add RWF_DONTCACHE_LAZY (0x200) as a user-visible pwritev2/io_uring
flag that maps to IOCB_DONTCACHE_LAZY. The flag follows the same
validation as RWF_DONTCACHE: the filesystem must support FOP_DONTCACHE,
DAX is not supported, and RWF_DONTCACHE and RWF_DONTCACHE_LAZY are
mutually exclusive.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/iomap/buffered-io.c | 2 +-
include/linux/fs.h | 18 ++++++++++++++++--
include/linux/pagemap.h | 2 +-
include/uapi/linux/fs.h | 6 +++++-
mm/filemap.c | 40 +++++++++++++++++++++++++++++++++++++---
5 files changed, 60 insertions(+), 8 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e3bedcbb5f1ea..069d4378bf457 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1185,7 +1185,7 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
if (iocb->ki_flags & IOCB_NOWAIT)
iter.flags |= IOMAP_NOWAIT;
- if (iocb->ki_flags & IOCB_DONTCACHE)
+ if (iocb->ki_flags & (IOCB_DONTCACHE | IOCB_DONTCACHE_LAZY))
iter.flags |= IOMAP_DONTCACHE;
while ((ret = iomap_iter(&iter, ops)) > 0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94695ce5e25b5..04ff531473e82 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -359,6 +359,7 @@ struct readahead_control;
/* kiocb is a read or write operation submitted by fs/aio.c. */
#define IOCB_AIO_RW (1 << 22)
#define IOCB_HAS_METADATA (1 << 23)
+#define IOCB_DONTCACHE_LAZY (__force int) RWF_DONTCACHE_LAZY
/* for use in trace events */
#define TRACE_IOCB_STRINGS \
@@ -376,7 +377,8 @@ struct readahead_control;
{ IOCB_NOIO, "NOIO" }, \
{ IOCB_ALLOC_CACHE, "ALLOC_CACHE" }, \
{ IOCB_AIO_RW, "AIO_RW" }, \
- { IOCB_HAS_METADATA, "AIO_HAS_METADATA" }
+ { IOCB_HAS_METADATA, "AIO_HAS_METADATA" }, \
+ { IOCB_DONTCACHE_LAZY, "DONTCACHE_LAZY" }
struct kiocb {
struct file *ki_filp;
@@ -2589,6 +2591,8 @@ extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);
int filemap_flush_range(struct address_space *mapping, loff_t start,
loff_t end);
+int filemap_dontcache_writeback_range(struct address_space *mapping,
+ loff_t start, loff_t end, ssize_t nr_written);
static inline int file_write_and_wait(struct file *file)
{
@@ -2626,6 +2630,12 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
filemap_flush_range(mapping, iocb->ki_pos - count,
iocb->ki_pos - 1);
+ } else if (iocb->ki_flags & IOCB_DONTCACHE_LAZY) {
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
+
+ filemap_dontcache_writeback_range(mapping,
+ iocb->ki_pos - count,
+ iocb->ki_pos - 1, count);
}
return count;
@@ -3393,13 +3403,17 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
return -EOPNOTSUPP;
}
- if (flags & RWF_DONTCACHE) {
+ if (flags & (RWF_DONTCACHE | RWF_DONTCACHE_LAZY)) {
/* file system must support it */
if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE))
return -EOPNOTSUPP;
/* DAX mappings not supported */
if (IS_DAX(ki->ki_filp->f_mapping->host))
return -EOPNOTSUPP;
+ /* can't use both at once */
+ if ((flags & (RWF_DONTCACHE | RWF_DONTCACHE_LAZY)) ==
+ (RWF_DONTCACHE | RWF_DONTCACHE_LAZY))
+ return -EINVAL;
}
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
if (flags & RWF_SYNC)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9f5c4e8b4a7d3..3539a7b4ed53c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -798,7 +798,7 @@ static inline struct folio *write_begin_get_folio(const struct kiocb *iocb,
fgp_flags |= fgf_set_order(len);
- if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
+ if (iocb && iocb->ki_flags & (IOCB_DONTCACHE | IOCB_DONTCACHE_LAZY))
fgp_flags |= FGP_DONTCACHE;
return __filemap_get_folio(mapping, index, fgp_flags,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 66ca526cf786c..74f7c75901e0c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -434,10 +434,14 @@ typedef int __bitwise __kernel_rwf_t;
/* prevent pipe and socket writes from raising SIGPIPE */
#define RWF_NOSIGNAL ((__force __kernel_rwf_t)0x00000100)
+/* buffered IO that drops the cache after reading or writing data,
+ * with rate-limited writeback (skip if writeback already in progress) */
+#define RWF_DONTCACHE_LAZY ((__force __kernel_rwf_t)0x00000200)
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
- RWF_DONTCACHE | RWF_NOSIGNAL)
+ RWF_DONTCACHE | RWF_NOSIGNAL | RWF_DONTCACHE_LAZY)
#define PROCFS_IOCTL_MAGIC 'f'
diff --git a/mm/filemap.c b/mm/filemap.c
index 9697e12dfbdcc..448bee3f3f1ce 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -440,6 +440,40 @@ int filemap_flush_range(struct address_space *mapping, loff_t start,
}
EXPORT_SYMBOL_GPL(filemap_flush_range);
+/**
+ * filemap_dontcache_writeback_range - rate-limited writeback for dontcache I/O
+ * @mapping: target address_space
+ * @start: byte offset to start writeback
+ * @end: byte offset to end writeback (inclusive)
+ * @nr_written: number of bytes just written by the caller
+ *
+ * Kick writeback for dontcache I/O, but avoid piling on if writeback is
+ * already in progress. When writeback is kicked, limit the number of pages
+ * submitted to be proportional to the amount just written, rather than
+ * flushing the entire dirty range.
+ *
+ * This reduces tail latency compared to filemap_flush_range() which submits
+ * writeback for all dirty pages on every call, creating queue contention
+ * under concurrent writers.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_dontcache_writeback_range(struct address_space *mapping,
+ loff_t start, loff_t end,
+ ssize_t nr_written)
+{
+ long nr;
+
+ /* If writeback is already active, don't pile on */
+ if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+ return 0;
+
+ nr = (nr_written + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ return filemap_writeback(mapping, start, end, WB_SYNC_NONE, &nr,
+ WB_REASON_BACKGROUND);
+}
+EXPORT_SYMBOL_GPL(filemap_dontcache_writeback_range);
+
/**
* filemap_flush - mostly a non-blocking flush
* @mapping: target address_space
@@ -2633,7 +2667,7 @@ static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch)
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order, NULL);
if (!folio)
return -ENOMEM;
- if (iocb->ki_flags & IOCB_DONTCACHE)
+ if (iocb->ki_flags & (IOCB_DONTCACHE | IOCB_DONTCACHE_LAZY))
__folio_set_dropbehind(folio);
/*
@@ -2680,7 +2714,7 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
- if (iocb->ki_flags & IOCB_DONTCACHE)
+ if (iocb->ki_flags & (IOCB_DONTCACHE | IOCB_DONTCACHE_LAZY))
ractl.dropbehind = 1;
page_cache_async_ra(&ractl, folio, last_index - folio->index);
return 0;
@@ -2712,7 +2746,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
- if (iocb->ki_flags & IOCB_DONTCACHE)
+ if (iocb->ki_flags & (IOCB_DONTCACHE | IOCB_DONTCACHE_LAZY))
ractl.dropbehind = 1;
page_cache_sync_ra(&ractl, last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: A comparison of the new nfsd iomodes (and an experimental one)
2026-03-26 15:23 A comparison of the new nfsd iomodes (and an experimental one) Jeff Layton
@ 2026-03-26 15:30 ` Chuck Lever
2026-03-26 16:35 ` Jeff Layton
0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2026-03-26 15:30 UTC (permalink / raw)
To: Jeff Layton, linux-nfs, linux-fsdevel, linux-block
Cc: Mike Snitzer, Jens Axboe
On 3/26/26 11:23 AM, Jeff Layton wrote:
> I've been doing some benchmarking of the new nfsd iomodes, using
> different fio-based workloads.
>
> The results have been interesting, but one thing that stands out is
> that RWF_DONTCACHE is absolutely terrible for streaming write
> workloads. That prompted me to experiment with a new iomode that added
> some optimizations (DONTCACHE_LAZY).
>
> The results along with Claude's analysis are here:
>
> https://markdownpastebin.com/?id=387375d00b5443b3a2e37d58a062331f
>
> He gets a bit out over his skis on the upstream plan, but tl;dr is that
> DONTCACHE_LAZY (which is DONTCACHE with some optimizations) outperforms
> the other write iomodes.
The analysis of the write modes seems plausible. I'm interested to hear
what Mike and Jens have to say about that.
One thing I'd like to hear more about is why Claude felt that disabling
splice read was beneficial. My own benchmarking in that area has shown
that splice read is always a win over not using splice.
--
Chuck Lever
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: A comparison of the new nfsd iomodes (and an experimental one)
2026-03-26 15:30 ` Chuck Lever
@ 2026-03-26 16:35 ` Jeff Layton
2026-03-26 20:48 ` Mike Snitzer
0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2026-03-26 16:35 UTC (permalink / raw)
To: Chuck Lever, linux-nfs, linux-fsdevel, linux-block
Cc: Mike Snitzer, Jens Axboe
On Thu, 2026-03-26 at 11:30 -0400, Chuck Lever wrote:
> On 3/26/26 11:23 AM, Jeff Layton wrote:
> > I've been doing some benchmarking of the new nfsd iomodes, using
> > different fio-based workloads.
> >
> > The results have been interesting, but one thing that stands out is
> > that RWF_DONTCACHE is absolutely terrible for streaming write
> > workloads. That prompted me to experiment with a new iomode that added
> > some optimizations (DONTCACHE_LAZY).
> >
> > The results along with Claude's analysis are here:
> >
> > https://markdownpastebin.com/?id=387375d00b5443b3a2e37d58a062331f
> >
> > He gets a bit out over his skis on the upstream plan, but tl;dr is that
> > DONTCACHE_LAZY (which is DONTCACHE with some optimizations) outperforms
> > the other write iomodes.
>
> The analysis of the write modes seems plausible. I'm interested to hear
> what Mike and Jens have to say about that.
>
> One thing I'd like to hear more about is why Claude felt that disabling
> splice read was beneficial. My own benchmarking in that area has shown
> that splice read is always a win over not using splice.
>
Good catch. That turns out to be a mistake in Claude's writeup.
The test scripts left splice reads enabled for buffered reads, and the
results in the analysis reflect that. I (and it) have no idea why it
would recommend disabling them, when the testing all left them enabled
for buffered reads.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: A comparison of the new nfsd iomodes (and an experimental one)
2026-03-26 16:35 ` Jeff Layton
@ 2026-03-26 20:48 ` Mike Snitzer
2026-03-27 11:32 ` Jeff Layton
0 siblings, 1 reply; 8+ messages in thread
From: Mike Snitzer @ 2026-03-26 20:48 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, linux-nfs, linux-fsdevel, linux-block, Jens Axboe
On Thu, Mar 26, 2026 at 12:35:15PM -0400, Jeff Layton wrote:
> On Thu, 2026-03-26 at 11:30 -0400, Chuck Lever wrote:
> > On 3/26/26 11:23 AM, Jeff Layton wrote:
> > > I've been doing some benchmarking of the new nfsd iomodes, using
> > > different fio-based workloads.
> > >
> > > The results have been interesting, but one thing that stands out is
> > > that RWF_DONTCACHE is absolutely terrible for streaming write
> > > workloads. That prompted me to experiment with a new iomode that added
> > > some optimizations (DONTCACHE_LAZY).
> > >
> > > The results along with Claude's analysis are here:
> > >
> > > https://markdownpastebin.com/?id=387375d00b5443b3a2e37d58a062331f
> > >
> > > He gets a bit out over his skis on the upstream plan, but tl;dr is that
> > > DONTCACHE_LAZY (which is DONTCACHE with some optimizations) outperforms
> > > the other write iomodes.
> >
> > The analysis of the write modes seems plausible. I'm interested to hear
> > what Mike and Jens have to say about that.
Thanks for doing your testing and the summary, but I cannot help but
feel like your test isn't coming close to realizing the O_DIRECT
benefits over buffered that were covered in the past, e.g.:
https://www.youtube.com/watch?v=tpPFDu9Nuuw
Can Claude be made to watch a youtube video, summarize what it learned
and then adapt its test-plan accordingly? ;)
Your bandwidth for 1MB sequential IO of 793 MB/s for O_DIRECT and
4,952 MB/s for buffered and dontcache is considerably less than the 72
GB/s offered in Jon's testbed. Your testing isn't exposing the
bottlenecks (contention) of the MM subsystem for buffered IO... not
yet put my finger on _why_ that is.
In Jon Flynn's testing he was using a working set of 312.5% of
available server memory, and the single client test system was using
fio with multiple threads and sync IO to write to 16 different mounts
(one per NVMe of the NFS server) with nconnect=16 and RDMA.
Raw performance of a single NVMe in Jon's testbed was over 14 GB/s --
he has the ability to drive 16 NVMe in his single NFS server. So an
order of magnitude more capable backend storage in Jon's NFS server.
Big concern is your testing isn't exposing MM bottlenecks of buffered
IO... given that, its not really providing useful results to compare
with O_DIRECT.
Putting that aside, yes DONTCACHE as-is really isn't helpful.. your
lazy variant seems much more useful.
> > One thing I'd like to hear more about is why Claude felt that disabling
> > splice read was beneficial. My own benchmarking in that area has shown
> > that splice read is always a win over not using splice.
> >
>
> Good catch. That turns out to be a mistake in Claude's writeup.
>
> The test scripts left splice reads enabled for buffered reads, and the
> results in the analysis reflect that. I (and it) have no idea why it
> would recommend disabling them, when the testing all left them enabled
> for buffered reads.
Claude had to have picked up on the mutual exclusion with splice_read
for both NFSD_IO_DONTCACHE and NFSD_IO_DIRECT io modes. So
splice_read is implicity disabled when testing NFSD_IO_DONTCACHE
(which is buffered IO).
Mike
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: A comparison of the new nfsd iomodes (and an experimental one)
2026-03-26 20:48 ` Mike Snitzer
@ 2026-03-27 11:32 ` Jeff Layton
2026-03-27 13:19 ` Chuck Lever
0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2026-03-27 11:32 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, linux-nfs, linux-fsdevel, linux-block, Jens Axboe
On Thu, 2026-03-26 at 16:48 -0400, Mike Snitzer wrote:
> On Thu, Mar 26, 2026 at 12:35:15PM -0400, Jeff Layton wrote:
> > On Thu, 2026-03-26 at 11:30 -0400, Chuck Lever wrote:
> > > On 3/26/26 11:23 AM, Jeff Layton wrote:
> > > > I've been doing some benchmarking of the new nfsd iomodes, using
> > > > different fio-based workloads.
> > > >
> > > > The results have been interesting, but one thing that stands out is
> > > > that RWF_DONTCACHE is absolutely terrible for streaming write
> > > > workloads. That prompted me to experiment with a new iomode that added
> > > > some optimizations (DONTCACHE_LAZY).
> > > >
> > > > The results along with Claude's analysis are here:
> > > >
> > > > https://markdownpastebin.com/?id=387375d00b5443b3a2e37d58a062331f
> > > >
> > > > He gets a bit out over his skis on the upstream plan, but tl;dr is that
> > > > DONTCACHE_LAZY (which is DONTCACHE with some optimizations) outperforms
> > > > the other write iomodes.
> > >
> > > The analysis of the write modes seems plausible. I'm interested to hear
> > > what Mike and Jens have to say about that.
>
> Thanks for doing your testing and the summary, but I cannot help but
> feel like your test isn't coming close to realizing the O_DIRECT
> benefits over buffered that were covered in the past, e.g.:
> https://www.youtube.com/watch?v=tpPFDu9Nuuw
>
> Can Claude be made to watch a youtube video, summarize what it learned
> and then adapt its test-plan accordingly? ;)
>
I'm not sure it can. It's a good q though. I'll ask it!
> Your bandwidth for 1MB sequential IO of 793 MB/s for O_DIRECT and
> 4,952 MB/s for buffered and dontcache is considerably less than the 72
> GB/s offered in Jon's testbed. Your testing isn't exposing the
> bottlenecks (contention) of the MM subsystem for buffered IO... not
> yet put my finger on _why_ that is.
That may very well be, but not everyone has a box as large as the one
you and Jon were working with.
> In Jon Flynn's testing he was using a working set of 312.5% of
> available server memory, and the single client test system was using
> fio with multiple threads and sync IO to write to 16 different mounts
> (one per NVMe of the NFS server) with nconnect=16 and RDMA.
>
This test was attempting to simulate a high nconnect count from
multiple clients by using multiple fio processes. I also used the
libnfs backend to fio so I wouldn't need to worry about tuning the
kernel client.
> Raw performance of a single NVMe in Jon's testbed was over 14 GB/s --
> he has the ability to drive 16 NVMe in his single NFS server. So an
> order of magnitude more capable backend storage in Jon's NFS server.
>
Very true. This box only had a single SSD. I can try to find something
with more storage though for another test.
> Big concern is your testing isn't exposing MM bottlenecks of buffered
> IO... given that, its not really providing useful results to compare
> with O_DIRECT.
>
> Putting that aside, yes DONTCACHE as-is really isn't helpful.. your
> lazy variant seems much more useful.
>
Right. I think that's the big takeaway from this. Ignoring claude's
rambling about changing default iomodes in the server, RWF_DONTCACHE
just sucks for heavy writes. There is room for improvement there.
The big question I have is whether fixing RWF_DONTCACHE's writeback
behavior would give better results than what you were seeing with
O_DIRECT. Do you guys still have access to that test rig? I can send
you the patch if you want to test this and see how it does.
> > > One thing I'd like to hear more about is why Claude felt that disabling
> > > splice read was beneficial. My own benchmarking in that area has shown
> > > that splice read is always a win over not using splice.
> > >
> >
> > Good catch. That turns out to be a mistake in Claude's writeup.
> >
> > The test scripts left splice reads enabled for buffered reads, and the
> > results in the analysis reflect that. I (and it) have no idea why it
> > would recommend disabling them, when the testing all left them enabled
> > for buffered reads.
>
> Claude had to have picked up on the mutual exclusion with splice_read
> for both NFSD_IO_DONTCACHE and NFSD_IO_DIRECT io modes. So
> splice_read is implicity disabled when testing NFSD_IO_DONTCACHE
> (which is buffered IO).
Yeah, Claude just got confused on the writeup. The benchmarking itself
was sound, AFAICT. Buffered reads used splice and it was disabled for
the other modes.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: A comparison of the new nfsd iomodes (and an experimental one)
2026-03-27 11:32 ` Jeff Layton
@ 2026-03-27 13:19 ` Chuck Lever
2026-03-27 16:57 ` Mike Snitzer
0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2026-03-27 13:19 UTC (permalink / raw)
To: Jeff Layton, Mike Snitzer
Cc: linux-nfs, linux-fsdevel, linux-block, Jens Axboe
On 3/27/26 7:32 AM, Jeff Layton wrote:
> On Thu, 2026-03-26 at 16:48 -0400, Mike Snitzer wrote:
>
>> Your bandwidth for 1MB sequential IO of 793 MB/s for O_DIRECT and
>> 4,952 MB/s for buffered and dontcache is considerably less than the 72
>> GB/s offered in Jon's testbed. Your testing isn't exposing the
>> bottlenecks (contention) of the MM subsystem for buffered IO... not
>> yet put my finger on _why_ that is.
>
> That may very well be, but not everyone has a box as large as the one
> you and Jon were working with.
Right, and this is kind of a blocker for us. Practically speaking, Jon's
results are not reproducible.
It would be immensely helpful if the MM-tipover behavior could be
reproduced on smaller systems. Reduced physical memory size, lower
network and storage speed, and so on, so that Jeff and I can study the
issue on our own systems.
--
Chuck Lever
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: A comparison of the new nfsd iomodes (and an experimental one)
2026-03-27 13:19 ` Chuck Lever
@ 2026-03-27 16:57 ` Mike Snitzer
2026-03-28 12:37 ` Jeff Layton
0 siblings, 1 reply; 8+ messages in thread
From: Mike Snitzer @ 2026-03-27 16:57 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, linux-fsdevel, linux-block, Jens Axboe,
jonathan.flynn
On Fri, Mar 27, 2026 at 09:19:07AM -0400, Chuck Lever wrote:
> On 3/27/26 7:32 AM, Jeff Layton wrote:
> > On Thu, 2026-03-26 at 16:48 -0400, Mike Snitzer wrote:
> >
> >> Your bandwidth for 1MB sequential IO of 793 MB/s for O_DIRECT and
> >> 4,952 MB/s for buffered and dontcache is considerably less than the 72
> >> GB/s offered in Jon's testbed. Your testing isn't exposing the
> >> bottlenecks (contention) of the MM subsystem for buffered IO... not
> >> yet put my finger on _why_ that is.
> >
> > That may very well be, but not everyone has a box as large as the one
> > you and Jon were working with.
>
> Right, and this is kind of a blocker for us. Practically speaking, Jon's
> results are not reproducible.
>
> It would be immensely helpful if the MM-tipover behavior could be
> reproduced on smaller systems. Reduced physical memory size, lower
> network and storage speed, and so on, so that Jeff and I can study the
> issue on our own systems.
Hammerspace still has the same performance lab setup, so we can
certainly reproduce if needed (and try to scale it down, etc) but
unfortunately its currently tied up with other important work. Jon
Flynn's testing was "only" with 2 systems (the NFS server has 16 fast
NVMe, client system connecting over 400 GbE with RDMA). But I'll see
about shaking a couple systems loose...
Might be useful for us to document the setup of the NFS server side
and give pointers for how to mount from the client system and then run
fio commandline.
I think Jens has a couple beefy servers with lots of fast NVMe. Maybe
he'd be open to testing NFS server scalability when he isn't touring
around the country watching his kid win motorsports events.. JENS!? ;)
Mike
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: A comparison of the new nfsd iomodes (and an experimental one)
2026-03-27 16:57 ` Mike Snitzer
@ 2026-03-28 12:37 ` Jeff Layton
0 siblings, 0 replies; 8+ messages in thread
From: Jeff Layton @ 2026-03-28 12:37 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever
Cc: linux-nfs, linux-fsdevel, linux-block, Jens Axboe, jonathan.flynn
[-- Attachment #1: Type: text/plain, Size: 2904 bytes --]
On Fri, 2026-03-27 at 12:57 -0400, Mike Snitzer wrote:
> On Fri, Mar 27, 2026 at 09:19:07AM -0400, Chuck Lever wrote:
> > On 3/27/26 7:32 AM, Jeff Layton wrote:
> > > On Thu, 2026-03-26 at 16:48 -0400, Mike Snitzer wrote:
> > >
> > > > Your bandwidth for 1MB sequential IO of 793 MB/s for O_DIRECT and
> > > > 4,952 MB/s for buffered and dontcache is considerably less than the 72
> > > > GB/s offered in Jon's testbed. Your testing isn't exposing the
> > > > bottlenecks (contention) of the MM subsystem for buffered IO... not
> > > > yet put my finger on _why_ that is.
> > >
> > > That may very well be, but not everyone has a box as large as the one
> > > you and Jon were working with.
> >
> > Right, and this is kind of a blocker for us. Practically speaking, Jon's
> > results are not reproducible.
> >
> > It would be immensely helpful if the MM-tipover behavior could be
> > reproduced on smaller systems. Reduced physical memory size, lower
> > network and storage speed, and so on, so that Jeff and I can study the
> > issue on our own systems.
>
> Hammerspace still has the same performance lab setup, so we can
> certainly reproduce if needed (and try to scale it down, etc) but
> unfortunately its currently tied up with other important work. Jon
> Flynn's testing was "only" with 2 systems (the NFS server has 16 fast
> NVMe, client system connecting over 400 GbE with RDMA). But I'll see
> about shaking a couple systems loose...
>
> Might be useful for us to document the setup of the NFS server side
> and give pointers for how to mount from the client system and then run
> fio commandline.
>
> I think Jens has a couple beefy servers with lots of fast NVMe. Maybe
> he'd be open to testing NFS server scalability when he isn't touring
> around the country watching his kid win motorsports events.. JENS!? ;)
>
>
I'm not sure we really even need NFS in the mix to show this. Here's
the results from testing them both on local XFS (same host in this
case, just no NFS in the mix):
https://text.is/VN0K3
tl;dr: DONTCACHE_LAZY seems to be better than DONTCACHE across the
board, and it even seems to slightly edge out O_DIRECT for writes.
YMMV on your hw, of course.
The attached patch is what I'm planning to propose, if you want to give
it a spin. I'm testing one more optimization now, to see if I can get
the flush contention down even more.
Personally, I wonder if we ought to consider a proportional flush like
this in the normal buffered write codepath too. I've long thought that
we wait too long to kick off writeback: the flusher threads have a 5s
duty cycle, and all of our dirty writeback percentages were last set
when RAM sizes were much smaller.
Doing something like this in the normal buffered case might allow us to
get ahead of heavy writers easier.
--
Jeff Layton <jlayton@kernel.org>
[-- Attachment #2: 0001-mm-fix-IOCB_DONTCACHE-write-performance-with-rate-li.patch --]
[-- Type: text/x-patch, Size: 4461 bytes --]
From 14a60064365918b38e877084d94f3a2fa4d0b747 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@kernel.org>
Date: Sat, 28 Mar 2026 05:14:17 -0700
Subject: [PATCH] mm: fix IOCB_DONTCACHE write performance with rate-limited
writeback
IOCB_DONTCACHE calls filemap_flush_range() with nr_to_write=LONG_MAX
on every write, which flushes all dirty pages in the written range.
Under concurrent writers this creates severe serialization on the
writeback submission path, causing throughput to collapse to ~47% of
buffered I/O with multi-second tail latency. Even single-client
sequential writes suffer: on a 512GB file with 256GB RAM, the
aggressive flushing triggers dirty throttling that limits throughput
to 575 MB/s vs 1442 MB/s with rate-limited writeback.
Replace the filemap_flush_range() call in generic_write_sync() with a
new filemap_dontcache_writeback_range() that uses two rate-limiting
mechanisms:
1. Skip-if-busy: check mapping_tagged(PAGECACHE_TAG_WRITEBACK)
before flushing. If writeback is already in progress on the
mapping, skip the flush entirely. This eliminates writeback
submission contention between concurrent writers.
2. Proportional cap: when flushing does occur, cap nr_to_write to
the number of pages just written. This prevents any single
write from triggering a large flush that would starve concurrent
readers.
Both mechanisms are necessary: skip-if-busy alone causes I/O bursts
when the tag clears (reader p99.9 spikes 83x); proportional cap alone
still serializes on xarray locks regardless of submission size.
Pages touched under IOCB_DONTCACHE continue to be marked for eviction
(dropbehind), so page cache usage remains bounded. Ranges skipped by
the busy check are eventually flushed by background writeback or by
the next writer to find the tag clear.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
include/linux/fs.h | 7 +++++--
mm/filemap.c | 29 +++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94695ce5e25b5..b60bcf30d62e7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2589,6 +2589,8 @@ extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);
int filemap_flush_range(struct address_space *mapping, loff_t start,
loff_t end);
+int filemap_dontcache_writeback_range(struct address_space *mapping,
+ loff_t start, loff_t end, ssize_t nr_written);
static inline int file_write_and_wait(struct file *file)
{
@@ -2624,8 +2626,9 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
} else if (iocb->ki_flags & IOCB_DONTCACHE) {
struct address_space *mapping = iocb->ki_filp->f_mapping;
- filemap_flush_range(mapping, iocb->ki_pos - count,
- iocb->ki_pos - 1);
+ filemap_dontcache_writeback_range(mapping,
+ iocb->ki_pos - count,
+ iocb->ki_pos - 1, count);
}
return count;
diff --git a/mm/filemap.c b/mm/filemap.c
index 9697e12dfbdcc..ed392d781c433 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -440,6 +440,35 @@ int filemap_flush_range(struct address_space *mapping, loff_t start,
}
EXPORT_SYMBOL_GPL(filemap_flush_range);
+/**
+ * filemap_dontcache_writeback_range - rate-limited writeback for dontcache I/O
+ * @mapping: target address_space
+ * @start: byte offset to start writeback
+ * @end: last byte offset (inclusive) for writeback
+ * @nr_written: number of bytes just written by the caller
+ *
+ * Rate-limited writeback for IOCB_DONTCACHE writes. Skips the flush
+ * entirely if writeback is already in progress on the mapping (skip-if-busy),
+ * and when flushing, caps nr_to_write to the number of pages just written
+ * (proportional cap). Together these avoid writeback contention between
+ * concurrent writers and prevent I/O bursts that starve readers.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_dontcache_writeback_range(struct address_space *mapping,
+ loff_t start, loff_t end, ssize_t nr_written)
+{
+ long nr;
+
+ if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+ return 0;
+
+ nr = (nr_written + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ return filemap_writeback(mapping, start, end, WB_SYNC_NONE, &nr,
+ WB_REASON_BACKGROUND);
+}
+EXPORT_SYMBOL_GPL(filemap_dontcache_writeback_range);
+
/**
* filemap_flush - mostly a non-blocking flush
* @mapping: target address_space
--
2.52.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-03-28 12:37 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26 15:23 A comparison of the new nfsd iomodes (and an experimental one) Jeff Layton
2026-03-26 15:30 ` Chuck Lever
2026-03-26 16:35 ` Jeff Layton
2026-03-26 20:48 ` Mike Snitzer
2026-03-27 11:32 ` Jeff Layton
2026-03-27 13:19 ` Chuck Lever
2026-03-27 16:57 ` Mike Snitzer
2026-03-28 12:37 ` Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox