public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>, Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	 linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
	 jonathan.flynn@hammerspace.com
Subject: Re: A comparison of the new nfsd iomodes (and an experimental one)
Date: Sat, 28 Mar 2026 08:37:30 -0400	[thread overview]
Message-ID: <cf3aba50556defa87ecbb38ba1af045ef3bc9fee.camel@kernel.org> (raw)
In-Reply-To: <aca3ApIPUGAovh_7@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2904 bytes --]

On Fri, 2026-03-27 at 12:57 -0400, Mike Snitzer wrote:
> On Fri, Mar 27, 2026 at 09:19:07AM -0400, Chuck Lever wrote:
> > On 3/27/26 7:32 AM, Jeff Layton wrote:
> > > On Thu, 2026-03-26 at 16:48 -0400, Mike Snitzer wrote:
> > > 
> > > > Your bandwidth for 1MB sequential IO of 793 MB/s for O_DIRECT and
> > > > 4,952 MB/s for buffered and dontcache is considerably less than the 72
> > > > GB/s offered in Jon's testbed.  Your testing isn't exposing the
> > > > bottlenecks (contention) of the MM subsystem for buffered IO... not
> > > > yet put my finger on _why_ that is.
> > > 
> > > That may very well be, but not everyone has a box as large as the one
> > > you and Jon were working with.
> > 
> > Right, and this is kind of a blocker for us. Practically speaking, Jon's
> > results are not reproducible.
> > 
> > It would be immensely helpful if the MM-tipover behavior could be
> > reproduced on smaller systems. Reduced physical memory size, lower
> > network and storage speed, and so on, so that Jeff and I can study the
> > issue on our own systems.
> 
> Hammerspace still has the same performance lab setup, so we can
> certainly reproduce if needed (and try to scale it down, etc) but
> unfortunately its currently tied up with other important work. Jon
> Flynn's testing was "only" with 2 systems (the NFS server has 16 fast
> NVMe, client system connecting over 400 GbE with RDMA). But I'll see
> about shaking a couple systems loose...
> 
> Might be useful for us to document the setup of the NFS server side
> and give pointers for how to mount from the client system and then run
> fio commandline.
> 
> I think Jens has a couple beefy servers with lots of fast NVMe. Maybe
> he'd be open to testing NFS server scalability when he isn't touring
> around the country watching his kid win motorsports events.. JENS!? ;)
> 
> 

I'm not sure we really even need NFS in the mix to show this. Here's
the results from testing them both on local XFS (same host in this
case, just no NFS in the mix):

    https://text.is/VN0K3

tl;dr: DONTCACHE_LAZY seems to be better than DONTCACHE across the
board, and it even seems to slightly edge out O_DIRECT for writes.
YMMV on your hw, of course.

The attached patch is what I'm planning to propose, if you want to give
it a spin. I'm testing one more optimization now, to see if I can get
the flush contention down even more.

Personally, I wonder if we ought to consider a proportional flush like
this in the normal buffered write codepath too. I've long thought that
we wait too long to kick off writeback: the flusher threads have a 5s
duty cycle, and all of our dirty writeback percentages were last set
when RAM sizes were much smaller.

Doing something like this in the normal buffered case might allow us to
get ahead of heavy writers easier.
-- 
Jeff Layton <jlayton@kernel.org>

[-- Attachment #2: 0001-mm-fix-IOCB_DONTCACHE-write-performance-with-rate-li.patch --]
[-- Type: text/x-patch, Size: 4461 bytes --]

From 14a60064365918b38e877084d94f3a2fa4d0b747 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@kernel.org>
Date: Sat, 28 Mar 2026 05:14:17 -0700
Subject: [PATCH] mm: fix IOCB_DONTCACHE write performance with rate-limited
 writeback

IOCB_DONTCACHE calls filemap_flush_range() with nr_to_write=LONG_MAX
on every write, which flushes all dirty pages in the written range.
Under concurrent writers this creates severe serialization on the
writeback submission path, causing throughput to collapse to ~47% of
buffered I/O with multi-second tail latency.  Even single-client
sequential writes suffer: on a 512GB file with 256GB RAM, the
aggressive flushing triggers dirty throttling that limits throughput
to 575 MB/s vs 1442 MB/s with rate-limited writeback.

Replace the filemap_flush_range() call in generic_write_sync() with a
new filemap_dontcache_writeback_range() that uses two rate-limiting
mechanisms:

  1. Skip-if-busy: check mapping_tagged(PAGECACHE_TAG_WRITEBACK)
     before flushing.  If writeback is already in progress on the
     mapping, skip the flush entirely.  This eliminates writeback
     submission contention between concurrent writers.

  2. Proportional cap: when flushing does occur, cap nr_to_write to
     the number of pages just written.  This prevents any single
     write from triggering a large flush that would starve concurrent
     readers.

Both mechanisms are necessary: skip-if-busy alone causes I/O bursts
when the tag clears (reader p99.9 spikes 83x); proportional cap alone
still serializes on xarray locks regardless of submission size.

Pages touched under IOCB_DONTCACHE continue to be marked for eviction
(dropbehind), so page cache usage remains bounded.  Ranges skipped by
the busy check are eventually flushed by background writeback or by
the next writer to find the tag clear.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/fs.h |  7 +++++--
 mm/filemap.c       | 29 +++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94695ce5e25b5..b60bcf30d62e7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2589,6 +2589,8 @@ extern int __must_check file_write_and_wait_range(struct file *file,
 						loff_t start, loff_t end);
 int filemap_flush_range(struct address_space *mapping, loff_t start,
 		loff_t end);
+int filemap_dontcache_writeback_range(struct address_space *mapping,
+		loff_t start, loff_t end, ssize_t nr_written);
 
 static inline int file_write_and_wait(struct file *file)
 {
@@ -2624,8 +2626,9 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 	} else if (iocb->ki_flags & IOCB_DONTCACHE) {
 		struct address_space *mapping = iocb->ki_filp->f_mapping;
 
-		filemap_flush_range(mapping, iocb->ki_pos - count,
-				iocb->ki_pos - 1);
+		filemap_dontcache_writeback_range(mapping,
+				iocb->ki_pos - count,
+				iocb->ki_pos - 1, count);
 	}
 
 	return count;
diff --git a/mm/filemap.c b/mm/filemap.c
index 9697e12dfbdcc..ed392d781c433 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -440,6 +440,35 @@ int filemap_flush_range(struct address_space *mapping, loff_t start,
 }
 EXPORT_SYMBOL_GPL(filemap_flush_range);
 
+/**
+ * filemap_dontcache_writeback_range - rate-limited writeback for dontcache I/O
+ * @mapping:	target address_space
+ * @start:	byte offset to start writeback
+ * @end:	last byte offset (inclusive) for writeback
+ * @nr_written:	number of bytes just written by the caller
+ *
+ * Rate-limited writeback for IOCB_DONTCACHE writes.  Skips the flush
+ * entirely if writeback is already in progress on the mapping (skip-if-busy),
+ * and when flushing, caps nr_to_write to the number of pages just written
+ * (proportional cap).  Together these avoid writeback contention between
+ * concurrent writers and prevent I/O bursts that starve readers.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_dontcache_writeback_range(struct address_space *mapping,
+		loff_t start, loff_t end, ssize_t nr_written)
+{
+	long nr;
+
+	if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+		return 0;
+
+	nr = (nr_written + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	return filemap_writeback(mapping, start, end, WB_SYNC_NONE, &nr,
+			WB_REASON_BACKGROUND);
+}
+EXPORT_SYMBOL_GPL(filemap_dontcache_writeback_range);
+
 /**
  * filemap_flush - mostly a non-blocking flush
  * @mapping:	target address_space
-- 
2.52.0


      reply	other threads:[~2026-03-28 12:37 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26 15:23 A comparison of the new nfsd iomodes (and an experimental one) Jeff Layton
2026-03-26 15:30 ` Chuck Lever
2026-03-26 16:35   ` Jeff Layton
2026-03-26 20:48     ` Mike Snitzer
2026-03-27 11:32       ` Jeff Layton
2026-03-27 13:19         ` Chuck Lever
2026-03-27 16:57           ` Mike Snitzer
2026-03-28 12:37             ` Jeff Layton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cf3aba50556defa87ecbb38ba1af045ef3bc9fee.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=chuck.lever@oracle.com \
    --cc=jonathan.flynn@hammerspace.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=snitzer@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox