All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Tue, 11 Nov 2025 19:06:34 -0500	[thread overview]
Message-ID: <aRPPikkia5ZVZxJG@kernel.org> (raw)
In-Reply-To: <aRL5EPMD9VsG1n3D@infradead.org>

On Tue, Nov 11, 2025 at 12:51:28AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 10, 2025 at 11:41:09AM -0500, Chuck Lever wrote:
> > On 11/7/25 9:01 PM, Mike Snitzer wrote:
> > > Q: Case 2 uses DONTCACHE, so case 1 should too right?
> > > 
> > > A: NO. There is legit benefit to having case 1 use cached buffered IO
> > > when issuing its 2 subpage IOs; and that benefit doesn't cause harm> because order-0 page management is not causing the MM problems that
> > > NFSD_IO_DIRECT sets out to avoid (whereas higher order cached buffered
> > > IO is exactly what both DONTCACHE and NFSD_IO_DIRECT aim to avoid.
> > > Otherwise MM spins and spins trying to find adequate free pages,
> > > cannot so does dirty writeback and reclaim which causes kswapd and
> > > kcompactd to burn cpu, etc).
> > 
> > Paraphrasing: Each unaligned end (case 1) is always smaller than a page,
> > thus it will stick with order-0 allocations (if that byte range is not
> > already in the page cache), allocations that are, practically speaking,
> > reliable.
> > 
> > However it might be a stretch to claim that an order-0 allocation
> > /never/ drives memory reclaim.
> 
> That is of course not true.  order-0 pages of course do drive memory
> reclaim.  In fact if you're on a file system without huge folios and
> an applications that doesn't use THP, the majority of reclaim is
> driven by order 0 allocations.

Correct, but order-0 pages aren't the burden on MM that reclaim of
higher order pages causes.

To be clear I didn't say what Chuck paraphrased with "/never/", etc.
But anyway I think we're all in agreement.

> > What we still don't know is exactly what the extra cost of setting
> > DONTCACHE is, even on small writes. Maybe DONTCACHE should be cleared
> > for /all/ segments that are smaller than a page?
> 
> I suspect the best initial tweak is for every segment or entire write
> that is not page aligned in the file, as that is an indicator that
> multiple RMW cycles are possible.  At least if we're streaming, but
> we don't have that information.  That means all of these cases:
> 
>  1) writes smaller than PAGE_SIZE
>  2) writes smaller than PAGE_SIZE * 2 but not aligned to PAGE_SIZE
>  3) unaligned end segments < PAGE_SIZE
> 
> If we want to fine tune, we'd probably expand case 2 a bit as a single
> page cache operation on an order 2 pages is going to be faster than
> three I/Os most of the time, but compared to the high level discussion
> here that's minor details.

I agree, the following should accomplish this.

Would prefer we get this heuristic agreed on so it can land along with
the NFSD_IO_DIRECT WRITE support for the 6.19 merge window.

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 02ff6bd48a18..ce162ae2f985 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1277,6 +1277,12 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
 	segment->flags = iocb->ki_flags;
 }
 
+/*
+ * Unaligned IO that is smaller than order-2 is best
+ * handled using cached buffered IO rather than DONTCACHE.
+ */
+#define NFSD_DONTCACHE_MIN_BYTES ((PAGE_SIZE << 2) + 1)
+
 static unsigned int
 nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
 			  unsigned int nvecs, struct kiocb *iocb,
@@ -1284,6 +1290,7 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
 			  struct nfsd_write_dio_seg segments[3])
 {
 	u32 offset_align = nf->nf_dio_offset_align;
+	bool use_cached_buffered_fallback = false;
 	loff_t prefix_end, orig_end, middle_end;
 	u32 mem_align = nf->nf_dio_mem_align;
 	size_t prefix, middle, suffix;
@@ -1295,6 +1302,8 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
 	 * If alignments are not available, the write is too small,
 	 * or no alignment can be found, fall back to buffered I/O.
 	 */
+	if (total < NFSD_DONTCACHE_MIN_BYTES)
+		use_cached_buffered_fallback = true;
 	if (unlikely(!mem_align || !offset_align) ||
 	    unlikely(total < max(offset_align, mem_align)))
 		goto no_dio;
@@ -1333,12 +1342,29 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
 		nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
 					prefix + middle, suffix, iocb);
 
+	/*
+	 * Unaligned end segments will always use cached buffered IO
+	 * because they are less than PAGE_SIZE.
+	 */
+	if ((prefix || suffix) && use_cached_buffered_fallback) {
+		/*
+		 * This unaligned IO is small enough that it
+		 * warrants fall back to a single cached buffered IO.
+		 */
+		goto no_dio;
+	}
+
 	return nsegs;
 
 no_dio:
 	/* No DIO alignment possible - pack into single non-DIO segment. */
 	nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
 				total, iocb);
+	/* Mark the I/O buffer as evict-able to reduce memory contention. */
+	if (!use_cached_buffered_fallback &&
+	    nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+		kiocb->ki_flags |= IOCB_DONTCACHE;
+
 	return 1;
 }
 
@@ -1361,16 +1387,9 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		if (kiocb->ki_flags & IOCB_DIRECT)
 			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
 						segments[i].iter.count);
-		else {
+		else
 			trace_nfsd_write_vector(rqstp, fhp, kiocb->ki_pos,
 						segments[i].iter.count);
-			/*
-			 * Mark the I/O buffer as evict-able to reduce
-			 * memory contention.
-			 */
-			if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
-				kiocb->ki_flags |= IOCB_DONTCACHE;
-		}
 
 		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
 		if (host_err < 0)

  parent reply	other threads:[~2025-11-12  0:06 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-07 15:34 [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:34 ` [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:39   ` Christoph Hellwig
2025-11-07 15:40     ` Chuck Lever
2025-11-07 20:05       ` Mike Snitzer
2025-11-07 20:08         ` Chuck Lever
2025-11-07 20:10           ` Mike Snitzer
2025-11-07 21:58             ` NeilBrown
2025-11-07 22:24               ` Mike Snitzer
2025-11-07 23:42                 ` NeilBrown
2025-11-08  2:01                   ` Mike Snitzer
2025-11-10 16:41                     ` Chuck Lever
2025-11-10 17:57                       ` Mike Snitzer
2025-11-11  8:51                       ` Christoph Hellwig
2025-11-11 14:20                         ` Chuck Lever
2025-11-11 14:21                           ` Christoph Hellwig
2025-11-12  0:06                         ` Mike Snitzer [this message]
2025-11-12 15:02                           ` Christoph Hellwig
2025-11-12 23:14                             ` Mike Snitzer
2025-11-13  8:13                               ` Christoph Hellwig
2025-11-13 21:45                                 ` Mike Snitzer
2025-11-07 20:28     ` Chuck Lever
2025-11-07 22:16       ` Mike Snitzer
2025-11-10  9:12         ` Christoph Hellwig
2025-11-10 15:42           ` Mike Snitzer
2025-11-11  8:44             ` Christoph Hellwig
2025-11-10  9:17       ` Christoph Hellwig
2025-11-10 15:43         ` Mike Snitzer
2025-11-07 17:18   ` Mike Snitzer
2025-11-07 22:13   ` NeilBrown
2025-11-07 15:34 ` [PATCH v11 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aRPPikkia5ZVZxJG@kernel.org \
    --to=snitzer@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=hch@infradead.org \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.