From: Mike Snitzer <snitzer@kernel.org>
To: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
linux-nfs@vger.kernel.org, jonathan.flynn@hammerspace.com
Subject: Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Wed, 12 Nov 2025 18:14:15 -0500 [thread overview]
Message-ID: <aRUUx2c6YNnRYRZ_@kernel.org> (raw)
In-Reply-To: <aRShjU_Ti7f2Ci7I@infradead.org>
On Wed, Nov 12, 2025 at 07:02:37AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 11, 2025 at 07:06:34PM -0500, Mike Snitzer wrote:
> > + /* Mark the I/O buffer as evict-able to reduce memory contention. */
> > + if (!use_cached_buffered_fallback &&
> > + nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> > + kiocb->ki_flags |= IOCB_DONTCACHE;
>
> This won't work, because ki_flags get overwritten by the flags in
> the segment layer.
Thanks, yeah that was a copy-n-paste bug, should've been:
segments[0].flags |= IOCB_DONTCACHE;
I should've stated that my patch wasn't tested and was intended as
RFC.
> I walked through this to understand all the cases,
> and ended up with a slightly different version including long comments
> explaining them. This is also untested, but I'd love to have other
> cross check the logic in it:
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index ab46301da4ae..4727235ef43f 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1292,12 +1292,23 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> unsigned int nsegs = 0;
>
> /*
> - * Check if direct I/O is feasible for this write request.
> - * If alignments are not available, the write is too small,
> - * or no alignment can be found, fall back to buffered I/O.
> + * If the file system doesn't advertise any alignment requirements,
> + * don't try to issue direct I/O. Fall back to uncached buffered
> + * I/O if possible because we'll assume it is not block based and
> + * doesn't need read-modify-write cycles.
> */
Not sure what is implied by "not block based" in this comment.
> - if (unlikely(!mem_align || !offset_align) ||
> - unlikely(total < max(offset_align, mem_align)))
> + if (unlikely(!mem_align || !offset_align)) {
> + if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> + segments[0].flags |= IOCB_DONTCACHE;
> + goto no_dio;
> + }
> +
> + /*
> + * If the I/O is smaller than the larger of the memory and logical
> + * offset alignment, it is like to require read-modify-write cycles.
> + * Issue cached buffered I/O.
> + */
Nit: s/like/likely/ typo.
> + if (unlikely(total < max(offset_align, mem_align)))
> goto no_dio;
>
> prefix_end = round_up(offset, offset_align);
> @@ -1308,7 +1319,17 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> middle = middle_end - prefix_end;
> suffix = orig_end - middle_end;
>
> - if (!middle)
> + /*
> + * If there is no aligned middle section, or aligned part is tiny,
> + * issue a single buffered I/O write instead of splitting up the
> + * write.
> + *
> + * Note: the middle section size here is random constant. I suspect
> + * when benchmarking it we'd actually end up with a significant larger
> + * number, with the details depending on hardware.
> + */
> + if (!middle ||
> + ((prefix || suffix) && middle < PAGE_SIZE * 2))
> goto no_dio;
>
> if (prefix)
Expressing this threshold in terms of PAGE_SIZE might be a
problem for other archs like ARM (where using 64K the norm for
performance).
The "Note:" portion of the above comment might do well to address the
goal being to avoid using cached buffered for higher order
allocations. You or Chuck may have solid ideas for refining wording.
But so you're aware Jon Flynn (who kindly does the heavy lifting of
performance testing for me) agrees with you: larger will likely be
beneficial but it'll be hardware dependent. So I'm going to expose a
knob for Jon to try various values so we can see.
From Jon's results we might elect to take further action.
> @@ -1324,16 +1345,24 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> * bvecs generated from RPC receive buffers are contiguous: After
> * the first bvec, all subsequent bvecs start at bv_offset zero
> * (page-aligned). Therefore, only the first bvec is checked.
> + *
> + * If the memory is not aligned at all, but we have a large enough
> + * logical offset-aligned middle section, try to use uncached buffered
> + * I/O for that to avoid cache pollution. If not fall back to a single
> + * cached buffered I/O for the entire write.
> */
> - if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
> - goto no_dio;
> - segments[nsegs].flags |= IOCB_DIRECT;
> + if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1)) {
> + if (!(nf->nf_file->f_op->fop_flags & FOP_DONTCACHE))
> + goto no_dio;
> + segments[nsegs].flags |= IOCB_DONTCACHE;
This is a new one for me, but its a nice catch: just because
IOCBD_DIRECT isn't possible for the middle doesn't mean that we should
avoid issuing the subpage segments in terms of buffered IO.
> + } else {
> + segments[nsegs].flags |= IOCB_DIRECT;
> + }
> nsegs++;
>
> if (suffix)
> nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
> prefix + middle, suffix, iocb);
> -
> return nsegs;
>
> no_dio:
> @@ -1362,16 +1391,9 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (kiocb->ki_flags & IOCB_DIRECT)
> trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> segments[i].iter.count);
> - else {
> + else
> trace_nfsd_write_vector(rqstp, fhp, kiocb->ki_pos,
> segments[i].iter.count);
> - /*
> - * Mark the I/O buffer as evict-able to reduce
> - * memory contention.
> - */
> - if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> - kiocb->ki_flags |= IOCB_DONTCACHE;
> - }
>
> host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> if (host_err < 0)
Overall this patch looks good. Would welcome seeing you submit a
formal patch. In parallel I'll work with Jon to get this tested on
Hammerspace's performance cluster to confirm all looks good.
Mike
next prev parent reply other threads:[~2025-11-12 23:14 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-07 15:34 [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:34 ` [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:39 ` Christoph Hellwig
2025-11-07 15:40 ` Chuck Lever
2025-11-07 20:05 ` Mike Snitzer
2025-11-07 20:08 ` Chuck Lever
2025-11-07 20:10 ` Mike Snitzer
2025-11-07 21:58 ` NeilBrown
2025-11-07 22:24 ` Mike Snitzer
2025-11-07 23:42 ` NeilBrown
2025-11-08 2:01 ` Mike Snitzer
2025-11-10 16:41 ` Chuck Lever
2025-11-10 17:57 ` Mike Snitzer
2025-11-11 8:51 ` Christoph Hellwig
2025-11-11 14:20 ` Chuck Lever
2025-11-11 14:21 ` Christoph Hellwig
2025-11-12 0:06 ` Mike Snitzer
2025-11-12 15:02 ` Christoph Hellwig
2025-11-12 23:14 ` Mike Snitzer [this message]
2025-11-13 8:13 ` Christoph Hellwig
2025-11-13 21:45 ` Mike Snitzer
2025-11-07 20:28 ` Chuck Lever
2025-11-07 22:16 ` Mike Snitzer
2025-11-10 9:12 ` Christoph Hellwig
2025-11-10 15:42 ` Mike Snitzer
2025-11-11 8:44 ` Christoph Hellwig
2025-11-10 9:17 ` Christoph Hellwig
2025-11-10 15:43 ` Mike Snitzer
2025-11-07 17:18 ` Mike Snitzer
2025-11-07 22:13 ` NeilBrown
2025-11-07 15:34 ` [PATCH v11 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aRUUx2c6YNnRYRZ_@kernel.org \
--to=snitzer@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=hch@infradead.org \
--cc=jlayton@kernel.org \
--cc=jonathan.flynn@hammerspace.com \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).