Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Snitzer <snitzer@kernel.org>
To: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	bcodding@redhat.com
Subject: Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
Date: Wed, 11 Jun 2025 16:51:03 -0400	[thread overview]
Message-ID: <aEnsN0uYB61_MyrW@kernel.org> (raw)
In-Reply-To: <1720744abfdc458bba1980e62d8fd61b06870a6e.camel@kernel.org>

On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
> On Wed, 2025-06-11 at 11:11 -0400, Chuck Lever wrote:
> > On 6/11/25 11:07 AM, Jeff Layton wrote:
> > > On Wed, 2025-06-11 at 10:42 -0400, Chuck Lever wrote:
> > > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > > IO must be aligned, otherwise it falls back to using buffered IO.
> > > > > 
> > > > > RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> > > > > nfsd/enable-dontcache=1) because it works against us (due to RMW
> > > > > needing to read without benefit of cache), whereas buffered IO enables
> > > > > misaligned IO to be more performant.
> > > > > 
> > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > ---
> > > > >  fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++++++++++++++----
> > > > >  1 file changed, 36 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > > index e7cc8c6dfbad..a942609e3ab9 100644
> > > > > --- a/fs/nfsd/vfs.c
> > > > > +++ b/fs/nfsd/vfs.c
> > > > > @@ -1064,6 +1064,22 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > > >  }
> > > > >  
> > > > > +static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
> > > > > +			   const u32 blocksize)
> > > > > +{
> > > > > +	u32 blocksize_mask;
> > > > > +
> > > > > +	if (!blocksize)
> > > > > +		return false;
> > > > > +
> > > > > +	blocksize_mask = blocksize - 1;
> > > > > +	if ((offset & blocksize_mask) ||
> > > > > +	    (iov_iter_alignment(iter) & blocksize_mask))
> > > > > +		return false;
> > > > > +
> > > > > +	return true;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * nfsd_iter_read - Perform a VFS read using an iterator
> > > > >   * @rqstp: RPC transaction context
> > > > > @@ -1107,8 +1123,16 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > >  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > > > >  	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > > > >  
> > > > > -	if (nfsd_enable_dontcache)
> > > > > -		flags |= RWF_DONTCACHE;
> > > > > +	if (nfsd_enable_dontcache) {
> > > > > +		if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
> > > > > +			flags |= RWF_DIRECT;
> > > > > +		/* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > > > > +		 * against us (due to RMW needing to read without benefit of cache),
> > > > > +		 * whereas buffered IO enables misaligned IO to be more performant.
> > > > > +		 */
> > > > > +		//else
> > > > > +		//	flags |= RWF_DONTCACHE;
> > > > > +	}
> > > > >  
> > > > >  	host_err = vfs_iter_read(file, &iter, &ppos, flags);
> > > > >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > > > @@ -1217,8 +1241,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > >  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > > > >  	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > > > >  
> > > > > -	if (nfsd_enable_dontcache)
> > > > > -		flags |= RWF_DONTCACHE;
> > > > > +	if (nfsd_enable_dontcache) {
> > > > > +		if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
> > > > > +			flags |= RWF_DIRECT;
> > > > > +		/* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > > > > +		 * against us (due to RMW needing to read without benefit of cache),
> > > > > +		 * whereas buffered IO enables misaligned IO to be more performant.
> > > > > +		 */
> > > > > +		//else
> > > > > +		//	flags |= RWF_DONTCACHE;
> > > > > +	}
> > > > 
> > > > IMO adding RWF_DONTCACHE first then replacing it later in the series
> > > > with a form of O_DIRECT is confusing. Also, why add RWF_DONTCACHE here
> > > > and then take it away "because it doesn't work"?

I spoke to this in a previous reply.  I can fold patches to elininate
this distraction in v2.

> > > > But OK, your series is really a proof-of-concept. Something to work out
> > > > before it is merge-ready, I guess.
> > > > 
> > > > It is much more likely for NFS READ requests to be properly aligned.
> > > > Clients are generally good about that. NFS WRITE request alignment
> > > > is going to be arbitrary. Fwiw.

Correct, thankfully TCP reads don't misalign their payload like TCP
writes do.  As you know, the value of patch 6 is that application IO
that generates misaligned IO (as a side-effect of misaligned read
blocksize, e.g. IOR hard's 47008 blocksize) can issue reads using
O_DIRECT.

> > > > However, one thing we discussed at bake-a-thon was what to do about
> > > > unstable WRITEs. For unstable WRITEs, the server has to cache the
> > > > write data at least until the client sends a COMMIT. Otherwise the
> > > > server will have to convert all UNSTABLE writes to FILE_SYNC writes,
> > > > and that can have performance implications.
> > > > 
> > > 
> > > If we're doing synchronous, direct I/O writes then why not just respond
> > > with FILE_SYNC? The write should be on the platter by the time it
> > > returns.

For v2 I'll look to formalize responding with FILE_SYNC when
'enable-dontcache' is set.

> > Because "platter". On some devices, writes are slow.
> > 
> > For some workloads, unstable is faster. I have an experimental series
> > that makes NFSD convert all NFS WRITEs to FILE_SYNC. It was not an
> > across the board win, even with an NVMe-backed file system.
> > 
> 
> Presumably, those devices wouldn't be exported in this mode. That's
> probably a good argument for making this settable on a per-export
> basis.

Correct.  This shouldn't be used by default.  But if/when it makes
sense, it *really* sings.

> > > > One thing you might consider is to continue using the page cache for
> > > > unstable WRITEs, and then use fadvise DONTNEED after a successful
> > > > COMMIT operation to reduce page cache footprint. Unstable writes to
> > > > the same range of the file might be a problem, however.
> > > 
> > > Since the client sends almost everything UNSTABLE, that would probably
> > > erase most of the performance win. The only reason I can see to use
> > > buffered I/O in this mode would be because we had to deal with an
> > > unaligned write and need to do a RMW cycle on a block.
> > > 
> > > The big question is whether mixing buffered and direct I/O writes like
> > > this is safe across all exportable filesystems. I'm not yet convinced
> > > of that.
> > 
> > Agreed, that deserves careful scrutiny.
> > 
> 
> Like Mike is asking though, I need a better understanding of the
> potential races here:
> 
> XFS, for instance, takes the i_rwsem shared around dio writes and
> exclusive around buffered, so they should exclude each other.

> If we did all the buffered writes as RWF_SYNC, would that prevent
> corruption?

I welcome any help pinning down what must be done to ensure this
is safe ("this" being: arbitrary switching between buffered and direct
IO and associated page cache invalidation).  But to be 100% clear:
NFSD exporting XFS with enable-dontcache=1 has worked very well.

Do we need to go to the extreme of each filesystem exporting support
with a new flag like FOP_INVALIDATES_BUFFERED_VS_DIRECT?  And if set,
any evidence to the contrary is a bug?

And does the VFS have a role in ensuring it's safe or can we assume
vfs/mm/etc are intended to be safe and any core common code that
proves otherwise is a bug?

> In any case, for now at least, unless you're using RDMA, it's going to
> end up falling back to buffered writes everywhere. The data is almost
> never going to be properly aligned coming in off the wire. That might
> be fixable though.

Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
over TCP to workaround SUNRPC TCP's XDR handling always storing the
write payload in misaligned IO.  But that's purely a stop-gap
workaround, which needs testing (to see if soft-iwap negates the win
of using O_DIRECT, etc).

But a long-term better fix is absolutely needed, to be continued (in
the subthread I need to get going)...

Mike

next prev parent reply	other threads:[~2025-06-11 20:51 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11  6:57   ` Christoph Hellwig
2025-06-11 10:44     ` Mike Snitzer
2025-06-11 13:04       ` Jeff Layton
2025-06-11 13:56     ` Chuck Lever
2025-06-11 14:31   ` Chuck Lever
2025-06-11 19:18     ` Mike Snitzer
2025-06-11 20:29       ` Jeff Layton
2025-06-11 21:36         ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28           ` Jeff Layton
2025-06-12 11:28             ` Jeff Layton
2025-06-12 13:28             ` Chuck Lever
2025-06-12 14:17               ` Benjamin Coddington
2025-06-12 15:56                 ` Mike Snitzer
2025-06-12 15:58                   ` Chuck Lever
2025-06-12 16:12                     ` Mike Snitzer
2025-06-12 16:32                       ` Chuck Lever
2025-06-13  5:39                     ` Christoph Hellwig
2025-06-12 16:22               ` Jeff Layton
2025-06-13  5:46                 ` Christoph Hellwig
2025-06-13  9:23                   ` Mike Snitzer
2025-06-13 13:02                     ` Jeff Layton
2025-06-16 12:35                       ` Christoph Hellwig
2025-06-16 12:29                     ` Christoph Hellwig
2025-06-16 16:07                       ` Mike Snitzer
2025-06-17  4:37                         ` Christoph Hellwig
2025-06-17 20:26                           ` Mike Snitzer
2025-06-17 22:23                             ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03  0:12             ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12  7:13         ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15           ` Chuck Lever
2025-06-12 13:21       ` Chuck Lever
2025-06-12 16:00         ` Mike Snitzer
2025-06-16 13:32           ` Chuck Lever
2025-06-16 16:10             ` Mike Snitzer
2025-06-17 17:22               ` Mike Snitzer
2025-06-17 17:31                 ` Chuck Lever
2025-06-19 20:19                   ` Mike Snitzer
2025-06-30 14:50                     ` Chuck Lever
2025-07-04 19:46                       ` Mike Snitzer
2025-07-04 19:49                         ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11  6:58   ` Christoph Hellwig
2025-06-11 10:51     ` Mike Snitzer
2025-06-11 14:17     ` Chuck Lever
2025-06-12  7:15       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11  7:00   ` Christoph Hellwig
2025-06-11 12:23     ` Mike Snitzer
2025-06-11 13:30       ` Jeff Layton
2025-06-12  7:22         ` Christoph Hellwig
2025-06-12  7:23       ` Christoph Hellwig
2025-06-11 14:42   ` Chuck Lever
2025-06-11 15:07     ` Jeff Layton
2025-06-11 15:11       ` Chuck Lever
2025-06-11 15:44         ` Jeff Layton
2025-06-11 20:51           ` Mike Snitzer [this message]
2025-06-12  7:32           ` Christoph Hellwig
2025-06-12  7:28         ` Christoph Hellwig
2025-06-12  7:25       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12  7:39   ` Christoph Hellwig
2025-06-12 20:37     ` Mike Snitzer
2025-06-13  5:31       ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02   ` Mike Snitzer
2025-06-11 19:06     ` Chuck Lever
2025-06-11 19:58       ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08   ` Mike Snitzer
2025-06-12 20:17     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aEnsN0uYB61_MyrW@kernel.org \
    --to=snitzer@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bcodding@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.