linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>, linux-nfs@vger.kernel.org
Subject: Re: [PATCH v7 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
Date: Mon, 18 Aug 2025 15:27:56 -0400	[thread overview]
Message-ID: <eea0862de2cc9d9204527ddcbd9455422bcd042e.camel@kernel.org> (raw)
In-Reply-To: <aKN5Yy6Y08VozjwF@kernel.org>

On Mon, 2025-08-18 at 15:05 -0400, Mike Snitzer wrote:
> On Mon, Aug 18, 2025 at 10:45:47AM -0400, Jeff Layton wrote:
> > On Fri, 2025-08-15 at 10:46 -0400, Mike Snitzer wrote:
> > > If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> > > DIO-aligned block (on either end of the READ). The expanded READ is
> > > verified to have proper offset/len (logical_block_size) and
> > > dma_alignment checking.
> > > 
> > > Must allocate and use a bounce-buffer page (called 'start_extra_page')
> > > if/when expanding the misaligned READ requires reading extra partial
> > > page at the start of the READ so that its DIO-aligned. Otherwise that
> > > extra page at the start will make its way back to the NFS client and
> > > corruption will occur. As found, and then this fix of using an extra
> > > page verified, using the 'dt' utility:
> > >   dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> > >      iotype=sequential pattern=iot onerr=abort oncerr=abort
> > > see: https://github.com/RobinTMiller/dt.git
> > > 
> > > Any misaligned READ that is less than 32K won't be expanded to be
> > > DIO-aligned (this heuristic just avoids excess work, like allocating
> > > start_extra_page, for smaller IO that can generally already perform
> > > well using buffered IO).
> > > 
> > > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > > Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > ---
> > >  fs/nfsd/vfs.c              | 200 +++++++++++++++++++++++++++++++++++--
> > >  include/linux/sunrpc/svc.h |   5 +-
> > >  2 files changed, 194 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index c340708fbab4d..64732dc8985d6 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -19,6 +19,7 @@
> > >  #include <linux/splice.h>
> > >  #include <linux/falloc.h>
> > >  #include <linux/fcntl.h>
> > > +#include <linux/math.h>
> > >  #include <linux/namei.h>
> > >  #include <linux/delay.h>
> > >  #include <linux/fsnotify.h>
> > > @@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > >  }
> > >  
> > > +struct nfsd_read_dio {
> > > +	loff_t start;
> > > +	loff_t end;
> > > +	unsigned long start_extra;
> > > +	unsigned long end_extra;
> > > +	struct page *start_extra_page;
> > > +};
> > > +
> > > +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> > > +{
> > > +	memset(read_dio, 0, sizeof(*read_dio));
> > > +	read_dio->start_extra_page = NULL;
> > > +}
> > > +
> > > +#define NFSD_READ_DIO_MIN_KB (32 << 10)
> > > +
> > > +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > +				  struct nfsd_file *nf, loff_t offset,
> > > +				  unsigned long len, unsigned int base,
> > > +				  struct nfsd_read_dio *read_dio)
> > > +{
> > > +	const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> > > +	loff_t middle_end, orig_end = offset + len;
> > > +
> > > +	if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> > > +		      "%s: underlying filesystem has not provided DIO alignment info\n",
> > > +		      __func__))
> > > +		return false;
> > > +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > > +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > > +		      __func__, dio_blocksize, PAGE_SIZE))
> > > +		return false;
> > > +
> > > +	/* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
> > > +	 * or base not aligned).
> > > +	 * Ondisk alignment is implied by the following code that expands
> > > +	 * misaligned IO to have a DIO-aligned offset and len.
> > > +	 */
> > > +	if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
> > > +		return false;
> > 
> > The small len check makes sense, but "base" at this point is the offset
> > into the first page. Here you're bailing out early if that's not
> > aligned. Isn't that contrary to what this patch is supposed to do
> > (which is expand the range so that the I/O is aligned)?
> 
> No matter whether we're expanding the read or not (that's a means to
> make the area read from disk DIO-aligned): the memory alignment is
> what it is -- so it isn't something that we can change (not without an
> extra copy). But thankfully with RDMA the memory for the READ payload
> is generally always aligned.
> 
> Chcuk did say in reply to an earlier version of this patchset
> (paraphrasing, rather not go splunking in the linux-nfs archive to
> find it): a future improvement would be to make sure the READ
> payload's memory is always aligned.
> 

Oh right, I got confused between the mem and block alignment here.

In light of that, you can add

Reviewed-by: Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2025-08-18 19:27 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-15 14:46 [PATCH v7 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-15 14:46 ` [PATCH v7 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-15 14:46 ` [PATCH v7 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-15 14:46 ` [PATCH v7 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-08-18 14:33   ` Jeff Layton
2025-08-15 14:46 ` [PATCH v7 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-18 14:34   ` Jeff Layton
2025-08-15 14:46 ` [PATCH v7 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-18 14:45   ` Jeff Layton
2025-08-18 19:05     ` Mike Snitzer
2025-08-18 19:27       ` Jeff Layton [this message]
2025-08-15 14:46 ` [PATCH v7 6/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-18 19:46   ` Jeff Layton
2025-08-26 16:34     ` Mike Snitzer
2025-08-15 14:46 ` [PATCH v7 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eea0862de2cc9d9204527ddcbd9455422bcd042e.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=snitzer@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).