Linux NFS development
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: NeilBrown <neilb@ownmail.net>
Cc: Chuck Lever <cel@kernel.org>,
	jlayton@kernel.org, okorniev@redhat.com, dai.ngo@oracle.com,
	tom@talpey.com, linux-nfs@vger.kernel.org
Subject: Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
Date: Thu, 18 Sep 2025 11:20:31 -0400	[thread overview]
Message-ID: <aMwjP8DrrxzOy-5-@kernel.org> (raw)
In-Reply-To: <aMwcUdWdey69k2iK@kernel.org>

On Thu, Sep 18, 2025 at 10:50:57AM -0400, Mike Snitzer wrote:
> On Thu, Sep 18, 2025 at 09:29:48AM +1000, NeilBrown wrote:
> > On Thu, 18 Sep 2025, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Add an experimental option that forces NFS READ operations to use
> > > direct I/O instead of reading through the NFS server's page cache.
> > > 
> > > There are already other layers of caching:
> > >  - The page cache on NFS clients
> > >  - The block device underlying the exported file system
> > > 
> > > The server's page cache, in many cases, is unlikely to provide
> > > additional benefit. Some benchmarks have demonstrated that the
> > > server's page cache is actively detrimental for workloads whose
> > > working set is larger than the server's available physical memory.
> > > 
> > > For instance, on small NFS servers, cached NFS file content can
> > > squeeze out local memory consumers. For large sequential workloads,
> > > an enormous amount of data flows into and out of the page cache
> > > and is consumed by NFS clients exactly once -- caching that data
> > > is expensive to do and totally valueless.
> > > 
> > > For now this is a hidden option that can be enabled on test
> > > systems for benchmarking. In the longer term, this option might
> > > be enabled persistently or per-export. When the exported file
> > > system does not support direct I/O, NFSD falls back to using
> > > either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> > > 
> > > Suggested-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  fs/nfsd/debugfs.c |    2 +
> > >  fs/nfsd/nfsd.h    |    1 +
> > >  fs/nfsd/trace.h   |    1 +
> > >  fs/nfsd/vfs.c     |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 85 insertions(+)
> > > 
> > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > > index ed2b9e066206..00eb1ecef6ac 100644
> > > --- a/fs/nfsd/debugfs.c
> > > +++ b/fs/nfsd/debugfs.c
> > > @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> > >   * Contents:
> > >   *   %0: NFS READ will use buffered IO
> > >   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > > + *   %2: NFS READ will use direct IO
> > >   *
> > >   * This setting takes immediate effect for all NFS versions,
> > >   * all exports, and in all NFSD net namespaces.
> > > @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> > >  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > >  		break;
> > >  	case NFSD_IO_DONTCACHE:
> > > +	case NFSD_IO_DIRECT:
> > >  		/*
> > >  		 * Must disable splice_read when enabling
> > >  		 * NFSD_IO_DONTCACHE.
> > > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > > index ea87b42894dd..bdb60ee1f1a4 100644
> > > --- a/fs/nfsd/nfsd.h
> > > +++ b/fs/nfsd/nfsd.h
> > > @@ -157,6 +157,7 @@ enum {
> > >  	/* Any new NFSD_IO enum value must be added at the end */
> > >  	NFSD_IO_BUFFERED,
> > >  	NFSD_IO_DONTCACHE,
> > > +	NFSD_IO_DIRECT,
> > >  };
> > >  
> > >  extern u64 nfsd_io_cache_read __read_mostly;
> > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > index 6e2c8e2aab10..bfd41236aff2 100644
> > > --- a/fs/nfsd/trace.h
> > > +++ b/fs/nfsd/trace.h
> > > @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
> > >  DEFINE_NFSD_IO_EVENT(read_start);
> > >  DEFINE_NFSD_IO_EVENT(read_splice);
> > >  DEFINE_NFSD_IO_EVENT(read_vector);
> > > +DEFINE_NFSD_IO_EVENT(read_direct);
> > >  DEFINE_NFSD_IO_EVENT(read_io_done);
> > >  DEFINE_NFSD_IO_EVENT(read_done);
> > >  DEFINE_NFSD_IO_EVENT(write_start);
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 35880d3f1326..5cd970c1089b 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > >  }
> > >  
> > > +/*
> > > + * The byte range of the client's READ request is expanded on both
> > > + * ends until it meets the underlying file system's direct I/O
> > > + * alignment requirements. After the internal read is complete, the
> > > + * byte range of the NFS READ payload is reduced to the byte range
> > > + * that was originally requested.
> > > + *
> > > + * Note that a direct read can be done only when the xdr_buf
> > > + * containing the NFS READ reply does not already have contents in
> > > + * its .pages array. This is due to potentially restrictive
> > > + * alignment requirements on the read buffer. When .page_len and
> > > + * @base are zero, the .pages array is guaranteed to be page-
> > > + * aligned.
> > 
> > This para is confusing.
> > It starts talking about the xdr_buf not having any contents.  Then it
> > transitions to a guarantee of page alignment.
> > 
> > If the start of the read requests isn't sufficiently aligned then a gap
> > will be created in the xdr_buf and that can only be handled at the start
> > (using page_base).
> > 
> > So as you say we need page_len to be zero.  But nowhere in the code is
> > this condition tested.
> > 
> > The closest is "!base" before the call to nfsd_direct_read() but when
> > called from nfsd4_encode_readv()
> > 
> >    base = xdr->buf->page_len & ~PAGE_MASK;
> > 
> > so ->page_len could be non-zero despite base being zero.
> 
> Hi Neil,
> 
> If we verify base is aligned relative to nf->nf_dio_mem_align; this
> incremental change should avoid the concern entirely right?
> 
> [I've verified all my tests pass with this change]

It helps if when testing NFSD you don't have LOCALIO enabled...
please disregard my patch ;)

The patch I provided doesn't work, it'll allow the iov_iter to have
misaligned pages and xfs_file_read_iter->iomap_dio_rw crashes (easily
remedied by checking iov_iter's alignment), but best to just refine
the check that prevents calling into nfsd_direct_read (by explicitly
checking page_len)?

Thanks,
Mike

  reply	other threads:[~2025-09-18 15:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
2025-09-17 17:51   ` Mike Snitzer
2025-09-17 14:31 ` [PATCH v2 2/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-17 14:31 ` [PATCH v2 3/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-17 14:32 ` [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-17 23:29   ` NeilBrown
2025-09-18 14:50     ` Mike Snitzer
2025-09-18 15:20       ` Mike Snitzer [this message]
2025-09-18 18:42     ` Chuck Lever
2025-09-18 19:01       ` Mike Snitzer
2025-09-18 16:29   ` Mike Snitzer
2025-09-18 18:27     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMwjP8DrrxzOy-5-@kernel.org \
    --to=snitzer@kernel.org \
    --cc=cel@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox