All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: NeilBrown <neilb@ownmail.net>
Cc: Chuck Lever <cel@kernel.org>,
	jlayton@kernel.org, okorniev@redhat.com, dai.ngo@oracle.com,
	tom@talpey.com, linux-nfs@vger.kernel.org
Subject: Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
Date: Thu, 18 Sep 2025 11:20:31 -0400	[thread overview]
Message-ID: <aMwjP8DrrxzOy-5-@kernel.org> (raw)
In-Reply-To: <aMwcUdWdey69k2iK@kernel.org>

On Thu, Sep 18, 2025 at 10:50:57AM -0400, Mike Snitzer wrote:
> On Thu, Sep 18, 2025 at 09:29:48AM +1000, NeilBrown wrote:
> > On Thu, 18 Sep 2025, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Add an experimental option that forces NFS READ operations to use
> > > direct I/O instead of reading through the NFS server's page cache.
> > > 
> > > There are already other layers of caching:
> > >  - The page cache on NFS clients
> > >  - The block device underlying the exported file system
> > > 
> > > The server's page cache, in many cases, is unlikely to provide
> > > additional benefit. Some benchmarks have demonstrated that the
> > > server's page cache is actively detrimental for workloads whose
> > > working set is larger than the server's available physical memory.
> > > 
> > > For instance, on small NFS servers, cached NFS file content can
> > > squeeze out local memory consumers. For large sequential workloads,
> > > an enormous amount of data flows into and out of the page cache
> > > and is consumed by NFS clients exactly once -- caching that data
> > > is expensive to do and totally valueless.
> > > 
> > > For now this is a hidden option that can be enabled on test
> > > systems for benchmarking. In the longer term, this option might
> > > be enabled persistently or per-export. When the exported file
> > > system does not support direct I/O, NFSD falls back to using
> > > either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> > > 
> > > Suggested-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  fs/nfsd/debugfs.c |    2 +
> > >  fs/nfsd/nfsd.h    |    1 +
> > >  fs/nfsd/trace.h   |    1 +
> > >  fs/nfsd/vfs.c     |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 85 insertions(+)
> > > 
> > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > > index ed2b9e066206..00eb1ecef6ac 100644
> > > --- a/fs/nfsd/debugfs.c
> > > +++ b/fs/nfsd/debugfs.c
> > > @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> > >   * Contents:
> > >   *   %0: NFS READ will use buffered IO
> > >   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > > + *   %2: NFS READ will use direct IO
> > >   *
> > >   * This setting takes immediate effect for all NFS versions,
> > >   * all exports, and in all NFSD net namespaces.
> > > @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> > >  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > >  		break;
> > >  	case NFSD_IO_DONTCACHE:
> > > +	case NFSD_IO_DIRECT:
> > >  		/*
> > >  		 * Must disable splice_read when enabling
> > >  		 * NFSD_IO_DONTCACHE.
> > > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > > index ea87b42894dd..bdb60ee1f1a4 100644
> > > --- a/fs/nfsd/nfsd.h
> > > +++ b/fs/nfsd/nfsd.h
> > > @@ -157,6 +157,7 @@ enum {
> > >  	/* Any new NFSD_IO enum value must be added at the end */
> > >  	NFSD_IO_BUFFERED,
> > >  	NFSD_IO_DONTCACHE,
> > > +	NFSD_IO_DIRECT,
> > >  };
> > >  
> > >  extern u64 nfsd_io_cache_read __read_mostly;
> > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > index 6e2c8e2aab10..bfd41236aff2 100644
> > > --- a/fs/nfsd/trace.h
> > > +++ b/fs/nfsd/trace.h
> > > @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
> > >  DEFINE_NFSD_IO_EVENT(read_start);
> > >  DEFINE_NFSD_IO_EVENT(read_splice);
> > >  DEFINE_NFSD_IO_EVENT(read_vector);
> > > +DEFINE_NFSD_IO_EVENT(read_direct);
> > >  DEFINE_NFSD_IO_EVENT(read_io_done);
> > >  DEFINE_NFSD_IO_EVENT(read_done);
> > >  DEFINE_NFSD_IO_EVENT(write_start);
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 35880d3f1326..5cd970c1089b 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > >  }
> > >  
> > > +/*
> > > + * The byte range of the client's READ request is expanded on both
> > > + * ends until it meets the underlying file system's direct I/O
> > > + * alignment requirements. After the internal read is complete, the
> > > + * byte range of the NFS READ payload is reduced to the byte range
> > > + * that was originally requested.
> > > + *
> > > + * Note that a direct read can be done only when the xdr_buf
> > > + * containing the NFS READ reply does not already have contents in
> > > + * its .pages array. This is due to potentially restrictive
> > > + * alignment requirements on the read buffer. When .page_len and
> > > + * @base are zero, the .pages array is guaranteed to be page-
> > > + * aligned.
> > 
> > This para is confusing.
> > It starts talking about the xdr_buf not having any contents.  Then it
> > transitions to a guarantee of page alignment.
> > 
> > If the start of the read requests isn't sufficiently aligned then a gap
> > will be created in the xdr_buf and that can only be handled at the start
> > (using page_base).
> > 
> > So as you say we need page_len to be zero.  But nowhere in the code is
> > this condition tested.
> > 
> > The closest is "!base" before the call to nfsd_direct_read() but when
> > called from nfsd4_encode_readv()
> > 
> >    base = xdr->buf->page_len & ~PAGE_MASK;
> > 
> > so ->page_len could be non-zero despite base being zero.
> 
> Hi Neil,
> 
> If we verify base is aligned relative to nf->nf_dio_mem_align; this
> incremental change should avoid the concern entirely right?
> 
> [I've verified all my tests pass with this change]

It helps if when testing NFSD you don't have LOCALIO enabled...
please disregard my patch ;)

The patch I provided doesn't work, it'll allow the iov_iter to have
misaligned pages and xfs_file_read_iter->iomap_dio_rw crashes (easily
remedied by checking iov_iter's alignment), but best to just refine
the check that prevents calling into nfsd_direct_read (by explicitly
checking page_len)?

Thanks,
Mike

  reply	other threads:[~2025-09-18 15:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
2025-09-17 17:51   ` Mike Snitzer
2025-09-17 14:31 ` [PATCH v2 2/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-17 14:31 ` [PATCH v2 3/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-17 14:32 ` [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-17 23:29   ` NeilBrown
2025-09-18 14:50     ` Mike Snitzer
2025-09-18 15:20       ` Mike Snitzer [this message]
2025-09-18 18:42     ` Chuck Lever
2025-09-18 19:01       ` Mike Snitzer
2025-09-18 16:29   ` Mike Snitzer
2025-09-18 18:27     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMwjP8DrrxzOy-5-@kernel.org \
    --to=snitzer@kernel.org \
    --cc=cel@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.