linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@ownmail.net>
To: "Chuck Lever" <cel@kernel.org>
Cc: "Mike Snitzer" <snitzer@kernel.org>,
	"Jeff Layton" <jlayton@kernel.org>,
	"Olga Kornievskaia" <okorniev@redhat.com>,
	"Dai Ngo" <dai.ngo@oracle.com>, "Tom Talpey" <tom@talpey.com>,
	linux-nfs@vger.kernel.org, "Chuck Lever" <chuck.lever@oracle.com>
Subject: Re: [PATCH v3 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
Date: Wed, 24 Sep 2025 17:30:38 +1000	[thread overview]
Message-ID: <175869903827.1696783.17181184352630252525@noble.neil.brown.name> (raw)
In-Reply-To: <60960803-80b3-4ca1-9fd3-16bc1bd1dbd4@kernel.org>

On Wed, 24 Sep 2025, Chuck Lever wrote:
> On 9/23/25 4:48 PM, NeilBrown wrote:
> > On Wed, 24 Sep 2025, Chuck Lever wrote:
> >> On 9/23/25 3:26 PM, Mike Snitzer wrote:
> >>> On Mon, Sep 22, 2025 at 10:11:37AM -0400, Chuck Lever wrote:
> >>>> From: Chuck Lever <chuck.lever@oracle.com>
> >>>>
> >>>> Add an experimental option that forces NFS READ operations to use
> >>>> direct I/O instead of reading through the NFS server's page cache.
> >>>>
> >>>> There are already other layers of caching:
> >>>>  - The page cache on NFS clients
> >>>>  - The block device underlying the exported file system
> >>>>
> >>>> The server's page cache, in many cases, is unlikely to provide
> >>>> additional benefit. Some benchmarks have demonstrated that the
> >>>> server's page cache is actively detrimental for workloads whose
> >>>> working set is larger than the server's available physical memory.
> >>>>
> >>>> For instance, on small NFS servers, cached NFS file content can
> >>>> squeeze out local memory consumers. For large sequential workloads,
> >>>> an enormous amount of data flows into and out of the page cache
> >>>> and is consumed by NFS clients exactly once -- caching that data
> >>>> is expensive to do and totally valueless.
> >>>>
> >>>> For now this is a hidden option that can be enabled on test
> >>>> systems for benchmarking. In the longer term, this option might
> >>>> be enabled persistently or per-export. When the exported file
> >>>> system does not support direct I/O, NFSD falls back to using
> >>>> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> >>>>
> >>>> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> >>>> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> >>>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> >>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >>>> ---
> >>>>  fs/nfsd/debugfs.c |  2 ++
> >>>>  fs/nfsd/nfsd.h    |  1 +
> >>>>  fs/nfsd/trace.h   |  1 +
> >>>>  fs/nfsd/vfs.c     | 82 +++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  4 files changed, 86 insertions(+)
> >>>>
> >>>> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> >>>> index ed2b9e066206..00eb1ecef6ac 100644
> >>>> --- a/fs/nfsd/debugfs.c
> >>>> +++ b/fs/nfsd/debugfs.c
> >>>> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> >>>>   * Contents:
> >>>>   *   %0: NFS READ will use buffered IO
> >>>>   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> >>>> + *   %2: NFS READ will use direct IO
> >>>>   *
> >>>>   * This setting takes immediate effect for all NFS versions,
> >>>>   * all exports, and in all NFSD net namespaces.
> >>>> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> >>>>  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> >>>>  		break;
> >>>>  	case NFSD_IO_DONTCACHE:
> >>>> +	case NFSD_IO_DIRECT:
> >>>>  		/*
> >>>>  		 * Must disable splice_read when enabling
> >>>>  		 * NFSD_IO_DONTCACHE.
> >>>> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> >>>> index ea87b42894dd..bdb60ee1f1a4 100644
> >>>> --- a/fs/nfsd/nfsd.h
> >>>> +++ b/fs/nfsd/nfsd.h
> >>>> @@ -157,6 +157,7 @@ enum {
> >>>>  	/* Any new NFSD_IO enum value must be added at the end */
> >>>>  	NFSD_IO_BUFFERED,
> >>>>  	NFSD_IO_DONTCACHE,
> >>>> +	NFSD_IO_DIRECT,
> >>>>  };
> >>>>  
> >>>>  extern u64 nfsd_io_cache_read __read_mostly;
> >>>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> >>>> index 6e2c8e2aab10..bfd41236aff2 100644
> >>>> --- a/fs/nfsd/trace.h
> >>>> +++ b/fs/nfsd/trace.h
> >>>> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
> >>>>  DEFINE_NFSD_IO_EVENT(read_start);
> >>>>  DEFINE_NFSD_IO_EVENT(read_splice);
> >>>>  DEFINE_NFSD_IO_EVENT(read_vector);
> >>>> +DEFINE_NFSD_IO_EVENT(read_direct);
> >>>>  DEFINE_NFSD_IO_EVENT(read_io_done);
> >>>>  DEFINE_NFSD_IO_EVENT(read_done);
> >>>>  DEFINE_NFSD_IO_EVENT(write_start);
> >>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >>>> index 35880d3f1326..ddcd812f0761 100644
> >>>> --- a/fs/nfsd/vfs.c
> >>>> +++ b/fs/nfsd/vfs.c
> >>>> @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>>>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >>>>  }
> >>>>  
> >>>> +/*
> >>>> + * The byte range of the client's READ request is expanded on both
> >>>> + * ends until it meets the underlying file system's direct I/O
> >>>> + * alignment requirements. After the internal read is complete, the
> >>>> + * byte range of the NFS READ payload is reduced to the byte range
> >>>> + * that was originally requested.
> >>>> + *
> >>>> + * Note that a direct read can be done only when the xdr_buf
> >>>> + * containing the NFS READ reply does not already have contents in
> >>>> + * its .pages array. This is due to potentially restrictive
> >>>> + * alignment requirements on the read buffer. When .page_len and
> >>>> + * @base are zero, the .pages array is guaranteed to be page-
> >>>> + * aligned.
> >>>> + */
> >>>
> >>> So this ^ comment (and the related conversation with Neil in a
> >>> different thread) says page_len should be 0 on entry to
> >>> nfsd_direct_read.
> >>>
> >>>> @@ -1106,6 +1182,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>>>  	switch (nfsd_io_cache_read) {
> >>>>  	case NFSD_IO_BUFFERED:
> >>>>  		break;
> >>>> +	case NFSD_IO_DIRECT:
> >>>> +		if (nf->nf_dio_read_offset_align &&
> >>>> +		    rqstp->rq_res.page_len && !base)
> >>>> +			return nfsd_direct_read(rqstp, fhp, nf, offset,
> >>>> +						count, eof);
> >>>> +		fallthrough;
> >>>
> >>> Yet the nfsd_iter_read is only calling nfsd_direct_read() if
> >>> rqstp->rq_res.page_len is not zero, shouldn't it be
> >>> !rqstp->rq_res.page_len ?
> >>
> >> Oops, yes. I did this work last week, while out of range of my lab.
> >>
> >>
> >>> (testing confirms it should be !rqstp->rq_res.page_len)
> >>>
> >>> Hopefully with this fix you can have more confidence in staging this
> >>> in your nfsd-testing?
> >> I'm waiting only for Neil to send an R-b.
> > 
> > After noticing, like Mike, that the page_len test was inverted I went
> > looking to see where page_len was updated, to be sure that a second READ
> > request would not try to use DIRECT IO.
> > I can see that nfsd_splice_actor() updates page_len, but I cannot see
> > where it is updated when nfsd_iter_read() is used.
> > 
> > What am I missing?
> 
> It might be updated while the NFSv4 reply encoder is encoding a
> COMPOUND. If the size of the RPC reply so far is larger than the
> xdr_buf's .head, the xdr_stream will be positioned somewhere in the
> xdr_buf's .pages array.
> 
> This is precisely why splice READ can be used only for the first
> NFSv4 READ operation in a COMPOUND. Subsequent READ operations
> must use nfsd_iter_read().

Hmmmm...

nfsd4_encode_readv() calls xdr_reserve_space_vec() passing maxcount from
the READ request.  The maxcount is passed to xdr_reserve_space()
repeatedly (one page at a time) where it is added to xdr->buf->page_len
(where xdr is ->rq_res_stream and xdr->buf is rq_res).

So the result is often that rq_res.page_len will be maxcount.

Then nfsd4_encode_readv() calls nfsd_iter_read() which, with this patch,
will test rq_res.page_len, which will always be non-zero.
So that isn't what we want.

(after the read, nfsd4_encode_readv() will call xdr_truncate_encode()
 which will reduce ->page_len based on how much was read).

Then nfsd4_encode_readv() calls nfsd_iter_read().

I don't think we can use ->page_len to determine if it is safe to use
nfsd_direct_read().  The conditions where nfsd_direct_read() is safe are
similar to the conditions where nfsd_splice_read() is safe.  Maybe we
should use similar code.

Thanks,
NeilBrown

  reply	other threads:[~2025-09-24  7:30 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22 14:11 [PATCH v3 0/3] NFSD direct I/O read Chuck Lever
2025-09-22 14:11 ` [PATCH v3 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-24 14:00   ` Chuck Lever
2025-09-22 14:11 ` [PATCH v3 2/3] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-22 14:11 ` [PATCH v3 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-23 22:26   ` Mike Snitzer
2025-09-23 22:49     ` Chuck Lever
2025-09-23 23:31       ` Mike Snitzer
2025-09-23 23:48       ` NeilBrown
2025-09-23 23:52         ` Chuck Lever
2025-09-24  7:30           ` NeilBrown [this message]
2025-09-24 14:12             ` Mike Snitzer
2025-09-24 15:18               ` Mike Snitzer
2025-09-24 18:56                 ` [RFC PATCH v3 4/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ (v3 and older) [Was: "Re: [PATCH v3 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ"] Mike Snitzer
2025-09-25  2:56                   ` Jonathan Flynn
2025-09-25 13:13                     ` Jonathan Flynn
2025-09-26  0:18               ` [PATCH v3 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ NeilBrown
2025-09-26  1:30                 ` Mike Snitzer
2025-09-24 18:56             ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=175869903827.1696783.17181184352630252525@noble.neil.brown.name \
    --to=neilb@ownmail.net \
    --cc=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).