From: Mike Snitzer <snitzer@kernel.org>
To: NeilBrown <neilb@ownmail.net>
Cc: Chuck Lever <cel@kernel.org>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
linux-nfs@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
Date: Tue, 9 Sep 2025 19:56:14 -0400 [thread overview]
Message-ID: <aMC-nsD3lHSzbGPx@kernel.org> (raw)
In-Reply-To: <175746104715.2850467.8246435920764028613@noble.neil.brown.name>
On Wed, Sep 10, 2025 at 09:37:27AM +1000, NeilBrown wrote:
> On Wed, 10 Sep 2025, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> >
> > Add an experimental option that forces NFS READ operations to use
> > direct I/O instead of reading through the NFS server's page cache.
> >
> > There are already other layers of caching:
> > - The page cache on NFS clients
> > - The block device underlying the exported file system
> >
> > The server's page cache, in many cases, is unlikely to provide
> > additional benefit. Some benchmarks have demonstrated that the
> > server's page cache is actively detrimental for workloads whose
> > working set is larger than the server's available physical memory.
> >
> > For instance, on small NFS servers, cached NFS file content can
> > squeeze out local memory consumers. For large sequential workloads,
> > an enormous amount of data flows into and out of the page cache
> > and is consumed by NFS clients exactly once -- caching that data
> > is expensive to do and totally valueless.
> >
> > For now this is a hidden option that can be enabled on test
> > systems for benchmarking. In the longer term, this option might
> > be enabled persistently or per-export. When the exported file
> > system does not support direct I/O, NFSD falls back to using
> > either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> >
> > Suggested-by: Mike Snitzer <snitzer@kernel.org>
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> > fs/nfsd/debugfs.c | 2 ++
> > fs/nfsd/nfsd.h | 1 +
> > fs/nfsd/trace.h | 1 +
> > fs/nfsd/vfs.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 82 insertions(+)
> >
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index ed2b9e066206..00eb1ecef6ac 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> > * Contents:
> > * %0: NFS READ will use buffered IO
> > * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > + * %2: NFS READ will use direct IO
> > *
> > * This setting takes immediate effect for all NFS versions,
> > * all exports, and in all NFSD net namespaces.
> > @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> > nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > break;
> > case NFSD_IO_DONTCACHE:
> > + case NFSD_IO_DIRECT:
> > /*
> > * Must disable splice_read when enabling
> > * NFSD_IO_DONTCACHE.
> > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > index ea87b42894dd..bdb60ee1f1a4 100644
> > --- a/fs/nfsd/nfsd.h
> > +++ b/fs/nfsd/nfsd.h
> > @@ -157,6 +157,7 @@ enum {
> > /* Any new NFSD_IO enum value must be added at the end */
> > NFSD_IO_BUFFERED,
> > NFSD_IO_DONTCACHE,
> > + NFSD_IO_DIRECT,
> > };
> >
> > extern u64 nfsd_io_cache_read __read_mostly;
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index e5af0d058fd0..88901df5fbb1 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
> > DEFINE_NFSD_IO_EVENT(read_start);
> > DEFINE_NFSD_IO_EVENT(read_splice);
> > DEFINE_NFSD_IO_EVENT(read_vector);
> > +DEFINE_NFSD_IO_EVENT(read_direct);
> > DEFINE_NFSD_IO_EVENT(read_io_done);
> > DEFINE_NFSD_IO_EVENT(read_done);
> > DEFINE_NFSD_IO_EVENT(write_start);
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 441267d877f9..9665454743eb 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > }
> >
> > +/*
> > + * The byte range of the client's READ request is expanded on both
> > + * ends until it meets the underlying file system's direct I/O
> > + * alignment requirements. After the internal read is complete, the
> > + * byte range of the NFS READ payload is reduced to the byte range
> > + * that was originally requested.
> > + *
> > + * Note that a direct read can be done only when the xdr_buf
> > + * containing the NFS READ reply does not already have contents in
> > + * its .pages array. This is due to potentially restrictive
> > + * alignment requirements on the read buffer. When .page_len and
> > + * @base are zero, the .pages array is guaranteed to be page-
> > + * aligned.
>
> Where do we test that this condition is met?
>
> I can see that nfsd_direct_read() is only called if "base" is zero, but
> that means the current contents of the .pages array are page-aligned,
> not that there are now.
>
> > + */
> > +static noinline_for_stack __be32
> > +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > + struct nfsd_file *nf, loff_t offset, unsigned long *count,
> > + u32 *eof)
> > +{
> > + loff_t dio_start, dio_end;
> > + unsigned long v, total;
> > + struct iov_iter iter;
> > + struct kiocb kiocb;
> > + ssize_t host_err;
> > + size_t len;
> > +
> > + init_sync_kiocb(&kiocb, nf->nf_file);
> > + kiocb.ki_flags |= IOCB_DIRECT;
> > +
> > + /* Read a properly-aligned region of bytes into rq_bvec */
> > + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> > + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> > +
> > + kiocb.ki_pos = dio_start;
> > +
> > + v = 0;
> > + total = dio_end - dio_start;
> > + while (total) {
> > + len = min_t(size_t, total, PAGE_SIZE);
> > + bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> > + len, 0);
> > + total -= len;
> > + ++v;
> > + }
> > + WARN_ON_ONCE(v > rqstp->rq_maxpages);
>
> I would rather we had an early test rather than a late warn-on.
> e.g.
> if (total > (rqstp->rq_maxpages >> PAGE_SHIFT))
> return -EINVAL /* or whatever */;
-EINVAL is the devil. ;)
My follow-up patch should avoid any possibility of this WARN_ON_ONCE
ever triggering:
https://lore.kernel.org/linux-nfs/20250909233315.80318-2-snitzer@kernel.org/
(and should/could be folded into this 3/3 patch?)
> Otherwise it seems to be making unstated assumptions about how big the
> alignment requirements could be.
Just FYI, nfsd_direct_read's while loop code is nearly identical to
that which is in nfsd_iter_read().
So if an early return warranted in nfsd_direct_read, it is also
warranted in nfsd_iter_read.
(I had an early return in nfsd_iter_read N iterations ago, decided it
not needed given its not something that we'll ever hit.. really just a
canary in the coal mine that offers companionship until one of us dies
of black lung)
Mike
next prev parent reply other threads:[~2025-09-09 23:56 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-09 19:05 [PATCH v1 0/3] NFSD direct I/O read Chuck Lever
2025-09-09 19:05 ` [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-09 23:07 ` NeilBrown
2025-09-09 19:05 ` [PATCH v1 2/3] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-09 23:20 ` NeilBrown
2025-09-09 19:05 ` [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-09 23:16 ` Mike Snitzer
2025-09-09 23:37 ` NeilBrown
2025-09-09 23:39 ` Chuck Lever
2025-09-09 23:48 ` Chuck Lever
2025-09-10 1:54 ` NeilBrown
2025-09-10 1:52 ` NeilBrown
2025-09-10 14:23 ` Chuck Lever
2025-09-09 23:56 ` Mike Snitzer [this message]
2025-09-10 11:37 ` Jeff Layton
2025-09-09 23:33 ` [PATCH 0/2] NFSD: continuation of NFSD DIRECT Mike Snitzer
2025-09-09 23:33 ` [PATCH 1/2] sunrpc: add an extra reserve page to svc_serv_maxpages() Mike Snitzer
2025-09-10 14:29 ` Chuck Lever
2025-09-09 23:33 ` [PATCH 2/2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
2025-10-08 18:59 ` [PATCH v2] " Mike Snitzer
2025-10-09 15:04 ` Jeff Layton
2025-10-09 17:46 ` Chuck Lever
2025-10-13 15:41 ` Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aMC-nsD3lHSzbGPx@kernel.org \
--to=snitzer@kernel.org \
--cc=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox