From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: neil@brown.name, jlayton@kernel.org, okorniev@redhat.com,
dai.ngo@oracle.com, tom@talpey.com, linux-nfs@vger.kernel.org
Subject: Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
Date: Thu, 18 Sep 2025 12:29:07 -0400 [thread overview]
Message-ID: <aMwzU50fiZSN00JP@kernel.org> (raw)
In-Reply-To: <175811952039.19474.5813875056701985362.stgit@91.116.238.104.host.secureserver.net>
On Wed, Sep 17, 2025 at 10:32:00AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
>
> There are already other layers of caching:
> - The page cache on NFS clients
> - The block device underlying the exported file system
>
> The server's page cache, in many cases, is unlikely to provide
> additional benefit. Some benchmarks have demonstrated that the
> server's page cache is actively detrimental for workloads whose
> working set is larger than the server's available physical memory.
>
> For instance, on small NFS servers, cached NFS file content can
> squeeze out local memory consumers. For large sequential workloads,
> an enormous amount of data flows into and out of the page cache
> and is consumed by NFS clients exactly once -- caching that data
> is expensive to do and totally valueless.
>
> For now this is a hidden option that can be enabled on test
> systems for benchmarking. In the longer term, this option might
> be enabled persistently or per-export. When the exported file
> system does not support direct I/O, NFSD falls back to using
> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
>
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/debugfs.c | 2 +
> fs/nfsd/nfsd.h | 1 +
> fs/nfsd/trace.h | 1 +
> fs/nfsd/vfs.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 85 insertions(+)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index ed2b9e066206..00eb1ecef6ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> * Contents:
> * %0: NFS READ will use buffered IO
> * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + * %2: NFS READ will use direct IO
> *
> * This setting takes immediate effect for all NFS versions,
> * all exports, and in all NFSD net namespaces.
> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> nfsd_io_cache_read = NFSD_IO_BUFFERED;
> break;
> case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> /*
> * Must disable splice_read when enabling
> * NFSD_IO_DONTCACHE.
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index ea87b42894dd..bdb60ee1f1a4 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -157,6 +157,7 @@ enum {
> /* Any new NFSD_IO enum value must be added at the end */
> NFSD_IO_BUFFERED,
> NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT,
> };
>
> extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 6e2c8e2aab10..bfd41236aff2 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
> DEFINE_NFSD_IO_EVENT(read_start);
> DEFINE_NFSD_IO_EVENT(read_splice);
> DEFINE_NFSD_IO_EVENT(read_vector);
> +DEFINE_NFSD_IO_EVENT(read_direct);
> DEFINE_NFSD_IO_EVENT(read_io_done);
> DEFINE_NFSD_IO_EVENT(read_done);
> DEFINE_NFSD_IO_EVENT(write_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 35880d3f1326..5cd970c1089b 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> +/*
> + * The byte range of the client's READ request is expanded on both
> + * ends until it meets the underlying file system's direct I/O
> + * alignment requirements. After the internal read is complete, the
> + * byte range of the NFS READ payload is reduced to the byte range
> + * that was originally requested.
> + *
> + * Note that a direct read can be done only when the xdr_buf
> + * containing the NFS READ reply does not already have contents in
> + * its .pages array. This is due to potentially restrictive
> + * alignment requirements on the read buffer. When .page_len and
> + * @base are zero, the .pages array is guaranteed to be page-
> + * aligned.
> + */
> +static noinline_for_stack __be32
> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, loff_t offset, unsigned long *count,
> + u32 *eof)
> +{
> + loff_t dio_start, dio_end;
> + unsigned long v, total;
> + struct iov_iter iter;
> + struct kiocb kiocb;
> + ssize_t host_err;
> + size_t len;
> +
> + init_sync_kiocb(&kiocb, nf->nf_file);
> + kiocb.ki_flags |= IOCB_DIRECT;
> +
> + /* Read a properly-aligned region of bytes into rq_bvec */
> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> +
> + kiocb.ki_pos = dio_start;
> +
> + v = 0;
> + total = *count;
Hi Chuck,
Looks like you introduced a copy-n-paste bug when updating
nfsd_direct_read's while loop to follow nfsd_iter_read's lead.
Should be:
total = dio_end - dio_start;
[NOTE: this was the reason I saw a crash with my incremental patch
that handled 'base', see:
https://lore.kernel.org/linux-nfs/aMwcUdWdey69k2iK@kernel.org/
]
Thanks,
Mike
> + while (total && v < rqstp->rq_maxpages &&
> + rqstp->rq_next_page < rqstp->rq_page_end) {
> + len = min_t(size_t, total, PAGE_SIZE);
> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> + len, 0);
> +
> + total -= len;
> + ++rqstp->rq_next_page;
> + ++v;
> + }
> +
> + trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
> + dio_end - dio_start - total);
> +
> + host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> + if (host_err >= 0) {
> + unsigned int pad = offset - dio_start;
> +
> + /* The returned payload starts after the pad */
> + rqstp->rq_res.page_base = pad;
> +
> + /* Compute the count of bytes to be returned */
> + if (host_err > pad + *count) {
> + host_err = *count;
> + } else if (host_err > pad) {
> + host_err -= pad;
> + } else {
> + host_err = 0;
> + }
> + } else if (unlikely(host_err == -EINVAL)) {
> + pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> + host_err = -ESERVERFAULT;
> + }
> +
> + return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
> + eof, host_err);
> +}
> +
> /**
> * nfsd_iter_read - Perform a VFS read using an iterator
> * @rqstp: RPC transaction context
> @@ -1106,6 +1182,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> switch (nfsd_io_cache_read) {
> case NFSD_IO_BUFFERED:
> break;
> + case NFSD_IO_DIRECT:
> + if (nf->nf_dio_read_offset_align && !base)
> + return nfsd_direct_read(rqstp, fhp, nf, offset,
> + count, eof);
> + fallthrough;
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags = IOCB_DONTCACHE;
>
>
>
next prev parent reply other threads:[~2025-09-18 16:29 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
2025-09-17 17:51 ` Mike Snitzer
2025-09-17 14:31 ` [PATCH v2 2/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-17 14:31 ` [PATCH v2 3/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-17 14:32 ` [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-17 23:29 ` NeilBrown
2025-09-18 14:50 ` Mike Snitzer
2025-09-18 15:20 ` Mike Snitzer
2025-09-18 18:42 ` Chuck Lever
2025-09-18 19:01 ` Mike Snitzer
2025-09-18 16:29 ` Mike Snitzer [this message]
2025-09-18 18:27 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aMwzU50fiZSN00JP@kernel.org \
--to=snitzer@kernel.org \
--cc=cel@kernel.org \
--cc=dai.ngo@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neil@brown.name \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox