* [PATCH v4 0/4] NFSD direct I/O read
@ 2025-09-26 14:51 Chuck Lever
2025-09-26 14:51 ` [PATCH v4 1/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Chuck Lever @ 2025-09-26 14:51 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The goal is to get the experimental read-side direct I/O
implementation merged sooner. We are still thinking through the
implications of mixing direct and buffered I/O when handling an
NFS WRITE that does not meet the dio alignment requirements.
Changes since v3:
* Move xdr_reserve_space_vec() call to preserve the page_len value
* Note that "add STATX_DIOALIGN and STATX_DIO_READ_ALIGN ..."
remains exactly the same as it was in the v3 series
Changes since v2:
* "Add array bounds-checking..." has been applied to nfsd-testing
* Add a page_len check before committing to use direct I/O
Changes since v1:
* Harden the loop that constructs the I/O bvec
* Address review comments
Changes from Mike's v9:
* The LOC introduced by the feature has been reduced considerably.
* A new trace point in nfsd_file_getattr reports each file's dio
alignment parameters when it is opened.
* The direct I/O path has been taken out-of-line so that it may
continue to be modified and optimized without perturbing the more
commonly-used I/O paths.
* When an exported file system does not implement direct I/O, more
commonly-used modes are employed instead to avoid returning
EOPNOTSUPP unexpectedly.
* When NFSD_IO_DIRECT is selected, NFS READs of all sizes use direct
I/O to provide better experimental data about small I/O workloads.
Chuck Lever (2):
NFSD: Relocate the xdr_reserve_space_vec() call site
NFSD: Implement NFSD_IO_DIRECT for NFS READ
Mike Snitzer (2):
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFSD: pass nfsd_file to nfsd_iter_read()
fs/nfsd/debugfs.c | 2 +
fs/nfsd/filecache.c | 34 ++++++++++++++++
fs/nfsd/filecache.h | 4 ++
fs/nfsd/nfs4xdr.c | 28 +++++++++----
fs/nfsd/nfsd.h | 1 +
fs/nfsd/nfsfh.c | 4 ++
fs/nfsd/trace.h | 28 +++++++++++++
fs/nfsd/vfs.c | 89 +++++++++++++++++++++++++++++++++++++++--
fs/nfsd/vfs.h | 2 +-
include/trace/misc/fs.h | 22 ++++++++++
10 files changed, 202 insertions(+), 12 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v4 1/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-09-26 14:51 [PATCH v4 0/4] NFSD direct I/O read Chuck Lever
@ 2025-09-26 14:51 ` Chuck Lever
2025-09-26 14:51 ` [PATCH v4 2/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
` (2 subsequent siblings)
3 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2025-09-26 14:51 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment
attributes from the underlying filesystem and store them in the
associated nfsd_file. This is done when the nfsd_file is first
opened for each regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/filecache.c | 34 ++++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/nfsfh.c | 4 ++++
fs/nfsd/trace.h | 27 +++++++++++++++++++++++++++
include/trace/misc/fs.h | 22 ++++++++++++++++++++++
5 files changed, 91 insertions(+)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 75bc48031c07..78cca0d751ac 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1048,6 +1051,35 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ trace_nfsd_file_get_dio_attrs(inode, &stat);
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return nfs_ok;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1166,6 +1198,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_get_dio_attrs(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434..e3d6ca2b6030 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index bd9acfdc7b01..ed85dd43da18 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -696,8 +696,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
+ struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
+ if (S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
if (fhp->fh_maxsize == NFS4_FHSIZE)
request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e..6e2c8e2aab10 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -1133,6 +1133,33 @@ TRACE_EVENT(nfsd_file_alloc,
)
);
+TRACE_EVENT(nfsd_file_get_dio_attrs,
+ TP_PROTO(
+ const struct inode *inode,
+ const struct kstat *stat
+ ),
+ TP_ARGS(inode, stat),
+ TP_STRUCT__entry(
+ __field(const void *, inode)
+ __field(unsigned long, mask)
+ __field(u32, mem_align)
+ __field(u32, offset_align)
+ __field(u32, read_offset_align)
+ ),
+ TP_fast_assign(
+ __entry->inode = inode;
+ __entry->mask = stat->result_mask;
+ __entry->mem_align = stat->dio_mem_align;
+ __entry->offset_align = stat->dio_offset_align;
+ __entry->read_offset_align = stat->dio_read_offset_align;
+ ),
+ TP_printk("inode=%p flags=%s mem_align=%u offset_align=%u read_offset_align=%u",
+ __entry->inode, show_statx_mask(__entry->mask),
+ __entry->mem_align, __entry->offset_align,
+ __entry->read_offset_align
+ )
+);
+
TRACE_EVENT(nfsd_file_acquire,
TP_PROTO(
const struct svc_rqst *rqstp,
diff --git a/include/trace/misc/fs.h b/include/trace/misc/fs.h
index 0406ebe2a80a..7ead1c61f0cb 100644
--- a/include/trace/misc/fs.h
+++ b/include/trace/misc/fs.h
@@ -141,3 +141,25 @@
{ ATTR_TIMES_SET, "TIMES_SET" }, \
{ ATTR_TOUCH, "TOUCH"}, \
{ ATTR_DELEG, "DELEG"})
+
+#define show_statx_mask(flags) \
+ __print_flags(flags, "|", \
+ { STATX_TYPE, "TYPE" }, \
+ { STATX_MODE, "MODE" }, \
+ { STATX_NLINK, "NLINK" }, \
+ { STATX_UID, "UID" }, \
+ { STATX_GID, "GID" }, \
+ { STATX_ATIME, "ATIME" }, \
+ { STATX_MTIME, "MTIME" }, \
+ { STATX_CTIME, "CTIME" }, \
+ { STATX_INO, "INO" }, \
+ { STATX_SIZE, "SIZE" }, \
+ { STATX_BLOCKS, "BLOCKS" }, \
+ { STATX_BASIC_STATS, "BASIC_STATS" }, \
+ { STATX_BTIME, "BTIME" }, \
+ { STATX_MNT_ID, "MNT_ID" }, \
+ { STATX_DIOALIGN, "DIOALIGN" }, \
+ { STATX_MNT_ID_UNIQUE, "MNT_ID_UNIQUE" }, \
+ { STATX_SUBVOL, "SUBVOL" }, \
+ { STATX_WRITE_ATOMIC, "WRITE_ATOMIC" }, \
+ { STATX_DIO_READ_ALIGN, "DIO_READ_ALIGN" })
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v4 2/4] NFSD: pass nfsd_file to nfsd_iter_read()
2025-09-26 14:51 [PATCH v4 0/4] NFSD direct I/O read Chuck Lever
2025-09-26 14:51 ` [PATCH v4 1/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
@ 2025-09-26 14:51 ` Chuck Lever
2025-09-26 14:51 ` [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
2025-09-26 14:51 ` [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
3 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2025-09-26 14:51 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
Prepare for nfsd_iter_read() to use the DIO alignment stored in
nfsd_file by passing the nfsd_file to nfsd_iter_read() rather than
just the file which is associaed with the nfsd_file.
This means nfsd4_encode_readv() now also needs the nfsd_file rather
than the file. Instead of changing the file arg to be the nfsd_file,
we discard the file arg as the nfsd_file (and indeed the file) is
already available via the "read" argument.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 97e9e9afa80a..41f0d54d6e1b 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4467,7 +4467,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4478,7 +4478,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4525,7 +4525,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5421,7 +5421,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2026431500ec..35880d3f1326 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1078,7 +1078,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1091,9 +1091,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1336,7 +1337,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6d..fa46f8b5f132 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site
2025-09-26 14:51 [PATCH v4 0/4] NFSD direct I/O read Chuck Lever
2025-09-26 14:51 ` [PATCH v4 1/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-26 14:51 ` [PATCH v4 2/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
@ 2025-09-26 14:51 ` Chuck Lever
2025-09-29 3:55 ` NeilBrown
2025-09-29 11:49 ` Jeff Layton
2025-09-26 14:51 ` [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
3 siblings, 2 replies; 11+ messages in thread
From: Chuck Lever @ 2025-09-26 14:51 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
In order to detect when a direct READ is possible, we need the send
buffer's .page_len to be zero when there is nothing in the buffer's
.pages array yet.
However, when xdr_reserve_space_vec() extends the size of the
xdr_stream to accommodate a READ payload, it adds to the send
buffer's .page_len.
It should be safe to reserve the stream space /after/ the VFS read
operation completes. This is, for example, how an NFSv3 READ works:
the VFS read goes into the rq_bvec, and is then added to the send
xdr_stream later by svcxdr_encode_opaque_pages().
Nit: we could probably get rid of the xdr_truncate_encode(), now
that maxcount reflects the actual count of bytes returned by the
file system.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs4xdr.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 41f0d54d6e1b..3a5c7a0e2db8 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4475,18 +4475,30 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
__be32 zero = xdr_zero;
__be32 nfserr;
- if (xdr_reserve_space_vec(xdr, maxcount) < 0)
- return nfserr_resource;
-
nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
if (nfserr)
return nfserr;
+
+ /*
+ * svcxdr_encode_opaque_pages() is not used here because
+ * we don't want to encode subsequent results in this
+ * COMPOUND into the xdr->buf's tail, but rather those
+ * results should follow the NFS READ payload in the
+ * buf's pages.
+ */
+ if (xdr_reserve_space_vec(xdr, maxcount) < 0)
+ return nfserr_resource;
+
+ /*
+ * Mark the buffer location of the NFS READ payload so that
+ * direct placement-capable transports send only the
+ * payload bytes out-of-band.
+ */
if (svc_encode_result_payload(resp->rqstp, starting_len, maxcount))
return nfserr_io;
- xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
write_bytes_to_xdr_buf(xdr->buf, starting_len + maxcount, &zero,
xdr_pad_size(maxcount));
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-26 14:51 [PATCH v4 0/4] NFSD direct I/O read Chuck Lever
` (2 preceding siblings ...)
2025-09-26 14:51 ` [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
@ 2025-09-26 14:51 ` Chuck Lever
2025-09-29 4:02 ` NeilBrown
2025-09-29 7:29 ` Christoph Hellwig
3 siblings, 2 replies; 11+ messages in thread
From: Chuck Lever @ 2025-09-26 14:51 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Add an experimental option that forces NFS READ operations to use
direct I/O instead of reading through the NFS server's page cache.
There are already other layers of caching:
- The page cache on NFS clients
- The block device underlying the exported file system
The server's page cache, in many cases, is unlikely to provide
additional benefit. Some benchmarks have demonstrated that the
server's page cache is actively detrimental for workloads whose
working set is larger than the server's available physical memory.
For instance, on small NFS servers, cached NFS file content can
squeeze out local memory consumers. For large sequential workloads,
an enormous amount of data flows into and out of the page cache
and is consumed by NFS clients exactly once -- caching that data
is expensive to do and totally valueless.
For now this is a hidden option that can be enabled on test
systems for benchmarking. In the longer term, this option might
be enabled persistently or per-export. When the exported file
system does not support direct I/O, NFSD falls back to using
either DONTCACHE or buffered I/O to fulfill NFS READ requests.
Suggested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 2 ++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 86 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index ed2b9e066206..00eb1ecef6ac 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
* Contents:
* %0: NFS READ will use buffered IO
* %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS READ will use direct IO
*
* This setting takes immediate effect for all NFS versions,
* all exports, and in all NFSD net namespaces.
@@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
nfsd_io_cache_read = NFSD_IO_BUFFERED;
break;
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
/*
* Must disable splice_read when enabling
* NFSD_IO_DONTCACHE.
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index ea87b42894dd..bdb60ee1f1a4 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -157,6 +157,7 @@ enum {
/* Any new NFSD_IO enum value must be added at the end */
NFSD_IO_BUFFERED,
NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
};
extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 6e2c8e2aab10..bfd41236aff2 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
DEFINE_NFSD_IO_EVENT(read_start);
DEFINE_NFSD_IO_EVENT(read_splice);
DEFINE_NFSD_IO_EVENT(read_vector);
+DEFINE_NFSD_IO_EVENT(read_direct);
DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 35880d3f1326..2f12d68e5356 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+/*
+ * The byte range of the client's READ request is expanded on both
+ * ends until it meets the underlying file system's direct I/O
+ * alignment requirements. After the internal read is complete, the
+ * byte range of the NFS READ payload is reduced to the byte range
+ * that was originally requested.
+ *
+ * Note that a direct read can be done only when the xdr_buf
+ * containing the NFS READ reply does not already have contents in
+ * its .pages array. This is due to potentially restrictive
+ * alignment requirements on the read buffer. When .page_len and
+ * @base are zero, the .pages array is guaranteed to be page-
+ * aligned.
+ */
+static noinline_for_stack __be32
+nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
+ u32 *eof)
+{
+ loff_t dio_start, dio_end;
+ unsigned long v, total;
+ struct iov_iter iter;
+ struct kiocb kiocb;
+ ssize_t host_err;
+ size_t len;
+
+ init_sync_kiocb(&kiocb, nf->nf_file);
+ kiocb.ki_flags |= IOCB_DIRECT;
+
+ /* Read a properly-aligned region of bytes into rq_bvec */
+ dio_start = round_down(offset, nf->nf_dio_read_offset_align);
+ dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
+
+ kiocb.ki_pos = dio_start;
+
+ v = 0;
+ total = dio_end - dio_start;
+ while (total && v < rqstp->rq_maxpages &&
+ rqstp->rq_next_page < rqstp->rq_page_end) {
+ len = min_t(size_t, total, PAGE_SIZE);
+ bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
+ len, 0);
+
+ total -= len;
+ ++rqstp->rq_next_page;
+ ++v;
+ }
+
+ trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
+ dio_end - dio_start - total);
+
+ host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
+ if (host_err >= 0) {
+ unsigned int pad = offset - dio_start;
+
+ /* The returned payload starts after the pad */
+ rqstp->rq_res.page_base = pad;
+
+ /* Compute the count of bytes to be returned */
+ if (host_err > pad + *count) {
+ host_err = *count;
+ } else if (host_err > pad) {
+ host_err -= pad;
+ } else {
+ host_err = 0;
+ }
+ } else if (unlikely(host_err == -EINVAL)) {
+ pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
+ host_err = -ESERVERFAULT;
+ }
+
+ return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
+ eof, host_err);
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1106,6 +1182,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
switch (nfsd_io_cache_read) {
case NFSD_IO_BUFFERED:
break;
+ case NFSD_IO_DIRECT:
+ if (nf->nf_dio_read_offset_align &&
+ !rqstp->rq_res.page_len && !base)
+ return nfsd_direct_read(rqstp, fhp, nf, offset,
+ count, eof);
+ fallthrough;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags = IOCB_DONTCACHE;
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site
2025-09-26 14:51 ` [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
@ 2025-09-29 3:55 ` NeilBrown
2025-09-29 11:49 ` Jeff Layton
1 sibling, 0 replies; 11+ messages in thread
From: NeilBrown @ 2025-09-29 3:55 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Sat, 27 Sep 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> In order to detect when a direct READ is possible, we need the send
> buffer's .page_len to be zero when there is nothing in the buffer's
> .pages array yet.
>
> However, when xdr_reserve_space_vec() extends the size of the
> xdr_stream to accommodate a READ payload, it adds to the send
> buffer's .page_len.
>
> It should be safe to reserve the stream space /after/ the VFS read
> operation completes. This is, for example, how an NFSv3 READ works:
> the VFS read goes into the rq_bvec, and is then added to the send
> xdr_stream later by svcxdr_encode_opaque_pages().
>
> Nit: we could probably get rid of the xdr_truncate_encode(), now
> that maxcount reflects the actual count of bytes returned by the
> file system.
Nit: you made the suggested change, but didn't update this comment :-)
Reviewed-by: NeilBrown <neil@brown.name>
Thanks,
NeilBrown
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/nfs4xdr.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 41f0d54d6e1b..3a5c7a0e2db8 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -4475,18 +4475,30 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
> __be32 zero = xdr_zero;
> __be32 nfserr;
>
> - if (xdr_reserve_space_vec(xdr, maxcount) < 0)
> - return nfserr_resource;
> -
> nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
> read->rd_offset, &maxcount, base,
> &read->rd_eof);
> read->rd_length = maxcount;
> if (nfserr)
> return nfserr;
> +
> + /*
> + * svcxdr_encode_opaque_pages() is not used here because
> + * we don't want to encode subsequent results in this
> + * COMPOUND into the xdr->buf's tail, but rather those
> + * results should follow the NFS READ payload in the
> + * buf's pages.
> + */
> + if (xdr_reserve_space_vec(xdr, maxcount) < 0)
> + return nfserr_resource;
> +
> + /*
> + * Mark the buffer location of the NFS READ payload so that
> + * direct placement-capable transports send only the
> + * payload bytes out-of-band.
> + */
> if (svc_encode_result_payload(resp->rqstp, starting_len, maxcount))
> return nfserr_io;
> - xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
>
> write_bytes_to_xdr_buf(xdr->buf, starting_len + maxcount, &zero,
> xdr_pad_size(maxcount));
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-26 14:51 ` [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
@ 2025-09-29 4:02 ` NeilBrown
2025-09-29 7:29 ` Christoph Hellwig
1 sibling, 0 replies; 11+ messages in thread
From: NeilBrown @ 2025-09-29 4:02 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Mike Snitzer
On Sat, 27 Sep 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
>
> There are already other layers of caching:
> - The page cache on NFS clients
> - The block device underlying the exported file system
>
> The server's page cache, in many cases, is unlikely to provide
> additional benefit. Some benchmarks have demonstrated that the
> server's page cache is actively detrimental for workloads whose
> working set is larger than the server's available physical memory.
>
> For instance, on small NFS servers, cached NFS file content can
> squeeze out local memory consumers. For large sequential workloads,
> an enormous amount of data flows into and out of the page cache
> and is consumed by NFS clients exactly once -- caching that data
> is expensive to do and totally valueless.
>
> For now this is a hidden option that can be enabled on test
> systems for benchmarking. In the longer term, this option might
> be enabled persistently or per-export. When the exported file
> system does not support direct I/O, NFSD falls back to using
> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
>
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/debugfs.c | 2 ++
> fs/nfsd/nfsd.h | 1 +
> fs/nfsd/trace.h | 1 +
> fs/nfsd/vfs.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 86 insertions(+)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index ed2b9e066206..00eb1ecef6ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> * Contents:
> * %0: NFS READ will use buffered IO
> * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + * %2: NFS READ will use direct IO
> *
> * This setting takes immediate effect for all NFS versions,
> * all exports, and in all NFSD net namespaces.
> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> nfsd_io_cache_read = NFSD_IO_BUFFERED;
> break;
> case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> /*
> * Must disable splice_read when enabling
> * NFSD_IO_DONTCACHE.
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index ea87b42894dd..bdb60ee1f1a4 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -157,6 +157,7 @@ enum {
> /* Any new NFSD_IO enum value must be added at the end */
> NFSD_IO_BUFFERED,
> NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT,
> };
>
> extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 6e2c8e2aab10..bfd41236aff2 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
> DEFINE_NFSD_IO_EVENT(read_start);
> DEFINE_NFSD_IO_EVENT(read_splice);
> DEFINE_NFSD_IO_EVENT(read_vector);
> +DEFINE_NFSD_IO_EVENT(read_direct);
> DEFINE_NFSD_IO_EVENT(read_io_done);
> DEFINE_NFSD_IO_EVENT(read_done);
> DEFINE_NFSD_IO_EVENT(write_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 35880d3f1326..2f12d68e5356 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> +/*
> + * The byte range of the client's READ request is expanded on both
> + * ends until it meets the underlying file system's direct I/O
> + * alignment requirements. After the internal read is complete, the
> + * byte range of the NFS READ payload is reduced to the byte range
> + * that was originally requested.
> + *
> + * Note that a direct read can be done only when the xdr_buf
> + * containing the NFS READ reply does not already have contents in
> + * its .pages array. This is due to potentially restrictive
> + * alignment requirements on the read buffer. When .page_len and
> + * @base are zero, the .pages array is guaranteed to be page-
> + * aligned.
> + */
> +static noinline_for_stack __be32
> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, loff_t offset, unsigned long *count,
> + u32 *eof)
> +{
> + loff_t dio_start, dio_end;
> + unsigned long v, total;
> + struct iov_iter iter;
> + struct kiocb kiocb;
> + ssize_t host_err;
> + size_t len;
> +
> + init_sync_kiocb(&kiocb, nf->nf_file);
> + kiocb.ki_flags |= IOCB_DIRECT;
> +
> + /* Read a properly-aligned region of bytes into rq_bvec */
> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> +
> + kiocb.ki_pos = dio_start;
> +
> + v = 0;
> + total = dio_end - dio_start;
> + while (total && v < rqstp->rq_maxpages &&
> + rqstp->rq_next_page < rqstp->rq_page_end) {
> + len = min_t(size_t, total, PAGE_SIZE);
> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> + len, 0);
> +
> + total -= len;
> + ++rqstp->rq_next_page;
> + ++v;
> + }
> +
> + trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
> + dio_end - dio_start - total);
> +
> + host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> + if (host_err >= 0) {
> + unsigned int pad = offset - dio_start;
> +
> + /* The returned payload starts after the pad */
> + rqstp->rq_res.page_base = pad;
> +
> + /* Compute the count of bytes to be returned */
> + if (host_err > pad + *count) {
> + host_err = *count;
> + } else if (host_err > pad) {
> + host_err -= pad;
> + } else {
> + host_err = 0;
> + }
> + } else if (unlikely(host_err == -EINVAL)) {
> + pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> + host_err = -ESERVERFAULT;
> + }
> +
> + return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
> + eof, host_err);
> +}
> +
> /**
> * nfsd_iter_read - Perform a VFS read using an iterator
> * @rqstp: RPC transaction context
> @@ -1106,6 +1182,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> switch (nfsd_io_cache_read) {
> case NFSD_IO_BUFFERED:
> break;
> + case NFSD_IO_DIRECT:
> + if (nf->nf_dio_read_offset_align &&
> + !rqstp->rq_res.page_len && !base)
> + return nfsd_direct_read(rqstp, fhp, nf, offset,
> + count, eof);
> + fallthrough;
Reviewed-by: NeilBrown <neil@brown.name>
This is starting to make sense now - thanks.
However ... It seems that the only reason that 'base' is being passed in
to nfsd_iter_read() is so that it doesn't need to dig around in
rqstp->rq_res to find out what the current page offset is.
But now that we *are* digging around (and even directly setting
page_base etc), maybe there is no point having the 'base' argument.
And I don't think you need to test "!base" above. If page_len is zero
then "base" (which is rq_res.page_len & ~PAGE_MASK from
nfsd4_encode_readv) must also be zero. So the extra test is redundant.
Thanks,
NeilBrown
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags = IOCB_DONTCACHE;
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-26 14:51 ` [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-29 4:02 ` NeilBrown
@ 2025-09-29 7:29 ` Christoph Hellwig
2025-09-29 13:03 ` Chuck Lever
1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2025-09-29 7:29 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Mike Snitzer
On Fri, Sep 26, 2025 at 10:51:51AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
>
> There are already other layers of caching:
> - The page cache on NFS clients
> - The block device underlying the exported file system
What layer of caching is in the "block device" ?
> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, loff_t offset, unsigned long *count,
> + u32 *eof)
> +{
> + loff_t dio_start, dio_end;
> + unsigned long v, total;
> + struct iov_iter iter;
> + struct kiocb kiocb;
> + ssize_t host_err;
> + size_t len;
> +
> + init_sync_kiocb(&kiocb, nf->nf_file);
> + kiocb.ki_flags |= IOCB_DIRECT;
> +
> + /* Read a properly-aligned region of bytes into rq_bvec */
> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> +
> + kiocb.ki_pos = dio_start;
> +
> + v = 0;
> + total = dio_end - dio_start;
> + while (total && v < rqstp->rq_maxpages &&
> + rqstp->rq_next_page < rqstp->rq_page_end) {
> + len = min_t(size_t, total, PAGE_SIZE);
> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> + len, 0);
> +
> + total -= len;
> + ++rqstp->rq_next_page;
> + ++v;
> + }
> +
> + trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
> + dio_end - dio_start - total);
> +
> + host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> + if (host_err >= 0) {
> + unsigned int pad = offset - dio_start;
> +
> + /* The returned payload starts after the pad */
> + rqstp->rq_res.page_base = pad;
> +
> + /* Compute the count of bytes to be returned */
> + if (host_err > pad + *count) {
> + host_err = *count;
> + } else if (host_err > pad) {
> + host_err -= pad;
> + } else {
> + host_err = 0;
> + }
No need for the braces here.
> + } else if (unlikely(host_err == -EINVAL)) {
> + pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> + host_err = -ESERVERFAULT;
> + }
You'll probably want to print s_id to identify the file syste for which
this happened. What is -ESERVERFAULT supposed to mean, btw?
> + case NFSD_IO_DIRECT:
> + if (nf->nf_dio_read_offset_align &&
I guess this is the "is direct I/O actually supported" check? I guess
it'll work, but will underreport as very few file system actually
report the requirement at the moment. Can you add a comment explaining
the check?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site
2025-09-26 14:51 ` [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
2025-09-29 3:55 ` NeilBrown
@ 2025-09-29 11:49 ` Jeff Layton
1 sibling, 0 replies; 11+ messages in thread
From: Jeff Layton @ 2025-09-29 11:49 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Fri, 2025-09-26 at 10:51 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> In order to detect when a direct READ is possible, we need the send
> buffer's .page_len to be zero when there is nothing in the buffer's
> .pages array yet.
>
> However, when xdr_reserve_space_vec() extends the size of the
> xdr_stream to accommodate a READ payload, it adds to the send
> buffer's .page_len.
>
> It should be safe to reserve the stream space /after/ the VFS read
> operation completes. This is, for example, how an NFSv3 READ works:
> the VFS read goes into the rq_bvec, and is then added to the send
> xdr_stream later by svcxdr_encode_opaque_pages().
>
> Nit: we could probably get rid of the xdr_truncate_encode(), now
> that maxcount reflects the actual count of bytes returned by the
> file system.
>
The "Nit:" makes it sound like there is follow-on work here, but you've
removed the xdr_truncate_encode(). Maybe something like this instead:
Also get rid of the xdr_truncate_encode() call now that maxcount
reflects the actual count of bytes returned by the file system.
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/nfs4xdr.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 41f0d54d6e1b..3a5c7a0e2db8 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -4475,18 +4475,30 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
> __be32 zero = xdr_zero;
> __be32 nfserr;
>
> - if (xdr_reserve_space_vec(xdr, maxcount) < 0)
> - return nfserr_resource;
> -
> nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
> read->rd_offset, &maxcount, base,
> &read->rd_eof);
> read->rd_length = maxcount;
> if (nfserr)
> return nfserr;
> +
> + /*
> + * svcxdr_encode_opaque_pages() is not used here because
> + * we don't want to encode subsequent results in this
> + * COMPOUND into the xdr->buf's tail, but rather those
> + * results should follow the NFS READ payload in the
> + * buf's pages.
> + */
> + if (xdr_reserve_space_vec(xdr, maxcount) < 0)
> + return nfserr_resource;
> +
> + /*
> + * Mark the buffer location of the NFS READ payload so that
> + * direct placement-capable transports send only the
> + * payload bytes out-of-band.
> + */
> if (svc_encode_result_payload(resp->rqstp, starting_len, maxcount))
> return nfserr_io;
> - xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
>
> write_bytes_to_xdr_buf(xdr->buf, starting_len + maxcount, &zero,
> xdr_pad_size(maxcount));
The code looks fine though:
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-29 7:29 ` Christoph Hellwig
@ 2025-09-29 13:03 ` Chuck Lever
2025-09-29 13:06 ` Chuck Lever
0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2025-09-29 13:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Mike Snitzer
On 9/29/25 12:29 AM, Christoph Hellwig wrote:
> On Fri, Sep 26, 2025 at 10:51:51AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Add an experimental option that forces NFS READ operations to use
>> direct I/O instead of reading through the NFS server's page cache.
>>
>> There are already other layers of caching:
>> - The page cache on NFS clients
>> - The block device underlying the exported file system
>
> What layer of caching is in the "block device" ?
I wrote this before Damien explained to me that the block devices
generally don't have a very significant cache at all.
>> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> + struct nfsd_file *nf, loff_t offset, unsigned long *count,
>> + u32 *eof)
>> +{
>> + loff_t dio_start, dio_end;
>> + unsigned long v, total;
>> + struct iov_iter iter;
>> + struct kiocb kiocb;
>> + ssize_t host_err;
>> + size_t len;
>> +
>> + init_sync_kiocb(&kiocb, nf->nf_file);
>> + kiocb.ki_flags |= IOCB_DIRECT;
>> +
>> + /* Read a properly-aligned region of bytes into rq_bvec */
>> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
>> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
>> +
>> + kiocb.ki_pos = dio_start;
>> +
>> + v = 0;
>> + total = dio_end - dio_start;
>> + while (total && v < rqstp->rq_maxpages &&
>> + rqstp->rq_next_page < rqstp->rq_page_end) {
>> + len = min_t(size_t, total, PAGE_SIZE);
>> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
>> + len, 0);
>> +
>> + total -= len;
>> + ++rqstp->rq_next_page;
>> + ++v;
>> + }
>> +
>> + trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
>> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
>> + dio_end - dio_start - total);
>> +
>> + host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
>> + if (host_err >= 0) {
>> + unsigned int pad = offset - dio_start;
>> +
>> + /* The returned payload starts after the pad */
>> + rqstp->rq_res.page_base = pad;
>> +
>> + /* Compute the count of bytes to be returned */
>> + if (host_err > pad + *count) {
>> + host_err = *count;
>> + } else if (host_err > pad) {
>> + host_err -= pad;
>> + } else {
>> + host_err = 0;
>> + }
>
> No need for the braces here.
>
>> + } else if (unlikely(host_err == -EINVAL)) {
>> + pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
>> + host_err = -ESERVERFAULT;
>> + }
>
> You'll probably want to print s_id to identify the file syste for which
> this happened. What is -ESERVERFAULT supposed to mean, btw?
Note, this is a "should never happen" condition -- so, exceedingly rare.
Clients typically turns NFSERR_SERVERFAULT into -EREMOTEIO.
>> + case NFSD_IO_DIRECT:
>> + if (nf->nf_dio_read_offset_align &&
>
> I guess this is the "is direct I/O actually supported" check? I guess
> it'll work, but will underreport as very few file system actually
> report the requirement at the moment. Can you add a comment explaining
> the check?
Well, it's to ensure that the vfs_getattr we did at open time actually
got a valid value. But effectively, yes, it means "is direct I/O
available for this file?"
--
Chuck Lever
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-29 13:03 ` Chuck Lever
@ 2025-09-29 13:06 ` Chuck Lever
0 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2025-09-29 13:06 UTC (permalink / raw)
To: Chuck Lever, Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer
On 9/29/25 6:03 AM, Chuck Lever wrote:
> On 9/29/25 12:29 AM, Christoph Hellwig wrote:
>> On Fri, Sep 26, 2025 at 10:51:51AM -0400, Chuck Lever wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> Add an experimental option that forces NFS READ operations to use
>>> direct I/O instead of reading through the NFS server's page cache.
>>>
>>> There are already other layers of caching:
>>> - The page cache on NFS clients
>>> - The block device underlying the exported file system
>>
>> What layer of caching is in the "block device" ?
>
> I wrote this before Damien explained to me that the block devices
> generally don't have a very significant cache at all.
>
>
>>> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> + struct nfsd_file *nf, loff_t offset, unsigned long *count,
>>> + u32 *eof)
>>> +{
>>> + loff_t dio_start, dio_end;
>>> + unsigned long v, total;
>>> + struct iov_iter iter;
>>> + struct kiocb kiocb;
>>> + ssize_t host_err;
>>> + size_t len;
>>> +
>>> + init_sync_kiocb(&kiocb, nf->nf_file);
>>> + kiocb.ki_flags |= IOCB_DIRECT;
>>> +
>>> + /* Read a properly-aligned region of bytes into rq_bvec */
>>> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
>>> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
>>> +
>>> + kiocb.ki_pos = dio_start;
>>> +
>>> + v = 0;
>>> + total = dio_end - dio_start;
>>> + while (total && v < rqstp->rq_maxpages &&
>>> + rqstp->rq_next_page < rqstp->rq_page_end) {
>>> + len = min_t(size_t, total, PAGE_SIZE);
>>> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
>>> + len, 0);
>>> +
>>> + total -= len;
>>> + ++rqstp->rq_next_page;
>>> + ++v;
>>> + }
>>> +
>>> + trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
>>> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
>>> + dio_end - dio_start - total);
>>> +
>>> + host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
>>> + if (host_err >= 0) {
>>> + unsigned int pad = offset - dio_start;
>>> +
>>> + /* The returned payload starts after the pad */
>>> + rqstp->rq_res.page_base = pad;
>>> +
>>> + /* Compute the count of bytes to be returned */
>>> + if (host_err > pad + *count) {
>>> + host_err = *count;
>>> + } else if (host_err > pad) {
>>> + host_err -= pad;
>>> + } else {
>>> + host_err = 0;
>>> + }
>>
>> No need for the braces here.
>>
>>> + } else if (unlikely(host_err == -EINVAL)) {
>>> + pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
>>> + host_err = -ESERVERFAULT;
>>> + }
>>
>> You'll probably want to print s_id to identify the file syste for which
>> this happened. What is -ESERVERFAULT supposed to mean, btw?
It means, there was an unexpected fault on the server. Software bug.
>
> Note, this is a "should never happen" condition -- so, exceedingly rare.
>
> Clients typically turns NFSERR_SERVERFAULT into -EREMOTEIO.
>
>
>>> + case NFSD_IO_DIRECT:
>>> + if (nf->nf_dio_read_offset_align &&
>>
>> I guess this is the "is direct I/O actually supported" check? I guess
>> it'll work, but will underreport as very few file system actually
>> report the requirement at the moment. Can you add a comment explaining
>> the check?
> Well, it's to ensure that the vfs_getattr we did at open time actually
> got a valid value. But effectively, yes, it means "is direct I/O
> available for this file?"
>
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-09-29 13:10 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-26 14:51 [PATCH v4 0/4] NFSD direct I/O read Chuck Lever
2025-09-26 14:51 ` [PATCH v4 1/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-26 14:51 ` [PATCH v4 2/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-26 14:51 ` [PATCH v4 3/4] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
2025-09-29 3:55 ` NeilBrown
2025-09-29 11:49 ` Jeff Layton
2025-09-26 14:51 ` [PATCH v4 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-29 4:02 ` NeilBrown
2025-09-29 7:29 ` Christoph Hellwig
2025-09-29 13:03 ` Chuck Lever
2025-09-29 13:06 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox