* [PATCH v2 0/4] NFSD direct I/O read
@ 2025-09-17 14:31 Chuck Lever
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Chuck Lever @ 2025-09-17 14:31 UTC (permalink / raw)
To: neil, jlayton, okorniev, dai.ngo, tom; +Cc: linux-nfs
The goal is to get the experimental read-side direct I/O
implementation merged sooner. We are still thinking through the
implications of mixing direct and buffered I/O when handling an
NFS WRITE that does not meet the dio alignment requirements.
Changes since v1:
* Harden the loop that constructs the I/O bvec
* Address review comments
Changes from Mike's v9:
* The LOC introduced by the feature has been reduced considerably.
* A new trace point in nfsd_file_getattr reports each file's dio
alignment parameters when it is opened.
* The direct I/O path has been taken out-of-line so that it may
continue to be modified and optimized without perturbing the more
commonly used I/O paths.
* When an exported file system does not implement direct I/O, more
commonly used modes are employed instead to avoid returning
EOPNOTSUPP unexpectedly.
* When NFSD_IO_DIRECT is selected, NFS READs of all sizes use direct
I/O to provide better experimental data about small I/O workloads.
---
Chuck Lever (2):
NFSD: Add array bounds-checking in nfsd_iter_read()
NFSD: Implement NFSD_IO_DIRECT for NFS READ
Mike Snitzer (2):
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFSD: pass nfsd_file to nfsd_iter_read()
fs/nfsd/debugfs.c | 2 +
fs/nfsd/filecache.c | 34 ++++++++++++++
fs/nfsd/filecache.h | 4 ++
fs/nfsd/nfs4xdr.c | 8 ++--
fs/nfsd/nfsd.h | 1 +
fs/nfsd/nfsfh.c | 4 ++
fs/nfsd/trace.h | 28 +++++++++++
fs/nfsd/vfs.c | 100 ++++++++++++++++++++++++++++++++++++----
fs/nfsd/vfs.h | 2 +-
include/trace/misc/fs.h | 22 +++++++++
10 files changed, 192 insertions(+), 13 deletions(-)
--
Chuck Lever
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read()
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
@ 2025-09-17 14:31 ` Chuck Lever
2025-09-17 17:51 ` Mike Snitzer
2025-09-17 14:31 ` [PATCH v2 2/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2025-09-17 14:31 UTC (permalink / raw)
To: neil, jlayton, okorniev, dai.ngo, tom; +Cc: linux-nfs
From: Chuck Lever <chuck.lever@oracle.com>
The *count parameter does not appear to be explicitly restricted
to being smaller than rsize, so it might be possible to overrun
the rq_bvec or rq_pages arrays.
Rather than overrunning these arrays (damage done!) and then WARNING
once, let's harden the loop so that it terminates before the end of
the arrays are reached. This should result in a short read, which is
OK -- clients recover by sending additional READ requests for the
remaining unread bytes.
Reported-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 714777c221ed..2026431500ec 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1115,18 +1115,20 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
v = 0;
total = *count;
- while (total) {
+ while (total && v < rqstp->rq_maxpages &&
+ rqstp->rq_next_page < rqstp->rq_page_end) {
len = min_t(size_t, total, PAGE_SIZE - base);
- bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
+ bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
len, base);
+
total -= len;
+ ++rqstp->rq_next_page;
++v;
base = 0;
}
- WARN_ON_ONCE(v > rqstp->rq_maxpages);
- trace_nfsd_read_vector(rqstp, fhp, offset, *count);
- iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
+ trace_nfsd_read_vector(rqstp, fhp, offset, *count - total);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count - total);
host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 2/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
@ 2025-09-17 14:31 ` Chuck Lever
2025-09-17 14:31 ` [PATCH v2 3/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-17 14:32 ` [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
3 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2025-09-17 14:31 UTC (permalink / raw)
To: neil, jlayton, okorniev, dai.ngo, tom; +Cc: linux-nfs
From: Mike Snitzer <snitzer@kernel.org>
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment
attributes from the underlying filesystem and store them in the
associated nfsd_file. This is done when the nfsd_file is first
opened for each regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/filecache.c | 34 ++++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/nfsfh.c | 4 ++++
fs/nfsd/trace.h | 27 +++++++++++++++++++++++++++
include/trace/misc/fs.h | 22 ++++++++++++++++++++++
5 files changed, 91 insertions(+)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 75bc48031c07..78cca0d751ac 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1048,6 +1051,35 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ trace_nfsd_file_get_dio_attrs(inode, &stat);
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return nfs_ok;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1166,6 +1198,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_get_dio_attrs(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434..e3d6ca2b6030 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 3edccc38db42..3eb724ec9566 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -696,8 +696,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
+ struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
+ if (S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
if (fhp->fh_maxsize == NFS4_FHSIZE)
request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e..6e2c8e2aab10 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -1133,6 +1133,33 @@ TRACE_EVENT(nfsd_file_alloc,
)
);
+TRACE_EVENT(nfsd_file_get_dio_attrs,
+ TP_PROTO(
+ const struct inode *inode,
+ const struct kstat *stat
+ ),
+ TP_ARGS(inode, stat),
+ TP_STRUCT__entry(
+ __field(const void *, inode)
+ __field(unsigned long, mask)
+ __field(u32, mem_align)
+ __field(u32, offset_align)
+ __field(u32, read_offset_align)
+ ),
+ TP_fast_assign(
+ __entry->inode = inode;
+ __entry->mask = stat->result_mask;
+ __entry->mem_align = stat->dio_mem_align;
+ __entry->offset_align = stat->dio_offset_align;
+ __entry->read_offset_align = stat->dio_read_offset_align;
+ ),
+ TP_printk("inode=%p flags=%s mem_align=%u offset_align=%u read_offset_align=%u",
+ __entry->inode, show_statx_mask(__entry->mask),
+ __entry->mem_align, __entry->offset_align,
+ __entry->read_offset_align
+ )
+);
+
TRACE_EVENT(nfsd_file_acquire,
TP_PROTO(
const struct svc_rqst *rqstp,
diff --git a/include/trace/misc/fs.h b/include/trace/misc/fs.h
index 0406ebe2a80a..7ead1c61f0cb 100644
--- a/include/trace/misc/fs.h
+++ b/include/trace/misc/fs.h
@@ -141,3 +141,25 @@
{ ATTR_TIMES_SET, "TIMES_SET" }, \
{ ATTR_TOUCH, "TOUCH"}, \
{ ATTR_DELEG, "DELEG"})
+
+#define show_statx_mask(flags) \
+ __print_flags(flags, "|", \
+ { STATX_TYPE, "TYPE" }, \
+ { STATX_MODE, "MODE" }, \
+ { STATX_NLINK, "NLINK" }, \
+ { STATX_UID, "UID" }, \
+ { STATX_GID, "GID" }, \
+ { STATX_ATIME, "ATIME" }, \
+ { STATX_MTIME, "MTIME" }, \
+ { STATX_CTIME, "CTIME" }, \
+ { STATX_INO, "INO" }, \
+ { STATX_SIZE, "SIZE" }, \
+ { STATX_BLOCKS, "BLOCKS" }, \
+ { STATX_BASIC_STATS, "BASIC_STATS" }, \
+ { STATX_BTIME, "BTIME" }, \
+ { STATX_MNT_ID, "MNT_ID" }, \
+ { STATX_DIOALIGN, "DIOALIGN" }, \
+ { STATX_MNT_ID_UNIQUE, "MNT_ID_UNIQUE" }, \
+ { STATX_SUBVOL, "SUBVOL" }, \
+ { STATX_WRITE_ATOMIC, "WRITE_ATOMIC" }, \
+ { STATX_DIO_READ_ALIGN, "DIO_READ_ALIGN" })
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 3/4] NFSD: pass nfsd_file to nfsd_iter_read()
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
2025-09-17 14:31 ` [PATCH v2 2/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
@ 2025-09-17 14:31 ` Chuck Lever
2025-09-17 14:32 ` [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
3 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2025-09-17 14:31 UTC (permalink / raw)
To: neil, jlayton, okorniev, dai.ngo, tom; +Cc: linux-nfs
From: Mike Snitzer <snitzer@kernel.org>
Prepare for nfsd_iter_read() to use the DIO alignment stored in
nfsd_file by passing the nfsd_file to nfsd_iter_read() rather than
just the file which is associaed with the nfsd_file.
This means nfsd4_encode_readv() now also needs the nfsd_file rather
than the file. Instead of changing the file arg to be the nfsd_file,
we discard the file arg as the nfsd_file (and indeed the file) is
already available via the "read" argument.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 97e9e9afa80a..41f0d54d6e1b 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4467,7 +4467,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4478,7 +4478,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4525,7 +4525,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5421,7 +5421,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2026431500ec..35880d3f1326 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1078,7 +1078,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1091,9 +1091,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1336,7 +1337,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6d..fa46f8b5f132 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
` (2 preceding siblings ...)
2025-09-17 14:31 ` [PATCH v2 3/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
@ 2025-09-17 14:32 ` Chuck Lever
2025-09-17 23:29 ` NeilBrown
2025-09-18 16:29 ` Mike Snitzer
3 siblings, 2 replies; 13+ messages in thread
From: Chuck Lever @ 2025-09-17 14:32 UTC (permalink / raw)
To: neil, jlayton, okorniev, dai.ngo, tom; +Cc: linux-nfs
From: Chuck Lever <chuck.lever@oracle.com>
Add an experimental option that forces NFS READ operations to use
direct I/O instead of reading through the NFS server's page cache.
There are already other layers of caching:
- The page cache on NFS clients
- The block device underlying the exported file system
The server's page cache, in many cases, is unlikely to provide
additional benefit. Some benchmarks have demonstrated that the
server's page cache is actively detrimental for workloads whose
working set is larger than the server's available physical memory.
For instance, on small NFS servers, cached NFS file content can
squeeze out local memory consumers. For large sequential workloads,
an enormous amount of data flows into and out of the page cache
and is consumed by NFS clients exactly once -- caching that data
is expensive to do and totally valueless.
For now this is a hidden option that can be enabled on test
systems for benchmarking. In the longer term, this option might
be enabled persistently or per-export. When the exported file
system does not support direct I/O, NFSD falls back to using
either DONTCACHE or buffered I/O to fulfill NFS READ requests.
Suggested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 2 +
fs/nfsd/nfsd.h | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 85 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index ed2b9e066206..00eb1ecef6ac 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
* Contents:
* %0: NFS READ will use buffered IO
* %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS READ will use direct IO
*
* This setting takes immediate effect for all NFS versions,
* all exports, and in all NFSD net namespaces.
@@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
nfsd_io_cache_read = NFSD_IO_BUFFERED;
break;
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
/*
* Must disable splice_read when enabling
* NFSD_IO_DONTCACHE.
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index ea87b42894dd..bdb60ee1f1a4 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -157,6 +157,7 @@ enum {
/* Any new NFSD_IO enum value must be added at the end */
NFSD_IO_BUFFERED,
NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
};
extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 6e2c8e2aab10..bfd41236aff2 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
DEFINE_NFSD_IO_EVENT(read_start);
DEFINE_NFSD_IO_EVENT(read_splice);
DEFINE_NFSD_IO_EVENT(read_vector);
+DEFINE_NFSD_IO_EVENT(read_direct);
DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 35880d3f1326..5cd970c1089b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+/*
+ * The byte range of the client's READ request is expanded on both
+ * ends until it meets the underlying file system's direct I/O
+ * alignment requirements. After the internal read is complete, the
+ * byte range of the NFS READ payload is reduced to the byte range
+ * that was originally requested.
+ *
+ * Note that a direct read can be done only when the xdr_buf
+ * containing the NFS READ reply does not already have contents in
+ * its .pages array. This is due to potentially restrictive
+ * alignment requirements on the read buffer. When .page_len and
+ * @base are zero, the .pages array is guaranteed to be page-
+ * aligned.
+ */
+static noinline_for_stack __be32
+nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
+ u32 *eof)
+{
+ loff_t dio_start, dio_end;
+ unsigned long v, total;
+ struct iov_iter iter;
+ struct kiocb kiocb;
+ ssize_t host_err;
+ size_t len;
+
+ init_sync_kiocb(&kiocb, nf->nf_file);
+ kiocb.ki_flags |= IOCB_DIRECT;
+
+ /* Read a properly-aligned region of bytes into rq_bvec */
+ dio_start = round_down(offset, nf->nf_dio_read_offset_align);
+ dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
+
+ kiocb.ki_pos = dio_start;
+
+ v = 0;
+ total = *count;
+ while (total && v < rqstp->rq_maxpages &&
+ rqstp->rq_next_page < rqstp->rq_page_end) {
+ len = min_t(size_t, total, PAGE_SIZE);
+ bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
+ len, 0);
+
+ total -= len;
+ ++rqstp->rq_next_page;
+ ++v;
+ }
+
+ trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
+ dio_end - dio_start - total);
+
+ host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
+ if (host_err >= 0) {
+ unsigned int pad = offset - dio_start;
+
+ /* The returned payload starts after the pad */
+ rqstp->rq_res.page_base = pad;
+
+ /* Compute the count of bytes to be returned */
+ if (host_err > pad + *count) {
+ host_err = *count;
+ } else if (host_err > pad) {
+ host_err -= pad;
+ } else {
+ host_err = 0;
+ }
+ } else if (unlikely(host_err == -EINVAL)) {
+ pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
+ host_err = -ESERVERFAULT;
+ }
+
+ return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
+ eof, host_err);
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1106,6 +1182,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
switch (nfsd_io_cache_read) {
case NFSD_IO_BUFFERED:
break;
+ case NFSD_IO_DIRECT:
+ if (nf->nf_dio_read_offset_align && !base)
+ return nfsd_direct_read(rqstp, fhp, nf, offset,
+ count, eof);
+ fallthrough;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags = IOCB_DONTCACHE;
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read()
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
@ 2025-09-17 17:51 ` Mike Snitzer
0 siblings, 0 replies; 13+ messages in thread
From: Mike Snitzer @ 2025-09-17 17:51 UTC (permalink / raw)
To: Chuck Lever; +Cc: neil, jlayton, okorniev, dai.ngo, tom, linux-nfs
On Wed, Sep 17, 2025 at 10:31:40AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> The *count parameter does not appear to be explicitly restricted
> to being smaller than rsize, so it might be possible to overrun
> the rq_bvec or rq_pages arrays.
>
> Rather than overrunning these arrays (damage done!) and then WARNING
> once, let's harden the loop so that it terminates before the end of
> the arrays are reached. This should result in a short read, which is
> OK -- clients recover by sending additional READ requests for the
> remaining unread bytes.
>
> Reported-by: NeilBrown <neil@brown.name>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Neil and I both gave our Reviewed-by on this one yesterday (when you
posted it on its own), but for good measure:
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Thanks.
> ---
> fs/nfsd/vfs.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 714777c221ed..2026431500ec 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1115,18 +1115,20 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> v = 0;
> total = *count;
> - while (total) {
> + while (total && v < rqstp->rq_maxpages &&
> + rqstp->rq_next_page < rqstp->rq_page_end) {
> len = min_t(size_t, total, PAGE_SIZE - base);
> - bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> len, base);
> +
> total -= len;
> + ++rqstp->rq_next_page;
> ++v;
> base = 0;
> }
> - WARN_ON_ONCE(v > rqstp->rq_maxpages);
>
> - trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> - iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> + trace_nfsd_read_vector(rqstp, fhp, offset, *count - total);
> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count - total);
> host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
@ 2025-09-17 23:29 ` NeilBrown
2025-09-18 14:50 ` Mike Snitzer
2025-09-18 18:42 ` Chuck Lever
0 siblings, 2 replies; 13+ messages in thread
From: NeilBrown @ 2025-09-17 23:29 UTC (permalink / raw)
To: Chuck Lever; +Cc: jlayton, okorniev, dai.ngo, tom, linux-nfs
On Thu, 18 Sep 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
>
> There are already other layers of caching:
> - The page cache on NFS clients
> - The block device underlying the exported file system
>
> The server's page cache, in many cases, is unlikely to provide
> additional benefit. Some benchmarks have demonstrated that the
> server's page cache is actively detrimental for workloads whose
> working set is larger than the server's available physical memory.
>
> For instance, on small NFS servers, cached NFS file content can
> squeeze out local memory consumers. For large sequential workloads,
> an enormous amount of data flows into and out of the page cache
> and is consumed by NFS clients exactly once -- caching that data
> is expensive to do and totally valueless.
>
> For now this is a hidden option that can be enabled on test
> systems for benchmarking. In the longer term, this option might
> be enabled persistently or per-export. When the exported file
> system does not support direct I/O, NFSD falls back to using
> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
>
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/debugfs.c | 2 +
> fs/nfsd/nfsd.h | 1 +
> fs/nfsd/trace.h | 1 +
> fs/nfsd/vfs.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 85 insertions(+)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index ed2b9e066206..00eb1ecef6ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> * Contents:
> * %0: NFS READ will use buffered IO
> * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + * %2: NFS READ will use direct IO
> *
> * This setting takes immediate effect for all NFS versions,
> * all exports, and in all NFSD net namespaces.
> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> nfsd_io_cache_read = NFSD_IO_BUFFERED;
> break;
> case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> /*
> * Must disable splice_read when enabling
> * NFSD_IO_DONTCACHE.
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index ea87b42894dd..bdb60ee1f1a4 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -157,6 +157,7 @@ enum {
> /* Any new NFSD_IO enum value must be added at the end */
> NFSD_IO_BUFFERED,
> NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT,
> };
>
> extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 6e2c8e2aab10..bfd41236aff2 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
> DEFINE_NFSD_IO_EVENT(read_start);
> DEFINE_NFSD_IO_EVENT(read_splice);
> DEFINE_NFSD_IO_EVENT(read_vector);
> +DEFINE_NFSD_IO_EVENT(read_direct);
> DEFINE_NFSD_IO_EVENT(read_io_done);
> DEFINE_NFSD_IO_EVENT(read_done);
> DEFINE_NFSD_IO_EVENT(write_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 35880d3f1326..5cd970c1089b 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> +/*
> + * The byte range of the client's READ request is expanded on both
> + * ends until it meets the underlying file system's direct I/O
> + * alignment requirements. After the internal read is complete, the
> + * byte range of the NFS READ payload is reduced to the byte range
> + * that was originally requested.
> + *
> + * Note that a direct read can be done only when the xdr_buf
> + * containing the NFS READ reply does not already have contents in
> + * its .pages array. This is due to potentially restrictive
> + * alignment requirements on the read buffer. When .page_len and
> + * @base are zero, the .pages array is guaranteed to be page-
> + * aligned.
This para is confusing.
It starts talking about the xdr_buf not having any contents. Then it
transitions to a guarantee of page alignment.
If the start of the read requests isn't sufficiently aligned then a gap
will be created in the xdr_buf and that can only be handled at the start
(using page_base).
So as you say we need page_len to be zero. But nowhere in the code is
this condition tested.
The closest is "!base" before the call to nfsd_direct_read() but when
called from nfsd4_encode_readv()
base = xdr->buf->page_len & ~PAGE_MASK;
so ->page_len could be non-zero despite base being zero.
NeilBrown
> + */
> +static noinline_for_stack __be32
> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, loff_t offset, unsigned long *count,
> + u32 *eof)
> +{
> + loff_t dio_start, dio_end;
> + unsigned long v, total;
> + struct iov_iter iter;
> + struct kiocb kiocb;
> + ssize_t host_err;
> + size_t len;
> +
> + init_sync_kiocb(&kiocb, nf->nf_file);
> + kiocb.ki_flags |= IOCB_DIRECT;
> +
> + /* Read a properly-aligned region of bytes into rq_bvec */
> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> +
> + kiocb.ki_pos = dio_start;
> +
> + v = 0;
> + total = *count;
> + while (total && v < rqstp->rq_maxpages &&
> + rqstp->rq_next_page < rqstp->rq_page_end) {
> + len = min_t(size_t, total, PAGE_SIZE);
> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> + len, 0);
> +
> + total -= len;
> + ++rqstp->rq_next_page;
> + ++v;
> + }
> +
> + trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
> + dio_end - dio_start - total);
> +
> + host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> + if (host_err >= 0) {
> + unsigned int pad = offset - dio_start;
> +
> + /* The returned payload starts after the pad */
> + rqstp->rq_res.page_base = pad;
> +
> + /* Compute the count of bytes to be returned */
> + if (host_err > pad + *count) {
> + host_err = *count;
> + } else if (host_err > pad) {
> + host_err -= pad;
> + } else {
> + host_err = 0;
> + }
> + } else if (unlikely(host_err == -EINVAL)) {
> + pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> + host_err = -ESERVERFAULT;
> + }
> +
> + return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
> + eof, host_err);
> +}
> +
> /**
> * nfsd_iter_read - Perform a VFS read using an iterator
> * @rqstp: RPC transaction context
> @@ -1106,6 +1182,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> switch (nfsd_io_cache_read) {
> case NFSD_IO_BUFFERED:
> break;
> + case NFSD_IO_DIRECT:
> + if (nf->nf_dio_read_offset_align && !base)
> + return nfsd_direct_read(rqstp, fhp, nf, offset,
> + count, eof);
> + fallthrough;
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags = IOCB_DONTCACHE;
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-17 23:29 ` NeilBrown
@ 2025-09-18 14:50 ` Mike Snitzer
2025-09-18 15:20 ` Mike Snitzer
2025-09-18 18:42 ` Chuck Lever
1 sibling, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2025-09-18 14:50 UTC (permalink / raw)
To: NeilBrown; +Cc: Chuck Lever, jlayton, okorniev, dai.ngo, tom, linux-nfs
On Thu, Sep 18, 2025 at 09:29:48AM +1000, NeilBrown wrote:
> On Thu, 18 Sep 2025, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> >
> > Add an experimental option that forces NFS READ operations to use
> > direct I/O instead of reading through the NFS server's page cache.
> >
> > There are already other layers of caching:
> > - The page cache on NFS clients
> > - The block device underlying the exported file system
> >
> > The server's page cache, in many cases, is unlikely to provide
> > additional benefit. Some benchmarks have demonstrated that the
> > server's page cache is actively detrimental for workloads whose
> > working set is larger than the server's available physical memory.
> >
> > For instance, on small NFS servers, cached NFS file content can
> > squeeze out local memory consumers. For large sequential workloads,
> > an enormous amount of data flows into and out of the page cache
> > and is consumed by NFS clients exactly once -- caching that data
> > is expensive to do and totally valueless.
> >
> > For now this is a hidden option that can be enabled on test
> > systems for benchmarking. In the longer term, this option might
> > be enabled persistently or per-export. When the exported file
> > system does not support direct I/O, NFSD falls back to using
> > either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> >
> > Suggested-by: Mike Snitzer <snitzer@kernel.org>
> > Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> > fs/nfsd/debugfs.c | 2 +
> > fs/nfsd/nfsd.h | 1 +
> > fs/nfsd/trace.h | 1 +
> > fs/nfsd/vfs.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 85 insertions(+)
> >
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index ed2b9e066206..00eb1ecef6ac 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> > * Contents:
> > * %0: NFS READ will use buffered IO
> > * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > + * %2: NFS READ will use direct IO
> > *
> > * This setting takes immediate effect for all NFS versions,
> > * all exports, and in all NFSD net namespaces.
> > @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> > nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > break;
> > case NFSD_IO_DONTCACHE:
> > + case NFSD_IO_DIRECT:
> > /*
> > * Must disable splice_read when enabling
> > * NFSD_IO_DONTCACHE.
> > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > index ea87b42894dd..bdb60ee1f1a4 100644
> > --- a/fs/nfsd/nfsd.h
> > +++ b/fs/nfsd/nfsd.h
> > @@ -157,6 +157,7 @@ enum {
> > /* Any new NFSD_IO enum value must be added at the end */
> > NFSD_IO_BUFFERED,
> > NFSD_IO_DONTCACHE,
> > + NFSD_IO_DIRECT,
> > };
> >
> > extern u64 nfsd_io_cache_read __read_mostly;
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index 6e2c8e2aab10..bfd41236aff2 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
> > DEFINE_NFSD_IO_EVENT(read_start);
> > DEFINE_NFSD_IO_EVENT(read_splice);
> > DEFINE_NFSD_IO_EVENT(read_vector);
> > +DEFINE_NFSD_IO_EVENT(read_direct);
> > DEFINE_NFSD_IO_EVENT(read_io_done);
> > DEFINE_NFSD_IO_EVENT(read_done);
> > DEFINE_NFSD_IO_EVENT(write_start);
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 35880d3f1326..5cd970c1089b 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > }
> >
> > +/*
> > + * The byte range of the client's READ request is expanded on both
> > + * ends until it meets the underlying file system's direct I/O
> > + * alignment requirements. After the internal read is complete, the
> > + * byte range of the NFS READ payload is reduced to the byte range
> > + * that was originally requested.
> > + *
> > + * Note that a direct read can be done only when the xdr_buf
> > + * containing the NFS READ reply does not already have contents in
> > + * its .pages array. This is due to potentially restrictive
> > + * alignment requirements on the read buffer. When .page_len and
> > + * @base are zero, the .pages array is guaranteed to be page-
> > + * aligned.
>
> This para is confusing.
> It starts talking about the xdr_buf not having any contents. Then it
> transitions to a guarantee of page alignment.
>
> If the start of the read requests isn't sufficiently aligned then a gap
> will be created in the xdr_buf and that can only be handled at the start
> (using page_base).
>
> So as you say we need page_len to be zero. But nowhere in the code is
> this condition tested.
>
> The closest is "!base" before the call to nfsd_direct_read() but when
> called from nfsd4_encode_readv()
>
> base = xdr->buf->page_len & ~PAGE_MASK;
>
> so ->page_len could be non-zero despite base being zero.
Hi Neil,
If we verify base is aligned relative to nf->nf_dio_mem_align; this
incremental change should avoid the concern entirely right?
[I've verified all my tests pass with this change]
[Chuck, please feel free to fold this in if you agree]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
fs/nfsd/vfs.c | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 4694a26238c7..d21b4fb565a6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1088,18 +1088,11 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* alignment requirements. After the internal read is complete, the
* byte range of the NFS READ payload is reduced to the byte range
* that was originally requested.
- *
- * Note that a direct read can be done only when the xdr_buf
- * containing the NFS READ reply does not already have contents in
- * its .pages array. This is due to potentially restrictive
- * alignment requirements on the read buffer. When .page_len and
- * @base are zero, the .pages array is guaranteed to be page-
- * aligned.
*/
static noinline_for_stack __be32
nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset, unsigned long *count,
- u32 *eof)
+ unsigned int base, u32 *eof)
{
loff_t dio_start, dio_end;
unsigned long v, total;
@@ -1121,13 +1114,14 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
total = *count;
while (total && v < rqstp->rq_maxpages &&
rqstp->rq_next_page < rqstp->rq_page_end) {
- len = min_t(size_t, total, PAGE_SIZE);
+ len = min_t(size_t, total, PAGE_SIZE - base);
bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
- len, 0);
+ len, base);
total -= len;
++rqstp->rq_next_page;
++v;
+ base = 0;
}
trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
@@ -1191,9 +1185,10 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
case NFSD_IO_BUFFERED:
break;
case NFSD_IO_DIRECT:
- if (nf->nf_dio_read_offset_align && !base)
+ if (nf->nf_dio_read_offset_align && nf->nf_dio_mem_align &&
+ (base & (nf->nf_dio_mem_align-1)) == 0)
return nfsd_direct_read(rqstp, fhp, nf, offset,
- count, eof);
+ count, base, eof);
fallthrough;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-18 14:50 ` Mike Snitzer
@ 2025-09-18 15:20 ` Mike Snitzer
0 siblings, 0 replies; 13+ messages in thread
From: Mike Snitzer @ 2025-09-18 15:20 UTC (permalink / raw)
To: NeilBrown; +Cc: Chuck Lever, jlayton, okorniev, dai.ngo, tom, linux-nfs
On Thu, Sep 18, 2025 at 10:50:57AM -0400, Mike Snitzer wrote:
> On Thu, Sep 18, 2025 at 09:29:48AM +1000, NeilBrown wrote:
> > On Thu, 18 Sep 2025, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > >
> > > Add an experimental option that forces NFS READ operations to use
> > > direct I/O instead of reading through the NFS server's page cache.
> > >
> > > There are already other layers of caching:
> > > - The page cache on NFS clients
> > > - The block device underlying the exported file system
> > >
> > > The server's page cache, in many cases, is unlikely to provide
> > > additional benefit. Some benchmarks have demonstrated that the
> > > server's page cache is actively detrimental for workloads whose
> > > working set is larger than the server's available physical memory.
> > >
> > > For instance, on small NFS servers, cached NFS file content can
> > > squeeze out local memory consumers. For large sequential workloads,
> > > an enormous amount of data flows into and out of the page cache
> > > and is consumed by NFS clients exactly once -- caching that data
> > > is expensive to do and totally valueless.
> > >
> > > For now this is a hidden option that can be enabled on test
> > > systems for benchmarking. In the longer term, this option might
> > > be enabled persistently or per-export. When the exported file
> > > system does not support direct I/O, NFSD falls back to using
> > > either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> > >
> > > Suggested-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > > fs/nfsd/debugfs.c | 2 +
> > > fs/nfsd/nfsd.h | 1 +
> > > fs/nfsd/trace.h | 1 +
> > > fs/nfsd/vfs.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > 4 files changed, 85 insertions(+)
> > >
> > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > > index ed2b9e066206..00eb1ecef6ac 100644
> > > --- a/fs/nfsd/debugfs.c
> > > +++ b/fs/nfsd/debugfs.c
> > > @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> > > * Contents:
> > > * %0: NFS READ will use buffered IO
> > > * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > > + * %2: NFS READ will use direct IO
> > > *
> > > * This setting takes immediate effect for all NFS versions,
> > > * all exports, and in all NFSD net namespaces.
> > > @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> > > nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > > break;
> > > case NFSD_IO_DONTCACHE:
> > > + case NFSD_IO_DIRECT:
> > > /*
> > > * Must disable splice_read when enabling
> > > * NFSD_IO_DONTCACHE.
> > > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > > index ea87b42894dd..bdb60ee1f1a4 100644
> > > --- a/fs/nfsd/nfsd.h
> > > +++ b/fs/nfsd/nfsd.h
> > > @@ -157,6 +157,7 @@ enum {
> > > /* Any new NFSD_IO enum value must be added at the end */
> > > NFSD_IO_BUFFERED,
> > > NFSD_IO_DONTCACHE,
> > > + NFSD_IO_DIRECT,
> > > };
> > >
> > > extern u64 nfsd_io_cache_read __read_mostly;
> > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > index 6e2c8e2aab10..bfd41236aff2 100644
> > > --- a/fs/nfsd/trace.h
> > > +++ b/fs/nfsd/trace.h
> > > @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
> > > DEFINE_NFSD_IO_EVENT(read_start);
> > > DEFINE_NFSD_IO_EVENT(read_splice);
> > > DEFINE_NFSD_IO_EVENT(read_vector);
> > > +DEFINE_NFSD_IO_EVENT(read_direct);
> > > DEFINE_NFSD_IO_EVENT(read_io_done);
> > > DEFINE_NFSD_IO_EVENT(read_done);
> > > DEFINE_NFSD_IO_EVENT(write_start);
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 35880d3f1326..5cd970c1089b 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > }
> > >
> > > +/*
> > > + * The byte range of the client's READ request is expanded on both
> > > + * ends until it meets the underlying file system's direct I/O
> > > + * alignment requirements. After the internal read is complete, the
> > > + * byte range of the NFS READ payload is reduced to the byte range
> > > + * that was originally requested.
> > > + *
> > > + * Note that a direct read can be done only when the xdr_buf
> > > + * containing the NFS READ reply does not already have contents in
> > > + * its .pages array. This is due to potentially restrictive
> > > + * alignment requirements on the read buffer. When .page_len and
> > > + * @base are zero, the .pages array is guaranteed to be page-
> > > + * aligned.
> >
> > This para is confusing.
> > It starts talking about the xdr_buf not having any contents. Then it
> > transitions to a guarantee of page alignment.
> >
> > If the start of the read requests isn't sufficiently aligned then a gap
> > will be created in the xdr_buf and that can only be handled at the start
> > (using page_base).
> >
> > So as you say we need page_len to be zero. But nowhere in the code is
> > this condition tested.
> >
> > The closest is "!base" before the call to nfsd_direct_read() but when
> > called from nfsd4_encode_readv()
> >
> > base = xdr->buf->page_len & ~PAGE_MASK;
> >
> > so ->page_len could be non-zero despite base being zero.
>
> Hi Neil,
>
> If we verify base is aligned relative to nf->nf_dio_mem_align; this
> incremental change should avoid the concern entirely right?
>
> [I've verified all my tests pass with this change]
It helps if when testing NFSD you don't have LOCALIO enabled...
please disregard my patch ;)
The patch I provided doesn't work, it'll allow the iov_iter to have
misaligned pages and xfs_file_read_iter->iomap_dio_rw crashes (easily
remedied by checking iov_iter's alignment), but best to just refine
the check that prevents calling into nfsd_direct_read (by explicitly
checking page_len)?
Thanks,
Mike
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-17 14:32 ` [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-17 23:29 ` NeilBrown
@ 2025-09-18 16:29 ` Mike Snitzer
2025-09-18 18:27 ` Chuck Lever
1 sibling, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2025-09-18 16:29 UTC (permalink / raw)
To: Chuck Lever; +Cc: neil, jlayton, okorniev, dai.ngo, tom, linux-nfs
On Wed, Sep 17, 2025 at 10:32:00AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
>
> There are already other layers of caching:
> - The page cache on NFS clients
> - The block device underlying the exported file system
>
> The server's page cache, in many cases, is unlikely to provide
> additional benefit. Some benchmarks have demonstrated that the
> server's page cache is actively detrimental for workloads whose
> working set is larger than the server's available physical memory.
>
> For instance, on small NFS servers, cached NFS file content can
> squeeze out local memory consumers. For large sequential workloads,
> an enormous amount of data flows into and out of the page cache
> and is consumed by NFS clients exactly once -- caching that data
> is expensive to do and totally valueless.
>
> For now this is a hidden option that can be enabled on test
> systems for benchmarking. In the longer term, this option might
> be enabled persistently or per-export. When the exported file
> system does not support direct I/O, NFSD falls back to using
> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
>
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/debugfs.c | 2 +
> fs/nfsd/nfsd.h | 1 +
> fs/nfsd/trace.h | 1 +
> fs/nfsd/vfs.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 85 insertions(+)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index ed2b9e066206..00eb1ecef6ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> * Contents:
> * %0: NFS READ will use buffered IO
> * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + * %2: NFS READ will use direct IO
> *
> * This setting takes immediate effect for all NFS versions,
> * all exports, and in all NFSD net namespaces.
> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> nfsd_io_cache_read = NFSD_IO_BUFFERED;
> break;
> case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> /*
> * Must disable splice_read when enabling
> * NFSD_IO_DONTCACHE.
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index ea87b42894dd..bdb60ee1f1a4 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -157,6 +157,7 @@ enum {
> /* Any new NFSD_IO enum value must be added at the end */
> NFSD_IO_BUFFERED,
> NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT,
> };
>
> extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 6e2c8e2aab10..bfd41236aff2 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
> DEFINE_NFSD_IO_EVENT(read_start);
> DEFINE_NFSD_IO_EVENT(read_splice);
> DEFINE_NFSD_IO_EVENT(read_vector);
> +DEFINE_NFSD_IO_EVENT(read_direct);
> DEFINE_NFSD_IO_EVENT(read_io_done);
> DEFINE_NFSD_IO_EVENT(read_done);
> DEFINE_NFSD_IO_EVENT(write_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 35880d3f1326..5cd970c1089b 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1074,6 +1074,82 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> +/*
> + * The byte range of the client's READ request is expanded on both
> + * ends until it meets the underlying file system's direct I/O
> + * alignment requirements. After the internal read is complete, the
> + * byte range of the NFS READ payload is reduced to the byte range
> + * that was originally requested.
> + *
> + * Note that a direct read can be done only when the xdr_buf
> + * containing the NFS READ reply does not already have contents in
> + * its .pages array. This is due to potentially restrictive
> + * alignment requirements on the read buffer. When .page_len and
> + * @base are zero, the .pages array is guaranteed to be page-
> + * aligned.
> + */
> +static noinline_for_stack __be32
> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, loff_t offset, unsigned long *count,
> + u32 *eof)
> +{
> + loff_t dio_start, dio_end;
> + unsigned long v, total;
> + struct iov_iter iter;
> + struct kiocb kiocb;
> + ssize_t host_err;
> + size_t len;
> +
> + init_sync_kiocb(&kiocb, nf->nf_file);
> + kiocb.ki_flags |= IOCB_DIRECT;
> +
> + /* Read a properly-aligned region of bytes into rq_bvec */
> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> +
> + kiocb.ki_pos = dio_start;
> +
> + v = 0;
> + total = *count;
Hi Chuck,
Looks like you introduced a copy-n-paste bug when updating
nfsd_direct_read's while loop to follow nfsd_iter_read's lead.
Should be:
total = dio_end - dio_start;
[NOTE: this was the reason I saw a crash with my incremental patch
that handled 'base', see:
https://lore.kernel.org/linux-nfs/aMwcUdWdey69k2iK@kernel.org/
]
Thanks,
Mike
> + while (total && v < rqstp->rq_maxpages &&
> + rqstp->rq_next_page < rqstp->rq_page_end) {
> + len = min_t(size_t, total, PAGE_SIZE);
> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> + len, 0);
> +
> + total -= len;
> + ++rqstp->rq_next_page;
> + ++v;
> + }
> +
> + trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
> + dio_end - dio_start - total);
> +
> + host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> + if (host_err >= 0) {
> + unsigned int pad = offset - dio_start;
> +
> + /* The returned payload starts after the pad */
> + rqstp->rq_res.page_base = pad;
> +
> + /* Compute the count of bytes to be returned */
> + if (host_err > pad + *count) {
> + host_err = *count;
> + } else if (host_err > pad) {
> + host_err -= pad;
> + } else {
> + host_err = 0;
> + }
> + } else if (unlikely(host_err == -EINVAL)) {
> + pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> + host_err = -ESERVERFAULT;
> + }
> +
> + return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
> + eof, host_err);
> +}
> +
> /**
> * nfsd_iter_read - Perform a VFS read using an iterator
> * @rqstp: RPC transaction context
> @@ -1106,6 +1182,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> switch (nfsd_io_cache_read) {
> case NFSD_IO_BUFFERED:
> break;
> + case NFSD_IO_DIRECT:
> + if (nf->nf_dio_read_offset_align && !base)
> + return nfsd_direct_read(rqstp, fhp, nf, offset,
> + count, eof);
> + fallthrough;
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags = IOCB_DONTCACHE;
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-18 16:29 ` Mike Snitzer
@ 2025-09-18 18:27 ` Chuck Lever
0 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2025-09-18 18:27 UTC (permalink / raw)
To: Mike Snitzer; +Cc: neil, jlayton, okorniev, dai.ngo, tom, linux-nfs
On 9/18/25 9:29 AM, Mike Snitzer wrote:
>> + /* Read a properly-aligned region of bytes into rq_bvec */
>> + dio_start = round_down(offset, nf->nf_dio_read_offset_align);
>> + dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
>> +
>> + kiocb.ki_pos = dio_start;
>> +
>> + v = 0;
>> + total = *count;
> Hi Chuck,
>
> Looks like you introduced a copy-n-paste bug when updating
> nfsd_direct_read's while loop to follow nfsd_iter_read's lead.
>
> Should be:
> total = dio_end - dio_start;
>
> [NOTE: this was the reason I saw a crash with my incremental patch
> that handled 'base', see:
> https://lore.kernel.org/linux-nfs/aMwcUdWdey69k2iK@kernel.org/
> ]
Indeed. Fixed in my tree, I will push it in a moment.
I'm traveling this week, so I don't have an opportunity to do more
than compile-test these before posting/pushing.
--
Chuck Lever
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-17 23:29 ` NeilBrown
2025-09-18 14:50 ` Mike Snitzer
@ 2025-09-18 18:42 ` Chuck Lever
2025-09-18 19:01 ` Mike Snitzer
1 sibling, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2025-09-18 18:42 UTC (permalink / raw)
To: NeilBrown; +Cc: jlayton, okorniev, dai.ngo, tom, linux-nfs
On 9/17/25 4:29 PM, NeilBrown wrote:
>> +/*
>> + * The byte range of the client's READ request is expanded on both
>> + * ends until it meets the underlying file system's direct I/O
>> + * alignment requirements. After the internal read is complete, the
>> + * byte range of the NFS READ payload is reduced to the byte range
>> + * that was originally requested.
>> + *
>> + * Note that a direct read can be done only when the xdr_buf
>> + * containing the NFS READ reply does not already have contents in
>> + * its .pages array. This is due to potentially restrictive
>> + * alignment requirements on the read buffer. When .page_len and
>> + * @base are zero, the .pages array is guaranteed to be page-
>> + * aligned.
> This para is confusing.
> It starts talking about the xdr_buf not having any contents. Then it
> transitions to a guarantee of page alignment.
>
> If the start of the read requests isn't sufficiently aligned then a gap
> will be created in the xdr_buf and that can only be handled at the start
> (using page_base).
>
> So as you say we need page_len to be zero. But nowhere in the code is
> this condition tested.
Despite what the comment claims, I had thought that things would work if
the payload started at a page boundary in xdr_buf.pages. But I can see
that page_offset applies only to the first entry in xdr_buf.pages.
So xdr_buf.page_len does need to be zero. That check can be added in
nfsd_iter_read().
I prefer this approach over more elaborate checking against the
dio_mem_alignment parameter because for the overwhelmingly common cases
of both NFSv3 READ and NFSv4 COMPOUND with one READ operation, page_len
will be zero. The extra complication is hard to unit-test and will
almost never be used.
> The closest is "!base" before the call to nfsd_direct_read() but when
> called from nfsd4_encode_readv()
>
> base = xdr->buf->page_len & ~PAGE_MASK;
>
> so ->page_len could be non-zero despite base being zero.
--
Chuck Lever
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-18 18:42 ` Chuck Lever
@ 2025-09-18 19:01 ` Mike Snitzer
0 siblings, 0 replies; 13+ messages in thread
From: Mike Snitzer @ 2025-09-18 19:01 UTC (permalink / raw)
To: Chuck Lever; +Cc: NeilBrown, jlayton, okorniev, dai.ngo, tom, linux-nfs
On Thu, Sep 18, 2025 at 11:42:03AM -0700, Chuck Lever wrote:
> On 9/17/25 4:29 PM, NeilBrown wrote:
> >> +/*
> >> + * The byte range of the client's READ request is expanded on both
> >> + * ends until it meets the underlying file system's direct I/O
> >> + * alignment requirements. After the internal read is complete, the
> >> + * byte range of the NFS READ payload is reduced to the byte range
> >> + * that was originally requested.
> >> + *
> >> + * Note that a direct read can be done only when the xdr_buf
> >> + * containing the NFS READ reply does not already have contents in
> >> + * its .pages array. This is due to potentially restrictive
> >> + * alignment requirements on the read buffer. When .page_len and
> >> + * @base are zero, the .pages array is guaranteed to be page-
> >> + * aligned.
> > This para is confusing.
> > It starts talking about the xdr_buf not having any contents. Then it
> > transitions to a guarantee of page alignment.
> >
> > If the start of the read requests isn't sufficiently aligned then a gap
> > will be created in the xdr_buf and that can only be handled at the start
> > (using page_base).
> >
> > So as you say we need page_len to be zero. But nowhere in the code is
> > this condition tested.
>
> Despite what the comment claims, I had thought that things would work if
> the payload started at a page boundary in xdr_buf.pages. But I can see
> that page_offset applies only to the first entry in xdr_buf.pages.
>
> So xdr_buf.page_len does need to be zero. That check can be added in
> nfsd_iter_read().
I had a look at trying to do that, it wasn't obvious given
fs/nfsd/vfs.c doesn't have any direct access or need for xdr_buf.
But I agree just adding that is ideal at this point.
> I prefer this approach over more elaborate checking against the
> dio_mem_alignment parameter because for the overwhelmingly common cases
> of both NFSv3 READ and NFSv4 COMPOUND with one READ operation, page_len
> will be zero. The extra complication is hard to unit-test and will
> almost never be used.
Extra checking might be interesting so nfsd_iov_iter_aligned_bvec()
is used (I've found its needed for the WRITE support). Maybe an
additional CONFIG_NFSD_DIRECT_VERIFY_DIO_ALIGNMENT option? -- might
seem overkill but I've found the iov_iter checking to really save me.
One of those things that you really only need during development, to
verify you didn't somehow miss something. But for release we don't
need/want it.
Anyway, just a "futures" tangent.. nothing actionable for this patch ;)
Mike
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-09-18 19:01 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-17 14:31 [PATCH v2 0/4] NFSD direct I/O read Chuck Lever
2025-09-17 14:31 ` [PATCH v2 1/4] NFSD: Add array bounds-checking in nfsd_iter_read() Chuck Lever
2025-09-17 17:51 ` Mike Snitzer
2025-09-17 14:31 ` [PATCH v2 2/4] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-17 14:31 ` [PATCH v2 3/4] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-17 14:32 ` [PATCH v2 4/4] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-17 23:29 ` NeilBrown
2025-09-18 14:50 ` Mike Snitzer
2025-09-18 15:20 ` Mike Snitzer
2025-09-18 18:42 ` Chuck Lever
2025-09-18 19:01 ` Mike Snitzer
2025-09-18 16:29 ` Mike Snitzer
2025-09-18 18:27 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox