* [PATCH v6 0/6] NFSD direct I/O read
@ 2025-10-08 13:52 Chuck Lever
2025-10-08 13:52 ` [PATCH v6 1/6] nfsd: fix refcount leak in nfsd_set_fh_dentry() Chuck Lever
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 13:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The goal is to get the experimental read-side direct I/O
implementation merged sooner. We are still thinking through the
implications of mixing direct and buffered I/O when handling an
NFS WRITE that does not meet the dio alignment requirements.
Changes since v5:
* Rebased on what's now in v6.18-rc1
* getattr failures now properly cleaned up
* Fix patches have been moved to the front of the series
Changes since v4:
* Additional Reviewed-by's; applying v5 to nfsd-testing
* Address Christoph's comments on 4/4
* Suggest a couple of clean-ups for 1/4
Changes since v3:
* Move xdr_reserve_space_vec() call to preserve the page_len value
* Note that "add STATX_DIOALIGN and STATX_DIO_READ_ALIGN ..."
remains exactly the same as it was in the v3 series
Changes since v2:
* "Add array bounds-checking..." has been applied to nfsd-testing
* Add a page_len check before committing to use direct I/O
Changes since v1:
* Harden the loop that constructs the I/O bvec
* Address review comments
Changes from Mike's v9:
* The LOC introduced by the feature has been reduced considerably.
* A new trace point in nfsd_file_getattr reports each file's dio
alignment parameters when it is opened.
* The direct I/O path has been taken out-of-line so that it may
continue to be modified and optimized without perturbing the more
commonly-used I/O paths.
* When an exported file system does not implement direct I/O, more
commonly-used modes are employed instead to avoid returning
EOPNOTSUPP unexpectedly.
* When NFSD_IO_DIRECT is selected, NFS READs of all sizes use direct
I/O to provide better experimental data about small I/O workloads.
Chuck Lever (4):
NFSD: Prevent a NULL pointer dereference in fh_getattr()
NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs()
NFSD: Relocate the xdr_reserve_space_vec() call site
NFSD: Implement NFSD_IO_DIRECT for NFS READ
Mike Snitzer (1):
NFSD: pass nfsd_file to nfsd_iter_read()
NeilBrown (1):
nfsd: fix refcount leak in nfsd_set_fh_dentry()
fs/nfsd/debugfs.c | 2 +
fs/nfsd/filecache.c | 5 ++-
fs/nfsd/nfs4xdr.c | 28 ++++++++++----
fs/nfsd/nfsd.h | 1 +
fs/nfsd/nfsfh.c | 9 ++---
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 90 +++++++++++++++++++++++++++++++++++++++++++--
fs/nfsd/vfs.h | 2 +-
8 files changed, 120 insertions(+), 18 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v6 1/6] nfsd: fix refcount leak in nfsd_set_fh_dentry()
2025-10-08 13:52 [PATCH v6 0/6] NFSD direct I/O read Chuck Lever
@ 2025-10-08 13:52 ` Chuck Lever
2025-10-08 14:13 ` Chuck Lever
2025-10-08 13:52 ` [PATCH v6 2/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
` (4 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 13:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, tianshuo han, stable
From: NeilBrown <neil@brown.name>
nfsd exports a "pseudo root filesystem" which is used by NFSv4 to find
the various exported filesystems using LOOKUP requests from a known root
filehandle. NFSv3 uses the MOUNT protocol to find those exported
filesystems and so is not given access to the pseudo root filesystem.
If a v3 (or v2) client uses a filehandle from that filesystem,
nfsd_set_fh_dentry() will report an error, but still stores the export
in "struct svc_fh" even though it also drops the reference (exp_put()).
This means that when fh_put() is called an extra reference will be dropped
which can lead to use-after-free and possible denial of service.
Normal NFS usage will not provide a pseudo-root filehandle to a v3
client. This bug can only be triggered by the client synthesising an
incorrect filehandle.
To fix this we move the assignments to the svc_fh later, after all
possible error cases have been detected.
Reported-and-tested-by: tianshuo han <hantianshuo233@gmail.com>
Fixes: ef7f6c4904d0 ("nfsd: move V4ROOT version check to nfsd_set_fh_dentry()")
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfsfh.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 3eb724ec9566..ed85dd43da18 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -269,9 +269,6 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct net *net,
dentry);
}
- fhp->fh_dentry = dentry;
- fhp->fh_export = exp;
-
switch (fhp->fh_maxsize) {
case NFS4_FHSIZE:
if (dentry->d_sb->s_export_op->flags & EXPORT_OP_NOATOMIC_ATTR)
@@ -293,6 +290,9 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct net *net,
goto out;
}
+ fhp->fh_dentry = dentry;
+ fhp->fh_export = exp;
+
return 0;
out:
exp_put(exp);
--
2.51.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 2/6] NFSD: Prevent a NULL pointer dereference in fh_getattr()
2025-10-08 13:52 [PATCH v6 0/6] NFSD direct I/O read Chuck Lever
2025-10-08 13:52 ` [PATCH v6 1/6] nfsd: fix refcount leak in nfsd_set_fh_dentry() Chuck Lever
@ 2025-10-08 13:52 ` Chuck Lever
2025-10-08 13:52 ` [PATCH v6 3/6] NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 13:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
In general, fh_getattr() can be called after the target dentry has
gone negative. For a negative dentry, d_inode(p.dentry) will return
NULL. S_ISREG() will dereference that pointer.
Avoid this potential regression by using the d_is_reg() helper
instead.
Suggested-by: NeilBrown <neil@brown.name>
Fixes: d11f6cd1bb4a ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfsfh.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index ed85dd43da18..16182936828f 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -696,10 +696,9 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
- struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
- if (S_ISREG(inode->i_mode))
+ if (d_is_reg(p.dentry))
request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
if (fhp->fh_maxsize == NFS4_FHSIZE)
--
2.51.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 3/6] NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-10-08 13:52 [PATCH v6 0/6] NFSD direct I/O read Chuck Lever
2025-10-08 13:52 ` [PATCH v6 1/6] nfsd: fix refcount leak in nfsd_set_fh_dentry() Chuck Lever
2025-10-08 13:52 ` [PATCH v6 2/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
@ 2025-10-08 13:52 ` Chuck Lever
2025-10-08 15:03 ` Jeff Layton
2025-10-08 13:52 ` [PATCH v6 4/6] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
` (2 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 13:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
A vfs_getattr() failure is rare but not totally impossible. It
typically means nfsd_do_file_acquire() raced with the file system
being shut down. Ensure the nfsd_file is not leaked in this case.
Fixes: d11f6cd1bb4a ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/filecache.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index a238b6725008..482feb0b55ad 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -1198,8 +1198,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
- if (status == nfs_ok)
+ if (status == nfs_ok) {
status = nfsd_file_get_dio_attrs(fhp, nf);
+ if (status != nfs_ok)
+ goto construction_err;
+ }
}
} else
status = nfserr_jukebox;
--
2.51.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 4/6] NFSD: pass nfsd_file to nfsd_iter_read()
2025-10-08 13:52 [PATCH v6 0/6] NFSD direct I/O read Chuck Lever
` (2 preceding siblings ...)
2025-10-08 13:52 ` [PATCH v6 3/6] NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
@ 2025-10-08 13:52 ` Chuck Lever
2025-10-08 13:52 ` [PATCH v6 5/6] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
2025-10-08 13:52 ` [PATCH v6 6/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 13:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer, Christoph Hellwig
From: Mike Snitzer <snitzer@kernel.org>
Prepare for nfsd_iter_read() to use the DIO alignment stored in
nfsd_file by passing the nfsd_file to nfsd_iter_read() rather than
just the file which is associaed with the nfsd_file.
This means nfsd4_encode_readv() now also needs the nfsd_file rather
than the file. Instead of changing the file arg to be the nfsd_file,
we discard the file arg as the nfsd_file (and indeed the file) is
already available via the "read" argument.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index c0a3c6a7c8bb..cd3251340b5c 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4465,7 +4465,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4476,7 +4476,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4523,7 +4523,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5419,7 +5419,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ea9c2de70429..406fe62de219 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1078,7 +1078,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1091,9 +1091,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1336,7 +1337,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6d..fa46f8b5f132 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.51.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 5/6] NFSD: Relocate the xdr_reserve_space_vec() call site
2025-10-08 13:52 [PATCH v6 0/6] NFSD direct I/O read Chuck Lever
` (3 preceding siblings ...)
2025-10-08 13:52 ` [PATCH v6 4/6] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
@ 2025-10-08 13:52 ` Chuck Lever
2025-10-08 13:52 ` [PATCH v6 6/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 13:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
In order to detect when a direct READ is possible, we need the send
buffer's .page_len to be zero when there is nothing in the buffer's
.pages array yet.
However, when xdr_reserve_space_vec() extends the size of the
xdr_stream to accommodate a READ payload, it adds to the send
buffer's .page_len.
It should be safe to reserve the stream space /after/ the VFS read
operation completes. This is, for example, how an NFSv3 READ works:
the VFS read goes into the rq_bvec, and is then added to the send
xdr_stream later by svcxdr_encode_opaque_pages().
Now that xdr_reserve_space_vec() uses the number of bytes actually
read, the xdr_truncate_encode() call is no longer necessary.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs4xdr.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index cd3251340b5c..f4a5e102b63a 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4473,18 +4473,30 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
__be32 zero = xdr_zero;
__be32 nfserr;
- if (xdr_reserve_space_vec(xdr, maxcount) < 0)
- return nfserr_resource;
-
nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
if (nfserr)
return nfserr;
+
+ /*
+ * svcxdr_encode_opaque_pages() is not used here because
+ * we don't want to encode subsequent results in this
+ * COMPOUND into the xdr->buf's tail, but rather those
+ * results should follow the NFS READ payload in the
+ * buf's pages.
+ */
+ if (xdr_reserve_space_vec(xdr, maxcount) < 0)
+ return nfserr_resource;
+
+ /*
+ * Mark the buffer location of the NFS READ payload so that
+ * direct placement-capable transports send only the
+ * payload bytes out-of-band.
+ */
if (svc_encode_result_payload(resp->rqstp, starting_len, maxcount))
return nfserr_io;
- xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
write_bytes_to_xdr_buf(xdr->buf, starting_len + maxcount, &zero,
xdr_pad_size(maxcount));
--
2.51.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6 6/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-10-08 13:52 [PATCH v6 0/6] NFSD direct I/O read Chuck Lever
` (4 preceding siblings ...)
2025-10-08 13:52 ` [PATCH v6 5/6] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
@ 2025-10-08 13:52 ` Chuck Lever
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 13:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Add an experimental option that forces NFS READ operations to use
direct I/O instead of reading through the NFS server's page cache.
There is already at least one other layer of read caching: the page
cache on NFS clients.
The server's page cache, in many cases, is unlikely to provide
additional benefit. Some benchmarks have demonstrated that the
server's page cache is actively detrimental for workloads whose
working set is larger than the server's available physical memory.
For instance, on small NFS servers, cached NFS file content can
squeeze out local memory consumers. For large sequential workloads,
an enormous amount of data flows into and out of the page cache
and is consumed by NFS clients exactly once -- caching that data
is expensive to do and totally valueless.
For now this is a hidden option that can be enabled on test
systems for benchmarking. In the longer term, this option might
be enabled persistently or per-export. When the exported file
system does not support direct I/O, NFSD falls back to using
either DONTCACHE or buffered I/O to fulfill NFS READ requests.
Suggested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 2 ++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 87 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index ed2b9e066206..00eb1ecef6ac 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
* Contents:
* %0: NFS READ will use buffered IO
* %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS READ will use direct IO
*
* This setting takes immediate effect for all NFS versions,
* all exports, and in all NFSD net namespaces.
@@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
nfsd_io_cache_read = NFSD_IO_BUFFERED;
break;
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
/*
* Must disable splice_read when enabling
* NFSD_IO_DONTCACHE.
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index ea87b42894dd..bdb60ee1f1a4 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -157,6 +157,7 @@ enum {
/* Any new NFSD_IO enum value must be added at the end */
NFSD_IO_BUFFERED,
NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
};
extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 6e2c8e2aab10..bfd41236aff2 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
DEFINE_NFSD_IO_EVENT(read_start);
DEFINE_NFSD_IO_EVENT(read_splice);
DEFINE_NFSD_IO_EVENT(read_vector);
+DEFINE_NFSD_IO_EVENT(read_direct);
DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 406fe62de219..f537a7b4ee01 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1074,6 +1074,83 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+/*
+ * The byte range of the client's READ request is expanded on both ends
+ * until it meets the underlying file system's direct I/O alignment
+ * requirements. After the internal read is complete, the byte range of
+ * the NFS READ payload is reduced to the byte range that was originally
+ * requested.
+ *
+ * Note that a direct read can be done only when the xdr_buf containing
+ * the NFS READ reply does not already have contents in its .pages array.
+ * This is due to potentially restrictive alignment requirements on the
+ * read buffer. When .page_len and @base are zero, the .pages array is
+ * guaranteed to be page-aligned.
+ */
+static noinline_for_stack __be32
+nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
+ u32 *eof)
+{
+ u64 dio_start, dio_end;
+ unsigned long v, total;
+ struct iov_iter iter;
+ struct kiocb kiocb;
+ ssize_t host_err;
+ size_t len;
+
+ init_sync_kiocb(&kiocb, nf->nf_file);
+ kiocb.ki_flags |= IOCB_DIRECT;
+
+ /* Read a properly-aligned region of bytes into rq_bvec */
+ dio_start = round_down(offset, nf->nf_dio_read_offset_align);
+ dio_end = round_up((u64)offset + *count, nf->nf_dio_read_offset_align);
+
+ kiocb.ki_pos = dio_start;
+
+ v = 0;
+ total = dio_end - dio_start;
+ while (total && v < rqstp->rq_maxpages &&
+ rqstp->rq_next_page < rqstp->rq_page_end) {
+ len = min_t(size_t, total, PAGE_SIZE);
+ bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
+ len, 0);
+
+ total -= len;
+ ++rqstp->rq_next_page;
+ ++v;
+ }
+
+ trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
+ dio_end - dio_start - total);
+
+ host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
+ if (host_err >= 0) {
+ unsigned int pad = offset - dio_start;
+
+ /* The returned payload starts after the pad */
+ rqstp->rq_res.page_base = pad;
+
+ /* Compute the count of bytes to be returned */
+ if (host_err > pad + *count)
+ host_err = *count;
+ else if (host_err > pad)
+ host_err -= pad;
+ else
+ host_err = 0;
+ } else if (unlikely(host_err == -EINVAL)) {
+ struct inode *inode = d_inode(fhp->fh_dentry);
+
+ pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+ inode->i_sb->s_id, inode->i_ino);
+ host_err = -ESERVERFAULT;
+ }
+
+ return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
+ eof, host_err);
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1106,6 +1183,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
switch (nfsd_io_cache_read) {
case NFSD_IO_BUFFERED:
break;
+ case NFSD_IO_DIRECT:
+ /* When dio_read_offset_align is zero, dio is not supported */
+ if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len)
+ return nfsd_direct_read(rqstp, fhp, nf, offset,
+ count, eof);
+ fallthrough;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags = IOCB_DONTCACHE;
--
2.51.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v6 1/6] nfsd: fix refcount leak in nfsd_set_fh_dentry()
2025-10-08 13:52 ` [PATCH v6 1/6] nfsd: fix refcount leak in nfsd_set_fh_dentry() Chuck Lever
@ 2025-10-08 14:13 ` Chuck Lever
0 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 14:13 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, tianshuo han, stable
On 10/8/25 9:52 AM, Chuck Lever wrote:
> From: NeilBrown <neil@brown.name>
>
> nfsd exports a "pseudo root filesystem" which is used by NFSv4 to find
> the various exported filesystems using LOOKUP requests from a known root
> filehandle. NFSv3 uses the MOUNT protocol to find those exported
> filesystems and so is not given access to the pseudo root filesystem.
Hrm. This one got included by mistake.
--
Chuck Lever
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v6 3/6] NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-10-08 13:52 ` [PATCH v6 3/6] NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
@ 2025-10-08 15:03 ` Jeff Layton
2025-10-08 15:08 ` Chuck Lever
0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2025-10-08 15:03 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Wed, 2025-10-08 at 09:52 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> A vfs_getattr() failure is rare but not totally impossible. It
> typically means nfsd_do_file_acquire() raced with the file system
> being shut down. Ensure the nfsd_file is not leaked in this case.
>
nit: I'm not sure we can say what typically causes vfs_getattr() to
fail. It really depends on the fs.
> Fixes: d11f6cd1bb4a ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/filecache.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a238b6725008..482feb0b55ad 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -1198,8 +1198,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
> }
> status = nfserrno(ret);
> trace_nfsd_file_open(nf, status);
> - if (status == nfs_ok)
> + if (status == nfs_ok) {
> status = nfsd_file_get_dio_attrs(fhp, nf);
> + if (status != nfs_ok)
> + goto construction_err;
> + }
> }
> } else
> status = nfserr_jukebox;
Is there really a leak here? It looks to me that when status != nfs_ok,
it's going to fall through to construction_err anyway. What this does
do is cause nfsd_file_unhash() to not be called. Do we want to leave
the file hashed in this case?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v6 3/6] NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-10-08 15:03 ` Jeff Layton
@ 2025-10-08 15:08 ` Chuck Lever
0 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2025-10-08 15:08 UTC (permalink / raw)
To: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On 10/8/25 11:03 AM, Jeff Layton wrote:
> On Wed, 2025-10-08 at 09:52 -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> A vfs_getattr() failure is rare but not totally impossible. It
>> typically means nfsd_do_file_acquire() raced with the file system
>> being shut down. Ensure the nfsd_file is not leaked in this case.
>>
>
> nit: I'm not sure we can say what typically causes vfs_getattr() to
> fail. It really depends on the fs.
>
>> Fixes: d11f6cd1bb4a ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> fs/nfsd/filecache.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index a238b6725008..482feb0b55ad 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -1198,8 +1198,11 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
>> }
>> status = nfserrno(ret);
>> trace_nfsd_file_open(nf, status);
>> - if (status == nfs_ok)
>> + if (status == nfs_ok) {
>> status = nfsd_file_get_dio_attrs(fhp, nf);
>> + if (status != nfs_ok)
>> + goto construction_err;
>> + }
>> }
>> } else
>> status = nfserr_jukebox;
>
> Is there really a leak here? It looks to me that when status != nfs_ok,
> it's going to fall through to construction_err anyway.
I wondered about that. Then there is probably nothing here to fix.
> What this does
> do is cause nfsd_file_unhash() to not be called. Do we want to leave
> the file hashed in this case?
--
Chuck Lever
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-10-08 15:08 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 13:52 [PATCH v6 0/6] NFSD direct I/O read Chuck Lever
2025-10-08 13:52 ` [PATCH v6 1/6] nfsd: fix refcount leak in nfsd_set_fh_dentry() Chuck Lever
2025-10-08 14:13 ` Chuck Lever
2025-10-08 13:52 ` [PATCH v6 2/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
2025-10-08 13:52 ` [PATCH v6 3/6] NFSD: Recover from vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
2025-10-08 15:03 ` Jeff Layton
2025-10-08 15:08 ` Chuck Lever
2025-10-08 13:52 ` [PATCH v6 4/6] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-10-08 13:52 ` [PATCH v6 5/6] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
2025-10-08 13:52 ` [PATCH v6 6/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).