* [PATCH v5 0/6] NFSD direct I/O read
@ 2025-09-29 15:56 Chuck Lever
2025-09-29 15:56 ` [PATCH v5 1/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
` (5 more replies)
0 siblings, 6 replies; 19+ messages in thread
From: Chuck Lever @ 2025-09-29 15:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The goal is to get the experimental read-side direct I/O
implementation merged sooner. We are still thinking through the
implications of mixing direct and buffered I/O when handling an
NFS WRITE that does not meet the dio alignment requirements.
Changes since v4:
* Additional Reviewed-by's; applying v5 to nfsd-testing
* Address Christoph's comments on 4/4
* Suggest a couple of clean-ups for 1/4
Changes since v3:
* Move xdr_reserve_space_vec() call to preserve the page_len value
* Note that "add STATX_DIOALIGN and STATX_DIO_READ_ALIGN ..."
remains exactly the same as it was in the v3 series
Changes since v2:
* "Add array bounds-checking..." has been applied to nfsd-testing
* Add a page_len check before committing to use direct I/O
Changes since v1:
* Harden the loop that constructs the I/O bvec
* Address review comments
Changes from Mike's v9:
* The LOC introduced by the feature has been reduced considerably.
* A new trace point in nfsd_file_getattr reports each file's dio
alignment parameters when it is opened.
* The direct I/O path has been taken out-of-line so that it may
continue to be modified and optimized without perturbing the more
commonly-used I/O paths.
* When an exported file system does not implement direct I/O, more
commonly-used modes are employed instead to avoid returning
EOPNOTSUPP unexpectedly.
* When NFSD_IO_DIRECT is selected, NFS READs of all sizes use direct
I/O to provide better experimental data about small I/O workloads.
Chuck Lever (4):
NFSD: Relocate the xdr_reserve_space_vec() call site
NFSD: Implement NFSD_IO_DIRECT for NFS READ
NFSD: Prevent a NULL pointer dereference in fh_getattr()
NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs()
Mike Snitzer (2):
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFSD: pass nfsd_file to nfsd_iter_read()
fs/nfsd/debugfs.c | 2 +
fs/nfsd/filecache.c | 33 +++++++++++++++
fs/nfsd/filecache.h | 4 ++
fs/nfsd/nfs4xdr.c | 28 +++++++++----
fs/nfsd/nfsd.h | 1 +
fs/nfsd/nfsfh.c | 3 ++
fs/nfsd/trace.h | 28 +++++++++++++
fs/nfsd/vfs.c | 92 +++++++++++++++++++++++++++++++++++++++--
fs/nfsd/vfs.h | 2 +-
include/trace/misc/fs.h | 22 ++++++++++
10 files changed, 203 insertions(+), 12 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v5 1/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-09-29 15:56 [PATCH v5 0/6] NFSD direct I/O read Chuck Lever
@ 2025-09-29 15:56 ` Chuck Lever
2025-10-03 7:09 ` Christoph Hellwig
2025-09-29 15:56 ` [PATCH v5 2/6] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
` (4 subsequent siblings)
5 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2025-09-29 15:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment
attributes from the underlying filesystem and store them in the
associated nfsd_file. This is done when the nfsd_file is first
opened for each regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/filecache.c | 34 ++++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/nfsfh.c | 4 ++++
fs/nfsd/trace.h | 27 +++++++++++++++++++++++++++
include/trace/misc/fs.h | 22 ++++++++++++++++++++++
5 files changed, 91 insertions(+)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 75bc48031c07..78cca0d751ac 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1048,6 +1051,35 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ trace_nfsd_file_get_dio_attrs(inode, &stat);
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return nfs_ok;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1166,6 +1198,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_get_dio_attrs(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434..e3d6ca2b6030 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index bd9acfdc7b01..ed85dd43da18 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -696,8 +696,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
+ struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
+ if (S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
if (fhp->fh_maxsize == NFS4_FHSIZE)
request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e..6e2c8e2aab10 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -1133,6 +1133,33 @@ TRACE_EVENT(nfsd_file_alloc,
)
);
+TRACE_EVENT(nfsd_file_get_dio_attrs,
+ TP_PROTO(
+ const struct inode *inode,
+ const struct kstat *stat
+ ),
+ TP_ARGS(inode, stat),
+ TP_STRUCT__entry(
+ __field(const void *, inode)
+ __field(unsigned long, mask)
+ __field(u32, mem_align)
+ __field(u32, offset_align)
+ __field(u32, read_offset_align)
+ ),
+ TP_fast_assign(
+ __entry->inode = inode;
+ __entry->mask = stat->result_mask;
+ __entry->mem_align = stat->dio_mem_align;
+ __entry->offset_align = stat->dio_offset_align;
+ __entry->read_offset_align = stat->dio_read_offset_align;
+ ),
+ TP_printk("inode=%p flags=%s mem_align=%u offset_align=%u read_offset_align=%u",
+ __entry->inode, show_statx_mask(__entry->mask),
+ __entry->mem_align, __entry->offset_align,
+ __entry->read_offset_align
+ )
+);
+
TRACE_EVENT(nfsd_file_acquire,
TP_PROTO(
const struct svc_rqst *rqstp,
diff --git a/include/trace/misc/fs.h b/include/trace/misc/fs.h
index 0406ebe2a80a..7ead1c61f0cb 100644
--- a/include/trace/misc/fs.h
+++ b/include/trace/misc/fs.h
@@ -141,3 +141,25 @@
{ ATTR_TIMES_SET, "TIMES_SET" }, \
{ ATTR_TOUCH, "TOUCH"}, \
{ ATTR_DELEG, "DELEG"})
+
+#define show_statx_mask(flags) \
+ __print_flags(flags, "|", \
+ { STATX_TYPE, "TYPE" }, \
+ { STATX_MODE, "MODE" }, \
+ { STATX_NLINK, "NLINK" }, \
+ { STATX_UID, "UID" }, \
+ { STATX_GID, "GID" }, \
+ { STATX_ATIME, "ATIME" }, \
+ { STATX_MTIME, "MTIME" }, \
+ { STATX_CTIME, "CTIME" }, \
+ { STATX_INO, "INO" }, \
+ { STATX_SIZE, "SIZE" }, \
+ { STATX_BLOCKS, "BLOCKS" }, \
+ { STATX_BASIC_STATS, "BASIC_STATS" }, \
+ { STATX_BTIME, "BTIME" }, \
+ { STATX_MNT_ID, "MNT_ID" }, \
+ { STATX_DIOALIGN, "DIOALIGN" }, \
+ { STATX_MNT_ID_UNIQUE, "MNT_ID_UNIQUE" }, \
+ { STATX_SUBVOL, "SUBVOL" }, \
+ { STATX_WRITE_ATOMIC, "WRITE_ATOMIC" }, \
+ { STATX_DIO_READ_ALIGN, "DIO_READ_ALIGN" })
--
2.51.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v5 2/6] NFSD: pass nfsd_file to nfsd_iter_read()
2025-09-29 15:56 [PATCH v5 0/6] NFSD direct I/O read Chuck Lever
2025-09-29 15:56 ` [PATCH v5 1/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
@ 2025-09-29 15:56 ` Chuck Lever
2025-10-03 7:09 ` Christoph Hellwig
2025-09-29 15:56 ` [PATCH v5 3/6] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
` (3 subsequent siblings)
5 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2025-09-29 15:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
Prepare for nfsd_iter_read() to use the DIO alignment stored in
nfsd_file by passing the nfsd_file to nfsd_iter_read() rather than
just the file which is associaed with the nfsd_file.
This means nfsd4_encode_readv() now also needs the nfsd_file rather
than the file. Instead of changing the file arg to be the nfsd_file,
we discard the file arg as the nfsd_file (and indeed the file) is
already available via the "read" argument.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 97e9e9afa80a..41f0d54d6e1b 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4467,7 +4467,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4478,7 +4478,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4525,7 +4525,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5421,7 +5421,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2026431500ec..35880d3f1326 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1078,7 +1078,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1091,9 +1091,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1336,7 +1337,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6d..fa46f8b5f132 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.51.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v5 3/6] NFSD: Relocate the xdr_reserve_space_vec() call site
2025-09-29 15:56 [PATCH v5 0/6] NFSD direct I/O read Chuck Lever
2025-09-29 15:56 ` [PATCH v5 1/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-29 15:56 ` [PATCH v5 2/6] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
@ 2025-09-29 15:56 ` Chuck Lever
2025-09-29 15:56 ` [PATCH v5 4/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
` (2 subsequent siblings)
5 siblings, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2025-09-29 15:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
In order to detect when a direct READ is possible, we need the send
buffer's .page_len to be zero when there is nothing in the buffer's
.pages array yet.
However, when xdr_reserve_space_vec() extends the size of the
xdr_stream to accommodate a READ payload, it adds to the send
buffer's .page_len.
It should be safe to reserve the stream space /after/ the VFS read
operation completes. This is, for example, how an NFSv3 READ works:
the VFS read goes into the rq_bvec, and is then added to the send
xdr_stream later by svcxdr_encode_opaque_pages().
Now that xdr_reserve_space_vec() uses the number of bytes actually
read, the xdr_truncate_encode() call is no longer necessary.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs4xdr.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 41f0d54d6e1b..3a5c7a0e2db8 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4475,18 +4475,30 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
__be32 zero = xdr_zero;
__be32 nfserr;
- if (xdr_reserve_space_vec(xdr, maxcount) < 0)
- return nfserr_resource;
-
nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
if (nfserr)
return nfserr;
+
+ /*
+ * svcxdr_encode_opaque_pages() is not used here because
+ * we don't want to encode subsequent results in this
+ * COMPOUND into the xdr->buf's tail, but rather those
+ * results should follow the NFS READ payload in the
+ * buf's pages.
+ */
+ if (xdr_reserve_space_vec(xdr, maxcount) < 0)
+ return nfserr_resource;
+
+ /*
+ * Mark the buffer location of the NFS READ payload so that
+ * direct placement-capable transports send only the
+ * payload bytes out-of-band.
+ */
if (svc_encode_result_payload(resp->rqstp, starting_len, maxcount))
return nfserr_io;
- xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
write_bytes_to_xdr_buf(xdr->buf, starting_len + maxcount, &zero,
xdr_pad_size(maxcount));
--
2.51.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v5 4/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-29 15:56 [PATCH v5 0/6] NFSD direct I/O read Chuck Lever
` (2 preceding siblings ...)
2025-09-29 15:56 ` [PATCH v5 3/6] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
@ 2025-09-29 15:56 ` Chuck Lever
2025-10-03 7:19 ` Christoph Hellwig
2025-09-29 15:56 ` [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
2025-09-29 15:56 ` [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
5 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2025-09-29 15:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Add an experimental option that forces NFS READ operations to use
direct I/O instead of reading through the NFS server's page cache.
There is already at least one other layer of read caching: the page
cache on NFS clients.
The server's page cache, in many cases, is unlikely to provide
additional benefit. Some benchmarks have demonstrated that the
server's page cache is actively detrimental for workloads whose
working set is larger than the server's available physical memory.
For instance, on small NFS servers, cached NFS file content can
squeeze out local memory consumers. For large sequential workloads,
an enormous amount of data flows into and out of the page cache
and is consumed by NFS clients exactly once -- caching that data
is expensive to do and totally valueless.
For now this is a hidden option that can be enabled on test
systems for benchmarking. In the longer term, this option might
be enabled persistently or per-export. When the exported file
system does not support direct I/O, NFSD falls back to using
either DONTCACHE or buffered I/O to fulfill NFS READ requests.
Suggested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 2 ++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 89 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index ed2b9e066206..00eb1ecef6ac 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
* Contents:
* %0: NFS READ will use buffered IO
* %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS READ will use direct IO
*
* This setting takes immediate effect for all NFS versions,
* all exports, and in all NFSD net namespaces.
@@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
nfsd_io_cache_read = NFSD_IO_BUFFERED;
break;
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
/*
* Must disable splice_read when enabling
* NFSD_IO_DONTCACHE.
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index ea87b42894dd..bdb60ee1f1a4 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -157,6 +157,7 @@ enum {
/* Any new NFSD_IO enum value must be added at the end */
NFSD_IO_BUFFERED,
NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
};
extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 6e2c8e2aab10..bfd41236aff2 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
DEFINE_NFSD_IO_EVENT(read_start);
DEFINE_NFSD_IO_EVENT(read_splice);
DEFINE_NFSD_IO_EVENT(read_vector);
+DEFINE_NFSD_IO_EVENT(read_direct);
DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 35880d3f1326..8172a063db07 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1074,6 +1074,84 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+/*
+ * The byte range of the client's READ request is expanded on both
+ * ends until it meets the underlying file system's direct I/O
+ * alignment requirements. After the internal read is complete, the
+ * byte range of the NFS READ payload is reduced to the byte range
+ * that was originally requested.
+ *
+ * Note that a direct read can be done only when the xdr_buf
+ * containing the NFS READ reply does not already have contents in
+ * its .pages array. This is due to potentially restrictive
+ * alignment requirements on the read buffer. When .page_len and
+ * @base are zero, the .pages array is guaranteed to be page-
+ * aligned.
+ */
+static noinline_for_stack __be32
+nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
+ u32 *eof)
+{
+ u64 dio_start, dio_end;
+ unsigned long v, total;
+ struct iov_iter iter;
+ struct kiocb kiocb;
+ ssize_t host_err;
+ size_t len;
+
+ init_sync_kiocb(&kiocb, nf->nf_file);
+ kiocb.ki_flags |= IOCB_DIRECT;
+
+ /* Read a properly-aligned region of bytes into rq_bvec */
+ dio_start = round_down(offset, nf->nf_dio_read_offset_align);
+ dio_end = round_up((u64)offset + *count, nf->nf_dio_read_offset_align);
+
+ kiocb.ki_pos = dio_start;
+
+ v = 0;
+ total = dio_end - dio_start;
+ while (total && v < rqstp->rq_maxpages &&
+ rqstp->rq_next_page < rqstp->rq_page_end) {
+ len = min_t(size_t, total, PAGE_SIZE);
+ bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
+ len, 0);
+
+ total -= len;
+ ++rqstp->rq_next_page;
+ ++v;
+ }
+
+ trace_nfsd_read_direct(rqstp, fhp, offset, *count - total);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v,
+ dio_end - dio_start - total);
+
+ host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
+ if (host_err >= 0) {
+ unsigned int pad = offset - dio_start;
+
+ /* The returned payload starts after the pad */
+ rqstp->rq_res.page_base = pad;
+
+ /* Compute the count of bytes to be returned */
+ if (host_err > pad + *count)
+ host_err = *count;
+ else if (host_err > pad)
+ host_err -= pad;
+ else
+ host_err = 0;
+ } else if (unlikely(host_err == -EINVAL)) {
+ struct inode *inode = d_inode(fhp->fh_dentry);
+
+ pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+ inode->i_sb->s_id, inode->i_ino);
+ host_err = -ESERVERFAULT;
+ }
+
+ return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
+ eof, host_err);
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1106,6 +1184,13 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
switch (nfsd_io_cache_read) {
case NFSD_IO_BUFFERED:
break;
+ case NFSD_IO_DIRECT:
+ /* When dio_read_offset_align is zero, dio is not supported */
+ if (nf->nf_dio_read_offset_align &&
+ !rqstp->rq_res.page_len)
+ return nfsd_direct_read(rqstp, fhp, nf, offset,
+ count, eof);
+ fallthrough;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags = IOCB_DONTCACHE;
--
2.51.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr()
2025-09-29 15:56 [PATCH v5 0/6] NFSD direct I/O read Chuck Lever
` (3 preceding siblings ...)
2025-09-29 15:56 ` [PATCH v5 4/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
@ 2025-09-29 15:56 ` Chuck Lever
2025-09-29 16:13 ` Jeff Layton
` (2 more replies)
2025-09-29 15:56 ` [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
5 siblings, 3 replies; 19+ messages in thread
From: Chuck Lever @ 2025-09-29 15:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
In general, fh_getattr() can be called after the target dentry has
gone negative. For a negative dentry, d_inode(p.dentry) will return
NULL. S_ISREG() will dereference that pointer.
Avoid this potential regression by using the d_is_reg() helper
instead.
Suggested-by: NeilBrown <neil@brown.name>
Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfsfh.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index ed85dd43da18..16182936828f 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -696,10 +696,9 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
- struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
- if (S_ISREG(inode->i_mode))
+ if (d_is_reg(p.dentry))
request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
if (fhp->fh_maxsize == NFS4_FHSIZE)
--
2.51.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-09-29 15:56 [PATCH v5 0/6] NFSD direct I/O read Chuck Lever
` (4 preceding siblings ...)
2025-09-29 15:56 ` [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
@ 2025-09-29 15:56 ` Chuck Lever
2025-09-29 16:51 ` Jeff Layton
` (2 more replies)
5 siblings, 3 replies; 19+ messages in thread
From: Chuck Lever @ 2025-09-29 15:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
A vfs_getattr() failure is rare but not totally impossible.
There's no recovery logic in that case; nfsd_do_file_acquire()'s
caller will fail but the wonky nfsd_file is left in the file cache.
It doesn't seem necessary for nfsd_file_do_acquire() to fail
outright if it successfully opened the file but some problem
prevented the collection of the dio alignment parameters.
Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/filecache.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 78cca0d751ac..b34cc8d2cb5e 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -1051,7 +1051,7 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
-static __be32
+static void
nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
{
struct inode *inode = file_inode(nf->nf_file);
@@ -1060,11 +1060,12 @@ nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
/* Currently only need to get DIO alignment info for regular files */
if (!S_ISREG(inode->i_mode))
- return nfs_ok;
+ return;
status = fh_getattr(fhp, &stat);
if (status != nfs_ok)
- return status;
+ /* Use default dio alignment parameters (all zero) */
+ return;
trace_nfsd_file_get_dio_attrs(inode, &stat);
@@ -1076,8 +1077,6 @@ nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
else
nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
-
- return nfs_ok;
}
static __be32
@@ -1199,7 +1198,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
if (status == nfs_ok)
- status = nfsd_file_get_dio_attrs(fhp, nf);
+ nfsd_file_get_dio_attrs(fhp, nf);
}
} else
status = nfserr_jukebox;
--
2.51.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr()
2025-09-29 15:56 ` [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
@ 2025-09-29 16:13 ` Jeff Layton
2025-09-29 17:39 ` Mike Snitzer
2025-10-03 7:11 ` Christoph Hellwig
2 siblings, 0 replies; 19+ messages in thread
From: Jeff Layton @ 2025-09-29 16:13 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Mon, 2025-09-29 at 11:56 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> In general, fh_getattr() can be called after the target dentry has
> gone negative. For a negative dentry, d_inode(p.dentry) will return
> NULL. S_ISREG() will dereference that pointer.
>
> Avoid this potential regression by using the d_is_reg() helper
> instead.
>
> Suggested-by: NeilBrown <neil@brown.name>
> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/nfsfh.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index ed85dd43da18..16182936828f 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -696,10 +696,9 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
> .mnt = fhp->fh_export->ex_path.mnt,
> .dentry = fhp->fh_dentry,
> };
> - struct inode *inode = d_inode(p.dentry);
> u32 request_mask = STATX_BASIC_STATS;
>
> - if (S_ISREG(inode->i_mode))
> + if (d_is_reg(p.dentry))
> request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
>
> if (fhp->fh_maxsize == NFS4_FHSIZE)
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-09-29 15:56 ` [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
@ 2025-09-29 16:51 ` Jeff Layton
2025-09-29 17:39 ` Mike Snitzer
2025-10-03 7:13 ` Christoph Hellwig
2 siblings, 0 replies; 19+ messages in thread
From: Jeff Layton @ 2025-09-29 16:51 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Mon, 2025-09-29 at 11:56 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> A vfs_getattr() failure is rare but not totally impossible.
>
> There's no recovery logic in that case; nfsd_do_file_acquire()'s
> caller will fail but the wonky nfsd_file is left in the file cache.
>
> It doesn't seem necessary for nfsd_file_do_acquire() to fail
> outright if it successfully opened the file but some problem
> prevented the collection of the dio alignment parameters.
>
Agreed
> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/filecache.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 78cca0d751ac..b34cc8d2cb5e 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -1051,7 +1051,7 @@ nfsd_file_is_cached(struct inode *inode)
> return ret;
> }
>
> -static __be32
> +static void
> nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
> {
> struct inode *inode = file_inode(nf->nf_file);
> @@ -1060,11 +1060,12 @@ nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
>
> /* Currently only need to get DIO alignment info for regular files */
> if (!S_ISREG(inode->i_mode))
> - return nfs_ok;
> + return;
>
> status = fh_getattr(fhp, &stat);
> if (status != nfs_ok)
> - return status;
> + /* Use default dio alignment parameters (all zero) */
> + return;
>
> trace_nfsd_file_get_dio_attrs(inode, &stat);
>
> @@ -1076,8 +1077,6 @@ nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
> nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
> else
> nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
> -
> - return nfs_ok;
> }
>
> static __be32
> @@ -1199,7 +1198,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
> status = nfserrno(ret);
> trace_nfsd_file_open(nf, status);
> if (status == nfs_ok)
> - status = nfsd_file_get_dio_attrs(fhp, nf);
> + nfsd_file_get_dio_attrs(fhp, nf);
> }
> } else
> status = nfserr_jukebox;
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr()
2025-09-29 15:56 ` [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
2025-09-29 16:13 ` Jeff Layton
@ 2025-09-29 17:39 ` Mike Snitzer
2025-10-03 7:11 ` Christoph Hellwig
2 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2025-09-29 17:39 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Sep 29, 2025 at 11:56:45AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> In general, fh_getattr() can be called after the target dentry has
> gone negative. For a negative dentry, d_inode(p.dentry) will return
> NULL. S_ISREG() will dereference that pointer.
>
> Avoid this potential regression by using the d_is_reg() helper
> instead.
>
> Suggested-by: NeilBrown <neil@brown.name>
> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/nfsfh.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index ed85dd43da18..16182936828f 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -696,10 +696,9 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
> .mnt = fhp->fh_export->ex_path.mnt,
> .dentry = fhp->fh_dentry,
> };
> - struct inode *inode = d_inode(p.dentry);
> u32 request_mask = STATX_BASIC_STATS;
>
> - if (S_ISREG(inode->i_mode))
> + if (d_is_reg(p.dentry))
> request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
>
> if (fhp->fh_maxsize == NFS4_FHSIZE)
> --
> 2.51.0
>
>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-09-29 15:56 ` [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
2025-09-29 16:51 ` Jeff Layton
@ 2025-09-29 17:39 ` Mike Snitzer
2025-10-03 7:13 ` Christoph Hellwig
2 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2025-09-29 17:39 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Sep 29, 2025 at 11:56:46AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> A vfs_getattr() failure is rare but not totally impossible.
>
> There's no recovery logic in that case; nfsd_do_file_acquire()'s
> caller will fail but the wonky nfsd_file is left in the file cache.
>
> It doesn't seem necessary for nfsd_file_do_acquire() to fail
> outright if it successfully opened the file but some problem
> prevented the collection of the dio alignment parameters.
>
> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 1/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-09-29 15:56 ` [PATCH v5 1/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
@ 2025-10-03 7:09 ` Christoph Hellwig
0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-10-03 7:09 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 2/6] NFSD: pass nfsd_file to nfsd_iter_read()
2025-09-29 15:56 ` [PATCH v5 2/6] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
@ 2025-10-03 7:09 ` Christoph Hellwig
0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-10-03 7:09 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr()
2025-09-29 15:56 ` [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
2025-09-29 16:13 ` Jeff Layton
2025-09-29 17:39 ` Mike Snitzer
@ 2025-10-03 7:11 ` Christoph Hellwig
2025-10-03 14:18 ` Chuck Lever
2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-10-03 7:11 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Sep 29, 2025 at 11:56:45AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> In general, fh_getattr() can be called after the target dentry has
> gone negative. For a negative dentry, d_inode(p.dentry) will return
> NULL. S_ISREG() will dereference that pointer.
>
> Avoid this potential regression by using the d_is_reg() helper
> instead.
>
> Suggested-by: NeilBrown <neil@brown.name>
> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
Fixed for existing bugs should usually go first and/or be split into a
separate prep series.
Otherwise:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-09-29 15:56 ` [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
2025-09-29 16:51 ` Jeff Layton
2025-09-29 17:39 ` Mike Snitzer
@ 2025-10-03 7:13 ` Christoph Hellwig
2025-10-03 14:23 ` Chuck Lever
2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-10-03 7:13 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Sep 29, 2025 at 11:56:46AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> A vfs_getattr() failure is rare but not totally impossible.
>
> There's no recovery logic in that case; nfsd_do_file_acquire()'s
> caller will fail but the wonky nfsd_file is left in the file cache.
>
> It doesn't seem necessary for nfsd_file_do_acquire() to fail
> outright if it successfully opened the file but some problem
> prevented the collection of the dio alignment parameters.
The only remotely likely case for this is a file system shutdown.
There's no real way it could fail but I/O later on will work.
I think just failing serves everyone here much better than try to
keep going.
> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
Same comment about fixes being first/separate as for the previous
patch.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 4/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-09-29 15:56 ` [PATCH v5 4/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
@ 2025-10-03 7:19 ` Christoph Hellwig
2025-10-03 15:06 ` Chuck Lever
0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-10-03 7:19 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Mike Snitzer
> +/*
> + * The byte range of the client's READ request is expanded on both
> + * ends until it meets the underlying file system's direct I/O
> + * alignment requirements. After the internal read is complete, the
> + * byte range of the NFS READ payload is reduced to the byte range
> + * that was originally requested.
> + *
> + * Note that a direct read can be done only when the xdr_buf
> + * containing the NFS READ reply does not already have contents in
> + * its .pages array. This is due to potentially restrictive
> + * alignment requirements on the read buffer. When .page_len and
> + * @base are zero, the .pages array is guaranteed to be page-
> + * aligned.
> + */
Any reason you are not using the full 80 characters available for the
comment?
> + v = 0;
> + total = dio_end - dio_start;
> + while (total && v < rqstp->rq_maxpages &&
> + rqstp->rq_next_page < rqstp->rq_page_end) {
> + len = min_t(size_t, total, PAGE_SIZE);
> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> + len, 0);
> +
> + total -= len;
> + ++rqstp->rq_next_page;
> + ++v;
I find this loop style very hard to follow. Why not do something like:
Why not something like:
for (v = 0; v < rqstp->rq_maxpages; v++) {
size_t len = min_t(size_t, total, PAGE_SIZE);
bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page, len, 0);
rqstp->rq_next_page++;
total -= len;
if (!total)
break;
}
?
> + case NFSD_IO_DIRECT:
> + /* When dio_read_offset_align is zero, dio is not supported */
> + if (nf->nf_dio_read_offset_align &&
> + !rqstp->rq_res.page_len)
Nit: the entire if clause fits into a single line.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr()
2025-10-03 7:11 ` Christoph Hellwig
@ 2025-10-03 14:18 ` Chuck Lever
0 siblings, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2025-10-03 14:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On 10/3/25 3:11 AM, Christoph Hellwig wrote:
> On Mon, Sep 29, 2025 at 11:56:45AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> In general, fh_getattr() can be called after the target dentry has
>> gone negative. For a negative dentry, d_inode(p.dentry) will return
>> NULL. S_ISREG() will dereference that pointer.
>>
>> Avoid this potential regression by using the d_is_reg() helper
>> instead.
>>
>> Suggested-by: NeilBrown <neil@brown.name>
>> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
>
> Fixed for existing bugs should usually go first and/or be split into a
> separate prep series.
>
> Otherwise:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
Additional context: bc70aaeba7df is the first commit in this series.
These fixes are split out because bc70aaeba7df is going through Anna's
tree for v6.18, so it's already "committed". However I can move them
to patch 2 and 3 of this series.
--
Chuck Lever
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs()
2025-10-03 7:13 ` Christoph Hellwig
@ 2025-10-03 14:23 ` Chuck Lever
0 siblings, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2025-10-03 14:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On 10/3/25 3:13 AM, Christoph Hellwig wrote:
> On Mon, Sep 29, 2025 at 11:56:46AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> A vfs_getattr() failure is rare but not totally impossible.
>>
>> There's no recovery logic in that case; nfsd_do_file_acquire()'s
>> caller will fail but the wonky nfsd_file is left in the file cache.
>>
>> It doesn't seem necessary for nfsd_file_do_acquire() to fail
>> outright if it successfully opened the file but some problem
>> prevented the collection of the dio alignment parameters.
>
> The only remotely likely case for this is a file system shutdown.
> There's no real way it could fail but I/O later on will work.
>
> I think just failing serves everyone here much better than try to
> keep going.
Got it.
IMO then more recovery logic is needed in do_file_acquire. We can't then
just leave the moribund nfsd_file in the filecache unless there is some
guarantee that it will not be leaked.
>> Fixes: bc70aaeba7df ("NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support")
>
> Same comment about fixes being first/separate as for the previous
> patch.
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v5 4/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ
2025-10-03 7:19 ` Christoph Hellwig
@ 2025-10-03 15:06 ` Chuck Lever
0 siblings, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2025-10-03 15:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Mike Snitzer
On 10/3/25 3:19 AM, Christoph Hellwig wrote:
>> +/*
>> + * The byte range of the client's READ request is expanded on both
>> + * ends until it meets the underlying file system's direct I/O
>> + * alignment requirements. After the internal read is complete, the
>> + * byte range of the NFS READ payload is reduced to the byte range
>> + * that was originally requested.
>> + *
>> + * Note that a direct read can be done only when the xdr_buf
>> + * containing the NFS READ reply does not already have contents in
>> + * its .pages array. This is due to potentially restrictive
>> + * alignment requirements on the read buffer. When .page_len and
>> + * @base are zero, the .pages array is guaranteed to be page-
>> + * aligned.
>> + */
>
> Any reason you are not using the full 80 characters available for the
> comment?
Yep: I find comment blocks that fill the whole line more difficult
to read because my eyes are gray and bent. :-)
>> + v = 0;
>> + total = dio_end - dio_start;
>> + while (total && v < rqstp->rq_maxpages &&
>> + rqstp->rq_next_page < rqstp->rq_page_end) {
>> + len = min_t(size_t, total, PAGE_SIZE);
>> + bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
>> + len, 0);
>> +
>> + total -= len;
>> + ++rqstp->rq_next_page;
>> + ++v;
>
> I find this loop style very hard to follow. Why not do something like:
>
> Why not something like:
>
> for (v = 0; v < rqstp->rq_maxpages; v++) {
> size_t len = min_t(size_t, total, PAGE_SIZE);
>
> bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page, len, 0);
> rqstp->rq_next_page++;
> total -= len;
> if (!total)
> break;
> }
>
> ?
We copied this loop from nfsd_iter_read(), where we just reworked it
because there are several conditions that need to terminate the loop:
- There are no more bytes to process
- There are no more pages available in rq_pages
- There are no more entries in rq_bvec
Your proposed rewrite drops the array bounds checking that we just
added, IIUC. Neil doesn't like adding termination conditions at the end
of a loop.
I'm open to utilizing for() instead of while() but I didn't find that
writing it as for() improved legibility.
(Yes this is duplicate code, which I'm OK with until we have optimized
direct reads properly and have nailed down exactly when NFSD will employ
them).
>> + case NFSD_IO_DIRECT:
>> + /* When dio_read_offset_align is zero, dio is not supported */
>> + if (nf->nf_dio_read_offset_align &&
>> + !rqstp->rq_res.page_len)
>
> Nit: the entire if clause fits into a single line.
>
I'll fix that; an additional check was removed during the last
revision, and that shortened the second line.
--
Chuck Lever
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-10-03 15:06 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-29 15:56 [PATCH v5 0/6] NFSD direct I/O read Chuck Lever
2025-09-29 15:56 ` [PATCH v5 1/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-10-03 7:09 ` Christoph Hellwig
2025-09-29 15:56 ` [PATCH v5 2/6] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-10-03 7:09 ` Christoph Hellwig
2025-09-29 15:56 ` [PATCH v5 3/6] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
2025-09-29 15:56 ` [PATCH v5 4/6] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-10-03 7:19 ` Christoph Hellwig
2025-10-03 15:06 ` Chuck Lever
2025-09-29 15:56 ` [PATCH v5 5/6] NFSD: Prevent a NULL pointer dereference in fh_getattr() Chuck Lever
2025-09-29 16:13 ` Jeff Layton
2025-09-29 17:39 ` Mike Snitzer
2025-10-03 7:11 ` Christoph Hellwig
2025-10-03 14:18 ` Chuck Lever
2025-09-29 15:56 ` [PATCH v5 6/6] NFSD: Ignore vfs_getattr() failure in nfsd_file_get_dio_attrs() Chuck Lever
2025-09-29 16:51 ` Jeff Layton
2025-09-29 17:39 ` Mike Snitzer
2025-10-03 7:13 ` Christoph Hellwig
2025-10-03 14:23 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox