* [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes
@ 2025-08-26 18:57 Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
` (6 more replies)
0 siblings, 7 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Hi,
Some workloads benefit from NFSD avoiding the page cache, particularly
those with a working set that is significantly larger than available
system memory. This patchset introduces _optional_ support to
configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
support. The NFSD default to use page cache is left unchanged.
The performance win associated with using NFSD DIRECT was previously
summarized here:
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
This picture offers a nice summary of performance gains:
https://original.art/NFSD_direct_vs_buffered_IO.jpg
This series builds on what has been staged in the nfsd-testing branch.
This code has proven to work well during my testing. Any suggestions
for further refinement are welcome.
Thanks,
Mike
Changes since v7:
- Add Jeff's Reviewed-by to patches 3, 4 and 5.
- Use IOCB_SYNC for all buffered WRITEs if NFSD using NFSD_IO_DIRECT
- Fix compiler warning in trace.h, must use %zd for ssize_t in TP_printk
Changes since v6:
- reinstate use of iov_iter_aligned_bvec, in terms of local helper
nfsd_iov_iter_aligned_bvec, otherwise underlying filesystem could be
sent misaligned DIO that it will respond to with -EINVAL.
- add WARN_ON_ONCE if NFSD_IO_DIRECT and underlying filesystem returns
-EINVAL (shouldn't happen, so its best to be loud if it does)
Earlier changelog was provided in v6's 0th patch header, see:
https://lore.kernel.org/linux-nfs/20250809050257.27355-1-snitzer@kernel.org/
Mike Snitzer (7):
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFSD: pass nfsd_file to nfsd_iter_read()
NFSD: add io_cache_read controls to debugfs interface
NFSD: add io_cache_write controls to debugfs interface
NFSD: issue READs using O_DIRECT even if IO is misaligned
NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events
fs/nfsd/debugfs.c | 100 +++++++++
fs/nfsd/filecache.c | 32 +++
fs/nfsd/filecache.h | 4 +
fs/nfsd/nfs4xdr.c | 8 +-
fs/nfsd/nfsd.h | 10 +
fs/nfsd/nfsfh.c | 4 +
fs/nfsd/trace.h | 61 ++++++
fs/nfsd/vfs.c | 413 +++++++++++++++++++++++++++++++++++--
fs/nfsd/vfs.h | 2 +-
include/linux/sunrpc/svc.h | 5 +-
10 files changed, 621 insertions(+), 18 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v8 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
@ 2025-08-26 18:57 ` Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
` (5 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
alignment attributes from underlying filesystem in associated
nfsd_file. This is done when the nfsd_file is first opened for
a regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/filecache.c | 32 ++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/nfsfh.c | 4 ++++
3 files changed, 40 insertions(+)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 8581c131338b8..5447dba6c5da0 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1048,6 +1051,33 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return status;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1166,6 +1196,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_getattr(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434a..e3d6ca2b60308 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index f4a3cc9e31e05..bdba2ba828a6a 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -677,8 +677,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
+ struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
+ if (S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
if (fhp->fh_maxsize == NFS4_FHSIZE)
request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v8 2/7] NFSD: pass nfsd_file to nfsd_iter_read()
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-08-26 18:57 ` Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
` (4 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7d19925f46e45..d519f4156cfad 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4464,7 +4464,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4475,7 +4475,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4522,7 +4522,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5418,7 +5418,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0c0f25b2c8e38..79439ad93880a 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1075,7 +1075,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1088,9 +1088,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1312,7 +1313,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6de..fa46f8b5f1320 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
@ 2025-08-26 18:57 ` Mike Snitzer
2025-09-03 14:38 ` Chuck Lever
2025-08-26 18:57 ` [PATCH v8 4/7] NFSD: add io_cache_write " Mike Snitzer
` (3 subsequent siblings)
6 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)
io_cache_read may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_read
If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the read must be aligned relative to the
underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/debugfs.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 9 ++++++++
fs/nfsd/vfs.c | 18 +++++++++++++++
3 files changed, 84 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 84b0c8b559dc9..3cadd45868b48 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -27,11 +27,65 @@ static int nfsd_dsr_get(void *data, u64 *val)
static int nfsd_dsr_set(void *data, u64 val)
{
nfsd_disable_splice_read = (val > 0) ? true : false;
+ if (!nfsd_disable_splice_read) {
+ /*
+ * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
+ * if splice_read is enabled.
+ */
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ }
return 0;
}
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_read
+ *
+ * Contents:
+ * %1: NFS READ will use buffered IO
+ * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %3: NFS READ will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_read_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_read;
+ return 0;
+}
+
+static int nfsd_io_cache_read_set(void *data, u64 val)
+{
+ int ret = 0;
+
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ break;
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ /*
+ * Must disable splice_read when enabling
+ * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
+ */
+ nfsd_disable_splice_read = true;
+ nfsd_io_cache_read = val;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
+ nfsd_io_cache_read_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -44,4 +98,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_dsr_fops);
+
+ debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1cd0bed57bc2f..6ef799405145f 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
extern bool nfsd_disable_splice_read __read_mostly;
+enum {
+ NFSD_IO_UNSPECIFIED = 0,
+ NFSD_IO_BUFFERED,
+ NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
+};
+
+extern u64 nfsd_io_cache_read __read_mostly;
+
extern int nfsd_max_blksize;
static inline int nfsd_v4client(struct svc_rqst *rq)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 79439ad93880a..8ea8b80097195 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,7 @@
#define NFSDDBG_FACILITY NFSDDBG_FILEOP
bool nfsd_disable_splice_read __read_mostly;
+u64 nfsd_io_cache_read __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1099,6 +1100,23 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
size_t len;
init_sync_kiocb(&kiocb, file);
+
+ switch (nfsd_io_cache_read) {
+ case NFSD_IO_DIRECT:
+ /* Verify ondisk and memory DIO alignment */
+ if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
+ (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
+ (base & (nf->nf_dio_mem_align - 1)) == 0)
+ kiocb.ki_flags = IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags = IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_UNSPECIFIED:
+ case NFSD_IO_BUFFERED:
+ break;
+ }
+
kiocb.ki_pos = offset;
v = 0;
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v8 4/7] NFSD: add io_cache_write controls to debugfs interface
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (2 preceding siblings ...)
2025-08-26 18:57 ` [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-08-26 18:57 ` Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
` (2 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)
io_cache_write may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_write
If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the WRITE payload must be aligned relative
to the underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>x
---
fs/nfsd/debugfs.c | 43 +++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/vfs.c | 16 ++++++++++++++++
3 files changed, 60 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 3cadd45868b48..8878c3519b30c 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -86,6 +86,46 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
nfsd_io_cache_read_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_write
+ *
+ * Contents:
+ * %1: NFS WRITE will use buffered IO
+ * %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ * %3: NFS WRITE will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_write_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_write;
+ return 0;
+}
+
+static int nfsd_io_cache_write_set(void *data, u64 val)
+{
+ int ret = 0;
+
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ nfsd_io_cache_write = val;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_write_fops, nfsd_io_cache_write_get,
+ nfsd_io_cache_write_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -101,4 +141,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
+
+ debugfs_create_file("io_cache_write", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_write_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 6ef799405145f..fe935b4cda538 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -161,6 +161,7 @@ enum {
};
extern u64 nfsd_io_cache_read __read_mostly;
+extern u64 nfsd_io_cache_write __read_mostly;
extern int nfsd_max_blksize;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8ea8b80097195..c340708fbab4d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -50,6 +50,7 @@
bool nfsd_disable_splice_read __read_mostly;
u64 nfsd_io_cache_read __read_mostly;
+u64 nfsd_io_cache_write __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1242,6 +1243,21 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
+
+ switch (nfsd_io_cache_write) {
+ case NFSD_IO_DIRECT:
+ /* direct I/O must be aligned to device logical sector size */
+ if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
+ (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
+ kiocb.ki_flags |= IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_UNSPECIFIED:
+ case NFSD_IO_BUFFERED:
+ break;
+ }
host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (3 preceding siblings ...)
2025-08-26 18:57 ` [PATCH v8 4/7] NFSD: add io_cache_write " Mike Snitzer
@ 2025-08-26 18:57 ` Mike Snitzer
2025-08-27 15:34 ` Chuck Lever
2025-08-26 18:57 ` [PATCH v8 6/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
6 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
DIO-aligned block (on either end of the READ). The expanded READ is
verified to have proper offset/len (logical_block_size) and
dma_alignment checking.
Must allocate and use a bounce-buffer page (called 'start_extra_page')
if/when expanding the misaligned READ requires reading extra partial
page at the start of the READ so that its DIO-aligned. Otherwise that
extra page at the start will make its way back to the NFS client and
corruption will occur. As found, and then this fix of using an extra
page verified, using the 'dt' utility:
dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
iotype=sequential pattern=iot onerr=abort oncerr=abort
see: https://github.com/RobinTMiller/dt.git
Any misaligned READ that is less than 32K won't be expanded to be
DIO-aligned (this heuristic just avoids excess work, like allocating
start_extra_page, for smaller IO that can generally already perform
well using buffered IO).
Suggested-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/vfs.c | 200 +++++++++++++++++++++++++++++++++++--
include/linux/sunrpc/svc.h | 5 +-
2 files changed, 194 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index c340708fbab4d..64732dc8985d6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -19,6 +19,7 @@
#include <linux/splice.h>
#include <linux/falloc.h>
#include <linux/fcntl.h>
+#include <linux/math.h>
#include <linux/namei.h>
#include <linux/delay.h>
#include <linux/fsnotify.h>
@@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+struct nfsd_read_dio {
+ loff_t start;
+ loff_t end;
+ unsigned long start_extra;
+ unsigned long end_extra;
+ struct page *start_extra_page;
+};
+
+static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
+{
+ memset(read_dio, 0, sizeof(*read_dio));
+ read_dio->start_extra_page = NULL;
+}
+
+#define NFSD_READ_DIO_MIN_KB (32 << 10)
+
+static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset,
+ unsigned long len, unsigned int base,
+ struct nfsd_read_dio *read_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_read_offset_align;
+ loff_t middle_end, orig_end = offset + len;
+
+ if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
+ "%s: underlying filesystem has not provided DIO alignment info\n",
+ __func__))
+ return false;
+ if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+ "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+ __func__, dio_blocksize, PAGE_SIZE))
+ return false;
+
+ /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
+ * or base not aligned).
+ * Ondisk alignment is implied by the following code that expands
+ * misaligned IO to have a DIO-aligned offset and len.
+ */
+ if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
+ return false;
+
+ init_nfsd_read_dio(read_dio);
+
+ read_dio->start = round_down(offset, dio_blocksize);
+ read_dio->end = round_up(orig_end, dio_blocksize);
+ read_dio->start_extra = offset - read_dio->start;
+ read_dio->end_extra = read_dio->end - orig_end;
+
+ /*
+ * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
+ * to be DIO-aligned (this heuristic avoids excess work, like allocating
+ * start_extra_page, for smaller IO that can generally already perform
+ * well using buffered IO).
+ */
+ if ((read_dio->start_extra || read_dio->end_extra) &&
+ (len < NFSD_READ_DIO_MIN_KB)) {
+ init_nfsd_read_dio(read_dio);
+ return false;
+ }
+
+ if (read_dio->start_extra) {
+ read_dio->start_extra_page = alloc_page(GFP_KERNEL);
+ if (WARN_ONCE(read_dio->start_extra_page == NULL,
+ "%s: Unable to allocate start_extra_page\n", __func__)) {
+ init_nfsd_read_dio(read_dio);
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
+ struct nfsd_read_dio *read_dio,
+ ssize_t bytes_read,
+ unsigned long bytes_expected,
+ loff_t *offset,
+ unsigned long *rq_bvec_numpages)
+{
+ ssize_t host_err = bytes_read;
+ loff_t v;
+
+ if (!read_dio->start_extra && !read_dio->end_extra)
+ return host_err;
+
+ /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
+ * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
+ */
+ if (read_dio->start_extra_page) {
+ __free_page(read_dio->start_extra_page);
+ *rq_bvec_numpages -= 1;
+ v = *rq_bvec_numpages;
+ memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
+ v * sizeof(struct bio_vec));
+ }
+ /* Eliminate any end_extra bytes from the last page */
+ v = *rq_bvec_numpages;
+ rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
+
+ if (host_err < 0) {
+ /* Underlying FS will return -EINVAL if misaligned
+ * DIO is attempted because it shouldn't be.
+ */
+ WARN_ON_ONCE(host_err == -EINVAL);
+ return host_err;
+ }
+
+ /* nfsd_analyze_read_dio() may have expanded the start and end,
+ * if so adjust returned read size to reflect original extent.
+ */
+ *offset += read_dio->start_extra;
+ if (likely(host_err >= read_dio->start_extra)) {
+ host_err -= read_dio->start_extra;
+ if (host_err > bytes_expected)
+ host_err = bytes_expected;
+ } else {
+ /* Short read that didn't read any of requested data */
+ host_err = 0;
+ }
+
+ return host_err;
+}
+
+static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
+ unsigned addr_mask, unsigned len_mask)
+{
+ const struct bio_vec *bvec = i->bvec;
+ unsigned skip = i->iov_offset;
+ size_t size = i->count;
+
+ if (size & len_mask)
+ return false;
+ do {
+ size_t len = bvec->bv_len;
+
+ if (len > size)
+ len = size;
+ if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+ return false;
+ bvec++;
+ size -= len;
+ skip = 0;
+ } while (size);
+
+ return true;
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1094,7 +1242,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int base, u32 *eof)
{
struct file *file = nf->nf_file;
- unsigned long v, total;
+ unsigned long v, total, in_count = *count;
+ struct nfsd_read_dio read_dio;
struct iov_iter iter;
struct kiocb kiocb;
ssize_t host_err;
@@ -1102,13 +1251,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
init_sync_kiocb(&kiocb, file);
+ v = 0;
+ total = in_count;
+
switch (nfsd_io_cache_read) {
case NFSD_IO_DIRECT:
- /* Verify ondisk and memory DIO alignment */
- if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
- (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
- (base & (nf->nf_dio_mem_align - 1)) == 0)
- kiocb.ki_flags = IOCB_DIRECT;
+ /*
+ * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
+ * the next DIO-aligned block (on either end of the READ).
+ */
+ if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
+ in_count, base, &read_dio)) {
+ /* trace_nfsd_read_vector() will reflect larger
+ * DIO-aligned READ.
+ */
+ offset = read_dio.start;
+ in_count = read_dio.end - offset;
+ total = in_count;
+
+ kiocb.ki_flags |= IOCB_DIRECT;
+ if (read_dio.start_extra) {
+ len = read_dio.start_extra;
+ bvec_set_page(&rqstp->rq_bvec[v],
+ read_dio.start_extra_page,
+ len, PAGE_SIZE - len);
+ total -= len;
+ ++v;
+ }
+ }
break;
case NFSD_IO_DONTCACHE:
kiocb.ki_flags = IOCB_DONTCACHE;
@@ -1120,8 +1290,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
kiocb.ki_pos = offset;
- v = 0;
- total = *count;
while (total) {
len = min_t(size_t, total, PAGE_SIZE - base);
bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
@@ -1132,9 +1300,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
WARN_ON_ONCE(v > rqstp->rq_maxpages);
- trace_nfsd_read_vector(rqstp, fhp, offset, *count);
- iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
+ trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
+
+ if ((kiocb.ki_flags & IOCB_DIRECT) &&
+ !nfsd_iov_iter_aligned_bvec(&iter, nf->nf_dio_mem_align-1,
+ nf->nf_dio_read_offset_align-1))
+ kiocb.ki_flags &= ~IOCB_DIRECT;
+
host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
+
+ if (in_count != *count) {
+ /* misaligned DIO expanded read to be DIO-aligned */
+ host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
+ host_err, *count, &offset, &v);
+ }
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e64ab444e0a7f..190c2667500e2 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* pages, one for the request, and one for the reply.
* nfsd_splice_actor() might need an extra page when a READ payload
* is not page-aligned.
+ * nfsd_iter_read() might need two extra pages when a READ payload
+ * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
+ * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
*/
static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
{
- return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+ return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
}
/*
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v8 6/7] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (4 preceding siblings ...)
2025-08-26 18:57 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-08-26 18:57 ` Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
6 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Buffered IO is used for the
misaligned extents and O_DIRECT is used for the middle DIO-aligned
extent.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/vfs.c | 181 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 173 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 64732dc8985d6..967ca67f197fc 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1355,6 +1355,174 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio {
+ loff_t middle_offset; /* Offset for start of DIO-aligned middle */
+ loff_t end_offset; /* Offset for start of DIO-aligned end */
+ ssize_t start_len; /* Length for misaligned first extent */
+ ssize_t middle_len; /* Length for DIO-aligned middle extent */
+ ssize_t end_len; /* Length for misaligned last extent */
+};
+
+static bool
+nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset,
+ unsigned long len, struct nfsd_write_dio *write_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_offset_align;
+ loff_t orig_end, middle_end, start_end, start_offset = offset;
+ ssize_t start_len = len;
+
+ if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
+ "%s: underlying filesystem has not provided DIO alignment info\n",
+ __func__))
+ return false;
+ if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+ "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+ __func__, dio_blocksize, PAGE_SIZE))
+ return false;
+ if (unlikely(len < dio_blocksize))
+ return false;
+
+ memset(write_dio, 0, sizeof(*write_dio));
+
+ if (((offset | len) & (dio_blocksize-1)) == 0) {
+ /* already DIO-aligned, no misaligned head or tail */
+ write_dio->middle_offset = offset;
+ write_dio->middle_len = len;
+ /* clear these for the benefit of trace_nfsd_analyze_write_dio */
+ start_offset = 0;
+ start_len = 0;
+ return true;
+ }
+
+ start_end = round_up(offset, dio_blocksize);
+ start_len = start_end - offset;
+ orig_end = offset + len;
+ middle_end = round_down(orig_end, dio_blocksize);
+
+ write_dio->start_len = start_len;
+ write_dio->middle_offset = start_end;
+ write_dio->middle_len = middle_end - start_end;
+ write_dio->end_offset = middle_end;
+ write_dio->end_len = orig_end - middle_end;
+
+ return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
+ * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
+ * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
+ * @nvecs: number of segments in @rq_bvec
+ * @cnt: size of the request in bytes
+ * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
+ *
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+ struct bio_vec *rq_bvec, unsigned int nvecs,
+ unsigned long cnt, struct nfsd_write_dio *write_dio)
+{
+ int n_iters = 0;
+ struct iov_iter *iters = *iterp;
+
+ /* Setup misaligned start? */
+ if (write_dio->start_len) {
+ iter_is_dio_aligned[n_iters] = false;
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iters[n_iters].count = write_dio->start_len;
+ ++n_iters;
+ }
+
+ /* Setup DIO-aligned middle */
+ iter_is_dio_aligned[n_iters] = true;
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ if (write_dio->start_len)
+ iov_iter_advance(&iters[n_iters], write_dio->start_len);
+ iters[n_iters].count -= write_dio->end_len;
+ ++n_iters;
+
+ /* Setup misaligned end? */
+ if (write_dio->end_len) {
+ iter_is_dio_aligned[n_iters] = false;
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iov_iter_advance(&iters[n_iters],
+ write_dio->start_len + write_dio->middle_len);
+ ++n_iters;
+ }
+
+ return n_iters;
+}
+
+static int
+nfsd_issue_write_buffered(struct svc_rqst *rqstp, struct file *file,
+ unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb)
+{
+ struct iov_iter iter;
+ int host_err;
+
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt = host_err;
+
+ return 0;
+}
+
+static noinline int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset,
+ unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb)
+{
+ struct nfsd_write_dio write_dio;
+ struct file *file = nf->nf_file;
+
+ /* Any buffered IO issued here will be misaligned, use
+ * IOCB_SYNC to ensure it has completed before returning.
+ */
+ kiocb->ki_flags |= IOCB_SYNC;
+
+ if (!nfsd_analyze_write_dio(rqstp, fhp, nf, offset, *cnt, &write_dio))
+ return nfsd_issue_write_buffered(rqstp, file, nvecs, cnt, kiocb);
+ else {
+ bool iter_is_dio_aligned[3];
+ struct iov_iter iter_stack[3];
+ struct iov_iter *iter = iter_stack;
+ unsigned int n_iters = 0;
+ int host_err;
+
+ n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+ rqstp->rq_bvec, nvecs, *cnt, &write_dio);
+ *cnt = 0;
+ for (int i = 0; i < n_iters; i++) {
+ if (iter_is_dio_aligned[i] &&
+ nfsd_iov_iter_aligned_bvec(&iter[i], nf->nf_dio_mem_align-1,
+ nf->nf_dio_offset_align-1))
+ kiocb->ki_flags |= IOCB_DIRECT;
+ else
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+ if (host_err < 0) {
+ /* Underlying FS will return -EINVAL if misaligned
+ * DIO is attempted because it shouldn't be.
+ */
+ WARN_ON_ONCE(host_err == -EINVAL);
+ return host_err;
+ }
+ *cnt += host_err;
+ if (host_err < iter[i].count) /* partial write? */
+ return *cnt;
+ }
+ }
+
+ return 0;
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1382,7 +1550,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct super_block *sb = file_inode(file)->i_sb;
struct kiocb kiocb;
struct svc_export *exp;
- struct iov_iter iter;
errseq_t since;
__be32 nfserr;
int host_err;
@@ -1419,31 +1586,29 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
kiocb.ki_flags |= IOCB_DSYNC;
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
case NFSD_IO_DIRECT:
- /* direct I/O must be aligned to device logical sector size */
- if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
- (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
- kiocb.ki_flags |= IOCB_DIRECT;
+ host_err = nfsd_issue_write_dio(rqstp, fhp, nf, offset,
+ nvecs, cnt, &kiocb);
break;
case NFSD_IO_DONTCACHE:
kiocb.ki_flags |= IOCB_DONTCACHE;
fallthrough;
case NFSD_IO_UNSPECIFIED:
case NFSD_IO_BUFFERED:
+ host_err = nfsd_issue_write_buffered(rqstp, file,
+ nvecs, cnt, &kiocb);
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v8 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (5 preceding siblings ...)
2025-08-26 18:57 ` [PATCH v8 6/7] NFSD: issue WRITEs " Mike Snitzer
@ 2025-08-26 18:57 ` Mike Snitzer
6 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-26 18:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Add EVENT_CLASS nfsd_analyze_dio_class and use it to create
nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events.
The nfsd_analyze_read_dio trace event shows how NFSD expands any
misaligned READ to the next DIO-aligned block (on either end of the
original READ, as needed).
This combination of trace events is useful for READs:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
Which for this dd command:
dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
Results in:
nfsd-23908[010] ..... 10375.141640: nfsd_analyze_read_dio: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
nfsd-23908[010] ..... 10375.141642: nfsd_read_vector: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47104
nfsd-23908[010] ..... 10375.141643: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0x0 bytecount 0xb800
nfsd-23908[010] ..... 10375.141773: nfsd_read_io_done: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008
nfsd-23908[010] ..... 10375.142063: nfsd_analyze_read_dio: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=46592+416 middle=47008+47008 end=94016+192
nfsd-23908[010] ..... 10375.142064: nfsd_read_vector: xid=0x83c5923b fh_hash=0x857ca4fc offset=46592 len=47616
nfsd-23908[010] ..... 10375.142065: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0xb600 bytecount 0xba00
nfsd-23908[010] ..... 10375.142103: nfsd_read_io_done: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008
The nfsd_analyze_write_dio trace event shows how NFSD splits a given
misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
extent.
This combination of trace events is useful for WRITEs:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
Which for this dd command:
dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct
Results in:
nfsd-23908[010] ..... 10374.902333: nfsd_write_opened: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
nfsd-23908[010] ..... 10374.902335: nfsd_analyze_write_dio: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+46592 end=46592+416
nfsd-23908[010] ..... 10374.902343: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0x0 pos 0x0 bytecount 0xb600
nfsd-23908[010] ..... 10374.902697: nfsd_write_io_done: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
nfsd-23908[010] ..... 10374.902925: nfsd_write_opened: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
nfsd-23908[010] ..... 10374.902926: nfsd_analyze_write_dio: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=47008+96 middle=47104+46592 end=93696+320
nfsd-23908[010] ..... 10374.903010: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0xb800 pos 0xb800 bytecount 0xb600
nfsd-23908[010] ..... 10374.903239: nfsd_write_io_done: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/trace.h | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/vfs.c | 13 +++++++++--
2 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e9..cd757d2c52c84 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -473,6 +473,67 @@ DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
DEFINE_NFSD_IO_EVENT(commit_done);
+DECLARE_EVENT_CLASS(nfsd_analyze_dio_class,
+ TP_PROTO(struct svc_rqst *rqstp,
+ struct svc_fh *fhp,
+ u64 offset,
+ u32 len,
+ loff_t start,
+ ssize_t start_len,
+ loff_t middle,
+ ssize_t middle_len,
+ loff_t end,
+ ssize_t end_len),
+ TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len),
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, fh_hash)
+ __field(u64, offset)
+ __field(u32, len)
+ __field(loff_t, start)
+ __field(ssize_t, start_len)
+ __field(loff_t, middle)
+ __field(ssize_t, middle_len)
+ __field(loff_t, end)
+ __field(ssize_t, end_len)
+ ),
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqstp->rq_xid);
+ __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
+ __entry->offset = offset;
+ __entry->len = len;
+ __entry->start = start;
+ __entry->start_len = start_len;
+ __entry->middle = middle;
+ __entry->middle_len = middle_len;
+ __entry->end = end;
+ __entry->end_len = end_len;
+ ),
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%zd middle=%llu+%zd end=%llu+%zd",
+ __entry->xid, __entry->fh_hash,
+ __entry->offset, __entry->len,
+ __entry->start, __entry->start_len,
+ __entry->middle, __entry->middle_len,
+ __entry->end, __entry->end_len)
+)
+
+#define DEFINE_NFSD_ANALYZE_DIO_EVENT(name) \
+DEFINE_EVENT(nfsd_analyze_dio_class, nfsd_analyze_##name##_dio, \
+ TP_PROTO(struct svc_rqst *rqstp, \
+ struct svc_fh *fhp, \
+ u64 offset, \
+ u32 len, \
+ loff_t start, \
+ ssize_t start_len, \
+ loff_t middle, \
+ ssize_t middle_len, \
+ loff_t end, \
+ ssize_t end_len), \
+ TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len))
+
+DEFINE_NFSD_ANALYZE_DIO_EVENT(read);
+DEFINE_NFSD_ANALYZE_DIO_EVENT(write);
+
DECLARE_EVENT_CLASS(nfsd_err_class,
TP_PROTO(struct svc_rqst *rqstp,
struct svc_fh *fhp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 967ca67f197fc..c2b044cb3b76c 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1143,6 +1143,12 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
}
+ /* Show original offset and count, and how it was expanded for DIO */
+ middle_end = read_dio->end - read_dio->end_extra;
+ trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
+ read_dio->start, read_dio->start_extra,
+ offset, (middle_end - offset),
+ middle_end, read_dio->end_extra);
return true;
}
@@ -1392,7 +1398,7 @@ nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
/* clear these for the benefit of trace_nfsd_analyze_write_dio */
start_offset = 0;
start_len = 0;
- return true;
+ goto out;
}
start_end = round_up(offset, dio_blocksize);
@@ -1405,7 +1411,10 @@ nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
write_dio->middle_len = middle_end - start_end;
write_dio->end_offset = middle_end;
write_dio->end_len = orig_end - middle_end;
-
+out:
+ trace_nfsd_analyze_write_dio(rqstp, fhp, offset, len, start_offset, start_len,
+ write_dio->middle_offset, write_dio->middle_len,
+ write_dio->end_offset, write_dio->end_len);
return true;
}
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-26 18:57 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-08-27 15:34 ` Chuck Lever
2025-08-27 19:41 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-08-27 15:34 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 8/26/25 2:57 PM, Mike Snitzer wrote:
> If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> DIO-aligned block (on either end of the READ). The expanded READ is
> verified to have proper offset/len (logical_block_size) and
> dma_alignment checking.
>
> Must allocate and use a bounce-buffer page (called 'start_extra_page')
> if/when expanding the misaligned READ requires reading extra partial
> page at the start of the READ so that its DIO-aligned. Otherwise that
> extra page at the start will make its way back to the NFS client and
> corruption will occur. As found, and then this fix of using an extra
> page verified, using the 'dt' utility:
> dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> iotype=sequential pattern=iot onerr=abort oncerr=abort
> see: https://github.com/RobinTMiller/dt.git
>
> Any misaligned READ that is less than 32K won't be expanded to be
> DIO-aligned (this heuristic just avoids excess work, like allocating
> start_extra_page, for smaller IO that can generally already perform
> well using buffered IO).
>
> Suggested-by: Jeff Layton <jlayton@kernel.org>
> Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> ---
> fs/nfsd/vfs.c | 200 +++++++++++++++++++++++++++++++++++--
> include/linux/sunrpc/svc.h | 5 +-
> 2 files changed, 194 insertions(+), 11 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index c340708fbab4d..64732dc8985d6 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -19,6 +19,7 @@
> #include <linux/splice.h>
> #include <linux/falloc.h>
> #include <linux/fcntl.h>
> +#include <linux/math.h>
> #include <linux/namei.h>
> #include <linux/delay.h>
> #include <linux/fsnotify.h>
> @@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> +struct nfsd_read_dio {
> + loff_t start;
> + loff_t end;
> + unsigned long start_extra;
> + unsigned long end_extra;
> + struct page *start_extra_page;
> +};
> +
> +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> +{
> + memset(read_dio, 0, sizeof(*read_dio));
> + read_dio->start_extra_page = NULL;
> +}
> +
> +#define NFSD_READ_DIO_MIN_KB (32 << 10)
> +
> +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, loff_t offset,
> + unsigned long len, unsigned int base,
> + struct nfsd_read_dio *read_dio)
> +{
> + const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> + loff_t middle_end, orig_end = offset + len;
> +
> + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> + "%s: underlying filesystem has not provided DIO alignment info\n",
> + __func__))
> + return false;
> + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> + __func__, dio_blocksize, PAGE_SIZE))
> + return false;
IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
> +
> + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
> + * or base not aligned).
> + * Ondisk alignment is implied by the following code that expands
> + * misaligned IO to have a DIO-aligned offset and len.
> + */
> + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
> + return false;
> +
> + init_nfsd_read_dio(read_dio);
> +
> + read_dio->start = round_down(offset, dio_blocksize);
> + read_dio->end = round_up(orig_end, dio_blocksize);
> + read_dio->start_extra = offset - read_dio->start;
> + read_dio->end_extra = read_dio->end - orig_end;
> +
> + /*
> + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> + * to be DIO-aligned (this heuristic avoids excess work, like allocating
> + * start_extra_page, for smaller IO that can generally already perform
> + * well using buffered IO).
> + */
> + if ((read_dio->start_extra || read_dio->end_extra) &&
> + (len < NFSD_READ_DIO_MIN_KB)) {
> + init_nfsd_read_dio(read_dio);
> + return false;
> + }
> +
> + if (read_dio->start_extra) {
> + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
This introduces a page allocation where there weren't any before. For
NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
like this is not needed on an I/O path.
So I think the answer to this is that I want you to implement reading
an entire aligned range from the file and then forming the NFS READ
response with only the range of bytes that the client requested, as we
discussed before. The use of xdr_buf and bvec should make that quite
straightforward.
This should make the aligned and unaligned cases nearly identical and
much less fraught.
> + if (WARN_ONCE(read_dio->start_extra_page == NULL,
> + "%s: Unable to allocate start_extra_page\n", __func__)) {
> + init_nfsd_read_dio(read_dio);
> + return false;
> + }
> + }
> +
> + return true;
> +}
> +
> +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> + struct nfsd_read_dio *read_dio,
> + ssize_t bytes_read,
> + unsigned long bytes_expected,
> + loff_t *offset,
> + unsigned long *rq_bvec_numpages)
> +{
> + ssize_t host_err = bytes_read;
> + loff_t v;
> +
> + if (!read_dio->start_extra && !read_dio->end_extra)
> + return host_err;
> +
> + /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> + * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> + */
> + if (read_dio->start_extra_page) {
> + __free_page(read_dio->start_extra_page);
> + *rq_bvec_numpages -= 1;
> + v = *rq_bvec_numpages;
> + memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> + v * sizeof(struct bio_vec));
> + }
> + /* Eliminate any end_extra bytes from the last page */
> + v = *rq_bvec_numpages;
> + rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> +
> + if (host_err < 0) {
> + /* Underlying FS will return -EINVAL if misaligned
> + * DIO is attempted because it shouldn't be.
> + */
> + WARN_ON_ONCE(host_err == -EINVAL);
> + return host_err;
> + }
> +
> + /* nfsd_analyze_read_dio() may have expanded the start and end,
> + * if so adjust returned read size to reflect original extent.
> + */
> + *offset += read_dio->start_extra;
> + if (likely(host_err >= read_dio->start_extra)) {
> + host_err -= read_dio->start_extra;
> + if (host_err > bytes_expected)
> + host_err = bytes_expected;
> + } else {
> + /* Short read that didn't read any of requested data */
> + host_err = 0;
> + }
> +
> + return host_err;
> +}
> +
> +static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
> + unsigned addr_mask, unsigned len_mask)
> +{
> + const struct bio_vec *bvec = i->bvec;
> + unsigned skip = i->iov_offset;
> + size_t size = i->count;
checkpatch.pl is complaining about the use of "unsigned" rather than
"unsigned int".
> +
> + if (size & len_mask)
> + return false;
> + do {
> + size_t len = bvec->bv_len;
> +
> + if (len > size)
> + len = size;
> + if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> + return false;
> + bvec++;
> + size -= len;
> + skip = 0;
> + } while (size);
> +
> + return true;
> +}
> +
> /**
> * nfsd_iter_read - Perform a VFS read using an iterator
> * @rqstp: RPC transaction context
> @@ -1094,7 +1242,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int base, u32 *eof)
> {
> struct file *file = nf->nf_file;
> - unsigned long v, total;
> + unsigned long v, total, in_count = *count;
> + struct nfsd_read_dio read_dio;
> struct iov_iter iter;
> struct kiocb kiocb;
> ssize_t host_err;
> @@ -1102,13 +1251,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> init_sync_kiocb(&kiocb, file);
>
> + v = 0;
> + total = in_count;
> +
> switch (nfsd_io_cache_read) {
> case NFSD_IO_DIRECT:
> - /* Verify ondisk and memory DIO alignment */
> - if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> - (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
> - (base & (nf->nf_dio_mem_align - 1)) == 0)
> - kiocb.ki_flags = IOCB_DIRECT;
> + /*
> + * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
> + * the next DIO-aligned block (on either end of the READ).
> + */
> + if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
> + in_count, base, &read_dio)) {
> + /* trace_nfsd_read_vector() will reflect larger
> + * DIO-aligned READ.
> + */
> + offset = read_dio.start;
> + in_count = read_dio.end - offset;
> + total = in_count;
> +
> + kiocb.ki_flags |= IOCB_DIRECT;
> + if (read_dio.start_extra) {
> + len = read_dio.start_extra;
> + bvec_set_page(&rqstp->rq_bvec[v],
> + read_dio.start_extra_page,
> + len, PAGE_SIZE - len);
> + total -= len;
> + ++v;
> + }
> + }
> break;
> case NFSD_IO_DONTCACHE:
> kiocb.ki_flags = IOCB_DONTCACHE;
> @@ -1120,8 +1290,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> kiocb.ki_pos = offset;
>
> - v = 0;
> - total = *count;
> while (total) {
> len = min_t(size_t, total, PAGE_SIZE - base);
> bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> @@ -1132,9 +1300,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> }
> WARN_ON_ONCE(v > rqstp->rq_maxpages);
>
> - trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> - iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> + trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
> +
> + if ((kiocb.ki_flags & IOCB_DIRECT) &&
> + !nfsd_iov_iter_aligned_bvec(&iter, nf->nf_dio_mem_align-1,
> + nf->nf_dio_read_offset_align-1))
> + kiocb.ki_flags &= ~IOCB_DIRECT;
> +
> host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> +
> + if (in_count != *count) {
> + /* misaligned DIO expanded read to be DIO-aligned */
> + host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
> + host_err, *count, &offset, &v);
> + }
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e64ab444e0a7f..190c2667500e2 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> * pages, one for the request, and one for the reply.
> * nfsd_splice_actor() might need an extra page when a READ payload
> * is not page-aligned.
> + * nfsd_iter_read() might need two extra pages when a READ payload
> + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
> + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
> */
> static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
> {
> - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
> }
>
> /*
To properly evaluate the impact of using direct I/O for reads with real
world user workloads, we will want to identify (or construct) some
metrics (and this is future work, but near-term future).
Seems like allocating memory becomes difficult only when too many pages
are dirty. I am skeptical that the issue is due to read caching, since
clean pages in the page cache are pretty easy to evict quickly, AIUI. If
that's incorrect, I'd like to understand why.
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-27 15:34 ` Chuck Lever
@ 2025-08-27 19:41 ` Mike Snitzer
2025-08-27 20:56 ` Chuck Lever
2025-08-28 16:22 ` Jeff Layton
0 siblings, 2 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-27 19:41 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
> On 8/26/25 2:57 PM, Mike Snitzer wrote:
> > If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> > DIO-aligned block (on either end of the READ). The expanded READ is
> > verified to have proper offset/len (logical_block_size) and
> > dma_alignment checking.
> >
> > Must allocate and use a bounce-buffer page (called 'start_extra_page')
> > if/when expanding the misaligned READ requires reading extra partial
> > page at the start of the READ so that its DIO-aligned. Otherwise that
> > extra page at the start will make its way back to the NFS client and
> > corruption will occur. As found, and then this fix of using an extra
> > page verified, using the 'dt' utility:
> > dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> > iotype=sequential pattern=iot onerr=abort oncerr=abort
> > see: https://github.com/RobinTMiller/dt.git
> >
> > Any misaligned READ that is less than 32K won't be expanded to be
> > DIO-aligned (this heuristic just avoids excess work, like allocating
> > start_extra_page, for smaller IO that can generally already perform
> > well using buffered IO).
> >
> > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > fs/nfsd/vfs.c | 200 +++++++++++++++++++++++++++++++++++--
> > include/linux/sunrpc/svc.h | 5 +-
> > 2 files changed, 194 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index c340708fbab4d..64732dc8985d6 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -19,6 +19,7 @@
> > #include <linux/splice.h>
> > #include <linux/falloc.h>
> > #include <linux/fcntl.h>
> > +#include <linux/math.h>
> > #include <linux/namei.h>
> > #include <linux/delay.h>
> > #include <linux/fsnotify.h>
> > @@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > }
> >
> > +struct nfsd_read_dio {
> > + loff_t start;
> > + loff_t end;
> > + unsigned long start_extra;
> > + unsigned long end_extra;
> > + struct page *start_extra_page;
> > +};
> > +
> > +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> > +{
> > + memset(read_dio, 0, sizeof(*read_dio));
> > + read_dio->start_extra_page = NULL;
> > +}
> > +
> > +#define NFSD_READ_DIO_MIN_KB (32 << 10)
> > +
> > +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > + struct nfsd_file *nf, loff_t offset,
> > + unsigned long len, unsigned int base,
> > + struct nfsd_read_dio *read_dio)
> > +{
> > + const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> > + loff_t middle_end, orig_end = offset + len;
> > +
> > + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> > + "%s: underlying filesystem has not provided DIO alignment info\n",
> > + __func__))
> > + return false;
> > + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > + __func__, dio_blocksize, PAGE_SIZE))
> > + return false;
>
> IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
I won't die on this hill, I just don't see the risk of these given
they are highly unlikely ("famous last words").
But if they trigger we should surely be made aware immediately. Not
only if someone happens to have a trace event enabled (which would
only happen with further support and engineering involvement to chase
"why isn't O_DIRECT being used like NFSD was optionally configured
to!?").
> > + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
> > + * or base not aligned).
> > + * Ondisk alignment is implied by the following code that expands
> > + * misaligned IO to have a DIO-aligned offset and len.
> > + */
> > + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
> > + return false;
> > +
> > + init_nfsd_read_dio(read_dio);
> > +
> > + read_dio->start = round_down(offset, dio_blocksize);
> > + read_dio->end = round_up(orig_end, dio_blocksize);
> > + read_dio->start_extra = offset - read_dio->start;
> > + read_dio->end_extra = read_dio->end - orig_end;
> > +
> > + /*
> > + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> > + * to be DIO-aligned (this heuristic avoids excess work, like allocating
> > + * start_extra_page, for smaller IO that can generally already perform
> > + * well using buffered IO).
> > + */
> > + if ((read_dio->start_extra || read_dio->end_extra) &&
> > + (len < NFSD_READ_DIO_MIN_KB)) {
> > + init_nfsd_read_dio(read_dio);
> > + return false;
> > + }
> > +
> > + if (read_dio->start_extra) {
> > + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
>
> This introduces a page allocation where there weren't any before. For
> NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
> like this is not needed on an I/O path.
NFSD never supported DIO before. Yes, with this patch there is
a single page allocation in the misaligned DIO READ path (if it
requires reading extra before the client requested data starts).
I tried to succinctly explain the need for the extra page allocation
for misaligned DIO READ in this patch's header (in 2nd paragraph
of the above header).
I cannot see how to read extra, not requested by the client, into the
head of rq_pages without causing serious problems. So that cannot be
what you're saying needed.
> So I think the answer to this is that I want you to implement reading
> an entire aligned range from the file and then forming the NFS READ
> response with only the range of bytes that the client requested, as we
> discussed before.
That is what I'm doing. But you're taking issue with my implementation
that uses a single extra page.
> The use of xdr_buf and bvec should make that quite
> straightforward.
Is your suggestion to, rather than allocate a disjoint single page,
borrow the extra page from the end of rq_pages? Just map it into the
bvec instead of my extra page?
> This should make the aligned and unaligned cases nearly identical and
> much less fraught.
Regardless of which memory used to read the extra data, I don't see
how the care I've taken to read extra but hide that fact from the
client can be avoided. So the pre/post misaligned DIO READ code won't
change a whole lot. But once I understand your suggestion better
(after a clarifying response to this message) hopefully I'll see what
you're saying.
All said, this patchset is very important to me, I don't want it to
miss v6.18 -- I'm still "in it to win it" but it feels like I do need
your or others' help to pull this off.
And/or is it possible to accept this initial implementation with
mutual understanding that we must revisit your concern about my
allocating an extra page for the misaligned DIO READ path?
> > + if (WARN_ONCE(read_dio->start_extra_page == NULL,
> > + "%s: Unable to allocate start_extra_page\n", __func__)) {
> > + init_nfsd_read_dio(read_dio);
> > + return false;
> > + }
> > + }
> > +
> > + return true;
> > +}
> > +
> > +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> > + struct nfsd_read_dio *read_dio,
> > + ssize_t bytes_read,
> > + unsigned long bytes_expected,
> > + loff_t *offset,
> > + unsigned long *rq_bvec_numpages)
> > +{
> > + ssize_t host_err = bytes_read;
> > + loff_t v;
> > +
> > + if (!read_dio->start_extra && !read_dio->end_extra)
> > + return host_err;
> > +
> > + /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> > + * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> > + */
> > + if (read_dio->start_extra_page) {
> > + __free_page(read_dio->start_extra_page);
> > + *rq_bvec_numpages -= 1;
> > + v = *rq_bvec_numpages;
> > + memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> > + v * sizeof(struct bio_vec));
> > + }
> > + /* Eliminate any end_extra bytes from the last page */
> > + v = *rq_bvec_numpages;
> > + rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> > +
> > + if (host_err < 0) {
> > + /* Underlying FS will return -EINVAL if misaligned
> > + * DIO is attempted because it shouldn't be.
> > + */
> > + WARN_ON_ONCE(host_err == -EINVAL);
> > + return host_err;
> > + }
> > +
> > + /* nfsd_analyze_read_dio() may have expanded the start and end,
> > + * if so adjust returned read size to reflect original extent.
> > + */
> > + *offset += read_dio->start_extra;
> > + if (likely(host_err >= read_dio->start_extra)) {
> > + host_err -= read_dio->start_extra;
> > + if (host_err > bytes_expected)
> > + host_err = bytes_expected;
> > + } else {
> > + /* Short read that didn't read any of requested data */
> > + host_err = 0;
> > + }
> > +
> > + return host_err;
> > +}
> > +
> > +static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
> > + unsigned addr_mask, unsigned len_mask)
> > +{
> > + const struct bio_vec *bvec = i->bvec;
> > + unsigned skip = i->iov_offset;
> > + size_t size = i->count;
>
> checkpatch.pl is complaining about the use of "unsigned" rather than
> "unsigned int".
OK.
> > +
> > + if (size & len_mask)
> > + return false;
> > + do {
> > + size_t len = bvec->bv_len;
> > +
> > + if (len > size)
> > + len = size;
> > + if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > + return false;
> > + bvec++;
> > + size -= len;
> > + skip = 0;
> > + } while (size);
> > +
> > + return true;
> > +}
> > +
> > /**
> > * nfsd_iter_read - Perform a VFS read using an iterator
> > * @rqstp: RPC transaction context
> > @@ -1094,7 +1242,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > unsigned int base, u32 *eof)
> > {
> > struct file *file = nf->nf_file;
> > - unsigned long v, total;
> > + unsigned long v, total, in_count = *count;
> > + struct nfsd_read_dio read_dio;
> > struct iov_iter iter;
> > struct kiocb kiocb;
> > ssize_t host_err;
> > @@ -1102,13 +1251,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >
> > init_sync_kiocb(&kiocb, file);
> >
> > + v = 0;
> > + total = in_count;
> > +
> > switch (nfsd_io_cache_read) {
> > case NFSD_IO_DIRECT:
> > - /* Verify ondisk and memory DIO alignment */
> > - if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> > - (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
> > - (base & (nf->nf_dio_mem_align - 1)) == 0)
> > - kiocb.ki_flags = IOCB_DIRECT;
> > + /*
> > + * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
> > + * the next DIO-aligned block (on either end of the READ).
> > + */
> > + if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
> > + in_count, base, &read_dio)) {
> > + /* trace_nfsd_read_vector() will reflect larger
> > + * DIO-aligned READ.
> > + */
> > + offset = read_dio.start;
> > + in_count = read_dio.end - offset;
> > + total = in_count;
> > +
> > + kiocb.ki_flags |= IOCB_DIRECT;
> > + if (read_dio.start_extra) {
> > + len = read_dio.start_extra;
> > + bvec_set_page(&rqstp->rq_bvec[v],
> > + read_dio.start_extra_page,
> > + len, PAGE_SIZE - len);
> > + total -= len;
> > + ++v;
> > + }
> > + }
> > break;
> > case NFSD_IO_DONTCACHE:
> > kiocb.ki_flags = IOCB_DONTCACHE;
> > @@ -1120,8 +1290,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >
> > kiocb.ki_pos = offset;
> >
> > - v = 0;
> > - total = *count;
> > while (total) {
> > len = min_t(size_t, total, PAGE_SIZE - base);
> > bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> > @@ -1132,9 +1300,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > }
> > WARN_ON_ONCE(v > rqstp->rq_maxpages);
> >
> > - trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > - iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > + trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
> > + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
> > +
> > + if ((kiocb.ki_flags & IOCB_DIRECT) &&
> > + !nfsd_iov_iter_aligned_bvec(&iter, nf->nf_dio_mem_align-1,
> > + nf->nf_dio_read_offset_align-1))
> > + kiocb.ki_flags &= ~IOCB_DIRECT;
> > +
> > host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> > +
> > + if (in_count != *count) {
> > + /* misaligned DIO expanded read to be DIO-aligned */
> > + host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
> > + host_err, *count, &offset, &v);
> > + }
> > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > }
> >
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index e64ab444e0a7f..190c2667500e2 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > * pages, one for the request, and one for the reply.
> > * nfsd_splice_actor() might need an extra page when a READ payload
> > * is not page-aligned.
> > + * nfsd_iter_read() might need two extra pages when a READ payload
> > + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
> > + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
> > */
> > static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
> > {
> > - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> > + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
> > }
> >
> > /*
>
> To properly evaluate the impact of using direct I/O for reads with real
> world user workloads, we will want to identify (or construct) some
> metrics (and this is future work, but near-term future).
>
> Seems like allocating memory becomes difficult only when too many pages
> are dirty. I am skeptical that the issue is due to read caching, since
> clean pages in the page cache are pretty easy to evict quickly, AIUI. If
> that's incorrect, I'd like to understand why.
The much more problematic case is heavy WRITE workload with a working
set that far exceeds system memory.
But I agree it doesn't make a whole lot of sense that clean pages in
the page cache would be getting in the way. All I can tell you is
that in my experience MM seems to _not_ evict them quickly (but more
focused read-only testing is warranted to further understand the
dynamics and heuristics in MM and beyond -- especially if/when
READ-only then a pivot to a mix of heavy READ and WRITE or
WRITE-only).
NFSD using DIO is optional. I thought the point was to get it as an
available option so that _others_ could experiment and help categorize
the benefits/pitfalls further?
I cannot be a one man show on all this. I welcome more help from
anyone interested.
Thanks,
Mike
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-27 19:41 ` Mike Snitzer
@ 2025-08-27 20:56 ` Chuck Lever
2025-08-27 23:15 ` Mike Snitzer
2025-08-28 16:22 ` Jeff Layton
1 sibling, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-08-27 20:56 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 8/27/25 3:41 PM, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
>> On 8/26/25 2:57 PM, Mike Snitzer wrote:
>>> + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
>>> + "%s: underlying filesystem has not provided DIO alignment info\n",
>>> + __func__))
>>> + return false;
>>> + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
>>> + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
>>> + __func__, dio_blocksize, PAGE_SIZE))
>>> + return false;
>>
>> IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
>
> I won't die on this hill, I just don't see the risk of these given
> they are highly unlikely ("famous last words").
>
> But if they trigger we should surely be made aware immediately. Not
> only if someone happens to have a trace event enabled (which would
> only happen with further support and engineering involvement to chase
> "why isn't O_DIRECT being used like NFSD was optionally configured
> to!?").
A. It seems particularly inefficient to make this check for every I/O
rather than once per file system
B. Once the warning has fired for one file, it won't fire again, making
it pretty useless if there are multiple similar mismatches. You still
end up with "No direct I/O even though I flipped the switch, and I
can't tell why."
>>> + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
>>> + * or base not aligned).
>>> + * Ondisk alignment is implied by the following code that expands
>>> + * misaligned IO to have a DIO-aligned offset and len.
>>> + */
>>> + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
>>> + return false;
>>> +
>>> + init_nfsd_read_dio(read_dio);
>>> +
>>> + read_dio->start = round_down(offset, dio_blocksize);
>>> + read_dio->end = round_up(orig_end, dio_blocksize);
>>> + read_dio->start_extra = offset - read_dio->start;
>>> + read_dio->end_extra = read_dio->end - orig_end;
>>> +
>>> + /*
>>> + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
>>> + * to be DIO-aligned (this heuristic avoids excess work, like allocating
>>> + * start_extra_page, for smaller IO that can generally already perform
>>> + * well using buffered IO).
>>> + */
>>> + if ((read_dio->start_extra || read_dio->end_extra) &&
>>> + (len < NFSD_READ_DIO_MIN_KB)) {
>>> + init_nfsd_read_dio(read_dio);
>>> + return false;
>>> + }
>>> +
>>> + if (read_dio->start_extra) {
>>> + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
>>
>> This introduces a page allocation where there weren't any before. For
>> NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
>> like this is not needed on an I/O path.
>
> NFSD never supported DIO before. Yes, with this patch there is
> a single page allocation in the misaligned DIO READ path (if it
> requires reading extra before the client requested data starts).
>
> I tried to succinctly explain the need for the extra page allocation
> for misaligned DIO READ in this patch's header (in 2nd paragraph
> of the above header).
>
> I cannot see how to read extra, not requested by the client, into the
> head of rq_pages without causing serious problems. So that cannot be
> what you're saying needed.
>
>> So I think the answer to this is that I want you to implement reading
>> an entire aligned range from the file and then forming the NFS READ
>> response with only the range of bytes that the client requested, as we
>> discussed before.
>
> That is what I'm doing. But you're taking issue with my implementation
> that uses a single extra page.
>
>> The use of xdr_buf and bvec should make that quite
>> straightforward.
>
> Is your suggestion to, rather than allocate a disjoint single page,
> borrow the extra page from the end of rq_pages? Just map it into the
> bvec instead of my extra page?
Yes, the extra page needs to come from rq_pages. But I don't see why it
should come from the /end/ of rq_pages.
- Extend the start of the byte range back to make it align with the
file's DIO alignment constraint
- Extend the end of the byte range forward to make it align with the
file's DIO alignment constraint
- Fill in the sink buffer's bvec using pages from rq_pages, as usual
- When the I/O is complete, adjust the offset in the first bvec entry
forward by setting a non-zero page offset, and adjust the returned
count downward to match the requested byte count from the client
If the byte range requested by the NFS READ was already aligned, then
the first entry offset value remains zero. As SteveD says, Boom. Done.
>> To properly evaluate the impact of using direct I/O for reads with real
>> world user workloads, we will want to identify (or construct) some
>> metrics (and this is future work, but near-term future).
>>
>> Seems like allocating memory becomes difficult only when too many pages
>> are dirty. I am skeptical that the issue is due to read caching, since
>> clean pages in the page cache are pretty easy to evict quickly, AIUI. If
>> that's incorrect, I'd like to understand why.
>
> The much more problematic case is heavy WRITE workload with a working
> set that far exceeds system memory.
OK, that makes sense. And, there is a parallel writeback effort ongoing
to help address some of that problem, AIUI. It makes sense to keep a
close watch on that to see how NFSD can benefit, while we're working
through the complexities of handling NFS WRITE using direct I/O.
> But I agree it doesn't make a whole lot of sense that clean pages in
> the page cache would be getting in the way. All I can tell you is
> that in my experience MM seems to _not_ evict them quickly (but more
> focused read-only testing is warranted to further understand the
> dynamics and heuristics in MM and beyond -- especially if/when
> READ-only then a pivot to a mix of heavy READ and WRITE or
> WRITE-only).
Starting by examining read-only workloads seems like a nice way to
simplify the problem space to get started.
> NFSD using DIO is optional. I thought the point was to get it as an
> available option so that _others_ could experiment and help categorize
> the benefits/pitfalls further?
Yes, that is the point. But such experiments lose value if there is no
data collection plan to go with them.
> I cannot be a one man show on all this. I welcome more help from
> anyone interested.
I think it's important for you to learn how the NFSD I/O path works
rather than simply handing us a drive-by contribution. It's going to
take some time, so be patient.
If you would rather make this drive-by, then you'll have to realize
that you are requesting more than simple review from us. You'll have
to be content with the pace at which us overloaded maintainers can get
to the work.
It's not the usual situation that a maintainer has to sit down and
do extensive rewrites on a contribution. That really doesn't scale
well. That's why I'm pushing back.
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-27 20:56 ` Chuck Lever
@ 2025-08-27 23:15 ` Mike Snitzer
2025-08-28 1:57 ` Chuck Lever
0 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-08-27 23:15 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
> On 8/27/25 3:41 PM, Mike Snitzer wrote:
> > On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
> >> On 8/26/25 2:57 PM, Mike Snitzer wrote:
>
> >>> + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> >>> + "%s: underlying filesystem has not provided DIO alignment info\n",
> >>> + __func__))
> >>> + return false;
> >>> + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> >>> + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> >>> + __func__, dio_blocksize, PAGE_SIZE))
> >>> + return false;
> >>
> >> IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
> >
> > I won't die on this hill, I just don't see the risk of these given
> > they are highly unlikely ("famous last words").
> >
> > But if they trigger we should surely be made aware immediately. Not
> > only if someone happens to have a trace event enabled (which would
> > only happen with further support and engineering involvement to chase
> > "why isn't O_DIRECT being used like NFSD was optionally configured
> > to!?").
> A. It seems particularly inefficient to make this check for every I/O
> rather than once per file system
>
> B. Once the warning has fired for one file, it won't fire again, making
> it pretty useless if there are multiple similar mismatches. You still
> end up with "No direct I/O even though I flipped the switch, and I
> can't tell why."
I've removed the WARN_ON_ONCEs for read and write. These repeat
per-IO negative checks aren't ideal but they certainly aren't costly.
> >>> + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
> >>> + * or base not aligned).
> >>> + * Ondisk alignment is implied by the following code that expands
> >>> + * misaligned IO to have a DIO-aligned offset and len.
> >>> + */
> >>> + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
> >>> + return false;
> >>> +
> >>> + init_nfsd_read_dio(read_dio);
> >>> +
> >>> + read_dio->start = round_down(offset, dio_blocksize);
> >>> + read_dio->end = round_up(orig_end, dio_blocksize);
> >>> + read_dio->start_extra = offset - read_dio->start;
> >>> + read_dio->end_extra = read_dio->end - orig_end;
> >>> +
> >>> + /*
> >>> + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> >>> + * to be DIO-aligned (this heuristic avoids excess work, like allocating
> >>> + * start_extra_page, for smaller IO that can generally already perform
> >>> + * well using buffered IO).
> >>> + */
> >>> + if ((read_dio->start_extra || read_dio->end_extra) &&
> >>> + (len < NFSD_READ_DIO_MIN_KB)) {
> >>> + init_nfsd_read_dio(read_dio);
> >>> + return false;
> >>> + }
> >>> +
> >>> + if (read_dio->start_extra) {
> >>> + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> >>
> >> This introduces a page allocation where there weren't any before. For
> >> NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
> >> like this is not needed on an I/O path.
> >
> > NFSD never supported DIO before. Yes, with this patch there is
> > a single page allocation in the misaligned DIO READ path (if it
> > requires reading extra before the client requested data starts).
> >
> > I tried to succinctly explain the need for the extra page allocation
> > for misaligned DIO READ in this patch's header (in 2nd paragraph
> > of the above header).
> >
> > I cannot see how to read extra, not requested by the client, into the
> > head of rq_pages without causing serious problems. So that cannot be
> > what you're saying needed.
> >
> >> So I think the answer to this is that I want you to implement reading
> >> an entire aligned range from the file and then forming the NFS READ
> >> response with only the range of bytes that the client requested, as we
> >> discussed before.
> >
> > That is what I'm doing. But you're taking issue with my implementation
> > that uses a single extra page.
> >
> >> The use of xdr_buf and bvec should make that quite
> >> straightforward.
> >
> > Is your suggestion to, rather than allocate a disjoint single page,
> > borrow the extra page from the end of rq_pages? Just map it into the
> > bvec instead of my extra page?
>
> Yes, the extra page needs to come from rq_pages. But I don't see why it
> should come from the /end/ of rq_pages.
>
> - Extend the start of the byte range back to make it align with the
> file's DIO alignment constraint
>
> - Extend the end of the byte range forward to make it align with the
> file's DIO alignment constraint
nfsd_analyze_read_dio() does that (start_extra and end_extra).
> - Fill in the sink buffer's bvec using pages from rq_pages, as usual
>
> - When the I/O is complete, adjust the offset in the first bvec entry
> forward by setting a non-zero page offset, and adjust the returned
> count downward to match the requested byte count from the client
Tried it long ago, such bvec manipulation only works when not using
RDMA. When the memory is remote, twiddling a local bvec isn't going
to ensure the correct pages have the correct data upon return to the
client.
RDMA is why the pages must be used in-place, and RDMA is also why
the extra page needed by this patch (for use as throwaway front-pad
for expanded misaligned DIO READ) must either be allocated _or_
hopefully it can be from rq_pages (after the end of the client
requested READ payload).
Or am I wrong and simply need to keep learning about NFSD's IO path?
> > NFSD using DIO is optional. I thought the point was to get it as an
> > available option so that _others_ could experiment and help categorize
> > the benefits/pitfalls further?
>
> Yes, that is the point. But such experiments lose value if there is no
> data collection plan to go with them.
Each user runs something they care about performing well and they
measure the result.
Literally the same thing as has been done for anything in Linux since
it all started. Nothing unicorn or bespoke here.
> If you would rather make this drive-by, then you'll have to realize
> that you are requesting more than simple review from us. You'll have
> to be content with the pace at which us overloaded maintainers can get
> to the work.
I think I just experienced the mailing-list equivalent of the Detroit
definition of "drive-by". Good/bad news: you're a terrible shot.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-27 23:15 ` Mike Snitzer
@ 2025-08-28 1:57 ` Chuck Lever
2025-08-28 8:09 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-08-28 1:57 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 8/27/25 7:15 PM, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
>> On 8/27/25 3:41 PM, Mike Snitzer wrote:
>>> On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
>>>> On 8/26/25 2:57 PM, Mike Snitzer wrote:
>>
>>>>> + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
>>>>> + "%s: underlying filesystem has not provided DIO alignment info\n",
>>>>> + __func__))
>>>>> + return false;
>>>>> + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
>>>>> + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
>>>>> + __func__, dio_blocksize, PAGE_SIZE))
>>>>> + return false;
>>>>
>>>> IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
>>>
>>> I won't die on this hill, I just don't see the risk of these given
>>> they are highly unlikely ("famous last words").
>>>
>>> But if they trigger we should surely be made aware immediately. Not
>>> only if someone happens to have a trace event enabled (which would
>>> only happen with further support and engineering involvement to chase
>>> "why isn't O_DIRECT being used like NFSD was optionally configured
>>> to!?").
>> A. It seems particularly inefficient to make this check for every I/O
>> rather than once per file system
>>
>> B. Once the warning has fired for one file, it won't fire again, making
>> it pretty useless if there are multiple similar mismatches. You still
>> end up with "No direct I/O even though I flipped the switch, and I
>> can't tell why."
>
> I've removed the WARN_ON_ONCEs for read and write. These repeat
> per-IO negative checks aren't ideal but they certainly aren't costly.
>
>>>>> + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
>>>>> + * or base not aligned).
>>>>> + * Ondisk alignment is implied by the following code that expands
>>>>> + * misaligned IO to have a DIO-aligned offset and len.
>>>>> + */
>>>>> + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
>>>>> + return false;
>>>>> +
>>>>> + init_nfsd_read_dio(read_dio);
>>>>> +
>>>>> + read_dio->start = round_down(offset, dio_blocksize);
>>>>> + read_dio->end = round_up(orig_end, dio_blocksize);
>>>>> + read_dio->start_extra = offset - read_dio->start;
>>>>> + read_dio->end_extra = read_dio->end - orig_end;
>>>>> +
>>>>> + /*
>>>>> + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
>>>>> + * to be DIO-aligned (this heuristic avoids excess work, like allocating
>>>>> + * start_extra_page, for smaller IO that can generally already perform
>>>>> + * well using buffered IO).
>>>>> + */
>>>>> + if ((read_dio->start_extra || read_dio->end_extra) &&
>>>>> + (len < NFSD_READ_DIO_MIN_KB)) {
>>>>> + init_nfsd_read_dio(read_dio);
>>>>> + return false;
>>>>> + }
>>>>> +
>>>>> + if (read_dio->start_extra) {
>>>>> + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
>>>>
>>>> This introduces a page allocation where there weren't any before. For
>>>> NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
>>>> like this is not needed on an I/O path.
>>>
>>> NFSD never supported DIO before. Yes, with this patch there is
>>> a single page allocation in the misaligned DIO READ path (if it
>>> requires reading extra before the client requested data starts).
>>>
>>> I tried to succinctly explain the need for the extra page allocation
>>> for misaligned DIO READ in this patch's header (in 2nd paragraph
>>> of the above header).
>>>
>>> I cannot see how to read extra, not requested by the client, into the
>>> head of rq_pages without causing serious problems. So that cannot be
>>> what you're saying needed.
>>>
>>>> So I think the answer to this is that I want you to implement reading
>>>> an entire aligned range from the file and then forming the NFS READ
>>>> response with only the range of bytes that the client requested, as we
>>>> discussed before.
>>>
>>> That is what I'm doing. But you're taking issue with my implementation
>>> that uses a single extra page.
>>>
>>>> The use of xdr_buf and bvec should make that quite
>>>> straightforward.
>>>
>>> Is your suggestion to, rather than allocate a disjoint single page,
>>> borrow the extra page from the end of rq_pages? Just map it into the
>>> bvec instead of my extra page?
>>
>> Yes, the extra page needs to come from rq_pages. But I don't see why it
>> should come from the /end/ of rq_pages.
>>
>> - Extend the start of the byte range back to make it align with the
>> file's DIO alignment constraint
>>
>> - Extend the end of the byte range forward to make it align with the
>> file's DIO alignment constraint
>
> nfsd_analyze_read_dio() does that (start_extra and end_extra).
>
>> - Fill in the sink buffer's bvec using pages from rq_pages, as usual
>>
>> - When the I/O is complete, adjust the offset in the first bvec entry
>> forward by setting a non-zero page offset, and adjust the returned
>> count downward to match the requested byte count from the client
>
> Tried it long ago, such bvec manipulation only works when not using
> RDMA. When the memory is remote, twiddling a local bvec isn't going
> to ensure the correct pages have the correct data upon return to the
> client.
>
> RDMA is why the pages must be used in-place, and RDMA is also why
> the extra page needed by this patch (for use as throwaway front-pad
> for expanded misaligned DIO READ) must either be allocated _or_
> hopefully it can be from rq_pages (after the end of the client
> requested READ payload).
>
> Or am I wrong and simply need to keep learning about NFSD's IO path?
You're wrong, not to put a fine point on it.
There's nothing I can think of in the RDMA or RPC/RDMA protocols that
mandates that the first page offset must always be zero. Moving data
at one address on the server to an entirely different address and
alignment on the client is exactly what RDMA is supposed to do.
It sounds like an implementation omission because the server's upper
layers have never needed it before now. If TCP already handles it, I'm
guessing it's going to be straightforward to fix.
>>> NFSD using DIO is optional. I thought the point was to get it as an
>>> available option so that _others_ could experiment and help categorize
>>> the benefits/pitfalls further?
>>
>> Yes, that is the point. But such experiments lose value if there is no
>> data collection plan to go with them.
>
> Each user runs something they care about performing well and they
> measure the result.
That assumes the user will continue to use the debug interfaces, and
the particular implementation you've proposed, for the rest of time.
And that's not my plan at all.
If we, in the community, cannot reproduce that result, or cannot
understand what has been measured, or the measurement misses part or
most of the picture, of what value is that for us to decide whether and
how to proceed with promoting the mechanism from debug feature to
something with a long-term support lifetime and a documented ABI-stable
user interface?
> Literally the same thing as has been done for anything in Linux since
> it all started. Nothing unicorn or bespoke here.
So let me ask this another way: What do we need users to measure to give
us good quality information about the page cache behavior and system
thrashing behavior you reported?
If a user asked you today "What should I measure to determine if this
mechanism is beneficial for my workload and healthy for my server" what
would you recommend?
For example: I can enable direct I/O on NFSD, but my workload is mostly
one or two clients doing kernel builds. The latency of NFS READs goes
up, but since a kernel build is not I/O bound and the client page caches
hide most of the increase, there is very little to show a measured
change.
So how should I assess and report the impact of NFSD doing direct I/O?
See -- users are not the only ones who are involved in this experiment;
and they will need guidance because we're not providing any
documentation for this feature.
>> If you would rather make this drive-by, then you'll have to realize
>> that you are requesting more than simple review from us. You'll have
>> to be content with the pace at which us overloaded maintainers can get
>> to the work.
>
> I think I just experienced the mailing-list equivalent of the Detroit
> definition of "drive-by". Good/bad news: you're a terrible shot.
The term "drive-by contribution" has a well-understood meaning in the
kernel community. If you are unfamiliar with it, I invite you to review
the mailing list archives. As always, no-one is shooting at you. If
anything, the drive-by contribution is aimed at me.
This year, Neil is not available and Jeff is working on client issues.
I will try to find some time to look at the svcrdma sendto path.
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-28 1:57 ` Chuck Lever
@ 2025-08-28 8:09 ` Mike Snitzer
2025-08-28 14:53 ` Chuck Lever
2025-08-28 16:36 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Jeff Layton
0 siblings, 2 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-28 8:09 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote:
> On 8/27/25 7:15 PM, Mike Snitzer wrote:
> > On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
> >> On 8/27/25 3:41 PM, Mike Snitzer wrote:
> >>> Is your suggestion to, rather than allocate a disjoint single page,
> >>> borrow the extra page from the end of rq_pages? Just map it into the
> >>> bvec instead of my extra page?
> >>
> >> Yes, the extra page needs to come from rq_pages. But I don't see why it
> >> should come from the /end/ of rq_pages.
> >>
> >> - Extend the start of the byte range back to make it align with the
> >> file's DIO alignment constraint
> >>
> >> - Extend the end of the byte range forward to make it align with the
> >> file's DIO alignment constraint
> >
> > nfsd_analyze_read_dio() does that (start_extra and end_extra).
> >
> >> - Fill in the sink buffer's bvec using pages from rq_pages, as usual
> >>
> >> - When the I/O is complete, adjust the offset in the first bvec entry
> >> forward by setting a non-zero page offset, and adjust the returned
> >> count downward to match the requested byte count from the client
> >
> > Tried it long ago, such bvec manipulation only works when not using
> > RDMA. When the memory is remote, twiddling a local bvec isn't going
> > to ensure the correct pages have the correct data upon return to the
> > client.
> >
> > RDMA is why the pages must be used in-place, and RDMA is also why
> > the extra page needed by this patch (for use as throwaway front-pad
> > for expanded misaligned DIO READ) must either be allocated _or_
> > hopefully it can be from rq_pages (after the end of the client
> > requested READ payload).
> >
> > Or am I wrong and simply need to keep learning about NFSD's IO path?
>
> You're wrong, not to put a fine point on it.
You didn't even understand me.. but firmly believe I'm wrong?
> There's nothing I can think of in the RDMA or RPC/RDMA protocols that
> mandates that the first page offset must always be zero. Moving data
> at one address on the server to an entirely different address and
> alignment on the client is exactly what RDMA is supposed to do.
>
> It sounds like an implementation omission because the server's upper
> layers have never needed it before now. If TCP already handles it, I'm
> guessing it's going to be straightforward to fix.
I never said that first page offset must be zero. I said that I
already did what you suggested and it didn't work with RDMA. This is
recall of too many months ago now, but: the client will see the
correct READ payload _except_ IIRC it is offset by whatever front-pad
was added to expand the misaligned DIO; no matter whether
rqstp->rq_bvec updated when IO completes.
But I'll revisit it again.
> >>> NFSD using DIO is optional. I thought the point was to get it as an
> >>> available option so that _others_ could experiment and help categorize
> >>> the benefits/pitfalls further?
> >>
> >> Yes, that is the point. But such experiments lose value if there is no
> >> data collection plan to go with them.
> >
> > Each user runs something they care about performing well and they
> > measure the result.
>
> That assumes the user will continue to use the debug interfaces, and
> the particular implementation you've proposed, for the rest of time.
> And that's not my plan at all.
>
> If we, in the community, cannot reproduce that result, or cannot
> understand what has been measured, or the measurement misses part or
> most of the picture, of what value is that for us to decide whether and
> how to proceed with promoting the mechanism from debug feature to
> something with a long-term support lifetime and a documented ABI-stable
> user interface?
I'll work to put a finer point on how to reproduce and enumerate the
things to look for (representative flamegraphs showing the issue,
which I already did at last Bakeathon).
But I have repeatedly offered that the pathological worst case is
client doing sequential write IO of a file that is 3-4x larger than
the NFS server's system memory.
Large memory systems with 8 or more NVMe devices, fast networks that
allow for huge data ingest capabilities. These are the platforms that
showcase MM's dirty writeback limitions when large sequential IO is
initiated from the NFS client and its able to overrun the NFS server.
In addition, in general DIO requires significantly less memory and
CPU; so platforms that have more limited resources (and may have
historically struggled) could have a new lease on life if they switch
NFSD from buffered to DIO mode.
> > Literally the same thing as has been done for anything in Linux since
> > it all started. Nothing unicorn or bespoke here.
>
> So let me ask this another way: What do we need users to measure to give
> us good quality information about the page cache behavior and system
> thrashing behavior you reported?
IO throughput, CPU and memory usage should be monitored over time.
> For example: I can enable direct I/O on NFSD, but my workload is mostly
> one or two clients doing kernel builds. The latency of NFS READs goes
> up, but since a kernel build is not I/O bound and the client page caches
> hide most of the increase, there is very little to show a measured
> change.
>
> So how should I assess and report the impact of NFSD doing direct I/O?
Your underwhelming usage isn't what this patchset is meant to help.
> See -- users are not the only ones who are involved in this experiment;
> and they will need guidance because we're not providing any
> documentation for this feature.
Users are not created equal. Major companies like Oracle and Meta
_should_ be aware of NFSD's problems with buffered IO. They have
internal and external stakeholders that are power users.
Jeff, does Meta ever see NFSD struggle to consistently use NVMe
devices? Lumpy performance? Full-blown IO stalls? Lots of NFSD
threads hung in D state?
> >> If you would rather make this drive-by, then you'll have to realize
> >> that you are requesting more than simple review from us. You'll have
> >> to be content with the pace at which us overloaded maintainers can get
> >> to the work.
> >
> > I think I just experienced the mailing-list equivalent of the Detroit
> > definition of "drive-by". Good/bad news: you're a terrible shot.
>
> The term "drive-by contribution" has a well-understood meaning in the
> kernel community. If you are unfamiliar with it, I invite you to review
> the mailing list archives. As always, no-one is shooting at you. If
> anything, the drive-by contribution is aimed at me.
It is a blatant miscategorization here. That you just doubled down
on it having relevance in this instance is flagrantly wrong.
Whatever compells you to belittle me and my contributions, just know
it is extremely hard to take. Highly unproductive and unprofessional.
Boom, done.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-28 8:09 ` Mike Snitzer
@ 2025-08-28 14:53 ` Chuck Lever
2025-08-28 18:52 ` Mike Snitzer
2025-08-28 16:36 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Jeff Layton
1 sibling, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-08-28 14:53 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 8/28/25 4:09 AM, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote:
>> On 8/27/25 7:15 PM, Mike Snitzer wrote:
>>> On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
>>>> On 8/27/25 3:41 PM, Mike Snitzer wrote:
>>>>> Is your suggestion to, rather than allocate a disjoint single page,
>>>>> borrow the extra page from the end of rq_pages? Just map it into the
>>>>> bvec instead of my extra page?
>>>>
>>>> Yes, the extra page needs to come from rq_pages. But I don't see why it
>>>> should come from the /end/ of rq_pages.
>>>>
>>>> - Extend the start of the byte range back to make it align with the
>>>> file's DIO alignment constraint
>>>>
>>>> - Extend the end of the byte range forward to make it align with the
>>>> file's DIO alignment constraint
>>>
>>> nfsd_analyze_read_dio() does that (start_extra and end_extra).
>>>
>>>> - Fill in the sink buffer's bvec using pages from rq_pages, as usual
>>>>
>>>> - When the I/O is complete, adjust the offset in the first bvec entry
>>>> forward by setting a non-zero page offset, and adjust the returned
>>>> count downward to match the requested byte count from the client
>>>
>>> Tried it long ago, such bvec manipulation only works when not using
>>> RDMA. When the memory is remote, twiddling a local bvec isn't going
>>> to ensure the correct pages have the correct data upon return to the
>>> client.
>>>
>>> RDMA is why the pages must be used in-place, and RDMA is also why
>>> the extra page needed by this patch (for use as throwaway front-pad
>>> for expanded misaligned DIO READ) must either be allocated _or_
>>> hopefully it can be from rq_pages (after the end of the client
>>> requested READ payload).
>>>
>>> Or am I wrong and simply need to keep learning about NFSD's IO path?
>>
>> You're wrong, not to put a fine point on it.
>
> You didn't even understand me.. but firmly believe I'm wrong?
>
>> There's nothing I can think of in the RDMA or RPC/RDMA protocols that
>> mandates that the first page offset must always be zero. Moving data
>> at one address on the server to an entirely different address and
>> alignment on the client is exactly what RDMA is supposed to do.
>>
>> It sounds like an implementation omission because the server's upper
>> layers have never needed it before now. If TCP already handles it, I'm
>> guessing it's going to be straightforward to fix.
>
> I never said that first page offset must be zero. I said that I
> already did what you suggested and it didn't work with RDMA. This is
> recall of too many months ago now, but: the client will see the
> correct READ payload _except_ IIRC it is offset by whatever front-pad
> was added to expand the misaligned DIO; no matter whether
> rqstp->rq_bvec updated when IO completes.
>
> But I'll revisit it again.
For the record, this email thread is the very first time I've heard that
you tried the simple approach and that it worked with TCP and not with
RDMA. I wish I had known that a while ago.
>>>>> NFSD using DIO is optional. I thought the point was to get it as an
>>>>> available option so that _others_ could experiment and help categorize
>>>>> the benefits/pitfalls further?
>>>>
>>>> Yes, that is the point. But such experiments lose value if there is no
>>>> data collection plan to go with them.
>>>
>>> Each user runs something they care about performing well and they
>>> measure the result.
>>
>> That assumes the user will continue to use the debug interfaces, and
>> the particular implementation you've proposed, for the rest of time.
>> And that's not my plan at all.
>>
>> If we, in the community, cannot reproduce that result, or cannot
>> understand what has been measured, or the measurement misses part or
>> most of the picture, of what value is that for us to decide whether and
>> how to proceed with promoting the mechanism from debug feature to
>> something with a long-term support lifetime and a documented ABI-stable
>> user interface?
>
> I'll work to put a finer point on how to reproduce and enumerate the
> things to look for (representative flamegraphs showing the issue,
> which I already did at last Bakeathon).
>
> But I have repeatedly offered that the pathological worst case is
> client doing sequential write IO of a file that is 3-4x larger than
> the NFS server's system memory.
>
> Large memory systems with 8 or more NVMe devices, fast networks that
> allow for huge data ingest capabilities. These are the platforms that
> showcase MM's dirty writeback limitions when large sequential IO is
> initiated from the NFS client and its able to overrun the NFS server.
>
> In addition, in general DIO requires significantly less memory and
> CPU; so platforms that have more limited resources (and may have
> historically struggled) could have a new lease on life if they switch
> NFSD from buffered to DIO mode.
>
>>> Literally the same thing as has been done for anything in Linux since
>>> it all started. Nothing unicorn or bespoke here.
>>
>> So let me ask this another way: What do we need users to measure to give
>> us good quality information about the page cache behavior and system
>> thrashing behavior you reported?
>
> IO throughput, CPU and memory usage should be monitored over time.
My point is users need a recipe for what to monitor, ie clear specific
instructions, and that recipe needs to be applied the same way by every
experimenter (because, as I say below, we too are collecting data about
the user's experience with this feature).
Can you provide a few specific instructions that experimenters can
follow on how to monitor and report these metrics? Simply copy-paste
what you and your test team have been using to observe system behavior.
>> For example: I can enable direct I/O on NFSD, but my workload is mostly
>> one or two clients doing kernel builds. The latency of NFS READs goes
>> up, but since a kernel build is not I/O bound and the client page caches
>> hide most of the increase, there is very little to show a measured
>> change.
>>
>> So how should I assess and report the impact of NFSD doing direct I/O?
>
> Your underwhelming usage isn't what this patchset is meant to help.
Again, my point is, how are users who try the debug option going to
know that their workload is or is not of interest? We're not
providing such documentation. Users are just going to turn this on and
try it.
But consider that one of our options is to make direct I/O the default.
We need to know how smaller workloads might be impacted by this new
default. The maintainers are part of this experiment as much as
potential users are. We need to collect data to understand how to make
this work part of the first world administrative API.
>> See -- users are not the only ones who are involved in this experiment;
>> and they will need guidance because we're not providing any
>> documentation for this feature.
>
> Users are not created equal. Major companies like Oracle and Meta
> _should_ be aware of NFSD's problems with buffered IO. They have
> internal and external stakeholders that are power users.
>
> Jeff, does Meta ever see NFSD struggle to consistently use NVMe
> devices? Lumpy performance? Full-blown IO stalls? Lots of NFSD
> threads hung in D state?
I have no idea why you think I don't understand those facts, but you are
still missing my point. NFSD maintainers are part of this experimental
set up as much as users are. We need good information to decide how to
shape this feature going forward. Let's build a plan to get that
information.
>>>> If you would rather make this drive-by, then you'll have to realize
>>>> that you are requesting more than simple review from us. You'll have
>>>> to be content with the pace at which us overloaded maintainers can get
>>>> to the work.
>>>
>>> I think I just experienced the mailing-list equivalent of the Detroit
>>> definition of "drive-by". Good/bad news: you're a terrible shot.
>>
>> The term "drive-by contribution" has a well-understood meaning in the
>> kernel community. If you are unfamiliar with it, I invite you to review
>> the mailing list archives. As always, no-one is shooting at you. If
>> anything, the drive-by contribution is aimed at me.
>
> It is a blatant miscategorization here. That you just doubled down
> on it having relevance in this instance is flagrantly wrong.
First of all, "drive-by contribution" is not meant to be pejorative.
It is what it is: someone throws you a patch and then goes away. It is
a fact of life for open source maintainers.
I'm simply trying to get a sense of your commitment to finishing this
work and staying with it going forward. You keep telling me "well I have
other things to do." What do you think that indicates to a busy
maintainer? That you will be around to finish the feature? Or rather
that your employer is going to yank you onto other projects, leaving us
stuck with the work of finishing it?
You have rejected one or more reasonable review comments and maintainer
requests. That doesn't convince me you are willing to collaborate with
us on maintaining the feature.
My biggest concern here is how much more work maintaining NFSD will be
once this change goes in. It's a major feature and a significant change.
It will require a lot of code massage in the short- and medium-term,
just as a start.
Maybe I'm assuming that you, as a kernel maintainer yourself, already
understand that this mind set and this calculus is part of my email replies.
So, I'm being frank. I'm not trying to offend or belittle.
> Whatever compells you to belittle me and my contributions, just know
> it is extremely hard to take. Highly unproductive and unprofessional.
I can't control how you understand my emails. If you choose to be
offended even though I'm trying to have an honest discussion, there's
nothing I can do about it.
I can't control /if/ you understand my emails. I'm repeating myself
quite a bit here because a lot of what I say sails past you. You read
things into what I say that I don't intend, and you miss a lot of what
I did intend. There's nothing I can do about that either.
I can't control your unproductive and unprofessional behavior. You keep
rejecting valid review comments and maintainer requests with comments
like "I don't feel like it" and "your reason for wanting this change is
invalid" and "you are wasting my time". Again, this suggests to me that
maintaining this feature will be a lot more work than it could be, and
that is an ongoing concern.
If there's something that offends you, the professional response on your
part is to point that out to me in the moment rather than trying to spit
back. Because most likely, I wasn't intending to offend at all. If you
keep that in mind while reading my emails, you might have an easier time
of it.
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-27 19:41 ` Mike Snitzer
2025-08-27 20:56 ` Chuck Lever
@ 2025-08-28 16:22 ` Jeff Layton
2025-08-28 16:27 ` Chuck Lever
1 sibling, 1 reply; 42+ messages in thread
From: Jeff Layton @ 2025-08-28 16:22 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs
On Wed, 2025-08-27 at 15:41 -0400, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
> > On 8/26/25 2:57 PM, Mike Snitzer wrote:
> > > If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> > > DIO-aligned block (on either end of the READ). The expanded READ is
> > > verified to have proper offset/len (logical_block_size) and
> > > dma_alignment checking.
> > >
> > > Must allocate and use a bounce-buffer page (called 'start_extra_page')
> > > if/when expanding the misaligned READ requires reading extra partial
> > > page at the start of the READ so that its DIO-aligned. Otherwise that
> > > extra page at the start will make its way back to the NFS client and
> > > corruption will occur. As found, and then this fix of using an extra
> > > page verified, using the 'dt' utility:
> > > dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> > > iotype=sequential pattern=iot onerr=abort oncerr=abort
> > > see: https://github.com/RobinTMiller/dt.git
> > >
> > > Any misaligned READ that is less than 32K won't be expanded to be
> > > DIO-aligned (this heuristic just avoids excess work, like allocating
> > > start_extra_page, for smaller IO that can generally already perform
> > > well using buffered IO).
> > >
> > > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > > Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > fs/nfsd/vfs.c | 200 +++++++++++++++++++++++++++++++++++--
> > > include/linux/sunrpc/svc.h | 5 +-
> > > 2 files changed, 194 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index c340708fbab4d..64732dc8985d6 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -19,6 +19,7 @@
> > > #include <linux/splice.h>
> > > #include <linux/falloc.h>
> > > #include <linux/fcntl.h>
> > > +#include <linux/math.h>
> > > #include <linux/namei.h>
> > > #include <linux/delay.h>
> > > #include <linux/fsnotify.h>
> > > @@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > }
> > >
> > > +struct nfsd_read_dio {
> > > + loff_t start;
> > > + loff_t end;
> > > + unsigned long start_extra;
> > > + unsigned long end_extra;
> > > + struct page *start_extra_page;
> > > +};
> > > +
> > > +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> > > +{
> > > + memset(read_dio, 0, sizeof(*read_dio));
> > > + read_dio->start_extra_page = NULL;
> > > +}
> > > +
> > > +#define NFSD_READ_DIO_MIN_KB (32 << 10)
> > > +
> > > +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > + struct nfsd_file *nf, loff_t offset,
> > > + unsigned long len, unsigned int base,
> > > + struct nfsd_read_dio *read_dio)
> > > +{
> > > + const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> > > + loff_t middle_end, orig_end = offset + len;
> > > +
> > > + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> > > + "%s: underlying filesystem has not provided DIO alignment info\n",
> > > + __func__))
> > > + return false;
> > > + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > > + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > > + __func__, dio_blocksize, PAGE_SIZE))
> > > + return false;
> >
> > IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
>
> I won't die on this hill, I just don't see the risk of these given
> they are highly unlikely ("famous last words").
>
> But if they trigger we should surely be made aware immediately. Not
> only if someone happens to have a trace event enabled (which would
> only happen with further support and engineering involvement to chase
> "why isn't O_DIRECT being used like NFSD was optionally configured
> to!?").
>
A kernel log message in this case makes sense to me, since it is a
(minor) administrative issue. WARN_ONCE() is going to throw a big,
scary stack trace, however that won't be terribly useful. We'll get hit
with bug reports from it for years.
Maybe pr_notice_once() for this? Or, maybe a pr_notice_once, but do it
on a per-export basis?
> > > + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
> > > + * or base not aligned).
> > > + * Ondisk alignment is implied by the following code that expands
> > > + * misaligned IO to have a DIO-aligned offset and len.
> > > + */
> > > + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
> > > + return false;
> > > +
> > > + init_nfsd_read_dio(read_dio);
> > > +
> > > + read_dio->start = round_down(offset, dio_blocksize);
> > > + read_dio->end = round_up(orig_end, dio_blocksize);
> > > + read_dio->start_extra = offset - read_dio->start;
> > > + read_dio->end_extra = read_dio->end - orig_end;
> > > +
> > > + /*
> > > + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> > > + * to be DIO-aligned (this heuristic avoids excess work, like allocating
> > > + * start_extra_page, for smaller IO that can generally already perform
> > > + * well using buffered IO).
> > > + */
> > > + if ((read_dio->start_extra || read_dio->end_extra) &&
> > > + (len < NFSD_READ_DIO_MIN_KB)) {
> > > + init_nfsd_read_dio(read_dio);
> > > + return false;
> > > + }
> > > +
> > > + if (read_dio->start_extra) {
> > > + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> >
> > This introduces a page allocation where there weren't any before. For
> > NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
> > like this is not needed on an I/O path.
>
> NFSD never supported DIO before. Yes, with this patch there is
> a single page allocation in the misaligned DIO READ path (if it
> requires reading extra before the client requested data starts).
>
> I tried to succinctly explain the need for the extra page allocation
> for misaligned DIO READ in this patch's header (in 2nd paragraph
> of the above header).
>
> I cannot see how to read extra, not requested by the client, into the
> head of rq_pages without causing serious problems. So that cannot be
> what you're saying needed.
>
> > So I think the answer to this is that I want you to implement reading
> > an entire aligned range from the file and then forming the NFS READ
> > response with only the range of bytes that the client requested, as we
> > discussed before.
>
> That is what I'm doing. But you're taking issue with my implementation
> that uses a single extra page.
>
> > The use of xdr_buf and bvec should make that quite
> > straightforward.
>
> Is your suggestion to, rather than allocate a disjoint single page,
> borrow the extra page from the end of rq_pages? Just map it into the
> bvec instead of my extra page?
>
> > This should make the aligned and unaligned cases nearly identical and
> > much less fraught.
>
> Regardless of which memory used to read the extra data, I don't see
> how the care I've taken to read extra but hide that fact from the
> client can be avoided. So the pre/post misaligned DIO READ code won't
> change a whole lot. But once I understand your suggestion better
> (after a clarifying response to this message) hopefully I'll see what
> you're saying.
>
> All said, this patchset is very important to me, I don't want it to
> miss v6.18 -- I'm still "in it to win it" but it feels like I do need
> your or others' help to pull this off.
>
> And/or is it possible to accept this initial implementation with
> mutual understanding that we must revisit your concern about my
> allocating an extra page for the misaligned DIO READ path?
>
> > > + if (WARN_ONCE(read_dio->start_extra_page == NULL,
> > > + "%s: Unable to allocate start_extra_page\n", __func__)) {
> > > + init_nfsd_read_dio(read_dio);
> > > + return false;
> > > + }
> > > + }
> > > +
> > > + return true;
> > > +}
> > > +
> > > +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> > > + struct nfsd_read_dio *read_dio,
> > > + ssize_t bytes_read,
> > > + unsigned long bytes_expected,
> > > + loff_t *offset,
> > > + unsigned long *rq_bvec_numpages)
> > > +{
> > > + ssize_t host_err = bytes_read;
> > > + loff_t v;
> > > +
> > > + if (!read_dio->start_extra && !read_dio->end_extra)
> > > + return host_err;
> > > +
> > > + /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> > > + * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> > > + */
> > > + if (read_dio->start_extra_page) {
> > > + __free_page(read_dio->start_extra_page);
> > > + *rq_bvec_numpages -= 1;
> > > + v = *rq_bvec_numpages;
> > > + memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> > > + v * sizeof(struct bio_vec));
> > > + }
> > > + /* Eliminate any end_extra bytes from the last page */
> > > + v = *rq_bvec_numpages;
> > > + rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> > > +
> > > + if (host_err < 0) {
> > > + /* Underlying FS will return -EINVAL if misaligned
> > > + * DIO is attempted because it shouldn't be.
> > > + */
> > > + WARN_ON_ONCE(host_err == -EINVAL);
> > > + return host_err;
> > > + }
> > > +
> > > + /* nfsd_analyze_read_dio() may have expanded the start and end,
> > > + * if so adjust returned read size to reflect original extent.
> > > + */
> > > + *offset += read_dio->start_extra;
> > > + if (likely(host_err >= read_dio->start_extra)) {
> > > + host_err -= read_dio->start_extra;
> > > + if (host_err > bytes_expected)
> > > + host_err = bytes_expected;
> > > + } else {
> > > + /* Short read that didn't read any of requested data */
> > > + host_err = 0;
> > > + }
> > > +
> > > + return host_err;
> > > +}
> > > +
> > > +static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
> > > + unsigned addr_mask, unsigned len_mask)
> > > +{
> > > + const struct bio_vec *bvec = i->bvec;
> > > + unsigned skip = i->iov_offset;
> > > + size_t size = i->count;
> >
> > checkpatch.pl is complaining about the use of "unsigned" rather than
> > "unsigned int".
>
> OK.
>
> > > +
> > > + if (size & len_mask)
> > > + return false;
> > > + do {
> > > + size_t len = bvec->bv_len;
> > > +
> > > + if (len > size)
> > > + len = size;
> > > + if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > + return false;
> > > + bvec++;
> > > + size -= len;
> > > + skip = 0;
> > > + } while (size);
> > > +
> > > + return true;
> > > +}
> > > +
> > > /**
> > > * nfsd_iter_read - Perform a VFS read using an iterator
> > > * @rqstp: RPC transaction context
> > > @@ -1094,7 +1242,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > unsigned int base, u32 *eof)
> > > {
> > > struct file *file = nf->nf_file;
> > > - unsigned long v, total;
> > > + unsigned long v, total, in_count = *count;
> > > + struct nfsd_read_dio read_dio;
> > > struct iov_iter iter;
> > > struct kiocb kiocb;
> > > ssize_t host_err;
> > > @@ -1102,13 +1251,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >
> > > init_sync_kiocb(&kiocb, file);
> > >
> > > + v = 0;
> > > + total = in_count;
> > > +
> > > switch (nfsd_io_cache_read) {
> > > case NFSD_IO_DIRECT:
> > > - /* Verify ondisk and memory DIO alignment */
> > > - if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> > > - (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
> > > - (base & (nf->nf_dio_mem_align - 1)) == 0)
> > > - kiocb.ki_flags = IOCB_DIRECT;
> > > + /*
> > > + * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
> > > + * the next DIO-aligned block (on either end of the READ).
> > > + */
> > > + if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
> > > + in_count, base, &read_dio)) {
> > > + /* trace_nfsd_read_vector() will reflect larger
> > > + * DIO-aligned READ.
> > > + */
> > > + offset = read_dio.start;
> > > + in_count = read_dio.end - offset;
> > > + total = in_count;
> > > +
> > > + kiocb.ki_flags |= IOCB_DIRECT;
> > > + if (read_dio.start_extra) {
> > > + len = read_dio.start_extra;
> > > + bvec_set_page(&rqstp->rq_bvec[v],
> > > + read_dio.start_extra_page,
> > > + len, PAGE_SIZE - len);
> > > + total -= len;
> > > + ++v;
> > > + }
> > > + }
> > > break;
> > > case NFSD_IO_DONTCACHE:
> > > kiocb.ki_flags = IOCB_DONTCACHE;
> > > @@ -1120,8 +1290,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >
> > > kiocb.ki_pos = offset;
> > >
> > > - v = 0;
> > > - total = *count;
> > > while (total) {
> > > len = min_t(size_t, total, PAGE_SIZE - base);
> > > bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> > > @@ -1132,9 +1300,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > }
> > > WARN_ON_ONCE(v > rqstp->rq_maxpages);
> > >
> > > - trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > > - iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > > + trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
> > > + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
> > > +
> > > + if ((kiocb.ki_flags & IOCB_DIRECT) &&
> > > + !nfsd_iov_iter_aligned_bvec(&iter, nf->nf_dio_mem_align-1,
> > > + nf->nf_dio_read_offset_align-1))
> > > + kiocb.ki_flags &= ~IOCB_DIRECT;
> > > +
> > > host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> > > +
> > > + if (in_count != *count) {
> > > + /* misaligned DIO expanded read to be DIO-aligned */
> > > + host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
> > > + host_err, *count, &offset, &v);
> > > + }
> > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > }
> > >
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index e64ab444e0a7f..190c2667500e2 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > * pages, one for the request, and one for the reply.
> > > * nfsd_splice_actor() might need an extra page when a READ payload
> > > * is not page-aligned.
> > > + * nfsd_iter_read() might need two extra pages when a READ payload
> > > + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
> > > + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
> > > */
> > > static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
> > > {
> > > - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> > > + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
> > > }
> > >
> > > /*
> >
> > To properly evaluate the impact of using direct I/O for reads with real
> > world user workloads, we will want to identify (or construct) some
> > metrics (and this is future work, but near-term future).
> >
> > Seems like allocating memory becomes difficult only when too many pages
> > are dirty. I am skeptical that the issue is due to read caching, since
> > clean pages in the page cache are pretty easy to evict quickly, AIUI. If
> > that's incorrect, I'd like to understand why.
>
> The much more problematic case is heavy WRITE workload with a working
> set that far exceeds system memory.
>
> But I agree it doesn't make a whole lot of sense that clean pages in
> the page cache would be getting in the way. All I can tell you is
> that in my experience MM seems to _not_ evict them quickly (but more
> focused read-only testing is warranted to further understand the
> dynamics and heuristics in MM and beyond -- especially if/when
> READ-only then a pivot to a mix of heavy READ and WRITE or
> WRITE-only).
>
> NFSD using DIO is optional. I thought the point was to get it as an
> available option so that _others_ could experiment and help categorize
> the benefits/pitfalls further?
>
> I cannot be a one man show on all this. I welcome more help from
> anyone interested.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-28 16:22 ` Jeff Layton
@ 2025-08-28 16:27 ` Chuck Lever
0 siblings, 0 replies; 42+ messages in thread
From: Chuck Lever @ 2025-08-28 16:27 UTC (permalink / raw)
To: Jeff Layton, Mike Snitzer; +Cc: linux-nfs
On 8/28/25 12:22 PM, Jeff Layton wrote:
> On Wed, 2025-08-27 at 15:41 -0400, Mike Snitzer wrote:
>> On Wed, Aug 27, 2025 at 11:34:03AM -0400, Chuck Lever wrote:
>>> On 8/26/25 2:57 PM, Mike Snitzer wrote:
>>>> If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
>>>> DIO-aligned block (on either end of the READ). The expanded READ is
>>>> verified to have proper offset/len (logical_block_size) and
>>>> dma_alignment checking.
>>>>
>>>> Must allocate and use a bounce-buffer page (called 'start_extra_page')
>>>> if/when expanding the misaligned READ requires reading extra partial
>>>> page at the start of the READ so that its DIO-aligned. Otherwise that
>>>> extra page at the start will make its way back to the NFS client and
>>>> corruption will occur. As found, and then this fix of using an extra
>>>> page verified, using the 'dt' utility:
>>>> dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
>>>> iotype=sequential pattern=iot onerr=abort oncerr=abort
>>>> see: https://github.com/RobinTMiller/dt.git
>>>>
>>>> Any misaligned READ that is less than 32K won't be expanded to be
>>>> DIO-aligned (this heuristic just avoids excess work, like allocating
>>>> start_extra_page, for smaller IO that can generally already perform
>>>> well using buffered IO).
>>>>
>>>> Suggested-by: Jeff Layton <jlayton@kernel.org>
>>>> Suggested-by: Chuck Lever <chuck.lever@oracle.com>
>>>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>>>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>>>> ---
>>>> fs/nfsd/vfs.c | 200 +++++++++++++++++++++++++++++++++++--
>>>> include/linux/sunrpc/svc.h | 5 +-
>>>> 2 files changed, 194 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>>> index c340708fbab4d..64732dc8985d6 100644
>>>> --- a/fs/nfsd/vfs.c
>>>> +++ b/fs/nfsd/vfs.c
>>>> @@ -19,6 +19,7 @@
>>>> #include <linux/splice.h>
>>>> #include <linux/falloc.h>
>>>> #include <linux/fcntl.h>
>>>> +#include <linux/math.h>
>>>> #include <linux/namei.h>
>>>> #include <linux/delay.h>
>>>> #include <linux/fsnotify.h>
>>>> @@ -1073,6 +1074,153 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>>> }
>>>>
>>>> +struct nfsd_read_dio {
>>>> + loff_t start;
>>>> + loff_t end;
>>>> + unsigned long start_extra;
>>>> + unsigned long end_extra;
>>>> + struct page *start_extra_page;
>>>> +};
>>>> +
>>>> +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
>>>> +{
>>>> + memset(read_dio, 0, sizeof(*read_dio));
>>>> + read_dio->start_extra_page = NULL;
>>>> +}
>>>> +
>>>> +#define NFSD_READ_DIO_MIN_KB (32 << 10)
>>>> +
>>>> +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>> + struct nfsd_file *nf, loff_t offset,
>>>> + unsigned long len, unsigned int base,
>>>> + struct nfsd_read_dio *read_dio)
>>>> +{
>>>> + const u32 dio_blocksize = nf->nf_dio_read_offset_align;
>>>> + loff_t middle_end, orig_end = offset + len;
>>>> +
>>>> + if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
>>>> + "%s: underlying filesystem has not provided DIO alignment info\n",
>>>> + __func__))
>>>> + return false;
>>>> + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
>>>> + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
>>>> + __func__, dio_blocksize, PAGE_SIZE))
>>>> + return false;
>>>
>>> IMHO these checks do not warrant a WARN. Perhaps a trace event, instead?
>>
>> I won't die on this hill, I just don't see the risk of these given
>> they are highly unlikely ("famous last words").
>>
>> But if they trigger we should surely be made aware immediately. Not
>> only if someone happens to have a trace event enabled (which would
>> only happen with further support and engineering involvement to chase
>> "why isn't O_DIRECT being used like NFSD was optionally configured
>> to!?").
>>
>
>
> A kernel log message in this case makes sense to me, since it is a
> (minor) administrative issue. WARN_ONCE() is going to throw a big,
> scary stack trace, however that won't be terribly useful. We'll get hit
> with bug reports from it for years.
Agreed, the stack trace isn't very useful information.
> Maybe pr_notice_once() for this? Or, maybe a pr_notice_once, but do it
> on a per-export basis?
Right, I think warning once and then turning off the warning for all
subsequent problems is going to cause a lot of missed problems. Warning
once per export sounds like a step in the right direction.
>>>> + /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
>>>> + * or base not aligned).
>>>> + * Ondisk alignment is implied by the following code that expands
>>>> + * misaligned IO to have a DIO-aligned offset and len.
>>>> + */
>>>> + if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
>>>> + return false;
>>>> +
>>>> + init_nfsd_read_dio(read_dio);
>>>> +
>>>> + read_dio->start = round_down(offset, dio_blocksize);
>>>> + read_dio->end = round_up(orig_end, dio_blocksize);
>>>> + read_dio->start_extra = offset - read_dio->start;
>>>> + read_dio->end_extra = read_dio->end - orig_end;
>>>> +
>>>> + /*
>>>> + * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
>>>> + * to be DIO-aligned (this heuristic avoids excess work, like allocating
>>>> + * start_extra_page, for smaller IO that can generally already perform
>>>> + * well using buffered IO).
>>>> + */
>>>> + if ((read_dio->start_extra || read_dio->end_extra) &&
>>>> + (len < NFSD_READ_DIO_MIN_KB)) {
>>>> + init_nfsd_read_dio(read_dio);
>>>> + return false;
>>>> + }
>>>> +
>>>> + if (read_dio->start_extra) {
>>>> + read_dio->start_extra_page = alloc_page(GFP_KERNEL);
>>>
>>> This introduces a page allocation where there weren't any before. For
>>> NFSD, I/O pages come from rqstp->rq_pages[] so that memory allocation
>>> like this is not needed on an I/O path.
>>
>> NFSD never supported DIO before. Yes, with this patch there is
>> a single page allocation in the misaligned DIO READ path (if it
>> requires reading extra before the client requested data starts).
>>
>> I tried to succinctly explain the need for the extra page allocation
>> for misaligned DIO READ in this patch's header (in 2nd paragraph
>> of the above header).
>>
>> I cannot see how to read extra, not requested by the client, into the
>> head of rq_pages without causing serious problems. So that cannot be
>> what you're saying needed.
>>
>>> So I think the answer to this is that I want you to implement reading
>>> an entire aligned range from the file and then forming the NFS READ
>>> response with only the range of bytes that the client requested, as we
>>> discussed before.
>>
>> That is what I'm doing. But you're taking issue with my implementation
>> that uses a single extra page.
>>
>>> The use of xdr_buf and bvec should make that quite
>>> straightforward.
>>
>> Is your suggestion to, rather than allocate a disjoint single page,
>> borrow the extra page from the end of rq_pages? Just map it into the
>> bvec instead of my extra page?
>>
>>> This should make the aligned and unaligned cases nearly identical and
>>> much less fraught.
>>
>> Regardless of which memory used to read the extra data, I don't see
>> how the care I've taken to read extra but hide that fact from the
>> client can be avoided. So the pre/post misaligned DIO READ code won't
>> change a whole lot. But once I understand your suggestion better
>> (after a clarifying response to this message) hopefully I'll see what
>> you're saying.
>>
>> All said, this patchset is very important to me, I don't want it to
>> miss v6.18 -- I'm still "in it to win it" but it feels like I do need
>> your or others' help to pull this off.
>>
>> And/or is it possible to accept this initial implementation with
>> mutual understanding that we must revisit your concern about my
>> allocating an extra page for the misaligned DIO READ path?
>>
>>>> + if (WARN_ONCE(read_dio->start_extra_page == NULL,
>>>> + "%s: Unable to allocate start_extra_page\n", __func__)) {
>>>> + init_nfsd_read_dio(read_dio);
>>>> + return false;
>>>> + }
>>>> + }
>>>> +
>>>> + return true;
>>>> +}
>>>> +
>>>> +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
>>>> + struct nfsd_read_dio *read_dio,
>>>> + ssize_t bytes_read,
>>>> + unsigned long bytes_expected,
>>>> + loff_t *offset,
>>>> + unsigned long *rq_bvec_numpages)
>>>> +{
>>>> + ssize_t host_err = bytes_read;
>>>> + loff_t v;
>>>> +
>>>> + if (!read_dio->start_extra && !read_dio->end_extra)
>>>> + return host_err;
>>>> +
>>>> + /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
>>>> + * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
>>>> + */
>>>> + if (read_dio->start_extra_page) {
>>>> + __free_page(read_dio->start_extra_page);
>>>> + *rq_bvec_numpages -= 1;
>>>> + v = *rq_bvec_numpages;
>>>> + memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
>>>> + v * sizeof(struct bio_vec));
>>>> + }
>>>> + /* Eliminate any end_extra bytes from the last page */
>>>> + v = *rq_bvec_numpages;
>>>> + rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
>>>> +
>>>> + if (host_err < 0) {
>>>> + /* Underlying FS will return -EINVAL if misaligned
>>>> + * DIO is attempted because it shouldn't be.
>>>> + */
>>>> + WARN_ON_ONCE(host_err == -EINVAL);
>>>> + return host_err;
>>>> + }
>>>> +
>>>> + /* nfsd_analyze_read_dio() may have expanded the start and end,
>>>> + * if so adjust returned read size to reflect original extent.
>>>> + */
>>>> + *offset += read_dio->start_extra;
>>>> + if (likely(host_err >= read_dio->start_extra)) {
>>>> + host_err -= read_dio->start_extra;
>>>> + if (host_err > bytes_expected)
>>>> + host_err = bytes_expected;
>>>> + } else {
>>>> + /* Short read that didn't read any of requested data */
>>>> + host_err = 0;
>>>> + }
>>>> +
>>>> + return host_err;
>>>> +}
>>>> +
>>>> +static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
>>>> + unsigned addr_mask, unsigned len_mask)
>>>> +{
>>>> + const struct bio_vec *bvec = i->bvec;
>>>> + unsigned skip = i->iov_offset;
>>>> + size_t size = i->count;
>>>
>>> checkpatch.pl is complaining about the use of "unsigned" rather than
>>> "unsigned int".
>>
>> OK.
>>
>>>> +
>>>> + if (size & len_mask)
>>>> + return false;
>>>> + do {
>>>> + size_t len = bvec->bv_len;
>>>> +
>>>> + if (len > size)
>>>> + len = size;
>>>> + if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
>>>> + return false;
>>>> + bvec++;
>>>> + size -= len;
>>>> + skip = 0;
>>>> + } while (size);
>>>> +
>>>> + return true;
>>>> +}
>>>> +
>>>> /**
>>>> * nfsd_iter_read - Perform a VFS read using an iterator
>>>> * @rqstp: RPC transaction context
>>>> @@ -1094,7 +1242,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>> unsigned int base, u32 *eof)
>>>> {
>>>> struct file *file = nf->nf_file;
>>>> - unsigned long v, total;
>>>> + unsigned long v, total, in_count = *count;
>>>> + struct nfsd_read_dio read_dio;
>>>> struct iov_iter iter;
>>>> struct kiocb kiocb;
>>>> ssize_t host_err;
>>>> @@ -1102,13 +1251,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>>
>>>> init_sync_kiocb(&kiocb, file);
>>>>
>>>> + v = 0;
>>>> + total = in_count;
>>>> +
>>>> switch (nfsd_io_cache_read) {
>>>> case NFSD_IO_DIRECT:
>>>> - /* Verify ondisk and memory DIO alignment */
>>>> - if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
>>>> - (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
>>>> - (base & (nf->nf_dio_mem_align - 1)) == 0)
>>>> - kiocb.ki_flags = IOCB_DIRECT;
>>>> + /*
>>>> + * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
>>>> + * the next DIO-aligned block (on either end of the READ).
>>>> + */
>>>> + if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
>>>> + in_count, base, &read_dio)) {
>>>> + /* trace_nfsd_read_vector() will reflect larger
>>>> + * DIO-aligned READ.
>>>> + */
>>>> + offset = read_dio.start;
>>>> + in_count = read_dio.end - offset;
>>>> + total = in_count;
>>>> +
>>>> + kiocb.ki_flags |= IOCB_DIRECT;
>>>> + if (read_dio.start_extra) {
>>>> + len = read_dio.start_extra;
>>>> + bvec_set_page(&rqstp->rq_bvec[v],
>>>> + read_dio.start_extra_page,
>>>> + len, PAGE_SIZE - len);
>>>> + total -= len;
>>>> + ++v;
>>>> + }
>>>> + }
>>>> break;
>>>> case NFSD_IO_DONTCACHE:
>>>> kiocb.ki_flags = IOCB_DONTCACHE;
>>>> @@ -1120,8 +1290,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>>
>>>> kiocb.ki_pos = offset;
>>>>
>>>> - v = 0;
>>>> - total = *count;
>>>> while (total) {
>>>> len = min_t(size_t, total, PAGE_SIZE - base);
>>>> bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
>>>> @@ -1132,9 +1300,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>>> }
>>>> WARN_ON_ONCE(v > rqstp->rq_maxpages);
>>>>
>>>> - trace_nfsd_read_vector(rqstp, fhp, offset, *count);
>>>> - iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
>>>> + trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
>>>> + iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
>>>> +
>>>> + if ((kiocb.ki_flags & IOCB_DIRECT) &&
>>>> + !nfsd_iov_iter_aligned_bvec(&iter, nf->nf_dio_mem_align-1,
>>>> + nf->nf_dio_read_offset_align-1))
>>>> + kiocb.ki_flags &= ~IOCB_DIRECT;
>>>> +
>>>> host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
>>>> +
>>>> + if (in_count != *count) {
>>>> + /* misaligned DIO expanded read to be DIO-aligned */
>>>> + host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
>>>> + host_err, *count, &offset, &v);
>>>> + }
>>>> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>>> }
>>>>
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index e64ab444e0a7f..190c2667500e2 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>> * pages, one for the request, and one for the reply.
>>>> * nfsd_splice_actor() might need an extra page when a READ payload
>>>> * is not page-aligned.
>>>> + * nfsd_iter_read() might need two extra pages when a READ payload
>>>> + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
>>>> + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
>>>> */
>>>> static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
>>>> {
>>>> - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
>>>> + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>>>> }
>>>>
>>>> /*
>>>
>>> To properly evaluate the impact of using direct I/O for reads with real
>>> world user workloads, we will want to identify (or construct) some
>>> metrics (and this is future work, but near-term future).
>>>
>>> Seems like allocating memory becomes difficult only when too many pages
>>> are dirty. I am skeptical that the issue is due to read caching, since
>>> clean pages in the page cache are pretty easy to evict quickly, AIUI. If
>>> that's incorrect, I'd like to understand why.
>>
>> The much more problematic case is heavy WRITE workload with a working
>> set that far exceeds system memory.
>>
>> But I agree it doesn't make a whole lot of sense that clean pages in
>> the page cache would be getting in the way. All I can tell you is
>> that in my experience MM seems to _not_ evict them quickly (but more
>> focused read-only testing is warranted to further understand the
>> dynamics and heuristics in MM and beyond -- especially if/when
>> READ-only then a pivot to a mix of heavy READ and WRITE or
>> WRITE-only).
>>
>> NFSD using DIO is optional. I thought the point was to get it as an
>> available option so that _others_ could experiment and help categorize
>> the benefits/pitfalls further?
>>
>> I cannot be a one man show on all this. I welcome more help from
>> anyone interested.
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-28 8:09 ` Mike Snitzer
2025-08-28 14:53 ` Chuck Lever
@ 2025-08-28 16:36 ` Jeff Layton
1 sibling, 0 replies; 42+ messages in thread
From: Jeff Layton @ 2025-08-28 16:36 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs
On Thu, 2025-08-28 at 04:09 -0400, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote:
> > On 8/27/25 7:15 PM, Mike Snitzer wrote:
> > > On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
> > > > On 8/27/25 3:41 PM, Mike Snitzer wrote:
> > > > > Is your suggestion to, rather than allocate a disjoint single page,
> > > > > borrow the extra page from the end of rq_pages? Just map it into the
> > > > > bvec instead of my extra page?
> > > >
> > > > Yes, the extra page needs to come from rq_pages. But I don't see why it
> > > > should come from the /end/ of rq_pages.
> > > >
> > > > - Extend the start of the byte range back to make it align with the
> > > > file's DIO alignment constraint
> > > >
> > > > - Extend the end of the byte range forward to make it align with the
> > > > file's DIO alignment constraint
> > >
> > > nfsd_analyze_read_dio() does that (start_extra and end_extra).
> > >
> > > > - Fill in the sink buffer's bvec using pages from rq_pages, as usual
> > > >
> > > > - When the I/O is complete, adjust the offset in the first bvec entry
> > > > forward by setting a non-zero page offset, and adjust the returned
> > > > count downward to match the requested byte count from the client
> > >
> > > Tried it long ago, such bvec manipulation only works when not using
> > > RDMA. When the memory is remote, twiddling a local bvec isn't going
> > > to ensure the correct pages have the correct data upon return to the
> > > client.
> > >
> > > RDMA is why the pages must be used in-place, and RDMA is also why
> > > the extra page needed by this patch (for use as throwaway front-pad
> > > for expanded misaligned DIO READ) must either be allocated _or_
> > > hopefully it can be from rq_pages (after the end of the client
> > > requested READ payload).
> > >
> > > Or am I wrong and simply need to keep learning about NFSD's IO path?
> >
> > You're wrong, not to put a fine point on it.
>
> You didn't even understand me.. but firmly believe I'm wrong?
>
> > There's nothing I can think of in the RDMA or RPC/RDMA protocols that
> > mandates that the first page offset must always be zero. Moving data
> > at one address on the server to an entirely different address and
> > alignment on the client is exactly what RDMA is supposed to do.
> >
> > It sounds like an implementation omission because the server's upper
> > layers have never needed it before now. If TCP already handles it, I'm
> > guessing it's going to be straightforward to fix.
>
> I never said that first page offset must be zero. I said that I
> already did what you suggested and it didn't work with RDMA. This is
> recall of too many months ago now, but: the client will see the
> correct READ payload _except_ IIRC it is offset by whatever front-pad
> was added to expand the misaligned DIO; no matter whether
> rqstp->rq_bvec updated when IO completes.
>
> But I'll revisit it again.
>
> > > > > NFSD using DIO is optional. I thought the point was to get it as an
> > > > > available option so that _others_ could experiment and help categorize
> > > > > the benefits/pitfalls further?
> > > >
> > > > Yes, that is the point. But such experiments lose value if there is no
> > > > data collection plan to go with them.
> > >
> > > Each user runs something they care about performing well and they
> > > measure the result.
> >
> > That assumes the user will continue to use the debug interfaces, and
> > the particular implementation you've proposed, for the rest of time.
> > And that's not my plan at all.
> >
> > If we, in the community, cannot reproduce that result, or cannot
> > understand what has been measured, or the measurement misses part or
> > most of the picture, of what value is that for us to decide whether and
> > how to proceed with promoting the mechanism from debug feature to
> > something with a long-term support lifetime and a documented ABI-stable
> > user interface?
>
> I'll work to put a finer point on how to reproduce and enumerate the
> things to look for (representative flamegraphs showing the issue,
> which I already did at last Bakeathon).
>
> But I have repeatedly offered that the pathological worst case is
> client doing sequential write IO of a file that is 3-4x larger than
> the NFS server's system memory.
>
> Large memory systems with 8 or more NVMe devices, fast networks that
> allow for huge data ingest capabilities. These are the platforms that
> showcase MM's dirty writeback limitions when large sequential IO is
> initiated from the NFS client and its able to overrun the NFS server.
>
> In addition, in general DIO requires significantly less memory and
> CPU; so platforms that have more limited resources (and may have
> historically struggled) could have a new lease on life if they switch
> NFSD from buffered to DIO mode.
>
> > > Literally the same thing as has been done for anything in Linux since
> > > it all started. Nothing unicorn or bespoke here.
> >
> > So let me ask this another way: What do we need users to measure to give
> > us good quality information about the page cache behavior and system
> > thrashing behavior you reported?
>
> IO throughput, CPU and memory usage should be monitored over time.
>
> > For example: I can enable direct I/O on NFSD, but my workload is mostly
> > one or two clients doing kernel builds. The latency of NFS READs goes
> > up, but since a kernel build is not I/O bound and the client page caches
> > hide most of the increase, there is very little to show a measured
> > change.
> >
> > So how should I assess and report the impact of NFSD doing direct I/O?
>
> Your underwhelming usage isn't what this patchset is meant to help.
>
> > See -- users are not the only ones who are involved in this experiment;
> > and they will need guidance because we're not providing any
> > documentation for this feature.
>
> Users are not created equal. Major companies like Oracle and Meta
> _should_ be aware of NFSD's problems with buffered IO. They have
> internal and external stakeholders that are power users.
>
> Jeff, does Meta ever see NFSD struggle to consistently use NVMe
> devices? Lumpy performance? Full-blown IO stalls? Lots of NFSD
> threads hung in D state?
>
Yes. We're particularly interested in this work for that reason. A lot
of the workload is large, streaming writes at the application layer
that are only rarely ever read, and quite a bit later when it does
happen.
This means that the pagecache is pretty useless. My _guess_ is that DIO
will help that significantly, though I do still have some concerns
about using buffered I/O for the edges of unaligned WRITEs.
> > > > If you would rather make this drive-by, then you'll have to realize
> > > > that you are requesting more than simple review from us. You'll have
> > > > to be content with the pace at which us overloaded maintainers can get
> > > > to the work.
> > >
> > > I think I just experienced the mailing-list equivalent of the Detroit
> > > definition of "drive-by". Good/bad news: you're a terrible shot.
> >
> > The term "drive-by contribution" has a well-understood meaning in the
> > kernel community. If you are unfamiliar with it, I invite you to review
> > the mailing list archives. As always, no-one is shooting at you. If
> > anything, the drive-by contribution is aimed at me.
>
> It is a blatant miscategorization here. That you just doubled down
> on it having relevance in this instance is flagrantly wrong.
>
> Whatever compells you to belittle me and my contributions, just know
> it is extremely hard to take. Highly unproductive and unprofessional.
>
> Boom, done.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-28 14:53 ` Chuck Lever
@ 2025-08-28 18:52 ` Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-08-28 18:52 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Thu, Aug 28, 2025 at 10:53:49AM -0400, Chuck Lever wrote:
> On 8/28/25 4:09 AM, Mike Snitzer wrote:
> > On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote:
> >> On 8/27/25 7:15 PM, Mike Snitzer wrote:
> >>> On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
> >>>> On 8/27/25 3:41 PM, Mike Snitzer wrote:
> >>>>> Is your suggestion to, rather than allocate a disjoint single page,
> >>>>> borrow the extra page from the end of rq_pages? Just map it into the
> >>>>> bvec instead of my extra page?
> >>>>
> >>>> Yes, the extra page needs to come from rq_pages. But I don't see why it
> >>>> should come from the /end/ of rq_pages.
> >>>>
> >>>> - Extend the start of the byte range back to make it align with the
> >>>> file's DIO alignment constraint
> >>>>
> >>>> - Extend the end of the byte range forward to make it align with the
> >>>> file's DIO alignment constraint
> >>>
> >>> nfsd_analyze_read_dio() does that (start_extra and end_extra).
> >>>
> >>>> - Fill in the sink buffer's bvec using pages from rq_pages, as usual
> >>>>
> >>>> - When the I/O is complete, adjust the offset in the first bvec entry
> >>>> forward by setting a non-zero page offset, and adjust the returned
> >>>> count downward to match the requested byte count from the client
> >>>
> >>> Tried it long ago, such bvec manipulation only works when not using
> >>> RDMA. When the memory is remote, twiddling a local bvec isn't going
> >>> to ensure the correct pages have the correct data upon return to the
> >>> client.
> >>>
> >>> RDMA is why the pages must be used in-place, and RDMA is also why
> >>> the extra page needed by this patch (for use as throwaway front-pad
> >>> for expanded misaligned DIO READ) must either be allocated _or_
> >>> hopefully it can be from rq_pages (after the end of the client
> >>> requested READ payload).
> >>>
> >>> Or am I wrong and simply need to keep learning about NFSD's IO path?
> >>
> >> You're wrong, not to put a fine point on it.
> >
> > You didn't even understand me.. but firmly believe I'm wrong?
> >
> >> There's nothing I can think of in the RDMA or RPC/RDMA protocols that
> >> mandates that the first page offset must always be zero. Moving data
> >> at one address on the server to an entirely different address and
> >> alignment on the client is exactly what RDMA is supposed to do.
> >>
> >> It sounds like an implementation omission because the server's upper
> >> layers have never needed it before now. If TCP already handles it, I'm
> >> guessing it's going to be straightforward to fix.
> >
> > I never said that first page offset must be zero. I said that I
> > already did what you suggested and it didn't work with RDMA. This is
> > recall of too many months ago now, but: the client will see the
> > correct READ payload _except_ IIRC it is offset by whatever front-pad
> > was added to expand the misaligned DIO; no matter whether
> > rqstp->rq_bvec updated when IO completes.
> >
> > But I'll revisit it again.
>
> For the record, this email thread is the very first time I've heard that
> you tried the simple approach and that it worked with TCP and not with
> RDMA. I wish I had known that a while ago.
Likewise, but the story is all in the patch header and the code tells
the story too. Hence your finding it with closer review (thanks for
that BTW!). I agree something is off so I'm happy to work it further.
I have iterated on quite a few aspects to this patch 5. Christoph had
suggestion for using memmove in nfsd_complete_misaligned_read_dio.
You had the feedback that required ensuring the lightest touch
relative to branching so that buffered IO mode remain as fast as
possible.
Looking forward to tackling this RDMA-specific weirdness now.
> >>>>> NFSD using DIO is optional. I thought the point was to get it as an
> >>>>> available option so that _others_ could experiment and help categorize
> >>>>> the benefits/pitfalls further?
> >>>>
> >>>> Yes, that is the point. But such experiments lose value if there is no
> >>>> data collection plan to go with them.
> >>>
> >>> Each user runs something they care about performing well and they
> >>> measure the result.
> >>
> >> That assumes the user will continue to use the debug interfaces, and
> >> the particular implementation you've proposed, for the rest of time.
> >> And that's not my plan at all.
> >>
> >> If we, in the community, cannot reproduce that result, or cannot
> >> understand what has been measured, or the measurement misses part or
> >> most of the picture, of what value is that for us to decide whether and
> >> how to proceed with promoting the mechanism from debug feature to
> >> something with a long-term support lifetime and a documented ABI-stable
> >> user interface?
> >
> > I'll work to put a finer point on how to reproduce and enumerate the
> > things to look for (representative flamegraphs showing the issue,
> > which I already did at last Bakeathon).
note ^ (referenced below)
> > But I have repeatedly offered that the pathological worst case is
> > client doing sequential write IO of a file that is 3-4x larger than
> > the NFS server's system memory.
> >
> > Large memory systems with 8 or more NVMe devices, fast networks that
> > allow for huge data ingest capabilities. These are the platforms that
> > showcase MM's dirty writeback limitions when large sequential IO is
> > initiated from the NFS client and its able to overrun the NFS server.
> >
> > In addition, in general DIO requires significantly less memory and
> > CPU; so platforms that have more limited resources (and may have
> > historically struggled) could have a new lease on life if they switch
> > NFSD from buffered to DIO mode.
> >
> >>> Literally the same thing as has been done for anything in Linux since
> >>> it all started. Nothing unicorn or bespoke here.
> >>
> >> So let me ask this another way: What do we need users to measure to give
> >> us good quality information about the page cache behavior and system
> >> thrashing behavior you reported?
> >
> > IO throughput, CPU and memory usage should be monitored over time.
>
> My point is users need a recipe for what to monitor, ie clear specific
> instructions, and that recipe needs to be applied the same way by every
> experimenter (because, as I say below, we too are collecting data about
> the user's experience with this feature).
>
> Can you provide a few specific instructions that experimenters can
> follow on how to monitor and report these metrics? Simply copy-paste
> what you and your test team have been using to observe system behavior.
Yes, I understood and understand. I did commit to preparing and
providing what is needed (see "note ^" above). Not a problem.
I will be working with Hammerspace QA and other engineers as needed to
produce as tight a reproducer as possible.
> >> For example: I can enable direct I/O on NFSD, but my workload is mostly
> >> one or two clients doing kernel builds. The latency of NFS READs goes
> >> up, but since a kernel build is not I/O bound and the client page caches
> >> hide most of the increase, there is very little to show a measured
> >> change.
> >>
> >> So how should I assess and report the impact of NFSD doing direct I/O?
> >
> > Your underwhelming usage isn't what this patchset is meant to help.
>
> Again, my point is, how are users who try the debug option going to
> know that their workload is or is not of interest? We're not
> providing such documentation. Users are just going to turn this on and
> try it.
>
> But consider that one of our options is to make direct I/O the default.
> We need to know how smaller workloads might be impacted by this new
> default. The maintainers are part of this experiment as much as
> potential users are. We need to collect data to understand how to make
> this work part of the first world administrative API.
I'm not opposed to anyone trying any workload. You're really wanting
to impose a structure and guidance, and I think that is a noble goal
and will serve us well toward avoiding a myriad of issues. If nothing
else it should help others understand the benefits/pitfalls.
To that end I'm happy to work on adding proper Documentation/
> >> See -- users are not the only ones who are involved in this experiment;
> >> and they will need guidance because we're not providing any
> >> documentation for this feature.
> >
> > Users are not created equal. Major companies like Oracle and Meta
> > _should_ be aware of NFSD's problems with buffered IO. They have
> > internal and external stakeholders that are power users.
> >
> > Jeff, does Meta ever see NFSD struggle to consistently use NVMe
> > devices? Lumpy performance? Full-blown IO stalls? Lots of NFSD
> > threads hung in D state?
>
> I have no idea why you think I don't understand those facts, but you are
> still missing my point. NFSD maintainers are part of this experimental
> set up as much as users are. We need good information to decide how to
> shape this feature going forward. Let's build a plan to get that
> information.
If you understand, which I have no doubt, then your repeat basic
questions were simply trying to impress upon me the info others
need. You need it too, but more to help mobilize and manage others'
time and to help build a program around this feature. I mean, that's
my understanding now. It makes sense.
> >>>> If you would rather make this drive-by, then you'll have to realize
> >>>> that you are requesting more than simple review from us. You'll have
> >>>> to be content with the pace at which us overloaded maintainers can get
> >>>> to the work.
> >>>
> >>> I think I just experienced the mailing-list equivalent of the Detroit
> >>> definition of "drive-by". Good/bad news: you're a terrible shot.
> >>
> >> The term "drive-by contribution" has a well-understood meaning in the
> >> kernel community. If you are unfamiliar with it, I invite you to review
> >> the mailing list archives. As always, no-one is shooting at you. If
> >> anything, the drive-by contribution is aimed at me.
> >
> > It is a blatant miscategorization here. That you just doubled down
> > on it having relevance in this instance is flagrantly wrong.
>
> First of all, "drive-by contribution" is not meant to be pejorative.
> It is what it is: someone throws you a patch and then goes away. It is
> a fact of life for open source maintainers.
Please know I know what "drive-by contribution" is. It isn't
applicable here. That I need to say that repeatedly is strange.
I haven't gone anywhere and won't be.
I have worked _hard_ on both NFSD DIRECT and NFS DIRECT. More work is
needed on both.
BREAKING NEWS: NFSD DIRECT is my top priority, I just received
approval from Hammerspace management for it to be my continued primary
focus.
> I'm simply trying to get a sense of your commitment to finishing this
> work and staying with it going forward. You keep telling me "well I have
> other things to do." What do you think that indicates to a busy
> maintainer? That you will be around to finish the feature? Or rather
> that your employer is going to yank you onto other projects, leaving us
> stuck with the work of finishing it?
You have my full attention and I will put my focus on:
1) closing the initial code out strong so you and Jeff are happy to merge
2) pushing forward wherever community feedback and desire takes us in
the future.
I have a deep sense of drive and commitment and this line of
development is important to me (and Hammerspace!).
> You have rejected one or more reasonable review comments and maintainer
> requests. That doesn't convince me you are willing to collaborate with
> us on maintaining the feature.
I am really not fighting/spitting at all in what follows until the end
of this email, I'm trying to preserve what little reputation I have
left, say my peace _but_ arrest any further strife for us both.
I think if you were to review _all_ the exchanges relative to this
NFSD DIRECT work (v1 through v8 patchset postings and associated
review and my responses to it all) you'd find I have been nothing but
driven to understand and then address all feedback and move the code
forward as rapidly as possible.
It has been collaborative and rewarding, and I appreciate everyone who
has helped move this work forward.
Then the wheels fell off yesterday and now some revisionist history is
creeping in, we cannot have that. Distracts everyone and takes away
from all the good work we've done.
We are both intense. My intensity will be spent on worthwhile efforts,
not fighting needlessly.
> My biggest concern here is how much more work maintaining NFSD will be
> once this change goes in. It's a major feature and a significant change.
> It will require a lot of code massage in the short- and medium-term,
> just as a start.
>
> Maybe I'm assuming that you, as a kernel maintainer yourself, already
> understand that this mind set and this calculus is part of my email replies.
It is nice that you acknowledge I am also a Linux subsystem maintainer
(but thankfully Mikulas is able to do the heavy lifting for DM now).
Yes, I know very well all the variables and concerns that must be
considered and this NFSD DIRECT line of work is just one of many
efforts for you.
> So, I'm being frank. I'm not trying to offend or belittle.
You can be frank without zooming out 50K feet and making general
attacks on my work, character or aptitude/understanding.
Those attacks happened in response to my asking for help, short of
that I asked if patch 5 could be merged with future work item(s)
attached.
From https://lore.kernel.org/linux-nfs/aK9fZR7pQxrosEfW@kernel.org/
"All said, this patchset is very important to me, I don't want it to
miss v6.18 -- I'm still "in it to win it" but it feels like I do need
your or others' help to pull this off.
And/or is it possible to accept this initial implementation with
mutual understanding that we must revisit your concern about my
allocating an extra page for the misaligned DIO READ path?"
I didn't say those things as some power move to push for NFSD DIRECT
going in before "ready". But this is the second time you have reacted
very hostile in response to merge readiness (first LOCALIO and now
this) and that makes it a pattern. One that I have no interest in
seeing repeat...
To that end, I will refrain from raising questions about when code
might be merged or planning around it. I'll maintain focus on doing
the work asked for and let the rest sort itself out.
My Genuine apologies that I hit a nerve yesterday. It wasn't intended,
but I now "get it": in particular I can infer you took my "I cannot be
a one man show on all this. I welcome more help from anyone
interested." as me being a hostile contributor. I _had_ competing work
and didn't account for NFSD DIRECT needing quite so much immediate
work, so I shared how exposed I was (albeit without enough care and
context). I certainly didn't intend to minimize yours and others'
helpful review feedback, etc. I regret putting that email's closing
statement like I did.
> > Whatever compells you to belittle me and my contributions, just know
> > it is extremely hard to take. Highly unproductive and unprofessional.
>
> I can't control how you understand my emails. If you choose to be
> offended even though I'm trying to have an honest discussion, there's
> nothing I can do about it.
>
> I can't control /if/ you understand my emails. I'm repeating myself
> quite a bit here because a lot of what I say sails past you. You read
> things into what I say that I don't intend, and you miss a lot of what
> I did intend. There's nothing I can do about that either.
But you are consciously putting to words judgments that would cut any
software engineer to their core.
Stands to reason that some ownership of your condescending and
belittling remarks is needed if you would like to avoid negative
reactions.
If you aren't even aware you are saying things that land in those
buckets... anyway, not my place to say more.
> I can't control your unproductive and unprofessional behavior. You keep
> rejecting valid review comments and maintainer requests with comments
> like "I don't feel like it" and "your reason for wanting this change is
> invalid" and "you are wasting my time". Again, this suggests to me that
> maintaining this feature will be a lot more work than it could be, and
> that is an ongoing concern.
All of what you said in this ^ above section is carry-over from when
LOCALIO was approaching "ready". I thought we buried those hatchets
and "are good" but clearly what they say about first impressions is
undeniable -- I had one chance and I left a bad one with you, all I
can do is rectify it with my future actions.
Fact is, none of your old judgments/findings from LOCALIO are
actually applicable for this NFSD DIRECT case. That you brought them
forward as if they are applicable now genuinely saddens me.
The one thing Jeff asked for, that we all agree would be better, is
that if the debugfs interface used string-based controls rather than
integer. That is quite a lift given it requires engaging gregkh and
others, one that I don't feel I can pull off given other work items
(especially so now with things covered as required above). But I
didn't reject the good idea; I looked into it and found debugfs
doesn't support strings, adding that capability cannot be a priority
if it comes with an opportunity cost of getting NFSD DIRECT polished
(bit of a catch-22 but alas). Not sure if that leaves you wanting, you
may think the incremental debugfs improvement basic but it is a
context switch I cannot afford to take. Jeff is now OK with the
debugfs interfaces as-is, hopefully they're tolerable for you too.
> If there's something that offends you, the professional response on your
> part is to point that out to me in the moment rather than trying to spit
> back. Because most likely, I wasn't intending to offend at all. If you
> keep that in mind while reading my emails, you might have an easier time
> of it.
OK. This was all unfortunate, _but_ I appreciate you taking the time
to work through it with me and provide your insights.
Hopefully we have a lasting understanding now and we can get back to
what we do best.
Mike
ps. sorry for the noise linux-nfs!
^ permalink raw reply [flat|nested] 42+ messages in thread
* [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned]
2025-08-28 18:52 ` Mike Snitzer
@ 2025-08-30 17:38 ` Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
` (2 more replies)
0 siblings, 3 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-30 17:38 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Hi Chuck,
[just including context from thread in cover-letter for these 2 RFC patches]
On Thu, Aug 28, 2025 at 02:52:34PM -0400, Mike Snitzer wrote:
> On Thu, Aug 28, 2025 at 10:53:49AM -0400, Chuck Lever wrote:
> > On 8/28/25 4:09 AM, Mike Snitzer wrote:
> > > On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote:
> > >>>
<snip>
> > >>>> - When the I/O is complete, adjust the offset in the first bvec entry
> > >>>> forward by setting a non-zero page offset, and adjust the returned
> > >>>> count downward to match the requested byte count from the client
> > >>>
> > >>> Tried it long ago, such bvec manipulation only works when not using
> > >>> RDMA. When the memory is remote, twiddling a local bvec isn't going
> > >>> to ensure the correct pages have the correct data upon return to the
> > >>> client.
> > >>>
> > >>> RDMA is why the pages must be used in-place, and RDMA is also why
> > >>> the extra page needed by this patch (for use as throwaway front-pad
> > >>> for expanded misaligned DIO READ) must either be allocated _or_
> > >>> hopefully it can be from rq_pages (after the end of the client
> > >>> requested READ payload).
<snip>
> > >> There's nothing I can think of in the RDMA or RPC/RDMA protocols that
> > >> mandates that the first page offset must always be zero. Moving data
> > >> at one address on the server to an entirely different address and
> > >> alignment on the client is exactly what RDMA is supposed to do.
> > >>
> > >> It sounds like an implementation omission because the server's upper
> > >> layers have never needed it before now. If TCP already handles it, I'm
> > >> guessing it's going to be straightforward to fix.
> > >
> > > I never said that first page offset must be zero. I said that I
> > > already did what you suggested and it didn't work with RDMA. This is
> > > recall of too many months ago now, but: the client will see the
> > > correct READ payload _except_ IIRC it is offset by whatever front-pad
> > > was added to expand the misaligned DIO; no matter whether
> > > rqstp->rq_bvec updated when IO completes.
> > >
> > > But I'll revisit it again.
> >
> > For the record, this email thread is the very first time I've heard that
> > you tried the simple approach and that it worked with TCP and not with
> > RDMA. I wish I had known that a while ago.
>
> Likewise, but the story is all in the patch header and the code tells
> the story too. Hence your finding it with closer review (thanks for
> that BTW!). I agree something is off so I'm happy to work it further.
>
> I have iterated on quite a few aspects to this patch 5. Christoph had
> suggestion for using memmove in nfsd_complete_misaligned_read_dio.
> You had the feedback that required ensuring the lightest touch
> relative to branching so that buffered IO mode remain as fast as
> possible.
>
> Looking forward to tackling this RDMA-specific weirdness now.
Hopeful these 2 patches more clearly demonstrate what I'm finding
needed when using RDMA with my NFSD misaligned DIO READ patch.
These patches build ontop of my v8 patchset. I've included quite a lot
of context for the data mismatch seen by the NFS client, etc in the
patch headers.
If I'm understanding you correctly, next step is to look closer at the
rpcrdma code that would skip the throwaway front-pad page from being
mapped to the start of the RDMA READ payload returned to the NFS
client?
Such important adjustment code would need to know that the rq_bvec[]
that reflects the READ payload doesn't include a bvec that points to
the first page of rqstp->rq_pages (pointed to by rqstp->rq_next_page
on entry to nfsd_iter_read) -- so it must skip past that memory in
the READ payload's RDMA memory returned to NFS client?
Thanks,
Mike
Mike Snitzer (2):
NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug
fs/nfsd/vfs.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 42+ messages in thread
* [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-08-30 17:38 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
@ 2025-08-30 17:38 ` Mike Snitzer
2025-09-02 14:04 ` Chuck Lever
2025-09-02 15:56 ` Chuck Lever
2025-08-30 17:38 ` [RFC PATCH 2/2] NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug Mike Snitzer
2025-08-30 18:53 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2 siblings, 2 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-30 17:38 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
From: Mike Snitzer <snitzer@hammerspace.com>
Chuck Lever advised that allocating a single start_extra_page, to
avoid RDMA corruption on client, definitely shouldn't be needed:
"There's nothing I can think of in the RDMA or RPC/RDMA protocols that
mandates that the first page offset must always be zero. Moving data
at one address on the server to an entirely different address and
alignment on the client is exactly what RDMA is supposed to do.
It sounds like an implementation omission because the server's upper
layers have never needed it before now. If TCP already handles it, I'm
guessing it's going to be straightforward to fix."
So avoid papering over what seems to be an rpcrdma bug, remove the
allocation and use of an extra start_extra_page.
With this patch applied ontop of v8 patchset [0], I get the following
data mismatch errors at the end [3] when using the NFS RDMA client
with reproducer documented in associated patch header since v2 [1]:
"Must allocate and use a bounce-buffer page (called 'start_extra_page')
if/when expanding the misaligned READ requires reading extra partial
page at the start of the READ so that its DIO-aligned. Otherwise that
extra page at the start will make its way back to the NFS client and
corruption will occur. As found, and then this fix of using an extra
page verified, using the 'dt' utility:
dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
iotype=sequential pattern=iot onerr=abort oncerr=abort
see: https://github.com/RobinTMiller/dt.git "
I really did try to call attention to this misaligned DIO READ
alloc_page hack to make RDMA work, see [2], but I didn't frame it as
RDMA specific and definitely should've been clearer on that important
detail:
"Also, I think its worth calling out this
nfsd_complete_misaligned_read_dio function for its remapping/shifting
of the READ payload reflected in rqstp->rq_bvec[]."
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
[0]: https://lore.kernel.org/linux-nfs/20250826185718.5593-1-snitzer@kernel.org/
[1]: https://lore.kernel.org/linux-nfs/20250708160619.64800-9-snitzer@kernel.org/
[2]: https://lore.kernel.org/linux-nfs/aG2MDVyyCbjTpgOv@kernel.org/
[3]: partial output of dt utility that shows NFS client READ data mismatch:
++ COUNT=3
++ IOSIZE=47008
++ dt of=/mnt/hs_test/dt_thisisa.test passes=1 bs=47008 count=3 iotype=sequential pattern=iot onerr=abort oncerr=abort
dt (j:1 t:1):
dt (j:1 t:1): Write Statistics:
dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
dt (j:1 t:1): Last IOT seed value used: 0x01010101
dt (j:1 t:1): Total records processed: 3 @ 47008 bytes/record (45.906 Kbytes)
dt (j:1 t:1): Total bytes transferred: 141024 (137.719 Kbytes, 0.134 Mbytes)
dt (j:1 t:1): Average transfer rates: 1004137 bytes/sec, 980.602 Kbytes/sec, 0.958 Mbytes/sec
dt (j:1 t:1): Number I/O's per second: 21.361
dt (j:1 t:1): Number seconds per I/O: 0.0468 (46.81ms)
dt (j:1 t:1): Total passes completed: 0/1
dt (j:1 t:1): Total errors detected: 0/1
dt (j:1 t:1): Total elapsed time: 00m00.14s
dt (j:1 t:1): Total system time: 00m00.00s
dt (j:1 t:1): Total user time: 00m00.00s
dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1): Warning: The bytes written 141024, is less than the data limit 1880320000 requested!
dt (j:1 t:1): ERROR: Error number 1 occurred on Sat Aug 30 16:14:08 2025
dt (j:1 t:1):
dt (j:1 t:1): Error Number: 1
dt (j:1 t:1): Time of Current Error: Sat Aug 30 16:14:08 2025
dt (j:1 t:1): Read Pass Start Time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1): Write Pass Start Time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1): Pass Number: 1
dt (j:1 t:1): Pass Elapsed Time: 00m00.10s
dt (j:1 t:1): Test Elapsed Time: 00m00.24s
dt (j:1 t:1): File Name: /mnt/hs_test/dt_thisisa.test
dt (j:1 t:1): File Inode: 1199 (0x4af)
dt (j:1 t:1): Directory Inode: 1 (0x1)
dt (j:1 t:1): File Size: 1880320000 (0x70136800)
dt (j:1 t:1): Operation: miscompare
dt (j:1 t:1): Record Number: 2
dt (j:1 t:1): Request Size: 47008 (0xb7a0)
dt (j:1 t:1): Block Length: 91 (0x5b)
dt (j:1 t:1): I/O Mode: read
dt (j:1 t:1): I/O Type: sequential
dt (j:1 t:1): File Type: output
dt (j:1 t:1): Direct I/O: disabled (caching data)
dt (j:1 t:1): Device Size: 512 (0x200)
dt (j:1 t:1): Starting File Offset: 47008 (0xb7a0)
dt (j:1 t:1): Starting LBA: 91 (0x5b)
dt (j:1 t:1): Ending File Offset: 94016 (0x16f40)
dt (j:1 t:1): Ending LBA: 182 (0xb6)
dt (j:1 t:1): Error File Offset: 47008 (0xb7a0)
dt (j:1 t:1): Error Offset Modulos: %8 = 0, %512 = 416, %4096 = 1952
dt (j:1 t:1): Starting Relative Error LBA: 91 (0x5b)
dt (j:1 t:1): Relative 4096 byte Error LBA: 11 (0xb)
dt (j:1 t:1): Corruption Buffer Index: 0 (byte index into read buffer)
dt (j:1 t:1): Corruption Block Index: 0 (byte index in miscompare block)
dt (j:1 t:1):
dt (j:1 t:1):
dt (j:1 t:1): Data compare error at byte 0 in record number 2
dt (j:1 t:1): Relative block number where the error occurred is 91, offset 47008
dt (j:1 t:1): Block expected = 91 (0x0000005b), block found = 1919311731 (0x72665f73), count = 47008
dt (j:1 t:1): The correct data starts at memory address 0x000000003c589000 (marked by asterisk '*')
dt (j:1 t:1): Dumping Pattern Buffer (base = 0x3c589000, mismatch offset = 0, limit = 512 bytes):
dt (j:1 t:1): / Buffer
dt (j:1 t:1): Memory Address / Index
dt (j:1 t:1): 0x000000003c589000/ 0 |*5b 00 00 00 5c 01 01 01 5d 02 02 02 5e 03 03 03 "[ \ ] ^ "
dt (j:1 t:1): 0x000000003c589010/ 16 | 5f 04 04 04 60 05 05 05 61 06 06 06 62 07 07 07 "_ ` a b "
dt (j:1 t:1): 0x000000003c589020/ 32 | 63 08 08 08 64 09 09 09 65 0a 0a 0a 66 0b 0b 0b "c d e f "
dt (j:1 t:1): 0x000000003c589030/ 48 | 67 0c 0c 0c 68 0d 0d 0d 69 0e 0e 0e 6a 0f 0f 0f "g h i j "
dt (j:1 t:1): 0x000000003c589040/ 64 | 6b 10 10 10 6c 11 11 11 6d 12 12 12 6e 13 13 13 "k l m n "
dt (j:1 t:1): 0x000000003c589050/ 80 | 6f 14 14 14 70 15 15 15 71 16 16 16 72 17 17 17 "o p q r "
dt (j:1 t:1): 0x000000003c589060/ 96 | 73 18 18 18 74 19 19 19 75 1a 1a 1a 76 1b 1b 1b "s t u v "
dt (j:1 t:1): 0x000000003c589070/ 112 | 77 1c 1c 1c 78 1d 1d 1d 79 1e 1e 1e 7a 1f 1f 1f "w x y z "
dt (j:1 t:1): 0x000000003c589080/ 128 | 7b 20 20 20 7c 21 21 21 7d 22 22 22 7e 23 23 23 "{ |!!!}"""~###"
dt (j:1 t:1): 0x000000003c589090/ 144 | 7f 24 24 24 80 25 25 25 81 26 26 26 82 27 27 27 " $$$ %%% &&& '''"
dt (j:1 t:1): 0x000000003c5890a0/ 160 | 83 28 28 28 84 29 29 29 85 2a 2a 2a 86 2b 2b 2b " ((( ))) *** +++"
dt (j:1 t:1): 0x000000003c5890b0/ 176 | 87 2c 2c 2c 88 2d 2d 2d 89 2e 2e 2e 8a 2f 2f 2f " ,,, --- ... ///"
dt (j:1 t:1): 0x000000003c5890c0/ 192 | 8b 30 30 30 8c 31 31 31 8d 32 32 32 8e 33 33 33 " 000 111 222 333"
dt (j:1 t:1): 0x000000003c5890d0/ 208 | 8f 34 34 34 90 35 35 35 91 36 36 36 92 37 37 37 " 444 555 666 777"
dt (j:1 t:1): 0x000000003c5890e0/ 224 | 93 38 38 38 94 39 39 39 95 3a 3a 3a 96 3b 3b 3b " 888 999 ::: ;;;"
dt (j:1 t:1): 0x000000003c5890f0/ 240 | 97 3c 3c 3c 98 3d 3d 3d 99 3e 3e 3e 9a 3f 3f 3f " <<< === >>> ???"
dt (j:1 t:1): 0x000000003c589100/ 256 | 9b 40 40 40 9c 41 41 41 9d 42 42 42 9e 43 43 43 " @@@ AAA BBB CCC"
dt (j:1 t:1): 0x000000003c589110/ 272 | 9f 44 44 44 a0 45 45 45 a1 46 46 46 a2 47 47 47 " DDD EEE FFF GGG"
dt (j:1 t:1): 0x000000003c589120/ 288 | a3 48 48 48 a4 49 49 49 a5 4a 4a 4a a6 4b 4b 4b " HHH III JJJ KKK"
dt (j:1 t:1): 0x000000003c589130/ 304 | a7 4c 4c 4c a8 4d 4d 4d a9 4e 4e 4e aa 4f 4f 4f " LLL MMM NNN OOO"
dt (j:1 t:1): 0x000000003c589140/ 320 | ab 50 50 50 ac 51 51 51 ad 52 52 52 ae 53 53 53 " PPP QQQ RRR SSS"
dt (j:1 t:1): 0x000000003c589150/ 336 | af 54 54 54 b0 55 55 55 b1 56 56 56 b2 57 57 57 " TTT UUU VVV WWW"
dt (j:1 t:1): 0x000000003c589160/ 352 | b3 58 58 58 b4 59 59 59 b5 5a 5a 5a b6 5b 5b 5b " XXX YYY ZZZ [[["
dt (j:1 t:1): 0x000000003c589170/ 368 | b7 5c 5c 5c b8 5d 5d 5d b9 5e 5e 5e ba 5f 5f 5f " \\\ ]]] ^^^ ___"
dt (j:1 t:1): 0x000000003c589180/ 384 | bb 60 60 60 bc 61 61 61 bd 62 62 62 be 63 63 63 " ``` aaa bbb ccc"
dt (j:1 t:1): 0x000000003c589190/ 400 | bf 64 64 64 c0 65 65 65 c1 66 66 66 c2 67 67 67 " ddd eee fff ggg"
dt (j:1 t:1): 0x000000003c5891a0/ 416 | c3 68 68 68 c4 69 69 69 c5 6a 6a 6a c6 6b 6b 6b " hhh iii jjj kkk"
dt (j:1 t:1): 0x000000003c5891b0/ 432 | c7 6c 6c 6c c8 6d 6d 6d c9 6e 6e 6e ca 6f 6f 6f " lll mmm nnn ooo"
dt (j:1 t:1): 0x000000003c5891c0/ 448 | cb 70 70 70 cc 71 71 71 cd 72 72 72 ce 73 73 73 " ppp qqq rrr sss"
dt (j:1 t:1): 0x000000003c5891d0/ 464 | cf 74 74 74 d0 75 75 75 d1 76 76 76 d2 77 77 77 " ttt uuu vvv www"
dt (j:1 t:1): 0x000000003c5891e0/ 480 | d3 78 78 78 d4 79 79 79 d5 7a 7a 7a d6 7b 7b 7b " xxx yyy zzz {{{"
dt (j:1 t:1): 0x000000003c5891f0/ 496 | d7 7c 7c 7c d8 7d 7d 7d d9 7e 7e 7e da 7f 7f 7f " ||| }}} ~~~ "
dt (j:1 t:1):
dt (j:1 t:1): The incorrect data starts at memory address 0x000000003c596000 (for Robin's debug! :)
dt (j:1 t:1): The incorrect data starts at file offset 000000000000047008 (marked by asterisk '*')
dt (j:1 t:1): Dumping Data File offsets (base = 47008, mismatch offset = 0, limit = 512 bytes):
dt (j:1 t:1): / Block
dt (j:1 t:1): File Offset / Index
dt (j:1 t:1): 000000000000047008/ 0 |*73 5f 66 72 65 65 5f 63 6f 6d 6d 69 74 5f 61 72 "s_free_commit_ar"
dt (j:1 t:1): 000000000000047024/ 16 | 72 61 79 00 54 43 50 5f 54 49 4d 45 5f 57 41 49 "ray TCP_TIME_WAI"
dt (j:1 t:1): 000000000000047040/ 32 | 54 00 42 50 46 5f 50 52 4f 47 5f 54 59 50 45 5f "T BPF_PROG_TYPE_"
dt (j:1 t:1): 000000000000047056/ 48 | 43 47 52 4f 55 50 5f 53 59 53 43 54 4c 00 4c 41 "CGROUP_SYSCTL LA"
dt (j:1 t:1): 000000000000047072/ 64 | 59 4f 55 54 5f 46 4c 45 58 5f 46 49 4c 45 53 00 "YOUT_FLEX_FILES "
dt (j:1 t:1): 000000000000047088/ 80 | 4e 46 53 45 52 52 5f 4a 55 4b 45 42 4f 58 00 72 "NFSERR_JUKEBOX r"
dt (j:1 t:1): 000000000000047104/ 96 | 78 5f 63 70 75 5f 72 6d 61 70 00 6d 69 67 72 61 "x_cpu_rmap migra"
dt (j:1 t:1): 000000000000047120/ 112 | 74 69 6f 6e 5f 64 69 73 61 62 6c 65 64 00 5f 5f "tion_disabled __"
dt (j:1 t:1): 000000000000047136/ 128 | 64 61 74 61 00 6e 64 6f 5f 64 65 6c 5f 73 6c 61 "data ndo_del_sla"
dt (j:1 t:1): 000000000000047152/ 144 | 76 65 00 6e 66 73 5f 63 6f 6d 6d 69 74 5f 64 61 "ve nfs_commit_da"
dt (j:1 t:1): 000000000000047168/ 160 | 74 61 00 65 78 74 5f 6d 75 74 65 78 00 63 6f 6e "ta ext_mutex con"
dt (j:1 t:1): 000000000000047184/ 176 | 6e 65 63 74 5f 63 6f 6f 6b 69 65 00 54 43 50 5f "nect_cookie TCP_"
dt (j:1 t:1): 000000000000047200/ 192 | 43 4c 4f 53 45 5f 57 41 49 54 00 6d 65 6d 63 6d "CLOSE_WAIT memcm"
dt (j:1 t:1): 000000000000047216/ 208 | 70 00 52 50 4d 5f 52 45 51 5f 53 55 53 50 45 4e "p RPM_REQ_SUSPEN"
dt (j:1 t:1): 000000000000047232/ 224 | 44 00 63 72 6d 61 74 63 68 00 63 61 6e 63 65 6c "D crmatch cancel"
dt (j:1 t:1): 000000000000047248/ 240 | 5f 66 6f 72 6b 00 70 67 70 72 6f 74 5f 74 00 74 "_fork pgprot_t t"
dt (j:1 t:1): 000000000000047264/ 256 | 72 61 63 65 70 6f 69 6e 74 5f 70 74 72 5f 74 00 "racepoint_ptr_t "
dt (j:1 t:1): 000000000000047280/ 272 | 66 6f 72 5f 72 65 63 6c 61 69 6d 00 4e 46 53 45 "for_reclaim NFSE"
dt (j:1 t:1): 000000000000047296/ 288 | 52 52 5f 42 41 44 43 48 41 52 00 5f 73 6b 62 5f "RR_BADCHAR _skb_"
dt (j:1 t:1): 000000000000047312/ 304 | 72 65 66 64 73 74 00 70 68 79 73 69 63 61 6c 5f "refdst physical_"
dt (j:1 t:1): 000000000000047328/ 320 | 6c 6f 63 61 74 69 6f 6e 00 6e 75 6d 5f 72 65 71 "location num_req"
dt (j:1 t:1): 000000000000047344/ 336 | 73 00 5f 5f 53 43 54 5f 5f 74 70 5f 66 75 6e 63 "s __SCT__tp_func"
dt (j:1 t:1): 000000000000047360/ 352 | 5f 70 6e 66 73 5f 6d 64 73 5f 66 61 6c 6c 62 61 "_pnfs_mds_fallba"
dt (j:1 t:1): 000000000000047376/ 368 | 63 6b 5f 77 72 69 74 65 5f 64 6f 6e 65 00 74 61 "ck_write_done ta"
dt (j:1 t:1): 000000000000047392/ 384 | 73 6b 5f 63 6c 65 61 6e 75 70 00 65 78 70 61 6e "sk_cleanup expan"
dt (j:1 t:1): 000000000000047408/ 400 | 64 5f 72 65 61 64 61 68 65 61 64 00 6c 6f 63 6b "d_readahead lock"
dt (j:1 t:1): 000000000000047424/ 416 | 5f 6d 61 6e 61 67 65 72 5f 6f 70 65 72 61 74 69 "_manager_operati"
dt (j:1 t:1): 000000000000047440/ 432 | 6f 6e 73 00 73 72 63 5f 72 65 67 00 63 72 64 65 "ons src_reg crde"
dt (j:1 t:1): 000000000000047456/ 448 | 73 74 72 6f 79 00 63 68 69 6c 64 72 65 6e 5f 6c "stroy children_l"
dt (j:1 t:1): 000000000000047472/ 464 | 6f 77 5f 75 73 61 67 65 00 6e 75 6d 5f 76 66 00 "ow_usage num_vf "
dt (j:1 t:1): 000000000000047488/ 480 | 73 63 72 61 74 63 68 00 50 49 44 54 59 50 45 5f "scratch PIDTYPE_"
dt (j:1 t:1): 000000000000047504/ 496 | 4d 41 58 00 70 72 65 70 61 72 65 5f 77 72 69 74 "MAX prepare_writ"
dt (j:1 t:1):
dt (j:1 t:1):
dt (j:1 t:1): Analyzing IOT Record Data: (Note: Block #'s are relative to start of record!)
dt (j:1 t:1):
dt (j:1 t:1): IOT block size: 512
dt (j:1 t:1): Total number of blocks: 91 (47008 bytes)
dt (j:1 t:1): Current IOT seed value: 0x01010101 (pass 1)
dt (j:1 t:1): Range of corrupted blocks: 0 - 90
dt (j:1 t:1): Length of corrupted blocks: 91 (46592 bytes)
dt (j:1 t:1): Corrupted blocks file offset: 47008 (LBA 91)
dt (j:1 t:1): Number of corrupted blocks: 91
dt (j:1 t:1): Number of good blocks found: 0
dt (j:1 t:1): Number of zero blocks found: 0
dt (j:1 t:1):
dt (j:1 t:1): Record #: 2
dt (j:1 t:1): Starting Record Offset: 47008
dt (j:1 t:1): Transfer Count: 47008 (0xb7a0)
dt (j:1 t:1): Ending Record Offset: 94016
dt (j:1 t:1): Relative Record Block Range: 91 - 182
dt (j:1 t:1): Read Buffer Address: 0x3c596000
dt (j:1 t:1): Pattern Base Address: 0x3c589000
dt (j:1 t:1): Note: Incorrect data is marked with asterisk '*'
dt (j:1 t:1):
dt (j:1 t:1): Record Block: 0 (BAD data)
dt (j:1 t:1): Record Block Offset: 47008 (LBA 91)
dt (j:1 t:1): Record Buffer Index: 0 (0x0)
dt (j:1 t:1): Expected Block Number: 91 (0x0000005b)
dt (j:1 t:1): Received Block Number: 1919311731 (0x72665f73)
dt (j:1 t:1): Received Block Offset: 982687606272
dt (j:1 t:1):
dt (j:1 t:1): Byte Expected: address 0x3c589000 Received: address 0x3c596000
dt (j:1 t:1): 0000 0000005b 0101015c 0202025d 0303035e * 72665f73 635f6565 696d6d6f 72615f74
dt (j:1 t:1): 0010 0404045f 05050560 06060661 07070762 * 00796172 5f504354 454d4954 4941575f
dt (j:1 t:1): 0020 08080863 09090964 0a0a0a65 0b0b0b66 * 50420054 52505f46 545f474f 5f455059
dt (j:1 t:1): 0030 0c0c0c67 0d0d0d68 0e0e0e69 0f0f0f6a * 4f524743 535f5055 54435359 414c004c
dt (j:1 t:1): 0040 1010106b 1111116c 1212126d 1313136e * 54554f59 454c465f 49465f58 0053454c
dt (j:1 t:1): 0050 1414146f 15151570 16161671 17171772 * 4553464e 4a5f5252 42454b55 7200584f
dt (j:1 t:1): 0060 18181873 19191974 1a1a1a75 1b1b1b76 * 70635f78 6d725f75 6d007061 61726769
dt (j:1 t:1): 0070 1c1c1c77 1d1d1d78 1e1e1e79 1f1f1f7a * 6e6f6974 7369645f 656c6261 5f5f0064
dt (j:1 t:1): 0080 2020207b 2121217c 2222227d 2323237e * 61746164 6f646e00 6c65645f 616c735f
dt (j:1 t:1): 0090 2424247f 25252580 26262681 27272782 * 6e006576 635f7366 696d6d6f 61645f74
dt (j:1 t:1): 00a0 28282883 29292984 2a2a2a85 2b2b2b86 * 65006174 6d5f7478 78657475 6e6f6300
dt (j:1 t:1): 00b0 2c2c2c87 2d2d2d88 2e2e2e89 2f2f2f8a * 7463656e 6f6f635f 0065696b 5f504354
dt (j:1 t:1): 00c0 3030308b 3131318c 3232328d 3333338e * 534f4c43 41575f45 6d005449 6d636d65
dt (j:1 t:1): 00d0 3434348f 35353590 36363691 37373792 * 50520070 45525f4d 55535f51 4e455053
dt (j:1 t:1): 00e0 38383893 39393994 3a3a3a95 3b3b3b96 * 72630044 6374616d 61630068 6c65636e
dt (j:1 t:1): 00f0 3c3c3c97 3d3d3d98 3e3e3e99 3f3f3f9a * 726f665f 6770006b 746f7270 7400745f
dt (j:1 t:1): 0100 4040409b 4141419c 4242429d 4343439e * 65636172 6e696f70 74705f74 00745f72
dt (j:1 t:1): 0110 4444449f 454545a0 464646a1 474747a2 * 5f726f66 6c636572 006d6961 4553464e
dt (j:1 t:1): 0120 484848a3 494949a4 4a4a4aa5 4b4b4ba6 * 425f5252 48434441 5f005241 5f626b73
dt (j:1 t:1): 0130 4c4c4ca7 4d4d4da8 4e4e4ea9 4f4f4faa * 64666572 70007473 69737968 5f6c6163
dt (j:1 t:1): 0140 505050ab 515151ac 525252ad 535353ae * 61636f6c 6e6f6974 6d756e00 7165725f
dt (j:1 t:1): 0150 545454af 555555b0 565656b1 575757b2 * 5f5f0073 5f544353 5f70745f 636e7566
dt (j:1 t:1): 0160 585858b3 595959b4 5a5a5ab5 5b5b5bb6 * 666e705f 646d5f73 61665f73 61626c6c
dt (j:1 t:1): 0170 5c5c5cb7 5d5d5db8 5e5e5eb9 5f5f5fba * 775f6b63 65746972 6e6f645f 61740065
dt (j:1 t:1): 0180 606060bb 616161bc 626262bd 636363be * 635f6b73 6e61656c 65007075 6e617078
dt (j:1 t:1): 0190 646464bf 656565c0 666666c1 676767c2 * 65725f64 68616461 00646165 6b636f6c
dt (j:1 t:1): 01a0 686868c3 696969c4 6a6a6ac5 6b6b6bc6 * 6e616d5f 72656761 65706f5f 69746172
dt (j:1 t:1): 01b0 6c6c6cc7 6d6d6dc8 6e6e6ec9 6f6f6fca * 00736e6f 5f637273 00676572 65647263
dt (j:1 t:1): 01c0 707070cb 717171cc 727272cd 737373ce * 6f727473 68630079 72646c69 6c5f6e65
dt (j:1 t:1): 01d0 747474cf 757575d0 767676d1 777777d2 * 755f776f 65676173 6d756e00 0066765f
dt (j:1 t:1): 01e0 787878d3 797979d4 7a7a7ad5 7b7b7bd6 * 61726373 00686374 54444950 5f455059
dt (j:1 t:1): 01f0 7c7c7cd7 7d7d7dd8 7e7e7ed9 7f7f7fda * 0058414d 70657270 5f657261 74697277
...
dt (j:1 t:1): Reread data does NOT match previous data or expected data!
dt (j:1 t:1): Writing reread data to file dt_thisisa.test-REREAD3-j1t1, from buffer 0x7f12bc004000, 47008 bytes...
dt (j:1 t:1): Command line to re-read the corrupted data:
dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 count=1 offset=47008 pattern=iot disable=retryDC,savecorrupted,trigdefaults
dt (j:1 t:1):
dt (j:1 t:1): Command line to re-read the data:
dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 dsize=512 iotype=sequential iodir=forward limit=94016 records=1 pattern=iot disable=retryDC,savecorrupted,trigdefaults
dt (j:1 t:1):
dt (j:1 t:1):
dt (j:1 t:1): Read Statistics:
dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
dt (j:1 t:1): Last IOT seed value used: 0x01010101
dt (j:1 t:1): Total records processed: 2 @ 47008 bytes/record (45.906 Kbytes)
dt (j:1 t:1): Total bytes transferred: 94016 (91.812 Kbytes, 0.090 Mbytes)
dt (j:1 t:1): Average transfer rates: 656857 bytes/sec, 641.462 Kbytes/sec, 0.626 Mbytes/sec
dt (j:1 t:1): Number I/O's per second: 13.973
dt (j:1 t:1): Number seconds per I/O: 0.0716 (71.56ms)
dt (j:1 t:1): Total passes completed: 1/1
dt (j:1 t:1): Total errors detected: 1/1
dt (j:1 t:1): Total elapsed time: 00m00.15s
dt (j:1 t:1): Total system time: 00m00.00s
dt (j:1 t:1): Total user time: 00m00.00s
dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1):
dt (j:1 t:1): Operating System Information:
dt (j:1 t:1): Host name: plsm121c-06.perf.hammer.space (192.168.1.106)
dt (j:1 t:1): User name: root
dt (j:1 t:1): Process ID: 31703
dt (j:1 t:1): OS information: Linux 6.12.24.17.hs.snitm+ #34 SMP PREEMPT_DYNAMIC Fri Aug 15 22:03:10 UTC 2025 x86_64
dt (j:1 t:1):
dt (j:1 t:1): File System Information:
dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
dt (j:1 t:1): Mounted on directory: /mnt/hs_test
dt (j:1 t:1): Filesystem type: nfs4
dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
dt (j:1 t:1): Filesystem block size: 1048576
dt (j:1 t:1): Filesystem free space: 60990430380032 (58165007.000 Mbytes, 56801.765 Gbytes, 55.470 Tbytes)
dt (j:1 t:1): Filesystem total space: 60992310476800 (58166800.000 Mbytes, 56803.516 Gbytes, 55.472 Tbytes)
dt (j:1 t:1):
dt (j:1 t:1): Total Statistics:
dt (j:1 t:1): Output device/file name: /mnt/hs_test/dt_thisisa.test (device type=regular)
dt (j:1 t:1): Type of I/O's performed: sequential (forward)
dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
dt (j:1 t:1): Data pattern string used: 'IOT Pattern' (blocking is 512 bytes)
dt (j:1 t:1): Last IOT seed value used: 0x01010101
dt (j:1 t:1): Total records read: 2
dt (j:1 t:1): Total bytes read: 94016 (91.812 Kbytes, 0.090 Mbytes, 0.000 Gbytes)
dt (j:1 t:1): Total records written: 3
dt (j:1 t:1): Total bytes written: 141024 (137.719 Kbytes, 0.134 Mbytes, 0.000 Gbytes)
dt (j:1 t:1): Total records processed: 5 @ 47008 bytes/record (45.906 Kbytes)
dt (j:1 t:1): Total bytes transferred: 235040 (229.531 Kbytes, 0.224 Mbytes)
dt (j:1 t:1): Average transfer rates: 828023 bytes/sec, 808.616 Kbytes/sec, 0.790 Mbytes/sec
dt (j:1 t:1): Number I/O's per second: 17.615
dt (j:1 t:1): Number seconds per I/O: 0.0568 (56.77ms)
dt (j:1 t:1): Total passes completed: 1/1
dt (j:1 t:1): Total errors detected: 1/1
dt (j:1 t:1): Total elapsed time: 00m00.29s
dt (j:1 t:1): Total system time: 00m00.00s
dt (j:1 t:1): Total user time: 00m00.00s
dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
dt (j:1 t:1):
dt (j:1 t:1): Command line to re-read the data:
dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 dsize=512 iotype=sequential iodir=forward limit=141024 records=3 pattern=iot
dt (j:1 t:1):
dt (j:1 t:1): Command Line:
dt (j:1 t:1):
dt (j:1 t:1): # dt of=/mnt/hs_test/dt_thisisa.test passes=1 bs=47008 count=3 iotype=sequential pattern=iot onerr=abort oncerr=abort
dt (j:1 t:1):
dt (j:1 t:1): --> Date: September 21st, 2023, Version: 25.05, Author: Robin T. Miller <--
dt (j:1 t:1):
dt (j:1 t:1): onerr=abort, so stopping all threads for job 1...
dt (j:0 t:0): Job 1 is being stopped (1 thread)
dt (j:0 t:0): Program is exiting with status -1...
---
fs/nfsd/vfs.c | 25 ++++++-------------------
1 file changed, 6 insertions(+), 19 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f8975ee262b5c..762d745b1b15d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1079,13 +1079,11 @@ struct nfsd_read_dio {
loff_t end;
unsigned long start_extra;
unsigned long end_extra;
- struct page *start_extra_page;
};
static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
{
memset(read_dio, 0, sizeof(*read_dio));
- read_dio->start_extra_page = NULL;
}
#define NFSD_READ_DIO_MIN_KB (32 << 10)
@@ -1121,9 +1119,8 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
/*
* Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
- * to be DIO-aligned (this heuristic avoids excess work, like allocating
- * start_extra_page, for smaller IO that can generally already perform
- * well using buffered IO).
+ * to be DIO-aligned (this heuristic avoids excess work, for smaller IO
+ * that can generally already perform well using buffered IO).
*/
if ((read_dio->start_extra || read_dio->end_extra) &&
(len < NFSD_READ_DIO_MIN_KB)) {
@@ -1131,15 +1128,6 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
return false;
}
- if (read_dio->start_extra) {
- read_dio->start_extra_page = alloc_page(GFP_KERNEL);
- if (WARN_ONCE(read_dio->start_extra_page == NULL,
- "%s: Unable to allocate start_extra_page\n", __func__)) {
- init_nfsd_read_dio(read_dio);
- return false;
- }
- }
-
/* Show original offset and count, and how it was expanded for DIO */
middle_end = read_dio->end - read_dio->end_extra;
trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
@@ -1162,11 +1150,10 @@ static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
if (!read_dio->start_extra && !read_dio->end_extra)
return host_err;
- /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
- * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
+ /* If nfsd_analyze_read_dio() found start_extra (front-pad) page needed it
+ * must be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
*/
- if (read_dio->start_extra_page) {
- __free_page(read_dio->start_extra_page);
+ if (read_dio->start_extra) {
*rq_bvec_numpages -= 1;
v = *rq_bvec_numpages;
memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
@@ -1276,7 +1263,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (read_dio.start_extra) {
len = read_dio.start_extra;
bvec_set_page(&rqstp->rq_bvec[v],
- read_dio.start_extra_page,
+ *(rqstp->rq_next_page++),
len, PAGE_SIZE - len);
total -= len;
++v;
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [RFC PATCH 2/2] NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug
2025-08-30 17:38 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
@ 2025-08-30 17:38 ` Mike Snitzer
2025-08-30 18:53 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-30 17:38 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
From: Mike Snitzer <snitzer@hammerspace.com>
This patch also papers over what seems like an rpcrdma bug, and
Chuck Lever also clarified that this too shouldn't be needed:
"Yes, the extra page needs to come from rq_pages. But I don't see why it
should come from the /end/ of rq_pages."
But this patch at least isolates the same bug further? (by showing
that the bounds expressed in rqstp->rq_bvec[] don't cause manipulation
what READ payload memory is returned to the NFS RDMA client?)
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/vfs.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 762d745b1b15d..70571a78e7c25 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1263,7 +1263,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (read_dio.start_extra) {
len = read_dio.start_extra;
bvec_set_page(&rqstp->rq_bvec[v],
- *(rqstp->rq_next_page++),
+ NULL, /* adjusted below */
len, PAGE_SIZE - len);
total -= len;
++v;
@@ -1289,6 +1289,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
base = 0;
}
WARN_ON_ONCE(v > rqstp->rq_maxpages);
+ if ((kiocb.ki_flags & IOCB_DIRECT) && read_dio.start_extra)
+ rqstp->rq_bvec[0].bv_page = *(rqstp->rq_next_page++);
trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
--
2.44.0
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned]
2025-08-30 17:38 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 2/2] NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug Mike Snitzer
@ 2025-08-30 18:53 ` Mike Snitzer
2 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-08-30 18:53 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
On Sat, Aug 30, 2025 at 01:38:50PM -0400, Mike Snitzer wrote:
> Hi Chuck,
>
> Hopeful these 2 patches more clearly demonstrate what I'm finding
> needed when using RDMA with my NFSD misaligned DIO READ patch.
>
> These patches build ontop of my v8 patchset. I've included quite a lot
> of context for the data mismatch seen by the NFS client, etc in the
> patch headers.
>
> If I'm understanding you correctly, next step is to look closer at the
> rpcrdma code that would skip the throwaway front-pad page from being
> mapped to the start of the RDMA READ payload returned to the NFS
> client?
>
> Such important adjustment code would need to know that the rq_bvec[]
> that reflects the READ payload doesn't include a bvec that points to
> the first page of rqstp->rq_pages (pointed to by rqstp->rq_next_page
> on entry to nfsd_iter_read) -- so it must skip past that memory in
> the READ payload's RDMA memory returned to NFS client?
>
> Thanks,
> Mike
>
> Mike Snitzer (2):
> NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
> NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug
>
> fs/nfsd/vfs.c | 27 ++++++++-------------------
> 1 file changed, 8 insertions(+), 19 deletions(-)
>
> --
> 2.44.0
>
Another important detail to mention, I've been working ontop of a
6.12.24 stable baseline that I've backported all NFS and NFSD changes
through 6.17-rc1 + (slightly stale) nfsd-testing to, please see:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.24/nfsd-testing-snitm.18-testing
Not sure if rebasing to the latest nfsd-testing would show this
issue... I'm thinking yes, but cannot say for sure. Needing the
ability to test/develop Hammerspace and pNFS stuff has prevented me
from simply rebasing. But it should be doable without too much
effort.
Mike
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-08-30 17:38 ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
@ 2025-09-02 14:04 ` Chuck Lever
2025-09-02 15:56 ` Chuck Lever
1 sibling, 0 replies; 42+ messages in thread
From: Chuck Lever @ 2025-09-02 14:04 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 8/30/25 1:38 PM, Mike Snitzer wrote:
> From: Mike Snitzer <snitzer@hammerspace.com>
>
> Chuck Lever advised that allocating a single start_extra_page, to
> avoid RDMA corruption on client, definitely shouldn't be needed:
>
> "There's nothing I can think of in the RDMA or RPC/RDMA protocols that
> mandates that the first page offset must always be zero. Moving data
> at one address on the server to an entirely different address and
> alignment on the client is exactly what RDMA is supposed to do.
>
> It sounds like an implementation omission because the server's upper
> layers have never needed it before now. If TCP already handles it, I'm
> guessing it's going to be straightforward to fix."
>
> So avoid papering over what seems to be an rpcrdma bug, remove the
> allocation and use of an extra start_extra_page.
>
> With this patch applied ontop of v8 patchset [0], I get the following
> data mismatch errors at the end [3] when using the NFS RDMA client
> with reproducer documented in associated patch header since v2 [1]:
>
> "Must allocate and use a bounce-buffer page (called 'start_extra_page')
> if/when expanding the misaligned READ requires reading extra partial
> page at the start of the READ so that its DIO-aligned. Otherwise that
> extra page at the start will make its way back to the NFS client and
> corruption will occur. As found, and then this fix of using an extra
> page verified, using the 'dt' utility:
> dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> iotype=sequential pattern=iot onerr=abort oncerr=abort
> see: https://github.com/RobinTMiller/dt.git "
>
> I really did try to call attention to this misaligned DIO READ
> alloc_page hack to make RDMA work, see [2], but I didn't frame it as
> RDMA specific and definitely should've been clearer on that important
> detail:
>
> "Also, I think its worth calling out this
> nfsd_complete_misaligned_read_dio function for its remapping/shifting
> of the READ payload reflected in rqstp->rq_bvec[]."
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>
> [0]: https://lore.kernel.org/linux-nfs/20250826185718.5593-1-snitzer@kernel.org/
> [1]: https://lore.kernel.org/linux-nfs/20250708160619.64800-9-snitzer@kernel.org/
> [2]: https://lore.kernel.org/linux-nfs/aG2MDVyyCbjTpgOv@kernel.org/
> [3]: partial output of dt utility that shows NFS client READ data mismatch:
> ++ COUNT=3
> ++ IOSIZE=47008
> ++ dt of=/mnt/hs_test/dt_thisisa.test passes=1 bs=47008 count=3 iotype=sequential pattern=iot onerr=abort oncerr=abort
> dt (j:1 t:1):
> dt (j:1 t:1): Write Statistics:
> dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
> dt (j:1 t:1): Last IOT seed value used: 0x01010101
> dt (j:1 t:1): Total records processed: 3 @ 47008 bytes/record (45.906 Kbytes)
> dt (j:1 t:1): Total bytes transferred: 141024 (137.719 Kbytes, 0.134 Mbytes)
> dt (j:1 t:1): Average transfer rates: 1004137 bytes/sec, 980.602 Kbytes/sec, 0.958 Mbytes/sec
> dt (j:1 t:1): Number I/O's per second: 21.361
> dt (j:1 t:1): Number seconds per I/O: 0.0468 (46.81ms)
> dt (j:1 t:1): Total passes completed: 0/1
> dt (j:1 t:1): Total errors detected: 0/1
> dt (j:1 t:1): Total elapsed time: 00m00.14s
> dt (j:1 t:1): Total system time: 00m00.00s
> dt (j:1 t:1): Total user time: 00m00.00s
> dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Warning: The bytes written 141024, is less than the data limit 1880320000 requested!
> dt (j:1 t:1): ERROR: Error number 1 occurred on Sat Aug 30 16:14:08 2025
> dt (j:1 t:1):
> dt (j:1 t:1): Error Number: 1
> dt (j:1 t:1): Time of Current Error: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Read Pass Start Time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Write Pass Start Time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Pass Number: 1
> dt (j:1 t:1): Pass Elapsed Time: 00m00.10s
> dt (j:1 t:1): Test Elapsed Time: 00m00.24s
> dt (j:1 t:1): File Name: /mnt/hs_test/dt_thisisa.test
> dt (j:1 t:1): File Inode: 1199 (0x4af)
> dt (j:1 t:1): Directory Inode: 1 (0x1)
> dt (j:1 t:1): File Size: 1880320000 (0x70136800)
> dt (j:1 t:1): Operation: miscompare
> dt (j:1 t:1): Record Number: 2
> dt (j:1 t:1): Request Size: 47008 (0xb7a0)
> dt (j:1 t:1): Block Length: 91 (0x5b)
> dt (j:1 t:1): I/O Mode: read
> dt (j:1 t:1): I/O Type: sequential
> dt (j:1 t:1): File Type: output
> dt (j:1 t:1): Direct I/O: disabled (caching data)
> dt (j:1 t:1): Device Size: 512 (0x200)
> dt (j:1 t:1): Starting File Offset: 47008 (0xb7a0)
> dt (j:1 t:1): Starting LBA: 91 (0x5b)
> dt (j:1 t:1): Ending File Offset: 94016 (0x16f40)
> dt (j:1 t:1): Ending LBA: 182 (0xb6)
> dt (j:1 t:1): Error File Offset: 47008 (0xb7a0)
> dt (j:1 t:1): Error Offset Modulos: %8 = 0, %512 = 416, %4096 = 1952
> dt (j:1 t:1): Starting Relative Error LBA: 91 (0x5b)
> dt (j:1 t:1): Relative 4096 byte Error LBA: 11 (0xb)
> dt (j:1 t:1): Corruption Buffer Index: 0 (byte index into read buffer)
> dt (j:1 t:1): Corruption Block Index: 0 (byte index in miscompare block)
> dt (j:1 t:1):
> dt (j:1 t:1):
> dt (j:1 t:1): Data compare error at byte 0 in record number 2
> dt (j:1 t:1): Relative block number where the error occurred is 91, offset 47008
> dt (j:1 t:1): Block expected = 91 (0x0000005b), block found = 1919311731 (0x72665f73), count = 47008
> dt (j:1 t:1): The correct data starts at memory address 0x000000003c589000 (marked by asterisk '*')
> dt (j:1 t:1): Dumping Pattern Buffer (base = 0x3c589000, mismatch offset = 0, limit = 512 bytes):
> dt (j:1 t:1): / Buffer
> dt (j:1 t:1): Memory Address / Index
> dt (j:1 t:1): 0x000000003c589000/ 0 |*5b 00 00 00 5c 01 01 01 5d 02 02 02 5e 03 03 03 "[ \ ] ^ "
> dt (j:1 t:1): 0x000000003c589010/ 16 | 5f 04 04 04 60 05 05 05 61 06 06 06 62 07 07 07 "_ ` a b "
> dt (j:1 t:1): 0x000000003c589020/ 32 | 63 08 08 08 64 09 09 09 65 0a 0a 0a 66 0b 0b 0b "c d e f "
> dt (j:1 t:1): 0x000000003c589030/ 48 | 67 0c 0c 0c 68 0d 0d 0d 69 0e 0e 0e 6a 0f 0f 0f "g h i j "
> dt (j:1 t:1): 0x000000003c589040/ 64 | 6b 10 10 10 6c 11 11 11 6d 12 12 12 6e 13 13 13 "k l m n "
> dt (j:1 t:1): 0x000000003c589050/ 80 | 6f 14 14 14 70 15 15 15 71 16 16 16 72 17 17 17 "o p q r "
> dt (j:1 t:1): 0x000000003c589060/ 96 | 73 18 18 18 74 19 19 19 75 1a 1a 1a 76 1b 1b 1b "s t u v "
> dt (j:1 t:1): 0x000000003c589070/ 112 | 77 1c 1c 1c 78 1d 1d 1d 79 1e 1e 1e 7a 1f 1f 1f "w x y z "
> dt (j:1 t:1): 0x000000003c589080/ 128 | 7b 20 20 20 7c 21 21 21 7d 22 22 22 7e 23 23 23 "{ |!!!}"""~###"
> dt (j:1 t:1): 0x000000003c589090/ 144 | 7f 24 24 24 80 25 25 25 81 26 26 26 82 27 27 27 " $$$ %%% &&& '''"
> dt (j:1 t:1): 0x000000003c5890a0/ 160 | 83 28 28 28 84 29 29 29 85 2a 2a 2a 86 2b 2b 2b " ((( ))) *** +++"
> dt (j:1 t:1): 0x000000003c5890b0/ 176 | 87 2c 2c 2c 88 2d 2d 2d 89 2e 2e 2e 8a 2f 2f 2f " ,,, --- ... ///"
> dt (j:1 t:1): 0x000000003c5890c0/ 192 | 8b 30 30 30 8c 31 31 31 8d 32 32 32 8e 33 33 33 " 000 111 222 333"
> dt (j:1 t:1): 0x000000003c5890d0/ 208 | 8f 34 34 34 90 35 35 35 91 36 36 36 92 37 37 37 " 444 555 666 777"
> dt (j:1 t:1): 0x000000003c5890e0/ 224 | 93 38 38 38 94 39 39 39 95 3a 3a 3a 96 3b 3b 3b " 888 999 ::: ;;;"
> dt (j:1 t:1): 0x000000003c5890f0/ 240 | 97 3c 3c 3c 98 3d 3d 3d 99 3e 3e 3e 9a 3f 3f 3f " <<< === >>> ???"
> dt (j:1 t:1): 0x000000003c589100/ 256 | 9b 40 40 40 9c 41 41 41 9d 42 42 42 9e 43 43 43 " @@@ AAA BBB CCC"
> dt (j:1 t:1): 0x000000003c589110/ 272 | 9f 44 44 44 a0 45 45 45 a1 46 46 46 a2 47 47 47 " DDD EEE FFF GGG"
> dt (j:1 t:1): 0x000000003c589120/ 288 | a3 48 48 48 a4 49 49 49 a5 4a 4a 4a a6 4b 4b 4b " HHH III JJJ KKK"
> dt (j:1 t:1): 0x000000003c589130/ 304 | a7 4c 4c 4c a8 4d 4d 4d a9 4e 4e 4e aa 4f 4f 4f " LLL MMM NNN OOO"
> dt (j:1 t:1): 0x000000003c589140/ 320 | ab 50 50 50 ac 51 51 51 ad 52 52 52 ae 53 53 53 " PPP QQQ RRR SSS"
> dt (j:1 t:1): 0x000000003c589150/ 336 | af 54 54 54 b0 55 55 55 b1 56 56 56 b2 57 57 57 " TTT UUU VVV WWW"
> dt (j:1 t:1): 0x000000003c589160/ 352 | b3 58 58 58 b4 59 59 59 b5 5a 5a 5a b6 5b 5b 5b " XXX YYY ZZZ [[["
> dt (j:1 t:1): 0x000000003c589170/ 368 | b7 5c 5c 5c b8 5d 5d 5d b9 5e 5e 5e ba 5f 5f 5f " \\\ ]]] ^^^ ___"
> dt (j:1 t:1): 0x000000003c589180/ 384 | bb 60 60 60 bc 61 61 61 bd 62 62 62 be 63 63 63 " ``` aaa bbb ccc"
> dt (j:1 t:1): 0x000000003c589190/ 400 | bf 64 64 64 c0 65 65 65 c1 66 66 66 c2 67 67 67 " ddd eee fff ggg"
> dt (j:1 t:1): 0x000000003c5891a0/ 416 | c3 68 68 68 c4 69 69 69 c5 6a 6a 6a c6 6b 6b 6b " hhh iii jjj kkk"
> dt (j:1 t:1): 0x000000003c5891b0/ 432 | c7 6c 6c 6c c8 6d 6d 6d c9 6e 6e 6e ca 6f 6f 6f " lll mmm nnn ooo"
> dt (j:1 t:1): 0x000000003c5891c0/ 448 | cb 70 70 70 cc 71 71 71 cd 72 72 72 ce 73 73 73 " ppp qqq rrr sss"
> dt (j:1 t:1): 0x000000003c5891d0/ 464 | cf 74 74 74 d0 75 75 75 d1 76 76 76 d2 77 77 77 " ttt uuu vvv www"
> dt (j:1 t:1): 0x000000003c5891e0/ 480 | d3 78 78 78 d4 79 79 79 d5 7a 7a 7a d6 7b 7b 7b " xxx yyy zzz {{{"
> dt (j:1 t:1): 0x000000003c5891f0/ 496 | d7 7c 7c 7c d8 7d 7d 7d d9 7e 7e 7e da 7f 7f 7f " ||| }}} ~~~ "
> dt (j:1 t:1):
> dt (j:1 t:1): The incorrect data starts at memory address 0x000000003c596000 (for Robin's debug! :)
> dt (j:1 t:1): The incorrect data starts at file offset 000000000000047008 (marked by asterisk '*')
> dt (j:1 t:1): Dumping Data File offsets (base = 47008, mismatch offset = 0, limit = 512 bytes):
> dt (j:1 t:1): / Block
> dt (j:1 t:1): File Offset / Index
> dt (j:1 t:1): 000000000000047008/ 0 |*73 5f 66 72 65 65 5f 63 6f 6d 6d 69 74 5f 61 72 "s_free_commit_ar"
> dt (j:1 t:1): 000000000000047024/ 16 | 72 61 79 00 54 43 50 5f 54 49 4d 45 5f 57 41 49 "ray TCP_TIME_WAI"
> dt (j:1 t:1): 000000000000047040/ 32 | 54 00 42 50 46 5f 50 52 4f 47 5f 54 59 50 45 5f "T BPF_PROG_TYPE_"
> dt (j:1 t:1): 000000000000047056/ 48 | 43 47 52 4f 55 50 5f 53 59 53 43 54 4c 00 4c 41 "CGROUP_SYSCTL LA"
> dt (j:1 t:1): 000000000000047072/ 64 | 59 4f 55 54 5f 46 4c 45 58 5f 46 49 4c 45 53 00 "YOUT_FLEX_FILES "
> dt (j:1 t:1): 000000000000047088/ 80 | 4e 46 53 45 52 52 5f 4a 55 4b 45 42 4f 58 00 72 "NFSERR_JUKEBOX r"
> dt (j:1 t:1): 000000000000047104/ 96 | 78 5f 63 70 75 5f 72 6d 61 70 00 6d 69 67 72 61 "x_cpu_rmap migra"
> dt (j:1 t:1): 000000000000047120/ 112 | 74 69 6f 6e 5f 64 69 73 61 62 6c 65 64 00 5f 5f "tion_disabled __"
> dt (j:1 t:1): 000000000000047136/ 128 | 64 61 74 61 00 6e 64 6f 5f 64 65 6c 5f 73 6c 61 "data ndo_del_sla"
> dt (j:1 t:1): 000000000000047152/ 144 | 76 65 00 6e 66 73 5f 63 6f 6d 6d 69 74 5f 64 61 "ve nfs_commit_da"
> dt (j:1 t:1): 000000000000047168/ 160 | 74 61 00 65 78 74 5f 6d 75 74 65 78 00 63 6f 6e "ta ext_mutex con"
> dt (j:1 t:1): 000000000000047184/ 176 | 6e 65 63 74 5f 63 6f 6f 6b 69 65 00 54 43 50 5f "nect_cookie TCP_"
> dt (j:1 t:1): 000000000000047200/ 192 | 43 4c 4f 53 45 5f 57 41 49 54 00 6d 65 6d 63 6d "CLOSE_WAIT memcm"
> dt (j:1 t:1): 000000000000047216/ 208 | 70 00 52 50 4d 5f 52 45 51 5f 53 55 53 50 45 4e "p RPM_REQ_SUSPEN"
> dt (j:1 t:1): 000000000000047232/ 224 | 44 00 63 72 6d 61 74 63 68 00 63 61 6e 63 65 6c "D crmatch cancel"
> dt (j:1 t:1): 000000000000047248/ 240 | 5f 66 6f 72 6b 00 70 67 70 72 6f 74 5f 74 00 74 "_fork pgprot_t t"
> dt (j:1 t:1): 000000000000047264/ 256 | 72 61 63 65 70 6f 69 6e 74 5f 70 74 72 5f 74 00 "racepoint_ptr_t "
> dt (j:1 t:1): 000000000000047280/ 272 | 66 6f 72 5f 72 65 63 6c 61 69 6d 00 4e 46 53 45 "for_reclaim NFSE"
> dt (j:1 t:1): 000000000000047296/ 288 | 52 52 5f 42 41 44 43 48 41 52 00 5f 73 6b 62 5f "RR_BADCHAR _skb_"
> dt (j:1 t:1): 000000000000047312/ 304 | 72 65 66 64 73 74 00 70 68 79 73 69 63 61 6c 5f "refdst physical_"
> dt (j:1 t:1): 000000000000047328/ 320 | 6c 6f 63 61 74 69 6f 6e 00 6e 75 6d 5f 72 65 71 "location num_req"
> dt (j:1 t:1): 000000000000047344/ 336 | 73 00 5f 5f 53 43 54 5f 5f 74 70 5f 66 75 6e 63 "s __SCT__tp_func"
> dt (j:1 t:1): 000000000000047360/ 352 | 5f 70 6e 66 73 5f 6d 64 73 5f 66 61 6c 6c 62 61 "_pnfs_mds_fallba"
> dt (j:1 t:1): 000000000000047376/ 368 | 63 6b 5f 77 72 69 74 65 5f 64 6f 6e 65 00 74 61 "ck_write_done ta"
> dt (j:1 t:1): 000000000000047392/ 384 | 73 6b 5f 63 6c 65 61 6e 75 70 00 65 78 70 61 6e "sk_cleanup expan"
> dt (j:1 t:1): 000000000000047408/ 400 | 64 5f 72 65 61 64 61 68 65 61 64 00 6c 6f 63 6b "d_readahead lock"
> dt (j:1 t:1): 000000000000047424/ 416 | 5f 6d 61 6e 61 67 65 72 5f 6f 70 65 72 61 74 69 "_manager_operati"
> dt (j:1 t:1): 000000000000047440/ 432 | 6f 6e 73 00 73 72 63 5f 72 65 67 00 63 72 64 65 "ons src_reg crde"
> dt (j:1 t:1): 000000000000047456/ 448 | 73 74 72 6f 79 00 63 68 69 6c 64 72 65 6e 5f 6c "stroy children_l"
> dt (j:1 t:1): 000000000000047472/ 464 | 6f 77 5f 75 73 61 67 65 00 6e 75 6d 5f 76 66 00 "ow_usage num_vf "
> dt (j:1 t:1): 000000000000047488/ 480 | 73 63 72 61 74 63 68 00 50 49 44 54 59 50 45 5f "scratch PIDTYPE_"
> dt (j:1 t:1): 000000000000047504/ 496 | 4d 41 58 00 70 72 65 70 61 72 65 5f 77 72 69 74 "MAX prepare_writ"
> dt (j:1 t:1):
> dt (j:1 t:1):
> dt (j:1 t:1): Analyzing IOT Record Data: (Note: Block #'s are relative to start of record!)
> dt (j:1 t:1):
> dt (j:1 t:1): IOT block size: 512
> dt (j:1 t:1): Total number of blocks: 91 (47008 bytes)
> dt (j:1 t:1): Current IOT seed value: 0x01010101 (pass 1)
> dt (j:1 t:1): Range of corrupted blocks: 0 - 90
> dt (j:1 t:1): Length of corrupted blocks: 91 (46592 bytes)
> dt (j:1 t:1): Corrupted blocks file offset: 47008 (LBA 91)
> dt (j:1 t:1): Number of corrupted blocks: 91
> dt (j:1 t:1): Number of good blocks found: 0
> dt (j:1 t:1): Number of zero blocks found: 0
> dt (j:1 t:1):
> dt (j:1 t:1): Record #: 2
> dt (j:1 t:1): Starting Record Offset: 47008
> dt (j:1 t:1): Transfer Count: 47008 (0xb7a0)
> dt (j:1 t:1): Ending Record Offset: 94016
> dt (j:1 t:1): Relative Record Block Range: 91 - 182
> dt (j:1 t:1): Read Buffer Address: 0x3c596000
> dt (j:1 t:1): Pattern Base Address: 0x3c589000
> dt (j:1 t:1): Note: Incorrect data is marked with asterisk '*'
> dt (j:1 t:1):
> dt (j:1 t:1): Record Block: 0 (BAD data)
> dt (j:1 t:1): Record Block Offset: 47008 (LBA 91)
> dt (j:1 t:1): Record Buffer Index: 0 (0x0)
> dt (j:1 t:1): Expected Block Number: 91 (0x0000005b)
> dt (j:1 t:1): Received Block Number: 1919311731 (0x72665f73)
> dt (j:1 t:1): Received Block Offset: 982687606272
> dt (j:1 t:1):
> dt (j:1 t:1): Byte Expected: address 0x3c589000 Received: address 0x3c596000
> dt (j:1 t:1): 0000 0000005b 0101015c 0202025d 0303035e * 72665f73 635f6565 696d6d6f 72615f74
> dt (j:1 t:1): 0010 0404045f 05050560 06060661 07070762 * 00796172 5f504354 454d4954 4941575f
> dt (j:1 t:1): 0020 08080863 09090964 0a0a0a65 0b0b0b66 * 50420054 52505f46 545f474f 5f455059
> dt (j:1 t:1): 0030 0c0c0c67 0d0d0d68 0e0e0e69 0f0f0f6a * 4f524743 535f5055 54435359 414c004c
> dt (j:1 t:1): 0040 1010106b 1111116c 1212126d 1313136e * 54554f59 454c465f 49465f58 0053454c
> dt (j:1 t:1): 0050 1414146f 15151570 16161671 17171772 * 4553464e 4a5f5252 42454b55 7200584f
> dt (j:1 t:1): 0060 18181873 19191974 1a1a1a75 1b1b1b76 * 70635f78 6d725f75 6d007061 61726769
> dt (j:1 t:1): 0070 1c1c1c77 1d1d1d78 1e1e1e79 1f1f1f7a * 6e6f6974 7369645f 656c6261 5f5f0064
> dt (j:1 t:1): 0080 2020207b 2121217c 2222227d 2323237e * 61746164 6f646e00 6c65645f 616c735f
> dt (j:1 t:1): 0090 2424247f 25252580 26262681 27272782 * 6e006576 635f7366 696d6d6f 61645f74
> dt (j:1 t:1): 00a0 28282883 29292984 2a2a2a85 2b2b2b86 * 65006174 6d5f7478 78657475 6e6f6300
> dt (j:1 t:1): 00b0 2c2c2c87 2d2d2d88 2e2e2e89 2f2f2f8a * 7463656e 6f6f635f 0065696b 5f504354
> dt (j:1 t:1): 00c0 3030308b 3131318c 3232328d 3333338e * 534f4c43 41575f45 6d005449 6d636d65
> dt (j:1 t:1): 00d0 3434348f 35353590 36363691 37373792 * 50520070 45525f4d 55535f51 4e455053
> dt (j:1 t:1): 00e0 38383893 39393994 3a3a3a95 3b3b3b96 * 72630044 6374616d 61630068 6c65636e
> dt (j:1 t:1): 00f0 3c3c3c97 3d3d3d98 3e3e3e99 3f3f3f9a * 726f665f 6770006b 746f7270 7400745f
> dt (j:1 t:1): 0100 4040409b 4141419c 4242429d 4343439e * 65636172 6e696f70 74705f74 00745f72
> dt (j:1 t:1): 0110 4444449f 454545a0 464646a1 474747a2 * 5f726f66 6c636572 006d6961 4553464e
> dt (j:1 t:1): 0120 484848a3 494949a4 4a4a4aa5 4b4b4ba6 * 425f5252 48434441 5f005241 5f626b73
> dt (j:1 t:1): 0130 4c4c4ca7 4d4d4da8 4e4e4ea9 4f4f4faa * 64666572 70007473 69737968 5f6c6163
> dt (j:1 t:1): 0140 505050ab 515151ac 525252ad 535353ae * 61636f6c 6e6f6974 6d756e00 7165725f
> dt (j:1 t:1): 0150 545454af 555555b0 565656b1 575757b2 * 5f5f0073 5f544353 5f70745f 636e7566
> dt (j:1 t:1): 0160 585858b3 595959b4 5a5a5ab5 5b5b5bb6 * 666e705f 646d5f73 61665f73 61626c6c
> dt (j:1 t:1): 0170 5c5c5cb7 5d5d5db8 5e5e5eb9 5f5f5fba * 775f6b63 65746972 6e6f645f 61740065
> dt (j:1 t:1): 0180 606060bb 616161bc 626262bd 636363be * 635f6b73 6e61656c 65007075 6e617078
> dt (j:1 t:1): 0190 646464bf 656565c0 666666c1 676767c2 * 65725f64 68616461 00646165 6b636f6c
> dt (j:1 t:1): 01a0 686868c3 696969c4 6a6a6ac5 6b6b6bc6 * 6e616d5f 72656761 65706f5f 69746172
> dt (j:1 t:1): 01b0 6c6c6cc7 6d6d6dc8 6e6e6ec9 6f6f6fca * 00736e6f 5f637273 00676572 65647263
> dt (j:1 t:1): 01c0 707070cb 717171cc 727272cd 737373ce * 6f727473 68630079 72646c69 6c5f6e65
> dt (j:1 t:1): 01d0 747474cf 757575d0 767676d1 777777d2 * 755f776f 65676173 6d756e00 0066765f
> dt (j:1 t:1): 01e0 787878d3 797979d4 7a7a7ad5 7b7b7bd6 * 61726373 00686374 54444950 5f455059
> dt (j:1 t:1): 01f0 7c7c7cd7 7d7d7dd8 7e7e7ed9 7f7f7fda * 0058414d 70657270 5f657261 74697277
> ...
> dt (j:1 t:1): Reread data does NOT match previous data or expected data!
> dt (j:1 t:1): Writing reread data to file dt_thisisa.test-REREAD3-j1t1, from buffer 0x7f12bc004000, 47008 bytes...
> dt (j:1 t:1): Command line to re-read the corrupted data:
> dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 count=1 offset=47008 pattern=iot disable=retryDC,savecorrupted,trigdefaults
> dt (j:1 t:1):
> dt (j:1 t:1): Command line to re-read the data:
> dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 dsize=512 iotype=sequential iodir=forward limit=94016 records=1 pattern=iot disable=retryDC,savecorrupted,trigdefaults
> dt (j:1 t:1):
> dt (j:1 t:1):
> dt (j:1 t:1): Read Statistics:
> dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
> dt (j:1 t:1): Last IOT seed value used: 0x01010101
> dt (j:1 t:1): Total records processed: 2 @ 47008 bytes/record (45.906 Kbytes)
> dt (j:1 t:1): Total bytes transferred: 94016 (91.812 Kbytes, 0.090 Mbytes)
> dt (j:1 t:1): Average transfer rates: 656857 bytes/sec, 641.462 Kbytes/sec, 0.626 Mbytes/sec
> dt (j:1 t:1): Number I/O's per second: 13.973
> dt (j:1 t:1): Number seconds per I/O: 0.0716 (71.56ms)
> dt (j:1 t:1): Total passes completed: 1/1
> dt (j:1 t:1): Total errors detected: 1/1
> dt (j:1 t:1): Total elapsed time: 00m00.15s
> dt (j:1 t:1): Total system time: 00m00.00s
> dt (j:1 t:1): Total user time: 00m00.00s
> dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1):
> dt (j:1 t:1): Operating System Information:
> dt (j:1 t:1): Host name: plsm121c-06.perf.hammer.space (192.168.1.106)
> dt (j:1 t:1): User name: root
> dt (j:1 t:1): Process ID: 31703
> dt (j:1 t:1): OS information: Linux 6.12.24.17.hs.snitm+ #34 SMP PREEMPT_DYNAMIC Fri Aug 15 22:03:10 UTC 2025 x86_64
> dt (j:1 t:1):
> dt (j:1 t:1): File System Information:
> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
> dt (j:1 t:1): Filesystem type: nfs4
> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
> dt (j:1 t:1): Filesystem block size: 1048576
> dt (j:1 t:1): Filesystem free space: 60990430380032 (58165007.000 Mbytes, 56801.765 Gbytes, 55.470 Tbytes)
> dt (j:1 t:1): Filesystem total space: 60992310476800 (58166800.000 Mbytes, 56803.516 Gbytes, 55.472 Tbytes)
> dt (j:1 t:1):
> dt (j:1 t:1): Total Statistics:
> dt (j:1 t:1): Output device/file name: /mnt/hs_test/dt_thisisa.test (device type=regular)
> dt (j:1 t:1): Type of I/O's performed: sequential (forward)
> dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
> dt (j:1 t:1): Data pattern string used: 'IOT Pattern' (blocking is 512 bytes)
> dt (j:1 t:1): Last IOT seed value used: 0x01010101
> dt (j:1 t:1): Total records read: 2
> dt (j:1 t:1): Total bytes read: 94016 (91.812 Kbytes, 0.090 Mbytes, 0.000 Gbytes)
> dt (j:1 t:1): Total records written: 3
> dt (j:1 t:1): Total bytes written: 141024 (137.719 Kbytes, 0.134 Mbytes, 0.000 Gbytes)
> dt (j:1 t:1): Total records processed: 5 @ 47008 bytes/record (45.906 Kbytes)
> dt (j:1 t:1): Total bytes transferred: 235040 (229.531 Kbytes, 0.224 Mbytes)
> dt (j:1 t:1): Average transfer rates: 828023 bytes/sec, 808.616 Kbytes/sec, 0.790 Mbytes/sec
> dt (j:1 t:1): Number I/O's per second: 17.615
> dt (j:1 t:1): Number seconds per I/O: 0.0568 (56.77ms)
> dt (j:1 t:1): Total passes completed: 1/1
> dt (j:1 t:1): Total errors detected: 1/1
> dt (j:1 t:1): Total elapsed time: 00m00.29s
> dt (j:1 t:1): Total system time: 00m00.00s
> dt (j:1 t:1): Total user time: 00m00.00s
> dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1):
> dt (j:1 t:1): Command line to re-read the data:
> dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 dsize=512 iotype=sequential iodir=forward limit=141024 records=3 pattern=iot
> dt (j:1 t:1):
> dt (j:1 t:1): Command Line:
> dt (j:1 t:1):
> dt (j:1 t:1): # dt of=/mnt/hs_test/dt_thisisa.test passes=1 bs=47008 count=3 iotype=sequential pattern=iot onerr=abort oncerr=abort
> dt (j:1 t:1):
> dt (j:1 t:1): --> Date: September 21st, 2023, Version: 25.05, Author: Robin T. Miller <--
> dt (j:1 t:1):
> dt (j:1 t:1): onerr=abort, so stopping all threads for job 1...
> dt (j:0 t:0): Job 1 is being stopped (1 thread)
> dt (j:0 t:0): Program is exiting with status -1...
> ---
> fs/nfsd/vfs.c | 25 ++++++-------------------
> 1 file changed, 6 insertions(+), 19 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f8975ee262b5c..762d745b1b15d 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1079,13 +1079,11 @@ struct nfsd_read_dio {
> loff_t end;
> unsigned long start_extra;
> unsigned long end_extra;
> - struct page *start_extra_page;
> };
>
> static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> {
> memset(read_dio, 0, sizeof(*read_dio));
> - read_dio->start_extra_page = NULL;
> }
>
> #define NFSD_READ_DIO_MIN_KB (32 << 10)
> @@ -1121,9 +1119,8 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> /*
> * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> - * to be DIO-aligned (this heuristic avoids excess work, like allocating
> - * start_extra_page, for smaller IO that can generally already perform
> - * well using buffered IO).
> + * to be DIO-aligned (this heuristic avoids excess work, for smaller IO
> + * that can generally already perform well using buffered IO).
> */
> if ((read_dio->start_extra || read_dio->end_extra) &&
> (len < NFSD_READ_DIO_MIN_KB)) {
> @@ -1131,15 +1128,6 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return false;
> }
>
> - if (read_dio->start_extra) {
> - read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> - if (WARN_ONCE(read_dio->start_extra_page == NULL,
> - "%s: Unable to allocate start_extra_page\n", __func__)) {
> - init_nfsd_read_dio(read_dio);
> - return false;
> - }
> - }
> -
> /* Show original offset and count, and how it was expanded for DIO */
> middle_end = read_dio->end - read_dio->end_extra;
> trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
> @@ -1162,11 +1150,10 @@ static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> if (!read_dio->start_extra && !read_dio->end_extra)
> return host_err;
>
> - /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> - * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> + /* If nfsd_analyze_read_dio() found start_extra (front-pad) page needed it
> + * must be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> */
> - if (read_dio->start_extra_page) {
> - __free_page(read_dio->start_extra_page);
> + if (read_dio->start_extra) {
> *rq_bvec_numpages -= 1;
> v = *rq_bvec_numpages;
> memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> @@ -1276,7 +1263,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (read_dio.start_extra) {
> len = read_dio.start_extra;
> bvec_set_page(&rqstp->rq_bvec[v],
> - read_dio.start_extra_page,
> + *(rqstp->rq_next_page++),
> len, PAGE_SIZE - len);
> total -= len;
> ++v;
Thank you, Mike. This will help me reproduce the problem. Saves me a
bunch of time!
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-08-30 17:38 ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
2025-09-02 14:04 ` Chuck Lever
@ 2025-09-02 15:56 ` Chuck Lever
2025-09-02 17:59 ` Chuck Lever
1 sibling, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-09-02 15:56 UTC (permalink / raw)
To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs
On 8/30/25 1:38 PM, Mike Snitzer wrote:
> From: Mike Snitzer <snitzer@hammerspace.com>
>
> Chuck Lever advised that allocating a single start_extra_page, to
> avoid RDMA corruption on client, definitely shouldn't be needed:
>
> "There's nothing I can think of in the RDMA or RPC/RDMA protocols that
> mandates that the first page offset must always be zero. Moving data
> at one address on the server to an entirely different address and
> alignment on the client is exactly what RDMA is supposed to do.
>
> It sounds like an implementation omission because the server's upper
> layers have never needed it before now. If TCP already handles it, I'm
> guessing it's going to be straightforward to fix."
>
> So avoid papering over what seems to be an rpcrdma bug, remove the
> allocation and use of an extra start_extra_page.
>
> With this patch applied ontop of v8 patchset [0], I get the following
> data mismatch errors at the end [3] when using the NFS RDMA client
> with reproducer documented in associated patch header since v2 [1]:
>
> "Must allocate and use a bounce-buffer page (called 'start_extra_page')
> if/when expanding the misaligned READ requires reading extra partial
> page at the start of the READ so that its DIO-aligned. Otherwise that
> extra page at the start will make its way back to the NFS client and
> corruption will occur. As found, and then this fix of using an extra
> page verified, using the 'dt' utility:
> dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> iotype=sequential pattern=iot onerr=abort oncerr=abort
> see: https://github.com/RobinTMiller/dt.git "
>
> I really did try to call attention to this misaligned DIO READ
> alloc_page hack to make RDMA work, see [2], but I didn't frame it as
> RDMA specific and definitely should've been clearer on that important
> detail:
>
> "Also, I think its worth calling out this
> nfsd_complete_misaligned_read_dio function for its remapping/shifting
> of the READ payload reflected in rqstp->rq_bvec[]."
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>
> [0]: https://lore.kernel.org/linux-nfs/20250826185718.5593-1-snitzer@kernel.org/
> [1]: https://lore.kernel.org/linux-nfs/20250708160619.64800-9-snitzer@kernel.org/
> [2]: https://lore.kernel.org/linux-nfs/aG2MDVyyCbjTpgOv@kernel.org/
> [3]: partial output of dt utility that shows NFS client READ data mismatch:
> ++ COUNT=3
> ++ IOSIZE=47008
> ++ dt of=/mnt/hs_test/dt_thisisa.test passes=1 bs=47008 count=3 iotype=sequential pattern=iot onerr=abort oncerr=abort
> dt (j:1 t:1):
> dt (j:1 t:1): Write Statistics:
> dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
> dt (j:1 t:1): Last IOT seed value used: 0x01010101
> dt (j:1 t:1): Total records processed: 3 @ 47008 bytes/record (45.906 Kbytes)
> dt (j:1 t:1): Total bytes transferred: 141024 (137.719 Kbytes, 0.134 Mbytes)
> dt (j:1 t:1): Average transfer rates: 1004137 bytes/sec, 980.602 Kbytes/sec, 0.958 Mbytes/sec
> dt (j:1 t:1): Number I/O's per second: 21.361
> dt (j:1 t:1): Number seconds per I/O: 0.0468 (46.81ms)
> dt (j:1 t:1): Total passes completed: 0/1
> dt (j:1 t:1): Total errors detected: 0/1
> dt (j:1 t:1): Total elapsed time: 00m00.14s
> dt (j:1 t:1): Total system time: 00m00.00s
> dt (j:1 t:1): Total user time: 00m00.00s
> dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Warning: The bytes written 141024, is less than the data limit 1880320000 requested!
> dt (j:1 t:1): ERROR: Error number 1 occurred on Sat Aug 30 16:14:08 2025
> dt (j:1 t:1):
> dt (j:1 t:1): Error Number: 1
> dt (j:1 t:1): Time of Current Error: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Read Pass Start Time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Write Pass Start Time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Pass Number: 1
> dt (j:1 t:1): Pass Elapsed Time: 00m00.10s
> dt (j:1 t:1): Test Elapsed Time: 00m00.24s
> dt (j:1 t:1): File Name: /mnt/hs_test/dt_thisisa.test
> dt (j:1 t:1): File Inode: 1199 (0x4af)
> dt (j:1 t:1): Directory Inode: 1 (0x1)
> dt (j:1 t:1): File Size: 1880320000 (0x70136800)
> dt (j:1 t:1): Operation: miscompare
> dt (j:1 t:1): Record Number: 2
> dt (j:1 t:1): Request Size: 47008 (0xb7a0)
> dt (j:1 t:1): Block Length: 91 (0x5b)
> dt (j:1 t:1): I/O Mode: read
> dt (j:1 t:1): I/O Type: sequential
> dt (j:1 t:1): File Type: output
> dt (j:1 t:1): Direct I/O: disabled (caching data)
> dt (j:1 t:1): Device Size: 512 (0x200)
> dt (j:1 t:1): Starting File Offset: 47008 (0xb7a0)
> dt (j:1 t:1): Starting LBA: 91 (0x5b)
> dt (j:1 t:1): Ending File Offset: 94016 (0x16f40)
> dt (j:1 t:1): Ending LBA: 182 (0xb6)
> dt (j:1 t:1): Error File Offset: 47008 (0xb7a0)
> dt (j:1 t:1): Error Offset Modulos: %8 = 0, %512 = 416, %4096 = 1952
> dt (j:1 t:1): Starting Relative Error LBA: 91 (0x5b)
> dt (j:1 t:1): Relative 4096 byte Error LBA: 11 (0xb)
> dt (j:1 t:1): Corruption Buffer Index: 0 (byte index into read buffer)
> dt (j:1 t:1): Corruption Block Index: 0 (byte index in miscompare block)
> dt (j:1 t:1):
> dt (j:1 t:1):
> dt (j:1 t:1): Data compare error at byte 0 in record number 2
> dt (j:1 t:1): Relative block number where the error occurred is 91, offset 47008
> dt (j:1 t:1): Block expected = 91 (0x0000005b), block found = 1919311731 (0x72665f73), count = 47008
> dt (j:1 t:1): The correct data starts at memory address 0x000000003c589000 (marked by asterisk '*')
> dt (j:1 t:1): Dumping Pattern Buffer (base = 0x3c589000, mismatch offset = 0, limit = 512 bytes):
> dt (j:1 t:1): / Buffer
> dt (j:1 t:1): Memory Address / Index
> dt (j:1 t:1): 0x000000003c589000/ 0 |*5b 00 00 00 5c 01 01 01 5d 02 02 02 5e 03 03 03 "[ \ ] ^ "
> dt (j:1 t:1): 0x000000003c589010/ 16 | 5f 04 04 04 60 05 05 05 61 06 06 06 62 07 07 07 "_ ` a b "
> dt (j:1 t:1): 0x000000003c589020/ 32 | 63 08 08 08 64 09 09 09 65 0a 0a 0a 66 0b 0b 0b "c d e f "
> dt (j:1 t:1): 0x000000003c589030/ 48 | 67 0c 0c 0c 68 0d 0d 0d 69 0e 0e 0e 6a 0f 0f 0f "g h i j "
> dt (j:1 t:1): 0x000000003c589040/ 64 | 6b 10 10 10 6c 11 11 11 6d 12 12 12 6e 13 13 13 "k l m n "
> dt (j:1 t:1): 0x000000003c589050/ 80 | 6f 14 14 14 70 15 15 15 71 16 16 16 72 17 17 17 "o p q r "
> dt (j:1 t:1): 0x000000003c589060/ 96 | 73 18 18 18 74 19 19 19 75 1a 1a 1a 76 1b 1b 1b "s t u v "
> dt (j:1 t:1): 0x000000003c589070/ 112 | 77 1c 1c 1c 78 1d 1d 1d 79 1e 1e 1e 7a 1f 1f 1f "w x y z "
> dt (j:1 t:1): 0x000000003c589080/ 128 | 7b 20 20 20 7c 21 21 21 7d 22 22 22 7e 23 23 23 "{ |!!!}"""~###"
> dt (j:1 t:1): 0x000000003c589090/ 144 | 7f 24 24 24 80 25 25 25 81 26 26 26 82 27 27 27 " $$$ %%% &&& '''"
> dt (j:1 t:1): 0x000000003c5890a0/ 160 | 83 28 28 28 84 29 29 29 85 2a 2a 2a 86 2b 2b 2b " ((( ))) *** +++"
> dt (j:1 t:1): 0x000000003c5890b0/ 176 | 87 2c 2c 2c 88 2d 2d 2d 89 2e 2e 2e 8a 2f 2f 2f " ,,, --- ... ///"
> dt (j:1 t:1): 0x000000003c5890c0/ 192 | 8b 30 30 30 8c 31 31 31 8d 32 32 32 8e 33 33 33 " 000 111 222 333"
> dt (j:1 t:1): 0x000000003c5890d0/ 208 | 8f 34 34 34 90 35 35 35 91 36 36 36 92 37 37 37 " 444 555 666 777"
> dt (j:1 t:1): 0x000000003c5890e0/ 224 | 93 38 38 38 94 39 39 39 95 3a 3a 3a 96 3b 3b 3b " 888 999 ::: ;;;"
> dt (j:1 t:1): 0x000000003c5890f0/ 240 | 97 3c 3c 3c 98 3d 3d 3d 99 3e 3e 3e 9a 3f 3f 3f " <<< === >>> ???"
> dt (j:1 t:1): 0x000000003c589100/ 256 | 9b 40 40 40 9c 41 41 41 9d 42 42 42 9e 43 43 43 " @@@ AAA BBB CCC"
> dt (j:1 t:1): 0x000000003c589110/ 272 | 9f 44 44 44 a0 45 45 45 a1 46 46 46 a2 47 47 47 " DDD EEE FFF GGG"
> dt (j:1 t:1): 0x000000003c589120/ 288 | a3 48 48 48 a4 49 49 49 a5 4a 4a 4a a6 4b 4b 4b " HHH III JJJ KKK"
> dt (j:1 t:1): 0x000000003c589130/ 304 | a7 4c 4c 4c a8 4d 4d 4d a9 4e 4e 4e aa 4f 4f 4f " LLL MMM NNN OOO"
> dt (j:1 t:1): 0x000000003c589140/ 320 | ab 50 50 50 ac 51 51 51 ad 52 52 52 ae 53 53 53 " PPP QQQ RRR SSS"
> dt (j:1 t:1): 0x000000003c589150/ 336 | af 54 54 54 b0 55 55 55 b1 56 56 56 b2 57 57 57 " TTT UUU VVV WWW"
> dt (j:1 t:1): 0x000000003c589160/ 352 | b3 58 58 58 b4 59 59 59 b5 5a 5a 5a b6 5b 5b 5b " XXX YYY ZZZ [[["
> dt (j:1 t:1): 0x000000003c589170/ 368 | b7 5c 5c 5c b8 5d 5d 5d b9 5e 5e 5e ba 5f 5f 5f " \\\ ]]] ^^^ ___"
> dt (j:1 t:1): 0x000000003c589180/ 384 | bb 60 60 60 bc 61 61 61 bd 62 62 62 be 63 63 63 " ``` aaa bbb ccc"
> dt (j:1 t:1): 0x000000003c589190/ 400 | bf 64 64 64 c0 65 65 65 c1 66 66 66 c2 67 67 67 " ddd eee fff ggg"
> dt (j:1 t:1): 0x000000003c5891a0/ 416 | c3 68 68 68 c4 69 69 69 c5 6a 6a 6a c6 6b 6b 6b " hhh iii jjj kkk"
> dt (j:1 t:1): 0x000000003c5891b0/ 432 | c7 6c 6c 6c c8 6d 6d 6d c9 6e 6e 6e ca 6f 6f 6f " lll mmm nnn ooo"
> dt (j:1 t:1): 0x000000003c5891c0/ 448 | cb 70 70 70 cc 71 71 71 cd 72 72 72 ce 73 73 73 " ppp qqq rrr sss"
> dt (j:1 t:1): 0x000000003c5891d0/ 464 | cf 74 74 74 d0 75 75 75 d1 76 76 76 d2 77 77 77 " ttt uuu vvv www"
> dt (j:1 t:1): 0x000000003c5891e0/ 480 | d3 78 78 78 d4 79 79 79 d5 7a 7a 7a d6 7b 7b 7b " xxx yyy zzz {{{"
> dt (j:1 t:1): 0x000000003c5891f0/ 496 | d7 7c 7c 7c d8 7d 7d 7d d9 7e 7e 7e da 7f 7f 7f " ||| }}} ~~~ "
> dt (j:1 t:1):
> dt (j:1 t:1): The incorrect data starts at memory address 0x000000003c596000 (for Robin's debug! :)
> dt (j:1 t:1): The incorrect data starts at file offset 000000000000047008 (marked by asterisk '*')
> dt (j:1 t:1): Dumping Data File offsets (base = 47008, mismatch offset = 0, limit = 512 bytes):
> dt (j:1 t:1): / Block
> dt (j:1 t:1): File Offset / Index
> dt (j:1 t:1): 000000000000047008/ 0 |*73 5f 66 72 65 65 5f 63 6f 6d 6d 69 74 5f 61 72 "s_free_commit_ar"
> dt (j:1 t:1): 000000000000047024/ 16 | 72 61 79 00 54 43 50 5f 54 49 4d 45 5f 57 41 49 "ray TCP_TIME_WAI"
> dt (j:1 t:1): 000000000000047040/ 32 | 54 00 42 50 46 5f 50 52 4f 47 5f 54 59 50 45 5f "T BPF_PROG_TYPE_"
> dt (j:1 t:1): 000000000000047056/ 48 | 43 47 52 4f 55 50 5f 53 59 53 43 54 4c 00 4c 41 "CGROUP_SYSCTL LA"
> dt (j:1 t:1): 000000000000047072/ 64 | 59 4f 55 54 5f 46 4c 45 58 5f 46 49 4c 45 53 00 "YOUT_FLEX_FILES "
> dt (j:1 t:1): 000000000000047088/ 80 | 4e 46 53 45 52 52 5f 4a 55 4b 45 42 4f 58 00 72 "NFSERR_JUKEBOX r"
> dt (j:1 t:1): 000000000000047104/ 96 | 78 5f 63 70 75 5f 72 6d 61 70 00 6d 69 67 72 61 "x_cpu_rmap migra"
> dt (j:1 t:1): 000000000000047120/ 112 | 74 69 6f 6e 5f 64 69 73 61 62 6c 65 64 00 5f 5f "tion_disabled __"
> dt (j:1 t:1): 000000000000047136/ 128 | 64 61 74 61 00 6e 64 6f 5f 64 65 6c 5f 73 6c 61 "data ndo_del_sla"
> dt (j:1 t:1): 000000000000047152/ 144 | 76 65 00 6e 66 73 5f 63 6f 6d 6d 69 74 5f 64 61 "ve nfs_commit_da"
> dt (j:1 t:1): 000000000000047168/ 160 | 74 61 00 65 78 74 5f 6d 75 74 65 78 00 63 6f 6e "ta ext_mutex con"
> dt (j:1 t:1): 000000000000047184/ 176 | 6e 65 63 74 5f 63 6f 6f 6b 69 65 00 54 43 50 5f "nect_cookie TCP_"
> dt (j:1 t:1): 000000000000047200/ 192 | 43 4c 4f 53 45 5f 57 41 49 54 00 6d 65 6d 63 6d "CLOSE_WAIT memcm"
> dt (j:1 t:1): 000000000000047216/ 208 | 70 00 52 50 4d 5f 52 45 51 5f 53 55 53 50 45 4e "p RPM_REQ_SUSPEN"
> dt (j:1 t:1): 000000000000047232/ 224 | 44 00 63 72 6d 61 74 63 68 00 63 61 6e 63 65 6c "D crmatch cancel"
> dt (j:1 t:1): 000000000000047248/ 240 | 5f 66 6f 72 6b 00 70 67 70 72 6f 74 5f 74 00 74 "_fork pgprot_t t"
> dt (j:1 t:1): 000000000000047264/ 256 | 72 61 63 65 70 6f 69 6e 74 5f 70 74 72 5f 74 00 "racepoint_ptr_t "
> dt (j:1 t:1): 000000000000047280/ 272 | 66 6f 72 5f 72 65 63 6c 61 69 6d 00 4e 46 53 45 "for_reclaim NFSE"
> dt (j:1 t:1): 000000000000047296/ 288 | 52 52 5f 42 41 44 43 48 41 52 00 5f 73 6b 62 5f "RR_BADCHAR _skb_"
> dt (j:1 t:1): 000000000000047312/ 304 | 72 65 66 64 73 74 00 70 68 79 73 69 63 61 6c 5f "refdst physical_"
> dt (j:1 t:1): 000000000000047328/ 320 | 6c 6f 63 61 74 69 6f 6e 00 6e 75 6d 5f 72 65 71 "location num_req"
> dt (j:1 t:1): 000000000000047344/ 336 | 73 00 5f 5f 53 43 54 5f 5f 74 70 5f 66 75 6e 63 "s __SCT__tp_func"
> dt (j:1 t:1): 000000000000047360/ 352 | 5f 70 6e 66 73 5f 6d 64 73 5f 66 61 6c 6c 62 61 "_pnfs_mds_fallba"
> dt (j:1 t:1): 000000000000047376/ 368 | 63 6b 5f 77 72 69 74 65 5f 64 6f 6e 65 00 74 61 "ck_write_done ta"
> dt (j:1 t:1): 000000000000047392/ 384 | 73 6b 5f 63 6c 65 61 6e 75 70 00 65 78 70 61 6e "sk_cleanup expan"
> dt (j:1 t:1): 000000000000047408/ 400 | 64 5f 72 65 61 64 61 68 65 61 64 00 6c 6f 63 6b "d_readahead lock"
> dt (j:1 t:1): 000000000000047424/ 416 | 5f 6d 61 6e 61 67 65 72 5f 6f 70 65 72 61 74 69 "_manager_operati"
> dt (j:1 t:1): 000000000000047440/ 432 | 6f 6e 73 00 73 72 63 5f 72 65 67 00 63 72 64 65 "ons src_reg crde"
> dt (j:1 t:1): 000000000000047456/ 448 | 73 74 72 6f 79 00 63 68 69 6c 64 72 65 6e 5f 6c "stroy children_l"
> dt (j:1 t:1): 000000000000047472/ 464 | 6f 77 5f 75 73 61 67 65 00 6e 75 6d 5f 76 66 00 "ow_usage num_vf "
> dt (j:1 t:1): 000000000000047488/ 480 | 73 63 72 61 74 63 68 00 50 49 44 54 59 50 45 5f "scratch PIDTYPE_"
> dt (j:1 t:1): 000000000000047504/ 496 | 4d 41 58 00 70 72 65 70 61 72 65 5f 77 72 69 74 "MAX prepare_writ"
> dt (j:1 t:1):
> dt (j:1 t:1):
> dt (j:1 t:1): Analyzing IOT Record Data: (Note: Block #'s are relative to start of record!)
> dt (j:1 t:1):
> dt (j:1 t:1): IOT block size: 512
> dt (j:1 t:1): Total number of blocks: 91 (47008 bytes)
> dt (j:1 t:1): Current IOT seed value: 0x01010101 (pass 1)
> dt (j:1 t:1): Range of corrupted blocks: 0 - 90
> dt (j:1 t:1): Length of corrupted blocks: 91 (46592 bytes)
> dt (j:1 t:1): Corrupted blocks file offset: 47008 (LBA 91)
> dt (j:1 t:1): Number of corrupted blocks: 91
> dt (j:1 t:1): Number of good blocks found: 0
> dt (j:1 t:1): Number of zero blocks found: 0
> dt (j:1 t:1):
> dt (j:1 t:1): Record #: 2
> dt (j:1 t:1): Starting Record Offset: 47008
> dt (j:1 t:1): Transfer Count: 47008 (0xb7a0)
> dt (j:1 t:1): Ending Record Offset: 94016
> dt (j:1 t:1): Relative Record Block Range: 91 - 182
> dt (j:1 t:1): Read Buffer Address: 0x3c596000
> dt (j:1 t:1): Pattern Base Address: 0x3c589000
> dt (j:1 t:1): Note: Incorrect data is marked with asterisk '*'
> dt (j:1 t:1):
> dt (j:1 t:1): Record Block: 0 (BAD data)
> dt (j:1 t:1): Record Block Offset: 47008 (LBA 91)
> dt (j:1 t:1): Record Buffer Index: 0 (0x0)
> dt (j:1 t:1): Expected Block Number: 91 (0x0000005b)
> dt (j:1 t:1): Received Block Number: 1919311731 (0x72665f73)
> dt (j:1 t:1): Received Block Offset: 982687606272
> dt (j:1 t:1):
> dt (j:1 t:1): Byte Expected: address 0x3c589000 Received: address 0x3c596000
> dt (j:1 t:1): 0000 0000005b 0101015c 0202025d 0303035e * 72665f73 635f6565 696d6d6f 72615f74
> dt (j:1 t:1): 0010 0404045f 05050560 06060661 07070762 * 00796172 5f504354 454d4954 4941575f
> dt (j:1 t:1): 0020 08080863 09090964 0a0a0a65 0b0b0b66 * 50420054 52505f46 545f474f 5f455059
> dt (j:1 t:1): 0030 0c0c0c67 0d0d0d68 0e0e0e69 0f0f0f6a * 4f524743 535f5055 54435359 414c004c
> dt (j:1 t:1): 0040 1010106b 1111116c 1212126d 1313136e * 54554f59 454c465f 49465f58 0053454c
> dt (j:1 t:1): 0050 1414146f 15151570 16161671 17171772 * 4553464e 4a5f5252 42454b55 7200584f
> dt (j:1 t:1): 0060 18181873 19191974 1a1a1a75 1b1b1b76 * 70635f78 6d725f75 6d007061 61726769
> dt (j:1 t:1): 0070 1c1c1c77 1d1d1d78 1e1e1e79 1f1f1f7a * 6e6f6974 7369645f 656c6261 5f5f0064
> dt (j:1 t:1): 0080 2020207b 2121217c 2222227d 2323237e * 61746164 6f646e00 6c65645f 616c735f
> dt (j:1 t:1): 0090 2424247f 25252580 26262681 27272782 * 6e006576 635f7366 696d6d6f 61645f74
> dt (j:1 t:1): 00a0 28282883 29292984 2a2a2a85 2b2b2b86 * 65006174 6d5f7478 78657475 6e6f6300
> dt (j:1 t:1): 00b0 2c2c2c87 2d2d2d88 2e2e2e89 2f2f2f8a * 7463656e 6f6f635f 0065696b 5f504354
> dt (j:1 t:1): 00c0 3030308b 3131318c 3232328d 3333338e * 534f4c43 41575f45 6d005449 6d636d65
> dt (j:1 t:1): 00d0 3434348f 35353590 36363691 37373792 * 50520070 45525f4d 55535f51 4e455053
> dt (j:1 t:1): 00e0 38383893 39393994 3a3a3a95 3b3b3b96 * 72630044 6374616d 61630068 6c65636e
> dt (j:1 t:1): 00f0 3c3c3c97 3d3d3d98 3e3e3e99 3f3f3f9a * 726f665f 6770006b 746f7270 7400745f
> dt (j:1 t:1): 0100 4040409b 4141419c 4242429d 4343439e * 65636172 6e696f70 74705f74 00745f72
> dt (j:1 t:1): 0110 4444449f 454545a0 464646a1 474747a2 * 5f726f66 6c636572 006d6961 4553464e
> dt (j:1 t:1): 0120 484848a3 494949a4 4a4a4aa5 4b4b4ba6 * 425f5252 48434441 5f005241 5f626b73
> dt (j:1 t:1): 0130 4c4c4ca7 4d4d4da8 4e4e4ea9 4f4f4faa * 64666572 70007473 69737968 5f6c6163
> dt (j:1 t:1): 0140 505050ab 515151ac 525252ad 535353ae * 61636f6c 6e6f6974 6d756e00 7165725f
> dt (j:1 t:1): 0150 545454af 555555b0 565656b1 575757b2 * 5f5f0073 5f544353 5f70745f 636e7566
> dt (j:1 t:1): 0160 585858b3 595959b4 5a5a5ab5 5b5b5bb6 * 666e705f 646d5f73 61665f73 61626c6c
> dt (j:1 t:1): 0170 5c5c5cb7 5d5d5db8 5e5e5eb9 5f5f5fba * 775f6b63 65746972 6e6f645f 61740065
> dt (j:1 t:1): 0180 606060bb 616161bc 626262bd 636363be * 635f6b73 6e61656c 65007075 6e617078
> dt (j:1 t:1): 0190 646464bf 656565c0 666666c1 676767c2 * 65725f64 68616461 00646165 6b636f6c
> dt (j:1 t:1): 01a0 686868c3 696969c4 6a6a6ac5 6b6b6bc6 * 6e616d5f 72656761 65706f5f 69746172
> dt (j:1 t:1): 01b0 6c6c6cc7 6d6d6dc8 6e6e6ec9 6f6f6fca * 00736e6f 5f637273 00676572 65647263
> dt (j:1 t:1): 01c0 707070cb 717171cc 727272cd 737373ce * 6f727473 68630079 72646c69 6c5f6e65
> dt (j:1 t:1): 01d0 747474cf 757575d0 767676d1 777777d2 * 755f776f 65676173 6d756e00 0066765f
> dt (j:1 t:1): 01e0 787878d3 797979d4 7a7a7ad5 7b7b7bd6 * 61726373 00686374 54444950 5f455059
> dt (j:1 t:1): 01f0 7c7c7cd7 7d7d7dd8 7e7e7ed9 7f7f7fda * 0058414d 70657270 5f657261 74697277
> ...
> dt (j:1 t:1): Reread data does NOT match previous data or expected data!
> dt (j:1 t:1): Writing reread data to file dt_thisisa.test-REREAD3-j1t1, from buffer 0x7f12bc004000, 47008 bytes...
> dt (j:1 t:1): Command line to re-read the corrupted data:
> dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 count=1 offset=47008 pattern=iot disable=retryDC,savecorrupted,trigdefaults
> dt (j:1 t:1):
> dt (j:1 t:1): Command line to re-read the data:
> dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 dsize=512 iotype=sequential iodir=forward limit=94016 records=1 pattern=iot disable=retryDC,savecorrupted,trigdefaults
> dt (j:1 t:1):
> dt (j:1 t:1):
> dt (j:1 t:1): Read Statistics:
> dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
> dt (j:1 t:1): Last IOT seed value used: 0x01010101
> dt (j:1 t:1): Total records processed: 2 @ 47008 bytes/record (45.906 Kbytes)
> dt (j:1 t:1): Total bytes transferred: 94016 (91.812 Kbytes, 0.090 Mbytes)
> dt (j:1 t:1): Average transfer rates: 656857 bytes/sec, 641.462 Kbytes/sec, 0.626 Mbytes/sec
> dt (j:1 t:1): Number I/O's per second: 13.973
> dt (j:1 t:1): Number seconds per I/O: 0.0716 (71.56ms)
> dt (j:1 t:1): Total passes completed: 1/1
> dt (j:1 t:1): Total errors detected: 1/1
> dt (j:1 t:1): Total elapsed time: 00m00.15s
> dt (j:1 t:1): Total system time: 00m00.00s
> dt (j:1 t:1): Total user time: 00m00.00s
> dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1):
> dt (j:1 t:1): Operating System Information:
> dt (j:1 t:1): Host name: plsm121c-06.perf.hammer.space (192.168.1.106)
> dt (j:1 t:1): User name: root
> dt (j:1 t:1): Process ID: 31703
> dt (j:1 t:1): OS information: Linux 6.12.24.17.hs.snitm+ #34 SMP PREEMPT_DYNAMIC Fri Aug 15 22:03:10 UTC 2025 x86_64
> dt (j:1 t:1):
> dt (j:1 t:1): File System Information:
> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
> dt (j:1 t:1): Filesystem type: nfs4
> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
I haven't been able to reproduce a similar failure in my lab with
NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
successful. Also, for shit giggles, I tried the fsx-based subtests in
fstests, no new failures there either. The export is xfs on an NVMe
add-on card; server uses direct I/O for READ and page cache for WRITE.
Notice the mount options for your test run: "proto=tcp" and
"nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
not use RDMA at all; it will use bog standard TCP/IP on your ultra
fast Ethernet network.
What should I try next? I can apply 2/2 or add "nconnect" or move the
testing to my RoCE fabric after lunch and keep poking at it.
Or, I could switch to TCP. Suggestions welcome.
> dt (j:1 t:1): Filesystem block size: 1048576
> dt (j:1 t:1): Filesystem free space: 60990430380032 (58165007.000 Mbytes, 56801.765 Gbytes, 55.470 Tbytes)
> dt (j:1 t:1): Filesystem total space: 60992310476800 (58166800.000 Mbytes, 56803.516 Gbytes, 55.472 Tbytes)
> dt (j:1 t:1):
> dt (j:1 t:1): Total Statistics:
> dt (j:1 t:1): Output device/file name: /mnt/hs_test/dt_thisisa.test (device type=regular)
> dt (j:1 t:1): Type of I/O's performed: sequential (forward)
> dt (j:1 t:1): Job Information Reported: Job 1, Thread 1
> dt (j:1 t:1): Data pattern string used: 'IOT Pattern' (blocking is 512 bytes)
> dt (j:1 t:1): Last IOT seed value used: 0x01010101
> dt (j:1 t:1): Total records read: 2
> dt (j:1 t:1): Total bytes read: 94016 (91.812 Kbytes, 0.090 Mbytes, 0.000 Gbytes)
> dt (j:1 t:1): Total records written: 3
> dt (j:1 t:1): Total bytes written: 141024 (137.719 Kbytes, 0.134 Mbytes, 0.000 Gbytes)
> dt (j:1 t:1): Total records processed: 5 @ 47008 bytes/record (45.906 Kbytes)
> dt (j:1 t:1): Total bytes transferred: 235040 (229.531 Kbytes, 0.224 Mbytes)
> dt (j:1 t:1): Average transfer rates: 828023 bytes/sec, 808.616 Kbytes/sec, 0.790 Mbytes/sec
> dt (j:1 t:1): Number I/O's per second: 17.615
> dt (j:1 t:1): Number seconds per I/O: 0.0568 (56.77ms)
> dt (j:1 t:1): Total passes completed: 1/1
> dt (j:1 t:1): Total errors detected: 1/1
> dt (j:1 t:1): Total elapsed time: 00m00.29s
> dt (j:1 t:1): Total system time: 00m00.00s
> dt (j:1 t:1): Total user time: 00m00.00s
> dt (j:1 t:1): Starting time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1): Ending time: Sat Aug 30 16:14:08 2025
> dt (j:1 t:1):
> dt (j:1 t:1): Command line to re-read the data:
> dt (j:1 t:1): # dt if=/mnt/hs_test/dt_thisisa.test bs=47008 dsize=512 iotype=sequential iodir=forward limit=141024 records=3 pattern=iot
> dt (j:1 t:1):
> dt (j:1 t:1): Command Line:
> dt (j:1 t:1):
> dt (j:1 t:1): # dt of=/mnt/hs_test/dt_thisisa.test passes=1 bs=47008 count=3 iotype=sequential pattern=iot onerr=abort oncerr=abort
> dt (j:1 t:1):
> dt (j:1 t:1): --> Date: September 21st, 2023, Version: 25.05, Author: Robin T. Miller <--
> dt (j:1 t:1):
> dt (j:1 t:1): onerr=abort, so stopping all threads for job 1...
> dt (j:0 t:0): Job 1 is being stopped (1 thread)
> dt (j:0 t:0): Program is exiting with status -1...
> ---
> fs/nfsd/vfs.c | 25 ++++++-------------------
> 1 file changed, 6 insertions(+), 19 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f8975ee262b5c..762d745b1b15d 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1079,13 +1079,11 @@ struct nfsd_read_dio {
> loff_t end;
> unsigned long start_extra;
> unsigned long end_extra;
> - struct page *start_extra_page;
> };
>
> static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> {
> memset(read_dio, 0, sizeof(*read_dio));
> - read_dio->start_extra_page = NULL;
> }
>
> #define NFSD_READ_DIO_MIN_KB (32 << 10)
> @@ -1121,9 +1119,8 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> /*
> * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
> - * to be DIO-aligned (this heuristic avoids excess work, like allocating
> - * start_extra_page, for smaller IO that can generally already perform
> - * well using buffered IO).
> + * to be DIO-aligned (this heuristic avoids excess work, for smaller IO
> + * that can generally already perform well using buffered IO).
> */
> if ((read_dio->start_extra || read_dio->end_extra) &&
> (len < NFSD_READ_DIO_MIN_KB)) {
> @@ -1131,15 +1128,6 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return false;
> }
>
> - if (read_dio->start_extra) {
> - read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> - if (WARN_ONCE(read_dio->start_extra_page == NULL,
> - "%s: Unable to allocate start_extra_page\n", __func__)) {
> - init_nfsd_read_dio(read_dio);
> - return false;
> - }
> - }
> -
> /* Show original offset and count, and how it was expanded for DIO */
> middle_end = read_dio->end - read_dio->end_extra;
> trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
> @@ -1162,11 +1150,10 @@ static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> if (!read_dio->start_extra && !read_dio->end_extra)
> return host_err;
>
> - /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> - * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> + /* If nfsd_analyze_read_dio() found start_extra (front-pad) page needed it
> + * must be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> */
> - if (read_dio->start_extra_page) {
> - __free_page(read_dio->start_extra_page);
> + if (read_dio->start_extra) {
> *rq_bvec_numpages -= 1;
> v = *rq_bvec_numpages;
> memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> @@ -1276,7 +1263,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (read_dio.start_extra) {
> len = read_dio.start_extra;
> bvec_set_page(&rqstp->rq_bvec[v],
> - read_dio.start_extra_page,
> + *(rqstp->rq_next_page++),
> len, PAGE_SIZE - len);
> total -= len;
> ++v;
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-02 15:56 ` Chuck Lever
@ 2025-09-02 17:59 ` Chuck Lever
2025-09-02 21:06 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-09-02 17:59 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 9/2/25 11:56 AM, Chuck Lever wrote:
> On 8/30/25 1:38 PM, Mike Snitzer wrote:
>> dt (j:1 t:1): File System Information:
>> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
>> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
>> dt (j:1 t:1): Filesystem type: nfs4
>> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
>
> I haven't been able to reproduce a similar failure in my lab with
> NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
> successful. Also, for shit giggles, I tried the fsx-based subtests in
> fstests, no new failures there either. The export is xfs on an NVMe
> add-on card; server uses direct I/O for READ and page cache for WRITE.
>
> Notice the mount options for your test run: "proto=tcp" and
> "nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
> not use RDMA at all; it will use bog standard TCP/IP on your ultra
> fast Ethernet network.
>
> What should I try next? I can apply 2/2 or add "nconnect" or move the
> testing to my RoCE fabric after lunch and keep poking at it.
>
> Or, I could switch to TCP. Suggestions welcome.
The client is not sending any READ procedures/operations to the server.
The following is NFSv3 for clarity, but NFSv4.x results are similar:
nfsd-1669 [003] 1466.634816: svc_process:
addr=192.168.2.67 xid=0x7b2a6274 service=nfsd vers=3 proc=NULL
nfsd-1669 [003] 1466.635389: svc_process:
addr=192.168.2.67 xid=0x7d2a6274 service=nfsd vers=3 proc=FSINFO
nfsd-1669 [003] 1466.635420: svc_process:
addr=192.168.2.67 xid=0x7e2a6274 service=nfsd vers=3 proc=PATHCONF
nfsd-1669 [003] 1466.635451: svc_process:
addr=192.168.2.67 xid=0x7f2a6274 service=nfsd vers=3 proc=GETATTR
nfsd-1669 [003] 1466.635486: svc_process:
addr=192.168.2.67 xid=0x802a6274 service=nfsacl vers=3 proc=NULL
nfsd-1669 [003] 1466.635558: svc_process:
addr=192.168.2.67 xid=0x812a6274 service=nfsd vers=3 proc=FSINFO
nfsd-1669 [003] 1466.635585: svc_process:
addr=192.168.2.67 xid=0x822a6274 service=nfsd vers=3 proc=GETATTR
nfsd-1669 [003] 1470.029208: svc_process:
addr=192.168.2.67 xid=0x832a6274 service=nfsd vers=3 proc=ACCESS
nfsd-1669 [003] 1470.029255: svc_process:
addr=192.168.2.67 xid=0x842a6274 service=nfsd vers=3 proc=LOOKUP
nfsd-1669 [003] 1470.029296: svc_process:
addr=192.168.2.67 xid=0x852a6274 service=nfsd vers=3 proc=FSSTAT
nfsd-1669 [003] 1470.039715: svc_process:
addr=192.168.2.67 xid=0x862a6274 service=nfsacl vers=3 proc=GETACL
nfsd-1669 [003] 1470.039758: svc_process:
addr=192.168.2.67 xid=0x872a6274 service=nfsd vers=3 proc=CREATE
nfsd-1669 [003] 1470.040091: svc_process:
addr=192.168.2.67 xid=0x882a6274 service=nfsd vers=3 proc=WRITE
nfsd-1669 [003] 1470.040469: svc_process:
addr=192.168.2.67 xid=0x892a6274 service=nfsd vers=3 proc=GETATTR
nfsd-1669 [003] 1470.040503: svc_process:
addr=192.168.2.67 xid=0x8a2a6274 service=nfsd vers=3 proc=ACCESS
nfsd-1669 [003] 1470.041867: svc_process:
addr=192.168.2.67 xid=0x8b2a6274 service=nfsd vers=3 proc=FSSTAT
nfsd-1669 [003] 1470.042109: svc_process:
addr=192.168.2.67 xid=0x8c2a6274 service=nfsd vers=3 proc=REMOVE
So I'm probably missing some setting on the reproducer/client.
/mnt from klimt.ib.1015granger.net:/export/fast
Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
local_lock=none,addr=192.168.2.55
Linux morisot.1015granger.net 6.15.10-100.fc41.x86_64 #1 SMP
PREEMPT_DYNAMIC Fri Aug 15 14:55:12 UTC 2025 x86_64 GNU/Linux
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-02 17:59 ` Chuck Lever
@ 2025-09-02 21:06 ` Mike Snitzer
2025-09-02 21:16 ` Chuck Lever
0 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-09-02 21:06 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Tue, Sep 02, 2025 at 01:59:12PM -0400, Chuck Lever wrote:
> On 9/2/25 11:56 AM, Chuck Lever wrote:
> > On 8/30/25 1:38 PM, Mike Snitzer wrote:
>
> >> dt (j:1 t:1): File System Information:
> >> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
> >> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
> >> dt (j:1 t:1): Filesystem type: nfs4
> >> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
> >
> > I haven't been able to reproduce a similar failure in my lab with
> > NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
> > successful. Also, for shit giggles, I tried the fsx-based subtests in
> > fstests, no new failures there either. The export is xfs on an NVMe
> > add-on card; server uses direct I/O for READ and page cache for WRITE.
> >
> > Notice the mount options for your test run: "proto=tcp" and
> > "nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
> > not use RDMA at all; it will use bog standard TCP/IP on your ultra
> > fast Ethernet network.
> >
> > What should I try next? I can apply 2/2 or add "nconnect" or move the
> > testing to my RoCE fabric after lunch and keep poking at it.
Hmm, I'll have to check with the Hammerspace performance team to
understand how RDMA used if the client mount has proto=tcp.
Certainly surprising, thanks for noticing/reporting this aspect.
I also cannot reproduce on a normal tcp mount and testbed. This
frankenbeast of a fast "RDMA" network that is misconfigured to use
proto=tcp is the only testbed where I've seen this dt data mismatch.
> > Or, I could switch to TCP. Suggestions welcome.
>
> The client is not sending any READ procedures/operations to the server.
> The following is NFSv3 for clarity, but NFSv4.x results are similar:
>
> nfsd-1669 [003] 1466.634816: svc_process:
> addr=192.168.2.67 xid=0x7b2a6274 service=nfsd vers=3 proc=NULL
> nfsd-1669 [003] 1466.635389: svc_process:
> addr=192.168.2.67 xid=0x7d2a6274 service=nfsd vers=3 proc=FSINFO
> nfsd-1669 [003] 1466.635420: svc_process:
> addr=192.168.2.67 xid=0x7e2a6274 service=nfsd vers=3 proc=PATHCONF
> nfsd-1669 [003] 1466.635451: svc_process:
> addr=192.168.2.67 xid=0x7f2a6274 service=nfsd vers=3 proc=GETATTR
> nfsd-1669 [003] 1466.635486: svc_process:
> addr=192.168.2.67 xid=0x802a6274 service=nfsacl vers=3 proc=NULL
> nfsd-1669 [003] 1466.635558: svc_process:
> addr=192.168.2.67 xid=0x812a6274 service=nfsd vers=3 proc=FSINFO
> nfsd-1669 [003] 1466.635585: svc_process:
> addr=192.168.2.67 xid=0x822a6274 service=nfsd vers=3 proc=GETATTR
> nfsd-1669 [003] 1470.029208: svc_process:
> addr=192.168.2.67 xid=0x832a6274 service=nfsd vers=3 proc=ACCESS
> nfsd-1669 [003] 1470.029255: svc_process:
> addr=192.168.2.67 xid=0x842a6274 service=nfsd vers=3 proc=LOOKUP
> nfsd-1669 [003] 1470.029296: svc_process:
> addr=192.168.2.67 xid=0x852a6274 service=nfsd vers=3 proc=FSSTAT
> nfsd-1669 [003] 1470.039715: svc_process:
> addr=192.168.2.67 xid=0x862a6274 service=nfsacl vers=3 proc=GETACL
> nfsd-1669 [003] 1470.039758: svc_process:
> addr=192.168.2.67 xid=0x872a6274 service=nfsd vers=3 proc=CREATE
> nfsd-1669 [003] 1470.040091: svc_process:
> addr=192.168.2.67 xid=0x882a6274 service=nfsd vers=3 proc=WRITE
> nfsd-1669 [003] 1470.040469: svc_process:
> addr=192.168.2.67 xid=0x892a6274 service=nfsd vers=3 proc=GETATTR
> nfsd-1669 [003] 1470.040503: svc_process:
> addr=192.168.2.67 xid=0x8a2a6274 service=nfsd vers=3 proc=ACCESS
> nfsd-1669 [003] 1470.041867: svc_process:
> addr=192.168.2.67 xid=0x8b2a6274 service=nfsd vers=3 proc=FSSTAT
> nfsd-1669 [003] 1470.042109: svc_process:
> addr=192.168.2.67 xid=0x8c2a6274 service=nfsd vers=3 proc=REMOVE
>
> So I'm probably missing some setting on the reproducer/client.
>
> /mnt from klimt.ib.1015granger.net:/export/fast
> Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
> local_lock=none,addr=192.168.2.55
>
> Linux morisot.1015granger.net 6.15.10-100.fc41.x86_64 #1 SMP
> PREEMPT_DYNAMIC Fri Aug 15 14:55:12 UTC 2025 x86_64 GNU/Linux
If you're using LOCALIO (client on server) that'd explain your not
seeing any READs coming over the wire to NFSD.
I've made sure to disable LOCALIO on my client, with:
echo N > /sys/module/nfs/parameters/localio_enabled
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-02 21:06 ` Mike Snitzer
@ 2025-09-02 21:16 ` Chuck Lever
2025-09-02 21:27 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-09-02 21:16 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 9/2/25 5:06 PM, Mike Snitzer wrote:
> On Tue, Sep 02, 2025 at 01:59:12PM -0400, Chuck Lever wrote:
>> On 9/2/25 11:56 AM, Chuck Lever wrote:
>>> On 8/30/25 1:38 PM, Mike Snitzer wrote:
>>
>>>> dt (j:1 t:1): File System Information:
>>>> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
>>>> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
>>>> dt (j:1 t:1): Filesystem type: nfs4
>>>> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
>>>
>>> I haven't been able to reproduce a similar failure in my lab with
>>> NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
>>> successful. Also, for shit giggles, I tried the fsx-based subtests in
>>> fstests, no new failures there either. The export is xfs on an NVMe
>>> add-on card; server uses direct I/O for READ and page cache for WRITE.
>>>
>>> Notice the mount options for your test run: "proto=tcp" and
>>> "nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
>>> not use RDMA at all; it will use bog standard TCP/IP on your ultra
>>> fast Ethernet network.
>>>
>>> What should I try next? I can apply 2/2 or add "nconnect" or move the
>>> testing to my RoCE fabric after lunch and keep poking at it.
>
> Hmm, I'll have to check with the Hammerspace performance team to
> understand how RDMA used if the client mount has proto=tcp.
>
> Certainly surprising, thanks for noticing/reporting this aspect.
>
> I also cannot reproduce on a normal tcp mount and testbed. This
> frankenbeast of a fast "RDMA" network that is misconfigured to use
> proto=tcp is the only testbed where I've seen this dt data mismatch.
>
>>> Or, I could switch to TCP. Suggestions welcome.
>>
>> The client is not sending any READ procedures/operations to the server.
>> The following is NFSv3 for clarity, but NFSv4.x results are similar:
>>
>> nfsd-1669 [003] 1466.634816: svc_process:
>> addr=192.168.2.67 xid=0x7b2a6274 service=nfsd vers=3 proc=NULL
>> nfsd-1669 [003] 1466.635389: svc_process:
>> addr=192.168.2.67 xid=0x7d2a6274 service=nfsd vers=3 proc=FSINFO
>> nfsd-1669 [003] 1466.635420: svc_process:
>> addr=192.168.2.67 xid=0x7e2a6274 service=nfsd vers=3 proc=PATHCONF
>> nfsd-1669 [003] 1466.635451: svc_process:
>> addr=192.168.2.67 xid=0x7f2a6274 service=nfsd vers=3 proc=GETATTR
>> nfsd-1669 [003] 1466.635486: svc_process:
>> addr=192.168.2.67 xid=0x802a6274 service=nfsacl vers=3 proc=NULL
>> nfsd-1669 [003] 1466.635558: svc_process:
>> addr=192.168.2.67 xid=0x812a6274 service=nfsd vers=3 proc=FSINFO
>> nfsd-1669 [003] 1466.635585: svc_process:
>> addr=192.168.2.67 xid=0x822a6274 service=nfsd vers=3 proc=GETATTR
>> nfsd-1669 [003] 1470.029208: svc_process:
>> addr=192.168.2.67 xid=0x832a6274 service=nfsd vers=3 proc=ACCESS
>> nfsd-1669 [003] 1470.029255: svc_process:
>> addr=192.168.2.67 xid=0x842a6274 service=nfsd vers=3 proc=LOOKUP
>> nfsd-1669 [003] 1470.029296: svc_process:
>> addr=192.168.2.67 xid=0x852a6274 service=nfsd vers=3 proc=FSSTAT
>> nfsd-1669 [003] 1470.039715: svc_process:
>> addr=192.168.2.67 xid=0x862a6274 service=nfsacl vers=3 proc=GETACL
>> nfsd-1669 [003] 1470.039758: svc_process:
>> addr=192.168.2.67 xid=0x872a6274 service=nfsd vers=3 proc=CREATE
>> nfsd-1669 [003] 1470.040091: svc_process:
>> addr=192.168.2.67 xid=0x882a6274 service=nfsd vers=3 proc=WRITE
>> nfsd-1669 [003] 1470.040469: svc_process:
>> addr=192.168.2.67 xid=0x892a6274 service=nfsd vers=3 proc=GETATTR
>> nfsd-1669 [003] 1470.040503: svc_process:
>> addr=192.168.2.67 xid=0x8a2a6274 service=nfsd vers=3 proc=ACCESS
>> nfsd-1669 [003] 1470.041867: svc_process:
>> addr=192.168.2.67 xid=0x8b2a6274 service=nfsd vers=3 proc=FSSTAT
>> nfsd-1669 [003] 1470.042109: svc_process:
>> addr=192.168.2.67 xid=0x8c2a6274 service=nfsd vers=3 proc=REMOVE
>>
>> So I'm probably missing some setting on the reproducer/client.
>>
>> /mnt from klimt.ib.1015granger.net:/export/fast
>> Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
>> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
>> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
>> local_lock=none,addr=192.168.2.55
>>
>> Linux morisot.1015granger.net 6.15.10-100.fc41.x86_64 #1 SMP
>> PREEMPT_DYNAMIC Fri Aug 15 14:55:12 UTC 2025 x86_64 GNU/Linux
>
> If you're using LOCALIO (client on server) that'd explain your not
> seeing any READs coming over the wire to NFSD.
>
> I've made sure to disable LOCALIO on my client, with:
> echo N > /sys/module/nfs/parameters/localio_enabled
I am testing with a physically separate client and server, so I believe
that LOCALIO is not in play. I do see WRITEs. And other workloads (in
particular "fsx -Z <fname>") show READ traffic and I'm getting the
new trace point to fire quite a bit, and it is showing misaligned
READ requests. So it has something to do with dt.
If I understand your two patches correctly, they are still pulling a
page from the end of rq_pages to do the initial pad page. That, I
think, is a working implementation, not the failing one.
EOD -- will continue tomorrow.
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-02 21:16 ` Chuck Lever
@ 2025-09-02 21:27 ` Mike Snitzer
2025-09-02 22:18 ` Mike Snitzer
2025-09-04 14:42 ` Mike Snitzer
0 siblings, 2 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-09-02 21:27 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
> On 9/2/25 5:06 PM, Mike Snitzer wrote:
> > On Tue, Sep 02, 2025 at 01:59:12PM -0400, Chuck Lever wrote:
> >> On 9/2/25 11:56 AM, Chuck Lever wrote:
> >>> On 8/30/25 1:38 PM, Mike Snitzer wrote:
> >>
> >>>> dt (j:1 t:1): File System Information:
> >>>> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
> >>>> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
> >>>> dt (j:1 t:1): Filesystem type: nfs4
> >>>> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
> >>>
> >>> I haven't been able to reproduce a similar failure in my lab with
> >>> NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
> >>> successful. Also, for shit giggles, I tried the fsx-based subtests in
> >>> fstests, no new failures there either. The export is xfs on an NVMe
> >>> add-on card; server uses direct I/O for READ and page cache for WRITE.
> >>>
> >>> Notice the mount options for your test run: "proto=tcp" and
> >>> "nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
> >>> not use RDMA at all; it will use bog standard TCP/IP on your ultra
> >>> fast Ethernet network.
> >>>
> >>> What should I try next? I can apply 2/2 or add "nconnect" or move the
> >>> testing to my RoCE fabric after lunch and keep poking at it.
> >
> > Hmm, I'll have to check with the Hammerspace performance team to
> > understand how RDMA used if the client mount has proto=tcp.
> >
> > Certainly surprising, thanks for noticing/reporting this aspect.
> >
> > I also cannot reproduce on a normal tcp mount and testbed. This
> > frankenbeast of a fast "RDMA" network that is misconfigured to use
> > proto=tcp is the only testbed where I've seen this dt data mismatch.
> >
> >>> Or, I could switch to TCP. Suggestions welcome.
> >>
> >> The client is not sending any READ procedures/operations to the server.
> >> The following is NFSv3 for clarity, but NFSv4.x results are similar:
> >>
> >> nfsd-1669 [003] 1466.634816: svc_process:
> >> addr=192.168.2.67 xid=0x7b2a6274 service=nfsd vers=3 proc=NULL
> >> nfsd-1669 [003] 1466.635389: svc_process:
> >> addr=192.168.2.67 xid=0x7d2a6274 service=nfsd vers=3 proc=FSINFO
> >> nfsd-1669 [003] 1466.635420: svc_process:
> >> addr=192.168.2.67 xid=0x7e2a6274 service=nfsd vers=3 proc=PATHCONF
> >> nfsd-1669 [003] 1466.635451: svc_process:
> >> addr=192.168.2.67 xid=0x7f2a6274 service=nfsd vers=3 proc=GETATTR
> >> nfsd-1669 [003] 1466.635486: svc_process:
> >> addr=192.168.2.67 xid=0x802a6274 service=nfsacl vers=3 proc=NULL
> >> nfsd-1669 [003] 1466.635558: svc_process:
> >> addr=192.168.2.67 xid=0x812a6274 service=nfsd vers=3 proc=FSINFO
> >> nfsd-1669 [003] 1466.635585: svc_process:
> >> addr=192.168.2.67 xid=0x822a6274 service=nfsd vers=3 proc=GETATTR
> >> nfsd-1669 [003] 1470.029208: svc_process:
> >> addr=192.168.2.67 xid=0x832a6274 service=nfsd vers=3 proc=ACCESS
> >> nfsd-1669 [003] 1470.029255: svc_process:
> >> addr=192.168.2.67 xid=0x842a6274 service=nfsd vers=3 proc=LOOKUP
> >> nfsd-1669 [003] 1470.029296: svc_process:
> >> addr=192.168.2.67 xid=0x852a6274 service=nfsd vers=3 proc=FSSTAT
> >> nfsd-1669 [003] 1470.039715: svc_process:
> >> addr=192.168.2.67 xid=0x862a6274 service=nfsacl vers=3 proc=GETACL
> >> nfsd-1669 [003] 1470.039758: svc_process:
> >> addr=192.168.2.67 xid=0x872a6274 service=nfsd vers=3 proc=CREATE
> >> nfsd-1669 [003] 1470.040091: svc_process:
> >> addr=192.168.2.67 xid=0x882a6274 service=nfsd vers=3 proc=WRITE
> >> nfsd-1669 [003] 1470.040469: svc_process:
> >> addr=192.168.2.67 xid=0x892a6274 service=nfsd vers=3 proc=GETATTR
> >> nfsd-1669 [003] 1470.040503: svc_process:
> >> addr=192.168.2.67 xid=0x8a2a6274 service=nfsd vers=3 proc=ACCESS
> >> nfsd-1669 [003] 1470.041867: svc_process:
> >> addr=192.168.2.67 xid=0x8b2a6274 service=nfsd vers=3 proc=FSSTAT
> >> nfsd-1669 [003] 1470.042109: svc_process:
> >> addr=192.168.2.67 xid=0x8c2a6274 service=nfsd vers=3 proc=REMOVE
> >>
> >> So I'm probably missing some setting on the reproducer/client.
> >>
> >> /mnt from klimt.ib.1015granger.net:/export/fast
> >> Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
> >> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
> >> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
> >> local_lock=none,addr=192.168.2.55
> >>
> >> Linux morisot.1015granger.net 6.15.10-100.fc41.x86_64 #1 SMP
> >> PREEMPT_DYNAMIC Fri Aug 15 14:55:12 UTC 2025 x86_64 GNU/Linux
> >
> > If you're using LOCALIO (client on server) that'd explain your not
> > seeing any READs coming over the wire to NFSD.
> >
> > I've made sure to disable LOCALIO on my client, with:
> > echo N > /sys/module/nfs/parameters/localio_enabled
>
> I am testing with a physically separate client and server, so I believe
> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
> particular "fsx -Z <fname>") show READ traffic and I'm getting the
> new trace point to fire quite a bit, and it is showing misaligned
> READ requests. So it has something to do with dt.
OK, yeah I figured you weren't doing loopback mount, only thing that
came to mind for you not seeing READ like expected. I haven't had any
problems with dt not driving READs to NFSD...
You'll certainly need to see READs in order for NFSD's new misaligned
DIO READ handling to get tested.
> If I understand your two patches correctly, they are still pulling a
> page from the end of rq_pages to do the initial pad page. That, I
> think, is a working implementation, not the failing one.
Patch 1 removes the use of a separate page, instead using the very
first page of rq_pages for the "start_extra" (or "front_pad) page for
the misaligned DIO READ. And with that my dt testing fails with data
mismatch like I shared. So patch 1 is failing implementation (for me
on the "RDMA" system I'm testing on).
Patch 2 then switches to using a rq_pages page _after_ the memory that
would normally get used as the READ payload memory to service the
READ. So patch 2 is a working implementation.
> EOD -- will continue tomorrow.
Ack.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-02 21:27 ` Mike Snitzer
@ 2025-09-02 22:18 ` Mike Snitzer
2025-09-04 19:07 ` Chuck Lever
2025-09-04 14:42 ` Mike Snitzer
1 sibling, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-09-02 22:18 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
> > On 9/2/25 5:06 PM, Mike Snitzer wrote:
> > > On Tue, Sep 02, 2025 at 01:59:12PM -0400, Chuck Lever wrote:
> > >> On 9/2/25 11:56 AM, Chuck Lever wrote:
> > >>> On 8/30/25 1:38 PM, Mike Snitzer wrote:
> > >>
> > >>>> dt (j:1 t:1): File System Information:
> > >>>> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
> > >>>> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
> > >>>> dt (j:1 t:1): Filesystem type: nfs4
> > >>>> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
> > >>>
> > >>> I haven't been able to reproduce a similar failure in my lab with
> > >>> NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
> > >>> successful. Also, for shit giggles, I tried the fsx-based subtests in
> > >>> fstests, no new failures there either. The export is xfs on an NVMe
> > >>> add-on card; server uses direct I/O for READ and page cache for WRITE.
> > >>>
> > >>> Notice the mount options for your test run: "proto=tcp" and
> > >>> "nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
> > >>> not use RDMA at all; it will use bog standard TCP/IP on your ultra
> > >>> fast Ethernet network.
> > >>>
> > >>> What should I try next? I can apply 2/2 or add "nconnect" or move the
> > >>> testing to my RoCE fabric after lunch and keep poking at it.
> > >
> > > Hmm, I'll have to check with the Hammerspace performance team to
> > > understand how RDMA used if the client mount has proto=tcp.
> > >
> > > Certainly surprising, thanks for noticing/reporting this aspect.
> > >
> > > I also cannot reproduce on a normal tcp mount and testbed. This
> > > frankenbeast of a fast "RDMA" network that is misconfigured to use
> > > proto=tcp is the only testbed where I've seen this dt data mismatch.
> > >
> > >>> Or, I could switch to TCP. Suggestions welcome.
> > >>
> > >> The client is not sending any READ procedures/operations to the server.
> > >> The following is NFSv3 for clarity, but NFSv4.x results are similar:
> > >>
> > >> nfsd-1669 [003] 1466.634816: svc_process:
> > >> addr=192.168.2.67 xid=0x7b2a6274 service=nfsd vers=3 proc=NULL
> > >> nfsd-1669 [003] 1466.635389: svc_process:
> > >> addr=192.168.2.67 xid=0x7d2a6274 service=nfsd vers=3 proc=FSINFO
> > >> nfsd-1669 [003] 1466.635420: svc_process:
> > >> addr=192.168.2.67 xid=0x7e2a6274 service=nfsd vers=3 proc=PATHCONF
> > >> nfsd-1669 [003] 1466.635451: svc_process:
> > >> addr=192.168.2.67 xid=0x7f2a6274 service=nfsd vers=3 proc=GETATTR
> > >> nfsd-1669 [003] 1466.635486: svc_process:
> > >> addr=192.168.2.67 xid=0x802a6274 service=nfsacl vers=3 proc=NULL
> > >> nfsd-1669 [003] 1466.635558: svc_process:
> > >> addr=192.168.2.67 xid=0x812a6274 service=nfsd vers=3 proc=FSINFO
> > >> nfsd-1669 [003] 1466.635585: svc_process:
> > >> addr=192.168.2.67 xid=0x822a6274 service=nfsd vers=3 proc=GETATTR
> > >> nfsd-1669 [003] 1470.029208: svc_process:
> > >> addr=192.168.2.67 xid=0x832a6274 service=nfsd vers=3 proc=ACCESS
> > >> nfsd-1669 [003] 1470.029255: svc_process:
> > >> addr=192.168.2.67 xid=0x842a6274 service=nfsd vers=3 proc=LOOKUP
> > >> nfsd-1669 [003] 1470.029296: svc_process:
> > >> addr=192.168.2.67 xid=0x852a6274 service=nfsd vers=3 proc=FSSTAT
> > >> nfsd-1669 [003] 1470.039715: svc_process:
> > >> addr=192.168.2.67 xid=0x862a6274 service=nfsacl vers=3 proc=GETACL
> > >> nfsd-1669 [003] 1470.039758: svc_process:
> > >> addr=192.168.2.67 xid=0x872a6274 service=nfsd vers=3 proc=CREATE
> > >> nfsd-1669 [003] 1470.040091: svc_process:
> > >> addr=192.168.2.67 xid=0x882a6274 service=nfsd vers=3 proc=WRITE
> > >> nfsd-1669 [003] 1470.040469: svc_process:
> > >> addr=192.168.2.67 xid=0x892a6274 service=nfsd vers=3 proc=GETATTR
> > >> nfsd-1669 [003] 1470.040503: svc_process:
> > >> addr=192.168.2.67 xid=0x8a2a6274 service=nfsd vers=3 proc=ACCESS
> > >> nfsd-1669 [003] 1470.041867: svc_process:
> > >> addr=192.168.2.67 xid=0x8b2a6274 service=nfsd vers=3 proc=FSSTAT
> > >> nfsd-1669 [003] 1470.042109: svc_process:
> > >> addr=192.168.2.67 xid=0x8c2a6274 service=nfsd vers=3 proc=REMOVE
> > >>
> > >> So I'm probably missing some setting on the reproducer/client.
> > >>
> > >> /mnt from klimt.ib.1015granger.net:/export/fast
> > >> Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
> > >> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
> > >> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
> > >> local_lock=none,addr=192.168.2.55
> > >>
> > >> Linux morisot.1015granger.net 6.15.10-100.fc41.x86_64 #1 SMP
> > >> PREEMPT_DYNAMIC Fri Aug 15 14:55:12 UTC 2025 x86_64 GNU/Linux
> > >
> > > If you're using LOCALIO (client on server) that'd explain your not
> > > seeing any READs coming over the wire to NFSD.
> > >
> > > I've made sure to disable LOCALIO on my client, with:
> > > echo N > /sys/module/nfs/parameters/localio_enabled
> >
> > I am testing with a physically separate client and server, so I believe
> > that LOCALIO is not in play. I do see WRITEs. And other workloads (in
> > particular "fsx -Z <fname>") show READ traffic and I'm getting the
> > new trace point to fire quite a bit, and it is showing misaligned
> > READ requests. So it has something to do with dt.
>
> OK, yeah I figured you weren't doing loopback mount, only thing that
> came to mind for you not seeing READ like expected. I haven't had any
> problems with dt not driving READs to NFSD...
>
> You'll certainly need to see READs in order for NFSD's new misaligned
> DIO READ handling to get tested.
>
> > If I understand your two patches correctly, they are still pulling a
> > page from the end of rq_pages to do the initial pad page. That, I
> > think, is a working implementation, not the failing one.
>
> Patch 1 removes the use of a separate page, instead using the very
> first page of rq_pages for the "start_extra" (or "front_pad) page for
> the misaligned DIO READ. And with that my dt testing fails with data
> mismatch like I shared. So patch 1 is failing implementation (for me
> on the "RDMA" system I'm testing on).
>
> Patch 2 then switches to using a rq_pages page _after_ the memory that
> would normally get used as the READ payload memory to service the
> READ. So patch 2 is a working implementation.
>
> > EOD -- will continue tomorrow.
>
> Ack.
>
The reason for proto=tcp is that I was mounting the Hammerspace
Anvil (metadata server) via 4.2 using tcp. And it is the layout that
the metadata server hands out that directs my 4.2 flexfiles client to
then access the DS over v3 using RDMA. My particular DS server in the
broader testbed has the following in /etc/nfs.conf:
[general]
[nfsrahead]
[exports]
[exportfs]
[gssd]
use-gss-proxy = 1
[lockd]
[exportd]
[mountd]
[nfsdcld]
[nfsdcltrack]
[nfsd]
rdma = y
rdma-port = 20049
threads = 576
vers4.0 = n
vers4.1 = n
[statd]
[sm-notify]
And if I instead mount with:
mount -o vers=3,proto=rdma,port=20049 192.168.0.106:/mnt/hs_nvme13 /test
And then re-run dt, I don't see any data mismatch:
dt (j:1 t:1): File System Information:
dt (j:1 t:1): Mounted from device: 192.168.0.106:/mnt/hs_nvme13
dt (j:1 t:1): Mounted on directory: /test
dt (j:1 t:1): Filesystem type: nfs
dt (j:1 t:1): Filesystem options: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.0.106,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.0.106
dt (j:1 t:1): Filesystem block size: 1048576
dt (j:1 t:1): Filesystem free space: 3812019404800 (3635425.000 Mbytes, 3550.220 Gbytes, 3.467 Tbytes)
dt (j:1 t:1): Filesystem total space: 3838875533312 (3661037.000 Mbytes, 3575.231 Gbytes, 3.491 Tbytes)
So... I think what this means is my "patch 1" _is_ a working
implementation. BUT, for some reason RDMA with pnfs flexfiles is
"unhappy".
Would seem I get to keep both pieces and need to sort out what's up
with pNFS flexfiles on this particular RDMA testbed.
I will post v9 of the NFSD DIRECT patchset with "patch 1" folded in to
the misaligned READ patch (5) and some other small fixes/improvements
to the series, probably tomorrow morning.
Thanks,
Mike
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface
2025-08-26 18:57 ` [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-09-03 14:38 ` Chuck Lever
2025-09-03 15:07 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-09-03 14:38 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 8/26/25 2:57 PM, Mike Snitzer wrote:
> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> read by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=1)
> - cached but removed from the page cache upon completion
> (NFSD_IO_DONTCACHE=2).
> - not cached (NFSD_IO_DIRECT=3)
>
> io_cache_read may be set by writing to:
> /sys/kernel/debug/nfsd/io_cache_read
>
> If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
>
> If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the read must be aligned relative to the
> underlying block device's dma_alignment.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> ---
> fs/nfsd/debugfs.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/nfsd/nfsd.h | 9 ++++++++
> fs/nfsd/vfs.c | 18 +++++++++++++++
> 3 files changed, 84 insertions(+)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc9..3cadd45868b48 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -27,11 +27,65 @@ static int nfsd_dsr_get(void *data, u64 *val)
> static int nfsd_dsr_set(void *data, u64 val)
> {
> nfsd_disable_splice_read = (val > 0) ? true : false;
> + if (!nfsd_disable_splice_read) {
> + /*
> + * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> + * if splice_read is enabled.
> + */
> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> + }
> return 0;
> }
>
> DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>
> +/*
> + * /sys/kernel/debug/nfsd/io_cache_read
> + *
> + * Contents:
> + * %1: NFS READ will use buffered IO
> + * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + * %3: NFS READ will use direct IO
> + *
> + * The default value of this setting is zero (UNSPECIFIED).
Hi Mike -
I can't remember why we have the UNSPECIFIED setting? IME a debug
file reflects the current setting, so if our default behavior is
"buffered" then the first "cat io_cache_read" should reflect that
rather than "I haven't been changed yet". This doesn't seem like the
usual semantics of a /sys/kernel/debug file, IME.
For example, a user space application may want to read io_cache_read
to find out what the current behavior is. If it gets 0 (UNSPECIFIED)
then it has found out nothing.
However, if there is a good reason to keep UNSPECIFIED, then you
need to add a " %0: NFS READ uses the default behavior" to the
documenting comment for nfsd_io_cache_{read,write}.
My preference is to remove NFSD_IO_UNSPECIFIED from this patch
and her sister (4/7).
> + * This setting takes immediate effect for all NFS versions,
> + * all exports, and in all NFSD net namespaces.
> + */
> +
> +static int nfsd_io_cache_read_get(void *data, u64 *val)
> +{
> + *val = nfsd_io_cache_read;
> + return 0;
> +}
> +
> +static int nfsd_io_cache_read_set(void *data, u64 val)
> +{
> + int ret = 0;
> +
> + switch (val) {
> + case NFSD_IO_BUFFERED:
> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> + break;
> + case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> + /*
> + * Must disable splice_read when enabling
> + * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> + */
> + nfsd_disable_splice_read = true;
> + nfsd_io_cache_read = val;
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> +
> + return ret;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> + nfsd_io_cache_read_set, "%llu\n");
> +
> void nfsd_debugfs_exit(void)
> {
> debugfs_remove_recursive(nfsd_top_dir);
> @@ -44,4 +98,7 @@ void nfsd_debugfs_init(void)
>
> debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +
> + debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
> + nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
> }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1cd0bed57bc2f..6ef799405145f 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
>
> extern bool nfsd_disable_splice_read __read_mostly;
>
> +enum {
> + NFSD_IO_UNSPECIFIED = 0,
> + NFSD_IO_BUFFERED,
> + NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT,
> +};
> +
> +extern u64 nfsd_io_cache_read __read_mostly;
And then here, initialize nfsd_io_cache_read to reflect the default
behavior. That would be NFSD_IO_BUFFERED for now... then later we might
want to change it to NFSD_IO_DIRECT, for instance.
Same suggestion for 4/7.
> +
> extern int nfsd_max_blksize;
>
> static inline int nfsd_v4client(struct svc_rqst *rq)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 79439ad93880a..8ea8b80097195 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -49,6 +49,7 @@
> #define NFSDDBG_FACILITY NFSDDBG_FILEOP
>
> bool nfsd_disable_splice_read __read_mostly;
> +u64 nfsd_io_cache_read __read_mostly;
>
> /**
> * nfserrno - Map Linux errnos to NFS errnos
> @@ -1099,6 +1100,23 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> size_t len;
>
> init_sync_kiocb(&kiocb, file);
> +
> + switch (nfsd_io_cache_read) {
> + case NFSD_IO_DIRECT:
> + /* Verify ondisk and memory DIO alignment */
> + if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> + (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
> + (base & (nf->nf_dio_mem_align - 1)) == 0)
> + kiocb.ki_flags = IOCB_DIRECT;
> + break;
> + case NFSD_IO_DONTCACHE:
> + kiocb.ki_flags = IOCB_DONTCACHE;
> + fallthrough;
Nit: Make this "break;". This is brittle: if someone adds something to
the NFSD_IO_BUFFERED arm but happens to miss that the DONTCACHE arm
above it is "fallthrough" then we have a latent bug.
Same suggestion for 4/7.
> + case NFSD_IO_UNSPECIFIED:
> + case NFSD_IO_BUFFERED:
> + break;
> + }
> +
> kiocb.ki_pos = offset;
>
> v = 0;
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface
2025-09-03 14:38 ` Chuck Lever
@ 2025-09-03 15:07 ` Mike Snitzer
2025-09-03 16:02 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-09-03 15:07 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Wed, Sep 03, 2025 at 10:38:45AM -0400, Chuck Lever wrote:
> On 8/26/25 2:57 PM, Mike Snitzer wrote:
> > Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> > read by NFSD will either be:
> > - cached using page cache (NFSD_IO_BUFFERED=1)
> > - cached but removed from the page cache upon completion
> > (NFSD_IO_DONTCACHE=2).
> > - not cached (NFSD_IO_DIRECT=3)
> >
> > io_cache_read may be set by writing to:
> > /sys/kernel/debug/nfsd/io_cache_read
> >
> > If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> > advertised as supported by the underlying filesystem (e.g. XFS),
> > otherwise all IO flagged with RWF_DONTCACHE will fail with
> > -EOPNOTSUPP.
> >
> > If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> > relative to the underlying block device's logical_block_size. Also the
> > memory buffer used to store the read must be aligned relative to the
> > underlying block device's dma_alignment.
> >
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > fs/nfsd/debugfs.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
> > fs/nfsd/nfsd.h | 9 ++++++++
> > fs/nfsd/vfs.c | 18 +++++++++++++++
> > 3 files changed, 84 insertions(+)
> >
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 84b0c8b559dc9..3cadd45868b48 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -27,11 +27,65 @@ static int nfsd_dsr_get(void *data, u64 *val)
> > static int nfsd_dsr_set(void *data, u64 val)
> > {
> > nfsd_disable_splice_read = (val > 0) ? true : false;
> > + if (!nfsd_disable_splice_read) {
> > + /*
> > + * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> > + * if splice_read is enabled.
> > + */
> > + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > + }
> > return 0;
> > }
> >
> > DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> >
> > +/*
> > + * /sys/kernel/debug/nfsd/io_cache_read
> > + *
> > + * Contents:
> > + * %1: NFS READ will use buffered IO
> > + * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > + * %3: NFS READ will use direct IO
> > + *
> > + * The default value of this setting is zero (UNSPECIFIED).
>
> Hi Mike -
>
> I can't remember why we have the UNSPECIFIED setting? IME a debug
> file reflects the current setting, so if our default behavior is
> "buffered" then the first "cat io_cache_read" should reflect that
> rather than "I haven't been changed yet". This doesn't seem like the
> usual semantics of a /sys/kernel/debug file, IME.
Jeff had convincing justification for his request, from:
https://lore.kernel.org/linux-nfs/e5a0d1e435196c55acbdc491b43b6380cbef5599.camel@kernel.org/
"I think the default case should leave nfsd_io_cache_read alone and
return an error. If we add new values later, and someone tries to use
them on an old kernel, it's better to make that attempt error out.
Ditto for the write side controls."
> For example, a user space application may want to read io_cache_read
> to find out what the current behavior is. If it gets 0 (UNSPECIFIED)
> then it has found out nothing.
Right, that is a negative user experience until/unless user is
informed. Having Documentation file may ease that though?
> However, if there is a good reason to keep UNSPECIFIED, then you
> need to add a " %0: NFS READ uses the default behavior" to the
> documenting comment for nfsd_io_cache_{read,write}.
>
> My preference is to remove NFSD_IO_UNSPECIFIED from this patch
> and her sister (4/7).
I don't have a strong preference, when I first implemented it I had it
how you'd prefer. But Jeff's kernel downgrade scenario still seems
like a prophetic nice catch.
Jeff, what is your thinking at this point?
> > + * This setting takes immediate effect for all NFS versions,
> > + * all exports, and in all NFSD net namespaces.
> > + */
> > +
> > +static int nfsd_io_cache_read_get(void *data, u64 *val)
> > +{
> > + *val = nfsd_io_cache_read;
> > + return 0;
> > +}
> > +
> > +static int nfsd_io_cache_read_set(void *data, u64 val)
> > +{
> > + int ret = 0;
> > +
> > + switch (val) {
> > + case NFSD_IO_BUFFERED:
> > + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > + break;
> > + case NFSD_IO_DONTCACHE:
> > + case NFSD_IO_DIRECT:
> > + /*
> > + * Must disable splice_read when enabling
> > + * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> > + */
> > + nfsd_disable_splice_read = true;
> > + nfsd_io_cache_read = val;
> > + break;
> > + default:
> > + ret = -EINVAL;
> > + break;
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> > + nfsd_io_cache_read_set, "%llu\n");
> > +
> > void nfsd_debugfs_exit(void)
> > {
> > debugfs_remove_recursive(nfsd_top_dir);
> > @@ -44,4 +98,7 @@ void nfsd_debugfs_init(void)
> >
> > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> > nfsd_top_dir, NULL, &nfsd_dsr_fops);
> > +
> > + debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
> > + nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
> > }
> > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > index 1cd0bed57bc2f..6ef799405145f 100644
> > --- a/fs/nfsd/nfsd.h
> > +++ b/fs/nfsd/nfsd.h
> > @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
> >
> > extern bool nfsd_disable_splice_read __read_mostly;
> >
> > +enum {
> > + NFSD_IO_UNSPECIFIED = 0,
> > + NFSD_IO_BUFFERED,
> > + NFSD_IO_DONTCACHE,
> > + NFSD_IO_DIRECT,
> > +};
> > +
> > +extern u64 nfsd_io_cache_read __read_mostly;
>
> And then here, initialize nfsd_io_cache_read to reflect the default
> behavior. That would be NFSD_IO_BUFFERED for now... then later we might
> want to change it to NFSD_IO_DIRECT, for instance.
>
> Same suggestion for 4/7.
Ah ok, I can see the way forward to default to NFSD_IO_BUFFERED but
_not_ default to it when erroring (if the user specified some unknown
value).
I'll run with that (despite just asking Jeff's opinion above, I'm the
one who came up with the awkward UNSPECIFIED state when honoring
Jeff's early feedback).
> > +
> > extern int nfsd_max_blksize;
> >
> > static inline int nfsd_v4client(struct svc_rqst *rq)
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 79439ad93880a..8ea8b80097195 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -49,6 +49,7 @@
> > #define NFSDDBG_FACILITY NFSDDBG_FILEOP
> >
> > bool nfsd_disable_splice_read __read_mostly;
> > +u64 nfsd_io_cache_read __read_mostly;
> >
> > /**
> > * nfserrno - Map Linux errnos to NFS errnos
> > @@ -1099,6 +1100,23 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > size_t len;
> >
> > init_sync_kiocb(&kiocb, file);
> > +
> > + switch (nfsd_io_cache_read) {
> > + case NFSD_IO_DIRECT:
> > + /* Verify ondisk and memory DIO alignment */
> > + if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> > + (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
> > + (base & (nf->nf_dio_mem_align - 1)) == 0)
> > + kiocb.ki_flags = IOCB_DIRECT;
> > + break;
> > + case NFSD_IO_DONTCACHE:
> > + kiocb.ki_flags = IOCB_DONTCACHE;
> > + fallthrough;
>
> Nit: Make this "break;". This is brittle: if someone adds something to
> the NFSD_IO_BUFFERED arm but happens to miss that the DONTCACHE arm
> above it is "fallthrough" then we have a latent bug.
>
> Same suggestion for 4/7.
Sure, but just FYI, the misaligned DIO WRITE patch does need
fallthrough from NFSD_IO_DONTCACHE to NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
kiocb.ki_flags |= IOCB_DONTCACHE;
fallthrough;
case NFSD_IO_UNSPECIFIED:
case NFSD_IO_BUFFERED:
+ host_err = nfsd_issue_write_buffered(rqstp, file,
+ nvecs, cnt, &kiocb);
break;
}
But rather than preemptively lay the foundation for that in 4/7 I'll
just be explicit in that 6/7 patch.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface
2025-09-03 15:07 ` Mike Snitzer
@ 2025-09-03 16:02 ` Mike Snitzer
2025-09-03 16:12 ` Chuck Lever
0 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-09-03 16:02 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Wed, Sep 03, 2025 at 11:07:29AM -0400, Mike Snitzer wrote:
> On Wed, Sep 03, 2025 at 10:38:45AM -0400, Chuck Lever wrote:
> > > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > > index 1cd0bed57bc2f..6ef799405145f 100644
> > > --- a/fs/nfsd/nfsd.h
> > > +++ b/fs/nfsd/nfsd.h
> > > @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
> > >
> > > extern bool nfsd_disable_splice_read __read_mostly;
> > >
> > > +enum {
> > > + NFSD_IO_UNSPECIFIED = 0,
> > > + NFSD_IO_BUFFERED,
> > > + NFSD_IO_DONTCACHE,
> > > + NFSD_IO_DIRECT,
> > > +};
> > > +
> > > +extern u64 nfsd_io_cache_read __read_mostly;
> >
> > And then here, initialize nfsd_io_cache_read to reflect the default
> > behavior. That would be NFSD_IO_BUFFERED for now... then later we might
> > want to change it to NFSD_IO_DIRECT, for instance.
> >
> > Same suggestion for 4/7.
>
> Ah ok, I can see the way forward to default to NFSD_IO_BUFFERED but
> _not_ default to it when erroring (if the user specified some unknown
> value).
>
> I'll run with that (despite just asking Jeff's opinion above, I'm the
> one who came up with the awkward UNSPECIFIED state when honoring
> Jeff's early feedback).
Here is the incremental diff (these changes will be folded into
appropriate patches in v9):
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 8878c3519b30c..173032a04cdec 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -43,11 +43,10 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
* /sys/kernel/debug/nfsd/io_cache_read
*
* Contents:
- * %1: NFS READ will use buffered IO
- * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
- * %3: NFS READ will use direct IO
+ * %0: NFS READ will use buffered IO
+ * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS READ will use direct IO
*
- * The default value of this setting is zero (UNSPECIFIED).
* This setting takes immediate effect for all NFS versions,
* all exports, and in all NFSD net namespaces.
*/
@@ -90,11 +89,10 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
* /sys/kernel/debug/nfsd/io_cache_write
*
* Contents:
- * %1: NFS WRITE will use buffered IO
- * %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
- * %3: NFS WRITE will use direct IO
+ * %0: NFS WRITE will use buffered IO
+ * %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS WRITE will use direct IO
*
- * The default value of this setting is zero (UNSPECIFIED).
* This setting takes immediate effect for all NFS versions,
* all exports, and in all NFSD net namespaces.
*/
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index fe935b4cda538..412a1e9a2a876 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -154,8 +154,7 @@ static inline void nfsd_debugfs_exit(void) {}
extern bool nfsd_disable_splice_read __read_mostly;
enum {
- NFSD_IO_UNSPECIFIED = 0,
- NFSD_IO_BUFFERED,
+ NFSD_IO_BUFFERED = 0,
NFSD_IO_DONTCACHE,
NFSD_IO_DIRECT,
};
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5e700a0d6b12e..403076443573f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -50,8 +50,8 @@
#define NFSDDBG_FACILITY NFSDDBG_FILEOP
bool nfsd_disable_splice_read __read_mostly;
-u64 nfsd_io_cache_read __read_mostly;
-u64 nfsd_io_cache_write __read_mostly;
+u64 nfsd_io_cache_read __read_mostly = NFSD_IO_BUFFERED;
+u64 nfsd_io_cache_write __read_mostly = NFSD_IO_BUFFERED;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1272,8 +1272,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
break;
case NFSD_IO_DONTCACHE:
kiocb.ki_flags = IOCB_DONTCACHE;
- fallthrough;
- case NFSD_IO_UNSPECIFIED:
+ break;
case NFSD_IO_BUFFERED:
break;
}
@@ -1605,8 +1604,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
break;
case NFSD_IO_DONTCACHE:
kiocb.ki_flags |= IOCB_DONTCACHE;
- fallthrough;
- case NFSD_IO_UNSPECIFIED:
+ fallthrough; /* must call nfsd_issue_write_buffered */
case NFSD_IO_BUFFERED:
host_err = nfsd_issue_write_buffered(rqstp, file,
nvecs, cnt, &kiocb);
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface
2025-09-03 16:02 ` Mike Snitzer
@ 2025-09-03 16:12 ` Chuck Lever
2025-09-03 16:50 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-09-03 16:12 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 9/3/25 12:02 PM, Mike Snitzer wrote:
> On Wed, Sep 03, 2025 at 11:07:29AM -0400, Mike Snitzer wrote:
>> On Wed, Sep 03, 2025 at 10:38:45AM -0400, Chuck Lever wrote:
>
>>>> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
>>>> index 1cd0bed57bc2f..6ef799405145f 100644
>>>> --- a/fs/nfsd/nfsd.h
>>>> +++ b/fs/nfsd/nfsd.h
>>>> @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
>>>>
>>>> extern bool nfsd_disable_splice_read __read_mostly;
>>>>
>>>> +enum {
>>>> + NFSD_IO_UNSPECIFIED = 0,
>>>> + NFSD_IO_BUFFERED,
>>>> + NFSD_IO_DONTCACHE,
>>>> + NFSD_IO_DIRECT,
>>>> +};
>>>> +
>>>> +extern u64 nfsd_io_cache_read __read_mostly;
>>>
>>> And then here, initialize nfsd_io_cache_read to reflect the default
>>> behavior. That would be NFSD_IO_BUFFERED for now... then later we might
>>> want to change it to NFSD_IO_DIRECT, for instance.
>>>
>>> Same suggestion for 4/7.
>>
>> Ah ok, I can see the way forward to default to NFSD_IO_BUFFERED but
>> _not_ default to it when erroring (if the user specified some unknown
>> value).
>>
>> I'll run with that (despite just asking Jeff's opinion above, I'm the
>> one who came up with the awkward UNSPECIFIED state when honoring
>> Jeff's early feedback).
>
> Here is the incremental diff (these changes will be folded into
> appropriate patches in v9):
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 8878c3519b30c..173032a04cdec 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -43,11 +43,10 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> * /sys/kernel/debug/nfsd/io_cache_read
> *
> * Contents:
> - * %1: NFS READ will use buffered IO
> - * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
> - * %3: NFS READ will use direct IO
> + * %0: NFS READ will use buffered IO
> + * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + * %2: NFS READ will use direct IO
> *
> - * The default value of this setting is zero (UNSPECIFIED).
> * This setting takes immediate effect for all NFS versions,
> * all exports, and in all NFSD net namespaces.
> */
> @@ -90,11 +89,10 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> * /sys/kernel/debug/nfsd/io_cache_write
> *
> * Contents:
> - * %1: NFS WRITE will use buffered IO
> - * %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> - * %3: NFS WRITE will use direct IO
> + * %0: NFS WRITE will use buffered IO
> + * %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> + * %2: NFS WRITE will use direct IO
> *
> - * The default value of this setting is zero (UNSPECIFIED).
> * This setting takes immediate effect for all NFS versions,
> * all exports, and in all NFSD net namespaces.
> */
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index fe935b4cda538..412a1e9a2a876 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -154,8 +154,7 @@ static inline void nfsd_debugfs_exit(void) {}
> extern bool nfsd_disable_splice_read __read_mostly;
>
> enum {
> - NFSD_IO_UNSPECIFIED = 0,
> - NFSD_IO_BUFFERED,
> + NFSD_IO_BUFFERED = 0,
Thanks, this LGTM. Two additional remarks:
1. I think that the "= 0" is unneeded here because C enumerators always
start at 0.
2. I'm wondering if this enum definition should be moved to a uapi
header. Thoughts? This is experimental, and not a fixed API. So maybe
it needs to stay in fs/nfsd/nfsd.h.
(I'm probably going over ground that has already been covered.)
> NFSD_IO_DONTCACHE,
> NFSD_IO_DIRECT,
> };
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5e700a0d6b12e..403076443573f 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -50,8 +50,8 @@
> #define NFSDDBG_FACILITY NFSDDBG_FILEOP
>
> bool nfsd_disable_splice_read __read_mostly;
> -u64 nfsd_io_cache_read __read_mostly;
> -u64 nfsd_io_cache_write __read_mostly;
> +u64 nfsd_io_cache_read __read_mostly = NFSD_IO_BUFFERED;
> +u64 nfsd_io_cache_write __read_mostly = NFSD_IO_BUFFERED;
>
> /**
> * nfserrno - Map Linux errnos to NFS errnos
> @@ -1272,8 +1272,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> break;
> case NFSD_IO_DONTCACHE:
> kiocb.ki_flags = IOCB_DONTCACHE;
> - fallthrough;
> - case NFSD_IO_UNSPECIFIED:
> + break;
> case NFSD_IO_BUFFERED:
> break;
> }
> @@ -1605,8 +1604,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> break;
> case NFSD_IO_DONTCACHE:
> kiocb.ki_flags |= IOCB_DONTCACHE;
> - fallthrough;
> - case NFSD_IO_UNSPECIFIED:
> + fallthrough; /* must call nfsd_issue_write_buffered */
Right. In this case, the NFSD_IO_BUFFERED arm is more than just a
"break;" so, not as brittle as the nfsd_iter_read() switch statement.
The comment is helpful, though; I'm not suggesting a change, just
observing.
> case NFSD_IO_BUFFERED:
> host_err = nfsd_issue_write_buffered(rqstp, file,
> nvecs, cnt, &kiocb);
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface
2025-09-03 16:12 ` Chuck Lever
@ 2025-09-03 16:50 ` Mike Snitzer
0 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-09-03 16:50 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Wed, Sep 03, 2025 at 12:12:10PM -0400, Chuck Lever wrote:
> On 9/3/25 12:02 PM, Mike Snitzer wrote:
> > On Wed, Sep 03, 2025 at 11:07:29AM -0400, Mike Snitzer wrote:
> >> On Wed, Sep 03, 2025 at 10:38:45AM -0400, Chuck Lever wrote:
> >
> >>>> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> >>>> index 1cd0bed57bc2f..6ef799405145f 100644
> >>>> --- a/fs/nfsd/nfsd.h
> >>>> +++ b/fs/nfsd/nfsd.h
> >>>> @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
> >>>>
> >>>> extern bool nfsd_disable_splice_read __read_mostly;
> >>>>
> >>>> +enum {
> >>>> + NFSD_IO_UNSPECIFIED = 0,
> >>>> + NFSD_IO_BUFFERED,
> >>>> + NFSD_IO_DONTCACHE,
> >>>> + NFSD_IO_DIRECT,
> >>>> +};
> >>>> +
> >>>> +extern u64 nfsd_io_cache_read __read_mostly;
> >>>
> >>> And then here, initialize nfsd_io_cache_read to reflect the default
> >>> behavior. That would be NFSD_IO_BUFFERED for now... then later we might
> >>> want to change it to NFSD_IO_DIRECT, for instance.
> >>>
> >>> Same suggestion for 4/7.
> >>
> >> Ah ok, I can see the way forward to default to NFSD_IO_BUFFERED but
> >> _not_ default to it when erroring (if the user specified some unknown
> >> value).
> >>
> >> I'll run with that (despite just asking Jeff's opinion above, I'm the
> >> one who came up with the awkward UNSPECIFIED state when honoring
> >> Jeff's early feedback).
> >
> > Here is the incremental diff (these changes will be folded into
> > appropriate patches in v9):
> >
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 8878c3519b30c..173032a04cdec 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -43,11 +43,10 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> > * /sys/kernel/debug/nfsd/io_cache_read
> > *
> > * Contents:
> > - * %1: NFS READ will use buffered IO
> > - * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > - * %3: NFS READ will use direct IO
> > + * %0: NFS READ will use buffered IO
> > + * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > + * %2: NFS READ will use direct IO
> > *
> > - * The default value of this setting is zero (UNSPECIFIED).
> > * This setting takes immediate effect for all NFS versions,
> > * all exports, and in all NFSD net namespaces.
> > */
> > @@ -90,11 +89,10 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> > * /sys/kernel/debug/nfsd/io_cache_write
> > *
> > * Contents:
> > - * %1: NFS WRITE will use buffered IO
> > - * %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> > - * %3: NFS WRITE will use direct IO
> > + * %0: NFS WRITE will use buffered IO
> > + * %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> > + * %2: NFS WRITE will use direct IO
> > *
> > - * The default value of this setting is zero (UNSPECIFIED).
> > * This setting takes immediate effect for all NFS versions,
> > * all exports, and in all NFSD net namespaces.
> > */
> > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > index fe935b4cda538..412a1e9a2a876 100644
> > --- a/fs/nfsd/nfsd.h
> > +++ b/fs/nfsd/nfsd.h
> > @@ -154,8 +154,7 @@ static inline void nfsd_debugfs_exit(void) {}
> > extern bool nfsd_disable_splice_read __read_mostly;
> >
> > enum {
> > - NFSD_IO_UNSPECIFIED = 0,
> > - NFSD_IO_BUFFERED,
> > + NFSD_IO_BUFFERED = 0,
>
> Thanks, this LGTM. Two additional remarks:
>
> 1. I think that the "= 0" is unneeded here because C enumerators always
> start at 0.
It does, I flip-flopped on removing the "= 0" and left it (thinking
was it'd help dissuade others from inserting new enum values at the
beginning). But rather than do that I can just add a comment.
> 2. I'm wondering if this enum definition should be moved to a uapi
> header. Thoughts? This is experimental, and not a fixed API. So maybe
> it needs to stay in fs/nfsd/nfsd.h.
>
> (I'm probably going over ground that has already been covered.)
Don't think this aspect was covered, yes my thinking was this is an
experimental interface that wasn't appropriate to expose in a uapi
header.
>
> > NFSD_IO_DONTCACHE,
> > NFSD_IO_DIRECT,
> > };
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 5e700a0d6b12e..403076443573f 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -50,8 +50,8 @@
> > #define NFSDDBG_FACILITY NFSDDBG_FILEOP
> >
> > bool nfsd_disable_splice_read __read_mostly;
> > -u64 nfsd_io_cache_read __read_mostly;
> > -u64 nfsd_io_cache_write __read_mostly;
> > +u64 nfsd_io_cache_read __read_mostly = NFSD_IO_BUFFERED;
> > +u64 nfsd_io_cache_write __read_mostly = NFSD_IO_BUFFERED;
> >
> > /**
> > * nfserrno - Map Linux errnos to NFS errnos
> > @@ -1272,8 +1272,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > break;
> > case NFSD_IO_DONTCACHE:
> > kiocb.ki_flags = IOCB_DONTCACHE;
> > - fallthrough;
> > - case NFSD_IO_UNSPECIFIED:
> > + break;
> > case NFSD_IO_BUFFERED:
> > break;
> > }
> > @@ -1605,8 +1604,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > break;
> > case NFSD_IO_DONTCACHE:
> > kiocb.ki_flags |= IOCB_DONTCACHE;
> > - fallthrough;
> > - case NFSD_IO_UNSPECIFIED:
> > + fallthrough; /* must call nfsd_issue_write_buffered */
>
> Right. In this case, the NFSD_IO_BUFFERED arm is more than just a
> "break;" so, not as brittle as the nfsd_iter_read() switch statement.
> The comment is helpful, though; I'm not suggesting a change, just
> observing.
Sure.
Thanks,
Mike
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-02 21:27 ` Mike Snitzer
2025-09-02 22:18 ` Mike Snitzer
@ 2025-09-04 14:42 ` Mike Snitzer
2025-09-04 15:12 ` Chuck Lever
2025-09-04 16:10 ` Chuck Lever
1 sibling, 2 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-09-04 14:42 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
> >
> > I am testing with a physically separate client and server, so I believe
> > that LOCALIO is not in play. I do see WRITEs. And other workloads (in
> > particular "fsx -Z <fname>") show READ traffic and I'm getting the
> > new trace point to fire quite a bit, and it is showing misaligned
> > READ requests. So it has something to do with dt.
>
> OK, yeah I figured you weren't doing loopback mount, only thing that
> came to mind for you not seeing READ like expected. I haven't had any
> problems with dt not driving READs to NFSD...
>
> You'll certainly need to see READs in order for NFSD's new misaligned
> DIO READ handling to get tested.
I was doing some additional testing of the v9 changes last night and
realized why you weren't seeing any READs come through to NFSD:
"flags=direct" must be added to the dt commandline. Otherwise it'll
use buffered IO at the client and the READ will be serviced by the
client's page cache.
But like I said in another reply: when I just use v3 and RDMA (without
the intermediary of flexfiles at the client) I'm not able to see the
data mismatch with dt...
So while its unlikely: does adding "flags=direct" cause dt to fail
when NFSD handles the misaligned DIO READ?
Mike
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-04 14:42 ` Mike Snitzer
@ 2025-09-04 15:12 ` Chuck Lever
2025-09-04 16:10 ` Chuck Lever
1 sibling, 0 replies; 42+ messages in thread
From: Chuck Lever @ 2025-09-04 15:12 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 9/4/25 10:42 AM, Mike Snitzer wrote:
> On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
>> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
>>>
>>> I am testing with a physically separate client and server, so I believe
>>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
>>> particular "fsx -Z <fname>") show READ traffic and I'm getting the
>>> new trace point to fire quite a bit, and it is showing misaligned
>>> READ requests. So it has something to do with dt.
>>
>> OK, yeah I figured you weren't doing loopback mount, only thing that
>> came to mind for you not seeing READ like expected. I haven't had any
>> problems with dt not driving READs to NFSD...
>>
>> You'll certainly need to see READs in order for NFSD's new misaligned
>> DIO READ handling to get tested.
>
> I was doing some additional testing of the v9 changes last night and
> realized why you weren't seeing any READs come through to NFSD:
> "flags=direct" must be added to the dt commandline. Otherwise it'll
> use buffered IO at the client and the READ will be serviced by the
> client's page cache.
>
> But like I said in another reply: when I just use v3 and RDMA (without
> the intermediary of flexfiles at the client) I'm not able to see the
> data mismatch with dt...
>
> So while its unlikely: does adding "flags=direct" cause dt to fail
> when NFSD handles the misaligned DIO READ?
That make sense, I will give that a shot and let you know.
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-04 14:42 ` Mike Snitzer
2025-09-04 15:12 ` Chuck Lever
@ 2025-09-04 16:10 ` Chuck Lever
2025-09-04 16:33 ` Mike Snitzer
1 sibling, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-09-04 16:10 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 9/4/25 10:42 AM, Mike Snitzer wrote:
> On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
>> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
>>>
>>> I am testing with a physically separate client and server, so I believe
>>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
>>> particular "fsx -Z <fname>") show READ traffic and I'm getting the
>>> new trace point to fire quite a bit, and it is showing misaligned
>>> READ requests. So it has something to do with dt.
>>
>> OK, yeah I figured you weren't doing loopback mount, only thing that
>> came to mind for you not seeing READ like expected. I haven't had any
>> problems with dt not driving READs to NFSD...
>>
>> You'll certainly need to see READs in order for NFSD's new misaligned
>> DIO READ handling to get tested.
>
> I was doing some additional testing of the v9 changes last night and
> realized why you weren't seeing any READs come through to NFSD:
> "flags=direct" must be added to the dt commandline. Otherwise it'll
> use buffered IO at the client and the READ will be serviced by the
> client's page cache.
>
> But like I said in another reply: when I just use v3 and RDMA (without
> the intermediary of flexfiles at the client) I'm not able to see the
> data mismatch with dt...
>
> So while its unlikely: does adding "flags=direct" cause dt to fail
> when NFSD handles the misaligned DIO READ?
Applied v9.
Multiple successful runs, no failures after adding "flags=direct".
Some excerpts from the last run show the server is seeing NFS
READs now:
Filesystem options:
rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
local_lock=none,addr=192.168.2.55
nfsd-1342 [004] 463.832928: nfsd_analyze_read_dio: xid=0x89784d89
fh_hash=0x024204eb offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
nfsd-1342 [004] 463.833105: nfsd_analyze_read_dio: xid=0x8a784d89
fh_hash=0x024204eb offset=47008 len=47008 start=46592+416
middle=47008+47008 end=94016+192
nfsd-1342 [004] 463.833185: nfsd_analyze_read_dio: xid=0x8b784d89
fh_hash=0x024204eb offset=94016 len=47008 start=93696+320
middle=94016+47008 end=141024+288
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-04 16:10 ` Chuck Lever
@ 2025-09-04 16:33 ` Mike Snitzer
2025-09-04 17:54 ` Chuck Lever
0 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2025-09-04 16:33 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Thu, Sep 04, 2025 at 12:10:00PM -0400, Chuck Lever wrote:
> On 9/4/25 10:42 AM, Mike Snitzer wrote:
> > On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
> >> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
> >>>
> >>> I am testing with a physically separate client and server, so I believe
> >>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
> >>> particular "fsx -Z <fname>") show READ traffic and I'm getting the
> >>> new trace point to fire quite a bit, and it is showing misaligned
> >>> READ requests. So it has something to do with dt.
> >>
> >> OK, yeah I figured you weren't doing loopback mount, only thing that
> >> came to mind for you not seeing READ like expected. I haven't had any
> >> problems with dt not driving READs to NFSD...
> >>
> >> You'll certainly need to see READs in order for NFSD's new misaligned
> >> DIO READ handling to get tested.
> >
> > I was doing some additional testing of the v9 changes last night and
> > realized why you weren't seeing any READs come through to NFSD:
> > "flags=direct" must be added to the dt commandline. Otherwise it'll
> > use buffered IO at the client and the READ will be serviced by the
> > client's page cache.
> >
> > But like I said in another reply: when I just use v3 and RDMA (without
> > the intermediary of flexfiles at the client) I'm not able to see the
> > data mismatch with dt...
> >
> > So while its unlikely: does adding "flags=direct" cause dt to fail
> > when NFSD handles the misaligned DIO READ?
> Applied v9.
>
> Multiple successful runs, no failures after adding "flags=direct".
> Some excerpts from the last run show the server is seeing NFS
> READs now:
>
> Filesystem options:
> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
> local_lock=none,addr=192.168.2.55
>
> nfsd-1342 [004] 463.832928: nfsd_analyze_read_dio: xid=0x89784d89
> fh_hash=0x024204eb offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
> nfsd-1342 [004] 463.833105: nfsd_analyze_read_dio: xid=0x8a784d89
> fh_hash=0x024204eb offset=47008 len=47008 start=46592+416
> middle=47008+47008 end=94016+192
> nfsd-1342 [004] 463.833185: nfsd_analyze_read_dio: xid=0x8b784d89
> fh_hash=0x024204eb offset=94016 len=47008 start=93696+320
> middle=94016+47008 end=141024+288
OK, thanks for testing!
So yeah, patch 9/9 of v9 does workaround the problem relative to
flexfiles+RDMA (though patch header should really be updated to add
"flags=direct" to the dt command line):
https://lore.kernel.org/linux-nfs/20250903205121.41380-10-snitzer@kernel.org/
Is it a tolerable intermediate workaround you'd be OK with? To be
clear, I'm continuing to work the problem (and will be discussing it
with Trond)... but its a tricky one for sure.
Mike
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-04 16:33 ` Mike Snitzer
@ 2025-09-04 17:54 ` Chuck Lever
0 siblings, 0 replies; 42+ messages in thread
From: Chuck Lever @ 2025-09-04 17:54 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 9/4/25 12:33 PM, Mike Snitzer wrote:
> On Thu, Sep 04, 2025 at 12:10:00PM -0400, Chuck Lever wrote:
>> On 9/4/25 10:42 AM, Mike Snitzer wrote:
>>> On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
>>>> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
>>>>>
>>>>> I am testing with a physically separate client and server, so I believe
>>>>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
>>>>> particular "fsx -Z <fname>") show READ traffic and I'm getting the
>>>>> new trace point to fire quite a bit, and it is showing misaligned
>>>>> READ requests. So it has something to do with dt.
>>>>
>>>> OK, yeah I figured you weren't doing loopback mount, only thing that
>>>> came to mind for you not seeing READ like expected. I haven't had any
>>>> problems with dt not driving READs to NFSD...
>>>>
>>>> You'll certainly need to see READs in order for NFSD's new misaligned
>>>> DIO READ handling to get tested.
>>>
>>> I was doing some additional testing of the v9 changes last night and
>>> realized why you weren't seeing any READs come through to NFSD:
>>> "flags=direct" must be added to the dt commandline. Otherwise it'll
>>> use buffered IO at the client and the READ will be serviced by the
>>> client's page cache.
>>>
>>> But like I said in another reply: when I just use v3 and RDMA (without
>>> the intermediary of flexfiles at the client) I'm not able to see the
>>> data mismatch with dt...
>>>
>>> So while its unlikely: does adding "flags=direct" cause dt to fail
>>> when NFSD handles the misaligned DIO READ?
>> Applied v9.
>>
>> Multiple successful runs, no failures after adding "flags=direct".
>> Some excerpts from the last run show the server is seeing NFS
>> READs now:
>>
>> Filesystem options:
>> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
>> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
>> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
>> local_lock=none,addr=192.168.2.55
>>
>> nfsd-1342 [004] 463.832928: nfsd_analyze_read_dio: xid=0x89784d89
>> fh_hash=0x024204eb offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
>> nfsd-1342 [004] 463.833105: nfsd_analyze_read_dio: xid=0x8a784d89
>> fh_hash=0x024204eb offset=47008 len=47008 start=46592+416
>> middle=47008+47008 end=94016+192
>> nfsd-1342 [004] 463.833185: nfsd_analyze_read_dio: xid=0x8b784d89
>> fh_hash=0x024204eb offset=94016 len=47008 start=93696+320
>> middle=94016+47008 end=141024+288
>
> OK, thanks for testing!
>
> So yeah, patch 9/9 of v9 does workaround the problem relative to
> flexfiles+RDMA (though patch header should really be updated to add
> "flags=direct" to the dt command line):
> https://lore.kernel.org/linux-nfs/20250903205121.41380-10-snitzer@kernel.org/
>
> Is it a tolerable intermediate workaround you'd be OK with? To be
> clear, I'm continuing to work the problem (and will be discussing it
> with Trond)... but its a tricky one for sure.
1/9 through 4/9 are merge-ready. Though I'm thinking maybe the DIRECT
support should remain "ENOTSUPP" for the moment -- just add DONTCACHE
and BUFFERED for now.
For 5/9, I would like to continue improving that code. It will be easier
and less risky if we do that before there are non-developer users of
that code (ie, done before it is merged). I will spend some time on it
to give some detailed feedback.
6/9, as we've discussed, is risky until we can gain more confidence that
managing the unaligned ends via a buffered write is not going to result
in corruption. So, not merge-ready.
7/9: I think we need to be smarter about the trace points. There are
some exceptions (like where NFSD_IO_DIRECT is turned off for an I/O)
that need either a trace point or a counter. The code paths are likely
to change anyway as they are polished. So, I don't plan to merge at this
time.
8/9 will need to be rewritten as the code evolves. We can wait to merge
that.
9/9: I would rather wait for thorough root cause analysis. It doesn't
make sense to me that picking the end page rather than the first page
should make any difference at all. I like to have a little more meat on
the rationale bone before merging fixes.
And whatever is found, it needs to be squashed into 5/9.
The "dt" reproducer is very low profile -- less than 20 operations on
the wire for the non-pNFS case. IMO grabbing a network capture (on
RoCE) would be helpful.
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-02 22:18 ` Mike Snitzer
@ 2025-09-04 19:07 ` Chuck Lever
2025-09-04 21:00 ` Mike Snitzer
0 siblings, 1 reply; 42+ messages in thread
From: Chuck Lever @ 2025-09-04 19:07 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, Jeff Layton
On 9/2/25 6:18 PM, Mike Snitzer wrote:
> On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
>> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
>>> On 9/2/25 5:06 PM, Mike Snitzer wrote:
>>>> On Tue, Sep 02, 2025 at 01:59:12PM -0400, Chuck Lever wrote:
>>>>> On 9/2/25 11:56 AM, Chuck Lever wrote:
>>>>>> On 8/30/25 1:38 PM, Mike Snitzer wrote:
>>>>>
>>>>>>> dt (j:1 t:1): File System Information:
>>>>>>> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
>>>>>>> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
>>>>>>> dt (j:1 t:1): Filesystem type: nfs4
>>>>>>> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
>>>>>>
>>>>>> I haven't been able to reproduce a similar failure in my lab with
>>>>>> NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
>>>>>> successful. Also, for shit giggles, I tried the fsx-based subtests in
>>>>>> fstests, no new failures there either. The export is xfs on an NVMe
>>>>>> add-on card; server uses direct I/O for READ and page cache for WRITE.
>>>>>>
>>>>>> Notice the mount options for your test run: "proto=tcp" and
>>>>>> "nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
>>>>>> not use RDMA at all; it will use bog standard TCP/IP on your ultra
>>>>>> fast Ethernet network.
>>>>>>
>>>>>> What should I try next? I can apply 2/2 or add "nconnect" or move the
>>>>>> testing to my RoCE fabric after lunch and keep poking at it.
>>>>
>>>> Hmm, I'll have to check with the Hammerspace performance team to
>>>> understand how RDMA used if the client mount has proto=tcp.
>>>>
>>>> Certainly surprising, thanks for noticing/reporting this aspect.
>>>>
>>>> I also cannot reproduce on a normal tcp mount and testbed. This
>>>> frankenbeast of a fast "RDMA" network that is misconfigured to use
>>>> proto=tcp is the only testbed where I've seen this dt data mismatch.
>>>>
>>>>>> Or, I could switch to TCP. Suggestions welcome.
>>>>>
>>>>> The client is not sending any READ procedures/operations to the server.
>>>>> The following is NFSv3 for clarity, but NFSv4.x results are similar:
>>>>>
>>>>> nfsd-1669 [003] 1466.634816: svc_process:
>>>>> addr=192.168.2.67 xid=0x7b2a6274 service=nfsd vers=3 proc=NULL
>>>>> nfsd-1669 [003] 1466.635389: svc_process:
>>>>> addr=192.168.2.67 xid=0x7d2a6274 service=nfsd vers=3 proc=FSINFO
>>>>> nfsd-1669 [003] 1466.635420: svc_process:
>>>>> addr=192.168.2.67 xid=0x7e2a6274 service=nfsd vers=3 proc=PATHCONF
>>>>> nfsd-1669 [003] 1466.635451: svc_process:
>>>>> addr=192.168.2.67 xid=0x7f2a6274 service=nfsd vers=3 proc=GETATTR
>>>>> nfsd-1669 [003] 1466.635486: svc_process:
>>>>> addr=192.168.2.67 xid=0x802a6274 service=nfsacl vers=3 proc=NULL
>>>>> nfsd-1669 [003] 1466.635558: svc_process:
>>>>> addr=192.168.2.67 xid=0x812a6274 service=nfsd vers=3 proc=FSINFO
>>>>> nfsd-1669 [003] 1466.635585: svc_process:
>>>>> addr=192.168.2.67 xid=0x822a6274 service=nfsd vers=3 proc=GETATTR
>>>>> nfsd-1669 [003] 1470.029208: svc_process:
>>>>> addr=192.168.2.67 xid=0x832a6274 service=nfsd vers=3 proc=ACCESS
>>>>> nfsd-1669 [003] 1470.029255: svc_process:
>>>>> addr=192.168.2.67 xid=0x842a6274 service=nfsd vers=3 proc=LOOKUP
>>>>> nfsd-1669 [003] 1470.029296: svc_process:
>>>>> addr=192.168.2.67 xid=0x852a6274 service=nfsd vers=3 proc=FSSTAT
>>>>> nfsd-1669 [003] 1470.039715: svc_process:
>>>>> addr=192.168.2.67 xid=0x862a6274 service=nfsacl vers=3 proc=GETACL
>>>>> nfsd-1669 [003] 1470.039758: svc_process:
>>>>> addr=192.168.2.67 xid=0x872a6274 service=nfsd vers=3 proc=CREATE
>>>>> nfsd-1669 [003] 1470.040091: svc_process:
>>>>> addr=192.168.2.67 xid=0x882a6274 service=nfsd vers=3 proc=WRITE
>>>>> nfsd-1669 [003] 1470.040469: svc_process:
>>>>> addr=192.168.2.67 xid=0x892a6274 service=nfsd vers=3 proc=GETATTR
>>>>> nfsd-1669 [003] 1470.040503: svc_process:
>>>>> addr=192.168.2.67 xid=0x8a2a6274 service=nfsd vers=3 proc=ACCESS
>>>>> nfsd-1669 [003] 1470.041867: svc_process:
>>>>> addr=192.168.2.67 xid=0x8b2a6274 service=nfsd vers=3 proc=FSSTAT
>>>>> nfsd-1669 [003] 1470.042109: svc_process:
>>>>> addr=192.168.2.67 xid=0x8c2a6274 service=nfsd vers=3 proc=REMOVE
>>>>>
>>>>> So I'm probably missing some setting on the reproducer/client.
>>>>>
>>>>> /mnt from klimt.ib.1015granger.net:/export/fast
>>>>> Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
>>>>> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
>>>>> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
>>>>> local_lock=none,addr=192.168.2.55
>>>>>
>>>>> Linux morisot.1015granger.net 6.15.10-100.fc41.x86_64 #1 SMP
>>>>> PREEMPT_DYNAMIC Fri Aug 15 14:55:12 UTC 2025 x86_64 GNU/Linux
>>>>
>>>> If you're using LOCALIO (client on server) that'd explain your not
>>>> seeing any READs coming over the wire to NFSD.
>>>>
>>>> I've made sure to disable LOCALIO on my client, with:
>>>> echo N > /sys/module/nfs/parameters/localio_enabled
>>>
>>> I am testing with a physically separate client and server, so I believe
>>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
>>> particular "fsx -Z <fname>") show READ traffic and I'm getting the
>>> new trace point to fire quite a bit, and it is showing misaligned
>>> READ requests. So it has something to do with dt.
>>
>> OK, yeah I figured you weren't doing loopback mount, only thing that
>> came to mind for you not seeing READ like expected. I haven't had any
>> problems with dt not driving READs to NFSD...
>>
>> You'll certainly need to see READs in order for NFSD's new misaligned
>> DIO READ handling to get tested.
>>
>>> If I understand your two patches correctly, they are still pulling a
>>> page from the end of rq_pages to do the initial pad page. That, I
>>> think, is a working implementation, not the failing one.
>>
>> Patch 1 removes the use of a separate page, instead using the very
>> first page of rq_pages for the "start_extra" (or "front_pad) page for
>> the misaligned DIO READ. And with that my dt testing fails with data
>> mismatch like I shared. So patch 1 is failing implementation (for me
>> on the "RDMA" system I'm testing on).
>>
>> Patch 2 then switches to using a rq_pages page _after_ the memory that
>> would normally get used as the READ payload memory to service the
>> READ. So patch 2 is a working implementation.
>>
>>> EOD -- will continue tomorrow.
>>
>> Ack.
>>
>
> The reason for proto=tcp is that I was mounting the Hammerspace
> Anvil (metadata server) via 4.2 using tcp. And it is the layout that
> the metadata server hands out that directs my 4.2 flexfiles client to
> then access the DS over v3 using RDMA. My particular DS server in the
> broader testbed has the following in /etc/nfs.conf:
>
> [general]
>
> [nfsrahead]
>
> [exports]
>
> [exportfs]
>
> [gssd]
> use-gss-proxy = 1
>
> [lockd]
>
> [exportd]
>
> [mountd]
>
> [nfsdcld]
>
> [nfsdcltrack]
>
> [nfsd]
> rdma = y
> rdma-port = 20049
> threads = 576
> vers4.0 = n
> vers4.1 = n
>
> [statd]
>
> [sm-notify]
>
> And if I instead mount with:
>
> mount -o vers=3,proto=rdma,port=20049 192.168.0.106:/mnt/hs_nvme13 /test
>
> And then re-run dt, I don't see any data mismatch:
I'm beginning to suspect that NFSv3 isn't the interesting case. For
NFSv3 READs, nfsd_iter_read() is always called with @base == 0.
NFSv4 READs, on the other hand, set @base to whatever is the current
end of the send buffer's .pages array. The checks in
nfsd_analyze_read_dio() might reject the use of direct I/O, or it
might be that the code is setting up the alignment of the read buffer
incorrectly.
> dt (j:1 t:1): File System Information:
> dt (j:1 t:1): Mounted from device: 192.168.0.106:/mnt/hs_nvme13
> dt (j:1 t:1): Mounted on directory: /test
> dt (j:1 t:1): Filesystem type: nfs
> dt (j:1 t:1): Filesystem options: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.0.106,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.0.106
> dt (j:1 t:1): Filesystem block size: 1048576
> dt (j:1 t:1): Filesystem free space: 3812019404800 (3635425.000 Mbytes, 3550.220 Gbytes, 3.467 Tbytes)
> dt (j:1 t:1): Filesystem total space: 3838875533312 (3661037.000 Mbytes, 3575.231 Gbytes, 3.491 Tbytes)
>
> So... I think what this means is my "patch 1" _is_ a working
> implementation. BUT, for some reason RDMA with pnfs flexfiles is
> "unhappy".
>
> Would seem I get to keep both pieces and need to sort out what's up
> with pNFS flexfiles on this particular RDMA testbed.
>
> I will post v9 of the NFSD DIRECT patchset with "patch 1" folded in to
> the misaligned READ patch (5) and some other small fixes/improvements
> to the series, probably tomorrow morning.
>
> Thanks,
> Mike
--
Chuck Lever
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?
2025-09-04 19:07 ` Chuck Lever
@ 2025-09-04 21:00 ` Mike Snitzer
0 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2025-09-04 21:00 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, Jeff Layton
On Thu, Sep 04, 2025 at 03:07:30PM -0400, Chuck Lever wrote:
> On 9/2/25 6:18 PM, Mike Snitzer wrote:
> > On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
> >> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
> >>> On 9/2/25 5:06 PM, Mike Snitzer wrote:
> >>>> On Tue, Sep 02, 2025 at 01:59:12PM -0400, Chuck Lever wrote:
> >>>>> On 9/2/25 11:56 AM, Chuck Lever wrote:
> >>>>>> On 8/30/25 1:38 PM, Mike Snitzer wrote:
> >>>>>
> >>>>>>> dt (j:1 t:1): File System Information:
> >>>>>>> dt (j:1 t:1): Mounted from device: 192.168.0.105:/hs_test
> >>>>>>> dt (j:1 t:1): Mounted on directory: /mnt/hs_test
> >>>>>>> dt (j:1 t:1): Filesystem type: nfs4
> >>>>>>> dt (j:1 t:1): Filesystem options: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,port=20491,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.106,local_lock=none,addr=192.168.0.105
> >>>>>>
> >>>>>> I haven't been able to reproduce a similar failure in my lab with
> >>>>>> NFSv4.2 over RDMA with FDR InfiniBand. I've run dt 6-7 times, all
> >>>>>> successful. Also, for shit giggles, I tried the fsx-based subtests in
> >>>>>> fstests, no new failures there either. The export is xfs on an NVMe
> >>>>>> add-on card; server uses direct I/O for READ and page cache for WRITE.
> >>>>>>
> >>>>>> Notice the mount options for your test run: "proto=tcp" and
> >>>>>> "nconnect=16". Even if your network fabric is RoCE, "proto=tcp" will
> >>>>>> not use RDMA at all; it will use bog standard TCP/IP on your ultra
> >>>>>> fast Ethernet network.
> >>>>>>
> >>>>>> What should I try next? I can apply 2/2 or add "nconnect" or move the
> >>>>>> testing to my RoCE fabric after lunch and keep poking at it.
> >>>>
> >>>> Hmm, I'll have to check with the Hammerspace performance team to
> >>>> understand how RDMA used if the client mount has proto=tcp.
> >>>>
> >>>> Certainly surprising, thanks for noticing/reporting this aspect.
> >>>>
> >>>> I also cannot reproduce on a normal tcp mount and testbed. This
> >>>> frankenbeast of a fast "RDMA" network that is misconfigured to use
> >>>> proto=tcp is the only testbed where I've seen this dt data mismatch.
> >>>>
> >>>>>> Or, I could switch to TCP. Suggestions welcome.
> >>>>>
> >>>>> The client is not sending any READ procedures/operations to the server.
> >>>>> The following is NFSv3 for clarity, but NFSv4.x results are similar:
> >>>>>
> >>>>> nfsd-1669 [003] 1466.634816: svc_process:
> >>>>> addr=192.168.2.67 xid=0x7b2a6274 service=nfsd vers=3 proc=NULL
> >>>>> nfsd-1669 [003] 1466.635389: svc_process:
> >>>>> addr=192.168.2.67 xid=0x7d2a6274 service=nfsd vers=3 proc=FSINFO
> >>>>> nfsd-1669 [003] 1466.635420: svc_process:
> >>>>> addr=192.168.2.67 xid=0x7e2a6274 service=nfsd vers=3 proc=PATHCONF
> >>>>> nfsd-1669 [003] 1466.635451: svc_process:
> >>>>> addr=192.168.2.67 xid=0x7f2a6274 service=nfsd vers=3 proc=GETATTR
> >>>>> nfsd-1669 [003] 1466.635486: svc_process:
> >>>>> addr=192.168.2.67 xid=0x802a6274 service=nfsacl vers=3 proc=NULL
> >>>>> nfsd-1669 [003] 1466.635558: svc_process:
> >>>>> addr=192.168.2.67 xid=0x812a6274 service=nfsd vers=3 proc=FSINFO
> >>>>> nfsd-1669 [003] 1466.635585: svc_process:
> >>>>> addr=192.168.2.67 xid=0x822a6274 service=nfsd vers=3 proc=GETATTR
> >>>>> nfsd-1669 [003] 1470.029208: svc_process:
> >>>>> addr=192.168.2.67 xid=0x832a6274 service=nfsd vers=3 proc=ACCESS
> >>>>> nfsd-1669 [003] 1470.029255: svc_process:
> >>>>> addr=192.168.2.67 xid=0x842a6274 service=nfsd vers=3 proc=LOOKUP
> >>>>> nfsd-1669 [003] 1470.029296: svc_process:
> >>>>> addr=192.168.2.67 xid=0x852a6274 service=nfsd vers=3 proc=FSSTAT
> >>>>> nfsd-1669 [003] 1470.039715: svc_process:
> >>>>> addr=192.168.2.67 xid=0x862a6274 service=nfsacl vers=3 proc=GETACL
> >>>>> nfsd-1669 [003] 1470.039758: svc_process:
> >>>>> addr=192.168.2.67 xid=0x872a6274 service=nfsd vers=3 proc=CREATE
> >>>>> nfsd-1669 [003] 1470.040091: svc_process:
> >>>>> addr=192.168.2.67 xid=0x882a6274 service=nfsd vers=3 proc=WRITE
> >>>>> nfsd-1669 [003] 1470.040469: svc_process:
> >>>>> addr=192.168.2.67 xid=0x892a6274 service=nfsd vers=3 proc=GETATTR
> >>>>> nfsd-1669 [003] 1470.040503: svc_process:
> >>>>> addr=192.168.2.67 xid=0x8a2a6274 service=nfsd vers=3 proc=ACCESS
> >>>>> nfsd-1669 [003] 1470.041867: svc_process:
> >>>>> addr=192.168.2.67 xid=0x8b2a6274 service=nfsd vers=3 proc=FSSTAT
> >>>>> nfsd-1669 [003] 1470.042109: svc_process:
> >>>>> addr=192.168.2.67 xid=0x8c2a6274 service=nfsd vers=3 proc=REMOVE
> >>>>>
> >>>>> So I'm probably missing some setting on the reproducer/client.
> >>>>>
> >>>>> /mnt from klimt.ib.1015granger.net:/export/fast
> >>>>> Flags: rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
> >>>>> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
> >>>>> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
> >>>>> local_lock=none,addr=192.168.2.55
> >>>>>
> >>>>> Linux morisot.1015granger.net 6.15.10-100.fc41.x86_64 #1 SMP
> >>>>> PREEMPT_DYNAMIC Fri Aug 15 14:55:12 UTC 2025 x86_64 GNU/Linux
> >>>>
> >>>> If you're using LOCALIO (client on server) that'd explain your not
> >>>> seeing any READs coming over the wire to NFSD.
> >>>>
> >>>> I've made sure to disable LOCALIO on my client, with:
> >>>> echo N > /sys/module/nfs/parameters/localio_enabled
> >>>
> >>> I am testing with a physically separate client and server, so I believe
> >>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
> >>> particular "fsx -Z <fname>") show READ traffic and I'm getting the
> >>> new trace point to fire quite a bit, and it is showing misaligned
> >>> READ requests. So it has something to do with dt.
> >>
> >> OK, yeah I figured you weren't doing loopback mount, only thing that
> >> came to mind for you not seeing READ like expected. I haven't had any
> >> problems with dt not driving READs to NFSD...
> >>
> >> You'll certainly need to see READs in order for NFSD's new misaligned
> >> DIO READ handling to get tested.
> >>
> >>> If I understand your two patches correctly, they are still pulling a
> >>> page from the end of rq_pages to do the initial pad page. That, I
> >>> think, is a working implementation, not the failing one.
> >>
> >> Patch 1 removes the use of a separate page, instead using the very
> >> first page of rq_pages for the "start_extra" (or "front_pad) page for
> >> the misaligned DIO READ. And with that my dt testing fails with data
> >> mismatch like I shared. So patch 1 is failing implementation (for me
> >> on the "RDMA" system I'm testing on).
> >>
> >> Patch 2 then switches to using a rq_pages page _after_ the memory that
> >> would normally get used as the READ payload memory to service the
> >> READ. So patch 2 is a working implementation.
> >>
> >>> EOD -- will continue tomorrow.
> >>
> >> Ack.
> >>
> >
> > The reason for proto=tcp is that I was mounting the Hammerspace
> > Anvil (metadata server) via 4.2 using tcp. And it is the layout that
> > the metadata server hands out that directs my 4.2 flexfiles client to
> > then access the DS over v3 using RDMA. My particular DS server in the
> > broader testbed has the following in /etc/nfs.conf:
> >
> > [general]
> >
> > [nfsrahead]
> >
> > [exports]
> >
> > [exportfs]
> >
> > [gssd]
> > use-gss-proxy = 1
> >
> > [lockd]
> >
> > [exportd]
> >
> > [mountd]
> >
> > [nfsdcld]
> >
> > [nfsdcltrack]
> >
> > [nfsd]
> > rdma = y
> > rdma-port = 20049
> > threads = 576
> > vers4.0 = n
> > vers4.1 = n
> >
> > [statd]
> >
> > [sm-notify]
> >
> > And if I instead mount with:
> >
> > mount -o vers=3,proto=rdma,port=20049 192.168.0.106:/mnt/hs_nvme13 /test
> >
> > And then re-run dt, I don't see any data mismatch:
>
> I'm beginning to suspect that NFSv3 isn't the interesting case. For
> NFSv3 READs, nfsd_iter_read() is always called with @base == 0.
>
> NFSv4 READs, on the other hand, set @base to whatever is the current
> end of the send buffer's .pages array. The checks in
> nfsd_analyze_read_dio() might reject the use of direct I/O, or it
> might be that the code is setting up the alignment of the read buffer
> incorrectly.
That's great point. I think it is the latter.
nfsd_analyze_read_dio() doesn't concern itself with @base. And after
nfsd_iter_read() calls nfsd_analyze_read_dio() the 'start_extra' page
is prepared assuming its base is 0:
if (read_dio.start_extra) {
len = read_dio.start_extra;
bvec_set_page(&rqstp->rq_bvec[v],
NULL, /* set below */
len, PAGE_SIZE - len);
It is loop in nfsd_iter_read() is what is accounting for @base:
while (total) {
len = min_t(size_t, total, PAGE_SIZE - base);
bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
len, base);
total -= len;
++v;
base = 0;
}
Which explains why using one past the end "just works" with:
if ((kiocb.ki_flags & IOCB_DIRECT) && read_dio.start_extra)
rqstp->rq_bvec[0].bv_page = *(rqstp->rq_next_page++);
I purposely avoided having the start_extra page need to account for
@base because:
1) I thought it best to not have 2 places dealing with @base;
unwittingly requiring start_extra page stand on its own.
2) otherwise it could cause start_extra's length to bleed into the
next page... which would cascade through all pages, and complicate
the accounting needed in nfsd_complete_misaligned_read_dio() to
unwind having expanded the misaligned DIO READ to be DIO aligned.
This all makes sense to me now... does it for you?
(nice catch on @base being the key... was sitting there in plain sight
the whole time)
Not loving the need for multiple checks for read_dio.start_extra
nfsd_iter_read() (with both of v9's 8/9 and 9/9 applied). But it
feels better than getting more fiddly elsewhere.
So replacing v9's 9/9 by folding this incremental into 5/9 would be
preferrable for me (but I know you want to reassess/rework 5/9 anyway,
so I'll defer to you at this point? no matter what: THANKS!):
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5b3c6072b6f5c..43bbd8f3e39bd 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1263,7 +1263,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (read_dio.start_extra) {
len = read_dio.start_extra;
bvec_set_page(&rqstp->rq_bvec[v],
- *(rqstp->rq_next_page++),
+ NULL, /* set below */
len, PAGE_SIZE - len);
total -= len;
++v;
@@ -1288,6 +1288,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
base = 0;
}
WARN_ON_ONCE(v > rqstp->rq_maxpages);
+ /* The start_extra page must come from the end of rq_pages[]
+ * so that it can stand on its own and be easily dropped
+ * by nfsd_complete_misaligned_read_dio().
+ */
+ if ((kiocb.ki_flags & IOCB_DIRECT) && read_dio.start_extra)
+ rqstp->rq_bvec[0].bv_page = *(rqstp->rq_next_page++);
trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
^ permalink raw reply related [flat|nested] 42+ messages in thread
end of thread, other threads:[~2025-09-04 21:00 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 18:57 [PATCH v8 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-09-03 14:38 ` Chuck Lever
2025-09-03 15:07 ` Mike Snitzer
2025-09-03 16:02 ` Mike Snitzer
2025-09-03 16:12 ` Chuck Lever
2025-09-03 16:50 ` Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-27 15:34 ` Chuck Lever
2025-08-27 19:41 ` Mike Snitzer
2025-08-27 20:56 ` Chuck Lever
2025-08-27 23:15 ` Mike Snitzer
2025-08-28 1:57 ` Chuck Lever
2025-08-28 8:09 ` Mike Snitzer
2025-08-28 14:53 ` Chuck Lever
2025-08-28 18:52 ` Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-30 17:38 ` [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug? Mike Snitzer
2025-09-02 14:04 ` Chuck Lever
2025-09-02 15:56 ` Chuck Lever
2025-09-02 17:59 ` Chuck Lever
2025-09-02 21:06 ` Mike Snitzer
2025-09-02 21:16 ` Chuck Lever
2025-09-02 21:27 ` Mike Snitzer
2025-09-02 22:18 ` Mike Snitzer
2025-09-04 19:07 ` Chuck Lever
2025-09-04 21:00 ` Mike Snitzer
2025-09-04 14:42 ` Mike Snitzer
2025-09-04 15:12 ` Chuck Lever
2025-09-04 16:10 ` Chuck Lever
2025-09-04 16:33 ` Mike Snitzer
2025-09-04 17:54 ` Chuck Lever
2025-08-30 17:38 ` [RFC PATCH 2/2] NFSD: use /end/ of rq_pages for front_pad page, simpler workaround for rpcrdma bug Mike Snitzer
2025-08-30 18:53 ` [RFC PATCH 0/2] some progress on rpcrdma bug [was: Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned] Mike Snitzer
2025-08-28 16:36 ` [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Jeff Layton
2025-08-28 16:22 ` Jeff Layton
2025-08-28 16:27 ` Chuck Lever
2025-08-26 18:57 ` [PATCH v8 6/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-26 18:57 ` [PATCH v8 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox