* [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
@ 2025-06-10 20:57 Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
` (8 more replies)
0 siblings, 9 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
Hi,
This series introduces 'enable-dontcache' to NFSD's debugfs interface,
once enabled NFSD will selectively make use of O_DIRECT when issuing
read and write IO:
- all READs will use O_DIRECT (both aligned and misaligned)
- all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
- misaligned WRITEs currently continue to use normal buffered IO
Q: Why not actually use RWF_DONTCACHE (yet)?
A:
If IO can is properly DIO-aligned, or can be made to be, using
O_DIRECT is preferred over DONTCACHE because of its reduced CPU and
memory usage. Relative to NFSD using RWF_DONTCACHE for misaligned
WRITEs, I've briefly discussed with Jens that follow-on dontcache work
is needed to justify falling back to actually using RWF_DONTCACHE.
Specifically, Hammerspace benchmarking has confirmed as Jeff Layton
suggested at Bakeathon, we need dontcache to be enhanced to not
immediately dropbehind when IO completes -- because it works against
us (due to RMW needing to read without benefit of cache), whereas
buffered IO enables misaligned IO to be more performant. Jens thought
that delayed dropbehind is certainly doable but that he needed to
reason through it further (so timing on availability is TBD). As soon
as it is possible I'll happily switch NFSD's misaligned write IO
fallback from normal buffered IO to actually using RWF_DONTCACHE.
Continuing with what this patchset provides:
NFSD now uses STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store
DIO alignment attributes from underlying filesystem in associated
nfsd_file. This is done when the nfsd_file is first opened for a
regular file.
A new RWF_DIRECT flag is added to include/uapi/linux/fs.h to allow
NFSD to use O_DIRECT on a per-IO basis.
If enable-dontcache=1 then RWF_DIRECT will be set for all READ IO
(even if the IO is misaligned, thanks to expanding the read to be
aligned for use with DIO, as suggested by Jeff and Chuck at the NFS
Bakeathon held recently in Ann Arbor).
NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
DIO alignment (both page and disk alignment). This works quite well
for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
maps the WRITE payload into aligned pages. But more work is needed to
be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
used. I spent quite a bit of time analyzing the existing xdr_buf code
and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
misaligned pages such that O_DIRECT isn't possible without a copy
(completely defeating the point). I'll reply to this cover letter to
start a subthread to discuss how best to deal with misaligned write
IO (by association with Hammerspace, I'm most interested in NFS v3).
Performance benefits of using O_DIRECT in NFSD:
Hammerspace's testbed was 10 NFS servers connected via 800Gbit
RDMA networking (mlx5_core), each with 1TB of memory, 48 cores (2 NUMA
nodes) and 8 ScaleFlux NVMe devices (each with two 3.5TB namespaces.
Theoretical max for reads per NVMe device is 14GB/s, or ~7GB/s per
namespace).
And 10 client systems each running 64 IO threads.
The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
and memory use, particularly for workloads with a working set that far
exceeds the available memory of a given server. This patchset's
changes (though patch 5, patch 6 wasn't written until after
benchmarking performed) enabled Hammerspace to improve its IO500.org
benchmark result (as submitted for this week's ISC 2025 in Hamburg,
Germany) by 25%.
That 25% improvement on IO500 is owed to NFS servers seeing:
- reduced CPU usage from 100% to ~50%
O_DIRECT:
write: 51% idle, 25% system, 14% IO wait, 2% IRQ
read: 55% idle, 9% system, 32.5% IO wait, 1.5% IRQ
buffered:
write: 17.8% idle, 67.5% system, 8% IO wait, 2% IRQ
read: 3.29% idle, 94.2% system, 2.5% IO wait, 1% IRQ
- reduced memory usage from just under 100% (987GiB for reads, 978GiB
for writes) to only ~244 MB for cache+buffer use (for both reads and
writes).
- buffered would tip-over due to kswapd and kcompactd struggling to
find free memory during reclaim.
- increased NVMe throughtput when comparing O_DIRECT vs buffered:
O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
buffered: 8 GB/s for writes, 4-5 GB/s for reads
- abiliy to support more IO threads per client system (from 48 to 64)
The performance improvement highlight of the numerous individual tests
in the IO500 collection of benchamrks was in the IOR "easy" test:
Write:
O_DIRECT: [RESULT] ior-easy-write 420.351599 GiB/s : time 869.650 seconds
CACHED: [RESULT] ior-easy-write 368.268722 GiB/s : time 413.647 seconds
Read:
O_DIRECT: [RESULT] ior-easy-read 446.790791 GiB/s : time 818.219 seconds
CACHED: [RESULT] ior-easy-read 284.706196 GiB/s : time 534.950 seconds
It is suspected that patch 6 in this patchset will improve IOR "hard"
read results. The "hard" name comes from the fact that it performs all
IO using a mislaigned blocksize of 47008 bytes (which happens to be
the IO size I showed ftrace output for in the 6th patch's header).
All review and discussion is welcome, thanks!
Mike
Mike Snitzer (6):
NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFSD: pass nfsd_file to nfsd_iter_read()
fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
NFSD: issue READs using O_DIRECT even if IO is misaligned
fs/nfsd/debugfs.c | 39 +++++++++++++
fs/nfsd/filecache.c | 32 +++++++++++
fs/nfsd/filecache.h | 4 ++
fs/nfsd/nfs4xdr.c | 8 +--
fs/nfsd/nfsd.h | 1 +
fs/nfsd/trace.h | 37 +++++++++++++
fs/nfsd/vfs.c | 111 ++++++++++++++++++++++++++++++++++---
fs/nfsd/vfs.h | 17 +-----
include/linux/fs.h | 2 +-
include/linux/sunrpc/svc.h | 5 +-
include/uapi/linux/fs.h | 5 +-
11 files changed, 231 insertions(+), 30 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
@ 2025-06-10 20:57 ` Mike Snitzer
2025-06-11 6:57 ` Christoph Hellwig
2025-06-11 14:31 ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
` (7 subsequent siblings)
8 siblings, 2 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
read or written by NFSD will either not be cached (thanks to O_DIRECT)
or will be removed from the page cache upon completion (DONTCACHE).
enable-dontcache is 0 by default. It may be enabled with:
echo 1 > /sys/kernel/debug/nfsd/enable-dontcache
FOP_DONTCACHE must be advertised as supported by the underlying
filesystem (e.g. XFS), otherwise if/when 'enable-dontcache' is 1
all IO flagged with RWF_DONTCACHE will fail with -EOPNOTSUPP.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/debugfs.c | 39 +++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/vfs.c | 12 +++++++++++-
3 files changed, 51 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 84b0c8b559dc..8decdec60a8e 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -32,6 +32,42 @@ static int nfsd_dsr_set(void *data, u64 val)
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/enable-dontcache
+ *
+ * Contents:
+ * %0: NFS READ and WRITE are not allowed to use dontcache
+ * %1: NFS READ and WRITE are allowed to use dontcache
+ *
+ * NFSD's dontcache support reserves the right to use O_DIRECT
+ * if it chooses (instead of dontcache's usual pagecache-based
+ * dropbehind semantics).
+ *
+ * The default value of this setting is zero (dontcache is
+ * disabled). This setting takes immediate effect for all NFS
+ * versions, all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_dontcache_get(void *data, u64 *val)
+{
+ *val = nfsd_enable_dontcache ? 1 : 0;
+ return 0;
+}
+
+static int nfsd_dontcache_set(void *data, u64 val)
+{
+ if (val > 0) {
+ /* Must first also disable-splice-read */
+ nfsd_disable_splice_read = true;
+ nfsd_enable_dontcache = true;
+ } else
+ nfsd_enable_dontcache = false;
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dontcache_fops, nfsd_dontcache_get,
+ nfsd_dontcache_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -44,4 +80,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_dsr_fops);
+
+ debugfs_create_file("enable-dontcache", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_dontcache_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1bfd0b4e9af7..00546547eae6 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -155,6 +155,7 @@ static inline void nfsd_debugfs_exit(void) {}
#endif
extern bool nfsd_disable_splice_read __read_mostly;
+extern bool nfsd_enable_dontcache __read_mostly;
extern int nfsd_max_blksize;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 7d94fae1dee8..bba3e6f4f56b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,7 @@
#define NFSDDBG_FACILITY NFSDDBG_FILEOP
bool nfsd_disable_splice_read __read_mostly;
+bool nfsd_enable_dontcache __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1086,6 +1087,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long v, total;
struct iov_iter iter;
loff_t ppos = offset;
+ rwf_t flags = 0;
ssize_t host_err;
size_t len;
@@ -1103,7 +1105,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
trace_nfsd_read_vector(rqstp, fhp, offset, *count);
iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
- host_err = vfs_iter_read(file, &iter, &ppos, 0);
+
+ if (nfsd_enable_dontcache)
+ flags |= RWF_DONTCACHE;
+
+ host_err = vfs_iter_read(file, &iter, &ppos, flags);
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
@@ -1209,6 +1215,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
+ if (nfsd_enable_dontcache)
+ flags |= RWF_DONTCACHE;
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
--
2.44.0
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
@ 2025-06-10 20:57 ` Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
` (6 subsequent siblings)
8 siblings, 0 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
alignment attributes from underlying filesystem in associated
nfsd_file. This is done when the nfsd_file is first opened for
a regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/filecache.c | 32 ++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/vfs.c | 17 +++++++++++++++++
fs/nfsd/vfs.h | 15 ++-------------
4 files changed, 55 insertions(+), 13 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 0acad9c35b3f..01598e7b0071 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1069,6 +1072,33 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return status;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1187,6 +1217,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_getattr(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 722b26c71e45..237a05c74211 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index bba3e6f4f56b..8dccbb4d78f9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -2672,3 +2672,20 @@ nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
return err? nfserrno(err) : 0;
}
+
+__be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
+{
+ u32 request_mask = STATX_BASIC_STATS;
+ struct path p = {.mnt = fh->fh_export->ex_path.mnt,
+ .dentry = fh->fh_dentry};
+ struct inode *inode = d_inode(p.dentry);
+
+ if (nfsd_enable_dontcache && S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
+ if (fh->fh_maxsize == NFS4_FHSIZE)
+ request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
+
+ return nfserrno(vfs_getattr(&p, stat, request_mask,
+ AT_STATX_SYNC_AS_STAT));
+}
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index eff04959606f..e3de3a295704 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,6 +160,8 @@ __be32 nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
void nfsd_filp_close(struct file *fp);
+__be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat);
+
static inline int fh_want_write(struct svc_fh *fh)
{
int ret;
@@ -180,17 +182,4 @@ static inline void fh_drop_write(struct svc_fh *fh)
}
}
-static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
-{
- u32 request_mask = STATX_BASIC_STATS;
- struct path p = {.mnt = fh->fh_export->ex_path.mnt,
- .dentry = fh->fh_dentry};
-
- if (fh->fh_maxsize == NFS4_FHSIZE)
- request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
-
- return nfserrno(vfs_getattr(&p, stat, request_mask,
- AT_STATX_SYNC_AS_STAT));
-}
-
#endif /* LINUX_NFSD_VFS_H */
--
2.44.0
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read()
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-06-10 20:57 ` Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
` (5 subsequent siblings)
8 siblings, 0 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 9ceeb2d10c01..7ec6d951a284 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4479,7 +4479,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4490,7 +4490,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4537,7 +4537,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5433,7 +5433,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8dccbb4d78f9..e7cc8c6dfbad 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1068,7 +1068,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1081,9 +1081,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
loff_t ppos = offset;
@@ -1311,7 +1312,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index e3de3a295704..09ed36fe5fd2 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.44.0
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
` (2 preceding siblings ...)
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
@ 2025-06-10 20:57 ` Mike Snitzer
2025-06-11 6:58 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
` (4 subsequent siblings)
8 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
Avoids the need to open code do_iter_readv_writev() purely to request
that a sync iocb make use of IOCB_DIRECT.
Care was taken to preserve the long-established value for IOCB_DIRECT
(1 << 17) when introducing RWF_DIRECT.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
include/linux/fs.h | 2 +-
include/uapi/linux/fs.h | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed025959d1bd..9bf5543926f8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -324,7 +324,7 @@ struct readahead_control;
/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD (1 << 16)
-#define IOCB_DIRECT (1 << 17)
+#define IOCB_DIRECT (__force int) RWF_DIRECT
#define IOCB_WRITE (1 << 18)
/* iocb->ki_waitq is valid */
#define IOCB_WAITQ (1 << 19)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 56a4f93a08f4..e0d00a7c336a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -335,10 +335,13 @@ typedef int __bitwise __kernel_rwf_t;
/* buffered IO that drops the cache after reading or writing data */
#define RWF_DONTCACHE ((__force __kernel_rwf_t)0x00000080)
+/* per-IO O_DIRECT, using (1 << 17) or 0x00020000 for compat with IOCB_DIRECT */
+#define RWF_DIRECT ((__force __kernel_rwf_t)(1 << 17))
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
- RWF_DONTCACHE)
+ RWF_DONTCACHE | RWF_DIRECT)
#define PROCFS_IOCTL_MAGIC 'f'
--
2.44.0
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
` (3 preceding siblings ...)
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
@ 2025-06-10 20:57 ` Mike Snitzer
2025-06-11 7:00 ` Christoph Hellwig
2025-06-11 14:42 ` Chuck Lever
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
` (3 subsequent siblings)
8 siblings, 2 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
IO must be aligned, otherwise it falls back to using buffered IO.
RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
nfsd/enable-dontcache=1) because it works against us (due to RMW
needing to read without benefit of cache), whereas buffered IO enables
misaligned IO to be more performant.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++++++++++++++----
1 file changed, 36 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index e7cc8c6dfbad..a942609e3ab9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1064,6 +1064,22 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
+ const u32 blocksize)
+{
+ u32 blocksize_mask;
+
+ if (!blocksize)
+ return false;
+
+ blocksize_mask = blocksize - 1;
+ if ((offset & blocksize_mask) ||
+ (iov_iter_alignment(iter) & blocksize_mask))
+ return false;
+
+ return true;
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1107,8 +1123,16 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
trace_nfsd_read_vector(rqstp, fhp, offset, *count);
iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
- if (nfsd_enable_dontcache)
- flags |= RWF_DONTCACHE;
+ if (nfsd_enable_dontcache) {
+ if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
+ flags |= RWF_DIRECT;
+ /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
+ * against us (due to RMW needing to read without benefit of cache),
+ * whereas buffered IO enables misaligned IO to be more performant.
+ */
+ //else
+ // flags |= RWF_DONTCACHE;
+ }
host_err = vfs_iter_read(file, &iter, &ppos, flags);
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
@@ -1217,8 +1241,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
- if (nfsd_enable_dontcache)
- flags |= RWF_DONTCACHE;
+ if (nfsd_enable_dontcache) {
+ if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
+ flags |= RWF_DIRECT;
+ /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
+ * against us (due to RMW needing to read without benefit of cache),
+ * whereas buffered IO enables misaligned IO to be more performant.
+ */
+ //else
+ // flags |= RWF_DONTCACHE;
+ }
since = READ_ONCE(file->f_wb_err);
if (verf)
--
2.44.0
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
` (4 preceding siblings ...)
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
@ 2025-06-10 20:57 ` Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
` (2 subsequent siblings)
8 siblings, 0 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
If enable-dontcache is used, expand any misaligned READ to the next
DIO-aligned block (on either end of the READ).
Reserve an extra page in svc_serv_maxpages() because nfsd_iter_read()
might need two extra pages when a READ payload is not DIO-aligned --
but nfsd_iter_read() and nfsd_splice_actor() are mutually exclusive
(so reuse page reserved for nfsd_splice_actor).
Also add nfsd_read_vector_dio trace event. This combination of
trace events is useful:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector_dio/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
Which for this dd command:
dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
Results in:
nfsd-16580 [001] ..... 5672.403130: nfsd_read_vector_dio: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008 start=0+0 end=47104-96
nfsd-16580 [001] ..... 5672.403131: nfsd_read_vector: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47104
nfsd-16580 [001] ..... 5672.403134: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0x0 bytecount 0xb800
nfsd-16580 [001] ..... 5672.404380: nfsd_read_io_done: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008
nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector_dio: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008 start=46592+416 end=94208-192
nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=46592 len=47616
nfsd-16580 [001] ..... 5672.404673: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0xb600 bytecount 0xba00
nfsd-16580 [001] ..... 5672.405771: nfsd_read_io_done: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008
Suggested-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/trace.h | 37 ++++++++++++++++++++++
fs/nfsd/vfs.c | 65 ++++++++++++++++++++++++++++----------
include/linux/sunrpc/svc.h | 5 ++-
3 files changed, 90 insertions(+), 17 deletions(-)
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 3c5505ef5e3a..a46515b953f4 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -473,6 +473,43 @@ DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
DEFINE_NFSD_IO_EVENT(commit_done);
+TRACE_EVENT(nfsd_read_vector_dio,
+ TP_PROTO(struct svc_rqst *rqstp,
+ struct svc_fh *fhp,
+ u64 offset,
+ u32 len,
+ loff_t start,
+ loff_t start_extra,
+ loff_t end,
+ loff_t end_extra),
+ TP_ARGS(rqstp, fhp, offset, len, start, start_extra, end, end_extra),
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, fh_hash)
+ __field(u64, offset)
+ __field(u32, len)
+ __field(loff_t, start)
+ __field(loff_t, start_extra)
+ __field(loff_t, end)
+ __field(loff_t, end_extra)
+ ),
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqstp->rq_xid);
+ __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
+ __entry->offset = offset;
+ __entry->len = len;
+ __entry->start = start;
+ __entry->start_extra = start_extra;
+ __entry->end = end;
+ __entry->end_extra = end_extra;
+ ),
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%llu end=%llu-%llu",
+ __entry->xid, __entry->fh_hash,
+ __entry->offset, __entry->len,
+ __entry->start, __entry->start_extra,
+ __entry->end, __entry->end_extra)
+);
+
DECLARE_EVENT_CLASS(nfsd_err_class,
TP_PROTO(struct svc_rqst *rqstp,
struct svc_fh *fhp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a942609e3ab9..be5d025b4680 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -19,6 +19,7 @@
#include <linux/splice.h>
#include <linux/falloc.h>
#include <linux/fcntl.h>
+#include <linux/math.h>
#include <linux/namei.h>
#include <linux/delay.h>
#include <linux/fsnotify.h>
@@ -1101,15 +1102,41 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int base, u32 *eof)
{
struct file *file = nf->nf_file;
- unsigned long v, total;
+ unsigned long v, total, in_count = *count;
+ loff_t start_extra = 0, end_extra = 0;
struct iov_iter iter;
- loff_t ppos = offset;
+ loff_t ppos;
rwf_t flags = 0;
ssize_t host_err;
size_t len;
+ /*
+ * If dontcache enabled, expand any misaligned READ to
+ * the next DIO-aligned block (on either end of the READ).
+ */
+ if (nfsd_enable_dontcache && nf->nf_dio_mem_align &&
+ (base & (nf->nf_dio_mem_align-1)) == 0) {
+ const u32 dio_blocksize = nf->nf_dio_read_offset_align;
+ loff_t orig_end = offset + *count;
+ loff_t start = round_down(offset, dio_blocksize);
+ loff_t end = round_up(orig_end, dio_blocksize);
+
+ WARN_ON_ONCE(dio_blocksize > PAGE_SIZE);
+ start_extra = offset - start;
+ end_extra = end - orig_end;
+
+ /* Show original offset and count, and how it was expanded for DIO */
+ trace_nfsd_read_vector_dio(rqstp, fhp, offset, *count,
+ start, start_extra, end, end_extra);
+
+ /* trace_nfsd_read_vector() will reflect larger DIO-aligned READ */
+ offset = start;
+ in_count = end - start;
+ flags |= RWF_DIRECT;
+ }
+
v = 0;
- total = *count;
+ total = in_count;
while (total) {
len = min_t(size_t, total, PAGE_SIZE - base);
bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
@@ -1120,21 +1147,27 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
WARN_ON_ONCE(v > rqstp->rq_maxpages);
- trace_nfsd_read_vector(rqstp, fhp, offset, *count);
- iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
-
- if (nfsd_enable_dontcache) {
- if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
- flags |= RWF_DIRECT;
- /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
- * against us (due to RMW needing to read without benefit of cache),
- * whereas buffered IO enables misaligned IO to be more performant.
- */
- //else
- // flags |= RWF_DONTCACHE;
- }
+ trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
+ ppos = offset;
host_err = vfs_iter_read(file, &iter, &ppos, flags);
+
+ if ((start_extra || end_extra) && host_err >= 0) {
+ rqstp->rq_bvec[0].bv_offset += start_extra;
+ rqstp->rq_bvec[0].bv_len -= start_extra;
+ rqstp->rq_bvec[v].bv_len -= end_extra;
+ /* Must adjust returned read size to reflect original extent */
+ offset += start_extra;
+ if (likely(host_err >= start_extra)) {
+ host_err -= start_extra;
+ if (host_err > *count)
+ host_err = *count;
+ } else {
+ /* Short read that didn't read any of requested data */
+ host_err = 0;
+ }
+ }
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 46f7991cea58..52f5c9ec35aa 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* pages, one for the request, and one for the reply.
* nfsd_splice_actor() might need an extra page when a READ payload
* is not page-aligned.
+ * nfsd_iter_read() might need two extra pages when a READ payload
+ * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
+ * are mutually exclusive.
*/
static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
{
- return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+ return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
}
/*
--
2.44.0
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
@ 2025-06-11 6:57 ` Christoph Hellwig
2025-06-11 10:44 ` Mike Snitzer
2025-06-11 13:56 ` Chuck Lever
2025-06-11 14:31 ` Chuck Lever
1 sibling, 2 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-11 6:57 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Tue, Jun 10, 2025 at 04:57:32PM -0400, Mike Snitzer wrote:
> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> read or written by NFSD will either not be cached (thanks to O_DIRECT)
> or will be removed from the page cache upon completion (DONTCACHE).
>
> enable-dontcache is 0 by default. It may be enabled with:
> echo 1 > /sys/kernel/debug/nfsd/enable-dontcache
Having this as a global debug-only interface feels a bit odd.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
@ 2025-06-11 6:58 ` Christoph Hellwig
2025-06-11 10:51 ` Mike Snitzer
2025-06-11 14:17 ` Chuck Lever
0 siblings, 2 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-11 6:58 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Tue, Jun 10, 2025 at 04:57:35PM -0400, Mike Snitzer wrote:
> Avoids the need to open code do_iter_readv_writev() purely to request
> that a sync iocb make use of IOCB_DIRECT.
>
> Care was taken to preserve the long-established value for IOCB_DIRECT
> (1 << 17) when introducing RWF_DIRECT.
What is the problem with using vfs_iocb_iter_read instead of
vfs_iter_read and passing the iocb directly?
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
@ 2025-06-11 7:00 ` Christoph Hellwig
2025-06-11 12:23 ` Mike Snitzer
2025-06-11 14:42 ` Chuck Lever
1 sibling, 1 reply; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-11 7:00 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Tue, Jun 10, 2025 at 04:57:36PM -0400, Mike Snitzer wrote:
> IO must be aligned, otherwise it falls back to using buffered IO.
>
> RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> nfsd/enable-dontcache=1) because it works against us (due to RMW
> needing to read without benefit of cache), whereas buffered IO enables
> misaligned IO to be more performant.
This seems to "randomly" mix direct I/O and buffered I/O on a file.
That's basically asking for data corruption due to invalidation races.
But maybe also explain what this is trying to address to start with?
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-11 6:57 ` Christoph Hellwig
@ 2025-06-11 10:44 ` Mike Snitzer
2025-06-11 13:04 ` Jeff Layton
2025-06-11 13:56 ` Chuck Lever
1 sibling, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 10:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Tue, Jun 10, 2025 at 11:57:33PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 04:57:32PM -0400, Mike Snitzer wrote:
> > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > or will be removed from the page cache upon completion (DONTCACHE).
> >
> > enable-dontcache is 0 by default. It may be enabled with:
> > echo 1 > /sys/kernel/debug/nfsd/enable-dontcache
>
> Having this as a global debug-only interface feels a bit odd.
>
I generally agree, I originally proposed nfsd.nfsd_dontcache=Y
modparam:
https://lore.kernel.org/linux-nfs/20250220171205.12092-1-snitzer@kernel.org/
(and even implemented formal NFSD per-export "dontcache" control,
which Trond and I both think is probably needed).
But (ab)using debugfs is the approach Chuck and Jeff would like to
take for experimental NFSD changes so that we can kick the tires
without having to support an interface until the end of time. See
commit 9fe5ea760e64 ("NFSD: Add /sys/kernel/debug/nfsd") for more on
the general thinking. First consumer was commit c9dcd1de7977 ("NFSD:
Add experimental setting to disable the use of splice read").
I'm fine with using debugfs, means to an end with no strings attached.
Once we have confidence in what is needed we can pivot back to
a modparam or per-export controls or whatever.
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
2025-06-11 6:58 ` Christoph Hellwig
@ 2025-06-11 10:51 ` Mike Snitzer
2025-06-11 14:17 ` Chuck Lever
1 sibling, 0 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 10:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Tue, Jun 10, 2025 at 11:58:25PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 04:57:35PM -0400, Mike Snitzer wrote:
> > Avoids the need to open code do_iter_readv_writev() purely to request
> > that a sync iocb make use of IOCB_DIRECT.
> >
> > Care was taken to preserve the long-established value for IOCB_DIRECT
> > (1 << 17) when introducing RWF_DIRECT.
>
> What is the problem with using vfs_iocb_iter_read instead of
> vfs_iter_read and passing the iocb directly?
Open to whatever. I just didn't want to open code
do_iter_readv_writev() like my header said at the start.
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 7:00 ` Christoph Hellwig
@ 2025-06-11 12:23 ` Mike Snitzer
2025-06-11 13:30 ` Jeff Layton
2025-06-12 7:23 ` Christoph Hellwig
0 siblings, 2 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 12:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe,
Dave Chinner
On Wed, Jun 11, 2025 at 12:00:02AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 04:57:36PM -0400, Mike Snitzer wrote:
> > IO must be aligned, otherwise it falls back to using buffered IO.
> >
> > RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> > nfsd/enable-dontcache=1) because it works against us (due to RMW
> > needing to read without benefit of cache), whereas buffered IO enables
> > misaligned IO to be more performant.
>
> This seems to "randomly" mix direct I/O and buffered I/O on a file.
It isn't random, if the IO is DIO-aligned it uses direct I/O.
> That's basically asking for data corruption due to invalidation races.
I've seen you speak of said dragons in other threads and even commit
headers, etc. Could be they are lurking, but I took the approach of
"implement it [this patchset] and see what breaks". It hasn't broken
yet, despite my having thrown a large battery of testing at it (which
includes all of Hammerspace's automated sanities testing that uses
many testsuites, e.g. xfstests, mdtest, etc, etc).
But the IOR "hard" workload, which checks for corruption and uses
47008 blocksize to force excessive RMW, hasn't yet been ran with my
"[PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is
misaligned" [0]. That IOR "hard" testing will likely happen today.
> But maybe also explain what this is trying to address to start with?
Ha, I suspect you saw my too-many-words 0th patch header [1] and
ignored it? Solid feedback, I need to be more succinct and I'm
probably too close to this work to see the gaps in introduction and
justification but will refine, starting now:
Overview: NFSD currently only uses buffered IO and it routinely falls
over due to the problems RWF_DONTCACHE was developed to workaround.
But RWF_DONTCACHE also uses buffered IO and page cache and also
suffers from inefficiencies that direct IO doesn't. Buffered IO's cpu
and memory consumption is particularly unwanted for resource
constrained systems.
Maybe some pictures are worth 1000+ words.
Here is a flamegraph showing buffered IO causing reclaim to bring the
system to a halt (when workload's working set far exceeds available
memory): https://original.art/buffered_read.svg
Here is flamegraph for the same type of workload but using DONTCACHE
instead of normal buffered IO: https://original.art/dontcache_read.svg
Dave Chinner provided his analysis of why DONTCACHE was struggling
[2]. And I gave further context to others and forecast that I'd be
working on implementing NFSD support for using O_DIRECT [3]. Then I
discussed how to approach the implementation with Chuck, Jeff and
others at the recent NFS Bakeathon. This series implements my take on
what was discussed.
This graph shows O_DIRECT vs buffered IO for the IOR "easy" workload
("easy" because it uses aligned 1 MiB IOs rather than 47008 bytes like
IOR "hard"): https://original.art/NFSD_direct_vs_buffered_IO.jpg
Buffered IO is generally worse across the board. DONTCACHE provides
welcome reclaim storm relief without the alignment requirements of
O_DIRECT but there really is no substitute for O_DIRECT if we're able
to use it. My patchset shows NFSD can and that it is much more
deterministic and less resource hungry.
Direct I/O is definitely the direction we need to go, with DONTCACHE
fallback for misaligned write IO (once it is able to delay its
dropbehind to work better with misaligned IO).
Mike
[0]: https://lore.kernel.org/linux-nfs/20250610205737.63343-7-snitzer@kernel.org/
[1]: https://lore.kernel.org/linux-nfs/20250610205737.63343-1-snitzer@kernel.org/
[2]: https://lore.kernel.org/linux-nfs/aBrKbOoj4dgUvz8f@dread.disaster.area/
[3]: https://lore.kernel.org/linux-nfs/aBvVltbDKdHXMtLL@kernel.org/
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
` (5 preceding siblings ...)
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-06-11 12:55 ` Jeff Layton
2025-06-12 7:39 ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-12 13:46 ` Chuck Lever
8 siblings, 1 reply; 75+ messages in thread
From: Jeff Layton @ 2025-06-11 12:55 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On Tue, 2025-06-10 at 16:57 -0400, Mike Snitzer wrote:
> Hi,
>
> This series introduces 'enable-dontcache' to NFSD's debugfs interface,
> once enabled NFSD will selectively make use of O_DIRECT when issuing
> read and write IO:
> - all READs will use O_DIRECT (both aligned and misaligned)
> - all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
> - misaligned WRITEs currently continue to use normal buffered IO
>
> Q: Why not actually use RWF_DONTCACHE (yet)?
> A:
> If IO can is properly DIO-aligned, or can be made to be, using
> O_DIRECT is preferred over DONTCACHE because of its reduced CPU and
> memory usage. Relative to NFSD using RWF_DONTCACHE for misaligned
> WRITEs, I've briefly discussed with Jens that follow-on dontcache work
> is needed to justify falling back to actually using RWF_DONTCACHE.
> Specifically, Hammerspace benchmarking has confirmed as Jeff Layton
> suggested at Bakeathon, we need dontcache to be enhanced to not
> immediately dropbehind when IO completes -- because it works against
> us (due to RMW needing to read without benefit of cache), whereas
> buffered IO enables misaligned IO to be more performant. Jens thought
> that delayed dropbehind is certainly doable but that he needed to
> reason through it further (so timing on availability is TBD). As soon
> as it is possible I'll happily switch NFSD's misaligned write IO
> fallback from normal buffered IO to actually using RWF_DONTCACHE.
>
To be clear, my concern with *_DONTCACHE is this bit in
generic_write_sync():
} else if (iocb->ki_flags & IOCB_DONTCACHE) {
struct address_space *mapping = iocb->ki_filp->f_mapping;
filemap_fdatawrite_range_kick(mapping, iocb->ki_pos - count,
iocb->ki_pos - 1);
}
I understand why it was done, but it means that we're kicking off
writeback for small ranges after every write. I think we'd be better
served by allowing for a little batching, and just kick off writeback
(maybe even for the whole inode) after a short delay. IOW, I agree with
Dave Chinner that we need some sort of writebehind window.
The dropbehind part (where we drop it from the pagecache after
writeback completes) is fine, IMO.
> Continuing with what this patchset provides:
>
> NFSD now uses STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store
> DIO alignment attributes from underlying filesystem in associated
> nfsd_file. This is done when the nfsd_file is first opened for a
> regular file.
>
> A new RWF_DIRECT flag is added to include/uapi/linux/fs.h to allow
> NFSD to use O_DIRECT on a per-IO basis.
>
> If enable-dontcache=1 then RWF_DIRECT will be set for all READ IO
> (even if the IO is misaligned, thanks to expanding the read to be
> aligned for use with DIO, as suggested by Jeff and Chuck at the NFS
> Bakeathon held recently in Ann Arbor).
>
> NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> DIO alignment (both page and disk alignment). This works quite well
> for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> maps the WRITE payload into aligned pages. But more work is needed to
> be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> used. I spent quite a bit of time analyzing the existing xdr_buf code
> and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> misaligned pages such that O_DIRECT isn't possible without a copy
> (completely defeating the point). I'll reply to this cover letter to
> start a subthread to discuss how best to deal with misaligned write
> IO (by association with Hammerspace, I'm most interested in NFS v3).
>
Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
rq_pages array. To get alignment right, you'd probably have to do the
receive in a much more piecemeal way.
Basically, you'd need to decode as you receive chunks of the message,
and look out for WRITEs, and then set it up so that their payloads are
received with proper alignment.
Anyway, separate thread to discuss that sounds good.
> Performance benefits of using O_DIRECT in NFSD:
>
> Hammerspace's testbed was 10 NFS servers connected via 800Gbit
> RDMA networking (mlx5_core), each with 1TB of memory, 48 cores (2 NUMA
> nodes) and 8 ScaleFlux NVMe devices (each with two 3.5TB namespaces.
> Theoretical max for reads per NVMe device is 14GB/s, or ~7GB/s per
> namespace).
>
> And 10 client systems each running 64 IO threads.
>
> The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> and memory use, particularly for workloads with a working set that far
> exceeds the available memory of a given server. This patchset's
> changes (though patch 5, patch 6 wasn't written until after
> benchmarking performed) enabled Hammerspace to improve its IO500.org
> benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> Germany) by 25%.
>
> That 25% improvement on IO500 is owed to NFS servers seeing:
> - reduced CPU usage from 100% to ~50%
> O_DIRECT:
> write: 51% idle, 25% system, 14% IO wait, 2% IRQ
> read: 55% idle, 9% system, 32.5% IO wait, 1.5% IRQ
> buffered:
> write: 17.8% idle, 67.5% system, 8% IO wait, 2% IRQ
> read: 3.29% idle, 94.2% system, 2.5% IO wait, 1% IRQ
>
> - reduced memory usage from just under 100% (987GiB for reads, 978GiB
> for writes) to only ~244 MB for cache+buffer use (for both reads and
> writes).
> - buffered would tip-over due to kswapd and kcompactd struggling to
> find free memory during reclaim.
>
> - increased NVMe throughtput when comparing O_DIRECT vs buffered:
> O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
> buffered: 8 GB/s for writes, 4-5 GB/s for reads
>
> - abiliy to support more IO threads per client system (from 48 to 64)
>
> The performance improvement highlight of the numerous individual tests
> in the IO500 collection of benchamrks was in the IOR "easy" test:
>
> Write:
> O_DIRECT: [RESULT] ior-easy-write 420.351599 GiB/s : time 869.650 seconds
> CACHED: [RESULT] ior-easy-write 368.268722 GiB/s : time 413.647 seconds
>
> Read:
> O_DIRECT: [RESULT] ior-easy-read 446.790791 GiB/s : time 818.219 seconds
> CACHED: [RESULT] ior-easy-read 284.706196 GiB/s : time 534.950 seconds
>
Wow!
> It is suspected that patch 6 in this patchset will improve IOR "hard"
> read results. The "hard" name comes from the fact that it performs all
> IO using a mislaigned blocksize of 47008 bytes (which happens to be
> the IO size I showed ftrace output for in the 6th patch's header).
>
> All review and discussion is welcome, thanks!
> Mike
>
> Mike Snitzer (6):
> NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
> NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
> NFSD: pass nfsd_file to nfsd_iter_read()
> fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
> NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
> NFSD: issue READs using O_DIRECT even if IO is misaligned
>
> fs/nfsd/debugfs.c | 39 +++++++++++++
> fs/nfsd/filecache.c | 32 +++++++++++
> fs/nfsd/filecache.h | 4 ++
> fs/nfsd/nfs4xdr.c | 8 +--
> fs/nfsd/nfsd.h | 1 +
> fs/nfsd/trace.h | 37 +++++++++++++
> fs/nfsd/vfs.c | 111 ++++++++++++++++++++++++++++++++++---
> fs/nfsd/vfs.h | 17 +-----
> include/linux/fs.h | 2 +-
> include/linux/sunrpc/svc.h | 5 +-
> include/uapi/linux/fs.h | 5 +-
> 11 files changed, 231 insertions(+), 30 deletions(-)
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-11 10:44 ` Mike Snitzer
@ 2025-06-11 13:04 ` Jeff Layton
0 siblings, 0 replies; 75+ messages in thread
From: Jeff Layton @ 2025-06-11 13:04 UTC (permalink / raw)
To: Mike Snitzer, Christoph Hellwig
Cc: Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, 2025-06-11 at 06:44 -0400, Mike Snitzer wrote:
> On Tue, Jun 10, 2025 at 11:57:33PM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 10, 2025 at 04:57:32PM -0400, Mike Snitzer wrote:
> > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > > or will be removed from the page cache upon completion (DONTCACHE).
> > >
> > > enable-dontcache is 0 by default. It may be enabled with:
> > > echo 1 > /sys/kernel/debug/nfsd/enable-dontcache
> >
> > Having this as a global debug-only interface feels a bit odd.
> >
>
> I generally agree, I originally proposed nfsd.nfsd_dontcache=Y
> modparam:
> https://lore.kernel.org/linux-nfs/20250220171205.12092-1-snitzer@kernel.org/
>
> (and even implemented formal NFSD per-export "dontcache" control,
> which Trond and I both think is probably needed).
>
> But (ab)using debugfs is the approach Chuck and Jeff would like to
> take for experimental NFSD changes so that we can kick the tires
> without having to support an interface until the end of time. See
> commit 9fe5ea760e64 ("NFSD: Add /sys/kernel/debug/nfsd") for more on
> the general thinking. First consumer was commit c9dcd1de7977 ("NFSD:
> Add experimental setting to disable the use of splice read").
>
> I'm fine with using debugfs, means to an end with no strings attached.
>
> Once we have confidence in what is needed we can pivot back to
> a modparam or per-export controls or whatever.
>
Yeah. I think this will probably end up being a per-export setting.
We're just hesitant to commit to an interface until we have a bit more
experience with this.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 12:23 ` Mike Snitzer
@ 2025-06-11 13:30 ` Jeff Layton
2025-06-12 7:22 ` Christoph Hellwig
2025-06-12 7:23 ` Christoph Hellwig
1 sibling, 1 reply; 75+ messages in thread
From: Jeff Layton @ 2025-06-11 13:30 UTC (permalink / raw)
To: Mike Snitzer, Christoph Hellwig
Cc: Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe, Dave Chinner
On Wed, 2025-06-11 at 08:23 -0400, Mike Snitzer wrote:
> On Wed, Jun 11, 2025 at 12:00:02AM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 10, 2025 at 04:57:36PM -0400, Mike Snitzer wrote:
> > > IO must be aligned, otherwise it falls back to using buffered IO.
> > >
> > > RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> > > nfsd/enable-dontcache=1) because it works against us (due to RMW
> > > needing to read without benefit of cache), whereas buffered IO enables
> > > misaligned IO to be more performant.
> >
> > This seems to "randomly" mix direct I/O and buffered I/O on a file.
>
> It isn't random, if the IO is DIO-aligned it uses direct I/O.
>
> > That's basically asking for data corruption due to invalidation races.
>
> I've seen you speak of said dragons in other threads and even commit
> headers, etc. Could be they are lurking, but I took the approach of
> "implement it [this patchset] and see what breaks". It hasn't broken
> yet, despite my having thrown a large battery of testing at it (which
> includes all of Hammerspace's automated sanities testing that uses
> many testsuites, e.g. xfstests, mdtest, etc, etc).
>
I'm concerned here too. Invalidation races can mean silent data
corruption. We'll need to ensure that this is safe.
Incidentally, is there a good testcase for this? Something that does
buffered and direct I/O from different tasks and looks for
inconsistencies?
> But the IOR "hard" workload, which checks for corruption and uses
> 47008 blocksize to force excessive RMW, hasn't yet been ran with my
> "[PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is
> misaligned" [0]. That IOR "hard" testing will likely happen today.
>
> > But maybe also explain what this is trying to address to start with?
>
> Ha, I suspect you saw my too-many-words 0th patch header [1] and
> ignored it? Solid feedback, I need to be more succinct and I'm
> probably too close to this work to see the gaps in introduction and
> justification but will refine, starting now:
>
> Overview: NFSD currently only uses buffered IO and it routinely falls
> over due to the problems RWF_DONTCACHE was developed to workaround.
> But RWF_DONTCACHE also uses buffered IO and page cache and also
> suffers from inefficiencies that direct IO doesn't. Buffered IO's cpu
> and memory consumption is particularly unwanted for resource
> constrained systems.
>
> Maybe some pictures are worth 1000+ words.
>
> Here is a flamegraph showing buffered IO causing reclaim to bring the
> system to a halt (when workload's working set far exceeds available
> memory): https://original.art/buffered_read.svg
>
> Here is flamegraph for the same type of workload but using DONTCACHE
> instead of normal buffered IO: https://original.art/dontcache_read.svg
>
> Dave Chinner provided his analysis of why DONTCACHE was struggling
> [2]. And I gave further context to others and forecast that I'd be
> working on implementing NFSD support for using O_DIRECT [3]. Then I
> discussed how to approach the implementation with Chuck, Jeff and
> others at the recent NFS Bakeathon. This series implements my take on
> what was discussed.
>
> This graph shows O_DIRECT vs buffered IO for the IOR "easy" workload
> ("easy" because it uses aligned 1 MiB IOs rather than 47008 bytes like
> IOR "hard"): https://original.art/NFSD_direct_vs_buffered_IO.jpg
>
> Buffered IO is generally worse across the board. DONTCACHE provides
> welcome reclaim storm relief without the alignment requirements of
> O_DIRECT but there really is no substitute for O_DIRECT if we're able
> to use it. My patchset shows NFSD can and that it is much more
> deterministic and less resource hungry.
>
> Direct I/O is definitely the direction we need to go, with DONTCACHE
> fallback for misaligned write IO (once it is able to delay its
> dropbehind to work better with misaligned IO).
>
> Mike
>
> [0]: https://lore.kernel.org/linux-nfs/20250610205737.63343-7-snitzer@kernel.org/
> [1]: https://lore.kernel.org/linux-nfs/20250610205737.63343-1-snitzer@kernel.org/
> [2]: https://lore.kernel.org/linux-nfs/aBrKbOoj4dgUvz8f@dread.disaster.area/
> [3]: https://lore.kernel.org/linux-nfs/aBvVltbDKdHXMtLL@kernel.org/
To summarize: the basic problem is that the pagecache is pretty useless
for satisfying READs from nfsd. Most NFS workloads don't involve I/O to
the same files from multiple clients. The client ends up having most of
the data in its cache already and only very rarely do we need to
revisit the data on the server.
At the same time, it's really easy to overwhelm the storage with
pagecache writeback with modern memory sizes. Having nfsd bypass the
pagecache altogether is potentially a huge performance win, if it can
be made to work safely.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-11 6:57 ` Christoph Hellwig
2025-06-11 10:44 ` Mike Snitzer
@ 2025-06-11 13:56 ` Chuck Lever
1 sibling, 0 replies; 75+ messages in thread
From: Chuck Lever @ 2025-06-11 13:56 UTC (permalink / raw)
To: Christoph Hellwig, Mike Snitzer
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On 6/11/25 2:57 AM, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 04:57:32PM -0400, Mike Snitzer wrote:
>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
>> or will be removed from the page cache upon completion (DONTCACHE).
>>
>> enable-dontcache is 0 by default. It may be enabled with:
>> echo 1 > /sys/kernel/debug/nfsd/enable-dontcache
>
> Having this as a global debug-only interface feels a bit odd.
This interface is temporary. We expect to implement export options
to control it if that turns out to be necessary.
I don't feel we understand the full performance implications for
most workloads yet, so it seems premature to design and introduce a
permanent administrative interface.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
` (6 preceding siblings ...)
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
@ 2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02 ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
8 siblings, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-11 14:16 UTC (permalink / raw)
To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On 6/10/25 4:57 PM, Mike Snitzer wrote:
> Hi,
>
> This series introduces 'enable-dontcache' to NFSD's debugfs interface,
> once enabled NFSD will selectively make use of O_DIRECT when issuing
> read and write IO:
> - all READs will use O_DIRECT (both aligned and misaligned)
> - all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
> - misaligned WRITEs currently continue to use normal buffered IO
>
> Q: Why not actually use RWF_DONTCACHE (yet)?
> A:
> If IO can is properly DIO-aligned, or can be made to be, using
> O_DIRECT is preferred over DONTCACHE because of its reduced CPU and
> memory usage. Relative to NFSD using RWF_DONTCACHE for misaligned
> WRITEs, I've briefly discussed with Jens that follow-on dontcache work
> is needed to justify falling back to actually using RWF_DONTCACHE.
> Specifically, Hammerspace benchmarking has confirmed as Jeff Layton
> suggested at Bakeathon, we need dontcache to be enhanced to not
> immediately dropbehind when IO completes -- because it works against
> us (due to RMW needing to read without benefit of cache), whereas
> buffered IO enables misaligned IO to be more performant. Jens thought
> that delayed dropbehind is certainly doable but that he needed to
> reason through it further (so timing on availability is TBD). As soon
> as it is possible I'll happily switch NFSD's misaligned write IO
> fallback from normal buffered IO to actually using RWF_DONTCACHE.
>
> Continuing with what this patchset provides:
>
> NFSD now uses STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store
> DIO alignment attributes from underlying filesystem in associated
> nfsd_file. This is done when the nfsd_file is first opened for a
> regular file.
>
> A new RWF_DIRECT flag is added to include/uapi/linux/fs.h to allow
> NFSD to use O_DIRECT on a per-IO basis.
>
> If enable-dontcache=1 then RWF_DIRECT will be set for all READ IO
> (even if the IO is misaligned, thanks to expanding the read to be
> aligned for use with DIO, as suggested by Jeff and Chuck at the NFS
> Bakeathon held recently in Ann Arbor).
>
> NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> DIO alignment (both page and disk alignment). This works quite well
> for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> maps the WRITE payload into aligned pages. But more work is needed to
> be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> used. I spent quite a bit of time analyzing the existing xdr_buf code
> and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> misaligned pages such that O_DIRECT isn't possible without a copy
> (completely defeating the point). I'll reply to this cover letter to
> start a subthread to discuss how best to deal with misaligned write
> IO (by association with Hammerspace, I'm most interested in NFS v3).
>
> Performance benefits of using O_DIRECT in NFSD:
>
> Hammerspace's testbed was 10 NFS servers connected via 800Gbit
> RDMA networking (mlx5_core), each with 1TB of memory, 48 cores (2 NUMA
> nodes) and 8 ScaleFlux NVMe devices (each with two 3.5TB namespaces.
> Theoretical max for reads per NVMe device is 14GB/s, or ~7GB/s per
> namespace).
>
> And 10 client systems each running 64 IO threads.
>
> The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> and memory use, particularly for workloads with a working set that far
> exceeds the available memory of a given server. This patchset's
> changes (though patch 5, patch 6 wasn't written until after
> benchmarking performed) enabled Hammerspace to improve its IO500.org
> benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> Germany) by 25%.
>
> That 25% improvement on IO500 is owed to NFS servers seeing:
> - reduced CPU usage from 100% to ~50%
> O_DIRECT:
> write: 51% idle, 25% system, 14% IO wait, 2% IRQ
> read: 55% idle, 9% system, 32.5% IO wait, 1.5% IRQ
> buffered:
> write: 17.8% idle, 67.5% system, 8% IO wait, 2% IRQ
> read: 3.29% idle, 94.2% system, 2.5% IO wait, 1% IRQ
>
> - reduced memory usage from just under 100% (987GiB for reads, 978GiB
> for writes) to only ~244 MB for cache+buffer use (for both reads and
> writes).
> - buffered would tip-over due to kswapd and kcompactd struggling to
> find free memory during reclaim.
>
> - increased NVMe throughtput when comparing O_DIRECT vs buffered:
> O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
> buffered: 8 GB/s for writes, 4-5 GB/s for reads
>
> - abiliy to support more IO threads per client system (from 48 to 64)
>
> The performance improvement highlight of the numerous individual tests
> in the IO500 collection of benchamrks was in the IOR "easy" test:
>
> Write:
> O_DIRECT: [RESULT] ior-easy-write 420.351599 GiB/s : time 869.650 seconds
> CACHED: [RESULT] ior-easy-write 368.268722 GiB/s : time 413.647 seconds
>
> Read:
> O_DIRECT: [RESULT] ior-easy-read 446.790791 GiB/s : time 818.219 seconds
> CACHED: [RESULT] ior-easy-read 284.706196 GiB/s : time 534.950 seconds
>
> It is suspected that patch 6 in this patchset will improve IOR "hard"
> read results. The "hard" name comes from the fact that it performs all
> IO using a mislaigned blocksize of 47008 bytes (which happens to be
> the IO size I showed ftrace output for in the 6th patch's header).
>
> All review and discussion is welcome, thanks!
> Mike
>
> Mike Snitzer (6):
> NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
> NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
> NFSD: pass nfsd_file to nfsd_iter_read()
> fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
> NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
> NFSD: issue READs using O_DIRECT even if IO is misaligned
>
> fs/nfsd/debugfs.c | 39 +++++++++++++
> fs/nfsd/filecache.c | 32 +++++++++++
> fs/nfsd/filecache.h | 4 ++
> fs/nfsd/nfs4xdr.c | 8 +--
> fs/nfsd/nfsd.h | 1 +
> fs/nfsd/trace.h | 37 +++++++++++++
> fs/nfsd/vfs.c | 111 ++++++++++++++++++++++++++++++++++---
> fs/nfsd/vfs.h | 17 +-----
> include/linux/fs.h | 2 +-
> include/linux/sunrpc/svc.h | 5 +-
> include/uapi/linux/fs.h | 5 +-
> 11 files changed, 231 insertions(+), 30 deletions(-)
>
Hey Mike!
There's a lot to digest here! A few general comments:
- Since this isn't a series that you intend I should apply immediately
to nfsd-next, let's mark subsequent postings with "RFC".
- Before diving into the history and design, your cover letter should
start with a clear problem statement. What are you trying to fix? I
think that might be what Christoph is missing in his comment on 5/6.
Maybe it's in the cover letter now, but it reads to me like the lede is
buried.
- In addition to the big iron results, I'd like to see benchmark results
for small I/O workloads, and workloads with slower persistent storage,
and workloads on slower network fabrics (ie, TCP).
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
2025-06-11 6:58 ` Christoph Hellwig
2025-06-11 10:51 ` Mike Snitzer
@ 2025-06-11 14:17 ` Chuck Lever
2025-06-12 7:15 ` Christoph Hellwig
1 sibling, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-11 14:17 UTC (permalink / raw)
To: Christoph Hellwig, Mike Snitzer
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On 6/11/25 2:58 AM, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 04:57:35PM -0400, Mike Snitzer wrote:
>> Avoids the need to open code do_iter_readv_writev() purely to request
>> that a sync iocb make use of IOCB_DIRECT.
>>
>> Care was taken to preserve the long-established value for IOCB_DIRECT
>> (1 << 17) when introducing RWF_DIRECT.
>
> What is the problem with using vfs_iocb_iter_read instead of
> vfs_iter_read and passing the iocb directly?
Christoph, are you suggesting that nfsd_iter_read() should always
call vfs_iocb_iter_read() instead of vfs_iter_read()? That might be
a nice clean up in general.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11 6:57 ` Christoph Hellwig
@ 2025-06-11 14:31 ` Chuck Lever
2025-06-11 19:18 ` Mike Snitzer
1 sibling, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-11 14:31 UTC (permalink / raw)
To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On 6/10/25 4:57 PM, Mike Snitzer wrote:
> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> read or written by NFSD will either not be cached (thanks to O_DIRECT)
> or will be removed from the page cache upon completion (DONTCACHE).
I thought we were going to do two switches: One for reads and one for
writes? I could be misremembering.
After all, you are describing two different facilities here: a form of
direct I/O for READs, and RWF_DONTCACHE for WRITEs (I think?).
> enable-dontcache is 0 by default. It may be enabled with:
> echo 1 > /sys/kernel/debug/nfsd/enable-dontcache
>
> FOP_DONTCACHE must be advertised as supported by the underlying
> filesystem (e.g. XFS), otherwise if/when 'enable-dontcache' is 1
> all IO flagged with RWF_DONTCACHE will fail with -EOPNOTSUPP.
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> fs/nfsd/debugfs.c | 39 +++++++++++++++++++++++++++++++++++++++
> fs/nfsd/nfsd.h | 1 +
> fs/nfsd/vfs.c | 12 +++++++++++-
> 3 files changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc..8decdec60a8e 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -32,6 +32,42 @@ static int nfsd_dsr_set(void *data, u64 val)
>
> DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>
> +/*
> + * /sys/kernel/debug/nfsd/enable-dontcache
> + *
> + * Contents:
> + * %0: NFS READ and WRITE are not allowed to use dontcache
> + * %1: NFS READ and WRITE are allowed to use dontcache
> + *
> + * NFSD's dontcache support reserves the right to use O_DIRECT
> + * if it chooses (instead of dontcache's usual pagecache-based
> + * dropbehind semantics).
> + *
> + * The default value of this setting is zero (dontcache is
> + * disabled). This setting takes immediate effect for all NFS
> + * versions, all exports, and in all NFSD net namespaces.
> + */
> +
> +static int nfsd_dontcache_get(void *data, u64 *val)
> +{
> + *val = nfsd_enable_dontcache ? 1 : 0;
> + return 0;
> +}
> +
> +static int nfsd_dontcache_set(void *data, u64 val)
> +{
> + if (val > 0) {
> + /* Must first also disable-splice-read */
> + nfsd_disable_splice_read = true;
> + nfsd_enable_dontcache = true;
> + } else
> + nfsd_enable_dontcache = false;
> + return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dontcache_fops, nfsd_dontcache_get,
> + nfsd_dontcache_set, "%llu\n");
> +
> void nfsd_debugfs_exit(void)
> {
> debugfs_remove_recursive(nfsd_top_dir);
> @@ -44,4 +80,7 @@ void nfsd_debugfs_init(void)
>
> debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +
> + debugfs_create_file("enable-dontcache", S_IWUSR | S_IRUGO,
> + nfsd_top_dir, NULL, &nfsd_dontcache_fops);
> }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1bfd0b4e9af7..00546547eae6 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -155,6 +155,7 @@ static inline void nfsd_debugfs_exit(void) {}
> #endif
>
> extern bool nfsd_disable_splice_read __read_mostly;
> +extern bool nfsd_enable_dontcache __read_mostly;
>
> extern int nfsd_max_blksize;
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 7d94fae1dee8..bba3e6f4f56b 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -49,6 +49,7 @@
> #define NFSDDBG_FACILITY NFSDDBG_FILEOP
>
> bool nfsd_disable_splice_read __read_mostly;
> +bool nfsd_enable_dontcache __read_mostly;
>
> /**
> * nfserrno - Map Linux errnos to NFS errnos
> @@ -1086,6 +1087,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned long v, total;
> struct iov_iter iter;
> loff_t ppos = offset;
> + rwf_t flags = 0;
> ssize_t host_err;
> size_t len;
>
> @@ -1103,7 +1105,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> - host_err = vfs_iter_read(file, &iter, &ppos, 0);
> +
> + if (nfsd_enable_dontcache)
> + flags |= RWF_DONTCACHE;
Two things:
- Maybe NFSD should record whether the file system is DONTCACHE-enabled
in @fhp or in the export it is associated with, and then check that
setting here before asserting RWF_DONTCACHE
- I thought we were going with O_DIRECT for READs.
> +
> + host_err = vfs_iter_read(file, &iter, &ppos, flags);
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> @@ -1209,6 +1215,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
> + if (nfsd_enable_dontcache)
> + flags |= RWF_DONTCACHE;
> +
> since = READ_ONCE(file->f_wb_err);
> if (verf)
> nfsd_copy_write_verifier(verf, nn);
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11 7:00 ` Christoph Hellwig
@ 2025-06-11 14:42 ` Chuck Lever
2025-06-11 15:07 ` Jeff Layton
1 sibling, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-11 14:42 UTC (permalink / raw)
To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On 6/10/25 4:57 PM, Mike Snitzer wrote:
> IO must be aligned, otherwise it falls back to using buffered IO.
>
> RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> nfsd/enable-dontcache=1) because it works against us (due to RMW
> needing to read without benefit of cache), whereas buffered IO enables
> misaligned IO to be more performant.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++++++++++++++----
> 1 file changed, 36 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index e7cc8c6dfbad..a942609e3ab9 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1064,6 +1064,22 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> +static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
> + const u32 blocksize)
> +{
> + u32 blocksize_mask;
> +
> + if (!blocksize)
> + return false;
> +
> + blocksize_mask = blocksize - 1;
> + if ((offset & blocksize_mask) ||
> + (iov_iter_alignment(iter) & blocksize_mask))
> + return false;
> +
> + return true;
> +}
> +
> /**
> * nfsd_iter_read - Perform a VFS read using an iterator
> * @rqstp: RPC transaction context
> @@ -1107,8 +1123,16 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
>
> - if (nfsd_enable_dontcache)
> - flags |= RWF_DONTCACHE;
> + if (nfsd_enable_dontcache) {
> + if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
> + flags |= RWF_DIRECT;
> + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> + * against us (due to RMW needing to read without benefit of cache),
> + * whereas buffered IO enables misaligned IO to be more performant.
> + */
> + //else
> + // flags |= RWF_DONTCACHE;
> + }
>
> host_err = vfs_iter_read(file, &iter, &ppos, flags);
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> @@ -1217,8 +1241,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
>
> - if (nfsd_enable_dontcache)
> - flags |= RWF_DONTCACHE;
> + if (nfsd_enable_dontcache) {
> + if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
> + flags |= RWF_DIRECT;
> + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> + * against us (due to RMW needing to read without benefit of cache),
> + * whereas buffered IO enables misaligned IO to be more performant.
> + */
> + //else
> + // flags |= RWF_DONTCACHE;
> + }
IMO adding RWF_DONTCACHE first then replacing it later in the series
with a form of O_DIRECT is confusing. Also, why add RWF_DONTCACHE here
and then take it away "because it doesn't work"?
But OK, your series is really a proof-of-concept. Something to work out
before it is merge-ready, I guess.
It is much more likely for NFS READ requests to be properly aligned.
Clients are generally good about that. NFS WRITE request alignment
is going to be arbitrary. Fwiw.
However, one thing we discussed at bake-a-thon was what to do about
unstable WRITEs. For unstable WRITEs, the server has to cache the
write data at least until the client sends a COMMIT. Otherwise the
server will have to convert all UNSTABLE writes to FILE_SYNC writes,
and that can have performance implications.
One thing you might consider is to continue using the page cache for
unstable WRITEs, and then use fadvise DONTNEED after a successful
COMMIT operation to reduce page cache footprint. Unstable writes to
the same range of the file might be a problem, however.
> since = READ_ONCE(file->f_wb_err);
> if (verf)
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 14:42 ` Chuck Lever
@ 2025-06-11 15:07 ` Jeff Layton
2025-06-11 15:11 ` Chuck Lever
2025-06-12 7:25 ` Christoph Hellwig
0 siblings, 2 replies; 75+ messages in thread
From: Jeff Layton @ 2025-06-11 15:07 UTC (permalink / raw)
To: Chuck Lever, Mike Snitzer; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On Wed, 2025-06-11 at 10:42 -0400, Chuck Lever wrote:
> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > IO must be aligned, otherwise it falls back to using buffered IO.
> >
> > RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> > nfsd/enable-dontcache=1) because it works against us (due to RMW
> > needing to read without benefit of cache), whereas buffered IO enables
> > misaligned IO to be more performant.
> >
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> > fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 36 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index e7cc8c6dfbad..a942609e3ab9 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1064,6 +1064,22 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > }
> >
> > +static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
> > + const u32 blocksize)
> > +{
> > + u32 blocksize_mask;
> > +
> > + if (!blocksize)
> > + return false;
> > +
> > + blocksize_mask = blocksize - 1;
> > + if ((offset & blocksize_mask) ||
> > + (iov_iter_alignment(iter) & blocksize_mask))
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > /**
> > * nfsd_iter_read - Perform a VFS read using an iterator
> > * @rqstp: RPC transaction context
> > @@ -1107,8 +1123,16 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> >
> > - if (nfsd_enable_dontcache)
> > - flags |= RWF_DONTCACHE;
> > + if (nfsd_enable_dontcache) {
> > + if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
> > + flags |= RWF_DIRECT;
> > + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > + * against us (due to RMW needing to read without benefit of cache),
> > + * whereas buffered IO enables misaligned IO to be more performant.
> > + */
> > + //else
> > + // flags |= RWF_DONTCACHE;
> > + }
> >
> > host_err = vfs_iter_read(file, &iter, &ppos, flags);
> > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > @@ -1217,8 +1241,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> >
> > - if (nfsd_enable_dontcache)
> > - flags |= RWF_DONTCACHE;
> > + if (nfsd_enable_dontcache) {
> > + if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
> > + flags |= RWF_DIRECT;
> > + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > + * against us (due to RMW needing to read without benefit of cache),
> > + * whereas buffered IO enables misaligned IO to be more performant.
> > + */
> > + //else
> > + // flags |= RWF_DONTCACHE;
> > + }
>
> IMO adding RWF_DONTCACHE first then replacing it later in the series
> with a form of O_DIRECT is confusing. Also, why add RWF_DONTCACHE here
> and then take it away "because it doesn't work"?
>
> But OK, your series is really a proof-of-concept. Something to work out
> before it is merge-ready, I guess.
>
> It is much more likely for NFS READ requests to be properly aligned.
> Clients are generally good about that. NFS WRITE request alignment
> is going to be arbitrary. Fwiw.
>
> However, one thing we discussed at bake-a-thon was what to do about
> unstable WRITEs. For unstable WRITEs, the server has to cache the
> write data at least until the client sends a COMMIT. Otherwise the
> server will have to convert all UNSTABLE writes to FILE_SYNC writes,
> and that can have performance implications.
>
If we're doing synchronous, direct I/O writes then why not just respond
with FILE_SYNC? The write should be on the platter by the time it
returns.
> One thing you might consider is to continue using the page cache for
> unstable WRITEs, and then use fadvise DONTNEED after a successful
> COMMIT operation to reduce page cache footprint. Unstable writes to
> the same range of the file might be a problem, however.
Since the client sends almost everything UNSTABLE, that would probably
erase most of the performance win. The only reason I can see to use
buffered I/O in this mode would be because we had to deal with an
unaligned write and need to do a RMW cycle on a block.
The big question is whether mixing buffered and direct I/O writes like
this is safe across all exportable filesystems. I'm not yet convinced
of that.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 15:07 ` Jeff Layton
@ 2025-06-11 15:11 ` Chuck Lever
2025-06-11 15:44 ` Jeff Layton
2025-06-12 7:28 ` Christoph Hellwig
2025-06-12 7:25 ` Christoph Hellwig
1 sibling, 2 replies; 75+ messages in thread
From: Chuck Lever @ 2025-06-11 15:11 UTC (permalink / raw)
To: Jeff Layton, Mike Snitzer; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On 6/11/25 11:07 AM, Jeff Layton wrote:
> On Wed, 2025-06-11 at 10:42 -0400, Chuck Lever wrote:
>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
>>> IO must be aligned, otherwise it falls back to using buffered IO.
>>>
>>> RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
>>> nfsd/enable-dontcache=1) because it works against us (due to RMW
>>> needing to read without benefit of cache), whereas buffered IO enables
>>> misaligned IO to be more performant.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>>> ---
>>> fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++++++++++++++----
>>> 1 file changed, 36 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>> index e7cc8c6dfbad..a942609e3ab9 100644
>>> --- a/fs/nfsd/vfs.c
>>> +++ b/fs/nfsd/vfs.c
>>> @@ -1064,6 +1064,22 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>> }
>>>
>>> +static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
>>> + const u32 blocksize)
>>> +{
>>> + u32 blocksize_mask;
>>> +
>>> + if (!blocksize)
>>> + return false;
>>> +
>>> + blocksize_mask = blocksize - 1;
>>> + if ((offset & blocksize_mask) ||
>>> + (iov_iter_alignment(iter) & blocksize_mask))
>>> + return false;
>>> +
>>> + return true;
>>> +}
>>> +
>>> /**
>>> * nfsd_iter_read - Perform a VFS read using an iterator
>>> * @rqstp: RPC transaction context
>>> @@ -1107,8 +1123,16 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> trace_nfsd_read_vector(rqstp, fhp, offset, *count);
>>> iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
>>>
>>> - if (nfsd_enable_dontcache)
>>> - flags |= RWF_DONTCACHE;
>>> + if (nfsd_enable_dontcache) {
>>> + if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
>>> + flags |= RWF_DIRECT;
>>> + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
>>> + * against us (due to RMW needing to read without benefit of cache),
>>> + * whereas buffered IO enables misaligned IO to be more performant.
>>> + */
>>> + //else
>>> + // flags |= RWF_DONTCACHE;
>>> + }
>>>
>>> host_err = vfs_iter_read(file, &iter, &ppos, flags);
>>> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>> @@ -1217,8 +1241,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
>>> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
>>>
>>> - if (nfsd_enable_dontcache)
>>> - flags |= RWF_DONTCACHE;
>>> + if (nfsd_enable_dontcache) {
>>> + if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
>>> + flags |= RWF_DIRECT;
>>> + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
>>> + * against us (due to RMW needing to read without benefit of cache),
>>> + * whereas buffered IO enables misaligned IO to be more performant.
>>> + */
>>> + //else
>>> + // flags |= RWF_DONTCACHE;
>>> + }
>>
>> IMO adding RWF_DONTCACHE first then replacing it later in the series
>> with a form of O_DIRECT is confusing. Also, why add RWF_DONTCACHE here
>> and then take it away "because it doesn't work"?
>>
>> But OK, your series is really a proof-of-concept. Something to work out
>> before it is merge-ready, I guess.
>>
>> It is much more likely for NFS READ requests to be properly aligned.
>> Clients are generally good about that. NFS WRITE request alignment
>> is going to be arbitrary. Fwiw.
>>
>> However, one thing we discussed at bake-a-thon was what to do about
>> unstable WRITEs. For unstable WRITEs, the server has to cache the
>> write data at least until the client sends a COMMIT. Otherwise the
>> server will have to convert all UNSTABLE writes to FILE_SYNC writes,
>> and that can have performance implications.
>>
>
> If we're doing synchronous, direct I/O writes then why not just respond
> with FILE_SYNC? The write should be on the platter by the time it
> returns.
Because "platter". On some devices, writes are slow.
For some workloads, unstable is faster. I have an experimental series
that makes NFSD convert all NFS WRITEs to FILE_SYNC. It was not an
across the board win, even with an NVMe-backed file system.
>> One thing you might consider is to continue using the page cache for
>> unstable WRITEs, and then use fadvise DONTNEED after a successful
>> COMMIT operation to reduce page cache footprint. Unstable writes to
>> the same range of the file might be a problem, however.
>
> Since the client sends almost everything UNSTABLE, that would probably
> erase most of the performance win. The only reason I can see to use
> buffered I/O in this mode would be because we had to deal with an
> unaligned write and need to do a RMW cycle on a block.
>
> The big question is whether mixing buffered and direct I/O writes like
> this is safe across all exportable filesystems. I'm not yet convinced
> of that.
Agreed, that deserves careful scrutiny.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 15:11 ` Chuck Lever
@ 2025-06-11 15:44 ` Jeff Layton
2025-06-11 20:51 ` Mike Snitzer
2025-06-12 7:32 ` Christoph Hellwig
2025-06-12 7:28 ` Christoph Hellwig
1 sibling, 2 replies; 75+ messages in thread
From: Jeff Layton @ 2025-06-11 15:44 UTC (permalink / raw)
To: Chuck Lever, Mike Snitzer; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On Wed, 2025-06-11 at 11:11 -0400, Chuck Lever wrote:
> On 6/11/25 11:07 AM, Jeff Layton wrote:
> > On Wed, 2025-06-11 at 10:42 -0400, Chuck Lever wrote:
> > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > IO must be aligned, otherwise it falls back to using buffered IO.
> > > >
> > > > RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> > > > nfsd/enable-dontcache=1) because it works against us (due to RMW
> > > > needing to read without benefit of cache), whereas buffered IO enables
> > > > misaligned IO to be more performant.
> > > >
> > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > ---
> > > > fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++++++++++++++----
> > > > 1 file changed, 36 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > index e7cc8c6dfbad..a942609e3ab9 100644
> > > > --- a/fs/nfsd/vfs.c
> > > > +++ b/fs/nfsd/vfs.c
> > > > @@ -1064,6 +1064,22 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > > }
> > > >
> > > > +static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
> > > > + const u32 blocksize)
> > > > +{
> > > > + u32 blocksize_mask;
> > > > +
> > > > + if (!blocksize)
> > > > + return false;
> > > > +
> > > > + blocksize_mask = blocksize - 1;
> > > > + if ((offset & blocksize_mask) ||
> > > > + (iov_iter_alignment(iter) & blocksize_mask))
> > > > + return false;
> > > > +
> > > > + return true;
> > > > +}
> > > > +
> > > > /**
> > > > * nfsd_iter_read - Perform a VFS read using an iterator
> > > > * @rqstp: RPC transaction context
> > > > @@ -1107,8 +1123,16 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > > > iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > > >
> > > > - if (nfsd_enable_dontcache)
> > > > - flags |= RWF_DONTCACHE;
> > > > + if (nfsd_enable_dontcache) {
> > > > + if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
> > > > + flags |= RWF_DIRECT;
> > > > + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > > > + * against us (due to RMW needing to read without benefit of cache),
> > > > + * whereas buffered IO enables misaligned IO to be more performant.
> > > > + */
> > > > + //else
> > > > + // flags |= RWF_DONTCACHE;
> > > > + }
> > > >
> > > > host_err = vfs_iter_read(file, &iter, &ppos, flags);
> > > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > > @@ -1217,8 +1241,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > > > iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > > >
> > > > - if (nfsd_enable_dontcache)
> > > > - flags |= RWF_DONTCACHE;
> > > > + if (nfsd_enable_dontcache) {
> > > > + if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
> > > > + flags |= RWF_DIRECT;
> > > > + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > > > + * against us (due to RMW needing to read without benefit of cache),
> > > > + * whereas buffered IO enables misaligned IO to be more performant.
> > > > + */
> > > > + //else
> > > > + // flags |= RWF_DONTCACHE;
> > > > + }
> > >
> > > IMO adding RWF_DONTCACHE first then replacing it later in the series
> > > with a form of O_DIRECT is confusing. Also, why add RWF_DONTCACHE here
> > > and then take it away "because it doesn't work"?
> > >
> > > But OK, your series is really a proof-of-concept. Something to work out
> > > before it is merge-ready, I guess.
> > >
> > > It is much more likely for NFS READ requests to be properly aligned.
> > > Clients are generally good about that. NFS WRITE request alignment
> > > is going to be arbitrary. Fwiw.
> > >
> > > However, one thing we discussed at bake-a-thon was what to do about
> > > unstable WRITEs. For unstable WRITEs, the server has to cache the
> > > write data at least until the client sends a COMMIT. Otherwise the
> > > server will have to convert all UNSTABLE writes to FILE_SYNC writes,
> > > and that can have performance implications.
> > >
> >
> > If we're doing synchronous, direct I/O writes then why not just respond
> > with FILE_SYNC? The write should be on the platter by the time it
> > returns.
>
> Because "platter". On some devices, writes are slow.
>
> For some workloads, unstable is faster. I have an experimental series
> that makes NFSD convert all NFS WRITEs to FILE_SYNC. It was not an
> across the board win, even with an NVMe-backed file system.
>
Presumably, those devices wouldn't be exported in this mode. That's
probably a good argument for making this settable on a per-export
basis.
>
> > > One thing you might consider is to continue using the page cache for
> > > unstable WRITEs, and then use fadvise DONTNEED after a successful
> > > COMMIT operation to reduce page cache footprint. Unstable writes to
> > > the same range of the file might be a problem, however.
> >
> > Since the client sends almost everything UNSTABLE, that would probably
> > erase most of the performance win. The only reason I can see to use
> > buffered I/O in this mode would be because we had to deal with an
> > unaligned write and need to do a RMW cycle on a block.
> >
> > The big question is whether mixing buffered and direct I/O writes like
> > this is safe across all exportable filesystems. I'm not yet convinced
> > of that.
>
> Agreed, that deserves careful scrutiny.
>
Like Mike is asking though, I need a better understanding of the
potential races here:
XFS, for instance, takes the i_rwsem shared around dio writes and
exclusive around buffered, so they should exclude each other. If we did
all the buffered writes as RWF_SYNC, would that prevent corruption?
In any case, for now at least, unless you're using RDMA, it's going to
end up falling back to buffered writes everywhere. The data is almost
never going to be properly aligned coming in off the wire. That might
be fixable though.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-11 14:16 ` Chuck Lever
@ 2025-06-11 18:02 ` Mike Snitzer
2025-06-11 19:06 ` Chuck Lever
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 18:02 UTC (permalink / raw)
To: Chuck Lever; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 10:16:39AM -0400, Chuck Lever wrote:
> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > Hi,
> >
> > This series introduces 'enable-dontcache' to NFSD's debugfs interface,
> > once enabled NFSD will selectively make use of O_DIRECT when issuing
> > read and write IO:
> > - all READs will use O_DIRECT (both aligned and misaligned)
> > - all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
> > - misaligned WRITEs currently continue to use normal buffered IO
> >
> > Q: Why not actually use RWF_DONTCACHE (yet)?
> > A:
> > If IO can is properly DIO-aligned, or can be made to be, using
> > O_DIRECT is preferred over DONTCACHE because of its reduced CPU and
> > memory usage. Relative to NFSD using RWF_DONTCACHE for misaligned
> > WRITEs, I've briefly discussed with Jens that follow-on dontcache work
> > is needed to justify falling back to actually using RWF_DONTCACHE.
> > Specifically, Hammerspace benchmarking has confirmed as Jeff Layton
> > suggested at Bakeathon, we need dontcache to be enhanced to not
> > immediately dropbehind when IO completes -- because it works against
> > us (due to RMW needing to read without benefit of cache), whereas
> > buffered IO enables misaligned IO to be more performant. Jens thought
> > that delayed dropbehind is certainly doable but that he needed to
> > reason through it further (so timing on availability is TBD). As soon
> > as it is possible I'll happily switch NFSD's misaligned write IO
> > fallback from normal buffered IO to actually using RWF_DONTCACHE.
> >
> > Continuing with what this patchset provides:
> >
> > NFSD now uses STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store
> > DIO alignment attributes from underlying filesystem in associated
> > nfsd_file. This is done when the nfsd_file is first opened for a
> > regular file.
> >
> > A new RWF_DIRECT flag is added to include/uapi/linux/fs.h to allow
> > NFSD to use O_DIRECT on a per-IO basis.
> >
> > If enable-dontcache=1 then RWF_DIRECT will be set for all READ IO
> > (even if the IO is misaligned, thanks to expanding the read to be
> > aligned for use with DIO, as suggested by Jeff and Chuck at the NFS
> > Bakeathon held recently in Ann Arbor).
> >
> > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> > DIO alignment (both page and disk alignment). This works quite well
> > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> > maps the WRITE payload into aligned pages. But more work is needed to
> > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> > used. I spent quite a bit of time analyzing the existing xdr_buf code
> > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> > misaligned pages such that O_DIRECT isn't possible without a copy
> > (completely defeating the point). I'll reply to this cover letter to
> > start a subthread to discuss how best to deal with misaligned write
> > IO (by association with Hammerspace, I'm most interested in NFS v3).
> >
> > Performance benefits of using O_DIRECT in NFSD:
> >
> > Hammerspace's testbed was 10 NFS servers connected via 800Gbit
> > RDMA networking (mlx5_core), each with 1TB of memory, 48 cores (2 NUMA
> > nodes) and 8 ScaleFlux NVMe devices (each with two 3.5TB namespaces.
> > Theoretical max for reads per NVMe device is 14GB/s, or ~7GB/s per
> > namespace).
> >
> > And 10 client systems each running 64 IO threads.
> >
> > The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> > and memory use, particularly for workloads with a working set that far
> > exceeds the available memory of a given server. This patchset's
> > changes (though patch 5, patch 6 wasn't written until after
> > benchmarking performed) enabled Hammerspace to improve its IO500.org
> > benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> > Germany) by 25%.
> >
> > That 25% improvement on IO500 is owed to NFS servers seeing:
> > - reduced CPU usage from 100% to ~50%
> > O_DIRECT:
> > write: 51% idle, 25% system, 14% IO wait, 2% IRQ
> > read: 55% idle, 9% system, 32.5% IO wait, 1.5% IRQ
> > buffered:
> > write: 17.8% idle, 67.5% system, 8% IO wait, 2% IRQ
> > read: 3.29% idle, 94.2% system, 2.5% IO wait, 1% IRQ
> >
> > - reduced memory usage from just under 100% (987GiB for reads, 978GiB
> > for writes) to only ~244 MB for cache+buffer use (for both reads and
> > writes).
> > - buffered would tip-over due to kswapd and kcompactd struggling to
> > find free memory during reclaim.
> >
> > - increased NVMe throughtput when comparing O_DIRECT vs buffered:
> > O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
> > buffered: 8 GB/s for writes, 4-5 GB/s for reads
> >
> > - abiliy to support more IO threads per client system (from 48 to 64)
> >
> > The performance improvement highlight of the numerous individual tests
> > in the IO500 collection of benchamrks was in the IOR "easy" test:
> >
> > Write:
> > O_DIRECT: [RESULT] ior-easy-write 420.351599 GiB/s : time 869.650 seconds
> > CACHED: [RESULT] ior-easy-write 368.268722 GiB/s : time 413.647 seconds
> >
> > Read:
> > O_DIRECT: [RESULT] ior-easy-read 446.790791 GiB/s : time 818.219 seconds
> > CACHED: [RESULT] ior-easy-read 284.706196 GiB/s : time 534.950 seconds
> >
> > It is suspected that patch 6 in this patchset will improve IOR "hard"
> > read results. The "hard" name comes from the fact that it performs all
> > IO using a mislaigned blocksize of 47008 bytes (which happens to be
> > the IO size I showed ftrace output for in the 6th patch's header).
> >
> > All review and discussion is welcome, thanks!
> > Mike
> >
> > Mike Snitzer (6):
> > NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
> > NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
> > NFSD: pass nfsd_file to nfsd_iter_read()
> > fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
> > NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
> > NFSD: issue READs using O_DIRECT even if IO is misaligned
> >
> > fs/nfsd/debugfs.c | 39 +++++++++++++
> > fs/nfsd/filecache.c | 32 +++++++++++
> > fs/nfsd/filecache.h | 4 ++
> > fs/nfsd/nfs4xdr.c | 8 +--
> > fs/nfsd/nfsd.h | 1 +
> > fs/nfsd/trace.h | 37 +++++++++++++
> > fs/nfsd/vfs.c | 111 ++++++++++++++++++++++++++++++++++---
> > fs/nfsd/vfs.h | 17 +-----
> > include/linux/fs.h | 2 +-
> > include/linux/sunrpc/svc.h | 5 +-
> > include/uapi/linux/fs.h | 5 +-
> > 11 files changed, 231 insertions(+), 30 deletions(-)
> >
>
>
> Hey Mike!
>
> There's a lot to digest here!
For sure. Thanks for working through it. My hope is it resonates and
is meaningful to digest *after* reading through my lede-burying cover
letter *and* the patches themselves. Let it wash over you.
It only adds 200 lines of change, folding patches might reduce
weirdness. How best to sequence and fold changes will be useful
feedback.
> A few general comments:
>
> - Since this isn't a series that you intend I should apply immediately
> to nfsd-next, let's mark subsequent postings with "RFC".
Yeah, my first posting should've been RFC.
But I'd be in favor of working with urgency so that by v6.16-rc4/5 you
and Jeff are fine with it for the 6.17 merge window.
> - Before diving into the history and design, your cover letter should
> start with a clear problem statement. What are you trying to fix? I
> think that might be what Christoph is missing in his comment on 5/6.
> Maybe it's in the cover letter now, but it reads to me like the lede is
> buried.
Yeah, I struggled/struggle to distill sweeping work with various
talking points down into a concise and natural flow. Probably should
taken my cover letter and fed it to some AI. ;)
> - In addition to the big iron results, I'd like to see benchmark results
> for small I/O workloads, and workloads with slower persistent storage,
> and workloads on slower network fabrics (ie, TCP).
It is opt-in so thankfully every class of usecase doesn't need to be
something I've covered personally. I'd welcome others to discover how
this work impacts their workload of choice.
But yeah, TCP with virt systems in the lab is what I developed against
for quite a while. It exposed that write IO is never aligned, so I
eventually put that to one side because the initial target use was on
a cluster with more capable RDMA network.
Thanks,
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-11 18:02 ` Mike Snitzer
@ 2025-06-11 19:06 ` Chuck Lever
2025-06-11 19:58 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-11 19:06 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On 6/11/25 2:02 PM, Mike Snitzer wrote:
> On Wed, Jun 11, 2025 at 10:16:39AM -0400, Chuck Lever wrote:
>> A few general comments:
>>
>> - Since this isn't a series that you intend I should apply immediately
>> to nfsd-next, let's mark subsequent postings with "RFC".
>
> Yeah, my first posting should've been RFC.
>
> But I'd be in favor of working with urgency so that by v6.16-rc4/5 you
> and Jeff are fine with it for the 6.17 merge window.
Since this series doesn't fix a crasher or security bug, and since I
have plenty of other swords in the forge, I can't commit to a
particular landing spot yet.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-11 14:31 ` Chuck Lever
@ 2025-06-11 19:18 ` Mike Snitzer
2025-06-11 20:29 ` Jeff Layton
2025-06-12 13:21 ` Chuck Lever
0 siblings, 2 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 19:18 UTC (permalink / raw)
To: Chuck Lever; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > or will be removed from the page cache upon completion (DONTCACHE).
>
> I thought we were going to do two switches: One for reads and one for
> writes? I could be misremembering.
We did discuss the possibility of doing that. Still can-do if that's
what you'd prefer.
> After all, you are describing two different facilities here: a form of
> direct I/O for READs, and RWF_DONTCACHE for WRITEs (I think?).
My thinking was NFSD doesn't need to provide faithful pure
RWF_DONTCACHE if it really doesn't make sense. But the "dontcache"
name can be (ab)used by NFSD to define it how it sees fit (O_DIRECT
doesn't cache so it seems fair). What I arrived at with this patchset
is how I described in my cover letter:
When 'enable-dontcache' is used:
- all READs will use O_DIRECT (both DIO-aligned and misaligned)
- all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
- misaligned WRITEs currently continue to use normal buffered IO
But we reserve the right to iterate on the implementation details as
we see fit. Still using the umbrella of 'dontcache'.
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 7d94fae1dee8..bba3e6f4f56b 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -49,6 +49,7 @@
> > #define NFSDDBG_FACILITY NFSDDBG_FILEOP
> >
> > bool nfsd_disable_splice_read __read_mostly;
> > +bool nfsd_enable_dontcache __read_mostly;
> >
> > /**
> > * nfserrno - Map Linux errnos to NFS errnos
> > @@ -1086,6 +1087,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > unsigned long v, total;
> > struct iov_iter iter;
> > loff_t ppos = offset;
> > + rwf_t flags = 0;
> > ssize_t host_err;
> > size_t len;
> >
> > @@ -1103,7 +1105,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >
> > trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > - host_err = vfs_iter_read(file, &iter, &ppos, 0);
> > +
> > + if (nfsd_enable_dontcache)
> > + flags |= RWF_DONTCACHE;
>
> Two things:
>
> - Maybe NFSD should record whether the file system is DONTCACHE-enabled
> in @fhp or in the export it is associated with, and then check that
> setting here before asserting RWF_DONTCACHE
Sure, that'd be safer than allowing RWF_DONTCACHE to be tried only to
get EOPNOTSUPP because the underlying filesystem doesn't enable
support.
Could follow what I did with nfsd_file only storing the dio_*
alignment data retrieved from statx IFF 'enable-dontcache' was enabled
at the time the nfsd_file was opened.
By adding check for FOP_DONTCACHE being set in underlying filesystem.
But as-is, we're not actually using RWF_DONTCACHE in the final form of
what I've provided in this patchset. So can easily circle back to
adding this if/when we do decide to use RWF_DONTCACHE.
> - I thought we were going with O_DIRECT for READs.
Yes, this is just an intermediate patch that goes away in later
patches. I was more focused on minimal patch to get the
'enable-dontcache' debugfs interface in place and tweaking it to its
ultimate form in later patch.
I put in place a more general framework that can evolve... it being
more free-form (e.g. "don't worry your pretty head about the
implementation details, we'll worry for you").
Causes some reviewer angst I suppose, so I can just fold patches to do
away with unused intermediate state.
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-11 19:06 ` Chuck Lever
@ 2025-06-11 19:58 ` Mike Snitzer
0 siblings, 0 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 19:58 UTC (permalink / raw)
To: Chuck Lever; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 03:06:34PM -0400, Chuck Lever wrote:
> On 6/11/25 2:02 PM, Mike Snitzer wrote:
> > On Wed, Jun 11, 2025 at 10:16:39AM -0400, Chuck Lever wrote:
> >> A few general comments:
> >>
> >> - Since this isn't a series that you intend I should apply immediately
> >> to nfsd-next, let's mark subsequent postings with "RFC".
> >
> > Yeah, my first posting should've been RFC.
> >
> > But I'd be in favor of working with urgency so that by v6.16-rc4/5 you
> > and Jeff are fine with it for the 6.17 merge window.
>
> Since this series doesn't fix a crasher or security bug, and since I
> have plenty of other swords in the forge, I can't commit to a
> particular landing spot yet.
Completely understood, you asked my intention so I spoke to it in
reply. Obviously we just take it as it comes. See how things go.
Thanks,
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-11 19:18 ` Mike Snitzer
@ 2025-06-11 20:29 ` Jeff Layton
2025-06-11 21:36 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 7:13 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:21 ` Chuck Lever
1 sibling, 2 replies; 75+ messages in thread
From: Jeff Layton @ 2025-06-11 20:29 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > > or will be removed from the page cache upon completion (DONTCACHE).
> >
> > I thought we were going to do two switches: One for reads and one for
> > writes? I could be misremembering.
>
> We did discuss the possibility of doing that. Still can-do if that's
> what you'd prefer.
>
Having them as separate controls in debugfs is fine for
experimentation's sake, but I imagine we'll need to be all-in one way
or the other with a real interface.
I think if we can crack the problem of receiving WRITE payloads into an
already-aligned buffer, then that becomes much more feasible. I think
that's a solveable problem.
> > After all, you are describing two different facilities here: a form of
> > direct I/O for READs, and RWF_DONTCACHE for WRITEs (I think?).
>
> My thinking was NFSD doesn't need to provide faithful pure
> RWF_DONTCACHE if it really doesn't make sense. But the "dontcache"
> name can be (ab)used by NFSD to define it how it sees fit (O_DIRECT
> doesn't cache so it seems fair). What I arrived at with this patchset
> is how I described in my cover letter:
>
> When 'enable-dontcache' is used:
> - all READs will use O_DIRECT (both DIO-aligned and misaligned)
> - all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
> - misaligned WRITEs currently continue to use normal buffered IO
>
> But we reserve the right to iterate on the implementation details as
> we see fit. Still using the umbrella of 'dontcache'.
>
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 7d94fae1dee8..bba3e6f4f56b 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -49,6 +49,7 @@
> > > #define NFSDDBG_FACILITY NFSDDBG_FILEOP
> > >
> > > bool nfsd_disable_splice_read __read_mostly;
> > > +bool nfsd_enable_dontcache __read_mostly;
> > >
> > > /**
> > > * nfserrno - Map Linux errnos to NFS errnos
> > > @@ -1086,6 +1087,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > unsigned long v, total;
> > > struct iov_iter iter;
> > > loff_t ppos = offset;
> > > + rwf_t flags = 0;
> > > ssize_t host_err;
> > > size_t len;
> > >
> > > @@ -1103,7 +1105,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >
> > > trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > > iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > > - host_err = vfs_iter_read(file, &iter, &ppos, 0);
> > > +
> > > + if (nfsd_enable_dontcache)
> > > + flags |= RWF_DONTCACHE;
> >
> > Two things:
> >
> > - Maybe NFSD should record whether the file system is DONTCACHE-enabled
> > in @fhp or in the export it is associated with, and then check that
> > setting here before asserting RWF_DONTCACHE
>
> Sure, that'd be safer than allowing RWF_DONTCACHE to be tried only to
> get EOPNOTSUPP because the underlying filesystem doesn't enable
> support.
>
> Could follow what I did with nfsd_file only storing the dio_*
> alignment data retrieved from statx IFF 'enable-dontcache' was enabled
> at the time the nfsd_file was opened.
>
> By adding check for FOP_DONTCACHE being set in underlying filesystem.
> But as-is, we're not actually using RWF_DONTCACHE in the final form of
> what I've provided in this patchset. So can easily circle back to
> adding this if/when we do decide to use RWF_DONTCACHE.
>
> > - I thought we were going with O_DIRECT for READs.
>
> Yes, this is just an intermediate patch that goes away in later
> patches. I was more focused on minimal patch to get the
> 'enable-dontcache' debugfs interface in place and tweaking it to its
> ultimate form in later patch.
>
> I put in place a more general framework that can evolve... it being
> more free-form (e.g. "don't worry your pretty head about the
> implementation details, we'll worry for you").
>
> Causes some reviewer angst I suppose, so I can just fold patches to do
> away with unused intermediate state.
>
> Mike
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 15:44 ` Jeff Layton
@ 2025-06-11 20:51 ` Mike Snitzer
2025-06-12 7:32 ` Christoph Hellwig
1 sibling, 0 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 20:51 UTC (permalink / raw)
To: Jeff Layton; +Cc: Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe, bcodding
On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
> On Wed, 2025-06-11 at 11:11 -0400, Chuck Lever wrote:
> > On 6/11/25 11:07 AM, Jeff Layton wrote:
> > > On Wed, 2025-06-11 at 10:42 -0400, Chuck Lever wrote:
> > > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > > IO must be aligned, otherwise it falls back to using buffered IO.
> > > > >
> > > > > RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> > > > > nfsd/enable-dontcache=1) because it works against us (due to RMW
> > > > > needing to read without benefit of cache), whereas buffered IO enables
> > > > > misaligned IO to be more performant.
> > > > >
> > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > ---
> > > > > fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++++++++++++++----
> > > > > 1 file changed, 36 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > > index e7cc8c6dfbad..a942609e3ab9 100644
> > > > > --- a/fs/nfsd/vfs.c
> > > > > +++ b/fs/nfsd/vfs.c
> > > > > @@ -1064,6 +1064,22 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > > > }
> > > > >
> > > > > +static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
> > > > > + const u32 blocksize)
> > > > > +{
> > > > > + u32 blocksize_mask;
> > > > > +
> > > > > + if (!blocksize)
> > > > > + return false;
> > > > > +
> > > > > + blocksize_mask = blocksize - 1;
> > > > > + if ((offset & blocksize_mask) ||
> > > > > + (iov_iter_alignment(iter) & blocksize_mask))
> > > > > + return false;
> > > > > +
> > > > > + return true;
> > > > > +}
> > > > > +
> > > > > /**
> > > > > * nfsd_iter_read - Perform a VFS read using an iterator
> > > > > * @rqstp: RPC transaction context
> > > > > @@ -1107,8 +1123,16 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > > trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > > > > iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> > > > >
> > > > > - if (nfsd_enable_dontcache)
> > > > > - flags |= RWF_DONTCACHE;
> > > > > + if (nfsd_enable_dontcache) {
> > > > > + if (is_dio_aligned(&iter, offset, nf->nf_dio_read_offset_align))
> > > > > + flags |= RWF_DIRECT;
> > > > > + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > > > > + * against us (due to RMW needing to read without benefit of cache),
> > > > > + * whereas buffered IO enables misaligned IO to be more performant.
> > > > > + */
> > > > > + //else
> > > > > + // flags |= RWF_DONTCACHE;
> > > > > + }
> > > > >
> > > > > host_err = vfs_iter_read(file, &iter, &ppos, flags);
> > > > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > > > > @@ -1217,8 +1241,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > > > nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > > > > iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > > > >
> > > > > - if (nfsd_enable_dontcache)
> > > > > - flags |= RWF_DONTCACHE;
> > > > > + if (nfsd_enable_dontcache) {
> > > > > + if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
> > > > > + flags |= RWF_DIRECT;
> > > > > + /* FIXME: not using RWF_DONTCACHE for misaligned IO because it works
> > > > > + * against us (due to RMW needing to read without benefit of cache),
> > > > > + * whereas buffered IO enables misaligned IO to be more performant.
> > > > > + */
> > > > > + //else
> > > > > + // flags |= RWF_DONTCACHE;
> > > > > + }
> > > >
> > > > IMO adding RWF_DONTCACHE first then replacing it later in the series
> > > > with a form of O_DIRECT is confusing. Also, why add RWF_DONTCACHE here
> > > > and then take it away "because it doesn't work"?
I spoke to this in a previous reply. I can fold patches to elininate
this distraction in v2.
> > > > But OK, your series is really a proof-of-concept. Something to work out
> > > > before it is merge-ready, I guess.
> > > >
> > > > It is much more likely for NFS READ requests to be properly aligned.
> > > > Clients are generally good about that. NFS WRITE request alignment
> > > > is going to be arbitrary. Fwiw.
Correct, thankfully TCP reads don't misalign their payload like TCP
writes do. As you know, the value of patch 6 is that application IO
that generates misaligned IO (as a side-effect of misaligned read
blocksize, e.g. IOR hard's 47008 blocksize) can issue reads using
O_DIRECT.
> > > > However, one thing we discussed at bake-a-thon was what to do about
> > > > unstable WRITEs. For unstable WRITEs, the server has to cache the
> > > > write data at least until the client sends a COMMIT. Otherwise the
> > > > server will have to convert all UNSTABLE writes to FILE_SYNC writes,
> > > > and that can have performance implications.
> > > >
> > >
> > > If we're doing synchronous, direct I/O writes then why not just respond
> > > with FILE_SYNC? The write should be on the platter by the time it
> > > returns.
For v2 I'll look to formalize responding with FILE_SYNC when
'enable-dontcache' is set.
> > Because "platter". On some devices, writes are slow.
> >
> > For some workloads, unstable is faster. I have an experimental series
> > that makes NFSD convert all NFS WRITEs to FILE_SYNC. It was not an
> > across the board win, even with an NVMe-backed file system.
> >
>
> Presumably, those devices wouldn't be exported in this mode. That's
> probably a good argument for making this settable on a per-export
> basis.
Correct. This shouldn't be used by default. But if/when it makes
sense, it *really* sings.
> > > > One thing you might consider is to continue using the page cache for
> > > > unstable WRITEs, and then use fadvise DONTNEED after a successful
> > > > COMMIT operation to reduce page cache footprint. Unstable writes to
> > > > the same range of the file might be a problem, however.
> > >
> > > Since the client sends almost everything UNSTABLE, that would probably
> > > erase most of the performance win. The only reason I can see to use
> > > buffered I/O in this mode would be because we had to deal with an
> > > unaligned write and need to do a RMW cycle on a block.
> > >
> > > The big question is whether mixing buffered and direct I/O writes like
> > > this is safe across all exportable filesystems. I'm not yet convinced
> > > of that.
> >
> > Agreed, that deserves careful scrutiny.
> >
>
> Like Mike is asking though, I need a better understanding of the
> potential races here:
>
> XFS, for instance, takes the i_rwsem shared around dio writes and
> exclusive around buffered, so they should exclude each other.
> If we did all the buffered writes as RWF_SYNC, would that prevent
> corruption?
I welcome any help pinning down what must be done to ensure this
is safe ("this" being: arbitrary switching between buffered and direct
IO and associated page cache invalidation). But to be 100% clear:
NFSD exporting XFS with enable-dontcache=1 has worked very well.
Do we need to go to the extreme of each filesystem exporting support
with a new flag like FOP_INVALIDATES_BUFFERED_VS_DIRECT? And if set,
any evidence to the contrary is a bug?
And does the VFS have a role in ensuring it's safe or can we assume
vfs/mm/etc are intended to be safe and any core common code that
proves otherwise is a bug?
> In any case, for now at least, unless you're using RDMA, it's going to
> end up falling back to buffered writes everywhere. The data is almost
> never going to be properly aligned coming in off the wire. That might
> be fixable though.
Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
over TCP to workaround SUNRPC TCP's XDR handling always storing the
write payload in misaligned IO. But that's purely a stop-gap
workaround, which needs testing (to see if soft-iwap negates the win
of using O_DIRECT, etc).
But a long-term better fix is absolutely needed, to be continued (in
the subthread I need to get going)...
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-11 20:29 ` Jeff Layton
@ 2025-06-11 21:36 ` Mike Snitzer
2025-06-12 10:28 ` Jeff Layton
2025-06-12 7:13 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
1 sibling, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-11 21:36 UTC (permalink / raw)
To: Jeff Layton; +Cc: Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
> On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
> > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > > > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > > > or will be removed from the page cache upon completion (DONTCACHE).
> > >
> > > I thought we were going to do two switches: One for reads and one for
> > > writes? I could be misremembering.
> >
> > We did discuss the possibility of doing that. Still can-do if that's
> > what you'd prefer.
> >
>
> Having them as separate controls in debugfs is fine for
> experimentation's sake, but I imagine we'll need to be all-in one way
> or the other with a real interface.
>
> I think if we can crack the problem of receiving WRITE payloads into an
> already-aligned buffer, then that becomes much more feasible. I think
> that's a solveable problem.
You'd immediately be my hero! Let's get into it:
In a previously reply to this thread you aptly detailed what I found
out the hard way (with too much xdr_buf code review and tracing):
On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
> >
> > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> > DIO alignment (both page and disk alignment). This works quite well
> > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> > maps the WRITE payload into aligned pages. But more work is needed to
> > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> > used. I spent quite a bit of time analyzing the existing xdr_buf code
> > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> > misaligned pages such that O_DIRECT isn't possible without a copy
> > (completely defeating the point). I'll reply to this cover letter to
> > start a subthread to discuss how best to deal with misaligned write
> > IO (by association with Hammerspace, I'm most interested in NFS v3).
> >
>
> Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
> rq_pages array. To get alignment right, you'd probably have to do the
> receive in a much more piecemeal way.
>
> Basically, you'd need to decode as you receive chunks of the message,
> and look out for WRITEs, and then set it up so that their payloads are
> received with proper alignment.
1)
Yes, and while I arrived at the same exact conclusion I was left with
dread about the potential for "breaking too many eggs to make that
tasty omelette".
If you (or others) see a way forward to have SUNRPC TCP's XDR receive
"inline" decode (rather than have the 2 stage process you covered
above) that'd be fantastic. Seems like really old tech-debt in SUNRPC
from a time when such care about alignment of WRITE payload pages was
completely off engineers' collective radar (owed to NFSD only using
buffered IO I assume?).
2)
One hack that I verified to work for READ and WRITE IO on my
particular TCP testbed was to front-pad the first "head" page of the
xdr_buf such that the WRITE payload started at the 2nd page of
rq_pages. So that looked like this hack for my usage:
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 8fc5b2b2d806..cf082a265261 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
/* Make arg->head point to first page and arg->pages point to rest */
arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
- arg->head[0].iov_len = PAGE_SIZE;
+ // FIXME: front-pad optimized to align TCP's WRITE payload
+ // but may not be enough for other operations?
+ arg->head[0].iov_len = 148;
arg->pages = rqstp->rq_pages + 1;
arg->page_base = 0;
/* save at least one page for response */
That gut "but may not be enough for other operations?" comment proved
to be prophetic.
Sadly it went on to fail spectacularly for other ops (specifically
READDIR and READDIRPLUS, probably others would too) because
xdr_inline_decode() _really_ doesn't like going beyond the end of the
xdr_buf's inline "head" page. It could be that even if
xdr_inline_decode() et al was "fixed" (which isn't for the faint of
heart given xdr_buf's more complex nature) there will likely be other
mole(s) that pop up. And in addition, we'd be wasting space in the
xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying
to see this frontpad hack through to completion.
3)
Lastly, for completeness, I also mentioned briefly in a previous
recent reply:
On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote:
> On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
>
> > In any case, for now at least, unless you're using RDMA, it's going to
> > end up falling back to buffered writes everywhere. The data is almost
> > never going to be properly aligned coming in off the wire. That might
> > be fixable though.
>
> Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
> over TCP to workaround SUNRPC TCP's XDR handling always storing the
> write payload in misaligned IO. But that's purely a stop-gap
> workaround, which needs testing (to see if soft-iwap negates the win
> of using O_DIRECT, etc).
(Ab)using soft-iwarp as the basis for easily getting page aligned TCP
WRITE payloads seems pretty gross given we are chasing utmost
performance, etc.
All said, I welcome your sage advice and help on this effort to
DIO-align SUNRPC TCP's WRITE payload pages.
Thanks,
Mike
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-11 20:29 ` Jeff Layton
2025-06-11 21:36 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
@ 2025-06-12 7:13 ` Christoph Hellwig
2025-06-12 13:15 ` Chuck Lever
1 sibling, 1 reply; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:13 UTC (permalink / raw)
To: Jeff Layton
Cc: Mike Snitzer, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
> I think if we can crack the problem of receiving WRITE payloads into an
> already-aligned buffer, then that becomes much more feasible. I think
> that's a solveable problem.
It's called RDMA :)
To place write payloads into page aligned buffer, the NIC needs to split
the various headers from the payload. The data placement part of RDMA
naturally takes care of that. If you want to do it without TCP, you need
hardware that is aware of the protocol headers up to the XDR level. I
know and the days where NFS was a big thing there were NICs that could do
this offload with the right firmware, and I wouldn't be surprised if
that's still the case.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
2025-06-11 14:17 ` Chuck Lever
@ 2025-06-12 7:15 ` Christoph Hellwig
0 siblings, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:15 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, Mike Snitzer, Jeff Layton, linux-nfs,
linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 10:17:56AM -0400, Chuck Lever wrote:
> On 6/11/25 2:58 AM, Christoph Hellwig wrote:
> > On Tue, Jun 10, 2025 at 04:57:35PM -0400, Mike Snitzer wrote:
> >> Avoids the need to open code do_iter_readv_writev() purely to request
> >> that a sync iocb make use of IOCB_DIRECT.
> >>
> >> Care was taken to preserve the long-established value for IOCB_DIRECT
> >> (1 << 17) when introducing RWF_DIRECT.
> >
> > What is the problem with using vfs_iocb_iter_read instead of
> > vfs_iter_read and passing the iocb directly?
>
> Christoph, are you suggesting that nfsd_iter_read() should always
> call vfs_iocb_iter_read() instead of vfs_iter_read()? That might be
> a nice clean up in general.
Yes. I don't think it's such a big cleanup because the helper is a
bit lower level. But IFF we are going down the route of using
direct I/O this will also allow to do asynchronous I/O instead of
blocking the server threads as well.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 13:30 ` Jeff Layton
@ 2025-06-12 7:22 ` Christoph Hellwig
0 siblings, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:22 UTC (permalink / raw)
To: Jeff Layton
Cc: Mike Snitzer, Christoph Hellwig, Chuck Lever, linux-nfs,
linux-fsdevel, Jens Axboe, Dave Chinner
On Wed, Jun 11, 2025 at 09:30:54AM -0400, Jeff Layton wrote:
>
> I'm concerned here too. Invalidation races can mean silent data
> corruption. We'll need to ensure that this is safe.
>
> Incidentally, is there a good testcase for this? Something that does
> buffered and direct I/O from different tasks and looks for
> inconsistencies?
We have a few xfstets racing different kinds of I/O and triggers the
failing invalidation warnings. I'm not sure anything combines this
with checks for data integrity.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 12:23 ` Mike Snitzer
2025-06-11 13:30 ` Jeff Layton
@ 2025-06-12 7:23 ` Christoph Hellwig
1 sibling, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:23 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Chuck Lever, Jeff Layton, linux-nfs,
linux-fsdevel, Jens Axboe, Dave Chinner
On Wed, Jun 11, 2025 at 08:23:34AM -0400, Mike Snitzer wrote:
> On Wed, Jun 11, 2025 at 12:00:02AM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 10, 2025 at 04:57:36PM -0400, Mike Snitzer wrote:
> > > IO must be aligned, otherwise it falls back to using buffered IO.
> > >
> > > RWF_DONTCACHE is _not_ currently used for misaligned IO (even when
> > > nfsd/enable-dontcache=1) because it works against us (due to RMW
> > > needing to read without benefit of cache), whereas buffered IO enables
> > > misaligned IO to be more performant.
> >
> > This seems to "randomly" mix direct I/O and buffered I/O on a file.
>
> It isn't random, if the IO is DIO-aligned it uses direct I/O.
Which as an I/O pattern does look pretty random :)
> > But maybe also explain what this is trying to address to start with?
>
> Ha, I suspect you saw my too-many-words 0th patch header [1] and
> ignored it? Solid feedback, I need to be more succinct and I'm
> probably too close to this work to see the gaps in introduction and
> justification but will refine, starting now:
Well, I was mostly asking about the description for this patch in
particular. Given that all the naming and the previous patches seemed
to be about dontcache I/O having optional direct I/O in here looked
really confusing.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 15:07 ` Jeff Layton
2025-06-11 15:11 ` Chuck Lever
@ 2025-06-12 7:25 ` Christoph Hellwig
1 sibling, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:25 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, Mike Snitzer, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 11:07:20AM -0400, Jeff Layton wrote:
> > write data at least until the client sends a COMMIT. Otherwise the
> > server will have to convert all UNSTABLE writes to FILE_SYNC writes,
> > and that can have performance implications.
> >
>
> If we're doing synchronous, direct I/O writes then why not just respond
> with FILE_SYNC? The write should be on the platter by the time it
> returns.
Only if you do O_DSYNC writes. Which are painfully slow for lots of
configurations. Otherwise you still need to issue an fdatasync.
But can you help me refreshing why UNSTABLE semantics require having
the data in the page cache?
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 15:11 ` Chuck Lever
2025-06-11 15:44 ` Jeff Layton
@ 2025-06-12 7:28 ` Christoph Hellwig
1 sibling, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:28 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Mike Snitzer, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 11:11:28AM -0400, Chuck Lever wrote:
> > If we're doing synchronous, direct I/O writes then why not just respond
> > with FILE_SYNC? The write should be on the platter by the time it
> > returns.
>
> Because "platter". On some devices, writes are slow.
>
> For some workloads, unstable is faster. I have an experimental series
> that makes NFSD convert all NFS WRITEs to FILE_SYNC. It was not an
> across the board win, even with an NVMe-backed file system.
For everything that is not a pure overwrite on a file system that passes
them through, and a device not having a volatile write cache FILE_SYNC
currently is slower. That might change for file systems logging
synchronous writes (I'm playing with that for XFS a bit again), but even
then you only want to selectively do that when the client specifically
requests O_DSYNC semantics as you'd easily overwhelm the log and
introduce a log write amplification.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
2025-06-11 15:44 ` Jeff Layton
2025-06-11 20:51 ` Mike Snitzer
@ 2025-06-12 7:32 ` Christoph Hellwig
1 sibling, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:32 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, Mike Snitzer, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
> XFS, for instance, takes the i_rwsem shared around dio writes and
> exclusive around buffered, so they should exclude each other. If we did
> all the buffered writes as RWF_SYNC, would that prevent corruption?
The big issue is memory mapped I/O, which doesn't take any locks.
I guess you could declare using a nfs exported file locally as a bad
idea, but I know plenty of setups doing it.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
@ 2025-06-12 7:39 ` Christoph Hellwig
2025-06-12 20:37 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-12 7:39 UTC (permalink / raw)
To: Jeff Layton
Cc: Mike Snitzer, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe,
Dave Chinner
On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
> To be clear, my concern with *_DONTCACHE is this bit in
> generic_write_sync():
> I understand why it was done, but it means that we're kicking off
> writeback for small ranges after every write. I think we'd be better
> served by allowing for a little batching, and just kick off writeback
> (maybe even for the whole inode) after a short delay. IOW, I agree with
> Dave Chinner that we need some sort of writebehind window.
Agreed. Not offloading to the worker threads also hurts the I/O
pattern. I guess Jens did that to not overwhelm the single threaded
worker thread, but that might be solved with the pending series for
multiple writeback workers.
Another thing is that using the page cache for reads is probably
rather pointless. I've been wondering if we should just change
the direct I/O read code to read from the page cache if there are
cached pages and otherwise go direct to the device. That would make
a setup using buffered writes (without or without the dontcache
flag) and direct I/O reads safe.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-11 21:36 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
@ 2025-06-12 10:28 ` Jeff Layton
2025-06-12 11:28 ` Jeff Layton
` (2 more replies)
0 siblings, 3 replies; 75+ messages in thread
From: Jeff Layton @ 2025-06-12 10:28 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe
On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote:
> On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
> > On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
> > > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> > > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > > > > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > > > > or will be removed from the page cache upon completion (DONTCACHE).
> > > >
> > > > I thought we were going to do two switches: One for reads and one for
> > > > writes? I could be misremembering.
> > >
> > > We did discuss the possibility of doing that. Still can-do if that's
> > > what you'd prefer.
> > >
> >
> > Having them as separate controls in debugfs is fine for
> > experimentation's sake, but I imagine we'll need to be all-in one way
> > or the other with a real interface.
> >
> > I think if we can crack the problem of receiving WRITE payloads into an
> > already-aligned buffer, then that becomes much more feasible. I think
> > that's a solveable problem.
>
> You'd immediately be my hero! Let's get into it:
>
> In a previously reply to this thread you aptly detailed what I found
> out the hard way (with too much xdr_buf code review and tracing):
>
> On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
> > >
> > > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> > > DIO alignment (both page and disk alignment). This works quite well
> > > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> > > maps the WRITE payload into aligned pages. But more work is needed to
> > > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> > > used. I spent quite a bit of time analyzing the existing xdr_buf code
> > > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> > > misaligned pages such that O_DIRECT isn't possible without a copy
> > > (completely defeating the point). I'll reply to this cover letter to
> > > start a subthread to discuss how best to deal with misaligned write
> > > IO (by association with Hammerspace, I'm most interested in NFS v3).
> > >
> >
> > Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
> > rq_pages array. To get alignment right, you'd probably have to do the
> > receive in a much more piecemeal way.
> >
> > Basically, you'd need to decode as you receive chunks of the message,
> > and look out for WRITEs, and then set it up so that their payloads are
> > received with proper alignment.
>
> 1)
> Yes, and while I arrived at the same exact conclusion I was left with
> dread about the potential for "breaking too many eggs to make that
> tasty omelette".
>
> If you (or others) see a way forward to have SUNRPC TCP's XDR receive
> "inline" decode (rather than have the 2 stage process you covered
> above) that'd be fantastic. Seems like really old tech-debt in SUNRPC
> from a time when such care about alignment of WRITE payload pages was
> completely off engineers' collective radar (owed to NFSD only using
> buffered IO I assume?).
>
> 2)
> One hack that I verified to work for READ and WRITE IO on my
> particular TCP testbed was to front-pad the first "head" page of the
> xdr_buf such that the WRITE payload started at the 2nd page of
> rq_pages. So that looked like this hack for my usage:
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 8fc5b2b2d806..cf082a265261 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
>
> /* Make arg->head point to first page and arg->pages point to rest */
> arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
> - arg->head[0].iov_len = PAGE_SIZE;
> + // FIXME: front-pad optimized to align TCP's WRITE payload
> + // but may not be enough for other operations?
> + arg->head[0].iov_len = 148;
> arg->pages = rqstp->rq_pages + 1;
> arg->page_base = 0;
> /* save at least one page for response */
>
> That gut "but may not be enough for other operations?" comment proved
> to be prophetic.
>
> Sadly it went on to fail spectacularly for other ops (specifically
> READDIR and READDIRPLUS, probably others would too) because
> xdr_inline_decode() _really_ doesn't like going beyond the end of the
> xdr_buf's inline "head" page. It could be that even if
> xdr_inline_decode() et al was "fixed" (which isn't for the faint of
> heart given xdr_buf's more complex nature) there will likely be other
> mole(s) that pop up. And in addition, we'd be wasting space in the
> xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying
> to see this frontpad hack through to completion.
>
> 3)
> Lastly, for completeness, I also mentioned briefly in a previous
> recent reply:
>
> On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote:
> > On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
> >
> > > In any case, for now at least, unless you're using RDMA, it's going to
> > > end up falling back to buffered writes everywhere. The data is almost
> > > never going to be properly aligned coming in off the wire. That might
> > > be fixable though.
> >
> > Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
> > over TCP to workaround SUNRPC TCP's XDR handling always storing the
> > write payload in misaligned IO. But that's purely a stop-gap
> > workaround, which needs testing (to see if soft-iwap negates the win
> > of using O_DIRECT, etc).
>
> (Ab)using soft-iwarp as the basis for easily getting page aligned TCP
> WRITE payloads seems pretty gross given we are chasing utmost
> performance, etc.
>
> All said, I welcome your sage advice and help on this effort to
> DIO-align SUNRPC TCP's WRITE payload pages.
>
> Thanks,
> Mike
(Sent this to Mike only by accident yesterday -- resending to the full
list now)
I've been looking over the code today. Basically, I think we need to
have svc_tcp_recvfrom() receive in phases. At a high level:
1/ receive the record marker (just like it does today)
2/ receive enough for the RPC header and then decode it.
3/ Use the rpc program and version from the decoded header to look up
the svc_program. Add an optional pg_tcp_recvfrom callback to that
structure that will receive the rest of the data into the buffer. If
pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do
today.
For NFSv3, pc_tcp_recvfrom can just look at the procedure. If it's
anything but a WRITE we'll just do what we do today
(svc_tcp_read_msg()).
For a WRITE, we'll receive the first part of the WRITE3args (everything
but the data) into rq_pages, and decode it. We can then use that info
to figure out the alignment. Advance to the next page in rq_pages, and
then to the point where the data is properly aligned. Do the receive
into that spot.
Then we just add a RQ_ALIGNED_DATA to rqstp->rq_flags, and teach
nfsd3_proc_write how to find the data and do a DIO write when it's set.
Unaligned writes are still a problem though. If two WRITE RPCs come in
for different parts of the same block at the same time, then you could
end up losing the result of the first write. I don't see a way to make
that non-racy.
NFSv4 will also be a bit of a challenge. We'll need to receive the
whole compound one operation at a time. If we hit a WRITE, then we can
just do the same thing that we do for v3 to align the data.
I'd probably aim to start with an implementation for v3, and then add
v4 support in a second phase.
I'm interested in working on this. It'll be a fair bit of work though.
I'll need to think about how to break this up into manageable pieces.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 10:28 ` Jeff Layton
@ 2025-06-12 11:28 ` Jeff Layton
2025-06-12 13:28 ` Chuck Lever
2025-07-03 0:12 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2 siblings, 0 replies; 75+ messages in thread
From: Jeff Layton @ 2025-06-12 11:28 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe
On Thu, 2025-06-12 at 06:28 -0400, Jeff Layton wrote:
> On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote:
> > On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
> > > On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
> > > > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> > > > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > > > > > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > > > > > or will be removed from the page cache upon completion (DONTCACHE).
> > > > >
> > > > > I thought we were going to do two switches: One for reads and one for
> > > > > writes? I could be misremembering.
> > > >
> > > > We did discuss the possibility of doing that. Still can-do if that's
> > > > what you'd prefer.
> > > >
> > >
> > > Having them as separate controls in debugfs is fine for
> > > experimentation's sake, but I imagine we'll need to be all-in one way
> > > or the other with a real interface.
> > >
> > > I think if we can crack the problem of receiving WRITE payloads into an
> > > already-aligned buffer, then that becomes much more feasible. I think
> > > that's a solveable problem.
> >
> > You'd immediately be my hero! Let's get into it:
> >
> > In a previously reply to this thread you aptly detailed what I found
> > out the hard way (with too much xdr_buf code review and tracing):
> >
> > On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
> > > >
> > > > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> > > > DIO alignment (both page and disk alignment). This works quite well
> > > > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> > > > maps the WRITE payload into aligned pages. But more work is needed to
> > > > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> > > > used. I spent quite a bit of time analyzing the existing xdr_buf code
> > > > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> > > > misaligned pages such that O_DIRECT isn't possible without a copy
> > > > (completely defeating the point). I'll reply to this cover letter to
> > > > start a subthread to discuss how best to deal with misaligned write
> > > > IO (by association with Hammerspace, I'm most interested in NFS v3).
> > > >
> > >
> > > Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
> > > rq_pages array. To get alignment right, you'd probably have to do the
> > > receive in a much more piecemeal way.
> > >
> > > Basically, you'd need to decode as you receive chunks of the message,
> > > and look out for WRITEs, and then set it up so that their payloads are
> > > received with proper alignment.
> >
> > 1)
> > Yes, and while I arrived at the same exact conclusion I was left with
> > dread about the potential for "breaking too many eggs to make that
> > tasty omelette".
> >
> > If you (or others) see a way forward to have SUNRPC TCP's XDR receive
> > "inline" decode (rather than have the 2 stage process you covered
> > above) that'd be fantastic. Seems like really old tech-debt in SUNRPC
> > from a time when such care about alignment of WRITE payload pages was
> > completely off engineers' collective radar (owed to NFSD only using
> > buffered IO I assume?).
> >
> > 2)
> > One hack that I verified to work for READ and WRITE IO on my
> > particular TCP testbed was to front-pad the first "head" page of the
> > xdr_buf such that the WRITE payload started at the 2nd page of
> > rq_pages. So that looked like this hack for my usage:
> >
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 8fc5b2b2d806..cf082a265261 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
> >
> > /* Make arg->head point to first page and arg->pages point to rest */
> > arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
> > - arg->head[0].iov_len = PAGE_SIZE;
> > + // FIXME: front-pad optimized to align TCP's WRITE payload
> > + // but may not be enough for other operations?
> > + arg->head[0].iov_len = 148;
> > arg->pages = rqstp->rq_pages + 1;
> > arg->page_base = 0;
> > /* save at least one page for response */
> >
> > That gut "but may not be enough for other operations?" comment proved
> > to be prophetic.
> >
> > Sadly it went on to fail spectacularly for other ops (specifically
> > READDIR and READDIRPLUS, probably others would too) because
> > xdr_inline_decode() _really_ doesn't like going beyond the end of the
> > xdr_buf's inline "head" page. It could be that even if
> > xdr_inline_decode() et al was "fixed" (which isn't for the faint of
> > heart given xdr_buf's more complex nature) there will likely be other
> > mole(s) that pop up. And in addition, we'd be wasting space in the
> > xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying
> > to see this frontpad hack through to completion.
> >
> > 3)
> > Lastly, for completeness, I also mentioned briefly in a previous
> > recent reply:
> >
> > On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote:
> > > On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
> > >
> > > > In any case, for now at least, unless you're using RDMA, it's going to
> > > > end up falling back to buffered writes everywhere. The data is almost
> > > > never going to be properly aligned coming in off the wire. That might
> > > > be fixable though.
> > >
> > > Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
> > > over TCP to workaround SUNRPC TCP's XDR handling always storing the
> > > write payload in misaligned IO. But that's purely a stop-gap
> > > workaround, which needs testing (to see if soft-iwap negates the win
> > > of using O_DIRECT, etc).
> >
> > (Ab)using soft-iwarp as the basis for easily getting page aligned TCP
> > WRITE payloads seems pretty gross given we are chasing utmost
> > performance, etc.
> >
> > All said, I welcome your sage advice and help on this effort to
> > DIO-align SUNRPC TCP's WRITE payload pages.
> >
> > Thanks,
> > Mike
>
> (Sent this to Mike only by accident yesterday -- resending to the full
> list now)
>
> I've been looking over the code today. Basically, I think we need to
> have svc_tcp_recvfrom() receive in phases. At a high level:
>
> 1/ receive the record marker (just like it does today)
>
> 2/ receive enough for the RPC header and then decode it.
>
> 3/ Use the rpc program and version from the decoded header to look up
> the svc_program. Add an optional pg_tcp_recvfrom callback to that
> structure that will receive the rest of the data into the buffer. If
> pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do
> today.
>
> For NFSv3, pc_tcp_recvfrom can just look at the procedure. If it's
> anything but a WRITE we'll just do what we do today
> (svc_tcp_read_msg()).
>
> For a WRITE, we'll receive the first part of the WRITE3args (everything
> but the data) into rq_pages, and decode it. We can then use that info
> to figure out the alignment. Advance to the next page in rq_pages, and
> then to the point where the data is properly aligned. Do the receive
> into that spot.
>
> Then we just add a RQ_ALIGNED_DATA to rqstp->rq_flags, and teach
> nfsd3_proc_write how to find the data and do a DIO write when it's set.
>
> Unaligned writes are still a problem though. If two WRITE RPCs come in
> for different parts of the same block at the same time, then you could
> end up losing the result of the first write. I don't see a way to make
> that non-racy.
>
> NFSv4 will also be a bit of a challenge. We'll need to receive the
> whole compound one operation at a time. If we hit a WRITE, then we can
> just do the same thing that we do for v3 to align the data.
>
> I'd probably aim to start with an implementation for v3, and then add
> v4 support in a second phase.
>
> I'm interested in working on this. It'll be a fair bit of work though.
> I'll need to think about how to break this up into manageable pieces.
Mike asked me to detail the race that I see between unaligned writes:
Since we'd have to fill a block before writing, the only way I can see
to do this with DIO would be to pre-populate the incomplete blocks at
the ends of the range before receiving the data into the buffer.
Most filesystems allow you to do concurrent DIO writes to the same file
in parallel. XFS, for instance only locks the inode->i_rwsem for read
when doing a DIO write.
Suppose we have two adjacent 1.5k WRITES going to a filesystem that has
1k blocks. Both writes end up doing DIO reads to fill the unwritten
part of the same block, and then receive in the data. Then they both
issue their writes to the fs (2k each). The second writer will end up
clobbering the data that the first wrote in the shared block.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-12 7:13 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
@ 2025-06-12 13:15 ` Chuck Lever
0 siblings, 0 replies; 75+ messages in thread
From: Chuck Lever @ 2025-06-12 13:15 UTC (permalink / raw)
To: Christoph Hellwig, Jeff Layton
Cc: Mike Snitzer, linux-nfs, linux-fsdevel, Jens Axboe
On 6/12/25 3:13 AM, Christoph Hellwig wrote:
> On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
>> I think if we can crack the problem of receiving WRITE payloads into an
>> already-aligned buffer, then that becomes much more feasible. I think
>> that's a solveable problem.
>
> It's called RDMA :)
>
> To place write payloads into page aligned buffer, the NIC needs to split
> the various headers from the payload. The data placement part of RDMA
> naturally takes care of that. If you want to do it without TCP, you need
> hardware that is aware of the protocol headers up to the XDR level. I
> know and the days where NFS was a big thing there were NICs that could do
> this offload with the right firmware, and I wouldn't be surprised if
> that's still the case.
>
>
Agreed: RDMA is the long-standing solution to this problem.
For TCP:
- For low workload intensity, handling unaligned payloads is adequate.
- For moderate intensity workloads, software RXE, or better, software
iWARP is the right answer. It's just a matter of making those drivers
work efficiently.
- For high intensity workloads, hardware RDMA is the right answer.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-11 19:18 ` Mike Snitzer
2025-06-11 20:29 ` Jeff Layton
@ 2025-06-12 13:21 ` Chuck Lever
2025-06-12 16:00 ` Mike Snitzer
1 sibling, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-12 13:21 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On 6/11/25 3:18 PM, Mike Snitzer wrote:
> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
>>> or will be removed from the page cache upon completion (DONTCACHE).
>>
>> I thought we were going to do two switches: One for reads and one for
>> writes? I could be misremembering.
>
> We did discuss the possibility of doing that. Still can-do if that's
> what you'd prefer.
For our experimental interface, I think having read and write enablement
as separate settings is wise, so please do that.
One quibble, though: The name "enable_dontcache" might be directly
meaningful to you, but I think others might find "enable_dont" to be
oxymoronic. And, it ties the setting to a specific kernel technology:
RWF_DONTCACHE.
So: Can we call these settings "io_cache_read" and "io_cache_write" ?
They could each carry multiple settings:
0: Use page cache
1: Use RWF_DONTCACHE
2: Use O_DIRECT
You can choose to implement any or all of the above three mechanisms.
>> After all, you are describing two different facilities here: a form of
>> direct I/O for READs, and RWF_DONTCACHE for WRITEs (I think?).
>
> My thinking was NFSD doesn't need to provide faithful pure
> RWF_DONTCACHE if it really doesn't make sense. But the "dontcache"
> name can be (ab)used by NFSD to define it how it sees fit (O_DIRECT
> doesn't cache so it seems fair). What I arrived at with this patchset
> is how I described in my cover letter:
>
> When 'enable-dontcache' is used:
> - all READs will use O_DIRECT (both DIO-aligned and misaligned)
> - all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
> - misaligned WRITEs currently continue to use normal buffered IO
>
> But we reserve the right to iterate on the implementation details as
> we see fit. Still using the umbrella of 'dontcache'.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 10:28 ` Jeff Layton
2025-06-12 11:28 ` Jeff Layton
@ 2025-06-12 13:28 ` Chuck Lever
2025-06-12 14:17 ` Benjamin Coddington
2025-06-12 16:22 ` Jeff Layton
2025-07-03 0:12 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2 siblings, 2 replies; 75+ messages in thread
From: Chuck Lever @ 2025-06-12 13:28 UTC (permalink / raw)
To: Jeff Layton, Mike Snitzer; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On 6/12/25 6:28 AM, Jeff Layton wrote:
> On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote:
>> On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
>>> On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
>>>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
>>>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
>>>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
>>>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
>>>>>> or will be removed from the page cache upon completion (DONTCACHE).
>>>>>
>>>>> I thought we were going to do two switches: One for reads and one for
>>>>> writes? I could be misremembering.
>>>>
>>>> We did discuss the possibility of doing that. Still can-do if that's
>>>> what you'd prefer.
>>>>
>>>
>>> Having them as separate controls in debugfs is fine for
>>> experimentation's sake, but I imagine we'll need to be all-in one way
>>> or the other with a real interface.
>>>
>>> I think if we can crack the problem of receiving WRITE payloads into an
>>> already-aligned buffer, then that becomes much more feasible. I think
>>> that's a solveable problem.
>>
>> You'd immediately be my hero! Let's get into it:
>>
>> In a previously reply to this thread you aptly detailed what I found
>> out the hard way (with too much xdr_buf code review and tracing):
>>
>> On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
>>>>
>>>> NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
>>>> DIO alignment (both page and disk alignment). This works quite well
>>>> for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
>>>> maps the WRITE payload into aligned pages. But more work is needed to
>>>> be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
>>>> used. I spent quite a bit of time analyzing the existing xdr_buf code
>>>> and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
>>>> misaligned pages such that O_DIRECT isn't possible without a copy
>>>> (completely defeating the point). I'll reply to this cover letter to
>>>> start a subthread to discuss how best to deal with misaligned write
>>>> IO (by association with Hammerspace, I'm most interested in NFS v3).
>>>>
>>>
>>> Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
>>> rq_pages array. To get alignment right, you'd probably have to do the
>>> receive in a much more piecemeal way.
>>>
>>> Basically, you'd need to decode as you receive chunks of the message,
>>> and look out for WRITEs, and then set it up so that their payloads are
>>> received with proper alignment.
>>
>> 1)
>> Yes, and while I arrived at the same exact conclusion I was left with
>> dread about the potential for "breaking too many eggs to make that
>> tasty omelette".
>>
>> If you (or others) see a way forward to have SUNRPC TCP's XDR receive
>> "inline" decode (rather than have the 2 stage process you covered
>> above) that'd be fantastic. Seems like really old tech-debt in SUNRPC
>> from a time when such care about alignment of WRITE payload pages was
>> completely off engineers' collective radar (owed to NFSD only using
>> buffered IO I assume?).
>>
>> 2)
>> One hack that I verified to work for READ and WRITE IO on my
>> particular TCP testbed was to front-pad the first "head" page of the
>> xdr_buf such that the WRITE payload started at the 2nd page of
>> rq_pages. So that looked like this hack for my usage:
>>
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index 8fc5b2b2d806..cf082a265261 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
>>
>> /* Make arg->head point to first page and arg->pages point to rest */
>> arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
>> - arg->head[0].iov_len = PAGE_SIZE;
>> + // FIXME: front-pad optimized to align TCP's WRITE payload
>> + // but may not be enough for other operations?
>> + arg->head[0].iov_len = 148;
>> arg->pages = rqstp->rq_pages + 1;
>> arg->page_base = 0;
>> /* save at least one page for response */
>>
>> That gut "but may not be enough for other operations?" comment proved
>> to be prophetic.
>>
>> Sadly it went on to fail spectacularly for other ops (specifically
>> READDIR and READDIRPLUS, probably others would too) because
>> xdr_inline_decode() _really_ doesn't like going beyond the end of the
>> xdr_buf's inline "head" page. It could be that even if
>> xdr_inline_decode() et al was "fixed" (which isn't for the faint of
>> heart given xdr_buf's more complex nature) there will likely be other
>> mole(s) that pop up. And in addition, we'd be wasting space in the
>> xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying
>> to see this frontpad hack through to completion.
>>
>> 3)
>> Lastly, for completeness, I also mentioned briefly in a previous
>> recent reply:
>>
>> On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote:
>>> On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
>>>
>>>> In any case, for now at least, unless you're using RDMA, it's going to
>>>> end up falling back to buffered writes everywhere. The data is almost
>>>> never going to be properly aligned coming in off the wire. That might
>>>> be fixable though.
>>>
>>> Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
>>> over TCP to workaround SUNRPC TCP's XDR handling always storing the
>>> write payload in misaligned IO. But that's purely a stop-gap
>>> workaround, which needs testing (to see if soft-iwap negates the win
>>> of using O_DIRECT, etc).
>>
>> (Ab)using soft-iwarp as the basis for easily getting page aligned TCP
>> WRITE payloads seems pretty gross given we are chasing utmost
>> performance, etc.
>>
>> All said, I welcome your sage advice and help on this effort to
>> DIO-align SUNRPC TCP's WRITE payload pages.
>>
>> Thanks,
>> Mike
>
> (Sent this to Mike only by accident yesterday -- resending to the full
> list now)
>
> I've been looking over the code today. Basically, I think we need to
> have svc_tcp_recvfrom() receive in phases. At a high level:
>
> 1/ receive the record marker (just like it does today)
>
> 2/ receive enough for the RPC header and then decode it.
>
> 3/ Use the rpc program and version from the decoded header to look up
> the svc_program. Add an optional pg_tcp_recvfrom callback to that
> structure that will receive the rest of the data into the buffer. If
> pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do
> today.
The layering violations here are mind-blowing.
> For NFSv3, pc_tcp_recvfrom can just look at the procedure. If it's
> anything but a WRITE we'll just do what we do today
> (svc_tcp_read_msg()).
>
> For a WRITE, we'll receive the first part of the WRITE3args (everything
> but the data) into rq_pages, and decode it. We can then use that info
> to figure out the alignment. Advance to the next page in rq_pages, and
> then to the point where the data is properly aligned. Do the receive
> into that spot.
>
> Then we just add a RQ_ALIGNED_DATA to rqstp->rq_flags, and teach
> nfsd3_proc_write how to find the data and do a DIO write when it's set.
>
> Unaligned writes are still a problem though. If two WRITE RPCs come in
> for different parts of the same block at the same time, then you could
> end up losing the result of the first write. I don't see a way to make
> that non-racy.
>
> NFSv4 will also be a bit of a challenge. We'll need to receive the
> whole compound one operation at a time. If we hit a WRITE, then we can
> just do the same thing that we do for v3 to align the data.
>
> I'd probably aim to start with an implementation for v3, and then add
> v4 support in a second phase.
>
> I'm interested in working on this. It'll be a fair bit of work though.
> I'll need to think about how to break this up into manageable pieces.
Bruce has been thinking about payload alignment schemes for at least
ten years. My opinion has always been:
- We have this already via RDMA, even over TCP
- Any scheme like this will still not perform as well as RDMA
- NFS/TCP is kind of a "works everywhere" technology that I prefer to
not screw around with
- The corner cases will be troubling us for many years
- Only a handful of users will truly benefit from it
- There are plenty of higher priority items on our to-do list.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
` (7 preceding siblings ...)
2025-06-11 14:16 ` Chuck Lever
@ 2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08 ` Mike Snitzer
8 siblings, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-12 13:46 UTC (permalink / raw)
To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On 6/10/25 4:57 PM, Mike Snitzer wrote:
> The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> and memory use, particularly for workloads with a working set that far
> exceeds the available memory of a given server. This patchset's
> changes (though patch 5, patch 6 wasn't written until after
> benchmarking performed) enabled Hammerspace to improve its IO500.org
> benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> Germany) by 25%.
>
> That 25% improvement on IO500 is owed to NFS servers seeing:
> - reduced CPU usage from 100% to ~50%
> O_DIRECT:
> write: 51% idle, 25% system, 14% IO wait, 2% IRQ
> read: 55% idle, 9% system, 32.5% IO wait, 1.5% IRQ
> buffered:
> write: 17.8% idle, 67.5% system, 8% IO wait, 2% IRQ
> read: 3.29% idle, 94.2% system, 2.5% IO wait, 1% IRQ
The IO wait and IRQ numbers for the buffered results appear to be
significantly better than for O_DIRECT. Can you help us understand
that? Is device utilization better or worse with O_DIRECT?
> - reduced memory usage from just under 100% (987GiB for reads, 978GiB
> for writes) to only ~244 MB for cache+buffer use (for both reads and
> writes).
> - buffered would tip-over due to kswapd and kcompactd struggling to
> find free memory during reclaim.
>
> - increased NVMe throughtput when comparing O_DIRECT vs buffered:
> O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
> buffered: 8 GB/s for writes, 4-5 GB/s for reads
>
> - ability to support more IO threads per client system (from 48 to 64)
This last item: how do you measure the "ability to support more
threads"? Is there a latency curve that is flatter? Do you see changes
in the latency distribution and the number of latency outliers?
My general comment here is kind of in the "related or future work"
category. This is not an objection, just thinking out loud.
But, can we get more insight into specifically where the CPU
utilization reduction comes from? Is it lock contention? Is it
inefficient data structure traversal? Any improvement here benefits
everyone, so that should be a focus of some study.
If the memory utilization is a problem, that sounds like an issue with
kernel systems outside of NFSD, or perhaps some system tuning can be
done to improve matters. Again, drilling into this and trying to improve
it will benefit everyone.
These results do point to some problems, clearly. Whether NFSD using
direct I/O is the best solution is not obvious to me yet.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 13:28 ` Chuck Lever
@ 2025-06-12 14:17 ` Benjamin Coddington
2025-06-12 15:56 ` Mike Snitzer
2025-06-12 16:22 ` Jeff Layton
1 sibling, 1 reply; 75+ messages in thread
From: Benjamin Coddington @ 2025-06-12 14:17 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Mike Snitzer, linux-nfs, linux-fsdevel, Jens Axboe
On 12 Jun 2025, at 9:28, Chuck Lever wrote:
> On 6/12/25 6:28 AM, Jeff Layton wrote:
>> On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote:
>>> On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
>>>> On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
>>>>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
>>>>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
>>>>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
>>>>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
>>>>>>> or will be removed from the page cache upon completion (DONTCACHE).
>>>>>>
>>>>>> I thought we were going to do two switches: One for reads and one for
>>>>>> writes? I could be misremembering.
>>>>>
>>>>> We did discuss the possibility of doing that. Still can-do if that's
>>>>> what you'd prefer.
>>>>>
>>>>
>>>> Having them as separate controls in debugfs is fine for
>>>> experimentation's sake, but I imagine we'll need to be all-in one way
>>>> or the other with a real interface.
>>>>
>>>> I think if we can crack the problem of receiving WRITE payloads into an
>>>> already-aligned buffer, then that becomes much more feasible. I think
>>>> that's a solveable problem.
>>>
>>> You'd immediately be my hero! Let's get into it:
>>>
>>> In a previously reply to this thread you aptly detailed what I found
>>> out the hard way (with too much xdr_buf code review and tracing):
>>>
>>> On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
>>>>>
>>>>> NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
>>>>> DIO alignment (both page and disk alignment). This works quite well
>>>>> for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
>>>>> maps the WRITE payload into aligned pages. But more work is needed to
>>>>> be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
>>>>> used. I spent quite a bit of time analyzing the existing xdr_buf code
>>>>> and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
>>>>> misaligned pages such that O_DIRECT isn't possible without a copy
>>>>> (completely defeating the point). I'll reply to this cover letter to
>>>>> start a subthread to discuss how best to deal with misaligned write
>>>>> IO (by association with Hammerspace, I'm most interested in NFS v3).
>>>>>
>>>>
>>>> Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
>>>> rq_pages array. To get alignment right, you'd probably have to do the
>>>> receive in a much more piecemeal way.
>>>>
>>>> Basically, you'd need to decode as you receive chunks of the message,
>>>> and look out for WRITEs, and then set it up so that their payloads are
>>>> received with proper alignment.
>>>
>>> 1)
>>> Yes, and while I arrived at the same exact conclusion I was left with
>>> dread about the potential for "breaking too many eggs to make that
>>> tasty omelette".
>>>
>>> If you (or others) see a way forward to have SUNRPC TCP's XDR receive
>>> "inline" decode (rather than have the 2 stage process you covered
>>> above) that'd be fantastic. Seems like really old tech-debt in SUNRPC
>>> from a time when such care about alignment of WRITE payload pages was
>>> completely off engineers' collective radar (owed to NFSD only using
>>> buffered IO I assume?).
>>>
>>> 2)
>>> One hack that I verified to work for READ and WRITE IO on my
>>> particular TCP testbed was to front-pad the first "head" page of the
>>> xdr_buf such that the WRITE payload started at the 2nd page of
>>> rq_pages. So that looked like this hack for my usage:
>>>
>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>> index 8fc5b2b2d806..cf082a265261 100644
>>> --- a/net/sunrpc/svc_xprt.c
>>> +++ b/net/sunrpc/svc_xprt.c
>>> @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
>>>
>>> /* Make arg->head point to first page and arg->pages point to rest */
>>> arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
>>> - arg->head[0].iov_len = PAGE_SIZE;
>>> + // FIXME: front-pad optimized to align TCP's WRITE payload
>>> + // but may not be enough for other operations?
>>> + arg->head[0].iov_len = 148;
>>> arg->pages = rqstp->rq_pages + 1;
>>> arg->page_base = 0;
>>> /* save at least one page for response */
>>>
>>> That gut "but may not be enough for other operations?" comment proved
>>> to be prophetic.
>>>
>>> Sadly it went on to fail spectacularly for other ops (specifically
>>> READDIR and READDIRPLUS, probably others would too) because
>>> xdr_inline_decode() _really_ doesn't like going beyond the end of the
>>> xdr_buf's inline "head" page. It could be that even if
>>> xdr_inline_decode() et al was "fixed" (which isn't for the faint of
>>> heart given xdr_buf's more complex nature) there will likely be other
>>> mole(s) that pop up. And in addition, we'd be wasting space in the
>>> xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying
>>> to see this frontpad hack through to completion.
>>>
>>> 3)
>>> Lastly, for completeness, I also mentioned briefly in a previous
>>> recent reply:
>>>
>>> On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote:
>>>> On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
>>>>
>>>>> In any case, for now at least, unless you're using RDMA, it's going to
>>>>> end up falling back to buffered writes everywhere. The data is almost
>>>>> never going to be properly aligned coming in off the wire. That might
>>>>> be fixable though.
>>>>
>>>> Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
>>>> over TCP to workaround SUNRPC TCP's XDR handling always storing the
>>>> write payload in misaligned IO. But that's purely a stop-gap
>>>> workaround, which needs testing (to see if soft-iwap negates the win
>>>> of using O_DIRECT, etc).
>>>
>>> (Ab)using soft-iwarp as the basis for easily getting page aligned TCP
>>> WRITE payloads seems pretty gross given we are chasing utmost
>>> performance, etc.
>>>
>>> All said, I welcome your sage advice and help on this effort to
>>> DIO-align SUNRPC TCP's WRITE payload pages.
>>>
>>> Thanks,
>>> Mike
>>
>> (Sent this to Mike only by accident yesterday -- resending to the full
>> list now)
>>
>> I've been looking over the code today. Basically, I think we need to
>> have svc_tcp_recvfrom() receive in phases. At a high level:
>>
>> 1/ receive the record marker (just like it does today)
>>
>> 2/ receive enough for the RPC header and then decode it.
>>
>> 3/ Use the rpc program and version from the decoded header to look up
>> the svc_program. Add an optional pg_tcp_recvfrom callback to that
>> structure that will receive the rest of the data into the buffer. If
>> pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do
>> today.
>
> The layering violations here are mind-blowing.
What's already been mentioned elsewhere, but not yet here:
The transmitter could always just tell the receiver where the data is, we'd
need an NFS v3.1 and an extension for v4.2?
Pot Stirred,
Ben
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 14:17 ` Benjamin Coddington
@ 2025-06-12 15:56 ` Mike Snitzer
2025-06-12 15:58 ` Chuck Lever
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-12 15:56 UTC (permalink / raw)
To: Benjamin Coddington
Cc: Chuck Lever, Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Thu, Jun 12, 2025 at 10:17:22AM -0400, Benjamin Coddington wrote:
>
> What's already been mentioned elsewhere, but not yet here:
>
> The transmitter could always just tell the receiver where the data is, we'd
> need an NFS v3.1 and an extension for v4.2?
>
> Pot Stirred,
> Ben
Yeah, forgot to mention giving serious consideration to extending
specs to make this happen easier. Pros and cons to doing so.
Thanks for raising it.
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 15:56 ` Mike Snitzer
@ 2025-06-12 15:58 ` Chuck Lever
2025-06-12 16:12 ` Mike Snitzer
2025-06-13 5:39 ` Christoph Hellwig
0 siblings, 2 replies; 75+ messages in thread
From: Chuck Lever @ 2025-06-12 15:58 UTC (permalink / raw)
To: Mike Snitzer, Benjamin Coddington
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On 6/12/25 11:56 AM, Mike Snitzer wrote:
> On Thu, Jun 12, 2025 at 10:17:22AM -0400, Benjamin Coddington wrote:
>>
>> What's already been mentioned elsewhere, but not yet here:
>>
>> The transmitter could always just tell the receiver where the data is, we'd
>> need an NFS v3.1 and an extension for v4.2?
>>
>> Pot Stirred,
>> Ben
>
> Yeah, forgot to mention giving serious consideration to extending
> specs to make this happen easier. Pros and cons to doing so.
>
> Thanks for raising it.
>
> Mike
NFS/RDMA does this already. Let's not re-invent the wheel.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-12 13:21 ` Chuck Lever
@ 2025-06-12 16:00 ` Mike Snitzer
2025-06-16 13:32 ` Chuck Lever
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-12 16:00 UTC (permalink / raw)
To: Chuck Lever; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Thu, Jun 12, 2025 at 09:21:35AM -0400, Chuck Lever wrote:
> On 6/11/25 3:18 PM, Mike Snitzer wrote:
> > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> >> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> >>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> >>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
> >>> or will be removed from the page cache upon completion (DONTCACHE).
> >>
> >> I thought we were going to do two switches: One for reads and one for
> >> writes? I could be misremembering.
> >
> > We did discuss the possibility of doing that. Still can-do if that's
> > what you'd prefer.
>
> For our experimental interface, I think having read and write enablement
> as separate settings is wise, so please do that.
>
> One quibble, though: The name "enable_dontcache" might be directly
> meaningful to you, but I think others might find "enable_dont" to be
> oxymoronic. And, it ties the setting to a specific kernel technology:
> RWF_DONTCACHE.
>
> So: Can we call these settings "io_cache_read" and "io_cache_write" ?
>
> They could each carry multiple settings:
>
> 0: Use page cache
> 1: Use RWF_DONTCACHE
> 2: Use O_DIRECT
>
> You can choose to implement any or all of the above three mechanisms.
I like it, will do for v2. But will have O_DIRECT=1 and RWF_DONTCACHE=2.
Thanks,
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 15:58 ` Chuck Lever
@ 2025-06-12 16:12 ` Mike Snitzer
2025-06-12 16:32 ` Chuck Lever
2025-06-13 5:39 ` Christoph Hellwig
1 sibling, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-12 16:12 UTC (permalink / raw)
To: Chuck Lever
Cc: Benjamin Coddington, Jeff Layton, linux-nfs, linux-fsdevel,
Jens Axboe
On Thu, Jun 12, 2025 at 11:58:27AM -0400, Chuck Lever wrote:
> On 6/12/25 11:56 AM, Mike Snitzer wrote:
> > On Thu, Jun 12, 2025 at 10:17:22AM -0400, Benjamin Coddington wrote:
> >>
> >> What's already been mentioned elsewhere, but not yet here:
> >>
> >> The transmitter could always just tell the receiver where the data is, we'd
> >> need an NFS v3.1 and an extension for v4.2?
> >>
> >> Pot Stirred,
> >> Ben
> >
> > Yeah, forgot to mention giving serious consideration to extending
> > specs to make this happen easier. Pros and cons to doing so.
> >
> > Thanks for raising it.
> >
> > Mike
>
> NFS/RDMA does this already. Let's not re-invent the wheel.
TCP is ubiquitous, I know you seem to really not want us to seriously
pursue fixing/improving things to allow for the WRITE payload to be
stored in an aligned buffer that allows zero-copy but... the value of
not requiring any RDMA hardware or client changes is too compelling to
ignore.
RDMA either requires specialized hardware or software (soft-iwarp or
soft-roce).
Imposing those as requirements isn't going to be viable for a large
portion of existing deployments.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 13:28 ` Chuck Lever
2025-06-12 14:17 ` Benjamin Coddington
@ 2025-06-12 16:22 ` Jeff Layton
2025-06-13 5:46 ` Christoph Hellwig
1 sibling, 1 reply; 75+ messages in thread
From: Jeff Layton @ 2025-06-12 16:22 UTC (permalink / raw)
To: Chuck Lever, Mike Snitzer; +Cc: linux-nfs, linux-fsdevel, Jens Axboe
On Thu, 2025-06-12 at 09:28 -0400, Chuck Lever wrote:
> On 6/12/25 6:28 AM, Jeff Layton wrote:
> > On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote:
> > > On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote:
> > > > On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote:
> > > > > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> > > > > > On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > > > > > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > > > > > > read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > > > > > > or will be removed from the page cache upon completion (DONTCACHE).
> > > > > >
> > > > > > I thought we were going to do two switches: One for reads and one for
> > > > > > writes? I could be misremembering.
> > > > >
> > > > > We did discuss the possibility of doing that. Still can-do if that's
> > > > > what you'd prefer.
> > > > >
> > > >
> > > > Having them as separate controls in debugfs is fine for
> > > > experimentation's sake, but I imagine we'll need to be all-in one way
> > > > or the other with a real interface.
> > > >
> > > > I think if we can crack the problem of receiving WRITE payloads into an
> > > > already-aligned buffer, then that becomes much more feasible. I think
> > > > that's a solveable problem.
> > >
> > > You'd immediately be my hero! Let's get into it:
> > >
> > > In a previously reply to this thread you aptly detailed what I found
> > > out the hard way (with too much xdr_buf code review and tracing):
> > >
> > > On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote:
> > > > >
> > > > > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> > > > > DIO alignment (both page and disk alignment). This works quite well
> > > > > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> > > > > maps the WRITE payload into aligned pages. But more work is needed to
> > > > > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> > > > > used. I spent quite a bit of time analyzing the existing xdr_buf code
> > > > > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in
> > > > > misaligned pages such that O_DIRECT isn't possible without a copy
> > > > > (completely defeating the point). I'll reply to this cover letter to
> > > > > start a subthread to discuss how best to deal with misaligned write
> > > > > IO (by association with Hammerspace, I'm most interested in NFS v3).
> > > > >
> > > >
> > > > Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
> > > > rq_pages array. To get alignment right, you'd probably have to do the
> > > > receive in a much more piecemeal way.
> > > >
> > > > Basically, you'd need to decode as you receive chunks of the message,
> > > > and look out for WRITEs, and then set it up so that their payloads are
> > > > received with proper alignment.
> > >
> > > 1)
> > > Yes, and while I arrived at the same exact conclusion I was left with
> > > dread about the potential for "breaking too many eggs to make that
> > > tasty omelette".
> > >
> > > If you (or others) see a way forward to have SUNRPC TCP's XDR receive
> > > "inline" decode (rather than have the 2 stage process you covered
> > > above) that'd be fantastic. Seems like really old tech-debt in SUNRPC
> > > from a time when such care about alignment of WRITE payload pages was
> > > completely off engineers' collective radar (owed to NFSD only using
> > > buffered IO I assume?).
> > >
> > > 2)
> > > One hack that I verified to work for READ and WRITE IO on my
> > > particular TCP testbed was to front-pad the first "head" page of the
> > > xdr_buf such that the WRITE payload started at the 2nd page of
> > > rq_pages. So that looked like this hack for my usage:
> > >
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 8fc5b2b2d806..cf082a265261 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
> > >
> > > /* Make arg->head point to first page and arg->pages point to rest */
> > > arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
> > > - arg->head[0].iov_len = PAGE_SIZE;
> > > + // FIXME: front-pad optimized to align TCP's WRITE payload
> > > + // but may not be enough for other operations?
> > > + arg->head[0].iov_len = 148;
> > > arg->pages = rqstp->rq_pages + 1;
> > > arg->page_base = 0;
> > > /* save at least one page for response */
> > >
> > > That gut "but may not be enough for other operations?" comment proved
> > > to be prophetic.
> > >
> > > Sadly it went on to fail spectacularly for other ops (specifically
> > > READDIR and READDIRPLUS, probably others would too) because
> > > xdr_inline_decode() _really_ doesn't like going beyond the end of the
> > > xdr_buf's inline "head" page. It could be that even if
> > > xdr_inline_decode() et al was "fixed" (which isn't for the faint of
> > > heart given xdr_buf's more complex nature) there will likely be other
> > > mole(s) that pop up. And in addition, we'd be wasting space in the
> > > xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying
> > > to see this frontpad hack through to completion.
> > >
> > > 3)
> > > Lastly, for completeness, I also mentioned briefly in a previous
> > > recent reply:
> > >
> > > On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote:
> > > > On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote:
> > > >
> > > > > In any case, for now at least, unless you're using RDMA, it's going to
> > > > > end up falling back to buffered writes everywhere. The data is almost
> > > > > never going to be properly aligned coming in off the wire. That might
> > > > > be fixable though.
> > > >
> > > > Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA
> > > > over TCP to workaround SUNRPC TCP's XDR handling always storing the
> > > > write payload in misaligned IO. But that's purely a stop-gap
> > > > workaround, which needs testing (to see if soft-iwap negates the win
> > > > of using O_DIRECT, etc).
> > >
> > > (Ab)using soft-iwarp as the basis for easily getting page aligned TCP
> > > WRITE payloads seems pretty gross given we are chasing utmost
> > > performance, etc.
> > >
> > > All said, I welcome your sage advice and help on this effort to
> > > DIO-align SUNRPC TCP's WRITE payload pages.
> > >
> > > Thanks,
> > > Mike
> >
> > (Sent this to Mike only by accident yesterday -- resending to the full
> > list now)
> >
> > I've been looking over the code today. Basically, I think we need to
> > have svc_tcp_recvfrom() receive in phases. At a high level:
> >
> > 1/ receive the record marker (just like it does today)
> >
> > 2/ receive enough for the RPC header and then decode it.
> >
> > 3/ Use the rpc program and version from the decoded header to look up
> > the svc_program. Add an optional pg_tcp_recvfrom callback to that
> > structure that will receive the rest of the data into the buffer. If
> > pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do
> > today.
>
> The layering violations here are mind-blowing.
>
Aww. I don't think it's too bad.
>
> > For NFSv3, pc_tcp_recvfrom can just look at the procedure. If it's
> > anything but a WRITE we'll just do what we do today
> > (svc_tcp_read_msg()).
> >
> > For a WRITE, we'll receive the first part of the WRITE3args (everything
> > but the data) into rq_pages, and decode it. We can then use that info
> > to figure out the alignment. Advance to the next page in rq_pages, and
> > then to the point where the data is properly aligned. Do the receive
> > into that spot.
> >
> > Then we just add a RQ_ALIGNED_DATA to rqstp->rq_flags, and teach
> > nfsd3_proc_write how to find the data and do a DIO write when it's set.
> >
> > Unaligned writes are still a problem though. If two WRITE RPCs come in
> > for different parts of the same block at the same time, then you could
> > end up losing the result of the first write. I don't see a way to make
> > that non-racy.
> >
> > NFSv4 will also be a bit of a challenge. We'll need to receive the
> > whole compound one operation at a time. If we hit a WRITE, then we can
> > just do the same thing that we do for v3 to align the data.
> >
> > I'd probably aim to start with an implementation for v3, and then add
> > v4 support in a second phase.
> >
> > I'm interested in working on this. It'll be a fair bit of work though.
> > I'll need to think about how to break this up into manageable pieces.
>
> Bruce has been thinking about payload alignment schemes for at least
> ten years. My opinion has always been:
>
> - We have this already via RDMA, even over TCP
> - Any scheme like this will still not perform as well as RDMA
> - NFS/TCP is kind of a "works everywhere" technology that I prefer to
> not screw around with
> - The corner cases will be troubling us for many years
> - Only a handful of users will truly benefit from it
> - There are plenty of higher priority items on our to-do list.
>
If you're against the idea, I won't waste my time.
It would require some fairly hefty rejiggering of the receive code. The
v4 part would be pretty nightmarish to work out too since you'd have to
decode the compound as you receive to tell where the next op starts.
The potential for corruption with unaligned writes is also pretty
nasty.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 16:12 ` Mike Snitzer
@ 2025-06-12 16:32 ` Chuck Lever
0 siblings, 0 replies; 75+ messages in thread
From: Chuck Lever @ 2025-06-12 16:32 UTC (permalink / raw)
To: Mike Snitzer
Cc: Benjamin Coddington, Jeff Layton, linux-nfs, linux-fsdevel,
Jens Axboe
On 6/12/25 12:12 PM, Mike Snitzer wrote:
> the value of
> not requiring any RDMA hardware or client changes is too compelling to
> ignore.
I think you are vastly inflating that value.
The numbers you have presented are for big systems that are already on
a fast RDMA-capable network.
You haven't demonstrated much of an issue for low-intensity workloads
like small home directory servers. For those, unaligned WRITEs are
totally adequate and would see no real improvement if the server could
handle unaligned payloads slightly more efficiently.
I also haven't seen specific data that showed it is only the buffer
alignment issue that is slowing down NFS WRITE. IME it is actually the
per-inode i_rwsem that is the major bottleneck.
> RDMA either requires specialized hardware or software (soft-iwarp or
> soft-roce).
> Imposing those as requirements isn't going to be viable for a large
> portion of existing deployments.
I don't see clear evidence that most deployments have a buffer alignment
problem. It's easy to pick on and explain, but that doesn't mean it is
pervasive.
Some deployments have intensive performance and scalability
requirements. Those are the ones where RDMA is appropriate and feasible.
Thus IMO you're trying to solve a problem that a) is already solved and
b) does not exist for most NFS users on TCP fabrics.
There is so much low-hanging fruit here. I really don't believe it is
valuable to pursue protocol changes that will take geological amounts
of time and energy to accomplish, especially because we have a solution
now that is effective where it needs to be.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-12 13:46 ` Chuck Lever
@ 2025-06-12 19:08 ` Mike Snitzer
2025-06-12 20:17 ` Chuck Lever
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-12 19:08 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, Dave Chinner,
willy, jonathan.flynn, keith.mannthey
On Thu, Jun 12, 2025 at 09:46:12AM -0400, Chuck Lever wrote:
> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> > and memory use, particularly for workloads with a working set that far
> > exceeds the available memory of a given server. This patchset's
> > changes (though patch 5, patch 6 wasn't written until after
> > benchmarking performed) enabled Hammerspace to improve its IO500.org
> > benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> > Germany) by 25%.
> >
> > That 25% improvement on IO500 is owed to NFS servers seeing:
> > - reduced CPU usage from 100% to ~50%
Apples: 10 servers, 10 clients, 64 PPN (Processes Per Node):
> > O_DIRECT:
> > write: 51% idle, 25% system, 14% IO wait, 2% IRQ
> > read: 55% idle, 9% system, 32.5% IO wait, 1.5% IRQ
Oranges: 6 servers, 6 clients, 128 PPN:
> > buffered:
> > write: 17.8% idle, 67.5% system, 8% IO wait, 2% IRQ
> > read: 3.29% idle, 94.2% system, 2.5% IO wait, 1% IRQ
>
> The IO wait and IRQ numbers for the buffered results appear to be
> significantly better than for O_DIRECT. Can you help us understand
> that? Is device utilization better or worse with O_DIRECT?
It was a post-mortum data analysis fail: when I worked with others
(Jon and Keith, cc'd) to collect performance data for use in my 0th
header (above) we didn't have buffered IO performance data from the
full IO500-scale benchmark (10 servers, 10 clients). I missed that
the above married apples and oranges until you noticed something
off...
Sorry about that. We don't currently have the full 10 nodes
available, so Keith re-ran IOR "easy" testing with 6 nodes to collect
new data.
NOTE his run was with much larger PPN (128 instead of 64) coupled with
reduction of both the number of client and server nodes (from 10 to 6)
used.
Here is CPU usage for one of the server nodes while running IOR "easy"
with 128 PPN on each of 6 clients, against 6 servers:
- reduced CPU usage from 100% to ~56% for read and 71.6% for Write
O_DIRECT:
write: 28.4% idle, 50% system, 13% IO wait, 2% IRQ
read: 44% idle, 11% system, 39% IO wait, 1.8% IRQ
buffered:
write: 17% idle, 68.4% system, 8.5% IO wait, 2% IRQ
read: 3.51% idle, 94.5% system, 0% IO wait, 0.6% IRQ
And associated NVMe performance:
- increased NVMe throughtput when comparing O_DIRECT vs buffered:
O_DIRECT: 10.5 GB/s for writes, 11.6 GB/s for reads
buffered: 7.75-8 GB/s for writes, 4 GB/s before tip over but 800MB/s after for reads
("tipover" is when reclaim starts to dominate due to inability to
efficiently find free pages so kswapd and kcompactd burn a lot of
resources).
And again here is the associated 6 node IOR easy NVMe performance in
graph form: https://original.art/NFSD_direct_vs_buffered_IO.jpg
> > - reduced memory usage from just under 100% (987GiB for reads, 978GiB
> > for writes) to only ~244 MB for cache+buffer use (for both reads and
> > writes).
> > - buffered would tip-over due to kswapd and kcompactd struggling to
> > find free memory during reclaim.
This memory usage data is still the case with the 6 server testbed.
> > - increased NVMe throughtput when comparing O_DIRECT vs buffered:
> > O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
> > buffered: 8 GB/s for writes, 4-5 GB/s for reads
And again, here is the end result for the IOR easy benchmark:
From Hammerspace's 10 node IO500 reported summary of IOR easy result:
Write:
O_DIRECT: [RESULT] ior-easy-write 420.351599 GiB/s : time 869.650 seconds
CACHED: [RESULT] ior-easy-write 368.268722 GiB/s : time 413.647 seconds
Read:
O_DIRECT: [RESULT] ior-easy-read 446.790791 GiB/s : time 818.219 seconds
CACHED: [RESULT] ior-easy-read 284.706196 GiB/s : time 534.950 seconds
From Hammerspace's 6 node run, IOR's summary output format (as opposed
to IO500 reported summary from above 10 node result):
Write:
IO mode: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
-------- ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
O_DIRECT write 348132 348133 0.002035 278921216 1024.00 0.040978 600.89 384.04 600.90 0
CACHED write 295579 295579 0.002416 278921216 1024.00 0.051602 707.73 355.27 707.73 0
IO mode: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
-------- ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
O_DIRECT read 347971 347973 0.001928 278921216 1024.00 0.017612 601.17 421.30 601.17 0
CACHED read 60653 60653 0.006894 278921216 1024.00 0.017279 3448.99 2975.23 3448.99 0
> > - ability to support more IO threads per client system (from 48 to 64)
>
> This last item: how do you measure the "ability to support more
> threads"? Is there a latency curve that is flatter? Do you see changes
> in the latency distribution and the number of latency outliers?
Mainly in the context of the IOR benchmark's result, we can see that
increasing PPN becomes detrimental because the score doesn't improve
or gets worse.
> My general comment here is kind of in the "related or future work"
> category. This is not an objection, just thinking out loud.
>
> But, can we get more insight into specifically where the CPU
> utilization reduction comes from? Is it lock contention? Is it
> inefficient data structure traversal? Any improvement here benefits
> everyone, so that should be a focus of some study.
Buffered IO just commands more resources than O_DIRECT for workloads
with a working set that exceeds system memory.
Each of the 6 servers has 1TiB of memory.
So for the above 6 client 128 PPN IOT "easy" run, each client thread
is writing and then reading 266 GiB. That creates an aggregate
working set of 199.50 TiB
The 199.50 TiB working set dwarfs the servers' aggregate 6 TiB of
memory. Being able to drive each of the 8 NVMe in each server as
efficiently as possible is critical.
As you can see from the above NVMe performance above O_DIRECT is best.
> If the memory utilization is a problem, that sounds like an issue with
> kernel systems outside of NFSD, or perhaps some system tuning can be
> done to improve matters. Again, drilling into this and trying to improve
> it will benefit everyone.
Yeah, there is an extensive iceberg level issue with buffered IO and
MM (reclaim's use of kswapd and kcompactd to find free pages) that
underpins the justifcation for RWF_DONTCACHE being developed and
merged. I'm not the best person to speak to all the long-standing
challenges (Willy, Dave, Jens, others would be better).
> These results do point to some problems, clearly. Whether NFSD using
> direct I/O is the best solution is not obvious to me yet.
All solutions are on the table. O_DIRECT just happens to be the most
straight-forward to work through at this point.
Dave Chinner's feeling that O_DIRECT a much better solution than
RWF_DONTCACHE for NFSD certainly helped narrow my focus too, from:
https://lore.kernel.org/linux-nfs/aBrKbOoj4dgUvz8f@dread.disaster.area/
"The nfs client largely aligns all of the page caceh based IO, so I'd
think that O_DIRECT on the server side would be much more performant
than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
writes all the way down to the storage....."
(Dave would be correct about NFSD's page alignment if RDMA used, but
obviously not the case if TCP used due to SUNRPC TCP's WRITE payload
being received into misaligned pages).
Thanks,
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-12 19:08 ` Mike Snitzer
@ 2025-06-12 20:17 ` Chuck Lever
0 siblings, 0 replies; 75+ messages in thread
From: Chuck Lever @ 2025-06-12 20:17 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, Dave Chinner,
willy, jonathan.flynn, keith.mannthey
On 6/12/25 3:08 PM, Mike Snitzer wrote:
> On Thu, Jun 12, 2025 at 09:46:12AM -0400, Chuck Lever wrote:
>> But, can we get more insight into specifically where the CPU
>> utilization reduction comes from? Is it lock contention? Is it
>> inefficient data structure traversal? Any improvement here benefits
>> everyone, so that should be a focus of some study.
>
> Buffered IO just commands more resources than O_DIRECT for workloads
> with a working set that exceeds system memory.
No doubt. However, using direct I/O has some consequences that we might
be able to avoid if we understand better how to manage the server's
cache rather than not caching at all.
> Each of the 6 servers has 1TiB of memory.
>
> So for the above 6 client 128 PPN IOT "easy" run, each client thread
> is writing and then reading 266 GiB. That creates an aggregate
> working set of 199.50 TiB
>
> The 199.50 TiB working set dwarfs the servers' aggregate 6 TiB of
> memory. Being able to drive each of the 8 NVMe in each server as
> efficiently as possible is critical.
>
> As you can see from the above NVMe performance above O_DIRECT is best.
Well, I see that it is the better choice between full caching v. direct
I/O when the backing storage is nearly as fast as memory. The sticking
point for me there is what will happen with slower backing storage.
> "The nfs client largely aligns all of the page caceh based IO, so I'd
> think that O_DIRECT on the server side would be much more performant
> than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
> writes all the way down to the storage....."
>
> (Dave would be correct about NFSD's page alignment if RDMA used, but
> obviously not the case if TCP used due to SUNRPC TCP's WRITE payload
> being received into misaligned pages).
RDMA gives us the opportunity to align the sink buffer pages on the NFS
server, yes. However I'm not sure if NFSD currently goes to the trouble
of actually doing that alignment before starting RDMA Reads. There
always seems to be one or more data copies needed when going through
nfsd_vfs_write().
If the application has aligned the WRITE payload already, we might not
notice that deficiency for many common workloads. For example, if most
unaligned writes come from small payloads, server-side re-alignment
might not matter -- there could be intrinsic RMW cycles that erase the
benefits of buffer alignment. Big payloads are usually aligned to
memory and file pages already.
Something to look into.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-12 7:39 ` Christoph Hellwig
@ 2025-06-12 20:37 ` Mike Snitzer
2025-06-13 5:31 ` Christoph Hellwig
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-12 20:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jeff Layton, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe,
Dave Chinner
On Thu, Jun 12, 2025 at 12:39:44AM -0700, Christoph Hellwig wrote:
>
> Another thing is that using the page cache for reads is probably
> rather pointless. I've been wondering if we should just change
> the direct I/O read code to read from the page cache if there are
> cached pages and otherwise go direct to the device. That would make
> a setup using buffered writes (without or without the dontcache
> flag) and direct I/O reads safe.
Yes, that sounds like a good idea. Just an idea at this point or have
you tried to implement it?
I'll start looking at associated code, but may slip until next week.
FYI, I mentioned this earlier at one point in this thread but I was
thinking the IOR "hard" benchmark would offer a solid test for
invalidating page cache vs O_DIRECT read when ran against NFS/NFSD
with this NFSD O_DIRECT series applied: which causes NFSD's misaligned
IO to use buffered IO for writes and O_DIRECT for reads. NFSD issuing
the misaligned write to XFS will force RMW when writing, creating
pages that must be invalidated for any subsequent NFSD read.
Turns out IOR "hard" does in fact fail spectacularly with:
WARNING: Incorrect data on read (6640830 errors found).
It doesn't fail if the 6th patch in this series isn't used:
https://lore.kernel.org/linux-nfs/20250610205737.63343-7-snitzer@kernel.org/
Could be a bug in that patch but I think it more likely IOR-hard is
teasing out the invalidation race you mentioned at the start.
We're retesting with RWF_SYNC set for the buffered write IO (which
Jeff suggested earlier in this thread).
But your idea seems important to pursue.
I won't be posting a v2 for this series until I can find/fix this
IOR-hard testcase.
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
2025-06-12 20:37 ` Mike Snitzer
@ 2025-06-13 5:31 ` Christoph Hellwig
0 siblings, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-13 5:31 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Jeff Layton, Chuck Lever, linux-nfs,
linux-fsdevel, Jens Axboe, Dave Chinner
On Thu, Jun 12, 2025 at 04:37:22PM -0400, Mike Snitzer wrote:
> On Thu, Jun 12, 2025 at 12:39:44AM -0700, Christoph Hellwig wrote:
> >
> > Another thing is that using the page cache for reads is probably
> > rather pointless. I've been wondering if we should just change
> > the direct I/O read code to read from the page cache if there are
> > cached pages and otherwise go direct to the device. That would make
> > a setup using buffered writes (without or without the dontcache
> > flag) and direct I/O reads safe.
>
> Yes, that sounds like a good idea. Just an idea at this point or have
> you tried to implement it?
Just an idea.
> FYI, I mentioned this earlier at one point in this thread but I was
> thinking the IOR "hard"
Sorry, but what is IOR?
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 15:58 ` Chuck Lever
2025-06-12 16:12 ` Mike Snitzer
@ 2025-06-13 5:39 ` Christoph Hellwig
1 sibling, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-13 5:39 UTC (permalink / raw)
To: Chuck Lever
Cc: Mike Snitzer, Benjamin Coddington, Jeff Layton, linux-nfs,
linux-fsdevel, Jens Axboe
On Thu, Jun 12, 2025 at 11:58:27AM -0400, Chuck Lever wrote:
> NFS/RDMA does this already. Let's not re-invent the wheel.
The other thing that fixes the problem (but also creates various others)
are the block/scsi/nvme layouts, which gurantee that all the data
transfers to the data device us block protocols that gets this right.
Well, unless you run them over TCP and still get the whole receive
side copy issue in the drivers, but at least the copied payload
is always aligned.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 16:22 ` Jeff Layton
@ 2025-06-13 5:46 ` Christoph Hellwig
2025-06-13 9:23 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-13 5:46 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, Mike Snitzer, linux-nfs, linux-fsdevel, Jens Axboe
On Thu, Jun 12, 2025 at 12:22:42PM -0400, Jeff Layton wrote:
> If you're against the idea, I won't waste my time.
>
> It would require some fairly hefty rejiggering of the receive code. The
> v4 part would be pretty nightmarish to work out too since you'd have to
> decode the compound as you receive to tell where the next op starts.
>
> The potential for corruption with unaligned writes is also pretty
> nasty.
Maybe I'm missing an improvement to the receive buffer handling in modern
network hardware, but AFAIK this still would only help you to align the
sunrpc data buffer to page boundaries, but avoid the data copy from the
hardware receive buffer to the sunrpc data buffer as you still don't have
hardware header splitting.
And I don't even know what this is supposed to buy the nfs server.
Direct I/O writes need to have the proper file offset alignment, but as
far as Linux is concerned we don't require any memory alignment. Most
storage hardware has requirements for the memory alignment that we pass
on, but typically that's just a dword (4-byte) alignment, which matches
the alignment sunrpc wants for most XDR data structures anyway. So what
additional alignment is actually needed for support direct I/O writes
assuming that is the goal? (I might also simply misunderstand the
problem).
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-13 5:46 ` Christoph Hellwig
@ 2025-06-13 9:23 ` Mike Snitzer
2025-06-13 13:02 ` Jeff Layton
2025-06-16 12:29 ` Christoph Hellwig
0 siblings, 2 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-13 9:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jeff Layton, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe,
david.flynn
On Thu, Jun 12, 2025 at 10:46:01PM -0700, Christoph Hellwig wrote:
> On Thu, Jun 12, 2025 at 12:22:42PM -0400, Jeff Layton wrote:
> > If you're against the idea, I won't waste my time.
> >
> > It would require some fairly hefty rejiggering of the receive code. The
> > v4 part would be pretty nightmarish to work out too since you'd have to
> > decode the compound as you receive to tell where the next op starts.
> >
> > The potential for corruption with unaligned writes is also pretty
> > nasty.
>
> Maybe I'm missing an improvement to the receive buffer handling in modern
> network hardware, but AFAIK this still would only help you to align the
> sunrpc data buffer to page boundaries, but avoid the data copy from the
> hardware receive buffer to the sunrpc data buffer as you still don't have
> hardware header splitting.
Correct, everything that Jeff detailed is about ensuring the WRITE
payload is received into page aligned boundary.
Which in practice has proven a hard requirement for O_DIRECT in my
testing -- but I could be hitting some bizarre driver bug in my TCP
testbed (which sadly sits ontop of older VMware guests/drivers).
But if you looking at patch 5 in this series:
https://lore.kernel.org/linux-nfs/20250610205737.63343-6-snitzer@kernel.org/
I added fs/nfsd/vfs.c:is_dio_aligned(), which is basically a tweaked
ditto of fs/btrfs/direct-io.c:check_direct_IO():
static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
const u32 blocksize)
{
u32 blocksize_mask;
if (!blocksize)
return false;
blocksize_mask = blocksize - 1;
if ((offset & blocksize_mask) ||
(iov_iter_alignment(iter) & blocksize_mask))
return false;
return true;
}
And fs/nfsd/vfs.c:nfsd_vfs_write() has (after my patch 5):
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
if (nfsd_enable_dontcache) {
if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
flags |= RWF_DIRECT;
What I found is that unless SUNRPC TPC stored the WRITE payload in a
page-aligned boundary then iov_iter_alignment() would fail.
The @payload arg above, with my SUNRPC TCP testing, was always offset
148 bytes into the first page of the pages allocated for xdr_buf's
use, which is rqstp->rq_pages, which is allocated by
net/sunrpc/svc_xprt.c:svc_alloc_arg().
> And I don't even know what this is supposed to buy the nfs server.
> Direct I/O writes need to have the proper file offset alignment, but as
> far as Linux is concerned we don't require any memory alignment. Most
> storage hardware has requirements for the memory alignment that we pass
> on, but typically that's just a dword (4-byte) alignment, which matches
> the alignment sunrpc wants for most XDR data structures anyway. So what
> additional alignment is actually needed for support direct I/O writes
> assuming that is the goal? (I might also simply misunderstand the
> problem).
THIS... this is the very precise question/detail I discussed with
Hammerspace's CEO David Flynn when discussing Linux's O_DIRECT
support. David shares your understanding and confusion. And all I
could tell him is that in practice I always page-aligned my data
buffers used to issue O_DIRECT. And that in this instance if I don't
then O_DIRECT doesn't work (if I commented out the iov_iter_alignment
check in is_dio_aligned above).
But is that simply due to xdr_buf_to_bvec()'s use of bvec_set_virt()
for xdr_buf "head" page (first page of rqstp->rg_pages)? Whereas you
can see xdr_buf_to_bvec() uses bvec_set_page() to add each of the
other pages that immediately follow the first "head" page.
All said, if Linux can/should happily allow non-page-aligned DIO (and
we only need to worry about the on-disk DIO alignment requirements)
that'd be wonderful.
Then its just a matter of finding where that is broken...
Happy to dig into this further if you might nudge me in the right
direction.
Thanks,
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-13 9:23 ` Mike Snitzer
@ 2025-06-13 13:02 ` Jeff Layton
2025-06-16 12:35 ` Christoph Hellwig
2025-06-16 12:29 ` Christoph Hellwig
1 sibling, 1 reply; 75+ messages in thread
From: Jeff Layton @ 2025-06-13 13:02 UTC (permalink / raw)
To: Mike Snitzer, Christoph Hellwig
Cc: Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe, david.flynn
On Fri, 2025-06-13 at 05:23 -0400, Mike Snitzer wrote:
> On Thu, Jun 12, 2025 at 10:46:01PM -0700, Christoph Hellwig wrote:
> > On Thu, Jun 12, 2025 at 12:22:42PM -0400, Jeff Layton wrote:
> > > If you're against the idea, I won't waste my time.
> > >
> > > It would require some fairly hefty rejiggering of the receive code. The
> > > v4 part would be pretty nightmarish to work out too since you'd have to
> > > decode the compound as you receive to tell where the next op starts.
> > >
> > > The potential for corruption with unaligned writes is also pretty
> > > nasty.
> >
> > Maybe I'm missing an improvement to the receive buffer handling in modern
> > network hardware, but AFAIK this still would only help you to align the
> > sunrpc data buffer to page boundaries, but avoid the data copy from the
> > hardware receive buffer to the sunrpc data buffer as you still don't have
> > hardware header splitting.
>
> Correct, everything that Jeff detailed is about ensuring the WRITE
> payload is received into page aligned boundary.
>
> Which in practice has proven a hard requirement for O_DIRECT in my
> testing -- but I could be hitting some bizarre driver bug in my TCP
> testbed (which sadly sits ontop of older VMware guests/drivers).
>
> But if you looking at patch 5 in this series:
> https://lore.kernel.org/linux-nfs/20250610205737.63343-6-snitzer@kernel.org/
>
> I added fs/nfsd/vfs.c:is_dio_aligned(), which is basically a tweaked
> ditto of fs/btrfs/direct-io.c:check_direct_IO():
>
> static bool is_dio_aligned(const struct iov_iter *iter, loff_t offset,
> const u32 blocksize)
> {
> u32 blocksize_mask;
>
> if (!blocksize)
> return false;
>
> blocksize_mask = blocksize - 1;
> if ((offset & blocksize_mask) ||
> (iov_iter_alignment(iter) & blocksize_mask))
> return false;
>
> return true;
> }
>
> And fs/nfsd/vfs.c:nfsd_vfs_write() has (after my patch 5):
>
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
>
> if (nfsd_enable_dontcache) {
> if (is_dio_aligned(&iter, offset, nf->nf_dio_offset_align))
> flags |= RWF_DIRECT;
>
> What I found is that unless SUNRPC TPC stored the WRITE payload in a
> page-aligned boundary then iov_iter_alignment() would fail.
>
> The @payload arg above, with my SUNRPC TCP testing, was always offset
> 148 bytes into the first page of the pages allocated for xdr_buf's
> use, which is rqstp->rq_pages, which is allocated by
> net/sunrpc/svc_xprt.c:svc_alloc_arg().
>
> > And I don't even know what this is supposed to buy the nfs server.
> > Direct I/O writes need to have the proper file offset alignment, but as
> > far as Linux is concerned we don't require any memory alignment. Most
> > storage hardware has requirements for the memory alignment that we pass
> > on, but typically that's just a dword (4-byte) alignment, which matches
> > the alignment sunrpc wants for most XDR data structures anyway. So what
> > additional alignment is actually needed for support direct I/O writes
> > assuming that is the goal? (I might also simply misunderstand the
> > problem).
>
> THIS... this is the very precise question/detail I discussed with
> Hammerspace's CEO David Flynn when discussing Linux's O_DIRECT
> support. David shares your understanding and confusion. And all I
> could tell him is that in practice I always page-aligned my data
> buffers used to issue O_DIRECT. And that in this instance if I don't
> then O_DIRECT doesn't work (if I commented out the iov_iter_alignment
> check in is_dio_aligned above).
>
> But is that simply due to xdr_buf_to_bvec()'s use of bvec_set_virt()
> for xdr_buf "head" page (first page of rqstp->rg_pages)? Whereas you
> can see xdr_buf_to_bvec() uses bvec_set_page() to add each of the
> other pages that immediately follow the first "head" page.
>
> All said, if Linux can/should happily allow non-page-aligned DIO (and
> we only need to worry about the on-disk DIO alignment requirements)
> that'd be wonderful.
>
> Then its just a matter of finding where that is broken...
>
> Happy to dig into this further if you might nudge me in the right
> direction.
>
This is an excellent point. If the memory alignment doesn't matter,
then maybe it's enough to just receive the same way we do today and
just pad out to the correct blocksize in the bvec array if the data is
unaligned vs. the blocksize.
We still have the problem of how to do a proper RMW though to deal with
unaligned writes. A couple of possibilities come to mind:
1. nfsd could just return nfserr_inval when a write is unaligned and
the export is set up for DIO writes. IOW, just project the requirement
about alignment to the client. This might be the safest option, at
least initially. Unaligned writes are pretty uncommon. Most clients
will probably never hit the error.
2. What if we added a new "rmw_iter" operation to file_operations that
could be used for unaligned writes? XFS (for instance) could take the
i_rwsem exclusive, do DIO reads of the end blocks into bounce pages,
copy in the unaligned bits at the ends of the iter, do a DIO write and
release the lock. It'll be slow as hell, but it wouldn't be racy.
Mike, would you be amenable to option #1, at least initially? If we can
come up with a way to do unaligned writes safely, we could relax the
restriction later.
I'm only half serious about rmw_iter, but it does seem like that could
work.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-13 9:23 ` Mike Snitzer
2025-06-13 13:02 ` Jeff Layton
@ 2025-06-16 12:29 ` Christoph Hellwig
2025-06-16 16:07 ` Mike Snitzer
1 sibling, 1 reply; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-16 12:29 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Jeff Layton, Chuck Lever, linux-nfs,
linux-fsdevel, Jens Axboe, david.flynn
On Fri, Jun 13, 2025 at 05:23:48AM -0400, Mike Snitzer wrote:
> Which in practice has proven a hard requirement for O_DIRECT in my
> testing
What fails if you don't page align the memory?
> But if you looking at patch 5 in this series:
> https://lore.kernel.org/linux-nfs/20250610205737.63343-6-snitzer@kernel.org/
>
> I added fs/nfsd/vfs.c:is_dio_aligned(), which is basically a tweaked
> ditto of fs/btrfs/direct-io.c:check_direct_IO():
No idea why btrfs still has this, but it's not a general requirement
from the block layer or other file system. You just need to be
aligned to the dma alignment in the queue limits, which for most NVMe,
SCSI or ATA devices reports a dword alignment. Some of the more
obscure drivers might require more alignment, or just report it due to
copy and paste.
> What I found is that unless SUNRPC TPC stored the WRITE payload in a
> page-aligned boundary then iov_iter_alignment() would fail.
iov_iter_alignment would fail, or yout check based on it? The latter
will fail, but it doesn't check anything that matters :)
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-13 13:02 ` Jeff Layton
@ 2025-06-16 12:35 ` Christoph Hellwig
0 siblings, 0 replies; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-16 12:35 UTC (permalink / raw)
To: Jeff Layton
Cc: Mike Snitzer, Christoph Hellwig, Chuck Lever, linux-nfs,
linux-fsdevel, Jens Axboe, david.flynn
On Fri, Jun 13, 2025 at 09:02:23AM -0400, Jeff Layton wrote:
> This is an excellent point. If the memory alignment doesn't matter,
> then maybe it's enough to just receive the same way we do today and
> just pad out to the correct blocksize in the bvec array if the data is
> unaligned vs. the blocksize.
Note that the size and the logical offset of direct I/O writes needs to
be aligned to at least the sector size of the device, and for out of
place writing file systems the block size of the file system. It's just
the memory address that (usually) only has minimal alignment
requirements. So no need to pad anything IFF you writes have the
right logical offset alignment and size, and if they don't, no padding
is going to help you.
> 1. nfsd could just return nfserr_inval when a write is unaligned and
> the export is set up for DIO writes. IOW, just project the requirement
> about alignment to the client. This might be the safest option, at
> least initially. Unaligned writes are pretty uncommon. Most clients
> will probably never hit the error.
Eww. While requiring aligned writes sounds fine (the block layout
family requires it), you'll really want to come up with a draft that
allows clients to opt into it. Something like the server offers an
attribute, that the client opts into or so.
> 2. What if we added a new "rmw_iter" operation to file_operations that
> could be used for unaligned writes? XFS (for instance) could take the
> i_rwsem exclusive, do DIO reads of the end blocks into bounce pages,
> copy in the unaligned bits at the ends of the iter, do a DIO write and
> release the lock. It'll be slow as hell, but it wouldn't be racy.
That does sound doable. But it's a fair amount of extra VFS and file
system code, so it needs very solid numbers to back it up.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-12 16:00 ` Mike Snitzer
@ 2025-06-16 13:32 ` Chuck Lever
2025-06-16 16:10 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-16 13:32 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On 6/12/25 12:00 PM, Mike Snitzer wrote:
> On Thu, Jun 12, 2025 at 09:21:35AM -0400, Chuck Lever wrote:
>> On 6/11/25 3:18 PM, Mike Snitzer wrote:
>>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
>>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
>>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
>>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
>>>>> or will be removed from the page cache upon completion (DONTCACHE).
>>>>
>>>> I thought we were going to do two switches: One for reads and one for
>>>> writes? I could be misremembering.
>>>
>>> We did discuss the possibility of doing that. Still can-do if that's
>>> what you'd prefer.
>>
>> For our experimental interface, I think having read and write enablement
>> as separate settings is wise, so please do that.
>>
>> One quibble, though: The name "enable_dontcache" might be directly
>> meaningful to you, but I think others might find "enable_dont" to be
>> oxymoronic. And, it ties the setting to a specific kernel technology:
>> RWF_DONTCACHE.
>>
>> So: Can we call these settings "io_cache_read" and "io_cache_write" ?
>>
>> They could each carry multiple settings:
>>
>> 0: Use page cache
>> 1: Use RWF_DONTCACHE
>> 2: Use O_DIRECT
>>
>> You can choose to implement any or all of the above three mechanisms.
>
> I like it, will do for v2. But will have O_DIRECT=1 and RWF_DONTCACHE=2.
For io_cache_read, either settings 1 and 2 need to set
disable_splice_read, or the io_cache_read setting has to be considered
by nfsd_read_splice_ok() when deciding to use nfsd_iter_read() or
splice read.
However, it would be slightly nicer if we could decide whether splice
read can be removed /before/ this series is merged. Can you get NFSD
tested with IOR with disable_splice_read both enabled and disabled (no
direct I/O)? Then we can compare the results to ensure that there is no
negative performance impact for removing the splice read code.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-16 12:29 ` Christoph Hellwig
@ 2025-06-16 16:07 ` Mike Snitzer
2025-06-17 4:37 ` Christoph Hellwig
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-16 16:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jeff Layton, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe,
david.flynn
On Mon, Jun 16, 2025 at 05:29:32AM -0700, Christoph Hellwig wrote:
> On Fri, Jun 13, 2025 at 05:23:48AM -0400, Mike Snitzer wrote:
> > Which in practice has proven a hard requirement for O_DIRECT in my
> > testing
>
> What fails if you don't page align the memory?
>
> > But if you looking at patch 5 in this series:
> > https://lore.kernel.org/linux-nfs/20250610205737.63343-6-snitzer@kernel.org/
> >
> > I added fs/nfsd/vfs.c:is_dio_aligned(), which is basically a tweaked
> > ditto of fs/btrfs/direct-io.c:check_direct_IO():
>
> No idea why btrfs still has this, but it's not a general requirement
> from the block layer or other file system. You just need to be
> aligned to the dma alignment in the queue limits, which for most NVMe,
> SCSI or ATA devices reports a dword alignment. Some of the more
> obscure drivers might require more alignment, or just report it due to
> copy and paste.
Yeah, should probably be fixed and the rest of filesystems audited.
> > What I found is that unless SUNRPC TPC stored the WRITE payload in a
> > page-aligned boundary then iov_iter_alignment() would fail.
>
> iov_iter_alignment would fail, or yout check based on it? The latter
> will fail, but it doesn't check anything that matters :)
>
The latter, the check based on iov_iter_alignment() failed. I
understand your point.
Thankfully I can confirm that dword alignment is all that is needed on
modern hardware, just showing my work:
I retested a 512K write payload that is aligned to the XFS bdev's
logical_block_size (512b) fails when I skip the iov_iter_alignment()
check at a high level.
Because it fails in fs/iomap/direct-io.c:iomap_dio_bio_iter() with
this check:
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
return -EINVAL;
Because:
static inline bool bdev_iter_is_aligned(struct block_device *bdev,
struct iov_iter *iter)
{
return iov_iter_is_aligned(iter, bdev_dma_alignment(bdev),
bdev_logical_block_size(bdev) - 1);
}
and because bdev_dma_alignment for my particular test bdev is 511 :(
But that's OK... my test bdev is a bad example (archaic VMware vSphere
provided SCSI device): it doesn't reflect expected modern hardware.
But I just slapped together a test pmem blockdevice (memory backed,
using memmap=6G!18G) and it too has dma_alignment=511
I do have access to a KVM guest with a virtio_scsi root bdev that has
dma_alignment=3
I also just confirmed that modern NVMe devices on another testbed also
have dma_alignment=3, whew...
I'd like NFSD to be able to know if its bvec is dma-aligned, before
issuing DIO writes to underlying XFS. AFAIK I can do that simply by
checking the STATX_DIOALIGN provided dio_mem_align...
Thanks,
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-16 13:32 ` Chuck Lever
@ 2025-06-16 16:10 ` Mike Snitzer
2025-06-17 17:22 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-16 16:10 UTC (permalink / raw)
To: Chuck Lever; +Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe
On Mon, Jun 16, 2025 at 09:32:16AM -0400, Chuck Lever wrote:
> On 6/12/25 12:00 PM, Mike Snitzer wrote:
> > On Thu, Jun 12, 2025 at 09:21:35AM -0400, Chuck Lever wrote:
> >> On 6/11/25 3:18 PM, Mike Snitzer wrote:
> >>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> >>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> >>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> >>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
> >>>>> or will be removed from the page cache upon completion (DONTCACHE).
> >>>>
> >>>> I thought we were going to do two switches: One for reads and one for
> >>>> writes? I could be misremembering.
> >>>
> >>> We did discuss the possibility of doing that. Still can-do if that's
> >>> what you'd prefer.
> >>
> >> For our experimental interface, I think having read and write enablement
> >> as separate settings is wise, so please do that.
> >>
> >> One quibble, though: The name "enable_dontcache" might be directly
> >> meaningful to you, but I think others might find "enable_dont" to be
> >> oxymoronic. And, it ties the setting to a specific kernel technology:
> >> RWF_DONTCACHE.
> >>
> >> So: Can we call these settings "io_cache_read" and "io_cache_write" ?
> >>
> >> They could each carry multiple settings:
> >>
> >> 0: Use page cache
> >> 1: Use RWF_DONTCACHE
> >> 2: Use O_DIRECT
> >>
> >> You can choose to implement any or all of the above three mechanisms.
> >
> > I like it, will do for v2. But will have O_DIRECT=1 and RWF_DONTCACHE=2.
>
> For io_cache_read, either settings 1 and 2 need to set
> disable_splice_read, or the io_cache_read setting has to be considered
> by nfsd_read_splice_ok() when deciding to use nfsd_iter_read() or
> splice read.
Yes, I understand.
> However, it would be slightly nicer if we could decide whether splice
> read can be removed /before/ this series is merged. Can you get NFSD
> tested with IOR with disable_splice_read both enabled and disabled (no
> direct I/O)? Then we can compare the results to ensure that there is no
> negative performance impact for removing the splice read code.
I can ask if we have a small window of opportunity to get this tested,
will let you know if so.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-16 16:07 ` Mike Snitzer
@ 2025-06-17 4:37 ` Christoph Hellwig
2025-06-17 20:26 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Christoph Hellwig @ 2025-06-17 4:37 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Jeff Layton, Chuck Lever, linux-nfs,
linux-fsdevel, Jens Axboe, david.flynn
On Mon, Jun 16, 2025 at 12:07:42PM -0400, Mike Snitzer wrote:
> But that's OK... my test bdev is a bad example (archaic VMware vSphere
> provided SCSI device): it doesn't reflect expected modern hardware.
>
> But I just slapped together a test pmem blockdevice (memory backed,
> using memmap=6G!18G) and it too has dma_alignment=511
That's the block layer default when not overriden by the driver, I guess
pmem folks didn't are enough. I suspect it should not have any
alignment requirements at all.
> I'd like NFSD to be able to know if its bvec is dma-aligned, before
> issuing DIO writes to underlying XFS. AFAIK I can do that simply by
> checking the STATX_DIOALIGN provided dio_mem_align...
Exactly.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-16 16:10 ` Mike Snitzer
@ 2025-06-17 17:22 ` Mike Snitzer
2025-06-17 17:31 ` Chuck Lever
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-17 17:22 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, keith.mannthey
On Mon, Jun 16, 2025 at 12:10:38PM -0400, Mike Snitzer wrote:
> On Mon, Jun 16, 2025 at 09:32:16AM -0400, Chuck Lever wrote:
> > On 6/12/25 12:00 PM, Mike Snitzer wrote:
> > > On Thu, Jun 12, 2025 at 09:21:35AM -0400, Chuck Lever wrote:
> > >> On 6/11/25 3:18 PM, Mike Snitzer wrote:
> > >>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> > >>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > >>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> > >>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
> > >>>>> or will be removed from the page cache upon completion (DONTCACHE).
> > >>>>
> > >>>> I thought we were going to do two switches: One for reads and one for
> > >>>> writes? I could be misremembering.
> > >>>
> > >>> We did discuss the possibility of doing that. Still can-do if that's
> > >>> what you'd prefer.
> > >>
> > >> For our experimental interface, I think having read and write enablement
> > >> as separate settings is wise, so please do that.
> > >>
> > >> One quibble, though: The name "enable_dontcache" might be directly
> > >> meaningful to you, but I think others might find "enable_dont" to be
> > >> oxymoronic. And, it ties the setting to a specific kernel technology:
> > >> RWF_DONTCACHE.
> > >>
> > >> So: Can we call these settings "io_cache_read" and "io_cache_write" ?
> > >>
> > >> They could each carry multiple settings:
> > >>
> > >> 0: Use page cache
> > >> 1: Use RWF_DONTCACHE
> > >> 2: Use O_DIRECT
> > >>
> > >> You can choose to implement any or all of the above three mechanisms.
> > >
> > > I like it, will do for v2. But will have O_DIRECT=1 and RWF_DONTCACHE=2.
> >
> > For io_cache_read, either settings 1 and 2 need to set
> > disable_splice_read, or the io_cache_read setting has to be considered
> > by nfsd_read_splice_ok() when deciding to use nfsd_iter_read() or
> > splice read.
>
> Yes, I understand.
>
> > However, it would be slightly nicer if we could decide whether splice
> > read can be removed /before/ this series is merged. Can you get NFSD
> > tested with IOR with disable_splice_read both enabled and disabled (no
> > direct I/O)? Then we can compare the results to ensure that there is no
> > negative performance impact for removing the splice read code.
>
> I can ask if we have a small window of opportunity to get this tested,
> will let you know if so.
>
I was able to enlist the help of Keith (cc'd) to get some runs in to
compare splice_read vs vectored read. A picture is worth 1000 words:
https://original.art/NFSD_splice_vs_buffered_read_IOR_EASY.jpg
Left side is with splice_read running IOR_EASY with 48, 64, 96 PPN
(Processes Per Node on each client) respectively. Then the same
IOR_EASY workload progression for buffered IO on the right side.
6x servers with 1TB memory and 48 cpus, each configured with 32 NFSD
threads, with CPU pinning and 4M Read Ahead. 6x clients running IOR_EASY.
This was Keith's take on splice_read's benefits:
- Is overall faster than buffered at any PPN.
- Is able to scale higher with PPN (whereas buffered is flat).
- Safe to say splice_read allows NFSD to do more IO then standard
buffered.
(These results came _after_ I did the patch to remove all the
splice_read related code from NFSD and SUNRPC.. while cathartic, alas
it seems it isn't meant to be at this point. I'll let you do the
honors in the future if/when you deem splice_read worthy of removal.)
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-17 17:22 ` Mike Snitzer
@ 2025-06-17 17:31 ` Chuck Lever
2025-06-19 20:19 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-17 17:31 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, keith.mannthey
On 6/17/25 1:22 PM, Mike Snitzer wrote:
> On Mon, Jun 16, 2025 at 12:10:38PM -0400, Mike Snitzer wrote:
>> On Mon, Jun 16, 2025 at 09:32:16AM -0400, Chuck Lever wrote:
>>> On 6/12/25 12:00 PM, Mike Snitzer wrote:
>>>> On Thu, Jun 12, 2025 at 09:21:35AM -0400, Chuck Lever wrote:
>>>>> On 6/11/25 3:18 PM, Mike Snitzer wrote:
>>>>>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
>>>>>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
>>>>>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
>>>>>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
>>>>>>>> or will be removed from the page cache upon completion (DONTCACHE).
>>>>>>>
>>>>>>> I thought we were going to do two switches: One for reads and one for
>>>>>>> writes? I could be misremembering.
>>>>>>
>>>>>> We did discuss the possibility of doing that. Still can-do if that's
>>>>>> what you'd prefer.
>>>>>
>>>>> For our experimental interface, I think having read and write enablement
>>>>> as separate settings is wise, so please do that.
>>>>>
>>>>> One quibble, though: The name "enable_dontcache" might be directly
>>>>> meaningful to you, but I think others might find "enable_dont" to be
>>>>> oxymoronic. And, it ties the setting to a specific kernel technology:
>>>>> RWF_DONTCACHE.
>>>>>
>>>>> So: Can we call these settings "io_cache_read" and "io_cache_write" ?
>>>>>
>>>>> They could each carry multiple settings:
>>>>>
>>>>> 0: Use page cache
>>>>> 1: Use RWF_DONTCACHE
>>>>> 2: Use O_DIRECT
>>>>>
>>>>> You can choose to implement any or all of the above three mechanisms.
>>>>
>>>> I like it, will do for v2. But will have O_DIRECT=1 and RWF_DONTCACHE=2.
>>>
>>> For io_cache_read, either settings 1 and 2 need to set
>>> disable_splice_read, or the io_cache_read setting has to be considered
>>> by nfsd_read_splice_ok() when deciding to use nfsd_iter_read() or
>>> splice read.
>>
>> Yes, I understand.
>>
>>> However, it would be slightly nicer if we could decide whether splice
>>> read can be removed /before/ this series is merged. Can you get NFSD
>>> tested with IOR with disable_splice_read both enabled and disabled (no
>>> direct I/O)? Then we can compare the results to ensure that there is no
>>> negative performance impact for removing the splice read code.
>>
>> I can ask if we have a small window of opportunity to get this tested,
>> will let you know if so.
>>
>
> I was able to enlist the help of Keith (cc'd) to get some runs in to
> compare splice_read vs vectored read. A picture is worth 1000 words:
> https://original.art/NFSD_splice_vs_buffered_read_IOR_EASY.jpg
>
> Left side is with splice_read running IOR_EASY with 48, 64, 96 PPN
> (Processes Per Node on each client) respectively. Then the same
> IOR_EASY workload progression for buffered IO on the right side.
>
> 6x servers with 1TB memory and 48 cpus, each configured with 32 NFSD
> threads, with CPU pinning and 4M Read Ahead. 6x clients running IOR_EASY.
>
> This was Keith's take on splice_read's benefits:
> - Is overall faster than buffered at any PPN.
> - Is able to scale higher with PPN (whereas buffered is flat).
> - Safe to say splice_read allows NFSD to do more IO then standard
> buffered.
I thank you and Keith for the data!
> (These results came _after_ I did the patch to remove all the
> splice_read related code from NFSD and SUNRPC.. while cathartic, alas
> it seems it isn't meant to be at this point. I'll let you do the
> honors in the future if/when you deem splice_read worthy of removal.)
If we were to make all NFS READ operations use O_DIRECT, then of course
NFSD's splice read should be removed at that point.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-17 4:37 ` Christoph Hellwig
@ 2025-06-17 20:26 ` Mike Snitzer
2025-06-17 22:23 ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-17 20:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jeff Layton, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe,
david.flynn
On Mon, Jun 16, 2025 at 09:37:01PM -0700, Christoph Hellwig wrote:
> On Mon, Jun 16, 2025 at 12:07:42PM -0400, Mike Snitzer wrote:
> > But that's OK... my test bdev is a bad example (archaic VMware vSphere
> > provided SCSI device): it doesn't reflect expected modern hardware.
> >
> > But I just slapped together a test pmem blockdevice (memory backed,
> > using memmap=6G!18G) and it too has dma_alignment=511
>
> That's the block layer default when not overriden by the driver, I guess
> pmem folks didn't care enough. I suspect it should not have any
> alignment requirements at all.
Yeah, I hacked it with this just to quickly simulate NVMe's dma_alignment:
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 210fb77f51ba..0ab2826073f9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -457,6 +457,7 @@ static int pmem_attach_disk(struct device *dev,
.max_hw_sectors = UINT_MAX,
.features = BLK_FEAT_WRITE_CACHE |
BLK_FEAT_SYNCHRONOUS,
+ .dma_alignment = 3,
};
int nid = dev_to_node(dev), fua;
struct resource *res = &nsio->res;
> > I'd like NFSD to be able to know if its bvec is dma-aligned, before
> > issuing DIO writes to underlying XFS. AFAIK I can do that simply by
> > checking the STATX_DIOALIGN provided dio_mem_align...
>
> Exactly.
I'm finding that even with dma_alignment=3 the bvec, that
nfsd_vfs_write()'s call to xdr_buf_to_bvec() produces from NFS's WRITE
payload, still causes iov_iter_aligned_bvec() to return false.
The reason is that iov_iter_aligned_bvec() inspects each member of the
bio_vec in isolation (in its while() loop). So even though NFS WRITE
payload's overall size is aligned on-disk (e.g. offset=0 len=512K) its
first and last bvec members are _not_ aligned (due to 512K NFS WRITE
payload being offset 148 bytes into the first page of the pages
allocated for it by SUNRPC). So iov_iter_aligned_bvec() fails at this
check:
if (len & len_mask)
return false;
with tracing I added:
nfsd-14027 [001] ..... 3734.668780: nfsd_vfs_write: iov_iter_aligned_bvec: addr_mask=3 len_mask=511
nfsd-14027 [001] ..... 3734.668781: nfsd_vfs_write: iov_iter_aligned_bvec: len=3948 & len_mask=511 failed
Is this another case of the checks being too strict? The bvec does
describe a contiguous 512K extent of on-disk LBA, just not if
inspected piece-wise.
BTW, XFS's directio code _will_ also check with
iov_iter_aligned_bvec() via iov_iter_is_aligned().
Mike
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages]
2025-06-17 20:26 ` Mike Snitzer
@ 2025-06-17 22:23 ` Mike Snitzer
0 siblings, 0 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-17 22:23 UTC (permalink / raw)
To: Christoph Hellwig, Alexander Viro, Andrew Morton
Cc: Jeff Layton, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe,
david.flynn
[Cc'ing Al and Andrew]
On Tue, Jun 17, 2025 at 04:26:42PM -0400, Mike Snitzer wrote:
> On Mon, Jun 16, 2025 at 09:37:01PM -0700, Christoph Hellwig wrote:
> > On Mon, Jun 16, 2025 at 12:07:42PM -0400, Mike Snitzer wrote:
> > > But that's OK... my test bdev is a bad example (archaic VMware vSphere
> > > provided SCSI device): it doesn't reflect expected modern hardware.
> > >
> > > But I just slapped together a test pmem blockdevice (memory backed,
> > > using memmap=6G!18G) and it too has dma_alignment=511
> >
> > That's the block layer default when not overriden by the driver, I guess
> > pmem folks didn't care enough. I suspect it should not have any
> > alignment requirements at all.
>
> Yeah, I hacked it with this just to quickly simulate NVMe's dma_alignment:
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 210fb77f51ba..0ab2826073f9 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -457,6 +457,7 @@ static int pmem_attach_disk(struct device *dev,
> .max_hw_sectors = UINT_MAX,
> .features = BLK_FEAT_WRITE_CACHE |
> BLK_FEAT_SYNCHRONOUS,
> + .dma_alignment = 3,
> };
> int nid = dev_to_node(dev), fua;
> struct resource *res = &nsio->res;
>
> > > I'd like NFSD to be able to know if its bvec is dma-aligned, before
> > > issuing DIO writes to underlying XFS. AFAIK I can do that simply by
> > > checking the STATX_DIOALIGN provided dio_mem_align...
> >
> > Exactly.
>
> I'm finding that even with dma_alignment=3 the bvec, that
> nfsd_vfs_write()'s call to xdr_buf_to_bvec() produces from NFS's WRITE
> payload, still causes iov_iter_aligned_bvec() to return false.
>
> The reason is that iov_iter_aligned_bvec() inspects each member of the
> bio_vec in isolation (in its while() loop). So even though NFS WRITE
> payload's overall size is aligned on-disk (e.g. offset=0 len=512K) its
> first and last bvec members are _not_ aligned (due to 512K NFS WRITE
> payload being offset 148 bytes into the first page of the pages
> allocated for it by SUNRPC). So iov_iter_aligned_bvec() fails at this
> check:
>
> if (len & len_mask)
> return false;
>
> with tracing I added:
>
> nfsd-14027 [001] ..... 3734.668780: nfsd_vfs_write: iov_iter_aligned_bvec: addr_mask=3 len_mask=511
> nfsd-14027 [001] ..... 3734.668781: nfsd_vfs_write: iov_iter_aligned_bvec: len=3948 & len_mask=511 failed
>
> Is this another case of the checks being too strict? The bvec does
> describe a contiguous 512K extent of on-disk LBA, just not if
> inspected piece-wise.
>
> BTW, XFS's directio code _will_ also check with
> iov_iter_aligned_bvec() via iov_iter_is_aligned().
This works, I just don't know what (if any) breakage it exposes us to:
Author: Mike Snitzer <snitzer@kernel.org>
Date: Tue Jun 17 22:04:44 2025 +0000
Subject: lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
iov_iter_aligned_bvec() is strictly checking alignment of each element
of the bvec to arrive at whether the bvec is aligned relative to
dma_alignment and on-disk alignment. Checking each element
individually results in disallowing a bvec that in aggregate is
perfectly aligned relative to the provided @len_mask.
Relax the on-disk alignment checking such that it is done on the full
extent described by the bvec but still do piecewise checking of the
dma_alignment for each bvec's bv_offset.
This allows for NFS's WRITE payload to be issued using O_DIRECT as
long as the bvec created with xdr_buf_to_bvec() is composed of pages
that respect the underlying device's dma_alignment (@addr_mask) and
the overall contiguous on-disk extent is aligned relative to the
logical_block_size (@len_mask).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index bdb37d572e97..b2ae482b8a1d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
unsigned skip = i->iov_offset;
size_t size = i->count;
+ if (size & len_mask)
+ return false;
+
do {
size_t len = bvec->bv_len;
if (len > size)
len = size;
- if (len & len_mask)
- return false;
if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
return false;
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-17 17:31 ` Chuck Lever
@ 2025-06-19 20:19 ` Mike Snitzer
2025-06-30 14:50 ` Chuck Lever
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-06-19 20:19 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, keith.mannthey
On Tue, Jun 17, 2025 at 01:31:23PM -0400, Chuck Lever wrote:
> On 6/17/25 1:22 PM, Mike Snitzer wrote:
> > On Mon, Jun 16, 2025 at 12:10:38PM -0400, Mike Snitzer wrote:
> >> On Mon, Jun 16, 2025 at 09:32:16AM -0400, Chuck Lever wrote:
> >>> On 6/12/25 12:00 PM, Mike Snitzer wrote:
> >>>> On Thu, Jun 12, 2025 at 09:21:35AM -0400, Chuck Lever wrote:
> >>>>> On 6/11/25 3:18 PM, Mike Snitzer wrote:
> >>>>>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
> >>>>>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> >>>>>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
> >>>>>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
> >>>>>>>> or will be removed from the page cache upon completion (DONTCACHE).
> >>>>>>>
> >>>>>>> I thought we were going to do two switches: One for reads and one for
> >>>>>>> writes? I could be misremembering.
> >>>>>>
> >>>>>> We did discuss the possibility of doing that. Still can-do if that's
> >>>>>> what you'd prefer.
> >>>>>
> >>>>> For our experimental interface, I think having read and write enablement
> >>>>> as separate settings is wise, so please do that.
> >>>>>
> >>>>> One quibble, though: The name "enable_dontcache" might be directly
> >>>>> meaningful to you, but I think others might find "enable_dont" to be
> >>>>> oxymoronic. And, it ties the setting to a specific kernel technology:
> >>>>> RWF_DONTCACHE.
> >>>>>
> >>>>> So: Can we call these settings "io_cache_read" and "io_cache_write" ?
> >>>>>
> >>>>> They could each carry multiple settings:
> >>>>>
> >>>>> 0: Use page cache
> >>>>> 1: Use RWF_DONTCACHE
> >>>>> 2: Use O_DIRECT
> >>>>>
> >>>>> You can choose to implement any or all of the above three mechanisms.
> >>>>
> >>>> I like it, will do for v2. But will have O_DIRECT=1 and RWF_DONTCACHE=2.
> >>>
> >>> For io_cache_read, either settings 1 and 2 need to set
> >>> disable_splice_read, or the io_cache_read setting has to be considered
> >>> by nfsd_read_splice_ok() when deciding to use nfsd_iter_read() or
> >>> splice read.
> >>
> >> Yes, I understand.
> >>
> >>> However, it would be slightly nicer if we could decide whether splice
> >>> read can be removed /before/ this series is merged. Can you get NFSD
> >>> tested with IOR with disable_splice_read both enabled and disabled (no
> >>> direct I/O)? Then we can compare the results to ensure that there is no
> >>> negative performance impact for removing the splice read code.
> >>
> >> I can ask if we have a small window of opportunity to get this tested,
> >> will let you know if so.
> >>
> >
> > I was able to enlist the help of Keith (cc'd) to get some runs in to
> > compare splice_read vs vectored read. A picture is worth 1000 words:
> > https://original.art/NFSD_splice_vs_buffered_read_IOR_EASY.jpg
> >
> > Left side is with splice_read running IOR_EASY with 48, 64, 96 PPN
> > (Processes Per Node on each client) respectively. Then the same
> > IOR_EASY workload progression for buffered IO on the right side.
> >
> > 6x servers with 1TB memory and 48 cpus, each configured with 32 NFSD
> > threads, with CPU pinning and 4M Read Ahead. 6x clients running IOR_EASY.
> >
> > This was Keith's take on splice_read's benefits:
> > - Is overall faster than buffered at any PPN.
> > - Is able to scale higher with PPN (whereas buffered is flat).
> > - Safe to say splice_read allows NFSD to do more IO then standard
> > buffered.
>
> I thank you and Keith for the data!
You're welcome.
> > (These results came _after_ I did the patch to remove all the
> > splice_read related code from NFSD and SUNRPC.. while cathartic, alas
> > it seems it isn't meant to be at this point. I'll let you do the
> > honors in the future if/when you deem splice_read worthy of removal.)
>
> If we were to make all NFS READ operations use O_DIRECT, then of course
> NFSD's splice read should be removed at that point.
Yes, that makes sense. I still need to try Christoph's idea (hope to
do so over next 24hrs):
https://lore.kernel.org/linux-nfs/aEu3o9imaQQF9vyg@infradead.org/
But for now, here is my latest NFSD O_DIRECT/DONTCACHE work, think of
the top 6 commits as a preview of what'll be v2 of this series:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.24/nfsd-testing
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-19 20:19 ` Mike Snitzer
@ 2025-06-30 14:50 ` Chuck Lever
2025-07-04 19:46 ` Mike Snitzer
0 siblings, 1 reply; 75+ messages in thread
From: Chuck Lever @ 2025-06-30 14:50 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, keith.mannthey
On 6/19/25 4:19 PM, Mike Snitzer wrote:
> On Tue, Jun 17, 2025 at 01:31:23PM -0400, Chuck Lever wrote:
>> On 6/17/25 1:22 PM, Mike Snitzer wrote:
>>> On Mon, Jun 16, 2025 at 12:10:38PM -0400, Mike Snitzer wrote:
>>>> On Mon, Jun 16, 2025 at 09:32:16AM -0400, Chuck Lever wrote:
>>>>> On 6/12/25 12:00 PM, Mike Snitzer wrote:
>>>>>> On Thu, Jun 12, 2025 at 09:21:35AM -0400, Chuck Lever wrote:
>>>>>>> On 6/11/25 3:18 PM, Mike Snitzer wrote:
>>>>>>>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote:
>>>>>>>>> On 6/10/25 4:57 PM, Mike Snitzer wrote:
>>>>>>>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data
>>>>>>>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT)
>>>>>>>>>> or will be removed from the page cache upon completion (DONTCACHE).
>>>>>>>>>
>>>>>>>>> I thought we were going to do two switches: One for reads and one for
>>>>>>>>> writes? I could be misremembering.
>>>>>>>>
>>>>>>>> We did discuss the possibility of doing that. Still can-do if that's
>>>>>>>> what you'd prefer.
>>>>>>>
>>>>>>> For our experimental interface, I think having read and write enablement
>>>>>>> as separate settings is wise, so please do that.
>>>>>>>
>>>>>>> One quibble, though: The name "enable_dontcache" might be directly
>>>>>>> meaningful to you, but I think others might find "enable_dont" to be
>>>>>>> oxymoronic. And, it ties the setting to a specific kernel technology:
>>>>>>> RWF_DONTCACHE.
>>>>>>>
>>>>>>> So: Can we call these settings "io_cache_read" and "io_cache_write" ?
>>>>>>>
>>>>>>> They could each carry multiple settings:
>>>>>>>
>>>>>>> 0: Use page cache
>>>>>>> 1: Use RWF_DONTCACHE
>>>>>>> 2: Use O_DIRECT
>>>>>>>
>>>>>>> You can choose to implement any or all of the above three mechanisms.
>>>>>>
>>>>>> I like it, will do for v2. But will have O_DIRECT=1 and RWF_DONTCACHE=2.
>>>>>
>>>>> For io_cache_read, either settings 1 and 2 need to set
>>>>> disable_splice_read, or the io_cache_read setting has to be considered
>>>>> by nfsd_read_splice_ok() when deciding to use nfsd_iter_read() or
>>>>> splice read.
>>>>
>>>> Yes, I understand.
>>>>
>>>>> However, it would be slightly nicer if we could decide whether splice
>>>>> read can be removed /before/ this series is merged. Can you get NFSD
>>>>> tested with IOR with disable_splice_read both enabled and disabled (no
>>>>> direct I/O)? Then we can compare the results to ensure that there is no
>>>>> negative performance impact for removing the splice read code.
>>>>
>>>> I can ask if we have a small window of opportunity to get this tested,
>>>> will let you know if so.
>>>>
>>>
>>> I was able to enlist the help of Keith (cc'd) to get some runs in to
>>> compare splice_read vs vectored read. A picture is worth 1000 words:
>>> https://original.art/NFSD_splice_vs_buffered_read_IOR_EASY.jpg
>>>
>>> Left side is with splice_read running IOR_EASY with 48, 64, 96 PPN
>>> (Processes Per Node on each client) respectively. Then the same
>>> IOR_EASY workload progression for buffered IO on the right side.
>>>
>>> 6x servers with 1TB memory and 48 cpus, each configured with 32 NFSD
>>> threads, with CPU pinning and 4M Read Ahead. 6x clients running IOR_EASY.
>>>
>>> This was Keith's take on splice_read's benefits:
>>> - Is overall faster than buffered at any PPN.
>>> - Is able to scale higher with PPN (whereas buffered is flat).
>>> - Safe to say splice_read allows NFSD to do more IO then standard
>>> buffered.
>>
>> I thank you and Keith for the data!
>
> You're welcome.
>
>>> (These results came _after_ I did the patch to remove all the
>>> splice_read related code from NFSD and SUNRPC.. while cathartic, alas
>>> it seems it isn't meant to be at this point. I'll let you do the
>>> honors in the future if/when you deem splice_read worthy of removal.)
>>
>> If we were to make all NFS READ operations use O_DIRECT, then of course
>> NFSD's splice read should be removed at that point.
>
> Yes, that makes sense. I still need to try Christoph's idea (hope to
> do so over next 24hrs):
> https://lore.kernel.org/linux-nfs/aEu3o9imaQQF9vyg@infradead.org/
>
> But for now, here is my latest NFSD O_DIRECT/DONTCACHE work, think of
> the top 6 commits as a preview of what'll be v2 of this series:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.24/nfsd-testing
I was waiting for a series repost, but in the meantime...
The one thing that caught my eye was the relocation of fh_getattr().
- If fh_getattr() is to be moved to fs/nfsd/vfs.c, then it should be
renamed nfsd_getattr() (or similar) to match the API naming
convention in that file.
- If fh_getattr() is to keep its current name, then it should be
moved to where the other fh_yada() functions reside, in
fs/nfsd/nfsfh.c
In a private tree, I constructed a patch to do the latter. I can
post that for comment.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO]
2025-06-12 10:28 ` Jeff Layton
2025-06-12 11:28 ` Jeff Layton
2025-06-12 13:28 ` Chuck Lever
@ 2025-07-03 0:12 ` NeilBrown
2 siblings, 0 replies; 75+ messages in thread
From: NeilBrown @ 2025-07-03 0:12 UTC (permalink / raw)
To: Jeff Layton
Cc: Mike Snitzer, Chuck Lever, linux-nfs, linux-fsdevel, Jens Axboe
On Thu, 12 Jun 2025, Jeff Layton wrote:
>
> I've been looking over the code today. Basically, I think we need to
> have svc_tcp_recvfrom() receive in phases. At a high level:
>
> 1/ receive the record marker (just like it does today)
Long ago (IETF 47??) I heard someone talking about a "smart" network
card that would detect UDP packets to port 2049, split the data into the
largest power-of-2 as a final component and the remainder as a header,
and DMA them into memory that way. This would very often put the data
in page-aligned memory.
We could do the same thing here.
Currently we copy as much as will fit into the "header" and the rest
into the "pages". We could instead use power-of-2 maths to put some in
the header and a whole number of pages into the "pages".
This would probably work well for NFSv3 and shouldn't make NFSv4 worse
It wouldn't provide a guarantee, but could provide a useful
optimisation.
NeilBrown
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-06-30 14:50 ` Chuck Lever
@ 2025-07-04 19:46 ` Mike Snitzer
2025-07-04 19:49 ` Chuck Lever
0 siblings, 1 reply; 75+ messages in thread
From: Mike Snitzer @ 2025-07-04 19:46 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, keith.mannthey
On Mon, Jun 30, 2025 at 10:50:42AM -0400, Chuck Lever wrote:
> On 6/19/25 4:19 PM, Mike Snitzer wrote:
> > On Tue, Jun 17, 2025 at 01:31:23PM -0400, Chuck Lever wrote:
> >>
> >> If we were to make all NFS READ operations use O_DIRECT, then of course
> >> NFSD's splice read should be removed at that point.
> >
> > Yes, that makes sense. I still need to try Christoph's idea (hope to
> > do so over next 24hrs):
> > https://lore.kernel.org/linux-nfs/aEu3o9imaQQF9vyg@infradead.org/
> >
> > But for now, here is my latest NFSD O_DIRECT/DONTCACHE work, think of
> > the top 6 commits as a preview of what'll be v2 of this series:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.24/nfsd-testing
>
> I was waiting for a series repost, but in the meantime...
>
> The one thing that caught my eye was the relocation of fh_getattr().
>
> - If fh_getattr() is to be moved to fs/nfsd/vfs.c, then it should be
> renamed nfsd_getattr() (or similar) to match the API naming
> convention in that file.
>
> - If fh_getattr() is to keep its current name, then it should be
> moved to where the other fh_yada() functions reside, in
> fs/nfsd/nfsfh.c
>
> In a private tree, I constructed a patch to do the latter. I can
> post that for comment.
Hi,
Sure, I can clean it up to take your patch into account. Please share
your patch (either pointer to commit in a branch or via email).
Tangent to explain why I've fallen off the face of the earth:
I have just been focused on trying to get client-side misaligned
O_DIRECT READ IO to be expanded to be DIO-aligned like I did with
NFSD. Turns out it is quite involved (took a week of focused
development to arrive at the fact that NFS client's nfs_page and
pagelist code's use of memory as an array is entirely incompatiable.
Discussed with Trond and the way forward would require having NFS
client fill in xdr_buf's bvec and manage manually.. but that's a
serious hack. Better long term goal is to convert xdr_buf over to
using bio_vec like NFSD is using.
So rather than do any of that _now_, I just today implemented an NFS
LOCALIO fallback to issuing the misaligned DIO READ using remote call
to NFSD (able to do so on a per-IO basis if READ is misaligned).
Seems to work really well, but does force LOCALIO to go remote (over
loopback network) just so it can leverage our new NFSD mode to use
O_DIRECT and expand misaligned writes, which is enabled with:
echo 2 > /sys/kernel/debug/nfsd/io_cache_read
All said, I'll get everything cleaned up and send out v2 of this
patchset on Monday. (If you share your patch I can rebase ontop of it
and hopefully still get v2 out on Monday)
Thanks, and Happy 4th of July!
Mike
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
2025-07-04 19:46 ` Mike Snitzer
@ 2025-07-04 19:49 ` Chuck Lever
0 siblings, 0 replies; 75+ messages in thread
From: Chuck Lever @ 2025-07-04 19:49 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jeff Layton, linux-nfs, linux-fsdevel, Jens Axboe, keith.mannthey
On 7/4/25 3:46 PM, Mike Snitzer wrote:
> On Mon, Jun 30, 2025 at 10:50:42AM -0400, Chuck Lever wrote:
>> On 6/19/25 4:19 PM, Mike Snitzer wrote:
>>> On Tue, Jun 17, 2025 at 01:31:23PM -0400, Chuck Lever wrote:
>>>>
>>>> If we were to make all NFS READ operations use O_DIRECT, then of course
>>>> NFSD's splice read should be removed at that point.
>>>
>>> Yes, that makes sense. I still need to try Christoph's idea (hope to
>>> do so over next 24hrs):
>>> https://lore.kernel.org/linux-nfs/aEu3o9imaQQF9vyg@infradead.org/
>>>
>>> But for now, here is my latest NFSD O_DIRECT/DONTCACHE work, think of
>>> the top 6 commits as a preview of what'll be v2 of this series:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.24/nfsd-testing
>>
>> I was waiting for a series repost, but in the meantime...
>>
>> The one thing that caught my eye was the relocation of fh_getattr().
>>
>> - If fh_getattr() is to be moved to fs/nfsd/vfs.c, then it should be
>> renamed nfsd_getattr() (or similar) to match the API naming
>> convention in that file.
>>
>> - If fh_getattr() is to keep its current name, then it should be
>> moved to where the other fh_yada() functions reside, in
>> fs/nfsd/nfsfh.c
>>
>> In a private tree, I constructed a patch to do the latter. I can
>> post that for comment.
>
> Hi,
>
> Sure, I can clean it up to take your patch into account. Please share
> your patch (either pointer to commit in a branch or via email).
>
> Tangent to explain why I've fallen off the face of the earth:
> I have just been focused on trying to get client-side misaligned
> O_DIRECT READ IO to be expanded to be DIO-aligned like I did with
> NFSD. Turns out it is quite involved (took a week of focused
> development to arrive at the fact that NFS client's nfs_page and
> pagelist code's use of memory as an array is entirely incompatiable.
> Discussed with Trond and the way forward would require having NFS
> client fill in xdr_buf's bvec and manage manually.. but that's a
> serious hack. Better long term goal is to convert xdr_buf over to
> using bio_vec like NFSD is using.
>
> So rather than do any of that _now_, I just today implemented an NFS
> LOCALIO fallback to issuing the misaligned DIO READ using remote call
> to NFSD (able to do so on a per-IO basis if READ is misaligned).
> Seems to work really well, but does force LOCALIO to go remote (over
> loopback network) just so it can leverage our new NFSD mode to use
> O_DIRECT and expand misaligned writes, which is enabled with:
> echo 2 > /sys/kernel/debug/nfsd/io_cache_read
>
> All said, I'll get everything cleaned up and send out v2 of this
> patchset on Monday. (If you share your patch I can rebase ontop of it
> and hopefully still get v2 out on Monday)
https://lore.kernel.org/linux-nfs/20250702233345.1128154-1-cel@kernel.org/T/#t
But no-one has yet offered an opinion about whether to rename fh_getattr
or move it to fs/nfsd/nfsfh.c. Things might change.
--
Chuck Lever
^ permalink raw reply [flat|nested] 75+ messages in thread
end of thread, other threads:[~2025-07-04 19:49 UTC | newest]
Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11 6:57 ` Christoph Hellwig
2025-06-11 10:44 ` Mike Snitzer
2025-06-11 13:04 ` Jeff Layton
2025-06-11 13:56 ` Chuck Lever
2025-06-11 14:31 ` Chuck Lever
2025-06-11 19:18 ` Mike Snitzer
2025-06-11 20:29 ` Jeff Layton
2025-06-11 21:36 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28 ` Jeff Layton
2025-06-12 11:28 ` Jeff Layton
2025-06-12 13:28 ` Chuck Lever
2025-06-12 14:17 ` Benjamin Coddington
2025-06-12 15:56 ` Mike Snitzer
2025-06-12 15:58 ` Chuck Lever
2025-06-12 16:12 ` Mike Snitzer
2025-06-12 16:32 ` Chuck Lever
2025-06-13 5:39 ` Christoph Hellwig
2025-06-12 16:22 ` Jeff Layton
2025-06-13 5:46 ` Christoph Hellwig
2025-06-13 9:23 ` Mike Snitzer
2025-06-13 13:02 ` Jeff Layton
2025-06-16 12:35 ` Christoph Hellwig
2025-06-16 12:29 ` Christoph Hellwig
2025-06-16 16:07 ` Mike Snitzer
2025-06-17 4:37 ` Christoph Hellwig
2025-06-17 20:26 ` Mike Snitzer
2025-06-17 22:23 ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03 0:12 ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12 7:13 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15 ` Chuck Lever
2025-06-12 13:21 ` Chuck Lever
2025-06-12 16:00 ` Mike Snitzer
2025-06-16 13:32 ` Chuck Lever
2025-06-16 16:10 ` Mike Snitzer
2025-06-17 17:22 ` Mike Snitzer
2025-06-17 17:31 ` Chuck Lever
2025-06-19 20:19 ` Mike Snitzer
2025-06-30 14:50 ` Chuck Lever
2025-07-04 19:46 ` Mike Snitzer
2025-07-04 19:49 ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11 6:58 ` Christoph Hellwig
2025-06-11 10:51 ` Mike Snitzer
2025-06-11 14:17 ` Chuck Lever
2025-06-12 7:15 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11 7:00 ` Christoph Hellwig
2025-06-11 12:23 ` Mike Snitzer
2025-06-11 13:30 ` Jeff Layton
2025-06-12 7:22 ` Christoph Hellwig
2025-06-12 7:23 ` Christoph Hellwig
2025-06-11 14:42 ` Chuck Lever
2025-06-11 15:07 ` Jeff Layton
2025-06-11 15:11 ` Chuck Lever
2025-06-11 15:44 ` Jeff Layton
2025-06-11 20:51 ` Mike Snitzer
2025-06-12 7:32 ` Christoph Hellwig
2025-06-12 7:28 ` Christoph Hellwig
2025-06-12 7:25 ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12 7:39 ` Christoph Hellwig
2025-06-12 20:37 ` Mike Snitzer
2025-06-13 5:31 ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02 ` Mike Snitzer
2025-06-11 19:06 ` Chuck Lever
2025-06-11 19:58 ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08 ` Mike Snitzer
2025-06-12 20:17 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).