* [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes
@ 2025-08-09 5:02 Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Hi,
Some workloads benefit from NFSD avoiding the page cache, particularly
those with a working set that is significantly larger than available
system memory. This patchset introduces _optional_ support to
configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
support. The NFSD default to use page cache is left unchanged.
The performance win associated with using NFSD DIRECT was previously
summarized here:
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
This picture offers a nice summary of performance gains:
https://original.art/NFSD_direct_vs_buffered_IO.jpg
This series builds on what has been staged in the nfsd-testing branch.
This code has proven to work well during my testing. Any suggestions
for further refinement are welcome.
Thanks,
Mike
Changes since v5:
- #define NFSD_READ_DIO_MIN_KB (32 << 10)
- switch to using pre-increment from post-increment.
- always get DIO alignment when opening regular nfsd_file (dropped
patch that only did if NFSD_IO_DIRECT).
- fixed nfsd_io_cache_{read,write}_set to not set NFSD_IO_UNSPECIFIED
if returning -EINVAL due to unrecognized value.
- use a switch statement in nfsd_iter_read like nfsd_iter_write
- Optimize nfsd_iter_read for default being buffered IO, like was done
for nfsd_iter_write in v5.
Changes since v4:
- removed use of iov_iter_is_aligned() in earlier patches, we don't
want any conflict with Keith's patchset that ultimately removes the
iov_iter_is_aligned() interface.
- refactored the final "NFSD: issue WRITEs using O_DIRECT even if IO
is misaligned" patch to have the lightest touch possible on
nfsd_vfs_write() for the default buffered IO case.
- updated patch headers where needed.
- all patches have changed some, so removed all Reviewed-by from Jeff
and Signed-off-by from Chuck.
- Series checked with checkpatch.pl, sparse and verified bisect safe.
Changes since v3:
- fixed nfsd_vfs_write() so work that only needs to happen once, after
IO is submitted, isn't performed each iteration of the for loop,
thanks for catching this Chuck.
- updated NFSD's misaligned READ and WRITE handling to not use
iov_iter_is_aligned() because it will soon be removed upstream.
- Chuck, provided both you and Jeff agree with patch 1's incremental
changes, patch 1 should be folded into what you already have in your
nfsd-testing branch (more justification in patch 1's header)
- dropped the NFSD misaligned DIO NFS reexport patch in favor of
adding STATX_DIOALIGN support to NFS (the patch to add support will
be provided in the next NFS DIRECT v7 patchset that I'll post soon).
Changes since v2:
- fixed patch 2 by moving redundant work out of nfsd_vfs_write()'s for
loop, thanks to Jeff's review.
- added Jeff's Reviewed-by to patches 1-3.
- Left patch 4 in the series because it is pragmatic, but feel free to
drop it if you'd prefer to see this cat skinned a different way.
Changes since v1:
- switched to using an EVENT_CLASS to create nfsd_analyze_{read,write}_dio
- added 4th patch, if user configured use of NFSD_IO_DIRECT then NFS
reexports should use it too (in future, with per-export controls
we'll have the benefit of finer-grained control; but until then we'd
do well to offer comprehensive use of NFSD_IO_DIRECT if it enabled).
Thanks,
Mike
Mike Snitzer (7):
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFSD: pass nfsd_file to nfsd_iter_read()
NFSD: add io_cache_read controls to debugfs interface
NFSD: add io_cache_write controls to debugfs interface
NFSD: issue READs using O_DIRECT even if IO is misaligned
NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events
fs/nfsd/debugfs.c | 100 ++++++++++
fs/nfsd/filecache.c | 32 ++++
fs/nfsd/filecache.h | 4 +
fs/nfsd/nfs4xdr.c | 8 +-
fs/nfsd/nfsd.h | 10 +
fs/nfsd/nfsfh.c | 4 +
fs/nfsd/trace.h | 61 +++++++
fs/nfsd/vfs.c | 366 +++++++++++++++++++++++++++++++++++--
fs/nfsd/vfs.h | 2 +-
include/linux/sunrpc/svc.h | 5 +-
10 files changed, 574 insertions(+), 18 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v6 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
@ 2025-08-09 5:02 ` Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
alignment attributes from underlying filesystem in associated
nfsd_file. This is done when the nfsd_file is first opened for
a regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/filecache.c | 32 ++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/nfsfh.c | 4 ++++
3 files changed, 40 insertions(+)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 8581c131338b8..5447dba6c5da0 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1048,6 +1051,33 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return status;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1166,6 +1196,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_getattr(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434a..e3d6ca2b60308 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index f4a3cc9e31e05..bdba2ba828a6a 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -677,8 +677,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
+ struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
+ if (S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
if (fhp->fh_maxsize == NFS4_FHSIZE)
request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
--
2.44.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 2/7] NFSD: pass nfsd_file to nfsd_iter_read()
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-08-09 5:02 ` Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7d19925f46e45..d519f4156cfad 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4464,7 +4464,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4475,7 +4475,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4522,7 +4522,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5418,7 +5418,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0c0f25b2c8e38..79439ad93880a 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1075,7 +1075,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1088,9 +1088,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1312,7 +1313,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6de..fa46f8b5f1320 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.44.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 3/7] NFSD: add io_cache_read controls to debugfs interface
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
@ 2025-08-09 5:02 ` Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 4/7] NFSD: add io_cache_write " Mike Snitzer
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)
io_cache_read may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_read
If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the read must be aligned relative to the
underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/debugfs.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 9 ++++++++
fs/nfsd/vfs.c | 18 +++++++++++++++
3 files changed, 84 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 84b0c8b559dc9..3cadd45868b48 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -27,11 +27,65 @@ static int nfsd_dsr_get(void *data, u64 *val)
static int nfsd_dsr_set(void *data, u64 val)
{
nfsd_disable_splice_read = (val > 0) ? true : false;
+ if (!nfsd_disable_splice_read) {
+ /*
+ * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
+ * if splice_read is enabled.
+ */
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ }
return 0;
}
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_read
+ *
+ * Contents:
+ * %1: NFS READ will use buffered IO
+ * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %3: NFS READ will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_read_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_read;
+ return 0;
+}
+
+static int nfsd_io_cache_read_set(void *data, u64 val)
+{
+ int ret = 0;
+
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ break;
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ /*
+ * Must disable splice_read when enabling
+ * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
+ */
+ nfsd_disable_splice_read = true;
+ nfsd_io_cache_read = val;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
+ nfsd_io_cache_read_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -44,4 +98,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_dsr_fops);
+
+ debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1cd0bed57bc2f..6ef799405145f 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
extern bool nfsd_disable_splice_read __read_mostly;
+enum {
+ NFSD_IO_UNSPECIFIED = 0,
+ NFSD_IO_BUFFERED,
+ NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
+};
+
+extern u64 nfsd_io_cache_read __read_mostly;
+
extern int nfsd_max_blksize;
static inline int nfsd_v4client(struct svc_rqst *rq)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 79439ad93880a..8ea8b80097195 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,7 @@
#define NFSDDBG_FACILITY NFSDDBG_FILEOP
bool nfsd_disable_splice_read __read_mostly;
+u64 nfsd_io_cache_read __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1099,6 +1100,23 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
size_t len;
init_sync_kiocb(&kiocb, file);
+
+ switch (nfsd_io_cache_read) {
+ case NFSD_IO_DIRECT:
+ /* Verify ondisk and memory DIO alignment */
+ if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
+ (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
+ (base & (nf->nf_dio_mem_align - 1)) == 0)
+ kiocb.ki_flags = IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags = IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_UNSPECIFIED:
+ case NFSD_IO_BUFFERED:
+ break;
+ }
+
kiocb.ki_pos = offset;
v = 0;
--
2.44.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 4/7] NFSD: add io_cache_write controls to debugfs interface
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (2 preceding siblings ...)
2025-08-09 5:02 ` [PATCH v6 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-08-09 5:02 ` Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)
io_cache_write may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_write
If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the WRITE payload must be aligned relative
to the underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/debugfs.c | 43 +++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/vfs.c | 16 ++++++++++++++++
3 files changed, 60 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 3cadd45868b48..8878c3519b30c 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -86,6 +86,46 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
nfsd_io_cache_read_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_write
+ *
+ * Contents:
+ * %1: NFS WRITE will use buffered IO
+ * %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ * %3: NFS WRITE will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_write_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_write;
+ return 0;
+}
+
+static int nfsd_io_cache_write_set(void *data, u64 val)
+{
+ int ret = 0;
+
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ nfsd_io_cache_write = val;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_write_fops, nfsd_io_cache_write_get,
+ nfsd_io_cache_write_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -101,4 +141,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
+
+ debugfs_create_file("io_cache_write", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_write_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 6ef799405145f..fe935b4cda538 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -161,6 +161,7 @@ enum {
};
extern u64 nfsd_io_cache_read __read_mostly;
+extern u64 nfsd_io_cache_write __read_mostly;
extern int nfsd_max_blksize;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8ea8b80097195..c340708fbab4d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -50,6 +50,7 @@
bool nfsd_disable_splice_read __read_mostly;
u64 nfsd_io_cache_read __read_mostly;
+u64 nfsd_io_cache_write __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1242,6 +1243,21 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
+
+ switch (nfsd_io_cache_write) {
+ case NFSD_IO_DIRECT:
+ /* direct I/O must be aligned to device logical sector size */
+ if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
+ (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
+ kiocb.ki_flags |= IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_UNSPECIFIED:
+ case NFSD_IO_BUFFERED:
+ break;
+ }
host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
--
2.44.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (3 preceding siblings ...)
2025-08-09 5:02 ` [PATCH v6 4/7] NFSD: add io_cache_write " Mike Snitzer
@ 2025-08-09 5:02 ` Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 6/7] NFSD: issue WRITEs " Mike Snitzer
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
DIO-aligned block (on either end of the READ). The expanded READ is
verified to have proper offset/len (logical_block_size) and
dma_alignment checking.
Must allocate and use a bounce-buffer page (called 'start_extra_page')
if/when expanding the misaligned READ requires reading extra partial
page at the start of the READ so that its DIO-aligned. Otherwise that
extra page at the start will make its way back to the NFS client and
corruption will occur. As found, and then this fix of using an extra
page verified, using the 'dt' utility:
dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
iotype=sequential pattern=iot onerr=abort oncerr=abort
see: https://github.com/RobinTMiller/dt.git
Any misaligned READ that is less than 32K won't be expanded to be
DIO-aligned (this heuristic just avoids excess work, like allocating
start_extra_page, for smaller IO that can generally already perform
well using buffered IO).
Suggested-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/vfs.c | 163 ++++++++++++++++++++++++++++++++++---
include/linux/sunrpc/svc.h | 5 +-
2 files changed, 157 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index c340708fbab4d..eaffa67d09bf0 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -19,6 +19,7 @@
#include <linux/splice.h>
#include <linux/falloc.h>
#include <linux/fcntl.h>
+#include <linux/math.h>
#include <linux/namei.h>
#include <linux/delay.h>
#include <linux/fsnotify.h>
@@ -1073,6 +1074,124 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+struct nfsd_read_dio {
+ loff_t start;
+ loff_t end;
+ unsigned long start_extra;
+ unsigned long end_extra;
+ struct page *start_extra_page;
+};
+
+static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
+{
+ memset(read_dio, 0, sizeof(*read_dio));
+ read_dio->start_extra_page = NULL;
+}
+
+#define NFSD_READ_DIO_MIN_KB (32 << 10)
+
+static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset,
+ unsigned long len, unsigned int base,
+ struct nfsd_read_dio *read_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_read_offset_align;
+ loff_t middle_end, orig_end = offset + len;
+
+ if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
+ "%s: underlying filesystem has not provided DIO alignment info\n",
+ __func__))
+ return false;
+ if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+ "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+ __func__, dio_blocksize, PAGE_SIZE))
+ return false;
+
+ /* Return early if IO is irreparably misaligned (len < PAGE_SIZE,
+ * or base not aligned).
+ * Ondisk alignment is implied by the following code that expands
+ * misaligned IO to have a DIO-aligned offset and len.
+ */
+ if (unlikely(len < dio_blocksize) || ((base & (nf->nf_dio_mem_align-1)) != 0))
+ return false;
+
+ init_nfsd_read_dio(read_dio);
+
+ read_dio->start = round_down(offset, dio_blocksize);
+ read_dio->end = round_up(orig_end, dio_blocksize);
+ read_dio->start_extra = offset - read_dio->start;
+ read_dio->end_extra = read_dio->end - orig_end;
+
+ /*
+ * Any misaligned READ less than NFSD_READ_DIO_MIN_KB won't be expanded
+ * to be DIO-aligned (this heuristic avoids excess work, like allocating
+ * start_extra_page, for smaller IO that can generally already perform
+ * well using buffered IO).
+ */
+ if ((read_dio->start_extra || read_dio->end_extra) &&
+ (len < NFSD_READ_DIO_MIN_KB)) {
+ init_nfsd_read_dio(read_dio);
+ return false;
+ }
+
+ if (read_dio->start_extra) {
+ read_dio->start_extra_page = alloc_page(GFP_KERNEL);
+ if (WARN_ONCE(read_dio->start_extra_page == NULL,
+ "%s: Unable to allocate start_extra_page\n", __func__)) {
+ init_nfsd_read_dio(read_dio);
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
+ struct nfsd_read_dio *read_dio,
+ ssize_t bytes_read,
+ unsigned long bytes_expected,
+ loff_t *offset,
+ unsigned long *rq_bvec_numpages)
+{
+ ssize_t host_err = bytes_read;
+ loff_t v;
+
+ if (!read_dio->start_extra && !read_dio->end_extra)
+ return host_err;
+
+ /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
+ * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
+ */
+ if (read_dio->start_extra_page) {
+ __free_page(read_dio->start_extra_page);
+ *rq_bvec_numpages -= 1;
+ v = *rq_bvec_numpages;
+ memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
+ v * sizeof(struct bio_vec));
+ }
+ /* Eliminate any end_extra bytes from the last page */
+ v = *rq_bvec_numpages;
+ rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
+
+ if (host_err < 0)
+ return host_err;
+
+ /* nfsd_analyze_read_dio() may have expanded the start and end,
+ * if so adjust returned read size to reflect original extent.
+ */
+ *offset += read_dio->start_extra;
+ if (likely(host_err >= read_dio->start_extra)) {
+ host_err -= read_dio->start_extra;
+ if (host_err > bytes_expected)
+ host_err = bytes_expected;
+ } else {
+ /* Short read that didn't read any of requested data */
+ host_err = 0;
+ }
+
+ return host_err;
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1094,7 +1213,8 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int base, u32 *eof)
{
struct file *file = nf->nf_file;
- unsigned long v, total;
+ unsigned long v, total, in_count = *count;
+ struct nfsd_read_dio read_dio;
struct iov_iter iter;
struct kiocb kiocb;
ssize_t host_err;
@@ -1102,13 +1222,34 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
init_sync_kiocb(&kiocb, file);
+ v = 0;
+ total = in_count;
+
switch (nfsd_io_cache_read) {
case NFSD_IO_DIRECT:
- /* Verify ondisk and memory DIO alignment */
- if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
- (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0) &&
- (base & (nf->nf_dio_mem_align - 1)) == 0)
- kiocb.ki_flags = IOCB_DIRECT;
+ /*
+ * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
+ * the next DIO-aligned block (on either end of the READ).
+ */
+ if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
+ in_count, base, &read_dio)) {
+ /* trace_nfsd_read_vector() will reflect larger
+ * DIO-aligned READ.
+ */
+ offset = read_dio.start;
+ in_count = read_dio.end - offset;
+ total = in_count;
+
+ kiocb.ki_flags |= IOCB_DIRECT;
+ if (read_dio.start_extra) {
+ len = read_dio.start_extra;
+ bvec_set_page(&rqstp->rq_bvec[v],
+ read_dio.start_extra_page,
+ len, PAGE_SIZE - len);
+ total -= len;
+ ++v;
+ }
+ }
break;
case NFSD_IO_DONTCACHE:
kiocb.ki_flags = IOCB_DONTCACHE;
@@ -1120,8 +1261,6 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
kiocb.ki_pos = offset;
- v = 0;
- total = *count;
while (total) {
len = min_t(size_t, total, PAGE_SIZE - base);
bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
@@ -1132,9 +1271,13 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
WARN_ON_ONCE(v > rqstp->rq_maxpages);
- trace_nfsd_read_vector(rqstp, fhp, offset, *count);
- iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
+ trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
+
+ if (kiocb.ki_flags & IOCB_DIRECT)
+ host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
+ host_err, *count, &offset, &v);
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e64ab444e0a7f..190c2667500e2 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* pages, one for the request, and one for the reply.
* nfsd_splice_actor() might need an extra page when a READ payload
* is not page-aligned.
+ * nfsd_iter_read() might need two extra pages when a READ payload
+ * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
+ * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
*/
static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
{
- return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+ return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
}
/*
--
2.44.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 6/7] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (4 preceding siblings ...)
2025-08-09 5:02 ` [PATCH v6 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-08-09 5:02 ` Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
2025-08-13 22:41 ` [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Buffered IO is used for the
misaligned extents and O_DIRECT is used for the middle DIO-aligned
extent.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/vfs.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 163 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index eaffa67d09bf0..a11c59dcb537b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1318,6 +1318,164 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio {
+ loff_t middle_offset; /* Offset for start of DIO-aligned middle */
+ loff_t end_offset; /* Offset for start of DIO-aligned end */
+ ssize_t start_len; /* Length for misaligned first extent */
+ ssize_t middle_len; /* Length for DIO-aligned middle extent */
+ ssize_t end_len; /* Length for misaligned last extent */
+};
+
+static bool
+nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset,
+ unsigned long len, struct nfsd_write_dio *write_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_offset_align;
+ loff_t orig_end, middle_end, start_end, start_offset = offset;
+ ssize_t start_len = len;
+
+ if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
+ "%s: underlying filesystem has not provided DIO alignment info\n",
+ __func__))
+ return false;
+ if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+ "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+ __func__, dio_blocksize, PAGE_SIZE))
+ return false;
+ if (unlikely(len < dio_blocksize))
+ return false;
+
+ memset(write_dio, 0, sizeof(*write_dio));
+
+ if (((offset | len) & (dio_blocksize-1)) == 0) {
+ /* already DIO-aligned, no misaligned head or tail */
+ write_dio->middle_offset = offset;
+ write_dio->middle_len = len;
+ /* clear these for the benefit of trace_nfsd_analyze_write_dio */
+ start_offset = 0;
+ start_len = 0;
+ return true;
+ }
+
+ start_end = round_up(offset, dio_blocksize);
+ start_len = start_end - offset;
+ orig_end = offset + len;
+ middle_end = round_down(orig_end, dio_blocksize);
+
+ write_dio->start_len = start_len;
+ write_dio->middle_offset = start_end;
+ write_dio->middle_len = middle_end - start_end;
+ write_dio->end_offset = middle_end;
+ write_dio->end_len = orig_end - middle_end;
+
+ return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
+ * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
+ * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
+ * @nvecs: number of segments in @rq_bvec
+ * @cnt: size of the request in bytes
+ * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
+ *
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+ struct bio_vec *rq_bvec, unsigned int nvecs,
+ unsigned long cnt, struct nfsd_write_dio *write_dio)
+{
+ int n_iters = 0;
+ struct iov_iter *iters = *iterp;
+
+ /* Setup misaligned start? */
+ if (write_dio->start_len) {
+ iter_is_dio_aligned[n_iters] = false;
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iters[n_iters].count = write_dio->start_len;
+ ++n_iters;
+ }
+
+ /* Setup DIO-aligned middle */
+ iter_is_dio_aligned[n_iters] = true;
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ if (write_dio->start_len)
+ iov_iter_advance(&iters[n_iters], write_dio->start_len);
+ iters[n_iters].count -= write_dio->end_len;
+ ++n_iters;
+
+ /* Setup misaligned end? */
+ if (write_dio->end_len) {
+ iter_is_dio_aligned[n_iters] = false;
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iov_iter_advance(&iters[n_iters],
+ write_dio->start_len + write_dio->middle_len);
+ ++n_iters;
+ }
+
+ return n_iters;
+}
+
+static int
+nfsd_issue_write_buffered(struct svc_rqst *rqstp, struct file *file,
+ unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb)
+{
+ struct iov_iter iter;
+ int host_err;
+
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt = host_err;
+
+ return 0;
+}
+
+static noinline int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset,
+ unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb)
+{
+ struct nfsd_write_dio write_dio;
+ struct file *file = nf->nf_file;
+
+ if (!nfsd_analyze_write_dio(rqstp, fhp, nf, offset,
+ *cnt, &write_dio)) {
+ return nfsd_issue_write_buffered(rqstp, file,
+ nvecs, cnt, kiocb);
+ } else {
+ bool iter_is_dio_aligned[3];
+ struct iov_iter iter_stack[3];
+ struct iov_iter *iter = iter_stack;
+ unsigned int n_iters = 0;
+ int host_err;
+
+ n_iters = nfsd_setup_write_dio_iters(&iter,
+ iter_is_dio_aligned, rqstp->rq_bvec,
+ nvecs, *cnt, &write_dio);
+ *cnt = 0;
+ for (int i = 0; i < n_iters; i++) {
+ if (iter_is_dio_aligned[i])
+ kiocb->ki_flags |= IOCB_DIRECT;
+ else
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+ host_err = vfs_iocb_iter_write(file, kiocb,
+ &iter[i]);
+ if (host_err < 0)
+ return host_err;
+ *cnt += host_err;
+ }
+ }
+
+ return 0;
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1345,7 +1503,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct super_block *sb = file_inode(file)->i_sb;
struct kiocb kiocb;
struct svc_export *exp;
- struct iov_iter iter;
errseq_t since;
__be32 nfserr;
int host_err;
@@ -1382,31 +1539,29 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
kiocb.ki_flags |= IOCB_DSYNC;
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
case NFSD_IO_DIRECT:
- /* direct I/O must be aligned to device logical sector size */
- if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
- (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
- kiocb.ki_flags |= IOCB_DIRECT;
+ host_err = nfsd_issue_write_dio(rqstp, fhp, nf, offset,
+ nvecs, cnt, &kiocb);
break;
case NFSD_IO_DONTCACHE:
kiocb.ki_flags |= IOCB_DONTCACHE;
fallthrough;
case NFSD_IO_UNSPECIFIED:
case NFSD_IO_BUFFERED:
+ host_err = nfsd_issue_write_buffered(rqstp, file,
+ nvecs, cnt, &kiocb);
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.44.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (5 preceding siblings ...)
2025-08-09 5:02 ` [PATCH v6 6/7] NFSD: issue WRITEs " Mike Snitzer
@ 2025-08-09 5:02 ` Mike Snitzer
2025-08-13 22:41 ` [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-09 5:02 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs
Add EVENT_CLASS nfsd_analyze_dio_class and use it to create
nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events.
The nfsd_analyze_read_dio trace event shows how NFSD expands any
misaligned READ to the next DIO-aligned block (on either end of the
original READ, as needed).
This combination of trace events is useful for READs:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
Which for this dd command:
dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
Results in:
nfsd-23908[010] ..... 10375.141640: nfsd_analyze_read_dio: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
nfsd-23908[010] ..... 10375.141642: nfsd_read_vector: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47104
nfsd-23908[010] ..... 10375.141643: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0x0 bytecount 0xb800
nfsd-23908[010] ..... 10375.141773: nfsd_read_io_done: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008
nfsd-23908[010] ..... 10375.142063: nfsd_analyze_read_dio: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=46592+416 middle=47008+47008 end=94016+192
nfsd-23908[010] ..... 10375.142064: nfsd_read_vector: xid=0x83c5923b fh_hash=0x857ca4fc offset=46592 len=47616
nfsd-23908[010] ..... 10375.142065: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0xb600 bytecount 0xba00
nfsd-23908[010] ..... 10375.142103: nfsd_read_io_done: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008
The nfsd_analyze_write_dio trace event shows how NFSD splits a given
misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
extent.
This combination of trace events is useful for WRITEs:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
Which for this dd command:
dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct
Results in:
nfsd-23908[010] ..... 10374.902333: nfsd_write_opened: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
nfsd-23908[010] ..... 10374.902335: nfsd_analyze_write_dio: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+46592 end=46592+416
nfsd-23908[010] ..... 10374.902343: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0x0 pos 0x0 bytecount 0xb600
nfsd-23908[010] ..... 10374.902697: nfsd_write_io_done: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
nfsd-23908[010] ..... 10374.902925: nfsd_write_opened: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
nfsd-23908[010] ..... 10374.902926: nfsd_analyze_write_dio: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=47008+96 middle=47104+46592 end=93696+320
nfsd-23908[010] ..... 10374.903010: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0xb800 pos 0xb800 bytecount 0xb600
nfsd-23908[010] ..... 10374.903239: nfsd_write_io_done: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/trace.h | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/vfs.c | 13 +++++++++--
2 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e9..4173bd9344b6b 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -473,6 +473,67 @@ DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
DEFINE_NFSD_IO_EVENT(commit_done);
+DECLARE_EVENT_CLASS(nfsd_analyze_dio_class,
+ TP_PROTO(struct svc_rqst *rqstp,
+ struct svc_fh *fhp,
+ u64 offset,
+ u32 len,
+ loff_t start,
+ ssize_t start_len,
+ loff_t middle,
+ ssize_t middle_len,
+ loff_t end,
+ ssize_t end_len),
+ TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len),
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, fh_hash)
+ __field(u64, offset)
+ __field(u32, len)
+ __field(loff_t, start)
+ __field(ssize_t, start_len)
+ __field(loff_t, middle)
+ __field(ssize_t, middle_len)
+ __field(loff_t, end)
+ __field(ssize_t, end_len)
+ ),
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqstp->rq_xid);
+ __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
+ __entry->offset = offset;
+ __entry->len = len;
+ __entry->start = start;
+ __entry->start_len = start_len;
+ __entry->middle = middle;
+ __entry->middle_len = middle_len;
+ __entry->end = end;
+ __entry->end_len = end_len;
+ ),
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%lu middle=%llu+%lu end=%llu+%lu",
+ __entry->xid, __entry->fh_hash,
+ __entry->offset, __entry->len,
+ __entry->start, __entry->start_len,
+ __entry->middle, __entry->middle_len,
+ __entry->end, __entry->end_len)
+)
+
+#define DEFINE_NFSD_ANALYZE_DIO_EVENT(name) \
+DEFINE_EVENT(nfsd_analyze_dio_class, nfsd_analyze_##name##_dio, \
+ TP_PROTO(struct svc_rqst *rqstp, \
+ struct svc_fh *fhp, \
+ u64 offset, \
+ u32 len, \
+ loff_t start, \
+ ssize_t start_len, \
+ loff_t middle, \
+ ssize_t middle_len, \
+ loff_t end, \
+ ssize_t end_len), \
+ TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len))
+
+DEFINE_NFSD_ANALYZE_DIO_EVENT(read);
+DEFINE_NFSD_ANALYZE_DIO_EVENT(write);
+
DECLARE_EVENT_CLASS(nfsd_err_class,
TP_PROTO(struct svc_rqst *rqstp,
struct svc_fh *fhp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a11c59dcb537b..8b414ed7531c7 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1143,6 +1143,12 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
}
+ /* Show original offset and count, and how it was expanded for DIO */
+ middle_end = read_dio->end - read_dio->end_extra;
+ trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
+ read_dio->start, read_dio->start_extra,
+ offset, (middle_end - offset),
+ middle_end, read_dio->end_extra);
return true;
}
@@ -1355,7 +1361,7 @@ nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
/* clear these for the benefit of trace_nfsd_analyze_write_dio */
start_offset = 0;
start_len = 0;
- return true;
+ goto out;
}
start_end = round_up(offset, dio_blocksize);
@@ -1368,7 +1374,10 @@ nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
write_dio->middle_len = middle_end - start_end;
write_dio->end_offset = middle_end;
write_dio->end_len = orig_end - middle_end;
-
+out:
+ trace_nfsd_analyze_write_dio(rqstp, fhp, offset, len, start_offset, start_len,
+ write_dio->middle_offset, write_dio->middle_len,
+ write_dio->end_offset, write_dio->end_len);
return true;
}
--
2.44.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
` (6 preceding siblings ...)
2025-08-09 5:02 ` [PATCH v6 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
@ 2025-08-13 22:41 ` Mike Snitzer
7 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2025-08-13 22:41 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, kbusch
On Sat, Aug 09, 2025 at 01:02:50AM -0400, Mike Snitzer wrote:
> Hi,
>
> Some workloads benefit from NFSD avoiding the page cache, particularly
> those with a working set that is significantly larger than available
> system memory. This patchset introduces _optional_ support to
> configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
> support. The NFSD default to use page cache is left unchanged.
>
> The performance win associated with using NFSD DIRECT was previously
> summarized here:
> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> This picture offers a nice summary of performance gains:
> https://original.art/NFSD_direct_vs_buffered_IO.jpg
>
> This series builds on what has been staged in the nfsd-testing branch.
>
> This code has proven to work well during my testing. Any suggestions
> for further refinement are welcome.
>
> Thanks,
> Mike
>
> Changes since v5:
> - #define NFSD_READ_DIO_MIN_KB (32 << 10)
> - switch to using pre-increment from post-increment.
> - always get DIO alignment when opening regular nfsd_file (dropped
> patch that only did if NFSD_IO_DIRECT).
> - fixed nfsd_io_cache_{read,write}_set to not set NFSD_IO_UNSPECIFIED
> if returning -EINVAL due to unrecognized value.
> - use a switch statement in nfsd_iter_read like nfsd_iter_write
> - Optimize nfsd_iter_read for default being buffered IO, like was done
> for nfsd_iter_write in v5.
FYI, I've found that in general we _do_ need to verify DIO-alignment
more comprehensively. Which means I've needed to reinstate the use of
a modified iov_iter_is_aligned.
Otherwise the piece-wise checking that Keith suggested, by adding
checks to existing loops and such, results in the underlying
filesystem (XFS) returning -EINVAL due to alignment issues.
So I'm now focusing on correctness and will hopefully be posting v7 of
this patchset by the end of the week.
Mike
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-08-13 22:41 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-09 5:02 [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 6/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-09 5:02 ` [PATCH v6 7/7] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
2025-08-13 22:41 ` [PATCH v6 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).