* [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
@ 2025-07-24 19:30 Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
` (14 more replies)
0 siblings, 15 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Hi,
Some workloads benefit from NFSD avoiding the page cache, particularly
those with a working set that is significantly larger than available
system memory. This patchset introduces _optional_ support to
configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
support. The NFSD default to use page cache is left unchanged.
The performance win associated with using NFSD DIRECT was previously
summarized here:
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
This picture offers a nice summary of performance gains:
https://original.art/NFSD_direct_vs_buffered_IO.jpg
Similarly, NFS and LOCALIO in particular also benefit from avoiding
the page cache for workloads that have a working set that is
significantly larger than available system memory. Enter: NFS DIRECT,
which makes it possible to always enable LOCALIO to use O_DIRECT even
if the IO is not DIO-aligned.
For this v5 I've combined the NFSD and NFSD patchsets because the NFS
changes do depend on the the NFSD changes. In addition, I think it
makes sense to review/test these changes together.
I'm sharing these again now, soon after posting the NFSD and NFS
updates, to hopefully make it clear where the code stands. Thanks to
Chuck's feedback I have kept the patch "NFSD: issue READs using
O_DIRECT even if IO is misaligned" (and will now finish NFSD's
misaligned WRITE handling, splitting IO to misaligned head and/or tail
and DIO-aligned middle, and will include in the next version of this
patchset -- probably mid next week).
New changes in this v5:
- Combine NFSD DIRECT and NFS DIRECT patches into single patchset.
- Fix a "nsfd" typo in a variable of the NFSD io_cache_read patch that
was masked because the later " NFSD: issue READs using O_DIRECT even
if IO is misaligned" patch fixed it.
- Properly include the "NFSD: filecache: only get DIO alignment
attrs if NFSD_IO_DIRECT enabled" in the patch series.
- Optimize NFS DIRECT's misaligned READ and WRITE support to return
early if IO irreparably misaligned or already DIO-aligned.
Thanks,
Mike
Mike Snitzer (13):
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
NFSD: pass nfsd_file to nfsd_iter_read()
NFSD: add io_cache_read controls to debugfs interface
NFSD: add io_cache_write controls to debugfs interface
NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
NFSD: issue READs using O_DIRECT even if IO is misaligned
nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
nfs/localio: make trace_nfs_local_open_fh more useful
nfs/localio: add nfsd_file_dio_alignment
nfs/localio: refactor iocb initialization
nfs/localio: fallback to NFSD for misaligned O_DIRECT READs
nfs/direct: add misaligned READ handling
nfs/direct: add misaligned WRITE handling
fs/nfs/direct.c | 262 +++++++++++++++++++++++--
fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
fs/nfs/internal.h | 17 +-
fs/nfs/localio.c | 231 ++++++++++++++--------
fs/nfs/nfstrace.h | 47 ++++-
fs/nfs/pagelist.c | 22 ++-
fs/nfsd/debugfs.c | 102 ++++++++++
fs/nfsd/filecache.c | 36 ++++
fs/nfsd/filecache.h | 4 +
fs/nfsd/localio.c | 11 ++
fs/nfsd/nfs4xdr.c | 8 +-
fs/nfsd/nfsd.h | 10 +
fs/nfsd/nfsfh.c | 4 +
fs/nfsd/trace.h | 37 ++++
fs/nfsd/vfs.c | 200 +++++++++++++++++--
fs/nfsd/vfs.h | 2 +-
include/linux/nfs_page.h | 1 +
include/linux/nfslocalio.h | 2 +
include/linux/sunrpc/svc.h | 5 +-
19 files changed, 875 insertions(+), 127 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 02/13] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
` (13 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
alignment attributes from underlying filesystem in associated
nfsd_file. This is done when the nfsd_file is first opened for
a regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/filecache.c | 32 ++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/nfsfh.c | 4 ++++
3 files changed, 40 insertions(+)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 75bc48031c07..aad5f924d101 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1048,6 +1051,33 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return status;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1166,6 +1196,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_getattr(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434..e3d6ca2b6030 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index f4c2fb3dd5d0..ff634b47c4df 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -676,8 +676,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
+ struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
+ if (S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
if (fhp->fh_maxsize == NFS4_FHSIZE)
request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 02/13] NFSD: pass nfsd_file to nfsd_iter_read()
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 03/13] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
` (12 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 8b68f74a8cf0..c5a9b5680005 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4464,7 +4464,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4475,7 +4475,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4522,7 +4522,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5418,7 +5418,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index eaf04751d07f..9bbc97aebbea 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1076,7 +1076,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1089,9 +1089,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1313,7 +1314,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6d..fa46f8b5f132 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 03/13] NFSD: add io_cache_read controls to debugfs interface
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 02/13] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 04/13] NFSD: add io_cache_write " Mike Snitzer
` (11 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)
io_cache_read may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_read
If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the read must be aligned relative to the
underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/debugfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 9 ++++++++
fs/nfsd/vfs.c | 16 +++++++++++++
3 files changed, 83 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 84b0c8b559dc..c07f71d4e84f 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -27,11 +27,66 @@ static int nfsd_dsr_get(void *data, u64 *val)
static int nfsd_dsr_set(void *data, u64 val)
{
nfsd_disable_splice_read = (val > 0) ? true : false;
+ if (!nfsd_disable_splice_read) {
+ /*
+ * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
+ * if splice_read is enabled.
+ */
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ }
return 0;
}
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_read
+ *
+ * Contents:
+ * %1: NFS READ will use buffered IO
+ * %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %3: NFS READ will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_read_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_read;
+ return 0;
+}
+
+static int nfsd_io_cache_read_set(void *data, u64 val)
+{
+ int ret = 0;
+
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ break;
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ /*
+ * Must disable splice_read when enabling
+ * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
+ */
+ nfsd_disable_splice_read = true;
+ nfsd_io_cache_read = val;
+ break;
+ default:
+ nfsd_io_cache_read = NFSD_IO_UNSPECIFIED;
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
+ nfsd_io_cache_read_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -44,4 +99,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_dsr_fops);
+
+ debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1cd0bed57bc2..6ef799405145 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
extern bool nfsd_disable_splice_read __read_mostly;
+enum {
+ NFSD_IO_UNSPECIFIED = 0,
+ NFSD_IO_BUFFERED,
+ NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
+};
+
+extern u64 nfsd_io_cache_read __read_mostly;
+
extern int nfsd_max_blksize;
static inline int nfsd_v4client(struct svc_rqst *rq)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 9bbc97aebbea..145f6d635ac7 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,7 @@
#define NFSDDBG_FACILITY NFSDDBG_FILEOP
bool nfsd_disable_splice_read __read_mostly;
+u64 nfsd_io_cache_read __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1116,6 +1117,21 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
trace_nfsd_read_vector(rqstp, fhp, offset, *count);
iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
+
+ switch (nfsd_io_cache_read) {
+ case NFSD_IO_DIRECT:
+ if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
+ iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
+ nf->nf_dio_read_offset_align - 1))
+ kiocb.ki_flags = IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags = IOCB_DONTCACHE;
+ break;
+ case NFSD_IO_BUFFERED:
+ break;
+ }
+
host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 04/13] NFSD: add io_cache_write controls to debugfs interface
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (2 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 03/13] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
` (10 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)
io_cache_write may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_write
If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the WRITE payload must be aligned relative
to the underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/debugfs.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/vfs.c | 18 ++++++++++++++++++
3 files changed, 63 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index c07f71d4e84f..872de65f0e9a 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -87,6 +87,47 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
nfsd_io_cache_read_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_write
+ *
+ * Contents:
+ * %1: NFS WRITE will use buffered IO
+ * %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ * %3: NFS WRITE will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_write_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_write;
+ return 0;
+}
+
+static int nfsd_io_cache_write_set(void *data, u64 val)
+{
+ int ret = 0;
+
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ nfsd_io_cache_write = val;
+ break;
+ default:
+ nfsd_io_cache_write = NFSD_IO_UNSPECIFIED;
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_write_fops, nfsd_io_cache_write_get,
+ nfsd_io_cache_write_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -102,4 +143,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
+
+ debugfs_create_file("io_cache_write", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_write_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 6ef799405145..fe935b4cda53 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -161,6 +161,7 @@ enum {
};
extern u64 nfsd_io_cache_read __read_mostly;
+extern u64 nfsd_io_cache_write __read_mostly;
extern int nfsd_max_blksize;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 145f6d635ac7..a7a587736a22 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -50,6 +50,7 @@
bool nfsd_disable_splice_read __read_mostly;
u64 nfsd_io_cache_read __read_mostly;
+u64 nfsd_io_cache_write __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1238,6 +1239,23 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
+ switch (nfsd_io_cache_write) {
+ case NFSD_IO_DIRECT:
+ /* direct I/O must be aligned to device logical sector size */
+ if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
+ (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0) &&
+ iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
+ nf->nf_dio_offset_align - 1))
+ kiocb.ki_flags = IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags = IOCB_DONTCACHE;
+ break;
+ case NFSD_IO_BUFFERED:
+ break;
+ }
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (3 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 04/13] NFSD: add io_cache_write " Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
` (9 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
From: Mike Snitzer <snitzer@hammerspace.com>
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/filecache.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index aad5f924d101..01db2aed82d6 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -1058,8 +1058,12 @@ nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
struct kstat stat;
__be32 status;
- /* Currently only need to get DIO alignment info for regular files */
- if (!S_ISREG(inode->i_mode))
+ /* Currently only need to get DIO alignment info for regular files
+ * IFF NFSD_IO_DIRECT is enabled for nfsd_io_cache_{read,write}.
+ */
+ if (!S_ISREG(inode->i_mode) ||
+ (nfsd_io_cache_read != NFSD_IO_DIRECT &&
+ nfsd_io_cache_write != NFSD_IO_DIRECT))
return nfs_ok;
status = fh_getattr(fhp, &stat);
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (4 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 07/13] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local() Mike Snitzer
` (8 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
DIO-aligned block (on either end of the READ). The expanded READ is
verified to have proper offset/len (logical_block_size) and
dma_alignment checking.
Must allocate and use a bounce-buffer page (called 'start_extra_page')
if/when expanding the misaligned READ requires reading extra partial
page at the start of the READ so that its DIO-aligned. Otherwise that
extra page at the start will make its way back to the NFS client and
corruption will occur. As found, and then this fix of using an extra
page verified, using the 'dt' utility:
dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
iotype=sequential pattern=iot onerr=abort oncerr=abort
see: https://github.com/RobinTMiller/dt.git
Any misaligned READ that is less than 32K won't be expanded to be
DIO-aligned (this heuristic just avoids excess work, like allocating
start_extra_page, for smaller IO that can generally already perform
well using buffered IO).
Also add nfsd_read_vector_dio trace event. This combination of
trace events is useful:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector_dio/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
Which for this dd command:
dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
Results in:
nfsd-16580 [001] ..... 5672.403130: nfsd_read_vector_dio: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008 start=0+0 end=47104-96
nfsd-16580 [001] ..... 5672.403131: nfsd_read_vector: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47104
nfsd-16580 [001] ..... 5672.403134: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0x0 bytecount 0xb800
nfsd-16580 [001] ..... 5672.404380: nfsd_read_io_done: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008
nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector_dio: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008 start=46592+416 end=94208-192
nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=46592 len=47616
nfsd-16580 [001] ..... 5672.404673: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0xb600 bytecount 0xba00
nfsd-16580 [001] ..... 5672.405771: nfsd_read_io_done: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008
Suggested-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/trace.h | 37 ++++++++
fs/nfsd/vfs.c | 183 ++++++++++++++++++++++++++++++++-----
include/linux/sunrpc/svc.h | 5 +-
3 files changed, 203 insertions(+), 22 deletions(-)
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e..55055482f8a8 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -473,6 +473,43 @@ DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
DEFINE_NFSD_IO_EVENT(commit_done);
+TRACE_EVENT(nfsd_read_vector_dio,
+ TP_PROTO(struct svc_rqst *rqstp,
+ struct svc_fh *fhp,
+ u64 offset,
+ u32 len,
+ loff_t start,
+ loff_t start_extra,
+ loff_t end,
+ loff_t end_extra),
+ TP_ARGS(rqstp, fhp, offset, len, start, start_extra, end, end_extra),
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, fh_hash)
+ __field(u64, offset)
+ __field(u32, len)
+ __field(loff_t, start)
+ __field(loff_t, start_extra)
+ __field(loff_t, end)
+ __field(loff_t, end_extra)
+ ),
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqstp->rq_xid);
+ __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
+ __entry->offset = offset;
+ __entry->len = len;
+ __entry->start = start;
+ __entry->start_extra = start_extra;
+ __entry->end = end;
+ __entry->end_extra = end_extra;
+ ),
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%llu end=%llu-%llu",
+ __entry->xid, __entry->fh_hash,
+ __entry->offset, __entry->len,
+ __entry->start, __entry->start_extra,
+ __entry->end, __entry->end_extra)
+);
+
DECLARE_EVENT_CLASS(nfsd_err_class,
TP_PROTO(struct svc_rqst *rqstp,
struct svc_fh *fhp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a7a587736a22..eee39d2d5e0f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -19,6 +19,7 @@
#include <linux/splice.h>
#include <linux/falloc.h>
#include <linux/fcntl.h>
+#include <linux/math.h>
#include <linux/namei.h>
#include <linux/delay.h>
#include <linux/fsnotify.h>
@@ -1074,6 +1075,116 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+struct nfsd_read_dio
+{
+ loff_t start;
+ loff_t end;
+ unsigned long start_extra;
+ unsigned long end_extra;
+ struct page *start_extra_page;
+};
+
+static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
+{
+ memset(read_dio, 0, sizeof(*read_dio));
+ read_dio->start_extra_page = NULL;
+}
+
+static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, loff_t offset,
+ unsigned long len, unsigned int base,
+ struct nfsd_read_dio *read_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_read_offset_align;
+ loff_t orig_end = offset + len;
+
+ if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
+ "%s: underlying filesystem has not provided DIO alignment info\n",
+ __func__))
+ return false;
+ if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+ "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+ __func__, dio_blocksize, PAGE_SIZE))
+ return false;
+
+ /* Return early if IO is irreparably misaligned
+ * (len < PAGE_SIZE, or base not aligned).
+ */
+ if (unlikely(len < dio_blocksize) ||
+ ((base & (nf->nf_dio_mem_align-1)) != 0))
+ return false;
+
+ read_dio->start = round_down(offset, dio_blocksize);
+ read_dio->end = round_up(orig_end, dio_blocksize);
+ read_dio->start_extra = offset - read_dio->start;
+ read_dio->end_extra = read_dio->end - orig_end;
+
+ /* don't expand READ for IO less than 32K */
+ if ((read_dio->start_extra || read_dio->end_extra) && (len < (32 << 10))) {
+ init_nfsd_read_dio(read_dio);
+ return false;
+ }
+
+ if (read_dio->start_extra) {
+ read_dio->start_extra_page = alloc_page(GFP_KERNEL);
+ if (WARN_ONCE(read_dio->start_extra_page == NULL,
+ "%s: Unable to allocate start_extra_page\n", __func__)) {
+ init_nfsd_read_dio(read_dio);
+ return false;
+ }
+ }
+
+ /* Show original offset and count, and how it was expanded for DIO */
+ trace_nfsd_read_vector_dio(rqstp, fhp, offset, len,
+ read_dio->start, read_dio->start_extra,
+ read_dio->end, read_dio->end_extra);
+
+ return true;
+}
+
+static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
+ struct nfsd_read_dio *read_dio,
+ ssize_t bytes_read,
+ unsigned long bytes_expected,
+ loff_t *offset,
+ unsigned long *rq_bvec_numpages)
+{
+ ssize_t host_err = bytes_read;
+ loff_t v;
+
+ /* If nfsd_analyze_read_dio() allocated a start_extra_page it must
+ * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
+ */
+ if (read_dio->start_extra_page) {
+ __free_page(read_dio->start_extra_page);
+ *rq_bvec_numpages -= 1;
+ v = *rq_bvec_numpages;
+ memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
+ v * sizeof(struct bio_vec));
+ }
+ /* Eliminate any end_extra bytes from the last page */
+ v = *rq_bvec_numpages;
+ rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
+
+ if (host_err < 0)
+ return host_err;
+
+ /* nfsd_analyze_read_dio() may have expanded the start and end,
+ * if so adjust returned read size to reflect original extent.
+ */
+ *offset += read_dio->start_extra;
+ if (likely(host_err >= read_dio->start_extra)) {
+ host_err -= read_dio->start_extra;
+ if (host_err > bytes_expected)
+ host_err = bytes_expected;
+ } else {
+ /* Short read that didn't read any of requested data */
+ host_err = 0;
+ }
+
+ return host_err;
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1095,45 +1206,75 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int base, u32 *eof)
{
struct file *file = nf->nf_file;
- unsigned long v, total;
+ unsigned long v, total, in_count = *count;
+ struct nfsd_read_dio read_dio;
struct iov_iter iter;
struct kiocb kiocb;
- ssize_t host_err;
+ ssize_t host_err = 0;
size_t len;
+ init_nfsd_read_dio(&read_dio);
init_sync_kiocb(&kiocb, file);
+
+ /*
+ * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
+ * the next DIO-aligned block (on either end of the READ).
+ */
+ if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
+ if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
+ in_count, base, &read_dio)) {
+ /* trace_nfsd_read_vector() will reflect larger
+ * DIO-aligned READ.
+ */
+ offset = read_dio.start;
+ in_count = read_dio.end - offset;
+ kiocb.ki_flags = IOCB_DIRECT;
+ }
+ } else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
+ kiocb.ki_flags = IOCB_DONTCACHE;
+
kiocb.ki_pos = offset;
v = 0;
- total = *count;
+ total = in_count;
+ if (read_dio.start_extra) {
+ bvec_set_page(&rqstp->rq_bvec[v++], read_dio.start_extra_page,
+ read_dio.start_extra, PAGE_SIZE - read_dio.start_extra);
+ total -= read_dio.start_extra;
+ }
while (total) {
len = min_t(size_t, total, PAGE_SIZE - base);
- bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
+ bvec_set_page(&rqstp->rq_bvec[v++], *(rqstp->rq_next_page++),
len, base);
total -= len;
- ++v;
base = 0;
}
- WARN_ON_ONCE(v > rqstp->rq_maxpages);
+ if (WARN_ONCE(v > rqstp->rq_maxpages,
+ "%s: v=%lu exceeds rqstp->rq_maxpages=%lu\n", __func__,
+ v, rqstp->rq_maxpages)) {
+ host_err = -EINVAL;
+ }
+
+ if (!host_err) {
+ trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
- trace_nfsd_read_vector(rqstp, fhp, offset, *count);
- iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
+ /* Double check nfsd_analyze_read_dio's DIO-aligned result */
+ if (unlikely((kiocb.ki_flags & IOCB_DIRECT) &&
+ !iov_iter_is_aligned(&iter,
+ nf->nf_dio_mem_align - 1,
+ nf->nf_dio_read_offset_align - 1))) {
+ /* Fallback to buffered IO */
+ kiocb.ki_flags &= ~IOCB_DIRECT;
+ }
- switch (nfsd_io_cache_read) {
- case NFSD_IO_DIRECT:
- if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
- iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
- nf->nf_dio_read_offset_align - 1))
- kiocb.ki_flags = IOCB_DIRECT;
- break;
- case NFSD_IO_DONTCACHE:
- kiocb.ki_flags = IOCB_DONTCACHE;
- break;
- case NFSD_IO_BUFFERED:
- break;
+ host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
}
- host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
+ if (read_dio.start_extra || read_dio.end_extra) {
+ host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
+ host_err, *count, &offset, &v);
+ }
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 40cbe81360ed..ed5c1ce55d5c 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* pages, one for the request, and one for the reply.
* nfsd_splice_actor() might need an extra page when a READ payload
* is not page-aligned.
+ * nfsd_iter_read() might need two extra pages when a READ payload
+ * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
+ * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
*/
static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
{
- return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+ return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
}
/*
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 07/13] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (5 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 08/13] nfs/localio: make trace_nfs_local_open_fh more useful Mike Snitzer
` (7 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
From: Mike Snitzer <snitzer@hammerspace.com>
Previously nfs_local_probe() was made to disable and then attempt to
re-enable LOCALIO (via LOCALIO protocol handshake) if/when it was
called and LOCALIO already enabled.
Vague memory for _why_ this was the case is that this was useful
if/when a local NFS server were to be restarted with a local NFS
client connected to it.
But as it happens this causes an absurd amount of LOCALIO flapping
which has a side-effect of too much IO being needlessly sent to NFSD
(using RPC over the loopback network interface). This is the
definition of "serious performance loss" (that negates the point of
having LOCALIO).
So remove this mis-optimization for re-enabling LOCALIO if/when an NFS
server is restarted (which is an extremely rare thing to do). Will
revisit testing that scenario again but in the meantime this patch
restores the full benefit of LOCALIO.
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
---
fs/nfs/localio.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 510d0a16cfe9..ecfe22a105ea 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -180,10 +180,8 @@ static void nfs_local_probe(struct nfs_client *clp)
return;
}
- if (nfs_client_is_local(clp)) {
- /* If already enabled, disable and re-enable */
- nfs_localio_disable_client(clp);
- }
+ if (nfs_client_is_local(clp))
+ return;
if (!nfs_uuid_begin(&clp->cl_uuid))
return;
@@ -244,7 +242,8 @@ __nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
case -ENOMEM:
case -ENXIO:
case -ENOENT:
- /* Revalidate localio, will disable if unsupported */
+ /* Revalidate localio */
+ nfs_localio_disable_client(clp);
nfs_local_probe(clp);
}
}
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 08/13] nfs/localio: make trace_nfs_local_open_fh more useful
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (6 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 07/13] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local() Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 09/13] nfs/localio: add nfsd_file_dio_alignment Mike Snitzer
` (6 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
From: Mike Snitzer <snitzer@hammerspace.com>
Always trigger trace event when LOCALIO opens a file.
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
fs/nfs/localio.c | 5 +++--
fs/nfs/nfstrace.h | 6 +++---
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index ecfe22a105ea..0b54f01299d2 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -231,13 +231,13 @@ __nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
struct nfsd_file __rcu **pnf,
const fmode_t mode)
{
+ int status = 0;
struct nfsd_file *localio;
localio = nfs_open_local_fh(&clp->cl_uuid, clp->cl_rpcclient,
cred, fh, nfl, pnf, mode);
if (IS_ERR(localio)) {
- int status = PTR_ERR(localio);
- trace_nfs_local_open_fh(fh, mode, status);
+ status = PTR_ERR(localio);
switch (status) {
case -ENOMEM:
case -ENXIO:
@@ -247,6 +247,7 @@ __nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
nfs_local_probe(clp);
}
}
+ trace_nfs_local_open_fh(fh, mode, status);
return localio;
}
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 7a058bd8c566..334e65d6bc72 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1707,10 +1707,10 @@ TRACE_EVENT(nfs_local_open_fh,
),
TP_printk(
- "error=%d fhandle=0x%08x mode=%s",
- __entry->error,
+ "fhandle=0x%08x mode=%s result=%d",
__entry->fhandle,
- show_fs_fmode_flags(__entry->fmode)
+ show_fs_fmode_flags(__entry->fmode),
+ __entry->error
)
);
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 09/13] nfs/localio: add nfsd_file_dio_alignment
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (7 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 08/13] nfs/localio: make trace_nfs_local_open_fh more useful Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 10/13] nfs/localio: refactor iocb initialization Mike Snitzer
` (5 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
From: Mike Snitzer <snitzer@hammerspace.com>
And use it to avoid issuing misaligned IO using O_DIRECT.
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
fs/nfs/localio.c | 26 ++++++++++++++++++++++----
fs/nfsd/localio.c | 11 +++++++++++
include/linux/nfslocalio.h | 2 ++
3 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 0b54f01299d2..0c48db38f74f 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -322,12 +322,10 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
return NULL;
}
+ init_sync_kiocb(&iocb->kiocb, file);
if (localio_O_DIRECT_semantics &&
- test_bit(NFS_IOHDR_ODIRECT, &hdr->flags)) {
- iocb->kiocb.ki_filp = file;
+ test_bit(NFS_IOHDR_ODIRECT, &hdr->flags))
iocb->kiocb.ki_flags = IOCB_DIRECT;
- } else
- init_sync_kiocb(&iocb->kiocb, file);
iocb->kiocb.ki_pos = hdr->args.offset;
iocb->hdr = hdr;
@@ -346,6 +344,26 @@ nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
hdr->args.count + hdr->args.pgbase);
if (hdr->args.pgbase != 0)
iov_iter_advance(i, hdr->args.pgbase);
+
+ if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
+ u32 nf_dio_mem_align, nf_dio_offset_align, nf_dio_read_offset_align;
+ /* Verify the IO is DIO-aligned as required */
+ nfs_to->nfsd_file_dio_alignment(iocb->localio, &nf_dio_mem_align,
+ &nf_dio_offset_align,
+ &nf_dio_read_offset_align);
+ if (dir == READ)
+ nf_dio_offset_align = nf_dio_read_offset_align;
+ /* direct I/O must be aligned to device logical sector size */
+ if (nf_dio_mem_align && nf_dio_offset_align &&
+ (((hdr->args.offset | hdr->args.count) & (nf_dio_offset_align-1)) == 0) &&
+ iov_iter_is_aligned(i, nf_dio_mem_align - 1,
+ nf_dio_offset_align - 1))
+ return 0;
+
+ /* Fallback to using buffered for this misaligned IO */
+ iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
+ iocb->kiocb.ki_filp->f_flags &= ~O_DIRECT;
+ }
}
static void
diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c
index 269fa9391dc4..be710d809a3b 100644
--- a/fs/nfsd/localio.c
+++ b/fs/nfsd/localio.c
@@ -117,12 +117,23 @@ nfsd_open_local_fh(struct net *net, struct auth_domain *dom,
return localio;
}
+static void nfsd_file_dio_alignment(struct nfsd_file *nf,
+ u32 *nf_dio_mem_align,
+ u32 *nf_dio_offset_align,
+ u32 *nf_dio_read_offset_align)
+{
+ *nf_dio_mem_align = nf->nf_dio_mem_align;
+ *nf_dio_offset_align = nf->nf_dio_offset_align;
+ *nf_dio_read_offset_align = nf->nf_dio_read_offset_align;
+}
+
static const struct nfsd_localio_operations nfsd_localio_ops = {
.nfsd_net_try_get = nfsd_net_try_get,
.nfsd_net_put = nfsd_net_put,
.nfsd_open_local_fh = nfsd_open_local_fh,
.nfsd_file_put_local = nfsd_file_put_local,
.nfsd_file_file = nfsd_file_file,
+ .nfsd_file_dio_alignment = nfsd_file_dio_alignment,
};
void nfsd_localio_ops_init(void)
diff --git a/include/linux/nfslocalio.h b/include/linux/nfslocalio.h
index 59ea90bd136b..3d91043254e6 100644
--- a/include/linux/nfslocalio.h
+++ b/include/linux/nfslocalio.h
@@ -64,6 +64,8 @@ struct nfsd_localio_operations {
const fmode_t);
struct net *(*nfsd_file_put_local)(struct nfsd_file __rcu **);
struct file *(*nfsd_file_file)(struct nfsd_file *);
+ void (*nfsd_file_dio_alignment)(struct nfsd_file *,
+ u32 *, u32 *, u32 *);
} ____cacheline_aligned;
extern void nfsd_localio_ops_init(void);
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 10/13] nfs/localio: refactor iocb initialization
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (8 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 09/13] nfs/localio: add nfsd_file_dio_alignment Mike Snitzer
@ 2025-07-24 19:30 ` Mike Snitzer
2025-07-24 19:31 ` [PATCH v5 11/13] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs Mike Snitzer
` (4 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:30 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
From: Mike Snitzer <snitzer@hammerspace.com>
No functional change, but slighty cleaner.
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
fs/nfs/localio.c | 56 ++++++++++++++++++++++--------------------------
1 file changed, 26 insertions(+), 30 deletions(-)
diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 0c48db38f74f..9ce242454c66 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -282,23 +282,6 @@ nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
}
EXPORT_SYMBOL_GPL(nfs_local_open_fh);
-static struct bio_vec *
-nfs_bvec_alloc_and_import_pagevec(struct page **pagevec,
- unsigned int npages, gfp_t flags)
-{
- struct bio_vec *bvec, *p;
-
- bvec = kmalloc_array(npages, sizeof(*bvec), flags);
- if (bvec != NULL) {
- for (p = bvec; npages > 0; p++, pagevec++, npages--) {
- p->bv_page = *pagevec;
- p->bv_len = PAGE_SIZE;
- p->bv_offset = 0;
- }
- }
- return bvec;
-}
-
static void
nfs_local_iocb_free(struct nfs_local_kiocb *iocb)
{
@@ -315,8 +298,9 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
iocb = kmalloc(sizeof(*iocb), flags);
if (iocb == NULL)
return NULL;
- iocb->bvec = nfs_bvec_alloc_and_import_pagevec(hdr->page_array.pagevec,
- hdr->page_array.npages, flags);
+
+ iocb->bvec = kmalloc_array(hdr->page_array.npages,
+ sizeof(struct bio_vec), flags);
if (iocb->bvec == NULL) {
kfree(iocb);
return NULL;
@@ -339,8 +323,22 @@ static void
nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
{
struct nfs_pgio_header *hdr = iocb->hdr;
+ struct page **pagevec = hdr->page_array.pagevec;
+ unsigned long v, total;
+ size_t len;
- iov_iter_bvec(i, dir, iocb->bvec, hdr->page_array.npages,
+ v = 0;
+ total = hdr->args.count + hdr->args.pgbase;
+ while (total) {
+ len = min_t(size_t, total, PAGE_SIZE);
+ bvec_set_page(&iocb->bvec[v], *(pagevec++),
+ len, 0);
+ total -= len;
+ ++v;
+ }
+ WARN_ON_ONCE(v != hdr->page_array.npages);
+
+ iov_iter_bvec(i, dir, iocb->bvec, v,
hdr->args.count + hdr->args.pgbase);
if (hdr->args.pgbase != 0)
iov_iter_advance(i, hdr->args.pgbase);
@@ -469,6 +467,10 @@ static void nfs_local_call_read(struct work_struct *work)
save_cred = override_creds(filp->f_cred);
nfs_local_iter_init(&iter, iocb, READ);
+ if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
+ iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
+ iocb->aio_complete_work = nfs_local_read_aio_complete_work;
+ }
status = filp->f_op->read_iter(&iocb->kiocb, &iter);
if (status != -EIOCBQUEUED) {
@@ -502,11 +504,6 @@ nfs_do_local_read(struct nfs_pgio_header *hdr,
nfs_local_pgio_init(hdr, call_ops);
hdr->res.eof = false;
- if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
- iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
- iocb->aio_complete_work = nfs_local_read_aio_complete_work;
- }
-
INIT_WORK(&iocb->work, nfs_local_call_read);
queue_work(nfslocaliod_workqueue, &iocb->work);
@@ -663,6 +660,10 @@ static void nfs_local_call_write(struct work_struct *work)
save_cred = override_creds(filp->f_cred);
nfs_local_iter_init(&iter, iocb, WRITE);
+ if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
+ iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
+ iocb->aio_complete_work = nfs_local_write_aio_complete_work;
+ }
file_start_write(filp);
status = filp->f_op->write_iter(&iocb->kiocb, &iter);
@@ -712,11 +713,6 @@ nfs_do_local_write(struct nfs_pgio_header *hdr,
nfs_set_local_verifier(hdr->inode, hdr->res.verf, hdr->args.stable);
- if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
- iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
- iocb->aio_complete_work = nfs_local_write_aio_complete_work;
- }
-
INIT_WORK(&iocb->work, nfs_local_call_write);
queue_work(nfslocaliod_workqueue, &iocb->work);
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 11/13] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (9 preceding siblings ...)
2025-07-24 19:30 ` [PATCH v5 10/13] nfs/localio: refactor iocb initialization Mike Snitzer
@ 2025-07-24 19:31 ` Mike Snitzer
2025-07-24 19:31 ` [PATCH v5 12/13] nfs/direct: add misaligned READ handling Mike Snitzer
` (3 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:31 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
LOCALIO is left enabled, it just isn't used for any misaligned
O_DIRECT READ larger than 32K.
If LOCALIO determines that an O_DIRECT READ is misaligned then it
makes sense to immediately issue the READ remotely via NFSD (which has
the ability to expand a misaligned O_DIRECT READ to be DIO-aligned)
if/when NFSD is configured to use O_DIRECT for READ IO with:
echo 3 > /sys/kernel/debug/nfsd/io_cache_read
This change in behavior for LOCALIO's O_DIRECT support really should
be dependent on NFSD running with io_cache_read=3 but there isn't an
interface to check for that (currently). This fallback is sub-optimal
due to resorting to using RPC and will only serve as a last resort
if/when NFS client's O_DIRECT support isn't able to align misaligned
IO (support will be added in subsequent patches).
Add 'localio_O_DIRECT_align_misaligned_IO' modparm, which depends on
localio_O_DIRECT_semantics=Y, to control if LOCALIO will make best
effort to transform misaligned IO to DIO-aligned (e.g. expanding
misaligned READ to DIO-aligned).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
fs/nfs/internal.h | 9 +-
fs/nfs/localio.c | 139 +++++++++++++++++--------
fs/nfs/pagelist.c | 15 ++-
4 files changed, 113 insertions(+), 51 deletions(-)
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index 4bea008dbebd..fbcf7c5ac118 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -1911,6 +1911,7 @@ ff_layout_read_pagelist(struct nfs_pgio_header *hdr)
localio = ff_local_open_fh(lseg, idx, ds->ds_clp, ds_cred, fh, FMODE_READ);
if (localio) {
hdr->task.tk_start = ktime_get();
+ // FIXME: if fallback occurs is this stats start bogus?
ff_layout_read_record_layoutstats_start(&hdr->task, hdr);
}
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 69c2c10ee658..f54030684c97 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -463,13 +463,14 @@ extern struct nfsd_file *nfs_local_open_fh(struct nfs_client *,
struct nfs_file_localio *,
const fmode_t);
extern int nfs_local_doio(struct nfs_client *,
- struct nfsd_file *,
+ struct nfsd_file **,
struct nfs_pgio_header *,
const struct rpc_call_ops *);
extern int nfs_local_commit(struct nfsd_file *,
struct nfs_commit_data *,
const struct rpc_call_ops *, int);
extern bool nfs_server_is_local(const struct nfs_client *clp);
+extern bool nfs_localio_O_DIRECT_align_misaligned_IO(void);
#else /* CONFIG_NFS_LOCALIO */
static inline void nfs_local_probe(struct nfs_client *clp) {}
@@ -482,7 +483,7 @@ nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
return NULL;
}
static inline int nfs_local_doio(struct nfs_client *clp,
- struct nfsd_file *localio,
+ struct nfsd_file **localio,
struct nfs_pgio_header *hdr,
const struct rpc_call_ops *call_ops)
{
@@ -498,6 +499,10 @@ static inline bool nfs_server_is_local(const struct nfs_client *clp)
{
return false;
}
+static inline bool nfs_localio_O_DIRECT_align_misaligned_IO(void)
+{
+ return false;
+}
#endif /* CONFIG_NFS_LOCALIO */
/* super.c */
diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 9ce242454c66..f61cfe42d745 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -36,6 +36,7 @@ struct nfs_local_kiocb {
struct nfs_pgio_header *hdr;
struct work_struct work;
void (*aio_complete_work)(struct work_struct *);
+ struct iov_iter iter ____cacheline_aligned;
struct nfsd_file *localio;
};
@@ -54,6 +55,12 @@ module_param(localio_O_DIRECT_semantics, bool, 0644);
MODULE_PARM_DESC(localio_O_DIRECT_semantics,
"LOCALIO will use O_DIRECT semantics to filesystem.");
+static bool localio_O_DIRECT_align_misaligned_IO __read_mostly = true;
+module_param(localio_O_DIRECT_align_misaligned_IO, bool, 0644);
+/* This feature also depends on: echo 2 > /sys/kernel/debug/nfsd/io_cache_read */
+MODULE_PARM_DESC(localio_O_DIRECT_align_misaligned_IO,
+ "If LOCALIO_O_DIRECT_semantics=Y make best effort to transform misaligned IO to DIO-aligned.");
+
static inline bool nfs_client_is_local(const struct nfs_client *clp)
{
return !!rcu_access_pointer(clp->cl_uuid.net);
@@ -65,6 +72,12 @@ bool nfs_server_is_local(const struct nfs_client *clp)
}
EXPORT_SYMBOL_GPL(nfs_server_is_local);
+bool nfs_localio_O_DIRECT_align_misaligned_IO(void)
+{
+ return localio_O_DIRECT_align_misaligned_IO;
+}
+EXPORT_SYMBOL_GPL(nfs_localio_O_DIRECT_align_misaligned_IO);
+
/*
* UUID_IS_LOCAL XDR functions
*/
@@ -319,8 +332,8 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
return iocb;
}
-static void
-nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
+static int
+nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int rw)
{
struct nfs_pgio_header *hdr = iocb->hdr;
struct page **pagevec = hdr->page_array.pagevec;
@@ -338,7 +351,7 @@ nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
}
WARN_ON_ONCE(v != hdr->page_array.npages);
- iov_iter_bvec(i, dir, iocb->bvec, v,
+ iov_iter_bvec(i, rw, iocb->bvec, v,
hdr->args.count + hdr->args.pgbase);
if (hdr->args.pgbase != 0)
iov_iter_advance(i, hdr->args.pgbase);
@@ -349,7 +362,7 @@ nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
nfs_to->nfsd_file_dio_alignment(iocb->localio, &nf_dio_mem_align,
&nf_dio_offset_align,
&nf_dio_read_offset_align);
- if (dir == READ)
+ if (rw == ITER_DEST)
nf_dio_offset_align = nf_dio_read_offset_align;
/* direct I/O must be aligned to device logical sector size */
if (nf_dio_mem_align && nf_dio_offset_align &&
@@ -358,10 +371,21 @@ nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
nf_dio_offset_align - 1))
return 0;
+ /* Only send misaligned READ to NFSD if 32K or larger */
+ if (localio_O_DIRECT_align_misaligned_IO &&
+ (rw == ITER_DEST) && (hdr->args.count >= (32 << 10))) {
+ /*
+ * Fallback to sending this READ to NFSD since it
+ * can expand misaligned READ IO to be DIO-aligned.
+ */
+ return -ENOSYS;
+ }
/* Fallback to using buffered for this misaligned IO */
iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
iocb->kiocb.ki_filp->f_flags &= ~O_DIRECT;
}
+
+ return 0;
}
static void
@@ -394,13 +418,18 @@ nfs_local_pgio_done(struct nfs_pgio_header *hdr, long status)
}
}
-static void
-nfs_local_pgio_release(struct nfs_local_kiocb *iocb)
+static void nfs_local_iocb_release(struct nfs_local_kiocb *iocb)
{
- struct nfs_pgio_header *hdr = iocb->hdr;
-
nfs_local_file_put(iocb->localio);
nfs_local_iocb_free(iocb);
+}
+
+static void
+nfs_local_pgio_release(struct nfs_local_kiocb *iocb)
+{
+ struct nfs_pgio_header *hdr = iocb->hdr;
+
+ nfs_local_iocb_release(iocb);
nfs_local_hdr_release(hdr, hdr->task.tk_ops);
}
@@ -461,18 +490,16 @@ static void nfs_local_call_read(struct work_struct *work)
container_of(work, struct nfs_local_kiocb, work);
struct file *filp = iocb->kiocb.ki_filp;
const struct cred *save_cred;
- struct iov_iter iter;
ssize_t status;
save_cred = override_creds(filp->f_cred);
- nfs_local_iter_init(&iter, iocb, READ);
if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
iocb->aio_complete_work = nfs_local_read_aio_complete_work;
}
- status = filp->f_op->read_iter(&iocb->kiocb, &iter);
+ status = filp->f_op->read_iter(&iocb->kiocb, &iocb->iter);
if (status != -EIOCBQUEUED) {
nfs_local_read_done(iocb, status);
nfs_local_pgio_release(iocb);
@@ -482,25 +509,14 @@ static void nfs_local_call_read(struct work_struct *work)
}
static int
-nfs_do_local_read(struct nfs_pgio_header *hdr,
- struct nfsd_file *localio,
+nfs_local_do_read(struct nfs_local_kiocb *iocb,
const struct rpc_call_ops *call_ops)
{
- struct nfs_local_kiocb *iocb;
- struct file *file = nfs_to->nfsd_file_file(localio);
-
- /* Don't support filesystems without read_iter */
- if (!file->f_op->read_iter)
- return -EAGAIN;
+ struct nfs_pgio_header *hdr = iocb->hdr;
dprintk("%s: vfs_read count=%u pos=%llu\n",
__func__, hdr->args.count, hdr->args.offset);
- iocb = nfs_local_iocb_alloc(hdr, file, GFP_KERNEL);
- if (iocb == NULL)
- return -ENOMEM;
- iocb->localio = localio;
-
nfs_local_pgio_init(hdr, call_ops);
hdr->res.eof = false;
@@ -653,20 +669,18 @@ static void nfs_local_call_write(struct work_struct *work)
struct file *filp = iocb->kiocb.ki_filp;
unsigned long old_flags = current->flags;
const struct cred *save_cred;
- struct iov_iter iter;
ssize_t status;
current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
save_cred = override_creds(filp->f_cred);
- nfs_local_iter_init(&iter, iocb, WRITE);
if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
iocb->aio_complete_work = nfs_local_write_aio_complete_work;
}
file_start_write(filp);
- status = filp->f_op->write_iter(&iocb->kiocb, &iter);
+ status = filp->f_op->write_iter(&iocb->kiocb, &iocb->iter);
file_end_write(filp);
if (status != -EIOCBQUEUED) {
nfs_local_write_done(iocb, status);
@@ -679,26 +693,15 @@ static void nfs_local_call_write(struct work_struct *work)
}
static int
-nfs_do_local_write(struct nfs_pgio_header *hdr,
- struct nfsd_file *localio,
+nfs_local_do_write(struct nfs_local_kiocb *iocb,
const struct rpc_call_ops *call_ops)
{
- struct nfs_local_kiocb *iocb;
- struct file *file = nfs_to->nfsd_file_file(localio);
-
- /* Don't support filesystems without write_iter */
- if (!file->f_op->write_iter)
- return -EAGAIN;
+ struct nfs_pgio_header *hdr = iocb->hdr;
dprintk("%s: vfs_write count=%u pos=%llu %s\n",
__func__, hdr->args.count, hdr->args.offset,
(hdr->args.stable == NFS_UNSTABLE) ? "unstable" : "stable");
- iocb = nfs_local_iocb_alloc(hdr, file, GFP_NOIO);
- if (iocb == NULL)
- return -ENOMEM;
- iocb->localio = localio;
-
switch (hdr->args.stable) {
default:
break;
@@ -719,32 +722,78 @@ nfs_do_local_write(struct nfs_pgio_header *hdr,
return 0;
}
-int nfs_local_doio(struct nfs_client *clp, struct nfsd_file *localio,
+static struct nfs_local_kiocb *
+nfs_local_iocb_init(struct nfs_pgio_header *hdr, struct nfsd_file **localio)
+{
+ struct file *file = nfs_to->nfsd_file_file(*localio);
+ struct nfs_local_kiocb *iocb;
+ gfp_t gfp_mask;
+ int rw, status;
+
+ if (hdr->rw_mode & FMODE_READ) {
+ if (!file->f_op->read_iter)
+ return ERR_PTR(-EOPNOTSUPP);
+ gfp_mask = GFP_KERNEL;
+ rw = ITER_DEST;
+ } else {
+ if (!file->f_op->write_iter)
+ return ERR_PTR(-EOPNOTSUPP);
+ gfp_mask = GFP_NOIO;
+ rw = ITER_SOURCE;
+ }
+
+ iocb = nfs_local_iocb_alloc(hdr, file, gfp_mask);
+ if (iocb == NULL)
+ return ERR_PTR(-ENOMEM);
+ iocb->hdr = hdr;
+ iocb->localio = *localio;
+
+ status = nfs_local_iter_init(&iocb->iter, iocb, rw);
+ if (status == -ENOSYS) {
+ /* close nfsd_file and clear localio,
+ * this informs callers that IO should
+ * be serviced remotely.
+ */
+ nfs_local_iocb_release(iocb);
+ *localio = NULL;
+ return ERR_PTR(status);
+ }
+ WARN_ON_ONCE(status != 0);
+
+ return iocb;
+}
+
+int nfs_local_doio(struct nfs_client *clp, struct nfsd_file **localio,
struct nfs_pgio_header *hdr,
const struct rpc_call_ops *call_ops)
{
+ struct nfs_local_kiocb *iocb;
int status = 0;
if (!hdr->args.count)
return 0;
+ iocb = nfs_local_iocb_init(hdr, localio);
+ if (IS_ERR(iocb))
+ return PTR_ERR(iocb);
+
switch (hdr->rw_mode) {
case FMODE_READ:
- status = nfs_do_local_read(hdr, localio, call_ops);
+ status = nfs_local_do_read(iocb, call_ops);
break;
case FMODE_WRITE:
- status = nfs_do_local_write(hdr, localio, call_ops);
+ status = nfs_local_do_write(iocb, call_ops);
break;
default:
dprintk("%s: invalid mode: %d\n", __func__,
hdr->rw_mode);
- status = -EINVAL;
+ status = -EOPNOTSUPP;
}
if (status != 0) {
if (status == -EAGAIN)
nfs_localio_disable_client(clp);
- nfs_local_file_put(localio);
+ nfs_local_iocb_release(iocb);
hdr->task.tk_status = status;
nfs_local_hdr_release(hdr, call_ops);
}
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 11968dcb7243..9ddff27e96e9 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -762,9 +762,17 @@ int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_header *hdr,
hdr->args.count,
(unsigned long long)hdr->args.offset);
- if (localio)
- return nfs_local_doio(NFS_SERVER(hdr->inode)->nfs_client,
- localio, hdr, call_ops);
+ if (localio) {
+ int status = nfs_local_doio(NFS_SERVER(hdr->inode)->nfs_client,
+ &localio, hdr, call_ops);
+ /* nfs_local_doio() will clear localio and return -ENOSYS if
+ * it is prudent to immediately service this IO remotely.
+ */
+ if (status != -ENOSYS)
+ return status;
+ WARN_ON_ONCE(localio != NULL);
+ /* fallthrough */
+ }
task = rpc_run_task(&task_setup_data);
if (IS_ERR(task))
@@ -959,7 +967,6 @@ static int nfs_generic_pg_pgios(struct nfs_pageio_descriptor *desc)
ret = nfs_generic_pgio(desc, hdr);
if (ret == 0) {
struct nfs_client *clp = NFS_SERVER(hdr->inode)->nfs_client;
-
struct nfsd_file *localio =
nfs_local_open_fh(clp, hdr->cred, hdr->args.fh,
&hdr->args.context->nfl,
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 12/13] nfs/direct: add misaligned READ handling
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (10 preceding siblings ...)
2025-07-24 19:31 ` [PATCH v5 11/13] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs Mike Snitzer
@ 2025-07-24 19:31 ` Mike Snitzer
2025-07-24 19:31 ` [PATCH v5 13/13] nfs/direct: add misaligned WRITE handling Mike Snitzer
` (2 subsequent siblings)
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:31 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Because the NFS client will already happily handle misaligned O_DIRECT
IO (by sending it out to NFSD via RPC) this commit's new capabilities
are for the benefit of LOCALIO and require the nfs modparam:
localio_O_DIRECT_align_misaligned_IO=Y
When enabled, misaligned READ IO is expanded to consist of a
DIO-aligned extent followed by a single misaligned tail page (due to
it being a partial page).
Also add an nfs_analyze_dio trace event that shows how the NFS client
split a given misaligned IO into a mix of misaligned page(s) and a
DIO-aligned extent.
This combination of trace events is useful for LOCALIO READs:
echo 1 > /sys/kernel/tracing/events/nfs/nfs_analyze_dio/enable
echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_read/enable
echo 1 > /sys/kernel/tracing/events/nfs/nfs_readpage_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
Which for this dd command:
dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
Results in:
dd-63258 [002] ..... 83742.428577: nfs_analyze_dio: READ offset=0 len=47008 start=0+0 middle=0+45056 end=45056+1952
dd-63258 [002] ..... 83742.428591: nfs_initiate_read: fileid=00:2e:219750 fhandle=0xf6927a01 offset=0 count=45056
kworker/u193:3-62985 [011] ..... 83742.428594: xfs_file_direct_read: dev 259:22 ino 0x5e0000a3 disize 0x16f40 pos 0x0 bytecount 0xb000
dd-63258 [002] ..... 83742.428595: nfs_initiate_read: fileid=00:2e:219750 fhandle=0xf6927a01 offset=45056 count=1952
kworker/u193:4-63221 [004] ..... 83742.428598: nfs_readpage_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=45056 count=1952 res=1952
kworker/u193:4-63221 [004] ..... 83742.428613: nfs_readpage_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=0 count=45056 res=45056
dd-63258 [002] ..... 83742.428619: nfs_analyze_dio: READ offset=47008 len=47008 start=45056+1952 middle=47008+43104 end=90112+3904
dd-63258 [002] ..... 83742.428622: nfs_initiate_read: fileid=00:2e:219750 fhandle=0xf6927a01 offset=45056 count=45056
dd-63258 [002] ..... 83742.428624: nfs_initiate_read: fileid=00:2e:219750 fhandle=0xf6927a01 offset=90112 count=3904
kworker/u193:4-63221 [004] ..... 83742.428624: xfs_file_direct_read: dev 259:22 ino 0x5e0000a3 disize 0x16f40 pos 0xb000 bytecount 0xb000
kworker/u193:3-62985 [011] ..... 83742.428628: nfs_readpage_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=90112 count=3904 res=3904 eof
kworker/u193:3-62985 [011] ..... 83742.428642: nfs_readpage_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=45056 count=45056 res=45056
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfs/direct.c | 178 ++++++++++++++++++++++++++++++++++++---
fs/nfs/internal.h | 7 ++
fs/nfs/nfstrace.h | 41 +++++++++
fs/nfs/pagelist.c | 7 ++
include/linux/nfs_page.h | 1 +
5 files changed, 223 insertions(+), 11 deletions(-)
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 48d89716193a..4e1e668eaa1f 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -210,6 +210,13 @@ static void nfs_direct_req_free(struct kref *kref)
nfs_put_lock_context(dreq->l_ctx);
if (dreq->ctx != NULL)
put_nfs_open_context(dreq->ctx);
+
+ if (dreq->start_extra_bvec != NULL) {
+ if (dreq->start_extra_bvec->bv_page != NULL)
+ __free_page(dreq->start_extra_bvec->bv_page);
+ kfree(dreq->start_extra_bvec);
+ }
+
kmem_cache_free(nfs_direct_cachep, dreq);
}
@@ -264,6 +271,10 @@ static void nfs_direct_complete(struct nfs_direct_req *dreq)
if (dreq->count != 0) {
res = (long) dreq->count;
WARN_ON_ONCE(dreq->count < 0);
+ /* Reduce res by front_pad */
+ if ((dreq->start_extra_bvec != NULL) &&
+ res >= dreq->start_extra_bvec->bv_len)
+ res -= dreq->start_extra_bvec->bv_len;
}
dreq->iocb->ki_complete(dreq->iocb, res);
}
@@ -285,6 +296,15 @@ static void nfs_direct_read_completion(struct nfs_pgio_header *hdr)
}
nfs_direct_count_bytes(dreq, hdr);
+
+ if (dreq->start_extra_bvec != NULL && (dreq->count == dreq->max_count)) {
+ unsigned front_pad = dreq->start_extra_bvec->bv_len;
+
+ hdr->res.count -= front_pad;
+ hdr->good_bytes -= front_pad;
+ hdr->args.count -= front_pad;
+ hdr->args.offset += front_pad;
+ }
spin_unlock(&dreq->lock);
nfs_update_delegated_atime(dreq->inode);
@@ -353,6 +373,30 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
desc.pg_dreq = dreq;
inode_dio_begin(inode);
+ if (dreq->start_extra_bvec != NULL) {
+ struct nfs_page *req;
+ size_t pgbase = dreq->start_extra_bvec->bv_offset;
+ unsigned int front_pad = dreq->start_extra_bvec->bv_len;
+
+ /* Must force start pos to DIO-aligned start */
+ WARN_ON(pos != dreq->io_start);
+ req = nfs_page_create_from_page(dreq->ctx,
+ dreq->start_extra_bvec->bv_page,
+ pgbase, pos, front_pad);
+ if (IS_ERR(req)) {
+ result = PTR_ERR(req);
+ goto out;
+ }
+ if (!nfs_pageio_add_request(&desc, req)) {
+ result = desc.pg_error;
+ nfs_release_request(req);
+ goto out;
+ }
+
+ requested_bytes += front_pad;
+ pos += front_pad;
+ }
+
while (iov_iter_count(iter)) {
struct page **pagevec;
size_t bytes;
@@ -363,12 +407,19 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
rsize, &pgbase);
if (result < 0)
break;
-
- bytes = result;
- npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+
+ /* Limit the first batch of pages to DIO-aligned boundary? */
+ if (pos < dreq->end_offset && dreq->middle_len)
+ bytes = min_t(size_t, dreq->middle_len, result);
+ else
+ bytes = result;
+ npages = (bytes + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+
for (i = 0; i < npages; i++) {
struct nfs_page *req;
unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
+ bool issue_dio_now = false;
+
/* XXX do we need to do the eof zeroing found in async_filler? */
req = nfs_page_create_from_page(dreq->ctx, pagevec[i],
pgbase, pos, req_len);
@@ -376,15 +427,33 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
result = PTR_ERR(req);
break;
}
+
+ pgbase = 0;
+ result -= req_len;
+ bytes -= req_len;
+ requested_bytes += req_len;
+ pos += req_len;
+
+ /* Looking ahead, is this req the end of the DIO-aligned middle? */
+ if (bytes == 0 && dreq->end_len &&
+ pos == dreq->end_offset && result == dreq->end_len) {
+ desc.pg_doio_now = 1;
+ issue_dio_now = true;
+ /* Reset iter to the last page (known misaligned),
+ * issue previous DIO-aligned page and then handle
+ * the last partial page stored in iter
+ */
+ iov_iter_revert(iter, result);
+ }
+
if (!nfs_pageio_add_request(&desc, req)) {
result = desc.pg_error;
nfs_release_request(req);
break;
}
- pgbase = 0;
- bytes -= req_len;
- requested_bytes += req_len;
- pos += req_len;
+
+ if (issue_dio_now)
+ break;
}
nfs_direct_release_pages(pagevec, npages);
kvfree(pagevec);
@@ -398,6 +467,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
* If no bytes were started, return the error, and let the
* generic layer handle the completion.
*/
+out:
if (requested_bytes == 0) {
inode_dio_end(inode);
nfs_direct_req_release(dreq);
@@ -409,6 +479,70 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
return requested_bytes;
}
+/*
+ * If localio_O_DIRECT_align_misaligned_READ enabled, expand any
+ * misaligned READ to include the previous DIO-aligned block.
+ * - FIXME: expanding the end to also be DIO-aligned requires a
+ * bounce page that must be copied to original partial end page.
+ */
+static bool nfs_analyze_read_dio(loff_t offset, __u32 len,
+ struct nfs_direct_req *dreq)
+{
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+ /* Hardcoded to PAGE_SIZE (since don't have LOCALIO nfsd_file's
+ * dio_alignment), works for smaller alignment too (e.g. 512b).
+ */
+ u32 dio_blocksize = PAGE_SIZE;
+ loff_t start, front_pad, orig_end, middle_end;
+
+ /* Return early if feature disabled, if IO is irreparably
+ * misaligned (len < PAGE_SIZE) or if IO is already DIO-aligned.
+ */
+ if (!nfs_localio_O_DIRECT_align_misaligned_IO() ||
+ unlikely(len < dio_blocksize) ||
+ (((offset | len) & (dio_blocksize-1)) == 0))
+ return false;
+
+ start = round_down(offset, dio_blocksize);
+ front_pad = offset - start;
+ orig_end = offset + len;
+ middle_end = round_down(orig_end, dio_blocksize);
+
+ if (front_pad) {
+ gfp_t gfp_mask = nfs_io_gfp_mask();
+
+ dreq->start_extra_bvec = kmalloc(sizeof(struct bio_vec), gfp_mask);
+ if (dreq->start_extra_bvec == NULL)
+ return false;
+ dreq->start_extra_bvec->bv_page = alloc_page(gfp_mask);
+ if (dreq->start_extra_bvec->bv_page == NULL) {
+ kfree(dreq->start_extra_bvec);
+ dreq->start_extra_bvec = NULL;
+ return false;
+ }
+
+ bvec_set_page(dreq->start_extra_bvec,
+ dreq->start_extra_bvec->bv_page,
+ front_pad, PAGE_SIZE - front_pad);
+ }
+
+ dreq->middle_offset = offset;
+ dreq->middle_len = middle_end - offset;
+ dreq->end_offset = middle_end;
+ dreq->end_len = orig_end - middle_end;
+
+ dreq->io_start = start;
+ dreq->max_count = orig_end - start;
+
+ trace_nfs_analyze_dio(READ, offset, len, start, front_pad,
+ dreq->middle_offset, dreq->middle_len,
+ dreq->end_offset, dreq->end_len);
+ return true;
+#else
+ return false;
+#endif
+}
+
/**
* nfs_file_direct_read - file direct read operation for NFS files
* @iocb: target I/O control block
@@ -439,6 +573,9 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
struct nfs_lock_context *l_ctx;
ssize_t result, requested;
size_t count = iov_iter_count(iter);
+ size_t in_count = count;
+ unsigned int front_pad = 0;
+
nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);
dfprintk(FILE, "NFS: direct read(%pD2, %zd@%Ld)\n",
@@ -455,9 +592,20 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
if (dreq == NULL)
goto out;
+ if (!swap && nfs_analyze_read_dio(iocb->ki_pos, count, dreq)) {
+ /* note that dreq values do include front_pad
+ * (dreq->io_start -> dreq->start_extra_bvec->bv_offset)
+ */
+ iocb->ki_pos = dreq->io_start;
+ count = dreq->max_count;
+ if (dreq->start_extra_bvec)
+ front_pad = dreq->start_extra_bvec->bv_len;
+ } else {
+ dreq->io_start = iocb->ki_pos;
+ dreq->max_count = count;
+ }
+
dreq->inode = inode;
- dreq->max_count = count;
- dreq->io_start = iocb->ki_pos;
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
l_ctx = nfs_get_lock_context(dreq->ctx);
if (IS_ERR(l_ctx)) {
@@ -483,16 +631,24 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
}
}
- NFS_I(inode)->read_io += count;
+ NFS_I(inode)->read_io += in_count;
requested = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
if (!swap)
nfs_end_io_direct(inode);
if (requested > 0) {
+ if (front_pad) {
+ /* given the iov_iter_revert below, must exclude the
+ * front_pad (dreq->start_extra_bvec) from requested,
+ */
+ requested -= front_pad;
+ }
+
result = nfs_direct_wait(dreq);
if (result > 0) {
- requested -= result;
+ if (front_pad && result >= front_pad)
+ result -= front_pad;
iocb->ki_pos += result;
}
iov_iter_revert(iter, requested);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f54030684c97..06a15bf08357 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -990,4 +990,11 @@ struct nfs_direct_req {
/* for read */
#define NFS_ODIRECT_SHOULD_DIRTY (3) /* dirty user-space page after read */
#define NFS_ODIRECT_DONE INT_MAX /* write verification failed */
+
+ /* State for expanding misaligned IO to be DIO-aligned (for LOCALIO) */
+ struct bio_vec * start_extra_bvec;
+ loff_t middle_offset; /* Offset for start of DIO-aligned middle */
+ loff_t end_offset; /* Offset for start of DIO-aligned end */
+ ssize_t middle_len; /* Length for DIO-aligned middle */
+ ssize_t end_len; /* Length for misaligned last page */
};
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 334e65d6bc72..a0b9af10a744 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1593,6 +1593,47 @@ DEFINE_NFS_DIRECT_REQ_EVENT(nfs_direct_write_completion);
DEFINE_NFS_DIRECT_REQ_EVENT(nfs_direct_write_schedule_iovec);
DEFINE_NFS_DIRECT_REQ_EVENT(nfs_direct_write_reschedule_io);
+TRACE_EVENT(nfs_analyze_dio,
+ TP_PROTO(u32 rw,
+ u64 offset,
+ u32 len,
+ loff_t start,
+ loff_t start_extra,
+ loff_t middle,
+ loff_t middle_len,
+ loff_t end,
+ loff_t end_len),
+ TP_ARGS(rw, offset, len, start, start_extra, middle, middle_len, end, end_len),
+ TP_STRUCT__entry(
+ __field(u32, rw)
+ __field(u64, offset)
+ __field(u32, len)
+ __field(loff_t, start)
+ __field(loff_t, start_extra)
+ __field(loff_t, middle)
+ __field(loff_t, middle_len)
+ __field(loff_t, end)
+ __field(loff_t, end_len)
+ ),
+ TP_fast_assign(
+ __entry->rw = rw;
+ __entry->offset = offset;
+ __entry->len = len;
+ __entry->start = start;
+ __entry->start_extra = start_extra;
+ __entry->middle = middle;
+ __entry->middle_len = middle_len;
+ __entry->end = end;
+ __entry->end_len = end_len;
+ ),
+ TP_printk("%s offset=%llu len=%u start=%llu+%llu middle=%llu+%llu end=%llu+%llu",
+ __entry->rw ? "WRITE" : "READ",
+ __entry->offset, __entry->len,
+ __entry->start, __entry->start_extra,
+ __entry->middle, __entry->middle_len,
+ __entry->end, __entry->end_len)
+);
+
TRACE_EVENT(nfs_fh_to_dentry,
TP_PROTO(
const struct super_block *sb,
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 9ddff27e96e9..8d877360042d 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -832,6 +832,7 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
int io_flags)
{
desc->pg_moreio = 0;
+ desc->pg_doio_now = 0;
desc->pg_inode = inode;
desc->pg_ops = pg_ops;
desc->pg_completion_ops = compl_ops;
@@ -1141,6 +1142,8 @@ nfs_pageio_do_add_request(struct nfs_pageio_descriptor *desc,
return size;
nfs_list_move_request(req, &mirror->pg_list);
mirror->pg_count += req->wb_bytes;
+ if (desc->pg_doio_now)
+ return 0; /* trigger nfs_pageio_doio() in caller */
return req->wb_bytes;
}
@@ -1220,6 +1223,10 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
nfs_pageio_doio(desc);
if (desc->pg_error < 0 || mirror->pg_recoalesce)
return 0;
+ if (desc->pg_doio_now) {
+ desc->pg_doio_now = 0;
+ return 1;
+ }
/* retry add_request for this subreq */
nfs_page_group_lock(req);
continue;
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 169b4ae30ff4..2e88dc2ff3fe 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -117,6 +117,7 @@ struct nfs_pageio_descriptor {
u32 pg_mirror_idx; /* current mirror */
unsigned short pg_maxretrans;
unsigned char pg_moreio : 1;
+ unsigned char pg_doio_now : 1;
};
/* arbitrarily selected limit to number of mirrors */
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 13/13] nfs/direct: add misaligned WRITE handling
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (11 preceding siblings ...)
2025-07-24 19:31 ` [PATCH v5 12/13] nfs/direct: add misaligned READ handling Mike Snitzer
@ 2025-07-24 19:31 ` Mike Snitzer
2025-07-27 15:39 ` [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Chuck Lever
2025-07-27 16:16 ` (subset) " Chuck Lever
14 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-24 19:31 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Because the NFS client will already happily handle misaligned O_DIRECT
IO (by sending it out to NFSD via RPC) this commit's new capabilities
are for the benefit of LOCALIO and require the nfs modparam:
localio_O_DIRECT_align_misaligned_IO=Y
When enabled, misaligned WRITE IO is split into a start, middle and
end as needed. The large middle extent is DIO-aligned and the start
and/or end are misaligned (due to each being a partial page).
Like the READ support that came before this WRITE support, the
nfs_analyze_dio trace event shows how the NFS client split a given
misaligned IO into a mix of misaligned page(s) and a DIO-aligned
extent.
This combination of trace events is useful for LOCALIO WRITEs:
echo 1 > /sys/kernel/tracing/events/nfs/nfs_analyze_dio/enable
echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_write/enable
echo 1 > /sys/kernel/tracing/events/nfs/nfs_writeback_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
Which for this dd command:
dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct
Results in:
dd-63257 [001] ..... 83742.427650: nfs_analyze_dio: WRITE offset=0 len=47008 start=0+0 middle=0+45056 end=45056+1952
dd-63257 [001] ..... 83742.427659: nfs_initiate_write: fileid=00:2e:219750 fhandle=0xf6927a01 offset=0 count=45056 stable=UNSTABLE
dd-63257 [001] ..... 83742.427662: nfs_initiate_write: fileid=00:2e:219750 fhandle=0xf6927a01 offset=45056 count=1952 stable=UNSTABLE
kworker/u193:3-62985 [011] ..... 83742.427664: xfs_file_direct_write: dev 259:22 ino 0x5e0000a3 disize 0x0 pos 0x0 bytecount 0xb000
kworker/u193:3-62985 [011] ..... 83742.427695: nfs_writeback_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=0 count=45056 res=45056 stable=UNSTABLE verifier=a8b37e6803d1eb1e
kworker/u193:4-63221 [004] ..... 83742.427699: nfs_writeback_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=45056 count=1952 res=1952 stable=UNSTABLE verifier=a8b37e6803d1eb1e
dd-63257 [001] ..... 83742.427755: nfs_analyze_dio: WRITE offset=47008 len=47008 start=47008+2144 middle=49152+40960 end=90112+3904
dd-63257 [001] ..... 83742.427758: nfs_initiate_write: fileid=00:2e:219750 fhandle=0xf6927a01 offset=47008 count=2144 stable=UNSTABLE
dd-63257 [001] ..... 83742.427760: nfs_initiate_write: fileid=00:2e:219750 fhandle=0xf6927a01 offset=49152 count=40960 stable=UNSTABLE
kworker/u193:4-63221 [004] ..... 83742.427761: nfs_writeback_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=47008 count=2144 res=2144 stable=UNSTABLE verifier=a8b37e6803d1eb1e
dd-63257 [001] ..... 83742.427763: nfs_initiate_write: fileid=00:2e:219750 fhandle=0xf6927a01 offset=90112 count=3904 stable=UNSTABLE
kworker/u193:4-63221 [004] ..... 83742.427763: xfs_file_direct_write: dev 259:22 ino 0x5e0000a3 disize 0xb7a0 pos 0xc000 bytecount 0xa000
kworker/u193:4-63221 [004] ..... 83742.427783: nfs_writeback_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=49152 count=40960 res=40960 stable=UNSTABLE verifier=a8b37e6803d1eb1e
kworker/u193:3-62985 [011] ..... 83742.427788: nfs_writeback_done: error=0 fileid=00:2e:219750 fhandle=0xf6927a01 offset=90112 count=3904 res=3904 stable=UNSTABLE verifier=a8b37e6803d1eb1e
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfs/direct.c | 84 ++++++++++++++++++++++++++++++++++++++++++++---
fs/nfs/internal.h | 1 +
2 files changed, 80 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 4e1e668eaa1f..80c2ca37cf28 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -1048,11 +1048,19 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
if (result < 0)
break;
- bytes = result;
- npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+ /* Limit the amount of bytes serviced each iteration to aligned batches */
+ if (pos < dreq->middle_offset && dreq->start_len)
+ bytes = min_t(size_t, dreq->start_len, result);
+ else if (pos < dreq->end_offset && dreq->middle_len)
+ bytes = min_t(size_t, dreq->middle_len, result);
+ else
+ bytes = result;
+ npages = (bytes + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+
for (i = 0; i < npages; i++) {
struct nfs_page *req;
unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
+ bool issue_dio_now = false;
req = nfs_page_create_from_page(dreq->ctx, pagevec[i],
pgbase, pos, req_len);
@@ -1068,6 +1076,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
}
pgbase = 0;
+ result -= req_len;
bytes -= req_len;
requested_bytes += req_len;
pos += req_len;
@@ -1077,9 +1086,27 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
continue;
}
+ /* Looking ahead, is this req the end of the start or middle? */
+ if (bytes == 0) {
+ if ((dreq->start_len &&
+ pos == dreq->middle_offset && result >= dreq->middle_len) ||
+ (dreq->end_len &&
+ pos == dreq->end_offset && result == dreq->end_len)) {
+ desc.pg_doio_now = 1;
+ issue_dio_now = true;
+ /* Reset iter to the last boundary, isse the current
+ * req and then handle iter to next boundary or end.
+ */
+ iov_iter_revert(iter, result);
+ }
+ }
+
nfs_lock_request(req);
- if (nfs_pageio_add_request(&desc, req))
+ if (nfs_pageio_add_request(&desc, req)) {
+ if (issue_dio_now)
+ break;
continue;
+ }
/* Exit on hard errors */
if (desc.pg_error < 0 && desc.pg_error != -EAGAIN) {
@@ -1120,6 +1147,50 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
return requested_bytes;
}
+/*
+ * If localio_O_DIRECT_align_misaligned_WRITE enabled, split misaligned
+ * WRITE to a DIO-aligned middle and misaligned head and/or tail.
+ */
+static bool nfs_analyze_write_dio(loff_t offset, __u32 len,
+ struct nfs_direct_req *dreq)
+{
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+ /* Hardcoded to PAGE_SIZE (since don't have LOCALIO nfsd_file's
+ * dio_alignment), works for smaller alignment too (e.g. 512b).
+ */
+ u32 dio_blocksize = PAGE_SIZE;
+ loff_t start_end, orig_end, middle_end;
+
+ /* Return early if feature disabled, if IO is irreparably
+ * misaligned (len < PAGE_SIZE) or if IO is already DIO-aligned.
+ */
+ if (!nfs_localio_O_DIRECT_align_misaligned_IO() ||
+ unlikely(len < dio_blocksize) ||
+ (((offset | len) & (dio_blocksize-1)) == 0))
+ return false;
+
+ start_end = round_up(offset, dio_blocksize);
+ orig_end = offset + len;
+ middle_end = round_down(orig_end, dio_blocksize);
+
+ dreq->io_start = offset;
+ dreq->max_count = orig_end - offset;
+
+ dreq->start_len = start_end - offset;
+ dreq->middle_offset = start_end;
+ dreq->middle_len = middle_end - start_end;
+ dreq->end_offset = middle_end;
+ dreq->end_len = orig_end - middle_end;
+
+ trace_nfs_analyze_dio(WRITE, offset, len, offset, dreq->start_len,
+ dreq->middle_offset, dreq->middle_len,
+ dreq->end_offset, dreq->end_len);
+ return true;
+#else
+ return false;
+#endif
+}
+
/**
* nfs_file_direct_write - file direct write operation for NFS files
* @iocb: target I/O control block
@@ -1176,9 +1247,12 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
if (!dreq)
goto out;
+ if (swap || !nfs_analyze_write_dio(pos, count, dreq)) {
+ dreq->max_count = count;
+ dreq->io_start = pos;
+ }
+
dreq->inode = inode;
- dreq->max_count = count;
- dreq->io_start = pos;
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
l_ctx = nfs_get_lock_context(dreq->ctx);
if (IS_ERR(l_ctx)) {
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 06a15bf08357..8daed5b1aa50 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -995,6 +995,7 @@ struct nfs_direct_req {
struct bio_vec * start_extra_bvec;
loff_t middle_offset; /* Offset for start of DIO-aligned middle */
loff_t end_offset; /* Offset for start of DIO-aligned end */
+ ssize_t start_len; /* Length for misaligned first page */
ssize_t middle_len; /* Length for DIO-aligned middle */
ssize_t end_len; /* Length for misaligned last page */
};
--
2.44.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (12 preceding siblings ...)
2025-07-24 19:31 ` [PATCH v5 13/13] nfs/direct: add misaligned WRITE handling Mike Snitzer
@ 2025-07-27 15:39 ` Chuck Lever
2025-07-28 13:44 ` Mike Snitzer
2025-07-27 16:16 ` (subset) " Chuck Lever
14 siblings, 1 reply; 22+ messages in thread
From: Chuck Lever @ 2025-07-27 15:39 UTC (permalink / raw)
To: Mike Snitzer, Jeff Layton, Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
On 7/24/25 3:30 PM, Mike Snitzer wrote:
> Hi,
>
> Some workloads benefit from NFSD avoiding the page cache, particularly
> those with a working set that is significantly larger than available
> system memory. This patchset introduces _optional_ support to
> configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
> support. The NFSD default to use page cache is left unchanged.
>
> The performance win associated with using NFSD DIRECT was previously
> summarized here:
> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> This picture offers a nice summary of performance gains:
> https://original.art/NFSD_direct_vs_buffered_IO.jpg
>
> Similarly, NFS and LOCALIO in particular also benefit from avoiding
> the page cache for workloads that have a working set that is
> significantly larger than available system memory. Enter: NFS DIRECT,
> which makes it possible to always enable LOCALIO to use O_DIRECT even
> if the IO is not DIO-aligned.
>
> For this v5 I've combined the NFSD and NFSD patchsets because the NFS
> changes do depend on the the NFSD changes. In addition, I think it
> makes sense to review/test these changes together.
I'm ready to pull the six NFSD patches in this series into nfsd-testing.
IMO we want regression and performance testing of NFSD, outside of the
LOCALIO paths, before claiming merge readiness.
> I'm sharing these again now, soon after posting the NFSD and NFS
> updates, to hopefully make it clear where the code stands. Thanks to
> Chuck's feedback I have kept the patch "NFSD: issue READs using
> O_DIRECT even if IO is misaligned" (and will now finish NFSD's
> misaligned WRITE handling, splitting IO to misaligned head and/or tail
> and DIO-aligned middle, and will include in the next version of this
> patchset -- probably mid next week).
>
> New changes in this v5:
> - Combine NFSD DIRECT and NFS DIRECT patches into single patchset.
> - Fix a "nsfd" typo in a variable of the NFSD io_cache_read patch that
> was masked because the later " NFSD: issue READs using O_DIRECT even
> if IO is misaligned" patch fixed it.
> - Properly include the "NFSD: filecache: only get DIO alignment
> attrs if NFSD_IO_DIRECT enabled" in the patch series.
> - Optimize NFS DIRECT's misaligned READ and WRITE support to return
> early if IO irreparably misaligned or already DIO-aligned.
>
> Thanks,
> Mike
>
> Mike Snitzer (13):
> NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
> NFSD: pass nfsd_file to nfsd_iter_read()
> NFSD: add io_cache_read controls to debugfs interface
> NFSD: add io_cache_write controls to debugfs interface
> NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
> NFSD: issue READs using O_DIRECT even if IO is misaligned
> nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
> nfs/localio: make trace_nfs_local_open_fh more useful
> nfs/localio: add nfsd_file_dio_alignment
> nfs/localio: refactor iocb initialization
> nfs/localio: fallback to NFSD for misaligned O_DIRECT READs
> nfs/direct: add misaligned READ handling
> nfs/direct: add misaligned WRITE handling
>
> fs/nfs/direct.c | 262 +++++++++++++++++++++++--
> fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
> fs/nfs/internal.h | 17 +-
> fs/nfs/localio.c | 231 ++++++++++++++--------
> fs/nfs/nfstrace.h | 47 ++++-
> fs/nfs/pagelist.c | 22 ++-
> fs/nfsd/debugfs.c | 102 ++++++++++
> fs/nfsd/filecache.c | 36 ++++
> fs/nfsd/filecache.h | 4 +
> fs/nfsd/localio.c | 11 ++
> fs/nfsd/nfs4xdr.c | 8 +-
> fs/nfsd/nfsd.h | 10 +
> fs/nfsd/nfsfh.c | 4 +
> fs/nfsd/trace.h | 37 ++++
> fs/nfsd/vfs.c | 200 +++++++++++++++++--
> fs/nfsd/vfs.h | 2 +-
> include/linux/nfs_page.h | 1 +
> include/linux/nfslocalio.h | 2 +
> include/linux/sunrpc/svc.h | 5 +-
> 19 files changed, 875 insertions(+), 127 deletions(-)
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: (subset) [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
` (13 preceding siblings ...)
2025-07-27 15:39 ` [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Chuck Lever
@ 2025-07-27 16:16 ` Chuck Lever
2025-07-28 13:51 ` Mike Snitzer
14 siblings, 1 reply; 22+ messages in thread
From: Chuck Lever @ 2025-07-27 16:16 UTC (permalink / raw)
To: Jeff Layton, Trond Myklebust, Anna Schumaker, Mike Snitzer
Cc: Chuck Lever, linux-nfs
From: Chuck Lever <chuck.lever@oracle.com>
On Thu, 24 Jul 2025 15:30:49 -0400, Mike Snitzer wrote:
> Some workloads benefit from NFSD avoiding the page cache, particularly
> those with a working set that is significantly larger than available
> system memory. This patchset introduces _optional_ support to
> configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
> support. The NFSD default to use page cache is left unchanged.
>
> The performance win associated with using NFSD DIRECT was previously
> summarized here:
> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> This picture offers a nice summary of performance gains:
> https://original.art/NFSD_direct_vs_buffered_IO.jpg
>
> [...]
Applied to nfsd-testing, thanks!
[01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
commit: af157e09634a113da83d8ac5fff541f9e06ad653
[02/13] NFSD: pass nfsd_file to nfsd_iter_read()
commit: 63a534c8b18642dc27318e08b77952c4d7f55628
[03/13] NFSD: add io_cache_read controls to debugfs interface
commit: f76b72e4908c556021d94bdeca86fffce430c791
[04/13] NFSD: add io_cache_write controls to debugfs interface
commit: a45da44bb6bade1dfef569c792ae2ee6507f4724
[05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
commit: af157e09634a113da83d8ac5fff541f9e06ad653
[06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned
commit: 6d80efb3cb6f9817bedfa460e9ddf56a916caf2f
--
Chuck Lever
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-27 15:39 ` [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Chuck Lever
@ 2025-07-28 13:44 ` Mike Snitzer
2025-07-28 13:48 ` Chuck Lever
0 siblings, 1 reply; 22+ messages in thread
From: Mike Snitzer @ 2025-07-28 13:44 UTC (permalink / raw)
To: Chuck Lever; +Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, linux-nfs
On Sun, Jul 27, 2025 at 11:39:18AM -0400, Chuck Lever wrote:
> On 7/24/25 3:30 PM, Mike Snitzer wrote:
> > Hi,
> >
> > Some workloads benefit from NFSD avoiding the page cache, particularly
> > those with a working set that is significantly larger than available
> > system memory. This patchset introduces _optional_ support to
> > configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
> > support. The NFSD default to use page cache is left unchanged.
> >
> > The performance win associated with using NFSD DIRECT was previously
> > summarized here:
> > https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> > This picture offers a nice summary of performance gains:
> > https://original.art/NFSD_direct_vs_buffered_IO.jpg
> >
> > Similarly, NFS and LOCALIO in particular also benefit from avoiding
> > the page cache for workloads that have a working set that is
> > significantly larger than available system memory. Enter: NFS DIRECT,
> > which makes it possible to always enable LOCALIO to use O_DIRECT even
> > if the IO is not DIO-aligned.
> >
> > For this v5 I've combined the NFSD and NFSD patchsets because the NFS
> > changes do depend on the the NFSD changes. In addition, I think it
> > makes sense to review/test these changes together.
>
> I'm ready to pull the six NFSD patches in this series into nfsd-testing.
> IMO we want regression and performance testing of NFSD, outside of the
> LOCALIO paths, before claiming merge readiness.
Makes sense, the NFSD changes are independent. LOCALIO's access to
the dio alignment attrs in nfsd_file is a convenience.
Mike
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-28 13:44 ` Mike Snitzer
@ 2025-07-28 13:48 ` Chuck Lever
2025-07-28 14:08 ` Mike Snitzer
0 siblings, 1 reply; 22+ messages in thread
From: Chuck Lever @ 2025-07-28 13:48 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, linux-nfs
On 7/28/25 9:44 AM, Mike Snitzer wrote:
> On Sun, Jul 27, 2025 at 11:39:18AM -0400, Chuck Lever wrote:
>> On 7/24/25 3:30 PM, Mike Snitzer wrote:
>>> Hi,
>>>
>>> Some workloads benefit from NFSD avoiding the page cache, particularly
>>> those with a working set that is significantly larger than available
>>> system memory. This patchset introduces _optional_ support to
>>> configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
>>> support. The NFSD default to use page cache is left unchanged.
>>>
>>> The performance win associated with using NFSD DIRECT was previously
>>> summarized here:
>>> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>>> This picture offers a nice summary of performance gains:
>>> https://original.art/NFSD_direct_vs_buffered_IO.jpg
>>>
>>> Similarly, NFS and LOCALIO in particular also benefit from avoiding
>>> the page cache for workloads that have a working set that is
>>> significantly larger than available system memory. Enter: NFS DIRECT,
>>> which makes it possible to always enable LOCALIO to use O_DIRECT even
>>> if the IO is not DIO-aligned.
>>>
>>> For this v5 I've combined the NFSD and NFSD patchsets because the NFS
>>> changes do depend on the the NFSD changes. In addition, I think it
>>> makes sense to review/test these changes together.
>>
>> I'm ready to pull the six NFSD patches in this series into nfsd-testing.
>> IMO we want regression and performance testing of NFSD, outside of the
>> LOCALIO paths, before claiming merge readiness.
>
> Makes sense, the NFSD changes are independent. LOCALIO's access to
> the dio alignment attrs in nfsd_file is a convenience.
As I was drifting off to sleep last night, my mind hallucinated the
idea that maybe all (three) caching modes should align the READ
payload. Would that make sense / simplify 06/13 ?
--
Chuck Lever
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: (subset) [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-27 16:16 ` (subset) " Chuck Lever
@ 2025-07-28 13:51 ` Mike Snitzer
2025-07-28 13:53 ` Chuck Lever
0 siblings, 1 reply; 22+ messages in thread
From: Mike Snitzer @ 2025-07-28 13:51 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, Chuck Lever,
linux-nfs
On Sun, Jul 27, 2025 at 12:16:04PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> On Thu, 24 Jul 2025 15:30:49 -0400, Mike Snitzer wrote:
> > Some workloads benefit from NFSD avoiding the page cache, particularly
> > those with a working set that is significantly larger than available
> > system memory. This patchset introduces _optional_ support to
> > configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
> > support. The NFSD default to use page cache is left unchanged.
> >
> > The performance win associated with using NFSD DIRECT was previously
> > summarized here:
> > https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> > This picture offers a nice summary of performance gains:
> > https://original.art/NFSD_direct_vs_buffered_IO.jpg
> >
> > [...]
>
> Applied to nfsd-testing, thanks!
>
> [01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
> commit: af157e09634a113da83d8ac5fff541f9e06ad653
> [05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
> commit: af157e09634a113da83d8ac5fff541f9e06ad653
I noticed you folded these, unfortunately that isn't bisect safe
unless you pull these fs/nfsd/nfsd.h changes to the front too:
git diff f76b72e4908c556021d94bdeca86fffce430c791^..a45da44bb6bade1dfef569c792ae2ee6507f4724 -- fs/nfsd/nfsd.h
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1cd0bed57bc2..fe935b4cda53 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -153,6 +153,16 @@ static inline void nfsd_debugfs_exit(void) {}
extern bool nfsd_disable_splice_read __read_mostly;
+enum {
+ NFSD_IO_UNSPECIFIED = 0,
+ NFSD_IO_BUFFERED,
+ NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
+};
+
+extern u64 nfsd_io_cache_read __read_mostly;
+extern u64 nfsd_io_cache_write __read_mostly;
+
extern int nfsd_max_blksize;
static inline int nfsd_v4client(struct svc_rqst *rq)
> [02/13] NFSD: pass nfsd_file to nfsd_iter_read()
> commit: 63a534c8b18642dc27318e08b77952c4d7f55628
> [03/13] NFSD: add io_cache_read controls to debugfs interface
> commit: f76b72e4908c556021d94bdeca86fffce430c791
> [04/13] NFSD: add io_cache_write controls to debugfs interface
> commit: a45da44bb6bade1dfef569c792ae2ee6507f4724
> [06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned
> commit: 6d80efb3cb6f9817bedfa460e9ddf56a916caf2f
Thanks!
Mike
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: (subset) [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-28 13:51 ` Mike Snitzer
@ 2025-07-28 13:53 ` Chuck Lever
2025-07-28 13:58 ` Mike Snitzer
0 siblings, 1 reply; 22+ messages in thread
From: Chuck Lever @ 2025-07-28 13:53 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, Chuck Lever,
linux-nfs
On 7/28/25 9:51 AM, Mike Snitzer wrote:
> On Sun, Jul 27, 2025 at 12:16:04PM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> On Thu, 24 Jul 2025 15:30:49 -0400, Mike Snitzer wrote:
>>> Some workloads benefit from NFSD avoiding the page cache, particularly
>>> those with a working set that is significantly larger than available
>>> system memory. This patchset introduces _optional_ support to
>>> configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
>>> support. The NFSD default to use page cache is left unchanged.
>>>
>>> The performance win associated with using NFSD DIRECT was previously
>>> summarized here:
>>> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>>> This picture offers a nice summary of performance gains:
>>> https://original.art/NFSD_direct_vs_buffered_IO.jpg
>>>
>>> [...]
>>
>> Applied to nfsd-testing, thanks!
>>
>> [01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
>> commit: af157e09634a113da83d8ac5fff541f9e06ad653
>
>> [05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
>> commit: af157e09634a113da83d8ac5fff541f9e06ad653
>
> I noticed you folded these, unfortunately that isn't bisect safe
> unless you pull these fs/nfsd/nfsd.h changes to the front too:
>
> git diff f76b72e4908c556021d94bdeca86fffce430c791^..a45da44bb6bade1dfef569c792ae2ee6507f4724 -- fs/nfsd/nfsd.h
>
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1cd0bed57bc2..fe935b4cda53 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -153,6 +153,16 @@ static inline void nfsd_debugfs_exit(void) {}
>
> extern bool nfsd_disable_splice_read __read_mostly;
>
> +enum {
> + NFSD_IO_UNSPECIFIED = 0,
> + NFSD_IO_BUFFERED,
> + NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT,
> +};
> +
> +extern u64 nfsd_io_cache_read __read_mostly;
> +extern u64 nfsd_io_cache_write __read_mostly;
> +
> extern int nfsd_max_blksize;
>
> static inline int nfsd_v4client(struct svc_rqst *rq)
>
>> [02/13] NFSD: pass nfsd_file to nfsd_iter_read()
>> commit: 63a534c8b18642dc27318e08b77952c4d7f55628
>> [03/13] NFSD: add io_cache_read controls to debugfs interface
>> commit: f76b72e4908c556021d94bdeca86fffce430c791
>> [04/13] NFSD: add io_cache_write controls to debugfs interface
>> commit: a45da44bb6bade1dfef569c792ae2ee6507f4724
>
>> [06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned
>> commit: 6d80efb3cb6f9817bedfa460e9ddf56a916caf2f
>
> Thanks!
> Mike
That's what I get for compile-testing first before squashing.
--
Chuck Lever
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: (subset) [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-28 13:53 ` Chuck Lever
@ 2025-07-28 13:58 ` Mike Snitzer
0 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-28 13:58 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, Chuck Lever,
linux-nfs
On Mon, Jul 28, 2025 at 09:53:45AM -0400, Chuck Lever wrote:
> On 7/28/25 9:51 AM, Mike Snitzer wrote:
> > On Sun, Jul 27, 2025 at 12:16:04PM -0400, Chuck Lever wrote:
> >> From: Chuck Lever <chuck.lever@oracle.com>
> >>
> >> On Thu, 24 Jul 2025 15:30:49 -0400, Mike Snitzer wrote:
> >>> Some workloads benefit from NFSD avoiding the page cache, particularly
> >>> those with a working set that is significantly larger than available
> >>> system memory. This patchset introduces _optional_ support to
> >>> configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
> >>> support. The NFSD default to use page cache is left unchanged.
> >>>
> >>> The performance win associated with using NFSD DIRECT was previously
> >>> summarized here:
> >>> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> >>> This picture offers a nice summary of performance gains:
> >>> https://original.art/NFSD_direct_vs_buffered_IO.jpg
> >>>
> >>> [...]
> >>
> >> Applied to nfsd-testing, thanks!
> >>
> >> [01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
> >> commit: af157e09634a113da83d8ac5fff541f9e06ad653
> >
> >> [05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
> >> commit: af157e09634a113da83d8ac5fff541f9e06ad653
> >
> > I noticed you folded these, unfortunately that isn't bisect safe
> > unless you pull these fs/nfsd/nfsd.h changes to the front too:
> >
> > git diff f76b72e4908c556021d94bdeca86fffce430c791^..a45da44bb6bade1dfef569c792ae2ee6507f4724 -- fs/nfsd/nfsd.h
> >
> > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > index 1cd0bed57bc2..fe935b4cda53 100644
> > --- a/fs/nfsd/nfsd.h
> > +++ b/fs/nfsd/nfsd.h
> > @@ -153,6 +153,16 @@ static inline void nfsd_debugfs_exit(void) {}
> >
> > extern bool nfsd_disable_splice_read __read_mostly;
> >
> > +enum {
> > + NFSD_IO_UNSPECIFIED = 0,
> > + NFSD_IO_BUFFERED,
> > + NFSD_IO_DONTCACHE,
> > + NFSD_IO_DIRECT,
> > +};
> > +
> > +extern u64 nfsd_io_cache_read __read_mostly;
> > +extern u64 nfsd_io_cache_write __read_mostly;
> > +
> > extern int nfsd_max_blksize;
> >
> > static inline int nfsd_v4client(struct svc_rqst *rq)
> >
> >> [02/13] NFSD: pass nfsd_file to nfsd_iter_read()
> >> commit: 63a534c8b18642dc27318e08b77952c4d7f55628
> >> [03/13] NFSD: add io_cache_read controls to debugfs interface
> >> commit: f76b72e4908c556021d94bdeca86fffce430c791
> >> [04/13] NFSD: add io_cache_write controls to debugfs interface
> >> commit: a45da44bb6bade1dfef569c792ae2ee6507f4724
> >
> >> [06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned
> >> commit: 6d80efb3cb6f9817bedfa460e9ddf56a916caf2f
> >
> > Thanks!
> > Mike
>
> That's what I get for compile-testing first before squashing.
It happens, you also need this from vfs.c:
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 9bbc97aebbea..a7a587736a22 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,8 @@
#define NFSDDBG_FACILITY NFSDDBG_FILEOP
bool nfsd_disable_splice_read __read_mostly;
+u64 nfsd_io_cache_read __read_mostly;
+u64 nfsd_io_cache_write __read_mostly;
/**
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT
2025-07-28 13:48 ` Chuck Lever
@ 2025-07-28 14:08 ` Mike Snitzer
0 siblings, 0 replies; 22+ messages in thread
From: Mike Snitzer @ 2025-07-28 14:08 UTC (permalink / raw)
To: Chuck Lever; +Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, linux-nfs
On Mon, Jul 28, 2025 at 09:48:48AM -0400, Chuck Lever wrote:
> On 7/28/25 9:44 AM, Mike Snitzer wrote:
> > On Sun, Jul 27, 2025 at 11:39:18AM -0400, Chuck Lever wrote:
> >> On 7/24/25 3:30 PM, Mike Snitzer wrote:
> >>> Hi,
> >>>
> >>> Some workloads benefit from NFSD avoiding the page cache, particularly
> >>> those with a working set that is significantly larger than available
> >>> system memory. This patchset introduces _optional_ support to
> >>> configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
> >>> support. The NFSD default to use page cache is left unchanged.
> >>>
> >>> The performance win associated with using NFSD DIRECT was previously
> >>> summarized here:
> >>> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> >>> This picture offers a nice summary of performance gains:
> >>> https://original.art/NFSD_direct_vs_buffered_IO.jpg
> >>>
> >>> Similarly, NFS and LOCALIO in particular also benefit from avoiding
> >>> the page cache for workloads that have a working set that is
> >>> significantly larger than available system memory. Enter: NFS DIRECT,
> >>> which makes it possible to always enable LOCALIO to use O_DIRECT even
> >>> if the IO is not DIO-aligned.
> >>>
> >>> For this v5 I've combined the NFSD and NFSD patchsets because the NFS
> >>> changes do depend on the the NFSD changes. In addition, I think it
> >>> makes sense to review/test these changes together.
> >>
> >> I'm ready to pull the six NFSD patches in this series into nfsd-testing.
> >> IMO we want regression and performance testing of NFSD, outside of the
> >> LOCALIO paths, before claiming merge readiness.
> >
> > Makes sense, the NFSD changes are independent. LOCALIO's access to
> > the dio alignment attrs in nfsd_file is a convenience.
>
> As I was drifting off to sleep last night, my mind hallucinated the
> idea that maybe all (three) caching modes should align the READ
> payload. Would that make sense / simplify 06/13 ?
As in nfsd_iter_read() no longer being passed @base? Sure it'd
simplify things a bit, but not so much that it needs to be done as a
prereq.
Bigger bonus is that it reduces cause for needless inability to use
DIO if/when configured to do so. So in that respect, definitely a good
incremental improvement.
Mike
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-07-28 14:08 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24 19:30 [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 01/13] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 02/13] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 03/13] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 04/13] NFSD: add io_cache_write " Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 05/13] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 06/13] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 07/13] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local() Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 08/13] nfs/localio: make trace_nfs_local_open_fh more useful Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 09/13] nfs/localio: add nfsd_file_dio_alignment Mike Snitzer
2025-07-24 19:30 ` [PATCH v5 10/13] nfs/localio: refactor iocb initialization Mike Snitzer
2025-07-24 19:31 ` [PATCH v5 11/13] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs Mike Snitzer
2025-07-24 19:31 ` [PATCH v5 12/13] nfs/direct: add misaligned READ handling Mike Snitzer
2025-07-24 19:31 ` [PATCH v5 13/13] nfs/direct: add misaligned WRITE handling Mike Snitzer
2025-07-27 15:39 ` [PATCH v5 00/13] NFSD DIRECT and NFS DIRECT Chuck Lever
2025-07-28 13:44 ` Mike Snitzer
2025-07-28 13:48 ` Chuck Lever
2025-07-28 14:08 ` Mike Snitzer
2025-07-27 16:16 ` (subset) " Chuck Lever
2025-07-28 13:51 ` Mike Snitzer
2025-07-28 13:53 ` Chuck Lever
2025-07-28 13:58 ` Mike Snitzer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).