linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes
@ 2025-08-07 16:25 Mike Snitzer
  2025-08-07 16:25 ` [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Hi,

Some workloads benefit from NFSD avoiding the page cache, particularly
those with a working set that is significantly larger than available
system memory.  This patchset introduces _optional_ support to
configure the use of O_DIRECT or DONTCACHE for NFSD's READ and WRITE
support.  The NFSD default to use page cache is left unchanged.

The performance win associated with using NFSD DIRECT was previously
summarized here:
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
This picture offers a nice summary of performance gains:
https://original.art/NFSD_direct_vs_buffered_IO.jpg

This series builds on what has been staged in the nfsd-testing branch.

This code has proven to work well during my testing.  Any suggestions
for further refinement are welcome.

Thanks,
Mike

Changes since v4:
- removed use of iov_iter_is_aligned() in earlier patches, we don't
  want any conflict with Keith's patchset that ultimately removes the
  iov_iter_is_aligned() interface.
- refactored the final "NFSD: issue WRITEs using O_DIRECT even if IO
  is misaligned" patch to have the lightest touch possible on
  nfsd_vfs_write() for the default buffered IO case.
- updated patch headers where needed.
- all patches have changed some, so removed all Reviewed-by from Jeff
  and Signed-off-by from Chuck.
- Series checked with checkpatch.pl, sparse and verified bisect safe.

Changes since v3:
- fixed nfsd_vfs_write() so work that only needs to happen once, after
  IO is submitted, isn't performed each iteration of the for loop,
  thanks for catching this Chuck.
- updated NFSD's misaligned READ and WRITE handling to not use
  iov_iter_is_aligned() because it will soon be removed upstream.
  - Chuck, provided both you and Jeff agree with patch 1's incremental
    changes, patch 1 should be folded into what you already have in your
    nfsd-testing branch (more justification in patch 1's header)
- dropped the NFSD misaligned DIO NFS reexport patch in favor of
  adding STATX_DIOALIGN support to NFS (the patch to add support will
  be provided in the next NFS DIRECT v7 patchset that I'll post soon).

Changes since v2:
- fixed patch 2 by moving redundant work out of nfsd_vfs_write()'s for
  loop, thanks to Jeff's review.
- added Jeff's Reviewed-by to patches 1-3.
- Left patch 4 in the series because it is pragmatic, but feel free to
  drop it if you'd prefer to see this cat skinned a different way.

Changes since v1:
- switched to using an EVENT_CLASS to create nfsd_analyze_{read,write}_dio
- added 4th patch, if user configured use of NFSD_IO_DIRECT then NFS
  reexports should use it too (in future, with per-export controls
  we'll have the benefit of finer-grained control; but until then we'd
  do well to offer comprehensive use of NFSD_IO_DIRECT if it enabled).

Thanks,
Mike

Mike Snitzer (7):
  NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  NFSD: pass nfsd_file to nfsd_iter_read()
  NFSD: add io_cache_read controls to debugfs interface
  NFSD: add io_cache_write controls to debugfs interface
  NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
  NFSD: issue READs using O_DIRECT even if IO is misaligned
  NFSD: issue WRITEs using O_DIRECT even if IO is misaligned

 fs/nfsd/debugfs.c          | 102 +++++++++++
 fs/nfsd/filecache.c        |  36 ++++
 fs/nfsd/filecache.h        |   4 +
 fs/nfsd/nfs4xdr.c          |   8 +-
 fs/nfsd/nfsd.h             |  10 +
 fs/nfsd/nfsfh.c            |   4 +
 fs/nfsd/trace.h            |  61 +++++++
 fs/nfsd/vfs.c              | 366 +++++++++++++++++++++++++++++++++++--
 fs/nfsd/vfs.h              |   2 +-
 include/linux/sunrpc/svc.h |   5 +-
 10 files changed, 575 insertions(+), 23 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
@ 2025-08-07 16:25 ` Mike Snitzer
  2025-08-08 11:49   ` Jeff Layton
  2025-08-07 16:25 ` [PATCH v5 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
alignment attributes from underlying filesystem in associated
nfsd_file.  This is done when the nfsd_file is first opened for
a regular file.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/filecache.c | 32 ++++++++++++++++++++++++++++++++
 fs/nfsd/filecache.h |  4 ++++
 fs/nfsd/nfsfh.c     |  4 ++++
 3 files changed, 40 insertions(+)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 8581c131338b8..5447dba6c5da0 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
 	refcount_set(&nf->nf_ref, 1);
 	nf->nf_may = need;
 	nf->nf_mark = NULL;
+	nf->nf_dio_mem_align = 0;
+	nf->nf_dio_offset_align = 0;
+	nf->nf_dio_read_offset_align = 0;
 	return nf;
 }
 
@@ -1048,6 +1051,33 @@ nfsd_file_is_cached(struct inode *inode)
 	return ret;
 }
 
+static __be32
+nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+	struct inode *inode = file_inode(nf->nf_file);
+	struct kstat stat;
+	__be32 status;
+
+	/* Currently only need to get DIO alignment info for regular files */
+	if (!S_ISREG(inode->i_mode))
+		return nfs_ok;
+
+	status = fh_getattr(fhp, &stat);
+	if (status != nfs_ok)
+		return status;
+
+	if (stat.result_mask & STATX_DIOALIGN) {
+		nf->nf_dio_mem_align = stat.dio_mem_align;
+		nf->nf_dio_offset_align = stat.dio_offset_align;
+	}
+	if (stat.result_mask & STATX_DIO_READ_ALIGN)
+		nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+	else
+		nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+	return status;
+}
+
 static __be32
 nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
 		     struct svc_cred *cred,
@@ -1166,6 +1196,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
 			}
 			status = nfserrno(ret);
 			trace_nfsd_file_open(nf, status);
+			if (status == nfs_ok)
+				status = nfsd_file_getattr(fhp, nf);
 		}
 	} else
 		status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434a..e3d6ca2b60308 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
 	struct list_head	nf_gc;
 	struct rcu_head		nf_rcu;
 	ktime_t			nf_birthtime;
+
+	u32			nf_dio_mem_align;
+	u32			nf_dio_offset_align;
+	u32			nf_dio_read_offset_align;
 };
 
 int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index f4a3cc9e31e05..bdba2ba828a6a 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -677,8 +677,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
 		.mnt		= fhp->fh_export->ex_path.mnt,
 		.dentry		= fhp->fh_dentry,
 	};
+	struct inode *inode = d_inode(p.dentry);
 	u32 request_mask = STATX_BASIC_STATS;
 
+	if (S_ISREG(inode->i_mode))
+		request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
 	if (fhp->fh_maxsize == NFS4_FHSIZE)
 		request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 2/7] NFSD: pass nfsd_file to nfsd_iter_read()
  2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
  2025-08-07 16:25 ` [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-08-07 16:25 ` Mike Snitzer
  2025-08-08 11:51   ` Jeff Layton
  2025-08-07 16:25 ` [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/nfs4xdr.c | 8 ++++----
 fs/nfsd/vfs.c     | 7 ++++---
 fs/nfsd/vfs.h     | 2 +-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7d19925f46e45..d519f4156cfad 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4464,7 +4464,7 @@ static __be32 nfsd4_encode_splice_read(
 
 static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
 				 struct nfsd4_read *read,
-				 struct file *file, unsigned long maxcount)
+				 unsigned long maxcount)
 {
 	struct xdr_stream *xdr = resp->xdr;
 	unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4475,7 +4475,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
 	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
 		return nfserr_resource;
 
-	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
 				read->rd_offset, &maxcount, base,
 				&read->rd_eof);
 	read->rd_length = maxcount;
@@ -4522,7 +4522,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
 	if (file->f_op->splice_read && splice_ok)
 		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
 	else
-		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+		nfserr = nfsd4_encode_readv(resp, read, maxcount);
 	if (nfserr) {
 		xdr_truncate_encode(xdr, eof_offset);
 		return nfserr;
@@ -5418,7 +5418,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
 	if (file->f_op->splice_read && splice_ok)
 		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
 	else
-		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+		nfserr = nfsd4_encode_readv(resp, read, maxcount);
 	if (nfserr)
 		return nfserr;
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0c0f25b2c8e38..79439ad93880a 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1075,7 +1075,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * nfsd_iter_read - Perform a VFS read using an iterator
  * @rqstp: RPC transaction context
  * @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
  * @offset: starting byte offset
  * @count: IN: requested number of bytes; OUT: number of bytes read
  * @base: offset in first page of read buffer
@@ -1088,9 +1088,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * returned.
  */
 __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		      struct file *file, loff_t offset, unsigned long *count,
+		      struct nfsd_file *nf, loff_t offset, unsigned long *count,
 		      unsigned int base, u32 *eof)
 {
+	struct file *file = nf->nf_file;
 	unsigned long v, total;
 	struct iov_iter iter;
 	struct kiocb kiocb;
@@ -1312,7 +1313,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
 		err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
 	else
-		err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+		err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
 
 	nfsd_file_put(nf);
 	trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6de..fa46f8b5f1320 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32		nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				unsigned long *count,
 				u32 *eof);
 __be32		nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
-				struct file *file, loff_t offset,
+				struct nfsd_file *nf, loff_t offset,
 				unsigned long *count, unsigned int base,
 				u32 *eof);
 bool		nfsd_read_splice_ok(struct svc_rqst *rqstp);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface
  2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
  2025-08-07 16:25 ` [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
  2025-08-07 16:25 ` [PATCH v5 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
@ 2025-08-07 16:25 ` Mike Snitzer
  2025-08-08 12:05   ` Jeff Layton
                     ` (2 more replies)
  2025-08-07 16:25 ` [PATCH v5 4/7] NFSD: add io_cache_write " Mike Snitzer
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
  (NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)

io_cache_read may be set by writing to:
  /sys/kernel/debug/nfsd/io_cache_read

If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.

If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the read must be aligned relative to the
underlying block device's dma_alignment.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/debugfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfsd/nfsd.h    |  9 ++++++++
 fs/nfsd/vfs.c     | 19 +++++++++++++---
 3 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 84b0c8b559dc9..c07f71d4e84f4 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -27,11 +27,66 @@ static int nfsd_dsr_get(void *data, u64 *val)
 static int nfsd_dsr_set(void *data, u64 val)
 {
 	nfsd_disable_splice_read = (val > 0) ? true : false;
+	if (!nfsd_disable_splice_read) {
+		/*
+		 * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
+		 * if splice_read is enabled.
+		 */
+		nfsd_io_cache_read = NFSD_IO_BUFFERED;
+	}
 	return 0;
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
 
+/*
+ * /sys/kernel/debug/nfsd/io_cache_read
+ *
+ * Contents:
+ *   %1: NFS READ will use buffered IO
+ *   %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ *   %3: NFS READ will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_read_get(void *data, u64 *val)
+{
+	*val = nfsd_io_cache_read;
+	return 0;
+}
+
+static int nfsd_io_cache_read_set(void *data, u64 val)
+{
+	int ret = 0;
+
+	switch (val) {
+	case NFSD_IO_BUFFERED:
+		nfsd_io_cache_read = NFSD_IO_BUFFERED;
+		break;
+	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
+		/*
+		 * Must disable splice_read when enabling
+		 * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
+		 */
+		nfsd_disable_splice_read = true;
+		nfsd_io_cache_read = val;
+		break;
+	default:
+		nfsd_io_cache_read = NFSD_IO_UNSPECIFIED;
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
+			 nfsd_io_cache_read_set, "%llu\n");
+
 void nfsd_debugfs_exit(void)
 {
 	debugfs_remove_recursive(nfsd_top_dir);
@@ -44,4 +99,7 @@ void nfsd_debugfs_init(void)
 
 	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
 			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
+
+	debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
+			    nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
 }
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1cd0bed57bc2f..6ef799405145f 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
 
 extern bool nfsd_disable_splice_read __read_mostly;
 
+enum {
+	NFSD_IO_UNSPECIFIED = 0,
+	NFSD_IO_BUFFERED,
+	NFSD_IO_DONTCACHE,
+	NFSD_IO_DIRECT,
+};
+
+extern u64 nfsd_io_cache_read __read_mostly;
+
 extern int nfsd_max_blksize;
 
 static inline int nfsd_v4client(struct svc_rqst *rq)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 79439ad93880a..26b6d96258711 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,7 @@
 #define NFSDDBG_FACILITY		NFSDDBG_FILEOP
 
 bool nfsd_disable_splice_read __read_mostly;
+u64 nfsd_io_cache_read __read_mostly;
 
 /**
  * nfserrno - Map Linux errnos to NFS errnos
@@ -1099,17 +1100,29 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	size_t len;
 
 	init_sync_kiocb(&kiocb, file);
+
+	if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
+		/* Verify ondisk DIO alignment, memory addrs checked below */
+		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
+		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0))
+			kiocb.ki_flags = IOCB_DIRECT;
+	} else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
+		kiocb.ki_flags = IOCB_DONTCACHE;
+
 	kiocb.ki_pos = offset;
 
 	v = 0;
 	total = *count;
 	while (total) {
 		len = min_t(size_t, total, PAGE_SIZE - base);
-		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
-			      len, base);
+		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++), len, base);
+		/* No need to verify memory is DIO-aligned since bv_offset is 0 */
+		if (unlikely((kiocb.ki_flags & IOCB_DIRECT) && base &&
+			     (base & (nf->nf_dio_mem_align - 1))))
+			kiocb.ki_flags &= ~IOCB_DIRECT;
 		total -= len;
-		++v;
 		base = 0;
+		v++;
 	}
 	WARN_ON_ONCE(v > rqstp->rq_maxpages);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 4/7] NFSD: add io_cache_write controls to debugfs interface
  2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
                   ` (2 preceding siblings ...)
  2025-08-07 16:25 ` [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-08-07 16:25 ` Mike Snitzer
  2025-08-08 12:03   ` Jeff Layton
  2025-08-08 17:58   ` Chuck Lever
  2025-08-07 16:25 ` [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=1)
- cached but removed from the page cache upon completion
  (NFSD_IO_DONTCACHE=2).
- not cached (NFSD_IO_DIRECT=3)

io_cache_write may be set by writing to:
  /sys/kernel/debug/nfsd/io_cache_write

If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.

If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the WRITE payload must be aligned relative
to the underlying block device's dma_alignment.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/debugfs.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 fs/nfsd/nfsd.h    |  1 +
 fs/nfsd/vfs.c     | 16 ++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index c07f71d4e84f4..872de65f0e9ac 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -87,6 +87,47 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
 			 nfsd_io_cache_read_set, "%llu\n");
 
+/*
+ * /sys/kernel/debug/nfsd/io_cache_write
+ *
+ * Contents:
+ *   %1: NFS WRITE will use buffered IO
+ *   %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ *   %3: NFS WRITE will use direct IO
+ *
+ * The default value of this setting is zero (UNSPECIFIED).
+ * This setting takes immediate effect for all NFS versions,
+ * all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_write_get(void *data, u64 *val)
+{
+	*val = nfsd_io_cache_write;
+	return 0;
+}
+
+static int nfsd_io_cache_write_set(void *data, u64 val)
+{
+	int ret = 0;
+
+	switch (val) {
+	case NFSD_IO_BUFFERED:
+	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
+		nfsd_io_cache_write = val;
+		break;
+	default:
+		nfsd_io_cache_write = NFSD_IO_UNSPECIFIED;
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_write_fops, nfsd_io_cache_write_get,
+			 nfsd_io_cache_write_set, "%llu\n");
+
 void nfsd_debugfs_exit(void)
 {
 	debugfs_remove_recursive(nfsd_top_dir);
@@ -102,4 +143,7 @@ void nfsd_debugfs_init(void)
 
 	debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
 			    nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
+
+	debugfs_create_file("io_cache_write", S_IWUSR | S_IRUGO,
+			    nfsd_top_dir, NULL, &nfsd_io_cache_write_fops);
 }
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 6ef799405145f..fe935b4cda538 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -161,6 +161,7 @@ enum {
 };
 
 extern u64 nfsd_io_cache_read __read_mostly;
+extern u64 nfsd_io_cache_write __read_mostly;
 
 extern int nfsd_max_blksize;
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 26b6d96258711..5768244c7a3c3 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -50,6 +50,7 @@
 
 bool nfsd_disable_splice_read __read_mostly;
 u64 nfsd_io_cache_read __read_mostly;
+u64 nfsd_io_cache_write __read_mostly;
 
 /**
  * nfserrno - Map Linux errnos to NFS errnos
@@ -1234,6 +1235,21 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
+	switch (nfsd_io_cache_write) {
+	case NFSD_IO_DIRECT:
+		/* direct I/O must be aligned to device logical sector size */
+		if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
+		    (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
+			kiocb.ki_flags |= IOCB_DIRECT;
+		break;
+	case NFSD_IO_DONTCACHE:
+		kiocb.ki_flags |= IOCB_DONTCACHE;
+		break;
+	case NFSD_IO_BUFFERED:
+		break;
+	}
+
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
 		nfsd_copy_write_verifier(verf, nn);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
  2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
                   ` (3 preceding siblings ...)
  2025-08-07 16:25 ` [PATCH v5 4/7] NFSD: add io_cache_write " Mike Snitzer
@ 2025-08-07 16:25 ` Mike Snitzer
  2025-08-08 12:05   ` Jeff Layton
  2025-08-08 17:59   ` Chuck Lever
  2025-08-07 16:25 ` [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
  2025-08-07 16:25 ` [PATCH v5 7/7] NFSD: issue WRITEs " Mike Snitzer
  6 siblings, 2 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/filecache.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 5447dba6c5da0..5601e839a72da 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -1058,8 +1058,12 @@ nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
 	struct kstat stat;
 	__be32 status;
 
-	/* Currently only need to get DIO alignment info for regular files */
-	if (!S_ISREG(inode->i_mode))
+	/* Currently only need to get DIO alignment info for regular files
+	 * IFF NFSD_IO_DIRECT is enabled for nfsd_io_cache_{read,write}.
+	 */
+	if (!S_ISREG(inode->i_mode) ||
+	    (nfsd_io_cache_read != NFSD_IO_DIRECT &&
+	     nfsd_io_cache_write != NFSD_IO_DIRECT))
 		return nfs_ok;
 
 	status = fh_getattr(fhp, &stat);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
  2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
                   ` (4 preceding siblings ...)
  2025-08-07 16:25 ` [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
@ 2025-08-07 16:25 ` Mike Snitzer
  2025-08-08 12:16   ` Jeff Layton
  2025-08-08 17:59   ` Chuck Lever
  2025-08-07 16:25 ` [PATCH v5 7/7] NFSD: issue WRITEs " Mike Snitzer
  6 siblings, 2 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
DIO-aligned block (on either end of the READ). The expanded READ is
verified to have proper offset/len (logical_block_size) and
dma_alignment checking.

Must allocate and use a bounce-buffer page (called 'start_extra_page')
if/when expanding the misaligned READ requires reading extra partial
page at the start of the READ so that its DIO-aligned. Otherwise that
extra page at the start will make its way back to the NFS client and
corruption will occur. As found, and then this fix of using an extra
page verified, using the 'dt' utility:
  dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
     iotype=sequential pattern=iot onerr=abort oncerr=abort
see: https://github.com/RobinTMiller/dt.git

Any misaligned READ that is less than 32K won't be expanded to be
DIO-aligned (this heuristic just avoids excess work, like allocating
start_extra_page, for smaller IO that can generally already perform
well using buffered IO).

Also add EVENT_CLASS nfsd_analyze_dio_class and use it to create
nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events. This
prepares for nfsd_vfs_write() to also make use of it when handling
misaligned WRITEs.

This combination of trace events is useful:

 echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
 echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
 echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
 echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable

Which for this dd command:

 dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct

Results in:

  nfsd-23908   [010] ..... 10375.141640: nfsd_analyze_read_dio: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
  nfsd-23908   [010] ..... 10375.141642: nfsd_read_vector: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47104
  nfsd-23908   [010] ..... 10375.141643: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0x0 bytecount 0xb800
  nfsd-23908   [010] ..... 10375.141773: nfsd_read_io_done: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008

  nfsd-23908   [010] ..... 10375.142063: nfsd_analyze_read_dio: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=46592+416 middle=47008+47008 end=94016+192
  nfsd-23908   [010] ..... 10375.142064: nfsd_read_vector: xid=0x83c5923b fh_hash=0x857ca4fc offset=46592 len=47616
  nfsd-23908   [010] ..... 10375.142065: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0xb600 bytecount 0xba00
  nfsd-23908   [010] ..... 10375.142103: nfsd_read_io_done: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008

Suggested-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/trace.h            |  61 ++++++++++++++
 fs/nfsd/vfs.c              | 167 ++++++++++++++++++++++++++++++++++---
 include/linux/sunrpc/svc.h |   5 +-
 3 files changed, 221 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e9..4173bd9344b6b 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -473,6 +473,67 @@ DEFINE_NFSD_IO_EVENT(write_done);
 DEFINE_NFSD_IO_EVENT(commit_start);
 DEFINE_NFSD_IO_EVENT(commit_done);
 
+DECLARE_EVENT_CLASS(nfsd_analyze_dio_class,
+	TP_PROTO(struct svc_rqst *rqstp,
+		 struct svc_fh	*fhp,
+		 u64		offset,
+		 u32		len,
+		 loff_t		start,
+		 ssize_t	start_len,
+		 loff_t		middle,
+		 ssize_t	middle_len,
+		 loff_t		end,
+		 ssize_t	end_len),
+	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len),
+	TP_STRUCT__entry(
+		__field(u32, xid)
+		__field(u32, fh_hash)
+		__field(u64, offset)
+		__field(u32, len)
+		__field(loff_t, start)
+		__field(ssize_t, start_len)
+		__field(loff_t, middle)
+		__field(ssize_t, middle_len)
+		__field(loff_t, end)
+		__field(ssize_t, end_len)
+	),
+	TP_fast_assign(
+		__entry->xid = be32_to_cpu(rqstp->rq_xid);
+		__entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
+		__entry->offset = offset;
+		__entry->len = len;
+		__entry->start = start;
+		__entry->start_len = start_len;
+		__entry->middle = middle;
+		__entry->middle_len = middle_len;
+		__entry->end = end;
+		__entry->end_len = end_len;
+	),
+	TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%lu middle=%llu+%lu end=%llu+%lu",
+		  __entry->xid, __entry->fh_hash,
+		  __entry->offset, __entry->len,
+		  __entry->start, __entry->start_len,
+		  __entry->middle, __entry->middle_len,
+		  __entry->end, __entry->end_len)
+)
+
+#define DEFINE_NFSD_ANALYZE_DIO_EVENT(name)			\
+DEFINE_EVENT(nfsd_analyze_dio_class, nfsd_analyze_##name##_dio,	\
+	TP_PROTO(struct svc_rqst *rqstp,			\
+		 struct svc_fh	*fhp,				\
+		 u64		offset,				\
+		 u32		len,				\
+		 loff_t		start,				\
+		 ssize_t	start_len,			\
+		 loff_t		middle,				\
+		 ssize_t	middle_len,			\
+		 loff_t		end,				\
+		 ssize_t	end_len),			\
+	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len))
+
+DEFINE_NFSD_ANALYZE_DIO_EVENT(read);
+DEFINE_NFSD_ANALYZE_DIO_EVENT(write);
+
 DECLARE_EVENT_CLASS(nfsd_err_class,
 	TP_PROTO(struct svc_rqst *rqstp,
 		 struct svc_fh	*fhp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5768244c7a3c3..be083a8812717 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -19,6 +19,7 @@
 #include <linux/splice.h>
 #include <linux/falloc.h>
 #include <linux/fcntl.h>
+#include <linux/math.h>
 #include <linux/namei.h>
 #include <linux/delay.h>
 #include <linux/fsnotify.h>
@@ -1073,6 +1074,116 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
 }
 
+struct nfsd_read_dio {
+	loff_t start;
+	loff_t end;
+	unsigned long start_extra;
+	unsigned long end_extra;
+	struct page *start_extra_page;
+};
+
+static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
+{
+	memset(read_dio, 0, sizeof(*read_dio));
+	read_dio->start_extra_page = NULL;
+}
+
+static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+				  struct nfsd_file *nf, loff_t offset,
+				  unsigned long len, unsigned int base,
+				  struct nfsd_read_dio *read_dio)
+{
+	const u32 dio_blocksize = nf->nf_dio_read_offset_align;
+	loff_t middle_end, orig_end = offset + len;
+
+	if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
+		      "%s: underlying filesystem has not provided DIO alignment info\n",
+		      __func__))
+		return false;
+	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+		      __func__, dio_blocksize, PAGE_SIZE))
+		return false;
+
+	/* Return early if IO is irreparably misaligned
+	 * (len < PAGE_SIZE, or base not aligned).
+	 */
+	if (unlikely(len < dio_blocksize) ||
+	    ((base & (nf->nf_dio_mem_align-1)) != 0))
+		return false;
+
+	read_dio->start = round_down(offset, dio_blocksize);
+	read_dio->end = round_up(orig_end, dio_blocksize);
+	read_dio->start_extra = offset - read_dio->start;
+	read_dio->end_extra = read_dio->end - orig_end;
+
+	/* don't expand READ for IO less than 32K */
+	if ((read_dio->start_extra || read_dio->end_extra) && (len < (32 << 10))) {
+		init_nfsd_read_dio(read_dio);
+		return false;
+	}
+
+	if (read_dio->start_extra) {
+		read_dio->start_extra_page = alloc_page(GFP_KERNEL);
+		if (WARN_ONCE(read_dio->start_extra_page == NULL,
+			      "%s: Unable to allocate start_extra_page\n", __func__)) {
+			init_nfsd_read_dio(read_dio);
+			return false;
+		}
+	}
+
+	/* Show original offset and count, and how it was expanded for DIO */
+	middle_end = read_dio->end - read_dio->end_extra;
+	trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
+				    read_dio->start, read_dio->start_extra,
+				    offset, (middle_end - offset),
+				    middle_end, read_dio->end_extra);
+	return true;
+}
+
+static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
+						 struct nfsd_read_dio *read_dio,
+						 ssize_t bytes_read,
+						 unsigned long bytes_expected,
+						 loff_t *offset,
+						 unsigned long *rq_bvec_numpages)
+{
+	ssize_t host_err = bytes_read;
+	loff_t v;
+
+	/* If nfsd_analyze_read_dio() allocated a start_extra_page it must
+	 * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
+	 */
+	if (read_dio->start_extra_page) {
+		__free_page(read_dio->start_extra_page);
+		*rq_bvec_numpages -= 1;
+		v = *rq_bvec_numpages;
+		memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
+			v * sizeof(struct bio_vec));
+	}
+	/* Eliminate any end_extra bytes from the last page */
+	v = *rq_bvec_numpages;
+	rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
+
+	if (host_err < 0)
+		return host_err;
+
+	/* nfsd_analyze_read_dio() may have expanded the start and end,
+	 * if so adjust returned read size to reflect original extent.
+	 */
+	*offset += read_dio->start_extra;
+	if (likely(host_err >= read_dio->start_extra)) {
+		host_err -= read_dio->start_extra;
+		if (host_err > bytes_expected)
+			host_err = bytes_expected;
+	} else {
+		/* Short read that didn't read any of requested data */
+		host_err = 0;
+	}
+
+	return host_err;
+}
+
 /**
  * nfsd_iter_read - Perform a VFS read using an iterator
  * @rqstp: RPC transaction context
@@ -1094,26 +1205,49 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		      unsigned int base, u32 *eof)
 {
 	struct file *file = nf->nf_file;
-	unsigned long v, total;
+	unsigned long v, total, in_count = *count;
+	struct nfsd_read_dio read_dio;
 	struct iov_iter iter;
 	struct kiocb kiocb;
-	ssize_t host_err;
+	ssize_t host_err = 0;
 	size_t len;
 
+	init_nfsd_read_dio(&read_dio);
 	init_sync_kiocb(&kiocb, file);
 
+	/*
+	 * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
+	 * the next DIO-aligned block (on either end of the READ).
+	 */
 	if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
-		/* Verify ondisk DIO alignment, memory addrs checked below */
-		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
-		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0))
-			kiocb.ki_flags = IOCB_DIRECT;
+		if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
+					  in_count, base, &read_dio)) {
+			/* trace_nfsd_read_vector() will reflect larger
+			 * DIO-aligned READ.
+			 */
+			offset = read_dio.start;
+			in_count = read_dio.end - offset;
+			/* Verify ondisk DIO alignment, memory addrs checked below */
+			if (likely(((offset | in_count) &
+				    (nf->nf_dio_read_offset_align - 1)) == 0))
+				kiocb.ki_flags = IOCB_DIRECT;
+		}
 	} else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
 		kiocb.ki_flags = IOCB_DONTCACHE;
 
 	kiocb.ki_pos = offset;
 
 	v = 0;
-	total = *count;
+	total = in_count;
+	if (read_dio.start_extra) {
+		bvec_set_page(&rqstp->rq_bvec[v], read_dio.start_extra_page,
+			      read_dio.start_extra, PAGE_SIZE - read_dio.start_extra);
+		if (unlikely((kiocb.ki_flags & IOCB_DIRECT) &&
+			     rqstp->rq_bvec[v].bv_offset & (nf->nf_dio_mem_align - 1)))
+			kiocb.ki_flags &= ~IOCB_DIRECT;
+		total -= read_dio.start_extra;
+		v++;
+	}
 	while (total) {
 		len = min_t(size_t, total, PAGE_SIZE - base);
 		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++), len, base);
@@ -1125,11 +1259,22 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		base = 0;
 		v++;
 	}
-	WARN_ON_ONCE(v > rqstp->rq_maxpages);
+	if (WARN_ONCE(v > rqstp->rq_maxpages,
+		      "%s: v=%lu exceeds rqstp->rq_maxpages=%lu\n", __func__,
+		      v, rqstp->rq_maxpages)) {
+		host_err = -EINVAL;
+	}
 
-	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
-	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
-	host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
+	if (!host_err) {
+		trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
+		iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
+		host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
+	}
+
+	if (read_dio.start_extra || read_dio.end_extra) {
+		host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
+					host_err, *count, &offset, &v);
+	}
 	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
 }
 
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e64ab444e0a7f..190c2667500e2 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
  * pages, one for the request, and one for the reply.
  * nfsd_splice_actor() might need an extra page when a READ payload
  * is not page-aligned.
+ * nfsd_iter_read() might need two extra pages when a READ payload
+ * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
+ * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
  */
 static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
 {
-	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
 }
 
 /*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 7/7] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
  2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
                   ` (5 preceding siblings ...)
  2025-08-07 16:25 ` [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-08-07 16:25 ` Mike Snitzer
  2025-08-08  2:30   ` Mike Snitzer
  2025-08-08 12:20   ` Jeff Layton
  6 siblings, 2 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-07 16:25 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Buffered IO is used for the
misaligned extents and O_DIRECT is used for the middle DIO-aligned
extent.

The nfsd_analyze_write_dio trace event shows how NFSD splits a given
misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
extent.

This combination of trace events is useful:

  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable

Which for this dd command:

  dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct

Results in:

  nfsd-23908   [010] ..... 10374.902333: nfsd_write_opened: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
  nfsd-23908   [010] ..... 10374.902335: nfsd_analyze_write_dio: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+46592 end=46592+416
  nfsd-23908   [010] ..... 10374.902343: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0x0 pos 0x0 bytecount 0xb600
  nfsd-23908   [010] ..... 10374.902697: nfsd_write_io_done: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008

  nfsd-23908   [010] ..... 10374.902925: nfsd_write_opened: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
  nfsd-23908   [010] ..... 10374.902926: nfsd_analyze_write_dio: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=47008+96 middle=47104+46592 end=93696+320
  nfsd-23908   [010] ..... 10374.903010: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0xb800 pos 0xb800 bytecount 0xb600
  nfsd-23908   [010] ..... 10374.903239: nfsd_write_io_done: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/vfs.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 170 insertions(+), 13 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index be083a8812717..1b5aa3e6e6623 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1315,6 +1315,167 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio {
+	loff_t middle_offset;	/* Offset for start of DIO-aligned middle */
+	loff_t end_offset;	/* Offset for start of DIO-aligned end */
+	ssize_t	start_len;	/* Length for misaligned first extent */
+	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
+	ssize_t	end_len;	/* Length for misaligned last extent */
+};
+
+static bool
+nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		       struct nfsd_file *nf, loff_t offset,
+		       unsigned long len, struct nfsd_write_dio *write_dio)
+{
+	const u32 dio_blocksize = nf->nf_dio_offset_align;
+	loff_t orig_end, middle_end, start_end, start_offset = offset;
+	ssize_t start_len = len;
+
+	if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
+		      "%s: underlying filesystem has not provided DIO alignment info\n",
+		      __func__))
+		return false;
+	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+		      __func__, dio_blocksize, PAGE_SIZE))
+		return false;
+	if (unlikely(len < dio_blocksize))
+		return false;
+
+	memset(write_dio, 0, sizeof(*write_dio));
+
+	if (((offset | len) & (dio_blocksize-1)) == 0) {
+		/* already DIO-aligned, no misaligned head or tail */
+		write_dio->middle_offset = offset;
+		write_dio->middle_len = len;
+		/* clear these for the benefit of trace_nfsd_analyze_write_dio */
+		start_offset = 0;
+		start_len = 0;
+		goto out;
+	}
+
+	start_end = round_up(offset, dio_blocksize);
+	start_len = start_end - offset;
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	write_dio->start_len = start_len;
+	write_dio->middle_offset = start_end;
+	write_dio->middle_len = middle_end - start_end;
+	write_dio->end_offset = middle_end;
+	write_dio->end_len = orig_end - middle_end;
+out:
+	trace_nfsd_analyze_write_dio(rqstp, fhp, offset, len, start_offset, start_len,
+				     write_dio->middle_offset, write_dio->middle_len,
+				     write_dio->end_offset, write_dio->end_len);
+	return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
+ * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
+ * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
+ * @nvecs: number of segments in @rq_bvec
+ * @cnt: size of the request in bytes
+ * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
+ *
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+			   struct bio_vec *rq_bvec, unsigned int nvecs,
+			   unsigned long cnt, struct nfsd_write_dio *write_dio)
+{
+	int n_iters = 0;
+	struct iov_iter *iters = *iterp;
+
+	/* Setup misaligned start? */
+	if (write_dio->start_len) {
+		iter_is_dio_aligned[n_iters] = false;
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iters[n_iters].count = write_dio->start_len;
+		n_iters++;
+	}
+
+	/* Setup DIO-aligned middle */
+	iter_is_dio_aligned[n_iters] = true;
+	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+	if (write_dio->start_len)
+		iov_iter_advance(&iters[n_iters], write_dio->start_len);
+	iters[n_iters].count -= write_dio->end_len;
+	n_iters++;
+
+	/* Setup misaligned end? */
+	if (write_dio->end_len) {
+		iter_is_dio_aligned[n_iters] = false;
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iov_iter_advance(&iters[n_iters],
+				 write_dio->start_len + write_dio->middle_len);
+		n_iters++;
+	}
+
+	return n_iters;
+}
+
+static int
+nfsd_issue_write_buffered(struct svc_rqst *rqstp, struct file *file,
+			  unsigned int nvecs, unsigned long *cnt,
+			  struct kiocb *kiocb)
+{
+	struct iov_iter iter;
+	int host_err;
+
+	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+	if (host_err < 0)
+		return host_err;
+	*cnt = host_err;
+
+	return 0;
+}
+
+static noinline int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		     struct nfsd_file *nf, loff_t offset,
+		     unsigned int nvecs, unsigned long *cnt,
+		     struct kiocb *kiocb)
+{
+	struct nfsd_write_dio write_dio;
+	struct file *file = nf->nf_file;
+
+	if (!nfsd_analyze_write_dio(rqstp, fhp, nf, offset,
+				    *cnt, &write_dio)) {
+		return nfsd_issue_write_buffered(rqstp, file,
+					nvecs, cnt, kiocb);
+	} else {
+		bool iter_is_dio_aligned[3];
+		struct iov_iter iter_stack[3];
+		struct iov_iter *iter = iter_stack;
+		unsigned int n_iters = 0;
+		int host_err;
+
+		n_iters = nfsd_setup_write_dio_iters(&iter,
+				iter_is_dio_aligned, rqstp->rq_bvec,
+				nvecs, *cnt, &write_dio);
+		*cnt = 0;
+		for (int i = 0; i < n_iters; i++) {
+			if (iter_is_dio_aligned[i])
+				kiocb->ki_flags |= IOCB_DIRECT;
+			else
+				kiocb->ki_flags &= ~IOCB_DIRECT;
+			host_err = vfs_iocb_iter_write(file, kiocb,
+						       &iter[i]);
+			if (host_err < 0)
+				return host_err;
+			*cnt += host_err;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1342,7 +1503,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	struct super_block	*sb = file_inode(file)->i_sb;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
-	struct iov_iter		iter;
 	errseq_t		since;
 	__be32			nfserr;
 	int			host_err;
@@ -1379,31 +1539,28 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		kiocb.ki_flags |= IOCB_DSYNC;
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
-	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
+	since = READ_ONCE(file->f_wb_err);
+	if (verf)
+		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
 	case NFSD_IO_DIRECT:
-		/* direct I/O must be aligned to device logical sector size */
-		if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
-		    (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
-			kiocb.ki_flags |= IOCB_DIRECT;
+		host_err = nfsd_issue_write_dio(rqstp, fhp, nf, offset,
+						nvecs, cnt, &kiocb);
 		break;
 	case NFSD_IO_DONTCACHE:
 		kiocb.ki_flags |= IOCB_DONTCACHE;
-		break;
+		fallthrough;
 	case NFSD_IO_BUFFERED:
+		host_err = nfsd_issue_write_buffered(rqstp, file,
+						nvecs, cnt, &kiocb);
 		break;
 	}
-
-	since = READ_ONCE(file->f_wb_err);
-	if (verf)
-		nfsd_copy_write_verifier(verf, nn);
-	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
 	if (host_err < 0) {
 		commit_reset_write_verifier(nn, rqstp, host_err);
 		goto out_nfserr;
 	}
-	*cnt = host_err;
 	nfsd_stats_io_write_add(nn, exp, *cnt);
 	fsnotify_modify(file);
 	host_err = filemap_check_wb_err(file->f_mapping, since);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 7/7] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
  2025-08-07 16:25 ` [PATCH v5 7/7] NFSD: issue WRITEs " Mike Snitzer
@ 2025-08-08  2:30   ` Mike Snitzer
  2025-08-08 14:21     ` Chuck Lever
  2025-08-08 12:20   ` Jeff Layton
  1 sibling, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2025-08-08  2:30 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

On Thu, Aug 07, 2025 at 12:25:44PM -0400, Mike Snitzer wrote:
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. Buffered IO is used for the
> misaligned extents and O_DIRECT is used for the middle DIO-aligned
> extent.
> 
> The nfsd_analyze_write_dio trace event shows how NFSD splits a given
> misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
> extent.
> 
> This combination of trace events is useful:
> 
>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
>   echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
> 
> Which for this dd command:
> 
>   dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct
> 
> Results in:
> 
>   nfsd-23908   [010] ..... 10374.902333: nfsd_write_opened: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
>   nfsd-23908   [010] ..... 10374.902335: nfsd_analyze_write_dio: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+46592 end=46592+416
>   nfsd-23908   [010] ..... 10374.902343: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0x0 pos 0x0 bytecount 0xb600
>   nfsd-23908   [010] ..... 10374.902697: nfsd_write_io_done: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
> 
>   nfsd-23908   [010] ..... 10374.902925: nfsd_write_opened: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
>   nfsd-23908   [010] ..... 10374.902926: nfsd_analyze_write_dio: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=47008+96 middle=47104+46592 end=93696+320
>   nfsd-23908   [010] ..... 10374.903010: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0xb800 pos 0xb800 bytecount 0xb600
>   nfsd-23908   [010] ..... 10374.903239: nfsd_write_io_done: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/vfs.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 170 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index be083a8812717..1b5aa3e6e6623 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1315,6 +1315,167 @@ static int wait_for_concurrent_writes(struct file *file)
>  	return err;
>  }
>  
> +struct nfsd_write_dio {
> +	loff_t middle_offset;	/* Offset for start of DIO-aligned middle */
> +	loff_t end_offset;	/* Offset for start of DIO-aligned end */
> +	ssize_t	start_len;	/* Length for misaligned first extent */
> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> +	ssize_t	end_len;	/* Length for misaligned last extent */
> +};
> +
> +static bool
> +nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		       struct nfsd_file *nf, loff_t offset,
> +		       unsigned long len, struct nfsd_write_dio *write_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> +	loff_t orig_end, middle_end, start_end, start_offset = offset;
> +	ssize_t start_len = len;
> +
> +	if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
> +		      "%s: underlying filesystem has not provided DIO alignment info\n",
> +		      __func__))
> +		return false;
> +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> +		      __func__, dio_blocksize, PAGE_SIZE))
> +		return false;
> +	if (unlikely(len < dio_blocksize))
> +		return false;
> +
> +	memset(write_dio, 0, sizeof(*write_dio));
> +
> +	if (((offset | len) & (dio_blocksize-1)) == 0) {
> +		/* already DIO-aligned, no misaligned head or tail */
> +		write_dio->middle_offset = offset;
> +		write_dio->middle_len = len;
> +		/* clear these for the benefit of trace_nfsd_analyze_write_dio */
> +		start_offset = 0;
> +		start_len = 0;
> +		goto out;
> +	}
> +
> +	start_end = round_up(offset, dio_blocksize);
> +	start_len = start_end - offset;
> +	orig_end = offset + len;
> +	middle_end = round_down(orig_end, dio_blocksize);
> +
> +	write_dio->start_len = start_len;
> +	write_dio->middle_offset = start_end;
> +	write_dio->middle_len = middle_end - start_end;
> +	write_dio->end_offset = middle_end;
> +	write_dio->end_len = orig_end - middle_end;
> +out:
> +	trace_nfsd_analyze_write_dio(rqstp, fhp, offset, len, start_offset, start_len,
> +				     write_dio->middle_offset, write_dio->middle_len,
> +				     write_dio->end_offset, write_dio->end_len);
> +	return true;
> +}
> +
> +/*
> + * Setup as many as 3 iov_iter based on extents described by @write_dio.
> + * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
> + * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
> + * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
> + * @nvecs: number of segments in @rq_bvec
> + * @cnt: size of the request in bytes
> + * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
> + *
> + * Returns the number of iov_iter that were setup.
> + */
> +static int
> +nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> +			   struct bio_vec *rq_bvec, unsigned int nvecs,
> +			   unsigned long cnt, struct nfsd_write_dio *write_dio)
> +{
> +	int n_iters = 0;
> +	struct iov_iter *iters = *iterp;
> +
> +	/* Setup misaligned start? */
> +	if (write_dio->start_len) {
> +		iter_is_dio_aligned[n_iters] = false;
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iters[n_iters].count = write_dio->start_len;
> +		n_iters++;
> +	}
> +
> +	/* Setup DIO-aligned middle */
> +	iter_is_dio_aligned[n_iters] = true;
> +	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +	if (write_dio->start_len)
> +		iov_iter_advance(&iters[n_iters], write_dio->start_len);
> +	iters[n_iters].count -= write_dio->end_len;
> +	n_iters++;
> +
> +	/* Setup misaligned end? */
> +	if (write_dio->end_len) {
> +		iter_is_dio_aligned[n_iters] = false;
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iov_iter_advance(&iters[n_iters],
> +				 write_dio->start_len + write_dio->middle_len);
> +		n_iters++;
> +	}
> +
> +	return n_iters;
> +}
> +
> +static int
> +nfsd_issue_write_buffered(struct svc_rqst *rqstp, struct file *file,
> +			  unsigned int nvecs, unsigned long *cnt,
> +			  struct kiocb *kiocb)
> +{
> +	struct iov_iter iter;
> +	int host_err;
> +
> +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> +	if (host_err < 0)
> +		return host_err;
> +	*cnt = host_err;
> +
> +	return 0;
> +}
> +
> +static noinline int
> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		     struct nfsd_file *nf, loff_t offset,
> +		     unsigned int nvecs, unsigned long *cnt,
> +		     struct kiocb *kiocb)
> +{
> +	struct nfsd_write_dio write_dio;
> +	struct file *file = nf->nf_file;
> +
> +	if (!nfsd_analyze_write_dio(rqstp, fhp, nf, offset,
> +				    *cnt, &write_dio)) {
> +		return nfsd_issue_write_buffered(rqstp, file,
> +					nvecs, cnt, kiocb);
> +	} else {
> +		bool iter_is_dio_aligned[3];
> +		struct iov_iter iter_stack[3];
> +		struct iov_iter *iter = iter_stack;
> +		unsigned int n_iters = 0;
> +		int host_err;
> +
> +		n_iters = nfsd_setup_write_dio_iters(&iter,
> +				iter_is_dio_aligned, rqstp->rq_bvec,
> +				nvecs, *cnt, &write_dio);
> +		*cnt = 0;
> +		for (int i = 0; i < n_iters; i++) {
> +			if (iter_is_dio_aligned[i])
> +				kiocb->ki_flags |= IOCB_DIRECT;
> +			else
> +				kiocb->ki_flags &= ~IOCB_DIRECT;
> +			host_err = vfs_iocb_iter_write(file, kiocb,
> +						       &iter[i]);
> +			if (host_err < 0)
> +				return host_err;
> +			*cnt += host_err;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  /**
>   * nfsd_vfs_write - write data to an already-open file
>   * @rqstp: RPC execution context
> @@ -1342,7 +1503,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	struct super_block	*sb = file_inode(file)->i_sb;
>  	struct kiocb		kiocb;
>  	struct svc_export	*exp;
> -	struct iov_iter		iter;
>  	errseq_t		since;
>  	__be32			nfserr;
>  	int			host_err;
> @@ -1379,31 +1539,28 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		kiocb.ki_flags |= IOCB_DSYNC;
>  
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> -	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
> +	since = READ_ONCE(file->f_wb_err);
> +	if (verf)
> +		nfsd_copy_write_verifier(verf, nn);
>  
>  	switch (nfsd_io_cache_write) {
>  	case NFSD_IO_DIRECT:
> -		/* direct I/O must be aligned to device logical sector size */
> -		if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
> -		    (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
> -			kiocb.ki_flags |= IOCB_DIRECT;
> +		host_err = nfsd_issue_write_dio(rqstp, fhp, nf, offset,
> +						nvecs, cnt, &kiocb);
>  		break;
>  	case NFSD_IO_DONTCACHE:
>  		kiocb.ki_flags |= IOCB_DONTCACHE;
> -		break;
> +		fallthrough;
>  	case NFSD_IO_BUFFERED:
> +		host_err = nfsd_issue_write_buffered(rqstp, file,
> +						nvecs, cnt, &kiocb);
>  		break;
>  	}
> -
> -	since = READ_ONCE(file->f_wb_err);
> -	if (verf)
> -		nfsd_copy_write_verifier(verf, nn);
> -	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
>  	if (host_err < 0) {
>  		commit_reset_write_verifier(nn, rqstp, host_err);
>  		goto out_nfserr;
>  	}
> -	*cnt = host_err;
>  	nfsd_stats_io_write_add(nn, exp, *cnt);
>  	fsnotify_modify(file);
>  	host_err = filemap_check_wb_err(file->f_mapping, since);
> -- 
> 2.44.0
> 
> 

Embarrassingly, turns out I only tested the NFSD_IO_DIRECT case prior
to submitting this v5, if the 'nfsd_io_cache_write' debugfs file is
never written it defaults to NFSD_IO_UNSPECIFIED.  But even if that's
the case we need to treat NFSD_IO_UNSPECIFIED like NFSD_IO_BUFFERED.
(that is how nfsd_vfs_read behaves, but I missed this in
nfsd_vfs_write when I refactored the code for v5).

This incremental patch fixes this oversight, Chuck should I submit a
v8 or you're OK with folding this fixup (if the rest of the code is
OK)?

Thanks,
Mike

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1b5aa3e6e6623..b529754a20bd5 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1505,7 +1505,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	struct svc_export	*exp;
 	errseq_t		since;
 	__be32			nfserr;
-	int			host_err;
+	int			host_err = 0;
 	unsigned long		exp_op_flags = 0;
 	unsigned int		pflags = current->flags;
 	bool			restore_flags = false;
@@ -1552,6 +1552,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	case NFSD_IO_DONTCACHE:
 		kiocb.ki_flags |= IOCB_DONTCACHE;
 		fallthrough;
+	case NFSD_IO_UNSPECIFIED:
 	case NFSD_IO_BUFFERED:
 		host_err = nfsd_issue_write_buffered(rqstp, file,
 						nvecs, cnt, &kiocb);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  2025-08-07 16:25 ` [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-08-08 11:49   ` Jeff Layton
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 11:49 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
> alignment attributes from underlying filesystem in associated
> nfsd_file.  This is done when the nfsd_file is first opened for
> a regular file.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/filecache.c | 32 ++++++++++++++++++++++++++++++++
>  fs/nfsd/filecache.h |  4 ++++
>  fs/nfsd/nfsfh.c     |  4 ++++
>  3 files changed, 40 insertions(+)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 8581c131338b8..5447dba6c5da0 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
>  	refcount_set(&nf->nf_ref, 1);
>  	nf->nf_may = need;
>  	nf->nf_mark = NULL;
> +	nf->nf_dio_mem_align = 0;
> +	nf->nf_dio_offset_align = 0;
> +	nf->nf_dio_read_offset_align = 0;
>  	return nf;
>  }
>  
> @@ -1048,6 +1051,33 @@ nfsd_file_is_cached(struct inode *inode)
>  	return ret;
>  }
>  
> +static __be32
> +nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
> +{
> +	struct inode *inode = file_inode(nf->nf_file);
> +	struct kstat stat;
> +	__be32 status;
> +
> +	/* Currently only need to get DIO alignment info for regular files */
> +	if (!S_ISREG(inode->i_mode))
> +		return nfs_ok;
> +
> +	status = fh_getattr(fhp, &stat);
> +	if (status != nfs_ok)
> +		return status;
> +
> +	if (stat.result_mask & STATX_DIOALIGN) {
> +		nf->nf_dio_mem_align = stat.dio_mem_align;
> +		nf->nf_dio_offset_align = stat.dio_offset_align;
> +	}
> +	if (stat.result_mask & STATX_DIO_READ_ALIGN)
> +		nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
> +	else
> +		nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
> +
> +	return status;
> +}
> +
>  static __be32
>  nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
>  		     struct svc_cred *cred,
> @@ -1166,6 +1196,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
>  			}
>  			status = nfserrno(ret);
>  			trace_nfsd_file_open(nf, status);
> +			if (status == nfs_ok)
> +				status = nfsd_file_getattr(fhp, nf);
>  		}
>  	} else
>  		status = nfserr_jukebox;
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index 24ddf60e8434a..e3d6ca2b60308 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -54,6 +54,10 @@ struct nfsd_file {
>  	struct list_head	nf_gc;
>  	struct rcu_head		nf_rcu;
>  	ktime_t			nf_birthtime;
> +
> +	u32			nf_dio_mem_align;
> +	u32			nf_dio_offset_align;
> +	u32			nf_dio_read_offset_align;
>  };
>  
>  int nfsd_file_cache_init(void);
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index f4a3cc9e31e05..bdba2ba828a6a 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -677,8 +677,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
>  		.mnt		= fhp->fh_export->ex_path.mnt,
>  		.dentry		= fhp->fh_dentry,
>  	};
> +	struct inode *inode = d_inode(p.dentry);
>  	u32 request_mask = STATX_BASIC_STATS;
>  
> +	if (S_ISREG(inode->i_mode))
> +		request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
> +
>  	if (fhp->fh_maxsize == NFS4_FHSIZE)
>  		request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 2/7] NFSD: pass nfsd_file to nfsd_iter_read()
  2025-08-07 16:25 ` [PATCH v5 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
@ 2025-08-08 11:51   ` Jeff Layton
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 11:51 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/nfs4xdr.c | 8 ++++----
>  fs/nfsd/vfs.c     | 7 ++++---
>  fs/nfsd/vfs.h     | 2 +-
>  3 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7d19925f46e45..d519f4156cfad 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -4464,7 +4464,7 @@ static __be32 nfsd4_encode_splice_read(
>  
>  static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
>  				 struct nfsd4_read *read,
> -				 struct file *file, unsigned long maxcount)
> +				 unsigned long maxcount)
>  {
>  	struct xdr_stream *xdr = resp->xdr;
>  	unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
> @@ -4475,7 +4475,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
>  	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
>  		return nfserr_resource;
>  
> -	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
> +	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
>  				read->rd_offset, &maxcount, base,
>  				&read->rd_eof);
>  	read->rd_length = maxcount;
> @@ -4522,7 +4522,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
>  	if (file->f_op->splice_read && splice_ok)
>  		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
>  	else
> -		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
> +		nfserr = nfsd4_encode_readv(resp, read, maxcount);
>  	if (nfserr) {
>  		xdr_truncate_encode(xdr, eof_offset);
>  		return nfserr;
> @@ -5418,7 +5418,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
>  	if (file->f_op->splice_read && splice_ok)
>  		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
>  	else
> -		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
> +		nfserr = nfsd4_encode_readv(resp, read, maxcount);
>  	if (nfserr)
>  		return nfserr;
>  
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 0c0f25b2c8e38..79439ad93880a 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1075,7 +1075,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   * nfsd_iter_read - Perform a VFS read using an iterator
>   * @rqstp: RPC transaction context
>   * @fhp: file handle of file to be read
> - * @file: opened struct file of file to be read
> + * @nf: opened struct nfsd_file of file to be read
>   * @offset: starting byte offset
>   * @count: IN: requested number of bytes; OUT: number of bytes read
>   * @base: offset in first page of read buffer
> @@ -1088,9 +1088,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   * returned.
>   */
>  __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -		      struct file *file, loff_t offset, unsigned long *count,
> +		      struct nfsd_file *nf, loff_t offset, unsigned long *count,
>  		      unsigned int base, u32 *eof)
>  {
> +	struct file *file = nf->nf_file;
>  	unsigned long v, total;
>  	struct iov_iter iter;
>  	struct kiocb kiocb;
> @@ -1312,7 +1313,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
>  		err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
>  	else
> -		err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
> +		err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
>  
>  	nfsd_file_put(nf);
>  	trace_nfsd_read_done(rqstp, fhp, offset, *count);
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index 0c0292611c6de..fa46f8b5f1320 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -121,7 +121,7 @@ __be32		nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				unsigned long *count,
>  				u32 *eof);
>  __be32		nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -				struct file *file, loff_t offset,
> +				struct nfsd_file *nf, loff_t offset,
>  				unsigned long *count, unsigned int base,
>  				u32 *eof);
>  bool		nfsd_read_splice_ok(struct svc_rqst *rqstp);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/7] NFSD: add io_cache_write controls to debugfs interface
  2025-08-07 16:25 ` [PATCH v5 4/7] NFSD: add io_cache_write " Mike Snitzer
@ 2025-08-08 12:03   ` Jeff Layton
  2025-08-08 17:58   ` Chuck Lever
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 12:03 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
> written by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=1)
> - cached but removed from the page cache upon completion
>   (NFSD_IO_DONTCACHE=2).
> - not cached (NFSD_IO_DIRECT=3)
> 
> io_cache_write may be set by writing to:
>   /sys/kernel/debug/nfsd/io_cache_write
> 
> If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
> 
> If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the WRITE payload must be aligned relative
> to the underlying block device's dma_alignment.

And if it isn't, it looks it falls back on doing regular buffered I/O
(at least until patch #7)?


> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/debugfs.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/nfsd.h    |  1 +
>  fs/nfsd/vfs.c     | 16 ++++++++++++++++
>  3 files changed, 61 insertions(+)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index c07f71d4e84f4..872de65f0e9ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -87,6 +87,47 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
>  DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
>  			 nfsd_io_cache_read_set, "%llu\n");
>  
> +/*
> + * /sys/kernel/debug/nfsd/io_cache_write
> + *
> + * Contents:
> + *   %1: NFS WRITE will use buffered IO
> + *   %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> + *   %3: NFS WRITE will use direct IO
> + *
> + * The default value of this setting is zero (UNSPECIFIED).
> + * This setting takes immediate effect for all NFS versions,
> + * all exports, and in all NFSD net namespaces.
> + */
> +
> +static int nfsd_io_cache_write_get(void *data, u64 *val)
> +{
> +	*val = nfsd_io_cache_write;
> +	return 0;
> +}
> +
> +static int nfsd_io_cache_write_set(void *data, u64 val)
> +{
> +	int ret = 0;
> +
> +	switch (val) {
> +	case NFSD_IO_BUFFERED:
> +	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
> +		nfsd_io_cache_write = val;
> +		break;
> +	default:
> +		nfsd_io_cache_write = NFSD_IO_UNSPECIFIED;
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_write_fops, nfsd_io_cache_write_get,
> +			 nfsd_io_cache_write_set, "%llu\n");
> +
>  void nfsd_debugfs_exit(void)
>  {
>  	debugfs_remove_recursive(nfsd_top_dir);
> @@ -102,4 +143,7 @@ void nfsd_debugfs_init(void)
>  
>  	debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
>  			    nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
> +
> +	debugfs_create_file("io_cache_write", S_IWUSR | S_IRUGO,
> +			    nfsd_top_dir, NULL, &nfsd_io_cache_write_fops);
>  }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 6ef799405145f..fe935b4cda538 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -161,6 +161,7 @@ enum {
>  };
>  
>  extern u64 nfsd_io_cache_read __read_mostly;
> +extern u64 nfsd_io_cache_write __read_mostly;
>  
>  extern int nfsd_max_blksize;
>  
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 26b6d96258711..5768244c7a3c3 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -50,6 +50,7 @@
>  
>  bool nfsd_disable_splice_read __read_mostly;
>  u64 nfsd_io_cache_read __read_mostly;
> +u64 nfsd_io_cache_write __read_mostly;
>  
>  /**
>   * nfserrno - Map Linux errnos to NFS errnos
> @@ -1234,6 +1235,21 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
>  	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
> +	switch (nfsd_io_cache_write) {
> +	case NFSD_IO_DIRECT:
> +		/* direct I/O must be aligned to device logical sector size */
> +		if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
> +		    (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
> +			kiocb.ki_flags |= IOCB_DIRECT;
> +		break;
> +	case NFSD_IO_DONTCACHE:
> +		kiocb.ki_flags |= IOCB_DONTCACHE;
> +		break;
> +	case NFSD_IO_BUFFERED:
> +		break;
> +	}
> +
>  	since = READ_ONCE(file->f_wb_err);
>  	if (verf)
>  		nfsd_copy_write_verifier(verf, nn);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface
  2025-08-07 16:25 ` [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-08-08 12:05   ` Jeff Layton
  2025-08-08 12:05   ` Jeff Layton
  2025-08-08 17:58   ` Chuck Lever
  2 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 12:05 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> read by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=1)
> - cached but removed from the page cache upon completion
>   (NFSD_IO_DONTCACHE=2).
> - not cached (NFSD_IO_DIRECT=3)
> 
> io_cache_read may be set by writing to:
>   /sys/kernel/debug/nfsd/io_cache_read
> 
> If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
> 
> If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the read must be aligned relative to the
> underlying block device's dma_alignment.
> 

Ditto here. Looks like this falls back to buffered I/O until patch #6?

> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/debugfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/nfsd.h    |  9 ++++++++
>  fs/nfsd/vfs.c     | 19 +++++++++++++---
>  3 files changed, 83 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc9..c07f71d4e84f4 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -27,11 +27,66 @@ static int nfsd_dsr_get(void *data, u64 *val)
>  static int nfsd_dsr_set(void *data, u64 val)
>  {
>  	nfsd_disable_splice_read = (val > 0) ? true : false;
> +	if (!nfsd_disable_splice_read) {
> +		/*
> +		 * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> +		 * if splice_read is enabled.
> +		 */
> +		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> +	}
>  	return 0;
>  }
>  
>  DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>  
> +/*
> + * /sys/kernel/debug/nfsd/io_cache_read
> + *
> + * Contents:
> + *   %1: NFS READ will use buffered IO
> + *   %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + *   %3: NFS READ will use direct IO
> + *
> + * The default value of this setting is zero (UNSPECIFIED).
> + * This setting takes immediate effect for all NFS versions,
> + * all exports, and in all NFSD net namespaces.
> + */
> +
> +static int nfsd_io_cache_read_get(void *data, u64 *val)
> +{
> +	*val = nfsd_io_cache_read;
> +	return 0;
> +}
> +
> +static int nfsd_io_cache_read_set(void *data, u64 val)
> +{
> +	int ret = 0;
> +
> +	switch (val) {
> +	case NFSD_IO_BUFFERED:
> +		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> +		break;
> +	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
> +		/*
> +		 * Must disable splice_read when enabling
> +		 * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> +		 */
> +		nfsd_disable_splice_read = true;
> +		nfsd_io_cache_read = val;
> +		break;
> +	default:
> +		nfsd_io_cache_read = NFSD_IO_UNSPECIFIED;
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> +			 nfsd_io_cache_read_set, "%llu\n");
> +
>  void nfsd_debugfs_exit(void)
>  {
>  	debugfs_remove_recursive(nfsd_top_dir);
> @@ -44,4 +99,7 @@ void nfsd_debugfs_init(void)
>  
>  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
>  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +
> +	debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
> +			    nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
>  }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1cd0bed57bc2f..6ef799405145f 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
>  
>  extern bool nfsd_disable_splice_read __read_mostly;
>  
> +enum {
> +	NFSD_IO_UNSPECIFIED = 0,
> +	NFSD_IO_BUFFERED,
> +	NFSD_IO_DONTCACHE,
> +	NFSD_IO_DIRECT,
> +};
> +
> +extern u64 nfsd_io_cache_read __read_mostly;
> +
>  extern int nfsd_max_blksize;
>  
>  static inline int nfsd_v4client(struct svc_rqst *rq)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 79439ad93880a..26b6d96258711 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -49,6 +49,7 @@
>  #define NFSDDBG_FACILITY		NFSDDBG_FILEOP
>  
>  bool nfsd_disable_splice_read __read_mostly;
> +u64 nfsd_io_cache_read __read_mostly;
>  
>  /**
>   * nfserrno - Map Linux errnos to NFS errnos
> @@ -1099,17 +1100,29 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	size_t len;
>  
>  	init_sync_kiocb(&kiocb, file);
> +
> +	if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
> +		/* Verify ondisk DIO alignment, memory addrs checked below */
> +		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> +		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0))
> +			kiocb.ki_flags = IOCB_DIRECT;
> +	} else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
> +		kiocb.ki_flags = IOCB_DONTCACHE;
> +
>  	kiocb.ki_pos = offset;
>  
>  	v = 0;
>  	total = *count;
>  	while (total) {
>  		len = min_t(size_t, total, PAGE_SIZE - base);
> -		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> -			      len, base);
> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++), len, base);
> +		/* No need to verify memory is DIO-aligned since bv_offset is 0 */
> +		if (unlikely((kiocb.ki_flags & IOCB_DIRECT) && base &&
> +			     (base & (nf->nf_dio_mem_align - 1))))
> +			kiocb.ki_flags &= ~IOCB_DIRECT;
>  		total -= len;
> -		++v;
>  		base = 0;
> +		v++;
>  	}
>  	WARN_ON_ONCE(v > rqstp->rq_maxpages);
>  

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface
  2025-08-07 16:25 ` [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
  2025-08-08 12:05   ` Jeff Layton
@ 2025-08-08 12:05   ` Jeff Layton
  2025-08-08 17:58   ` Chuck Lever
  2 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 12:05 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> read by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=1)
> - cached but removed from the page cache upon completion
>   (NFSD_IO_DONTCACHE=2).
> - not cached (NFSD_IO_DIRECT=3)
> 
> io_cache_read may be set by writing to:
>   /sys/kernel/debug/nfsd/io_cache_read
> 
> If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
> 
> If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the read must be aligned relative to the
> underlying block device's dma_alignment.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/debugfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/nfsd.h    |  9 ++++++++
>  fs/nfsd/vfs.c     | 19 +++++++++++++---
>  3 files changed, 83 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc9..c07f71d4e84f4 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -27,11 +27,66 @@ static int nfsd_dsr_get(void *data, u64 *val)
>  static int nfsd_dsr_set(void *data, u64 val)
>  {
>  	nfsd_disable_splice_read = (val > 0) ? true : false;
> +	if (!nfsd_disable_splice_read) {
> +		/*
> +		 * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> +		 * if splice_read is enabled.
> +		 */
> +		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> +	}
>  	return 0;
>  }
>  
>  DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>  
> +/*
> + * /sys/kernel/debug/nfsd/io_cache_read
> + *
> + * Contents:
> + *   %1: NFS READ will use buffered IO
> + *   %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + *   %3: NFS READ will use direct IO
> + *
> + * The default value of this setting is zero (UNSPECIFIED).
> + * This setting takes immediate effect for all NFS versions,
> + * all exports, and in all NFSD net namespaces.
> + */
> +
> +static int nfsd_io_cache_read_get(void *data, u64 *val)
> +{
> +	*val = nfsd_io_cache_read;
> +	return 0;
> +}
> +
> +static int nfsd_io_cache_read_set(void *data, u64 val)
> +{
> +	int ret = 0;
> +
> +	switch (val) {
> +	case NFSD_IO_BUFFERED:
> +		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> +		break;
> +	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
> +		/*
> +		 * Must disable splice_read when enabling
> +		 * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> +		 */
> +		nfsd_disable_splice_read = true;
> +		nfsd_io_cache_read = val;
> +		break;
> +	default:
> +		nfsd_io_cache_read = NFSD_IO_UNSPECIFIED;
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> +			 nfsd_io_cache_read_set, "%llu\n");
> +
>  void nfsd_debugfs_exit(void)
>  {
>  	debugfs_remove_recursive(nfsd_top_dir);
> @@ -44,4 +99,7 @@ void nfsd_debugfs_init(void)
>  
>  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
>  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +
> +	debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
> +			    nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
>  }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1cd0bed57bc2f..6ef799405145f 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
>  
>  extern bool nfsd_disable_splice_read __read_mostly;
>  
> +enum {
> +	NFSD_IO_UNSPECIFIED = 0,
> +	NFSD_IO_BUFFERED,
> +	NFSD_IO_DONTCACHE,
> +	NFSD_IO_DIRECT,
> +};
> +
> +extern u64 nfsd_io_cache_read __read_mostly;
> +
>  extern int nfsd_max_blksize;
>  
>  static inline int nfsd_v4client(struct svc_rqst *rq)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 79439ad93880a..26b6d96258711 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -49,6 +49,7 @@
>  #define NFSDDBG_FACILITY		NFSDDBG_FILEOP
>  
>  bool nfsd_disable_splice_read __read_mostly;
> +u64 nfsd_io_cache_read __read_mostly;
>  
>  /**
>   * nfserrno - Map Linux errnos to NFS errnos
> @@ -1099,17 +1100,29 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	size_t len;
>  
>  	init_sync_kiocb(&kiocb, file);
> +
> +	if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
> +		/* Verify ondisk DIO alignment, memory addrs checked below */
> +		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> +		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0))
> +			kiocb.ki_flags = IOCB_DIRECT;
> +	} else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
> +		kiocb.ki_flags = IOCB_DONTCACHE;
> +
>  	kiocb.ki_pos = offset;
>  
>  	v = 0;
>  	total = *count;
>  	while (total) {
>  		len = min_t(size_t, total, PAGE_SIZE - base);
> -		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> -			      len, base);
> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++), len, base);
> +		/* No need to verify memory is DIO-aligned since bv_offset is 0 */
> +		if (unlikely((kiocb.ki_flags & IOCB_DIRECT) && base &&
> +			     (base & (nf->nf_dio_mem_align - 1))))
> +			kiocb.ki_flags &= ~IOCB_DIRECT;
>  		total -= len;
> -		++v;
>  		base = 0;
> +		v++;
>  	}
>  	WARN_ON_ONCE(v > rqstp->rq_maxpages);
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
  2025-08-07 16:25 ` [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
@ 2025-08-08 12:05   ` Jeff Layton
  2025-08-08 17:59   ` Chuck Lever
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 12:05 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> Suggested-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/filecache.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 5447dba6c5da0..5601e839a72da 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -1058,8 +1058,12 @@ nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
>  	struct kstat stat;
>  	__be32 status;
>  
> -	/* Currently only need to get DIO alignment info for regular files */
> -	if (!S_ISREG(inode->i_mode))
> +	/* Currently only need to get DIO alignment info for regular files
> +	 * IFF NFSD_IO_DIRECT is enabled for nfsd_io_cache_{read,write}.
> +	 */
> +	if (!S_ISREG(inode->i_mode) ||
> +	    (nfsd_io_cache_read != NFSD_IO_DIRECT &&
> +	     nfsd_io_cache_write != NFSD_IO_DIRECT))
>  		return nfs_ok;
>  
>  	status = fh_getattr(fhp, &stat);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
  2025-08-07 16:25 ` [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-08-08 12:16   ` Jeff Layton
  2025-08-08 17:59   ` Chuck Lever
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 12:16 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> DIO-aligned block (on either end of the READ). The expanded READ is
> verified to have proper offset/len (logical_block_size) and
> dma_alignment checking.
> 
> Must allocate and use a bounce-buffer page (called 'start_extra_page')
> if/when expanding the misaligned READ requires reading extra partial
> page at the start of the READ so that its DIO-aligned. Otherwise that
> extra page at the start will make its way back to the NFS client and
> corruption will occur. As found, and then this fix of using an extra
> page verified, using the 'dt' utility:
>   dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
>      iotype=sequential pattern=iot onerr=abort oncerr=abort
> see: https://github.com/RobinTMiller/dt.git
> 
> Any misaligned READ that is less than 32K won't be expanded to be
> DIO-aligned (this heuristic just avoids excess work, like allocating
> start_extra_page, for smaller IO that can generally already perform
> well using buffered IO).
> 
> Also add EVENT_CLASS nfsd_analyze_dio_class and use it to create
> nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events. This
> prepares for nfsd_vfs_write() to also make use of it when handling
> misaligned WRITEs.
> 
> This combination of trace events is useful:
> 
>  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
>  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
>  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
>  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
> 
> Which for this dd command:
> 
>  dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
> 
> Results in:
> 
>   nfsd-23908   [010] ..... 10375.141640: nfsd_analyze_read_dio: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
>   nfsd-23908   [010] ..... 10375.141642: nfsd_read_vector: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47104
>   nfsd-23908   [010] ..... 10375.141643: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0x0 bytecount 0xb800
>   nfsd-23908   [010] ..... 10375.141773: nfsd_read_io_done: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008
> 
>   nfsd-23908   [010] ..... 10375.142063: nfsd_analyze_read_dio: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=46592+416 middle=47008+47008 end=94016+192
>   nfsd-23908   [010] ..... 10375.142064: nfsd_read_vector: xid=0x83c5923b fh_hash=0x857ca4fc offset=46592 len=47616
>   nfsd-23908   [010] ..... 10375.142065: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0xb600 bytecount 0xba00
>   nfsd-23908   [010] ..... 10375.142103: nfsd_read_io_done: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008
> 
> Suggested-by: Jeff Layton <jlayton@kernel.org>
> Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/trace.h            |  61 ++++++++++++++
>  fs/nfsd/vfs.c              | 167 ++++++++++++++++++++++++++++++++++---
>  include/linux/sunrpc/svc.h |   5 +-
>  3 files changed, 221 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index a664fdf1161e9..4173bd9344b6b 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -473,6 +473,67 @@ DEFINE_NFSD_IO_EVENT(write_done);
>  DEFINE_NFSD_IO_EVENT(commit_start);
>  DEFINE_NFSD_IO_EVENT(commit_done);
>  
> +DECLARE_EVENT_CLASS(nfsd_analyze_dio_class,
> +	TP_PROTO(struct svc_rqst *rqstp,
> +		 struct svc_fh	*fhp,
> +		 u64		offset,
> +		 u32		len,
> +		 loff_t		start,
> +		 ssize_t	start_len,
> +		 loff_t		middle,
> +		 ssize_t	middle_len,
> +		 loff_t		end,
> +		 ssize_t	end_len),
> +	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len),
> +	TP_STRUCT__entry(
> +		__field(u32, xid)
> +		__field(u32, fh_hash)
> +		__field(u64, offset)
> +		__field(u32, len)
> +		__field(loff_t, start)
> +		__field(ssize_t, start_len)
> +		__field(loff_t, middle)
> +		__field(ssize_t, middle_len)
> +		__field(loff_t, end)
> +		__field(ssize_t, end_len)
> +	),
> +	TP_fast_assign(
> +		__entry->xid = be32_to_cpu(rqstp->rq_xid);
> +		__entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
> +		__entry->offset = offset;
> +		__entry->len = len;
> +		__entry->start = start;
> +		__entry->start_len = start_len;
> +		__entry->middle = middle;
> +		__entry->middle_len = middle_len;
> +		__entry->end = end;
> +		__entry->end_len = end_len;
> +	),
> +	TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%lu middle=%llu+%lu end=%llu+%lu",
> +		  __entry->xid, __entry->fh_hash,
> +		  __entry->offset, __entry->len,
> +		  __entry->start, __entry->start_len,
> +		  __entry->middle, __entry->middle_len,
> +		  __entry->end, __entry->end_len)
> +)
> +
> +#define DEFINE_NFSD_ANALYZE_DIO_EVENT(name)			\
> +DEFINE_EVENT(nfsd_analyze_dio_class, nfsd_analyze_##name##_dio,	\
> +	TP_PROTO(struct svc_rqst *rqstp,			\
> +		 struct svc_fh	*fhp,				\
> +		 u64		offset,				\
> +		 u32		len,				\
> +		 loff_t		start,				\
> +		 ssize_t	start_len,			\
> +		 loff_t		middle,				\
> +		 ssize_t	middle_len,			\
> +		 loff_t		end,				\
> +		 ssize_t	end_len),			\
> +	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len))
> +
> +DEFINE_NFSD_ANALYZE_DIO_EVENT(read);
> +DEFINE_NFSD_ANALYZE_DIO_EVENT(write);
> +
>  DECLARE_EVENT_CLASS(nfsd_err_class,
>  	TP_PROTO(struct svc_rqst *rqstp,
>  		 struct svc_fh	*fhp,
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5768244c7a3c3..be083a8812717 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -19,6 +19,7 @@
>  #include <linux/splice.h>
>  #include <linux/falloc.h>
>  #include <linux/fcntl.h>
> +#include <linux/math.h>
>  #include <linux/namei.h>
>  #include <linux/delay.h>
>  #include <linux/fsnotify.h>
> @@ -1073,6 +1074,116 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> +struct nfsd_read_dio {
> +	loff_t start;
> +	loff_t end;
> +	unsigned long start_extra;
> +	unsigned long end_extra;
> +	struct page *start_extra_page;
> +};
> +
> +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> +{
> +	memset(read_dio, 0, sizeof(*read_dio));
> +	read_dio->start_extra_page = NULL;
> +}
> +
> +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +				  struct nfsd_file *nf, loff_t offset,
> +				  unsigned long len, unsigned int base,
> +				  struct nfsd_read_dio *read_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> +	loff_t middle_end, orig_end = offset + len;
> +
> +	if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> +		      "%s: underlying filesystem has not provided DIO alignment info\n",
> +		      __func__))
> +		return false;
> +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> +		      __func__, dio_blocksize, PAGE_SIZE))
> +		return false;
> +
> +	/* Return early if IO is irreparably misaligned
> +	 * (len < PAGE_SIZE, or base not aligned).
> +	 */
> +	if (unlikely(len < dio_blocksize) ||
> +	    ((base & (nf->nf_dio_mem_align-1)) != 0))
> +		return false;
> +
> +	read_dio->start = round_down(offset, dio_blocksize);
> +	read_dio->end = round_up(orig_end, dio_blocksize);
> +	read_dio->start_extra = offset - read_dio->start;
> +	read_dio->end_extra = read_dio->end - orig_end;
> +
> +	/* don't expand READ for IO less than 32K */
> +	if ((read_dio->start_extra || read_dio->end_extra) && (len < (32 << 10))) {
> +		init_nfsd_read_dio(read_dio);
> +		return false;
> +	}
> +
> +	if (read_dio->start_extra) {
> +		read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> +		if (WARN_ONCE(read_dio->start_extra_page == NULL,
> +			      "%s: Unable to allocate start_extra_page\n", __func__)) {
> +			init_nfsd_read_dio(read_dio);
> +			return false;
> +		}
> +	}
> +
> +	/* Show original offset and count, and how it was expanded for DIO */
> +	middle_end = read_dio->end - read_dio->end_extra;
> +	trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
> +				    read_dio->start, read_dio->start_extra,
> +				    offset, (middle_end - offset),
> +				    middle_end, read_dio->end_extra);
> +	return true;
> +}
> +
> +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> +						 struct nfsd_read_dio *read_dio,
> +						 ssize_t bytes_read,
> +						 unsigned long bytes_expected,
> +						 loff_t *offset,
> +						 unsigned long *rq_bvec_numpages)
> +{
> +	ssize_t host_err = bytes_read;
> +	loff_t v;
> +
> +	/* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> +	 * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> +	 */
> +	if (read_dio->start_extra_page) {
> +		__free_page(read_dio->start_extra_page);
> +		*rq_bvec_numpages -= 1;
> +		v = *rq_bvec_numpages;
> +		memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> +			v * sizeof(struct bio_vec));
> +	}
> +	/* Eliminate any end_extra bytes from the last page */
> +	v = *rq_bvec_numpages;
> +	rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> +
> +	if (host_err < 0)
> +		return host_err;
> +
> +	/* nfsd_analyze_read_dio() may have expanded the start and end,
> +	 * if so adjust returned read size to reflect original extent.
> +	 */
> +	*offset += read_dio->start_extra;
> +	if (likely(host_err >= read_dio->start_extra)) {
> +		host_err -= read_dio->start_extra;
> +		if (host_err > bytes_expected)
> +			host_err = bytes_expected;
> +	} else {
> +		/* Short read that didn't read any of requested data */
> +		host_err = 0;
> +	}
> +
> +	return host_err;
> +}
> +
>  /**
>   * nfsd_iter_read - Perform a VFS read using an iterator
>   * @rqstp: RPC transaction context
> @@ -1094,26 +1205,49 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		      unsigned int base, u32 *eof)
>  {
>  	struct file *file = nf->nf_file;
> -	unsigned long v, total;
> +	unsigned long v, total, in_count = *count;
> +	struct nfsd_read_dio read_dio;
>  	struct iov_iter iter;
>  	struct kiocb kiocb;
> -	ssize_t host_err;
> +	ssize_t host_err = 0;
>  	size_t len;
>  
> +	init_nfsd_read_dio(&read_dio);
>  	init_sync_kiocb(&kiocb, file);
>  
> +	/*
> +	 * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
> +	 * the next DIO-aligned block (on either end of the READ).
> +	 */
>  	if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
> -		/* Verify ondisk DIO alignment, memory addrs checked below */
> -		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> -		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0))
> -			kiocb.ki_flags = IOCB_DIRECT;
> +		if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
> +					  in_count, base, &read_dio)) {
> +			/* trace_nfsd_read_vector() will reflect larger
> +			 * DIO-aligned READ.
> +			 */
> +			offset = read_dio.start;
> +			in_count = read_dio.end - offset;
> +			/* Verify ondisk DIO alignment, memory addrs checked below */
> +			if (likely(((offset | in_count) &
> +				    (nf->nf_dio_read_offset_align - 1)) == 0))
> +				kiocb.ki_flags = IOCB_DIRECT;
> +		}
>  	} else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
>  		kiocb.ki_flags = IOCB_DONTCACHE;
>  
>  	kiocb.ki_pos = offset;
>  
>  	v = 0;
> -	total = *count;
> +	total = in_count;
> +	if (read_dio.start_extra) {
> +		bvec_set_page(&rqstp->rq_bvec[v], read_dio.start_extra_page,
> +			      read_dio.start_extra, PAGE_SIZE - read_dio.start_extra);
> +		if (unlikely((kiocb.ki_flags & IOCB_DIRECT) &&
> +			     rqstp->rq_bvec[v].bv_offset & (nf->nf_dio_mem_align - 1)))
> +			kiocb.ki_flags &= ~IOCB_DIRECT;
> +		total -= read_dio.start_extra;
> +		v++;
> +	}
>  	while (total) {
>  		len = min_t(size_t, total, PAGE_SIZE - base);
>  		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++), len, base);
> @@ -1125,11 +1259,22 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		base = 0;
>  		v++;
>  	}
> -	WARN_ON_ONCE(v > rqstp->rq_maxpages);
> +	if (WARN_ONCE(v > rqstp->rq_maxpages,
> +		      "%s: v=%lu exceeds rqstp->rq_maxpages=%lu\n", __func__,
> +		      v, rqstp->rq_maxpages)) {
> +		host_err = -EINVAL;
> +	}
>  
> -	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> -	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> -	host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> +	if (!host_err) {
> +		trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
> +		iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
> +		host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> +	}
> +
> +	if (read_dio.start_extra || read_dio.end_extra) {
> +		host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
> +					host_err, *count, &offset, &v);
> +	}
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e64ab444e0a7f..190c2667500e2 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   * pages, one for the request, and one for the reply.
>   * nfsd_splice_actor() might need an extra page when a READ payload
>   * is not page-aligned.
> + * nfsd_iter_read() might need two extra pages when a READ payload
> + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
> + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
>   */
>  static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
>  {
> -	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> +	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>  }
>  
>  /*

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 7/7] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
  2025-08-07 16:25 ` [PATCH v5 7/7] NFSD: issue WRITEs " Mike Snitzer
  2025-08-08  2:30   ` Mike Snitzer
@ 2025-08-08 12:20   ` Jeff Layton
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-08-08 12:20 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever; +Cc: linux-nfs

On Thu, 2025-08-07 at 12:25 -0400, Mike Snitzer wrote:
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. Buffered IO is used for the
> misaligned extents and O_DIRECT is used for the middle DIO-aligned
> extent.
> 
> The nfsd_analyze_write_dio trace event shows how NFSD splits a given
> misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
> extent.
> 
> This combination of trace events is useful:
> 
>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
>   echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
> 
> Which for this dd command:
> 
>   dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct
> 
> Results in:
> 
>   nfsd-23908   [010] ..... 10374.902333: nfsd_write_opened: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
>   nfsd-23908   [010] ..... 10374.902335: nfsd_analyze_write_dio: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+46592 end=46592+416
>   nfsd-23908   [010] ..... 10374.902343: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0x0 pos 0x0 bytecount 0xb600
>   nfsd-23908   [010] ..... 10374.902697: nfsd_write_io_done: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
> 
>   nfsd-23908   [010] ..... 10374.902925: nfsd_write_opened: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
>   nfsd-23908   [010] ..... 10374.902926: nfsd_analyze_write_dio: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=47008+96 middle=47104+46592 end=93696+320
>   nfsd-23908   [010] ..... 10374.903010: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0xb800 pos 0xb800 bytecount 0xb600
>   nfsd-23908   [010] ..... 10374.903239: nfsd_write_io_done: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/vfs.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 170 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index be083a8812717..1b5aa3e6e6623 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1315,6 +1315,167 @@ static int wait_for_concurrent_writes(struct file *file)
>  	return err;
>  }
>  
> +struct nfsd_write_dio {
> +	loff_t middle_offset;	/* Offset for start of DIO-aligned middle */
> +	loff_t end_offset;	/* Offset for start of DIO-aligned end */
> +	ssize_t	start_len;	/* Length for misaligned first extent */
> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> +	ssize_t	end_len;	/* Length for misaligned last extent */
> +};
> +
> +static bool
> +nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		       struct nfsd_file *nf, loff_t offset,
> +		       unsigned long len, struct nfsd_write_dio *write_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> +	loff_t orig_end, middle_end, start_end, start_offset = offset;
> +	ssize_t start_len = len;
> +
> +	if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
> +		      "%s: underlying filesystem has not provided DIO alignment info\n",
> +		      __func__))
> +		return false;
> +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> +		      __func__, dio_blocksize, PAGE_SIZE))
> +		return false;
> +	if (unlikely(len < dio_blocksize))
> +		return false;
> +
> +	memset(write_dio, 0, sizeof(*write_dio));
> +
> +	if (((offset | len) & (dio_blocksize-1)) == 0) {
> +		/* already DIO-aligned, no misaligned head or tail */
> +		write_dio->middle_offset = offset;
> +		write_dio->middle_len = len;
> +		/* clear these for the benefit of trace_nfsd_analyze_write_dio */
> +		start_offset = 0;
> +		start_len = 0;
> +		goto out;
> +	}
> +
> +	start_end = round_up(offset, dio_blocksize);
> +	start_len = start_end - offset;
> +	orig_end = offset + len;
> +	middle_end = round_down(orig_end, dio_blocksize);
> +
> +	write_dio->start_len = start_len;
> +	write_dio->middle_offset = start_end;
> +	write_dio->middle_len = middle_end - start_end;
> +	write_dio->end_offset = middle_end;
> +	write_dio->end_len = orig_end - middle_end;
> +out:
> +	trace_nfsd_analyze_write_dio(rqstp, fhp, offset, len, start_offset, start_len,
> +				     write_dio->middle_offset, write_dio->middle_len,
> +				     write_dio->end_offset, write_dio->end_len);
> +	return true;
> +}
> +
> +/*
> + * Setup as many as 3 iov_iter based on extents described by @write_dio.
> + * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
> + * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
> + * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
> + * @nvecs: number of segments in @rq_bvec
> + * @cnt: size of the request in bytes
> + * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
> + *
> + * Returns the number of iov_iter that were setup.
> + */
> +static int
> +nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> +			   struct bio_vec *rq_bvec, unsigned int nvecs,
> +			   unsigned long cnt, struct nfsd_write_dio *write_dio)
> +{
> +	int n_iters = 0;
> +	struct iov_iter *iters = *iterp;
> +
> +	/* Setup misaligned start? */
> +	if (write_dio->start_len) {
> +		iter_is_dio_aligned[n_iters] = false;
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iters[n_iters].count = write_dio->start_len;
> +		n_iters++;
> +	}
> +
> +	/* Setup DIO-aligned middle */
> +	iter_is_dio_aligned[n_iters] = true;
> +	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +	if (write_dio->start_len)
> +		iov_iter_advance(&iters[n_iters], write_dio->start_len);
> +	iters[n_iters].count -= write_dio->end_len;
> +	n_iters++;
> +
> +	/* Setup misaligned end? */
> +	if (write_dio->end_len) {
> +		iter_is_dio_aligned[n_iters] = false;
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iov_iter_advance(&iters[n_iters],
> +				 write_dio->start_len + write_dio->middle_len);
> +		n_iters++;
> +	}
> +
> +	return n_iters;
> +}
> +
> +static int
> +nfsd_issue_write_buffered(struct svc_rqst *rqstp, struct file *file,
> +			  unsigned int nvecs, unsigned long *cnt,
> +			  struct kiocb *kiocb)
> +{
> +	struct iov_iter iter;
> +	int host_err;
> +
> +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> +	if (host_err < 0)
> +		return host_err;
> +	*cnt = host_err;
> +
> +	return 0;
> +}
> +
> +static noinline int
> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		     struct nfsd_file *nf, loff_t offset,
> +		     unsigned int nvecs, unsigned long *cnt,
> +		     struct kiocb *kiocb)
> +{
> +	struct nfsd_write_dio write_dio;
> +	struct file *file = nf->nf_file;
> +
> +	if (!nfsd_analyze_write_dio(rqstp, fhp, nf, offset,
> +				    *cnt, &write_dio)) {
> +		return nfsd_issue_write_buffered(rqstp, file,
> +					nvecs, cnt, kiocb);
> +	} else {
> +		bool iter_is_dio_aligned[3];
> +		struct iov_iter iter_stack[3];
> +		struct iov_iter *iter = iter_stack;
> +		unsigned int n_iters = 0;
> +		int host_err;
> +
> +		n_iters = nfsd_setup_write_dio_iters(&iter,
> +				iter_is_dio_aligned, rqstp->rq_bvec,
> +				nvecs, *cnt, &write_dio);
> +		*cnt = 0;
> +		for (int i = 0; i < n_iters; i++) {
> +			if (iter_is_dio_aligned[i])
> +				kiocb->ki_flags |= IOCB_DIRECT;
> +			else
> +				kiocb->ki_flags &= ~IOCB_DIRECT;
> +			host_err = vfs_iocb_iter_write(file, kiocb,
> +						       &iter[i]);
> +			if (host_err < 0)
> +				return host_err;
> +			*cnt += host_err;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  /**
>   * nfsd_vfs_write - write data to an already-open file
>   * @rqstp: RPC execution context
> @@ -1342,7 +1503,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	struct super_block	*sb = file_inode(file)->i_sb;
>  	struct kiocb		kiocb;
>  	struct svc_export	*exp;
> -	struct iov_iter		iter;
>  	errseq_t		since;
>  	__be32			nfserr;
>  	int			host_err;
> @@ -1379,31 +1539,28 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		kiocb.ki_flags |= IOCB_DSYNC;
>  
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> -	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
> +	since = READ_ONCE(file->f_wb_err);
> +	if (verf)
> +		nfsd_copy_write_verifier(verf, nn);
>  
>  	switch (nfsd_io_cache_write) {
>  	case NFSD_IO_DIRECT:
> -		/* direct I/O must be aligned to device logical sector size */
> -		if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
> -		    (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
> -			kiocb.ki_flags |= IOCB_DIRECT;
> +		host_err = nfsd_issue_write_dio(rqstp, fhp, nf, offset,
> +						nvecs, cnt, &kiocb);
>  		break;
>  	case NFSD_IO_DONTCACHE:
>  		kiocb.ki_flags |= IOCB_DONTCACHE;
> -		break;
> +		fallthrough;
>  	case NFSD_IO_BUFFERED:
> +		host_err = nfsd_issue_write_buffered(rqstp, file,
> +						nvecs, cnt, &kiocb);
>  		break;
>  	}
> -
> -	since = READ_ONCE(file->f_wb_err);
> -	if (verf)
> -		nfsd_copy_write_verifier(verf, nn);
> -	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
>  	if (host_err < 0) {
>  		commit_reset_write_verifier(nn, rqstp, host_err);
>  		goto out_nfserr;
>  	}
> -	*cnt = host_err;
>  	nfsd_stats_io_write_add(nn, exp, *cnt);
>  	fsnotify_modify(file);
>  	host_err = filemap_check_wb_err(file->f_mapping, since);


Assuming the follow-on patch gets folded in:

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 7/7] NFSD: issue WRITEs using O_DIRECT even if IO is misaligned
  2025-08-08  2:30   ` Mike Snitzer
@ 2025-08-08 14:21     ` Chuck Lever
  0 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2025-08-08 14:21 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 8/7/25 10:30 PM, Mike Snitzer wrote:
> On Thu, Aug 07, 2025 at 12:25:44PM -0400, Mike Snitzer wrote:
>> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
>> middle and end as needed. The large middle extent is DIO-aligned and
>> the start and/or end are misaligned. Buffered IO is used for the
>> misaligned extents and O_DIRECT is used for the middle DIO-aligned
>> extent.
>>
>> The nfsd_analyze_write_dio trace event shows how NFSD splits a given
>> misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
>> extent.
>>
>> This combination of trace events is useful:
>>
>>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
>>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
>>   echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
>>   echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
>>
>> Which for this dd command:
>>
>>   dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct
>>
>> Results in:
>>
>>   nfsd-23908   [010] ..... 10374.902333: nfsd_write_opened: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
>>   nfsd-23908   [010] ..... 10374.902335: nfsd_analyze_write_dio: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+46592 end=46592+416
>>   nfsd-23908   [010] ..... 10374.902343: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0x0 pos 0x0 bytecount 0xb600
>>   nfsd-23908   [010] ..... 10374.902697: nfsd_write_io_done: xid=0x7fc5923b fh_hash=0x857ca4fc offset=0 len=47008
>>
>>   nfsd-23908   [010] ..... 10374.902925: nfsd_write_opened: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
>>   nfsd-23908   [010] ..... 10374.902926: nfsd_analyze_write_dio: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=47008+96 middle=47104+46592 end=93696+320
>>   nfsd-23908   [010] ..... 10374.903010: xfs_file_direct_write: dev 259:2 ino 0xc00116 disize 0xb800 pos 0xb800 bytecount 0xb600
>>   nfsd-23908   [010] ..... 10374.903239: nfsd_write_io_done: xid=0x80c5923b fh_hash=0x857ca4fc offset=47008 len=47008
>>
>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>> ---
>>  fs/nfsd/vfs.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 170 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index be083a8812717..1b5aa3e6e6623 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1315,6 +1315,167 @@ static int wait_for_concurrent_writes(struct file *file)
>>  	return err;
>>  }
>>  
>> +struct nfsd_write_dio {
>> +	loff_t middle_offset;	/* Offset for start of DIO-aligned middle */
>> +	loff_t end_offset;	/* Offset for start of DIO-aligned end */
>> +	ssize_t	start_len;	/* Length for misaligned first extent */
>> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
>> +	ssize_t	end_len;	/* Length for misaligned last extent */
>> +};
>> +
>> +static bool
>> +nfsd_analyze_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> +		       struct nfsd_file *nf, loff_t offset,
>> +		       unsigned long len, struct nfsd_write_dio *write_dio)
>> +{
>> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
>> +	loff_t orig_end, middle_end, start_end, start_offset = offset;
>> +	ssize_t start_len = len;
>> +
>> +	if (WARN_ONCE(!nf->nf_dio_mem_align || !dio_blocksize,
>> +		      "%s: underlying filesystem has not provided DIO alignment info\n",
>> +		      __func__))
>> +		return false;
>> +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
>> +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
>> +		      __func__, dio_blocksize, PAGE_SIZE))
>> +		return false;
>> +	if (unlikely(len < dio_blocksize))
>> +		return false;
>> +
>> +	memset(write_dio, 0, sizeof(*write_dio));
>> +
>> +	if (((offset | len) & (dio_blocksize-1)) == 0) {
>> +		/* already DIO-aligned, no misaligned head or tail */
>> +		write_dio->middle_offset = offset;
>> +		write_dio->middle_len = len;
>> +		/* clear these for the benefit of trace_nfsd_analyze_write_dio */
>> +		start_offset = 0;
>> +		start_len = 0;
>> +		goto out;
>> +	}
>> +
>> +	start_end = round_up(offset, dio_blocksize);
>> +	start_len = start_end - offset;
>> +	orig_end = offset + len;
>> +	middle_end = round_down(orig_end, dio_blocksize);
>> +
>> +	write_dio->start_len = start_len;
>> +	write_dio->middle_offset = start_end;
>> +	write_dio->middle_len = middle_end - start_end;
>> +	write_dio->end_offset = middle_end;
>> +	write_dio->end_len = orig_end - middle_end;
>> +out:
>> +	trace_nfsd_analyze_write_dio(rqstp, fhp, offset, len, start_offset, start_len,
>> +				     write_dio->middle_offset, write_dio->middle_len,
>> +				     write_dio->end_offset, write_dio->end_len);
>> +	return true;
>> +}
>> +
>> +/*
>> + * Setup as many as 3 iov_iter based on extents described by @write_dio.
>> + * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
>> + * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
>> + * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
>> + * @nvecs: number of segments in @rq_bvec
>> + * @cnt: size of the request in bytes
>> + * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
>> + *
>> + * Returns the number of iov_iter that were setup.
>> + */
>> +static int
>> +nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
>> +			   struct bio_vec *rq_bvec, unsigned int nvecs,
>> +			   unsigned long cnt, struct nfsd_write_dio *write_dio)
>> +{
>> +	int n_iters = 0;
>> +	struct iov_iter *iters = *iterp;
>> +
>> +	/* Setup misaligned start? */
>> +	if (write_dio->start_len) {
>> +		iter_is_dio_aligned[n_iters] = false;
>> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> +		iters[n_iters].count = write_dio->start_len;
>> +		n_iters++;
>> +	}
>> +
>> +	/* Setup DIO-aligned middle */
>> +	iter_is_dio_aligned[n_iters] = true;
>> +	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> +	if (write_dio->start_len)
>> +		iov_iter_advance(&iters[n_iters], write_dio->start_len);
>> +	iters[n_iters].count -= write_dio->end_len;
>> +	n_iters++;
>> +
>> +	/* Setup misaligned end? */
>> +	if (write_dio->end_len) {
>> +		iter_is_dio_aligned[n_iters] = false;
>> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> +		iov_iter_advance(&iters[n_iters],
>> +				 write_dio->start_len + write_dio->middle_len);
>> +		n_iters++;
>> +	}
>> +
>> +	return n_iters;
>> +}
>> +
>> +static int
>> +nfsd_issue_write_buffered(struct svc_rqst *rqstp, struct file *file,
>> +			  unsigned int nvecs, unsigned long *cnt,
>> +			  struct kiocb *kiocb)
>> +{
>> +	struct iov_iter iter;
>> +	int host_err;
>> +
>> +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
>> +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
>> +	if (host_err < 0)
>> +		return host_err;
>> +	*cnt = host_err;
>> +
>> +	return 0;
>> +}
>> +
>> +static noinline int
>> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> +		     struct nfsd_file *nf, loff_t offset,
>> +		     unsigned int nvecs, unsigned long *cnt,
>> +		     struct kiocb *kiocb)
>> +{
>> +	struct nfsd_write_dio write_dio;
>> +	struct file *file = nf->nf_file;
>> +
>> +	if (!nfsd_analyze_write_dio(rqstp, fhp, nf, offset,
>> +				    *cnt, &write_dio)) {
>> +		return nfsd_issue_write_buffered(rqstp, file,
>> +					nvecs, cnt, kiocb);
>> +	} else {
>> +		bool iter_is_dio_aligned[3];
>> +		struct iov_iter iter_stack[3];
>> +		struct iov_iter *iter = iter_stack;
>> +		unsigned int n_iters = 0;
>> +		int host_err;
>> +
>> +		n_iters = nfsd_setup_write_dio_iters(&iter,
>> +				iter_is_dio_aligned, rqstp->rq_bvec,
>> +				nvecs, *cnt, &write_dio);
>> +		*cnt = 0;
>> +		for (int i = 0; i < n_iters; i++) {
>> +			if (iter_is_dio_aligned[i])
>> +				kiocb->ki_flags |= IOCB_DIRECT;
>> +			else
>> +				kiocb->ki_flags &= ~IOCB_DIRECT;
>> +			host_err = vfs_iocb_iter_write(file, kiocb,
>> +						       &iter[i]);
>> +			if (host_err < 0)
>> +				return host_err;
>> +			*cnt += host_err;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  /**
>>   * nfsd_vfs_write - write data to an already-open file
>>   * @rqstp: RPC execution context
>> @@ -1342,7 +1503,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  	struct super_block	*sb = file_inode(file)->i_sb;
>>  	struct kiocb		kiocb;
>>  	struct svc_export	*exp;
>> -	struct iov_iter		iter;
>>  	errseq_t		since;
>>  	__be32			nfserr;
>>  	int			host_err;
>> @@ -1379,31 +1539,28 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  		kiocb.ki_flags |= IOCB_DSYNC;
>>  
>>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
>> -	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
>> +
>> +	since = READ_ONCE(file->f_wb_err);
>> +	if (verf)
>> +		nfsd_copy_write_verifier(verf, nn);
>>  
>>  	switch (nfsd_io_cache_write) {
>>  	case NFSD_IO_DIRECT:
>> -		/* direct I/O must be aligned to device logical sector size */
>> -		if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
>> -		    (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
>> -			kiocb.ki_flags |= IOCB_DIRECT;
>> +		host_err = nfsd_issue_write_dio(rqstp, fhp, nf, offset,
>> +						nvecs, cnt, &kiocb);
>>  		break;
>>  	case NFSD_IO_DONTCACHE:
>>  		kiocb.ki_flags |= IOCB_DONTCACHE;
>> -		break;
>> +		fallthrough;
>>  	case NFSD_IO_BUFFERED:
>> +		host_err = nfsd_issue_write_buffered(rqstp, file,
>> +						nvecs, cnt, &kiocb);
>>  		break;
>>  	}
>> -
>> -	since = READ_ONCE(file->f_wb_err);
>> -	if (verf)
>> -		nfsd_copy_write_verifier(verf, nn);
>> -	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
>>  	if (host_err < 0) {
>>  		commit_reset_write_verifier(nn, rqstp, host_err);
>>  		goto out_nfserr;
>>  	}
>> -	*cnt = host_err;
>>  	nfsd_stats_io_write_add(nn, exp, *cnt);
>>  	fsnotify_modify(file);
>>  	host_err = filemap_check_wb_err(file->f_mapping, since);
>> -- 
>> 2.44.0
>>
>>
> 
> Embarrassingly, turns out I only tested the NFSD_IO_DIRECT case prior
> to submitting this v5, if the 'nfsd_io_cache_write' debugfs file is
> never written it defaults to NFSD_IO_UNSPECIFIED.  But even if that's
> the case we need to treat NFSD_IO_UNSPECIFIED like NFSD_IO_BUFFERED.
> (that is how nfsd_vfs_read behaves, but I missed this in
> nfsd_vfs_write when I refactored the code for v5).
> 
> This incremental patch fixes this oversight, Chuck should I submit a
> v8 or you're OK with folding this fixup (if the rest of the code is
> OK)?

Hold off on v6 for now. Let's wait for more review comments.


> Thanks,
> Mike
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 1b5aa3e6e6623..b529754a20bd5 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1505,7 +1505,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	struct svc_export	*exp;
>  	errseq_t		since;
>  	__be32			nfserr;
> -	int			host_err;
> +	int			host_err = 0;
>  	unsigned long		exp_op_flags = 0;
>  	unsigned int		pflags = current->flags;
>  	bool			restore_flags = false;
> @@ -1552,6 +1552,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	case NFSD_IO_DONTCACHE:
>  		kiocb.ki_flags |= IOCB_DONTCACHE;
>  		fallthrough;
> +	case NFSD_IO_UNSPECIFIED:
>  	case NFSD_IO_BUFFERED:
>  		host_err = nfsd_issue_write_buffered(rqstp, file,
>  						nvecs, cnt, &kiocb);


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface
  2025-08-07 16:25 ` [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
  2025-08-08 12:05   ` Jeff Layton
  2025-08-08 12:05   ` Jeff Layton
@ 2025-08-08 17:58   ` Chuck Lever
  2 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2025-08-08 17:58 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 8/7/25 12:25 PM, Mike Snitzer wrote:
> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> read by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=1)
> - cached but removed from the page cache upon completion
>   (NFSD_IO_DONTCACHE=2).
> - not cached (NFSD_IO_DIRECT=3)
> 
> io_cache_read may be set by writing to:
>   /sys/kernel/debug/nfsd/io_cache_read
> 
> If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
> 
> If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the read must be aligned relative to the
> underlying block device's dma_alignment.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/debugfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/nfsd.h    |  9 ++++++++
>  fs/nfsd/vfs.c     | 19 +++++++++++++---
>  3 files changed, 83 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc9..c07f71d4e84f4 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -27,11 +27,66 @@ static int nfsd_dsr_get(void *data, u64 *val)
>  static int nfsd_dsr_set(void *data, u64 val)
>  {
>  	nfsd_disable_splice_read = (val > 0) ? true : false;
> +	if (!nfsd_disable_splice_read) {
> +		/*
> +		 * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> +		 * if splice_read is enabled.
> +		 */
> +		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> +	}
>  	return 0;
>  }
>  
>  DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>  
> +/*
> + * /sys/kernel/debug/nfsd/io_cache_read
> + *
> + * Contents:
> + *   %1: NFS READ will use buffered IO
> + *   %2: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + *   %3: NFS READ will use direct IO
> + *
> + * The default value of this setting is zero (UNSPECIFIED).
> + * This setting takes immediate effect for all NFS versions,
> + * all exports, and in all NFSD net namespaces.
> + */
> +
> +static int nfsd_io_cache_read_get(void *data, u64 *val)
> +{
> +	*val = nfsd_io_cache_read;
> +	return 0;
> +}
> +
> +static int nfsd_io_cache_read_set(void *data, u64 val)
> +{
> +	int ret = 0;
> +
> +	switch (val) {
> +	case NFSD_IO_BUFFERED:
> +		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> +		break;
> +	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
> +		/*
> +		 * Must disable splice_read when enabling
> +		 * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> +		 */
> +		nfsd_disable_splice_read = true;
> +		nfsd_io_cache_read = val;
> +		break;
> +	default:
> +		nfsd_io_cache_read = NFSD_IO_UNSPECIFIED;
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> +			 nfsd_io_cache_read_set, "%llu\n");
> +
>  void nfsd_debugfs_exit(void)
>  {
>  	debugfs_remove_recursive(nfsd_top_dir);
> @@ -44,4 +99,7 @@ void nfsd_debugfs_init(void)
>  
>  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
>  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +
> +	debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
> +			    nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
>  }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1cd0bed57bc2f..6ef799405145f 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -153,6 +153,15 @@ static inline void nfsd_debugfs_exit(void) {}
>  
>  extern bool nfsd_disable_splice_read __read_mostly;
>  
> +enum {
> +	NFSD_IO_UNSPECIFIED = 0,
> +	NFSD_IO_BUFFERED,
> +	NFSD_IO_DONTCACHE,
> +	NFSD_IO_DIRECT,
> +};
> +
> +extern u64 nfsd_io_cache_read __read_mostly;
> +
>  extern int nfsd_max_blksize;
>  
>  static inline int nfsd_v4client(struct svc_rqst *rq)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 79439ad93880a..26b6d96258711 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -49,6 +49,7 @@
>  #define NFSDDBG_FACILITY		NFSDDBG_FILEOP
>  
>  bool nfsd_disable_splice_read __read_mostly;
> +u64 nfsd_io_cache_read __read_mostly;
>  
>  /**
>   * nfserrno - Map Linux errnos to NFS errnos
> @@ -1099,17 +1100,29 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	size_t len;
>  
>  	init_sync_kiocb(&kiocb, file);
> +
> +	if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
> +		/* Verify ondisk DIO alignment, memory addrs checked below */
> +		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> +		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0))
> +			kiocb.ki_flags = IOCB_DIRECT;
> +	} else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
> +		kiocb.ki_flags = IOCB_DONTCACHE;
> +

Personal style: let's make this a switch statement like it will be for
the write path.


>  	kiocb.ki_pos = offset;
>  
>  	v = 0;
>  	total = *count;
>  	while (total) {
>  		len = min_t(size_t, total, PAGE_SIZE - base);
> -		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> -			      len, base);
> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++), len, base);

Nit: changing this line does not appear to be necessary.


> +		/* No need to verify memory is DIO-aligned since bv_offset is 0 */
> +		if (unlikely((kiocb.ki_flags & IOCB_DIRECT) && base &&
> +			     (base & (nf->nf_dio_mem_align - 1))))
> +			kiocb.ki_flags &= ~IOCB_DIRECT;
>  		total -= len;
> -		++v;
>  		base = 0;
> +		v++;

Nit: I've actually measured pre-incrementing to be slightly faster than
post-incrementing, so I'd like to keep "++v;" here.


>  	}
>  	WARN_ON_ONCE(v > rqstp->rq_maxpages);
>  


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/7] NFSD: add io_cache_write controls to debugfs interface
  2025-08-07 16:25 ` [PATCH v5 4/7] NFSD: add io_cache_write " Mike Snitzer
  2025-08-08 12:03   ` Jeff Layton
@ 2025-08-08 17:58   ` Chuck Lever
  2025-08-08 18:10     ` Mike Snitzer
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2025-08-08 17:58 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 8/7/25 12:25 PM, Mike Snitzer wrote:
> Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
> written by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=1)
> - cached but removed from the page cache upon completion
>   (NFSD_IO_DONTCACHE=2).
> - not cached (NFSD_IO_DIRECT=3)
> 
> io_cache_write may be set by writing to:
>   /sys/kernel/debug/nfsd/io_cache_write
> 
> If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
> 
> If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the WRITE payload must be aligned relative
> to the underlying block device's dma_alignment.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/debugfs.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/nfsd.h    |  1 +
>  fs/nfsd/vfs.c     | 16 ++++++++++++++++
>  3 files changed, 61 insertions(+)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index c07f71d4e84f4..872de65f0e9ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -87,6 +87,47 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
>  DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
>  			 nfsd_io_cache_read_set, "%llu\n");
>  
> +/*
> + * /sys/kernel/debug/nfsd/io_cache_write
> + *
> + * Contents:
> + *   %1: NFS WRITE will use buffered IO
> + *   %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> + *   %3: NFS WRITE will use direct IO
> + *
> + * The default value of this setting is zero (UNSPECIFIED).
> + * This setting takes immediate effect for all NFS versions,
> + * all exports, and in all NFSD net namespaces.
> + */
> +
> +static int nfsd_io_cache_write_get(void *data, u64 *val)
> +{
> +	*val = nfsd_io_cache_write;
> +	return 0;
> +}
> +
> +static int nfsd_io_cache_write_set(void *data, u64 val)
> +{
> +	int ret = 0;
> +
> +	switch (val) {
> +	case NFSD_IO_BUFFERED:
> +	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
> +		nfsd_io_cache_write = val;
> +		break;
> +	default:
> +		nfsd_io_cache_write = NFSD_IO_UNSPECIFIED;
> +		ret = -EINVAL;

I might be wrong, but an error return should leave the setting
untouched, IMO. Likewise for the read setting.


> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_write_fops, nfsd_io_cache_write_get,
> +			 nfsd_io_cache_write_set, "%llu\n");
> +
>  void nfsd_debugfs_exit(void)
>  {
>  	debugfs_remove_recursive(nfsd_top_dir);
> @@ -102,4 +143,7 @@ void nfsd_debugfs_init(void)
>  
>  	debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
>  			    nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
> +
> +	debugfs_create_file("io_cache_write", S_IWUSR | S_IRUGO,
> +			    nfsd_top_dir, NULL, &nfsd_io_cache_write_fops);
>  }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 6ef799405145f..fe935b4cda538 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -161,6 +161,7 @@ enum {
>  };
>  
>  extern u64 nfsd_io_cache_read __read_mostly;
> +extern u64 nfsd_io_cache_write __read_mostly;
>  
>  extern int nfsd_max_blksize;
>  
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 26b6d96258711..5768244c7a3c3 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -50,6 +50,7 @@
>  
>  bool nfsd_disable_splice_read __read_mostly;
>  u64 nfsd_io_cache_read __read_mostly;
> +u64 nfsd_io_cache_write __read_mostly;
>  
>  /**
>   * nfserrno - Map Linux errnos to NFS errnos
> @@ -1234,6 +1235,21 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
>  	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
> +	switch (nfsd_io_cache_write) {
> +	case NFSD_IO_DIRECT:
> +		/* direct I/O must be aligned to device logical sector size */
> +		if (nf->nf_dio_mem_align && nf->nf_dio_offset_align &&
> +		    (((offset | *cnt) & (nf->nf_dio_offset_align-1)) == 0))
> +			kiocb.ki_flags |= IOCB_DIRECT;
> +		break;
> +	case NFSD_IO_DONTCACHE:
> +		kiocb.ki_flags |= IOCB_DONTCACHE;
> +		break;
> +	case NFSD_IO_BUFFERED:
> +		break;
> +	}
> +
>  	since = READ_ONCE(file->f_wb_err);
>  	if (verf)
>  		nfsd_copy_write_verifier(verf, nn);


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
  2025-08-07 16:25 ` [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
  2025-08-08 12:05   ` Jeff Layton
@ 2025-08-08 17:59   ` Chuck Lever
  2025-08-08 18:12     ` Mike Snitzer
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2025-08-08 17:59 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 8/7/25 12:25 PM, Mike Snitzer wrote:
> Suggested-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/filecache.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 5447dba6c5da0..5601e839a72da 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -1058,8 +1058,12 @@ nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
>  	struct kstat stat;
>  	__be32 status;
>  
> -	/* Currently only need to get DIO alignment info for regular files */
> -	if (!S_ISREG(inode->i_mode))
> +	/* Currently only need to get DIO alignment info for regular files
> +	 * IFF NFSD_IO_DIRECT is enabled for nfsd_io_cache_{read,write}.

What happens if the I/O mode is changed from buffered to direct I/O
while there are open files? I think you will need to collect the
alignment parameters on all opens regardless of the I/O mode setting.


> +	 */
> +	if (!S_ISREG(inode->i_mode) ||
> +	    (nfsd_io_cache_read != NFSD_IO_DIRECT &&
> +	     nfsd_io_cache_write != NFSD_IO_DIRECT))
>  		return nfs_ok;
>  
>  	status = fh_getattr(fhp, &stat);


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
  2025-08-07 16:25 ` [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
  2025-08-08 12:16   ` Jeff Layton
@ 2025-08-08 17:59   ` Chuck Lever
  2025-08-08 18:19     ` Mike Snitzer
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2025-08-08 17:59 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 8/7/25 12:25 PM, Mike Snitzer wrote:
> If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> DIO-aligned block (on either end of the READ). The expanded READ is
> verified to have proper offset/len (logical_block_size) and
> dma_alignment checking.
> 
> Must allocate and use a bounce-buffer page (called 'start_extra_page')
> if/when expanding the misaligned READ requires reading extra partial
> page at the start of the READ so that its DIO-aligned. Otherwise that
> extra page at the start will make its way back to the NFS client and
> corruption will occur. As found, and then this fix of using an extra
> page verified, using the 'dt' utility:
>   dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
>      iotype=sequential pattern=iot onerr=abort oncerr=abort
> see: https://github.com/RobinTMiller/dt.git
> 
> Any misaligned READ that is less than 32K won't be expanded to be
> DIO-aligned (this heuristic just avoids excess work, like allocating
> start_extra_page, for smaller IO that can generally already perform
> well using buffered IO).
> 
> Also add EVENT_CLASS nfsd_analyze_dio_class and use it to create
> nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events. This
> prepares for nfsd_vfs_write() to also make use of it when handling
> misaligned WRITEs.
> 
> This combination of trace events is useful:
> 
>  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
>  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
>  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
>  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
> 
> Which for this dd command:
> 
>  dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
> 
> Results in:
> 
>   nfsd-23908   [010] ..... 10375.141640: nfsd_analyze_read_dio: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
>   nfsd-23908   [010] ..... 10375.141642: nfsd_read_vector: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47104
>   nfsd-23908   [010] ..... 10375.141643: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0x0 bytecount 0xb800
>   nfsd-23908   [010] ..... 10375.141773: nfsd_read_io_done: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008
> 
>   nfsd-23908   [010] ..... 10375.142063: nfsd_analyze_read_dio: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=46592+416 middle=47008+47008 end=94016+192
>   nfsd-23908   [010] ..... 10375.142064: nfsd_read_vector: xid=0x83c5923b fh_hash=0x857ca4fc offset=46592 len=47616
>   nfsd-23908   [010] ..... 10375.142065: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0xb600 bytecount 0xba00
>   nfsd-23908   [010] ..... 10375.142103: nfsd_read_io_done: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008
> 
> Suggested-by: Jeff Layton <jlayton@kernel.org>
> Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/trace.h            |  61 ++++++++++++++
>  fs/nfsd/vfs.c              | 167 ++++++++++++++++++++++++++++++++++---
>  include/linux/sunrpc/svc.h |   5 +-
>  3 files changed, 221 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index a664fdf1161e9..4173bd9344b6b 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -473,6 +473,67 @@ DEFINE_NFSD_IO_EVENT(write_done);
>  DEFINE_NFSD_IO_EVENT(commit_start);
>  DEFINE_NFSD_IO_EVENT(commit_done);
>  
> +DECLARE_EVENT_CLASS(nfsd_analyze_dio_class,
> +	TP_PROTO(struct svc_rqst *rqstp,
> +		 struct svc_fh	*fhp,
> +		 u64		offset,
> +		 u32		len,
> +		 loff_t		start,
> +		 ssize_t	start_len,
> +		 loff_t		middle,
> +		 ssize_t	middle_len,
> +		 loff_t		end,
> +		 ssize_t	end_len),
> +	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len),
> +	TP_STRUCT__entry(
> +		__field(u32, xid)
> +		__field(u32, fh_hash)
> +		__field(u64, offset)
> +		__field(u32, len)
> +		__field(loff_t, start)
> +		__field(ssize_t, start_len)
> +		__field(loff_t, middle)
> +		__field(ssize_t, middle_len)
> +		__field(loff_t, end)
> +		__field(ssize_t, end_len)
> +	),
> +	TP_fast_assign(
> +		__entry->xid = be32_to_cpu(rqstp->rq_xid);
> +		__entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
> +		__entry->offset = offset;
> +		__entry->len = len;
> +		__entry->start = start;
> +		__entry->start_len = start_len;
> +		__entry->middle = middle;
> +		__entry->middle_len = middle_len;
> +		__entry->end = end;
> +		__entry->end_len = end_len;
> +	),
> +	TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%lu middle=%llu+%lu end=%llu+%lu",
> +		  __entry->xid, __entry->fh_hash,
> +		  __entry->offset, __entry->len,
> +		  __entry->start, __entry->start_len,
> +		  __entry->middle, __entry->middle_len,
> +		  __entry->end, __entry->end_len)
> +)
> +
> +#define DEFINE_NFSD_ANALYZE_DIO_EVENT(name)			\
> +DEFINE_EVENT(nfsd_analyze_dio_class, nfsd_analyze_##name##_dio,	\
> +	TP_PROTO(struct svc_rqst *rqstp,			\
> +		 struct svc_fh	*fhp,				\
> +		 u64		offset,				\
> +		 u32		len,				\
> +		 loff_t		start,				\
> +		 ssize_t	start_len,			\
> +		 loff_t		middle,				\
> +		 ssize_t	middle_len,			\
> +		 loff_t		end,				\
> +		 ssize_t	end_len),			\
> +	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len))
> +
> +DEFINE_NFSD_ANALYZE_DIO_EVENT(read);
> +DEFINE_NFSD_ANALYZE_DIO_EVENT(write);
> +

Just a random thought: usually I add new trace points at the end of
the series to keep the deeper patches smaller.


>  DECLARE_EVENT_CLASS(nfsd_err_class,
>  	TP_PROTO(struct svc_rqst *rqstp,
>  		 struct svc_fh	*fhp,
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5768244c7a3c3..be083a8812717 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -19,6 +19,7 @@
>  #include <linux/splice.h>
>  #include <linux/falloc.h>
>  #include <linux/fcntl.h>
> +#include <linux/math.h>
>  #include <linux/namei.h>
>  #include <linux/delay.h>
>  #include <linux/fsnotify.h>
> @@ -1073,6 +1074,116 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> +struct nfsd_read_dio {
> +	loff_t start;
> +	loff_t end;
> +	unsigned long start_extra;
> +	unsigned long end_extra;
> +	struct page *start_extra_page;
> +};
> +
> +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> +{
> +	memset(read_dio, 0, sizeof(*read_dio));
> +	read_dio->start_extra_page = NULL;
> +}
> +
> +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +				  struct nfsd_file *nf, loff_t offset,
> +				  unsigned long len, unsigned int base,
> +				  struct nfsd_read_dio *read_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> +	loff_t middle_end, orig_end = offset + len;
> +
> +	if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> +		      "%s: underlying filesystem has not provided DIO alignment info\n",
> +		      __func__))
> +		return false;
> +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> +		      __func__, dio_blocksize, PAGE_SIZE))
> +		return false;
> +
> +	/* Return early if IO is irreparably misaligned
> +	 * (len < PAGE_SIZE, or base not aligned).
> +	 */
> +	if (unlikely(len < dio_blocksize) ||
> +	    ((base & (nf->nf_dio_mem_align-1)) != 0))
> +		return false;
> +
> +	read_dio->start = round_down(offset, dio_blocksize);
> +	read_dio->end = round_up(orig_end, dio_blocksize);
> +	read_dio->start_extra = offset - read_dio->start;
> +	read_dio->end_extra = read_dio->end - orig_end;
> +
> +	/* don't expand READ for IO less than 32K */

The code already says "don't expand READ for IO less than 32K". The
comment needs to explain why. Move the rationale from the patch
description to this comment, maybe?


> +	if ((read_dio->start_extra || read_dio->end_extra) && (len < (32 << 10))) {
> +		init_nfsd_read_dio(read_dio);
> +		return false;
> +	}

Nit: Let's replace the raw integer with a symbolic constant. But let's
resist the urge to expose this as a tunable for now ;-)


> +
> +	if (read_dio->start_extra) {
> +		read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> +		if (WARN_ONCE(read_dio->start_extra_page == NULL,
> +			      "%s: Unable to allocate start_extra_page\n", __func__)) {
> +			init_nfsd_read_dio(read_dio);
> +			return false;
> +		}
> +	}
> +
> +	/* Show original offset and count, and how it was expanded for DIO */
> +	middle_end = read_dio->end - read_dio->end_extra;
> +	trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
> +				    read_dio->start, read_dio->start_extra,
> +				    offset, (middle_end - offset),
> +				    middle_end, read_dio->end_extra);
> +	return true;
> +}
> +
> +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> +						 struct nfsd_read_dio *read_dio,
> +						 ssize_t bytes_read,
> +						 unsigned long bytes_expected,
> +						 loff_t *offset,
> +						 unsigned long *rq_bvec_numpages)
> +{
> +	ssize_t host_err = bytes_read;
> +	loff_t v;
> +
> +	/* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> +	 * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> +	 */
> +	if (read_dio->start_extra_page) {
> +		__free_page(read_dio->start_extra_page);
> +		*rq_bvec_numpages -= 1;
> +		v = *rq_bvec_numpages;
> +		memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> +			v * sizeof(struct bio_vec));
> +	}
> +	/* Eliminate any end_extra bytes from the last page */
> +	v = *rq_bvec_numpages;
> +	rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> +
> +	if (host_err < 0)
> +		return host_err;
> +
> +	/* nfsd_analyze_read_dio() may have expanded the start and end,
> +	 * if so adjust returned read size to reflect original extent.
> +	 */
> +	*offset += read_dio->start_extra;
> +	if (likely(host_err >= read_dio->start_extra)) {
> +		host_err -= read_dio->start_extra;
> +		if (host_err > bytes_expected)
> +			host_err = bytes_expected;
> +	} else {
> +		/* Short read that didn't read any of requested data */
> +		host_err = 0;
> +	}
> +
> +	return host_err;
> +}
> +
>  /**
>   * nfsd_iter_read - Perform a VFS read using an iterator
>   * @rqstp: RPC transaction context
> @@ -1094,26 +1205,49 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		      unsigned int base, u32 *eof)
>  {
>  	struct file *file = nf->nf_file;
> -	unsigned long v, total;
> +	unsigned long v, total, in_count = *count;
> +	struct nfsd_read_dio read_dio;
>  	struct iov_iter iter;
>  	struct kiocb kiocb;
> -	ssize_t host_err;
> +	ssize_t host_err = 0;
>  	size_t len;
>  
> +	init_nfsd_read_dio(&read_dio);

Initialize only if direct I/O is in use. I think all this new read code
needs the same treatment as the write path: move the handling of the
esoteric I/O types out of the hot (buffered) path.


>  	init_sync_kiocb(&kiocb, file);
>  
> +	/*
> +	 * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
> +	 * the next DIO-aligned block (on either end of the READ).
> +	 */
>  	if (nfsd_io_cache_read == NFSD_IO_DIRECT) {
> -		/* Verify ondisk DIO alignment, memory addrs checked below */
> -		if (nf->nf_dio_mem_align && nf->nf_dio_read_offset_align &&
> -		    (((offset | *count) & (nf->nf_dio_read_offset_align - 1)) == 0))
> -			kiocb.ki_flags = IOCB_DIRECT;
> +		if (nfsd_analyze_read_dio(rqstp, fhp, nf, offset,
> +					  in_count, base, &read_dio)) {
> +			/* trace_nfsd_read_vector() will reflect larger
> +			 * DIO-aligned READ.
> +			 */
> +			offset = read_dio.start;
> +			in_count = read_dio.end - offset;
> +			/* Verify ondisk DIO alignment, memory addrs checked below */
> +			if (likely(((offset | in_count) &
> +				    (nf->nf_dio_read_offset_align - 1)) == 0))
> +				kiocb.ki_flags = IOCB_DIRECT;
> +		}
>  	} else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
>  		kiocb.ki_flags = IOCB_DONTCACHE;
>  
>  	kiocb.ki_pos = offset;
>  
>  	v = 0;
> -	total = *count;
> +	total = in_count;
> +	if (read_dio.start_extra) {
> +		bvec_set_page(&rqstp->rq_bvec[v], read_dio.start_extra_page,
> +			      read_dio.start_extra, PAGE_SIZE - read_dio.start_extra);
> +		if (unlikely((kiocb.ki_flags & IOCB_DIRECT) &&
> +			     rqstp->rq_bvec[v].bv_offset & (nf->nf_dio_mem_align - 1)))
> +			kiocb.ki_flags &= ~IOCB_DIRECT;
> +		total -= read_dio.start_extra;
> +		v++;
> +	}
>  	while (total) {
>  		len = min_t(size_t, total, PAGE_SIZE - base);
>  		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++), len, base);
> @@ -1125,11 +1259,22 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		base = 0;
>  		v++;
>  	}
> -	WARN_ON_ONCE(v > rqstp->rq_maxpages);
> +	if (WARN_ONCE(v > rqstp->rq_maxpages,
> +		      "%s: v=%lu exceeds rqstp->rq_maxpages=%lu\n", __func__,
> +		      v, rqstp->rq_maxpages)) {
> +		host_err = -EINVAL;
> +	}
>  
> -	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> -	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> -	host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> +	if (!host_err) {
> +		trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
> +		iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
> +		host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> +	}
> +
> +	if (read_dio.start_extra || read_dio.end_extra) {
> +		host_err = nfsd_complete_misaligned_read_dio(rqstp, &read_dio,
> +					host_err, *count, &offset, &v);
> +	}
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e64ab444e0a7f..190c2667500e2 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   * pages, one for the request, and one for the reply.
>   * nfsd_splice_actor() might need an extra page when a READ payload
>   * is not page-aligned.
> + * nfsd_iter_read() might need two extra pages when a READ payload
> + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
> + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
>   */
>  static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
>  {
> -	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> +	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>  }
>  
>  /*


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/7] NFSD: add io_cache_write controls to debugfs interface
  2025-08-08 17:58   ` Chuck Lever
@ 2025-08-08 18:10     ` Mike Snitzer
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-08 18:10 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Fri, Aug 08, 2025 at 01:58:54PM -0400, Chuck Lever wrote:
> On 8/7/25 12:25 PM, Mike Snitzer wrote:
> > Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
> > written by NFSD will either be:
> > - cached using page cache (NFSD_IO_BUFFERED=1)
> > - cached but removed from the page cache upon completion
> >   (NFSD_IO_DONTCACHE=2).
> > - not cached (NFSD_IO_DIRECT=3)
> > 
> > io_cache_write may be set by writing to:
> >   /sys/kernel/debug/nfsd/io_cache_write
> > 
> > If NFSD_IO_DONTCACHE is specified using 2, FOP_DONTCACHE must be
> > advertised as supported by the underlying filesystem (e.g. XFS),
> > otherwise all IO flagged with RWF_DONTCACHE will fail with
> > -EOPNOTSUPP.
> > 
> > If NFSD_IO_DIRECT is specified using 3, the IO must be aligned
> > relative to the underlying block device's logical_block_size. Also the
> > memory buffer used to store the WRITE payload must be aligned relative
> > to the underlying block device's dma_alignment.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  fs/nfsd/debugfs.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  fs/nfsd/nfsd.h    |  1 +
> >  fs/nfsd/vfs.c     | 16 ++++++++++++++++
> >  3 files changed, 61 insertions(+)
> > 
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index c07f71d4e84f4..872de65f0e9ac 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -87,6 +87,47 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> >  DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> >  			 nfsd_io_cache_read_set, "%llu\n");
> >  
> > +/*
> > + * /sys/kernel/debug/nfsd/io_cache_write
> > + *
> > + * Contents:
> > + *   %1: NFS WRITE will use buffered IO
> > + *   %2: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> > + *   %3: NFS WRITE will use direct IO
> > + *
> > + * The default value of this setting is zero (UNSPECIFIED).
> > + * This setting takes immediate effect for all NFS versions,
> > + * all exports, and in all NFSD net namespaces.
> > + */
> > +
> > +static int nfsd_io_cache_write_get(void *data, u64 *val)
> > +{
> > +	*val = nfsd_io_cache_write;
> > +	return 0;
> > +}
> > +
> > +static int nfsd_io_cache_write_set(void *data, u64 val)
> > +{
> > +	int ret = 0;
> > +
> > +	switch (val) {
> > +	case NFSD_IO_BUFFERED:
> > +	case NFSD_IO_DONTCACHE:
> > +	case NFSD_IO_DIRECT:
> > +		nfsd_io_cache_write = val;
> > +		break;
> > +	default:
> > +		nfsd_io_cache_write = NFSD_IO_UNSPECIFIED;
> > +		ret = -EINVAL;
> 
> I might be wrong, but an error return should leave the setting
> untouched, IMO. Likewise for the read setting.

OK, we should get the NFSD_IO_UNSPECIFIED by default as a side-effect
of it being 0.

So _not_ explicitly setting NFSD_IO_UNSPECIFIED as
catch-all/default/error makes sense.

Will fix, thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled
  2025-08-08 17:59   ` Chuck Lever
@ 2025-08-08 18:12     ` Mike Snitzer
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-08 18:12 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Fri, Aug 08, 2025 at 01:59:07PM -0400, Chuck Lever wrote:
> On 8/7/25 12:25 PM, Mike Snitzer wrote:
> > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  fs/nfsd/filecache.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 5447dba6c5da0..5601e839a72da 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -1058,8 +1058,12 @@ nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
> >  	struct kstat stat;
> >  	__be32 status;
> >  
> > -	/* Currently only need to get DIO alignment info for regular files */
> > -	if (!S_ISREG(inode->i_mode))
> > +	/* Currently only need to get DIO alignment info for regular files
> > +	 * IFF NFSD_IO_DIRECT is enabled for nfsd_io_cache_{read,write}.
> 
> What happens if the I/O mode is changed from buffered to direct I/O
> while there are open files? I think you will need to collect the
> alignment parameters on all opens regardless of the I/O mode setting.

Good point... so will just drop this patch, thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned
  2025-08-08 17:59   ` Chuck Lever
@ 2025-08-08 18:19     ` Mike Snitzer
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2025-08-08 18:19 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Fri, Aug 08, 2025 at 01:59:47PM -0400, Chuck Lever wrote:
> On 8/7/25 12:25 PM, Mike Snitzer wrote:
> > If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> > DIO-aligned block (on either end of the READ). The expanded READ is
> > verified to have proper offset/len (logical_block_size) and
> > dma_alignment checking.
> > 
> > Must allocate and use a bounce-buffer page (called 'start_extra_page')
> > if/when expanding the misaligned READ requires reading extra partial
> > page at the start of the READ so that its DIO-aligned. Otherwise that
> > extra page at the start will make its way back to the NFS client and
> > corruption will occur. As found, and then this fix of using an extra
> > page verified, using the 'dt' utility:
> >   dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> >      iotype=sequential pattern=iot onerr=abort oncerr=abort
> > see: https://github.com/RobinTMiller/dt.git
> > 
> > Any misaligned READ that is less than 32K won't be expanded to be
> > DIO-aligned (this heuristic just avoids excess work, like allocating
> > start_extra_page, for smaller IO that can generally already perform
> > well using buffered IO).
> > 
> > Also add EVENT_CLASS nfsd_analyze_dio_class and use it to create
> > nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events. This
> > prepares for nfsd_vfs_write() to also make use of it when handling
> > misaligned WRITEs.
> > 
> > This combination of trace events is useful:
> > 
> >  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> >  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
> >  echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
> >  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
> > 
> > Which for this dd command:
> > 
> >  dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
> > 
> > Results in:
> > 
> >   nfsd-23908   [010] ..... 10375.141640: nfsd_analyze_read_dio: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
> >   nfsd-23908   [010] ..... 10375.141642: nfsd_read_vector: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47104
> >   nfsd-23908   [010] ..... 10375.141643: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0x0 bytecount 0xb800
> >   nfsd-23908   [010] ..... 10375.141773: nfsd_read_io_done: xid=0x82c5923b fh_hash=0x857ca4fc offset=0 len=47008
> > 
> >   nfsd-23908   [010] ..... 10375.142063: nfsd_analyze_read_dio: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008 start=46592+416 middle=47008+47008 end=94016+192
> >   nfsd-23908   [010] ..... 10375.142064: nfsd_read_vector: xid=0x83c5923b fh_hash=0x857ca4fc offset=46592 len=47616
> >   nfsd-23908   [010] ..... 10375.142065: xfs_file_direct_read: dev 259:2 ino 0xc00116 disize 0x226e0 pos 0xb600 bytecount 0xba00
> >   nfsd-23908   [010] ..... 10375.142103: nfsd_read_io_done: xid=0x83c5923b fh_hash=0x857ca4fc offset=47008 len=47008
> > 
> > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  fs/nfsd/trace.h            |  61 ++++++++++++++
> >  fs/nfsd/vfs.c              | 167 ++++++++++++++++++++++++++++++++++---
> >  include/linux/sunrpc/svc.h |   5 +-
> >  3 files changed, 221 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index a664fdf1161e9..4173bd9344b6b 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -473,6 +473,67 @@ DEFINE_NFSD_IO_EVENT(write_done);
> >  DEFINE_NFSD_IO_EVENT(commit_start);
> >  DEFINE_NFSD_IO_EVENT(commit_done);
> >  
> > +DECLARE_EVENT_CLASS(nfsd_analyze_dio_class,
> > +	TP_PROTO(struct svc_rqst *rqstp,
> > +		 struct svc_fh	*fhp,
> > +		 u64		offset,
> > +		 u32		len,
> > +		 loff_t		start,
> > +		 ssize_t	start_len,
> > +		 loff_t		middle,
> > +		 ssize_t	middle_len,
> > +		 loff_t		end,
> > +		 ssize_t	end_len),
> > +	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len),
> > +	TP_STRUCT__entry(
> > +		__field(u32, xid)
> > +		__field(u32, fh_hash)
> > +		__field(u64, offset)
> > +		__field(u32, len)
> > +		__field(loff_t, start)
> > +		__field(ssize_t, start_len)
> > +		__field(loff_t, middle)
> > +		__field(ssize_t, middle_len)
> > +		__field(loff_t, end)
> > +		__field(ssize_t, end_len)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->xid = be32_to_cpu(rqstp->rq_xid);
> > +		__entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
> > +		__entry->offset = offset;
> > +		__entry->len = len;
> > +		__entry->start = start;
> > +		__entry->start_len = start_len;
> > +		__entry->middle = middle;
> > +		__entry->middle_len = middle_len;
> > +		__entry->end = end;
> > +		__entry->end_len = end_len;
> > +	),
> > +	TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%lu middle=%llu+%lu end=%llu+%lu",
> > +		  __entry->xid, __entry->fh_hash,
> > +		  __entry->offset, __entry->len,
> > +		  __entry->start, __entry->start_len,
> > +		  __entry->middle, __entry->middle_len,
> > +		  __entry->end, __entry->end_len)
> > +)
> > +
> > +#define DEFINE_NFSD_ANALYZE_DIO_EVENT(name)			\
> > +DEFINE_EVENT(nfsd_analyze_dio_class, nfsd_analyze_##name##_dio,	\
> > +	TP_PROTO(struct svc_rqst *rqstp,			\
> > +		 struct svc_fh	*fhp,				\
> > +		 u64		offset,				\
> > +		 u32		len,				\
> > +		 loff_t		start,				\
> > +		 ssize_t	start_len,			\
> > +		 loff_t		middle,				\
> > +		 ssize_t	middle_len,			\
> > +		 loff_t		end,				\
> > +		 ssize_t	end_len),			\
> > +	TP_ARGS(rqstp, fhp, offset, len, start, start_len, middle, middle_len, end, end_len))
> > +
> > +DEFINE_NFSD_ANALYZE_DIO_EVENT(read);
> > +DEFINE_NFSD_ANALYZE_DIO_EVENT(write);
> > +
> 
> Just a random thought: usually I add new trace points at the end of
> the series to keep the deeper patches smaller.

OK, will do.

> >  DECLARE_EVENT_CLASS(nfsd_err_class,
> >  	TP_PROTO(struct svc_rqst *rqstp,
> >  		 struct svc_fh	*fhp,
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 5768244c7a3c3..be083a8812717 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/splice.h>
> >  #include <linux/falloc.h>
> >  #include <linux/fcntl.h>
> > +#include <linux/math.h>
> >  #include <linux/namei.h>
> >  #include <linux/delay.h>
> >  #include <linux/fsnotify.h>
> > @@ -1073,6 +1074,116 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >  }
> >  
> > +struct nfsd_read_dio {
> > +	loff_t start;
> > +	loff_t end;
> > +	unsigned long start_extra;
> > +	unsigned long end_extra;
> > +	struct page *start_extra_page;
> > +};
> > +
> > +static void init_nfsd_read_dio(struct nfsd_read_dio *read_dio)
> > +{
> > +	memset(read_dio, 0, sizeof(*read_dio));
> > +	read_dio->start_extra_page = NULL;
> > +}
> > +
> > +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > +				  struct nfsd_file *nf, loff_t offset,
> > +				  unsigned long len, unsigned int base,
> > +				  struct nfsd_read_dio *read_dio)
> > +{
> > +	const u32 dio_blocksize = nf->nf_dio_read_offset_align;
> > +	loff_t middle_end, orig_end = offset + len;
> > +
> > +	if (WARN_ONCE(!nf->nf_dio_mem_align || !nf->nf_dio_read_offset_align,
> > +		      "%s: underlying filesystem has not provided DIO alignment info\n",
> > +		      __func__))
> > +		return false;
> > +	if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> > +		      "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> > +		      __func__, dio_blocksize, PAGE_SIZE))
> > +		return false;
> > +
> > +	/* Return early if IO is irreparably misaligned
> > +	 * (len < PAGE_SIZE, or base not aligned).
> > +	 */
> > +	if (unlikely(len < dio_blocksize) ||
> > +	    ((base & (nf->nf_dio_mem_align-1)) != 0))
> > +		return false;
> > +
> > +	read_dio->start = round_down(offset, dio_blocksize);
> > +	read_dio->end = round_up(orig_end, dio_blocksize);
> > +	read_dio->start_extra = offset - read_dio->start;
> > +	read_dio->end_extra = read_dio->end - orig_end;
> > +
> > +	/* don't expand READ for IO less than 32K */
> 
> The code already says "don't expand READ for IO less than 32K". The
> comment needs to explain why. Move the rationale from the patch
> description to this comment, maybe?
> 
> 
> > +	if ((read_dio->start_extra || read_dio->end_extra) && (len < (32 << 10))) {
> > +		init_nfsd_read_dio(read_dio);
> > +		return false;
> > +	}
> 
> Nit: Let's replace the raw integer with a symbolic constant. But let's
> resist the urge to expose this as a tunable for now ;-)

Ack to both.

> > +
> > +	if (read_dio->start_extra) {
> > +		read_dio->start_extra_page = alloc_page(GFP_KERNEL);
> > +		if (WARN_ONCE(read_dio->start_extra_page == NULL,
> > +			      "%s: Unable to allocate start_extra_page\n", __func__)) {
> > +			init_nfsd_read_dio(read_dio);
> > +			return false;
> > +		}
> > +	}
> > +
> > +	/* Show original offset and count, and how it was expanded for DIO */
> > +	middle_end = read_dio->end - read_dio->end_extra;
> > +	trace_nfsd_analyze_read_dio(rqstp, fhp, offset, len,
> > +				    read_dio->start, read_dio->start_extra,
> > +				    offset, (middle_end - offset),
> > +				    middle_end, read_dio->end_extra);
> > +	return true;
> > +}
> > +
> > +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> > +						 struct nfsd_read_dio *read_dio,
> > +						 ssize_t bytes_read,
> > +						 unsigned long bytes_expected,
> > +						 loff_t *offset,
> > +						 unsigned long *rq_bvec_numpages)
> > +{
> > +	ssize_t host_err = bytes_read;
> > +	loff_t v;
> > +
> > +	/* If nfsd_analyze_read_dio() allocated a start_extra_page it must
> > +	 * be removed from rqstp->rq_bvec[] to avoid returning unwanted data.
> > +	 */
> > +	if (read_dio->start_extra_page) {
> > +		__free_page(read_dio->start_extra_page);
> > +		*rq_bvec_numpages -= 1;
> > +		v = *rq_bvec_numpages;
> > +		memmove(rqstp->rq_bvec, rqstp->rq_bvec + 1,
> > +			v * sizeof(struct bio_vec));
> > +	}
> > +	/* Eliminate any end_extra bytes from the last page */
> > +	v = *rq_bvec_numpages;
> > +	rqstp->rq_bvec[v].bv_len -= read_dio->end_extra;
> > +
> > +	if (host_err < 0)
> > +		return host_err;
> > +
> > +	/* nfsd_analyze_read_dio() may have expanded the start and end,
> > +	 * if so adjust returned read size to reflect original extent.
> > +	 */
> > +	*offset += read_dio->start_extra;
> > +	if (likely(host_err >= read_dio->start_extra)) {
> > +		host_err -= read_dio->start_extra;
> > +		if (host_err > bytes_expected)
> > +			host_err = bytes_expected;
> > +	} else {
> > +		/* Short read that didn't read any of requested data */
> > +		host_err = 0;
> > +	}
> > +
> > +	return host_err;
> > +}
> > +
> >  /**
> >   * nfsd_iter_read - Perform a VFS read using an iterator
> >   * @rqstp: RPC transaction context
> > @@ -1094,26 +1205,49 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  		      unsigned int base, u32 *eof)
> >  {
> >  	struct file *file = nf->nf_file;
> > -	unsigned long v, total;
> > +	unsigned long v, total, in_count = *count;
> > +	struct nfsd_read_dio read_dio;
> >  	struct iov_iter iter;
> >  	struct kiocb kiocb;
> > -	ssize_t host_err;
> > +	ssize_t host_err = 0;
> >  	size_t len;
> >  
> > +	init_nfsd_read_dio(&read_dio);
> 
> Initialize only if direct I/O is in use. I think all this new read code
> needs the same treatment as the write path: move the handling of the
> esoteric I/O types out of the hot (buffered) path.

Will try to pull that off, but the read path needs a bit more
branching, etc.  As you mentioned, splice is already the default so
nfsd_iter_read() theoretically afforded some additional lattitude
_but_ I don't disagree with the ideal of being as light as possible.

Took me a solid day to refactor the WRITE side, so this will slip
until next week.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-08-08 18:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 16:25 [PATCH v5 0/7] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-08-07 16:25 ` [PATCH v5 1/7] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-08-08 11:49   ` Jeff Layton
2025-08-07 16:25 ` [PATCH v5 2/7] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-08-08 11:51   ` Jeff Layton
2025-08-07 16:25 ` [PATCH v5 3/7] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-08-08 12:05   ` Jeff Layton
2025-08-08 12:05   ` Jeff Layton
2025-08-08 17:58   ` Chuck Lever
2025-08-07 16:25 ` [PATCH v5 4/7] NFSD: add io_cache_write " Mike Snitzer
2025-08-08 12:03   ` Jeff Layton
2025-08-08 17:58   ` Chuck Lever
2025-08-08 18:10     ` Mike Snitzer
2025-08-07 16:25 ` [PATCH v5 5/7] NFSD: filecache: only get DIO alignment attrs if NFSD_IO_DIRECT enabled Mike Snitzer
2025-08-08 12:05   ` Jeff Layton
2025-08-08 17:59   ` Chuck Lever
2025-08-08 18:12     ` Mike Snitzer
2025-08-07 16:25 ` [PATCH v5 6/7] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-08-08 12:16   ` Jeff Layton
2025-08-08 17:59   ` Chuck Lever
2025-08-08 18:19     ` Mike Snitzer
2025-08-07 16:25 ` [PATCH v5 7/7] NFSD: issue WRITEs " Mike Snitzer
2025-08-08  2:30   ` Mike Snitzer
2025-08-08 14:21     ` Chuck Lever
2025-08-08 12:20   ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).