Linux NFS development
 help / color / mirror / Atom feed
* [PATCH v1 0/3] NFSD direct I/O read
@ 2025-09-09 19:05 Chuck Lever
  2025-09-09 19:05 ` [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Chuck Lever @ 2025-09-09 19:05 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

This series replaces patches 1, 2, and 5 in Mike's v9 series:

https://lore.kernel.org/linux-nfs/20250903205121.41380-1-snitzer@kernel.org/T/#t

Changes from Mike's v9:
* The LOC introduced by the feature has been reduced considerably.
* A new trace point in nfsd_file_getattr reports each file's dio
  alignment parameters when it is opened.
* The direct I/O path has been taken out-of-line so that it may
  continue to be modified and optimized without perturbing the more
  commonly used I/O paths.
* When an exported file system does not implement direct I/O, more
  commonly used modes are employed instead to avoid returning
  EOPNOTSUPP unexpectedly.
* When NFSD_IO_DIRECT is selected, NFS READs of all sizes use direct
  I/O to provide better experimental data about small I/O workloads.

I haven't found any issues with NFSv3, NFSv4.0, and NFSv4.1 tested
on TCP and RDMA while /sys/kernel/debug/nfsd/io_cache_read is set
to 2. Trace points confirm that NFSD is using direct I/O.

The goal is to get the experimental read-side direct I/O
implementation merged sooner, as the write side has a few gray areas
that still need discussion and resolution.

Chuck Lever (1):
  NFSD: Implement NFSD_IO_DIRECT for NFS READ

Mike Snitzer (2):
  NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  NFSD: pass nfsd_file to nfsd_iter_read()

 fs/nfsd/debugfs.c       |  2 +
 fs/nfsd/filecache.c     | 34 +++++++++++++++++
 fs/nfsd/filecache.h     |  4 ++
 fs/nfsd/nfs4xdr.c       |  8 ++--
 fs/nfsd/nfsd.h          |  1 +
 fs/nfsd/nfsfh.c         |  4 ++
 fs/nfsd/trace.h         | 28 ++++++++++++++
 fs/nfsd/vfs.c           | 85 +++++++++++++++++++++++++++++++++++++++--
 fs/nfsd/vfs.h           |  2 +-
 include/trace/misc/fs.h | 22 +++++++++++
 10 files changed, 182 insertions(+), 8 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  2025-09-09 19:05 [PATCH v1 0/3] NFSD direct I/O read Chuck Lever
@ 2025-09-09 19:05 ` Chuck Lever
  2025-09-09 23:07   ` NeilBrown
  2025-09-09 19:05 ` [PATCH v1 2/3] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Chuck Lever @ 2025-09-09 19:05 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment
attributes from the underlying filesystem and store them in the
associated nfsd_file. This is done when the nfsd_file is first
opened for each regular file.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/filecache.c     | 34 ++++++++++++++++++++++++++++++++++
 fs/nfsd/filecache.h     |  4 ++++
 fs/nfsd/nfsfh.c         |  4 ++++
 fs/nfsd/trace.h         | 27 +++++++++++++++++++++++++++
 include/trace/misc/fs.h | 22 ++++++++++++++++++++++
 5 files changed, 91 insertions(+)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 75bc48031c07..80ef72ab7ef4 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
 	refcount_set(&nf->nf_ref, 1);
 	nf->nf_may = need;
 	nf->nf_mark = NULL;
+	nf->nf_dio_mem_align = 0;
+	nf->nf_dio_offset_align = 0;
+	nf->nf_dio_read_offset_align = 0;
 	return nf;
 }
 
@@ -1048,6 +1051,35 @@ nfsd_file_is_cached(struct inode *inode)
 	return ret;
 }
 
+static __be32
+nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+	struct inode *inode = file_inode(nf->nf_file);
+	struct kstat stat;
+	__be32 status;
+
+	/* Currently only need to get DIO alignment info for regular files */
+	if (!S_ISREG(inode->i_mode))
+		return nfs_ok;
+
+	status = fh_getattr(fhp, &stat);
+	if (status != nfs_ok)
+		return status;
+
+	trace_nfsd_file_getattr(inode, &stat);
+
+	if (stat.result_mask & STATX_DIOALIGN) {
+		nf->nf_dio_mem_align = stat.dio_mem_align;
+		nf->nf_dio_offset_align = stat.dio_offset_align;
+	}
+	if (stat.result_mask & STATX_DIO_READ_ALIGN)
+		nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+	else
+		nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+	return nfs_ok;
+}
+
 static __be32
 nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
 		     struct svc_cred *cred,
@@ -1166,6 +1198,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
 			}
 			status = nfserrno(ret);
 			trace_nfsd_file_open(nf, status);
+			if (status == nfs_ok)
+				status = nfsd_file_getattr(fhp, nf);
 		}
 	} else
 		status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index 24ddf60e8434..e3d6ca2b6030 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
 	struct list_head	nf_gc;
 	struct rcu_head		nf_rcu;
 	ktime_t			nf_birthtime;
+
+	u32			nf_dio_mem_align;
+	u32			nf_dio_offset_align;
+	u32			nf_dio_read_offset_align;
 };
 
 int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 3edccc38db42..3eb724ec9566 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -696,8 +696,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
 		.mnt		= fhp->fh_export->ex_path.mnt,
 		.dentry		= fhp->fh_dentry,
 	};
+	struct inode *inode = d_inode(p.dentry);
 	u32 request_mask = STATX_BASIC_STATS;
 
+	if (S_ISREG(inode->i_mode))
+		request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
 	if (fhp->fh_maxsize == NFS4_FHSIZE)
 		request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
 
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e..e5af0d058fd0 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -1133,6 +1133,33 @@ TRACE_EVENT(nfsd_file_alloc,
 	)
 );
 
+TRACE_EVENT(nfsd_file_getattr,
+	TP_PROTO(
+		const struct inode *inode,
+		const struct kstat *stat
+	),
+	TP_ARGS(inode, stat),
+	TP_STRUCT__entry(
+		__field(const void *, inode)
+		__field(unsigned long, mask)
+		__field(u32, mem_align)
+		__field(u32, offset_align)
+		__field(u32, read_offset_align)
+	),
+	TP_fast_assign(
+		__entry->inode = inode;
+		__entry->mask = stat->result_mask;
+		__entry->mem_align = stat->dio_mem_align;
+		__entry->offset_align = stat->dio_offset_align;
+		__entry->read_offset_align = stat->dio_read_offset_align;
+	),
+	TP_printk("inode=%p flags=%s mem_align=%u offset_align=%u read_offset_align=%u",
+		__entry->inode, show_statx_mask(__entry->mask),
+		__entry->mem_align, __entry->offset_align,
+		__entry->read_offset_align
+	)
+);
+
 TRACE_EVENT(nfsd_file_acquire,
 	TP_PROTO(
 		const struct svc_rqst *rqstp,
diff --git a/include/trace/misc/fs.h b/include/trace/misc/fs.h
index 0406ebe2a80a..7ead1c61f0cb 100644
--- a/include/trace/misc/fs.h
+++ b/include/trace/misc/fs.h
@@ -141,3 +141,25 @@
 		{ ATTR_TIMES_SET,	"TIMES_SET" },	\
 		{ ATTR_TOUCH,		"TOUCH"},	\
 		{ ATTR_DELEG,		"DELEG"})
+
+#define show_statx_mask(flags)					\
+	__print_flags(flags, "|",				\
+		{ STATX_TYPE,		"TYPE" },		\
+		{ STATX_MODE,		"MODE" },		\
+		{ STATX_NLINK,		"NLINK" },		\
+		{ STATX_UID,		"UID" },		\
+		{ STATX_GID,		"GID" },		\
+		{ STATX_ATIME,		"ATIME" },		\
+		{ STATX_MTIME,		"MTIME" },		\
+		{ STATX_CTIME,		"CTIME" },		\
+		{ STATX_INO,		"INO" },		\
+		{ STATX_SIZE,		"SIZE" },		\
+		{ STATX_BLOCKS,		"BLOCKS" },		\
+		{ STATX_BASIC_STATS,	"BASIC_STATS" },	\
+		{ STATX_BTIME,		"BTIME" },		\
+		{ STATX_MNT_ID,		"MNT_ID" },		\
+		{ STATX_DIOALIGN,	"DIOALIGN" },		\
+		{ STATX_MNT_ID_UNIQUE,	"MNT_ID_UNIQUE" },	\
+		{ STATX_SUBVOL,		"SUBVOL" },		\
+		{ STATX_WRITE_ATOMIC,	"WRITE_ATOMIC" },	\
+		{ STATX_DIO_READ_ALIGN,	"DIO_READ_ALIGN" })
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v1 2/3] NFSD: pass nfsd_file to nfsd_iter_read()
  2025-09-09 19:05 [PATCH v1 0/3] NFSD direct I/O read Chuck Lever
  2025-09-09 19:05 ` [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
@ 2025-09-09 19:05 ` Chuck Lever
  2025-09-09 23:20   ` NeilBrown
  2025-09-09 19:05 ` [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
  2025-09-09 23:33 ` [PATCH 0/2] NFSD: continuation of NFSD DIRECT Mike Snitzer
  3 siblings, 1 reply; 23+ messages in thread
From: Chuck Lever @ 2025-09-09 19:05 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs4xdr.c | 8 ++++----
 fs/nfsd/vfs.c     | 7 ++++---
 fs/nfsd/vfs.h     | 2 +-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index c0a3c6a7c8bb..cd3251340b5c 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4465,7 +4465,7 @@ static __be32 nfsd4_encode_splice_read(
 
 static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
 				 struct nfsd4_read *read,
-				 struct file *file, unsigned long maxcount)
+				 unsigned long maxcount)
 {
 	struct xdr_stream *xdr = resp->xdr;
 	unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4476,7 +4476,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
 	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
 		return nfserr_resource;
 
-	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
 				read->rd_offset, &maxcount, base,
 				&read->rd_eof);
 	read->rd_length = maxcount;
@@ -4523,7 +4523,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
 	if (file->f_op->splice_read && splice_ok)
 		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
 	else
-		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+		nfserr = nfsd4_encode_readv(resp, read, maxcount);
 	if (nfserr) {
 		xdr_truncate_encode(xdr, eof_offset);
 		return nfserr;
@@ -5419,7 +5419,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
 	if (file->f_op->splice_read && splice_ok)
 		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
 	else
-		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+		nfserr = nfsd4_encode_readv(resp, read, maxcount);
 	if (nfserr)
 		return nfserr;
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 714777c221ed..441267d877f9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1078,7 +1078,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * nfsd_iter_read - Perform a VFS read using an iterator
  * @rqstp: RPC transaction context
  * @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
  * @offset: starting byte offset
  * @count: IN: requested number of bytes; OUT: number of bytes read
  * @base: offset in first page of read buffer
@@ -1091,9 +1091,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * returned.
  */
 __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		      struct file *file, loff_t offset, unsigned long *count,
+		      struct nfsd_file *nf, loff_t offset, unsigned long *count,
 		      unsigned int base, u32 *eof)
 {
+	struct file *file = nf->nf_file;
 	unsigned long v, total;
 	struct iov_iter iter;
 	struct kiocb kiocb;
@@ -1334,7 +1335,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
 		err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
 	else
-		err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+		err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
 
 	nfsd_file_put(nf);
 	trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6d..fa46f8b5f132 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32		nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				unsigned long *count,
 				u32 *eof);
 __be32		nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
-				struct file *file, loff_t offset,
+				struct nfsd_file *nf, loff_t offset,
 				unsigned long *count, unsigned int base,
 				u32 *eof);
 bool		nfsd_read_splice_ok(struct svc_rqst *rqstp);
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 19:05 [PATCH v1 0/3] NFSD direct I/O read Chuck Lever
  2025-09-09 19:05 ` [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
  2025-09-09 19:05 ` [PATCH v1 2/3] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
@ 2025-09-09 19:05 ` Chuck Lever
  2025-09-09 23:16   ` Mike Snitzer
                     ` (2 more replies)
  2025-09-09 23:33 ` [PATCH 0/2] NFSD: continuation of NFSD DIRECT Mike Snitzer
  3 siblings, 3 replies; 23+ messages in thread
From: Chuck Lever @ 2025-09-09 19:05 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer

From: Chuck Lever <chuck.lever@oracle.com>

Add an experimental option that forces NFS READ operations to use
direct I/O instead of reading through the NFS server's page cache.

There are already other layers of caching:
 - The page cache on NFS clients
 - The block device underlying the exported file system

The server's page cache, in many cases, is unlikely to provide
additional benefit. Some benchmarks have demonstrated that the
server's page cache is actively detrimental for workloads whose
working set is larger than the server's available physical memory.

For instance, on small NFS servers, cached NFS file content can
squeeze out local memory consumers. For large sequential workloads,
an enormous amount of data flows into and out of the page cache
and is consumed by NFS clients exactly once -- caching that data
is expensive to do and totally valueless.

For now this is a hidden option that can be enabled on test
systems for benchmarking. In the longer term, this option might
be enabled persistently or per-export. When the exported file
system does not support direct I/O, NFSD falls back to using
either DONTCACHE or buffered I/O to fulfill NFS READ requests.

Suggested-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/debugfs.c |  2 ++
 fs/nfsd/nfsd.h    |  1 +
 fs/nfsd/trace.h   |  1 +
 fs/nfsd/vfs.c     | 78 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 82 insertions(+)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index ed2b9e066206..00eb1ecef6ac 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
  * Contents:
  *   %0: NFS READ will use buffered IO
  *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ *   %2: NFS READ will use direct IO
  *
  * This setting takes immediate effect for all NFS versions,
  * all exports, and in all NFSD net namespaces.
@@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
 		nfsd_io_cache_read = NFSD_IO_BUFFERED;
 		break;
 	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
 		/*
 		 * Must disable splice_read when enabling
 		 * NFSD_IO_DONTCACHE.
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index ea87b42894dd..bdb60ee1f1a4 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -157,6 +157,7 @@ enum {
 	/* Any new NFSD_IO enum value must be added at the end */
 	NFSD_IO_BUFFERED,
 	NFSD_IO_DONTCACHE,
+	NFSD_IO_DIRECT,
 };
 
 extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index e5af0d058fd0..88901df5fbb1 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
 DEFINE_NFSD_IO_EVENT(read_start);
 DEFINE_NFSD_IO_EVENT(read_splice);
 DEFINE_NFSD_IO_EVENT(read_vector);
+DEFINE_NFSD_IO_EVENT(read_direct);
 DEFINE_NFSD_IO_EVENT(read_io_done);
 DEFINE_NFSD_IO_EVENT(read_done);
 DEFINE_NFSD_IO_EVENT(write_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 441267d877f9..9665454743eb 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
 }
 
+/*
+ * The byte range of the client's READ request is expanded on both
+ * ends until it meets the underlying file system's direct I/O
+ * alignment requirements. After the internal read is complete, the
+ * byte range of the NFS READ payload is reduced to the byte range
+ * that was originally requested.
+ *
+ * Note that a direct read can be done only when the xdr_buf
+ * containing the NFS READ reply does not already have contents in
+ * its .pages array. This is due to potentially restrictive
+ * alignment requirements on the read buffer. When .page_len and
+ * @base are zero, the .pages array is guaranteed to be page-
+ * aligned.
+ */
+static noinline_for_stack __be32
+nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		 struct nfsd_file *nf, loff_t offset, unsigned long *count,
+		 u32 *eof)
+{
+	loff_t dio_start, dio_end;
+	unsigned long v, total;
+	struct iov_iter iter;
+	struct kiocb kiocb;
+	ssize_t host_err;
+	size_t len;
+
+	init_sync_kiocb(&kiocb, nf->nf_file);
+	kiocb.ki_flags |= IOCB_DIRECT;
+
+	/* Read a properly-aligned region of bytes into rq_bvec */
+	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
+	dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
+
+	kiocb.ki_pos = dio_start;
+
+	v = 0;
+	total = dio_end - dio_start;
+	while (total) {
+		len = min_t(size_t, total, PAGE_SIZE);
+		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
+			      len, 0);
+		total -= len;
+		++v;
+	}
+	WARN_ON_ONCE(v > rqstp->rq_maxpages);
+
+	trace_nfsd_read_direct(rqstp, fhp, offset, *count);
+	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, dio_end - dio_start);
+
+	host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
+	if (host_err >= 0) {
+		unsigned int pad = offset - dio_start;
+
+		/* The returned payload starts after the pad */
+		rqstp->rq_res.page_base = pad;
+
+		/* Compute the count of bytes to be returned */
+		if (host_err > pad + *count) {
+			host_err = *count;
+		} else if (host_err > pad) {
+			host_err -= pad;
+		} else {
+			host_err = 0;
+		}
+	} else if (unlikely(host_err == -EINVAL)) {
+		pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
+		host_err = -ESERVERFAULT;
+	}
+
+	return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
+				eof, host_err);
+}
+
 /**
  * nfsd_iter_read - Perform a VFS read using an iterator
  * @rqstp: RPC transaction context
@@ -1106,6 +1179,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	switch (nfsd_io_cache_read) {
 	case NFSD_IO_BUFFERED:
 		break;
+	case NFSD_IO_DIRECT:
+		if (nf->nf_dio_read_offset_align && !base)
+			return nfsd_direct_read(rqstp, fhp, nf, offset,
+						count, eof);
+		fallthrough;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags = IOCB_DONTCACHE;
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  2025-09-09 19:05 ` [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
@ 2025-09-09 23:07   ` NeilBrown
  0 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2025-09-09 23:07 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Mike Snitzer

On Wed, 10 Sep 2025, Chuck Lever wrote:

> +static __be32
> +nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)

<bikeshed>
 should this function have a name which reflects the fact that it is
 getting the dio attrs, not all attrs?
 e.h. nfsd_file_get_dio_attr()
 ....

> @@ -1166,6 +1198,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
>  			}
>  			status = nfserrno(ret);
>  			trace_nfsd_file_open(nf, status);
> +			if (status == nfs_ok)
> +				status = nfsd_file_getattr(fhp, nf);
>  		}

I ask because this stanza looks strange. "Why are the file attrs being
got here?" I ask.  Oh, it is really getting the dio attrs - that makes
sense.

</bikeshed>

Thanks,
NeilBrown


>  	} else
>  		status = nfserr_jukebox;
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index 24ddf60e8434..e3d6ca2b6030 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -54,6 +54,10 @@ struct nfsd_file {
>  	struct list_head	nf_gc;
>  	struct rcu_head		nf_rcu;
>  	ktime_t			nf_birthtime;
> +
> +	u32			nf_dio_mem_align;
> +	u32			nf_dio_offset_align;
> +	u32			nf_dio_read_offset_align;
>  };
>  
>  int nfsd_file_cache_init(void);
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 3edccc38db42..3eb724ec9566 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -696,8 +696,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
>  		.mnt		= fhp->fh_export->ex_path.mnt,
>  		.dentry		= fhp->fh_dentry,
>  	};
> +	struct inode *inode = d_inode(p.dentry);
>  	u32 request_mask = STATX_BASIC_STATS;
>  
> +	if (S_ISREG(inode->i_mode))
> +		request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
> +
>  	if (fhp->fh_maxsize == NFS4_FHSIZE)
>  		request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
>  
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index a664fdf1161e..e5af0d058fd0 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -1133,6 +1133,33 @@ TRACE_EVENT(nfsd_file_alloc,
>  	)
>  );
>  
> +TRACE_EVENT(nfsd_file_getattr,
> +	TP_PROTO(
> +		const struct inode *inode,
> +		const struct kstat *stat
> +	),
> +	TP_ARGS(inode, stat),
> +	TP_STRUCT__entry(
> +		__field(const void *, inode)
> +		__field(unsigned long, mask)
> +		__field(u32, mem_align)
> +		__field(u32, offset_align)
> +		__field(u32, read_offset_align)
> +	),
> +	TP_fast_assign(
> +		__entry->inode = inode;
> +		__entry->mask = stat->result_mask;
> +		__entry->mem_align = stat->dio_mem_align;
> +		__entry->offset_align = stat->dio_offset_align;
> +		__entry->read_offset_align = stat->dio_read_offset_align;
> +	),
> +	TP_printk("inode=%p flags=%s mem_align=%u offset_align=%u read_offset_align=%u",
> +		__entry->inode, show_statx_mask(__entry->mask),
> +		__entry->mem_align, __entry->offset_align,
> +		__entry->read_offset_align
> +	)
> +);
> +
>  TRACE_EVENT(nfsd_file_acquire,
>  	TP_PROTO(
>  		const struct svc_rqst *rqstp,
> diff --git a/include/trace/misc/fs.h b/include/trace/misc/fs.h
> index 0406ebe2a80a..7ead1c61f0cb 100644
> --- a/include/trace/misc/fs.h
> +++ b/include/trace/misc/fs.h
> @@ -141,3 +141,25 @@
>  		{ ATTR_TIMES_SET,	"TIMES_SET" },	\
>  		{ ATTR_TOUCH,		"TOUCH"},	\
>  		{ ATTR_DELEG,		"DELEG"})
> +
> +#define show_statx_mask(flags)					\
> +	__print_flags(flags, "|",				\
> +		{ STATX_TYPE,		"TYPE" },		\
> +		{ STATX_MODE,		"MODE" },		\
> +		{ STATX_NLINK,		"NLINK" },		\
> +		{ STATX_UID,		"UID" },		\
> +		{ STATX_GID,		"GID" },		\
> +		{ STATX_ATIME,		"ATIME" },		\
> +		{ STATX_MTIME,		"MTIME" },		\
> +		{ STATX_CTIME,		"CTIME" },		\
> +		{ STATX_INO,		"INO" },		\
> +		{ STATX_SIZE,		"SIZE" },		\
> +		{ STATX_BLOCKS,		"BLOCKS" },		\
> +		{ STATX_BASIC_STATS,	"BASIC_STATS" },	\
> +		{ STATX_BTIME,		"BTIME" },		\
> +		{ STATX_MNT_ID,		"MNT_ID" },		\
> +		{ STATX_DIOALIGN,	"DIOALIGN" },		\
> +		{ STATX_MNT_ID_UNIQUE,	"MNT_ID_UNIQUE" },	\
> +		{ STATX_SUBVOL,		"SUBVOL" },		\
> +		{ STATX_WRITE_ATOMIC,	"WRITE_ATOMIC" },	\
> +		{ STATX_DIO_READ_ALIGN,	"DIO_READ_ALIGN" })
> -- 
> 2.50.0
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 19:05 ` [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
@ 2025-09-09 23:16   ` Mike Snitzer
  2025-09-09 23:37   ` NeilBrown
  2025-09-10 11:37   ` Jeff Layton
  2 siblings, 0 replies; 23+ messages in thread
From: Mike Snitzer @ 2025-09-09 23:16 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Tue, Sep 09, 2025 at 03:05:25PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
> 
> There are already other layers of caching:
>  - The page cache on NFS clients
>  - The block device underlying the exported file system
> 
> The server's page cache, in many cases, is unlikely to provide
> additional benefit. Some benchmarks have demonstrated that the
> server's page cache is actively detrimental for workloads whose
> working set is larger than the server's available physical memory.
> 
> For instance, on small NFS servers, cached NFS file content can
> squeeze out local memory consumers. For large sequential workloads,
> an enormous amount of data flows into and out of the page cache
> and is consumed by NFS clients exactly once -- caching that data
> is expensive to do and totally valueless.
> 
> For now this is a hidden option that can be enabled on test
> systems for benchmarking. In the longer term, this option might
> be enabled persistently or per-export. When the exported file
> system does not support direct I/O, NFSD falls back to using
> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> 
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Reviewed-by: Mike Snitzer <snitzer@kernel.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 2/3] NFSD: pass nfsd_file to nfsd_iter_read()
  2025-09-09 19:05 ` [PATCH v1 2/3] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
@ 2025-09-09 23:20   ` NeilBrown
  0 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2025-09-09 23:20 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Mike Snitzer

On Wed, 10 Sep 2025, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
> 
> Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.

This is particularly unhelpful commit description.  It would be
perfectly good as an introduce to a description which actually gives
some details, which could then be compared to the code.

   Prepare for nfsd_iter_read() to use the DIO alignment stored in
   nfsd_file by passing the nfsd_file to nfsd_iter_read() rather than
   just the file which is associaed with the nfsd_file.

   This means nfsd4_encode_readv() now also needs the nfsd_file rather
   than the file.  Instead of changing the file arg to be the nfsd_file,
   we discard the file arg as the nfsd_file (and indeed the file) is
   already available via the "read" argument.


Now when I read the patch it makes perfect sense.

Reviewed-by: NeilBrown <neil@brown.name>

Thanks,
NeilBrown

> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfs4xdr.c | 8 ++++----
>  fs/nfsd/vfs.c     | 7 ++++---
>  fs/nfsd/vfs.h     | 2 +-
>  3 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index c0a3c6a7c8bb..cd3251340b5c 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -4465,7 +4465,7 @@ static __be32 nfsd4_encode_splice_read(
>  
>  static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
>  				 struct nfsd4_read *read,
> -				 struct file *file, unsigned long maxcount)
> +				 unsigned long maxcount)
>  {
>  	struct xdr_stream *xdr = resp->xdr;
>  	unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
> @@ -4476,7 +4476,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
>  	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
>  		return nfserr_resource;
>  
> -	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
> +	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
>  				read->rd_offset, &maxcount, base,
>  				&read->rd_eof);
>  	read->rd_length = maxcount;
> @@ -4523,7 +4523,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
>  	if (file->f_op->splice_read && splice_ok)
>  		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
>  	else
> -		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
> +		nfserr = nfsd4_encode_readv(resp, read, maxcount);
>  	if (nfserr) {
>  		xdr_truncate_encode(xdr, eof_offset);
>  		return nfserr;
> @@ -5419,7 +5419,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
>  	if (file->f_op->splice_read && splice_ok)
>  		nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
>  	else
> -		nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
> +		nfserr = nfsd4_encode_readv(resp, read, maxcount);
>  	if (nfserr)
>  		return nfserr;
>  
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 714777c221ed..441267d877f9 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1078,7 +1078,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   * nfsd_iter_read - Perform a VFS read using an iterator
>   * @rqstp: RPC transaction context
>   * @fhp: file handle of file to be read
> - * @file: opened struct file of file to be read
> + * @nf: opened struct nfsd_file of file to be read
>   * @offset: starting byte offset
>   * @count: IN: requested number of bytes; OUT: number of bytes read
>   * @base: offset in first page of read buffer
> @@ -1091,9 +1091,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   * returned.
>   */
>  __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -		      struct file *file, loff_t offset, unsigned long *count,
> +		      struct nfsd_file *nf, loff_t offset, unsigned long *count,
>  		      unsigned int base, u32 *eof)
>  {
> +	struct file *file = nf->nf_file;
>  	unsigned long v, total;
>  	struct iov_iter iter;
>  	struct kiocb kiocb;
> @@ -1334,7 +1335,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
>  		err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
>  	else
> -		err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
> +		err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
>  
>  	nfsd_file_put(nf);
>  	trace_nfsd_read_done(rqstp, fhp, offset, *count);
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index 0c0292611c6d..fa46f8b5f132 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -121,7 +121,7 @@ __be32		nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				unsigned long *count,
>  				u32 *eof);
>  __be32		nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -				struct file *file, loff_t offset,
> +				struct nfsd_file *nf, loff_t offset,
>  				unsigned long *count, unsigned int base,
>  				u32 *eof);
>  bool		nfsd_read_splice_ok(struct svc_rqst *rqstp);
> -- 
> 2.50.0
> 
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/2] NFSD: continuation of NFSD DIRECT
  2025-09-09 19:05 [PATCH v1 0/3] NFSD direct I/O read Chuck Lever
                   ` (2 preceding siblings ...)
  2025-09-09 19:05 ` [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
@ 2025-09-09 23:33 ` Mike Snitzer
  2025-09-09 23:33   ` [PATCH 1/2] sunrpc: add an extra reserve page to svc_serv_maxpages() Mike Snitzer
  2025-09-09 23:33   ` [PATCH 2/2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
  3 siblings, 2 replies; 23+ messages in thread
From: Mike Snitzer @ 2025-09-09 23:33 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

Hi Chuck,

These patches build ontop of your recent series, here:
https://lore.kernel.org/linux-nfs/20250909190525.7214-1-cel@kernel.org/

The first sunrpc patch is carry-over from earlier iterations of my
NFSD DIRECT patchsets. I think it needed to be sure we have adequate
pages to handle expanding misaligned READ to fit in max payload. But
maybe not?

The 2nd "NFSD: Implement NFSD_IO_DIRECT for NFS WRITE" patch is a
refactored version of my earlier NFS WRITE support.  I tried to honor
the types of changes I saw in your rewrite of the NFSD_IO_DIRECT NFS
READ support. So I avoided needless maths and variables that were only
for the benefit of excessive tracepoint complexity (which I also
removed), while also establishing cleaner code that you can iterate on
further if you think something else needed. LOC isn't much smaller
but maybe you or others will see something that can be elided -- the
good news is this NFSD DIRECT WRITE code stands on its own like the
NFSD DIRECT READ does.

Not sure what NFSD DIRECT WRITE gray areas you think we need to work
through further in NFSD (but I could imagine there being room for
common Linux vfs code or other FS-specific code improvements to make
buffered vs direct safer). I know others have repeated concern but in
practice I haven't had any issues with this NFSD DIRECT WRITE
suppoort. So I welcome further discussion on what else needed.

Thanks,
Mike

Mike Snitzer (2):
  sunrpc: add an extra reserve page to svc_serv_maxpages()
  NFSD: Implement NFSD_IO_DIRECT for NFS WRITE

 fs/nfsd/debugfs.c          |   1 +
 fs/nfsd/trace.h            |   1 +
 fs/nfsd/vfs.c              | 215 ++++++++++++++++++++++++++++++++++++-
 include/linux/sunrpc/svc.h |   5 +-
 4 files changed, 216 insertions(+), 6 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] sunrpc: add an extra reserve page to svc_serv_maxpages()
  2025-09-09 23:33 ` [PATCH 0/2] NFSD: continuation of NFSD DIRECT Mike Snitzer
@ 2025-09-09 23:33   ` Mike Snitzer
  2025-09-10 14:29     ` Chuck Lever
  2025-09-09 23:33   ` [PATCH 2/2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2025-09-09 23:33 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

nfsd_iter_read() might need two extra pages when a READ payload is not
DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor() are
mutually exclusive (so reuse page reserved for nfsd_splice_actor).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 include/linux/sunrpc/svc.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e64ab444e0a7f..190c2667500e2 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
  * pages, one for the request, and one for the reply.
  * nfsd_splice_actor() might need an extra page when a READ payload
  * is not page-aligned.
+ * nfsd_iter_read() might need two extra pages when a READ payload
+ * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
+ * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
  */
 static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
 {
-	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
 }
 
 /*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-09-09 23:33 ` [PATCH 0/2] NFSD: continuation of NFSD DIRECT Mike Snitzer
  2025-09-09 23:33   ` [PATCH 1/2] sunrpc: add an extra reserve page to svc_serv_maxpages() Mike Snitzer
@ 2025-09-09 23:33   ` Mike Snitzer
  2025-10-08 18:59     ` [PATCH v2] " Mike Snitzer
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2025-09-09 23:33 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Buffered IO (with preference
towards using DONTCACHE) is used for the misaligned extents and
O_DIRECT is used for the middle DIO-aligned extent.

If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
invalidate the page cache on behalf of the DIO WRITE, then
nfsd_issue_write_dio() will fall back to using buffered IO.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/debugfs.c |   1 +
 fs/nfsd/trace.h   |   1 +
 fs/nfsd/vfs.c     | 215 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 212 insertions(+), 5 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac1..7f44689e0a53e 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	switch (val) {
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 88901df5fbb11..4e560427f6e1e 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
 DEFINE_NFSD_IO_EVENT(read_done);
 DEFINE_NFSD_IO_EVENT(write_start);
 DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
 DEFINE_NFSD_IO_EVENT(write_io_done);
 DEFINE_NFSD_IO_EVENT(write_done);
 DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 239266c241d2c..e2039feaf95c8 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1246,6 +1246,208 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio {
+	ssize_t	start_len;	/* Length for misaligned first extent */
+	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
+	ssize_t	end_len;	/* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+		       struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
+{
+	const u32 dio_blocksize = nf->nf_dio_offset_align;
+	loff_t start_end, orig_end, middle_end;
+
+	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+		return false;
+	if (unlikely(dio_blocksize > PAGE_SIZE))
+		return false;
+	if (unlikely(len < dio_blocksize))
+		return false;
+
+	start_end = round_up(offset, dio_blocksize);
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	write_dio->start_len = start_end - offset;
+	write_dio->middle_len = middle_end - start_end;
+	write_dio->end_len = orig_end - middle_end;
+
+	return true;
+}
+
+static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
+		unsigned int addr_mask, unsigned int len_mask)
+{
+	const struct bio_vec *bvec = i->bvec;
+	size_t skip = i->iov_offset;
+	size_t size = i->count;
+
+	if (size & len_mask)
+		return false;
+	do {
+		size_t len = bvec->bv_len;
+
+		if (len > size)
+			len = size;
+		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+			return false;
+		bvec++;
+		size -= len;
+		skip = 0;
+	} while (size);
+
+	return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * @iterp: pointer to pointer to onstack array of 3 iov_iter structs from caller.
+ * @iter_is_dio_aligned: pointer to onstack array of 3 bools from caller.
+ * @rq_bvec: backing bio_vec used to setup all 3 iov_iter permutations.
+ * @nvecs: number of segments in @rq_bvec
+ * @cnt: size of the request in bytes
+ * @write_dio: nfsd_write_dio struct that describes start, middle and end extents.
+ *
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+			   struct bio_vec *rq_bvec, unsigned int nvecs,
+			   unsigned long cnt, struct nfsd_write_dio *write_dio,
+			   struct nfsd_file *nf)
+{
+	int n_iters = 0;
+	struct iov_iter *iters = *iterp;
+
+	/* Setup misaligned start? */
+	if (write_dio->start_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iters[n_iters].count = write_dio->start_len;
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	/* Setup DIO-aligned middle */
+	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+	if (write_dio->start_len)
+		iov_iter_advance(&iters[n_iters], write_dio->start_len);
+	iters[n_iters].count -= write_dio->end_len;
+	iter_is_dio_aligned[n_iters] =
+		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+				nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
+	if (unlikely(!iter_is_dio_aligned[n_iters]))
+		return 0; /* no DIO-aligned IO possible */
+	++n_iters;
+
+	/* Setup misaligned end? */
+	if (write_dio->end_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iov_iter_advance(&iters[n_iters],
+				 write_dio->start_len + write_dio->middle_len);
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	return n_iters;
+}
+
+static int
+nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
+		    unsigned int nvecs, unsigned long *cnt,
+		    struct kiocb *kiocb)
+{
+	struct iov_iter iter;
+	int host_err;
+
+	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+	if (host_err < 0)
+		return host_err;
+	*cnt = host_err;
+
+	return 0;
+}
+
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct nfsd_file *nf,
+		     loff_t offset, unsigned int nvecs,
+		     unsigned long *cnt, struct kiocb *kiocb,
+		     struct nfsd_write_dio *write_dio)
+{
+	struct file *file = nf->nf_file;
+	bool iter_is_dio_aligned[3];
+	struct iov_iter iter_stack[3];
+	struct iov_iter *iter = iter_stack;
+	unsigned int n_iters = 0;
+	unsigned long in_count = *cnt;
+	loff_t in_offset = kiocb->ki_pos;
+	ssize_t host_err;
+
+	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+				rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
+	if (unlikely(!n_iters))
+		return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
+
+	*cnt = 0;
+	for (int i = 0; i < n_iters; i++) {
+		if (iter_is_dio_aligned[i])
+			kiocb->ki_flags |= IOCB_DIRECT;
+		else
+			kiocb->ki_flags &= ~IOCB_DIRECT;
+
+		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+		if (host_err < 0) {
+			/* VFS will return -ENOTBLK if DIO WRITE fails to
+			 * invalidate the page cache. Retry using buffered IO.
+			 */
+			if (unlikely(host_err == -ENOTBLK)) {
+				kiocb->ki_flags &= ~IOCB_DIRECT;
+				*cnt = in_count;
+				kiocb->ki_pos = in_offset;
+				return nfsd_buffered_write(rqstp, file,
+							   nvecs, cnt, kiocb);
+			} else if (unlikely(host_err == -EINVAL)) {
+				pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
+				host_err = -ESERVERFAULT;
+			}
+			return host_err;
+		}
+		*cnt += host_err;
+		if (host_err < iter[i].count) /* partial write? */
+			break;
+	}
+
+	return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct nfsd_file *nf, loff_t offset, unsigned int nvecs,
+		  unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct nfsd_write_dio write_dio;
+
+	/* Any buffered IO issued here will be misaligned, use
+	 * IOCB_SYNC to ensure it has completed before returning.
+	 */
+	kiocb->ki_flags |= IOCB_SYNC;
+	/* Check if IOCB_DONTCACHE should be used when issuing buffered IO;
+	 * if so, it will be ignored for any DIO issued here.
+	 */
+	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+		kiocb->ki_flags |= IOCB_DONTCACHE;
+
+	if (nfsd_is_write_dio_possible(offset, *cnt, nf, &write_dio)) {
+		trace_nfsd_write_direct(rqstp, fhp, offset, *cnt);
+		return nfsd_issue_write_dio(rqstp, nf, offset, nvecs,
+					    cnt, kiocb, &write_dio);
+	}
+
+	return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1273,7 +1475,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	struct super_block	*sb = file_inode(file)->i_sb;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
-	struct iov_iter		iter;
 	errseq_t		since;
 	__be32			nfserr;
 	int			host_err;
@@ -1310,25 +1511,29 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		kiocb.ki_flags |= IOCB_DSYNC;
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
-	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
-	case NFSD_IO_BUFFERED:
+	case NFSD_IO_DIRECT:
+		host_err = nfsd_direct_write(rqstp, fhp, nf, offset,
+					     nvecs, cnt, &kiocb);
 		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
+		fallthrough; /* must call nfsd_buffered_write */
+	case NFSD_IO_BUFFERED:
+		host_err = nfsd_buffered_write(rqstp, file,
+					       nvecs, cnt, &kiocb);
 		break;
 	}
-	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
 	if (host_err < 0) {
 		commit_reset_write_verifier(nn, rqstp, host_err);
 		goto out_nfserr;
 	}
-	*cnt = host_err;
 	nfsd_stats_io_write_add(nn, exp, *cnt);
 	fsnotify_modify(file);
 	host_err = filemap_check_wb_err(file->f_mapping, since);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 19:05 ` [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
  2025-09-09 23:16   ` Mike Snitzer
@ 2025-09-09 23:37   ` NeilBrown
  2025-09-09 23:39     ` Chuck Lever
  2025-09-09 23:56     ` Mike Snitzer
  2025-09-10 11:37   ` Jeff Layton
  2 siblings, 2 replies; 23+ messages in thread
From: NeilBrown @ 2025-09-09 23:37 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Mike Snitzer

On Wed, 10 Sep 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
> 
> There are already other layers of caching:
>  - The page cache on NFS clients
>  - The block device underlying the exported file system
> 
> The server's page cache, in many cases, is unlikely to provide
> additional benefit. Some benchmarks have demonstrated that the
> server's page cache is actively detrimental for workloads whose
> working set is larger than the server's available physical memory.
> 
> For instance, on small NFS servers, cached NFS file content can
> squeeze out local memory consumers. For large sequential workloads,
> an enormous amount of data flows into and out of the page cache
> and is consumed by NFS clients exactly once -- caching that data
> is expensive to do and totally valueless.
> 
> For now this is a hidden option that can be enabled on test
> systems for benchmarking. In the longer term, this option might
> be enabled persistently or per-export. When the exported file
> system does not support direct I/O, NFSD falls back to using
> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> 
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/debugfs.c |  2 ++
>  fs/nfsd/nfsd.h    |  1 +
>  fs/nfsd/trace.h   |  1 +
>  fs/nfsd/vfs.c     | 78 +++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index ed2b9e066206..00eb1ecef6ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>   * Contents:
>   *   %0: NFS READ will use buffered IO
>   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + *   %2: NFS READ will use direct IO
>   *
>   * This setting takes immediate effect for all NFS versions,
>   * all exports, and in all NFSD net namespaces.
> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
>  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
>  		break;
>  	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
>  		/*
>  		 * Must disable splice_read when enabling
>  		 * NFSD_IO_DONTCACHE.
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index ea87b42894dd..bdb60ee1f1a4 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -157,6 +157,7 @@ enum {
>  	/* Any new NFSD_IO enum value must be added at the end */
>  	NFSD_IO_BUFFERED,
>  	NFSD_IO_DONTCACHE,
> +	NFSD_IO_DIRECT,
>  };
>  
>  extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index e5af0d058fd0..88901df5fbb1 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
>  DEFINE_NFSD_IO_EVENT(read_start);
>  DEFINE_NFSD_IO_EVENT(read_splice);
>  DEFINE_NFSD_IO_EVENT(read_vector);
> +DEFINE_NFSD_IO_EVENT(read_direct);
>  DEFINE_NFSD_IO_EVENT(read_io_done);
>  DEFINE_NFSD_IO_EVENT(read_done);
>  DEFINE_NFSD_IO_EVENT(write_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 441267d877f9..9665454743eb 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> +/*
> + * The byte range of the client's READ request is expanded on both
> + * ends until it meets the underlying file system's direct I/O
> + * alignment requirements. After the internal read is complete, the
> + * byte range of the NFS READ payload is reduced to the byte range
> + * that was originally requested.
> + *
> + * Note that a direct read can be done only when the xdr_buf
> + * containing the NFS READ reply does not already have contents in
> + * its .pages array. This is due to potentially restrictive
> + * alignment requirements on the read buffer. When .page_len and
> + * @base are zero, the .pages array is guaranteed to be page-
> + * aligned.

Where do we test that this condition is met?

I can see that nfsd_direct_read() is only called if "base" is zero, but
that means the current contents of the .pages array are page-aligned,
not that there are now.

> + */
> +static noinline_for_stack __be32
> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		 struct nfsd_file *nf, loff_t offset, unsigned long *count,
> +		 u32 *eof)
> +{
> +	loff_t dio_start, dio_end;
> +	unsigned long v, total;
> +	struct iov_iter iter;
> +	struct kiocb kiocb;
> +	ssize_t host_err;
> +	size_t len;
> +
> +	init_sync_kiocb(&kiocb, nf->nf_file);
> +	kiocb.ki_flags |= IOCB_DIRECT;
> +
> +	/* Read a properly-aligned region of bytes into rq_bvec */
> +	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> +	dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> +
> +	kiocb.ki_pos = dio_start;
> +
> +	v = 0;
> +	total = dio_end - dio_start;
> +	while (total) {
> +		len = min_t(size_t, total, PAGE_SIZE);
> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> +			      len, 0);
> +		total -= len;
> +		++v;
> +	}
> +	WARN_ON_ONCE(v > rqstp->rq_maxpages);

I would rather we had an early test rather than a late warn-on.
e.g.
  if (total > (rqstp->rq_maxpages >> PAGE_SHIFT))
     return -EINVAL /* or whatever */;

Otherwise it seems to be making unstated assumptions about how big the
alignment requirements could be.

Thanks,
NeilBrown

> +
> +	trace_nfsd_read_direct(rqstp, fhp, offset, *count);
> +	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, dio_end - dio_start);
> +
> +	host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> +	if (host_err >= 0) {
> +		unsigned int pad = offset - dio_start;
> +
> +		/* The returned payload starts after the pad */
> +		rqstp->rq_res.page_base = pad;
> +
> +		/* Compute the count of bytes to be returned */
> +		if (host_err > pad + *count) {
> +			host_err = *count;
> +		} else if (host_err > pad) {
> +			host_err -= pad;
> +		} else {
> +			host_err = 0;
> +		}
> +	} else if (unlikely(host_err == -EINVAL)) {
> +		pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> +		host_err = -ESERVERFAULT;
> +	}
> +
> +	return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
> +				eof, host_err);
> +}
> +
>  /**
>   * nfsd_iter_read - Perform a VFS read using an iterator
>   * @rqstp: RPC transaction context
> @@ -1106,6 +1179,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	switch (nfsd_io_cache_read) {
>  	case NFSD_IO_BUFFERED:
>  		break;
> +	case NFSD_IO_DIRECT:
> +		if (nf->nf_dio_read_offset_align && !base)
> +			return nfsd_direct_read(rqstp, fhp, nf, offset,
> +						count, eof);
> +		fallthrough;
>  	case NFSD_IO_DONTCACHE:
>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
>  			kiocb.ki_flags = IOCB_DONTCACHE;
> -- 
> 2.50.0
> 
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 23:37   ` NeilBrown
@ 2025-09-09 23:39     ` Chuck Lever
  2025-09-09 23:48       ` Chuck Lever
  2025-09-10  1:52       ` NeilBrown
  2025-09-09 23:56     ` Mike Snitzer
  1 sibling, 2 replies; 23+ messages in thread
From: Chuck Lever @ 2025-09-09 23:39 UTC (permalink / raw)
  To: neil
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Mike Snitzer

On 9/9/25 7:37 PM, NeilBrown wrote:
> On Wed, 10 Sep 2025, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Add an experimental option that forces NFS READ operations to use
>> direct I/O instead of reading through the NFS server's page cache.
>>
>> There are already other layers of caching:
>>  - The page cache on NFS clients
>>  - The block device underlying the exported file system
>>
>> The server's page cache, in many cases, is unlikely to provide
>> additional benefit. Some benchmarks have demonstrated that the
>> server's page cache is actively detrimental for workloads whose
>> working set is larger than the server's available physical memory.
>>
>> For instance, on small NFS servers, cached NFS file content can
>> squeeze out local memory consumers. For large sequential workloads,
>> an enormous amount of data flows into and out of the page cache
>> and is consumed by NFS clients exactly once -- caching that data
>> is expensive to do and totally valueless.
>>
>> For now this is a hidden option that can be enabled on test
>> systems for benchmarking. In the longer term, this option might
>> be enabled persistently or per-export. When the exported file
>> system does not support direct I/O, NFSD falls back to using
>> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
>>
>> Suggested-by: Mike Snitzer <snitzer@kernel.org>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  fs/nfsd/debugfs.c |  2 ++
>>  fs/nfsd/nfsd.h    |  1 +
>>  fs/nfsd/trace.h   |  1 +
>>  fs/nfsd/vfs.c     | 78 +++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 82 insertions(+)
>>
>> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
>> index ed2b9e066206..00eb1ecef6ac 100644
>> --- a/fs/nfsd/debugfs.c
>> +++ b/fs/nfsd/debugfs.c
>> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>>   * Contents:
>>   *   %0: NFS READ will use buffered IO
>>   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
>> + *   %2: NFS READ will use direct IO
>>   *
>>   * This setting takes immediate effect for all NFS versions,
>>   * all exports, and in all NFSD net namespaces.
>> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
>>  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
>>  		break;
>>  	case NFSD_IO_DONTCACHE:
>> +	case NFSD_IO_DIRECT:
>>  		/*
>>  		 * Must disable splice_read when enabling
>>  		 * NFSD_IO_DONTCACHE.
>> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
>> index ea87b42894dd..bdb60ee1f1a4 100644
>> --- a/fs/nfsd/nfsd.h
>> +++ b/fs/nfsd/nfsd.h
>> @@ -157,6 +157,7 @@ enum {
>>  	/* Any new NFSD_IO enum value must be added at the end */
>>  	NFSD_IO_BUFFERED,
>>  	NFSD_IO_DONTCACHE,
>> +	NFSD_IO_DIRECT,
>>  };
>>  
>>  extern u64 nfsd_io_cache_read __read_mostly;
>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>> index e5af0d058fd0..88901df5fbb1 100644
>> --- a/fs/nfsd/trace.h
>> +++ b/fs/nfsd/trace.h
>> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
>>  DEFINE_NFSD_IO_EVENT(read_start);
>>  DEFINE_NFSD_IO_EVENT(read_splice);
>>  DEFINE_NFSD_IO_EVENT(read_vector);
>> +DEFINE_NFSD_IO_EVENT(read_direct);
>>  DEFINE_NFSD_IO_EVENT(read_io_done);
>>  DEFINE_NFSD_IO_EVENT(read_done);
>>  DEFINE_NFSD_IO_EVENT(write_start);
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 441267d877f9..9665454743eb 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>  }
>>  
>> +/*
>> + * The byte range of the client's READ request is expanded on both
>> + * ends until it meets the underlying file system's direct I/O
>> + * alignment requirements. After the internal read is complete, the
>> + * byte range of the NFS READ payload is reduced to the byte range
>> + * that was originally requested.
>> + *
>> + * Note that a direct read can be done only when the xdr_buf
>> + * containing the NFS READ reply does not already have contents in
>> + * its .pages array. This is due to potentially restrictive
>> + * alignment requirements on the read buffer. When .page_len and
>> + * @base are zero, the .pages array is guaranteed to be page-
>> + * aligned.
> 
> Where do we test that this condition is met?
> 
> I can see that nfsd_direct_read() is only called if "base" is zero, but
> that means the current contents of the .pages array are page-aligned,
> not that there are now.
> 
>> + */
>> +static noinline_for_stack __be32
>> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> +		 struct nfsd_file *nf, loff_t offset, unsigned long *count,
>> +		 u32 *eof)
>> +{
>> +	loff_t dio_start, dio_end;
>> +	unsigned long v, total;
>> +	struct iov_iter iter;
>> +	struct kiocb kiocb;
>> +	ssize_t host_err;
>> +	size_t len;
>> +
>> +	init_sync_kiocb(&kiocb, nf->nf_file);
>> +	kiocb.ki_flags |= IOCB_DIRECT;
>> +
>> +	/* Read a properly-aligned region of bytes into rq_bvec */
>> +	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
>> +	dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
>> +
>> +	kiocb.ki_pos = dio_start;
>> +
>> +	v = 0;
>> +	total = dio_end - dio_start;
>> +	while (total) {
>> +		len = min_t(size_t, total, PAGE_SIZE);
>> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
>> +			      len, 0);
>> +		total -= len;
>> +		++v;
>> +	}
>> +	WARN_ON_ONCE(v > rqstp->rq_maxpages);
> 
> I would rather we had an early test rather than a late warn-on.
> e.g.
>   if (total > (rqstp->rq_maxpages >> PAGE_SHIFT))
>      return -EINVAL /* or whatever */;
> 
> Otherwise it seems to be making unstated assumptions about how big the
> alignment requirements could be.

This is the same warn-on test that nfsd_iter_read does for buffered and
dontcache reads. It's done late because the final value of v is computed
here, not known before the loop.

I think we might be able to turn this into a short read, for all I/O
modes?


> Thanks,
> NeilBrown
> 
>> +
>> +	trace_nfsd_read_direct(rqstp, fhp, offset, *count);
>> +	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, dio_end - dio_start);
>> +
>> +	host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
>> +	if (host_err >= 0) {
>> +		unsigned int pad = offset - dio_start;
>> +
>> +		/* The returned payload starts after the pad */
>> +		rqstp->rq_res.page_base = pad;
>> +
>> +		/* Compute the count of bytes to be returned */
>> +		if (host_err > pad + *count) {
>> +			host_err = *count;
>> +		} else if (host_err > pad) {
>> +			host_err -= pad;
>> +		} else {
>> +			host_err = 0;
>> +		}
>> +	} else if (unlikely(host_err == -EINVAL)) {
>> +		pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
>> +		host_err = -ESERVERFAULT;
>> +	}
>> +
>> +	return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
>> +				eof, host_err);
>> +}
>> +
>>  /**
>>   * nfsd_iter_read - Perform a VFS read using an iterator
>>   * @rqstp: RPC transaction context
>> @@ -1106,6 +1179,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  	switch (nfsd_io_cache_read) {
>>  	case NFSD_IO_BUFFERED:
>>  		break;
>> +	case NFSD_IO_DIRECT:
>> +		if (nf->nf_dio_read_offset_align && !base)
>> +			return nfsd_direct_read(rqstp, fhp, nf, offset,
>> +						count, eof);
>> +		fallthrough;
>>  	case NFSD_IO_DONTCACHE:
>>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
>>  			kiocb.ki_flags = IOCB_DONTCACHE;
>> -- 
>> 2.50.0
>>
>>
>>
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 23:39     ` Chuck Lever
@ 2025-09-09 23:48       ` Chuck Lever
  2025-09-10  1:54         ` NeilBrown
  2025-09-10  1:52       ` NeilBrown
  1 sibling, 1 reply; 23+ messages in thread
From: Chuck Lever @ 2025-09-09 23:48 UTC (permalink / raw)
  To: Chuck Lever, neil
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Mike Snitzer

On 9/9/25 7:39 PM, Chuck Lever wrote:
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 441267d877f9..9665454743eb 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>  }
>>  
>> +/*
>> + * The byte range of the client's READ request is expanded on both
>> + * ends until it meets the underlying file system's direct I/O
>> + * alignment requirements. After the internal read is complete, the
>> + * byte range of the NFS READ payload is reduced to the byte range
>> + * that was originally requested.
>> + *
>> + * Note that a direct read can be done only when the xdr_buf
>> + * containing the NFS READ reply does not already have contents in
>> + * its .pages array. This is due to potentially restrictive
>> + * alignment requirements on the read buffer. When .page_len and
>> + * @base are zero, the .pages array is guaranteed to be page-
>> + * aligned.
> Where do we test that this condition is met?
> 
> I can see that nfsd_direct_read() is only called if "base" is zero, but
> that means the current contents of the .pages array are page-aligned,
> not that there are now.

The above comment might be stale; I'm not sure the .page_len test is
necessary. As long as the payload starts on a page boundary, it should
meet most any plausible buffer alignment restriction.

However, if there are additional restrictions needed, checks can be
added in nfsd_iter_read() just before nfsd_direct_read() is called.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 23:37   ` NeilBrown
  2025-09-09 23:39     ` Chuck Lever
@ 2025-09-09 23:56     ` Mike Snitzer
  1 sibling, 0 replies; 23+ messages in thread
From: Mike Snitzer @ 2025-09-09 23:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Wed, Sep 10, 2025 at 09:37:27AM +1000, NeilBrown wrote:
> On Wed, 10 Sep 2025, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Add an experimental option that forces NFS READ operations to use
> > direct I/O instead of reading through the NFS server's page cache.
> > 
> > There are already other layers of caching:
> >  - The page cache on NFS clients
> >  - The block device underlying the exported file system
> > 
> > The server's page cache, in many cases, is unlikely to provide
> > additional benefit. Some benchmarks have demonstrated that the
> > server's page cache is actively detrimental for workloads whose
> > working set is larger than the server's available physical memory.
> > 
> > For instance, on small NFS servers, cached NFS file content can
> > squeeze out local memory consumers. For large sequential workloads,
> > an enormous amount of data flows into and out of the page cache
> > and is consumed by NFS clients exactly once -- caching that data
> > is expensive to do and totally valueless.
> > 
> > For now this is a hidden option that can be enabled on test
> > systems for benchmarking. In the longer term, this option might
> > be enabled persistently or per-export. When the exported file
> > system does not support direct I/O, NFSD falls back to using
> > either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> > 
> > Suggested-by: Mike Snitzer <snitzer@kernel.org>
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  fs/nfsd/debugfs.c |  2 ++
> >  fs/nfsd/nfsd.h    |  1 +
> >  fs/nfsd/trace.h   |  1 +
> >  fs/nfsd/vfs.c     | 78 +++++++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 82 insertions(+)
> > 
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index ed2b9e066206..00eb1ecef6ac 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> >   * Contents:
> >   *   %0: NFS READ will use buffered IO
> >   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > + *   %2: NFS READ will use direct IO
> >   *
> >   * This setting takes immediate effect for all NFS versions,
> >   * all exports, and in all NFSD net namespaces.
> > @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> >  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> >  		break;
> >  	case NFSD_IO_DONTCACHE:
> > +	case NFSD_IO_DIRECT:
> >  		/*
> >  		 * Must disable splice_read when enabling
> >  		 * NFSD_IO_DONTCACHE.
> > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > index ea87b42894dd..bdb60ee1f1a4 100644
> > --- a/fs/nfsd/nfsd.h
> > +++ b/fs/nfsd/nfsd.h
> > @@ -157,6 +157,7 @@ enum {
> >  	/* Any new NFSD_IO enum value must be added at the end */
> >  	NFSD_IO_BUFFERED,
> >  	NFSD_IO_DONTCACHE,
> > +	NFSD_IO_DIRECT,
> >  };
> >  
> >  extern u64 nfsd_io_cache_read __read_mostly;
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index e5af0d058fd0..88901df5fbb1 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
> >  DEFINE_NFSD_IO_EVENT(read_start);
> >  DEFINE_NFSD_IO_EVENT(read_splice);
> >  DEFINE_NFSD_IO_EVENT(read_vector);
> > +DEFINE_NFSD_IO_EVENT(read_direct);
> >  DEFINE_NFSD_IO_EVENT(read_io_done);
> >  DEFINE_NFSD_IO_EVENT(read_done);
> >  DEFINE_NFSD_IO_EVENT(write_start);
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 441267d877f9..9665454743eb 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >  }
> >  
> > +/*
> > + * The byte range of the client's READ request is expanded on both
> > + * ends until it meets the underlying file system's direct I/O
> > + * alignment requirements. After the internal read is complete, the
> > + * byte range of the NFS READ payload is reduced to the byte range
> > + * that was originally requested.
> > + *
> > + * Note that a direct read can be done only when the xdr_buf
> > + * containing the NFS READ reply does not already have contents in
> > + * its .pages array. This is due to potentially restrictive
> > + * alignment requirements on the read buffer. When .page_len and
> > + * @base are zero, the .pages array is guaranteed to be page-
> > + * aligned.
> 
> Where do we test that this condition is met?
> 
> I can see that nfsd_direct_read() is only called if "base" is zero, but
> that means the current contents of the .pages array are page-aligned,
> not that there are now.
> 
> > + */
> > +static noinline_for_stack __be32
> > +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > +		 struct nfsd_file *nf, loff_t offset, unsigned long *count,
> > +		 u32 *eof)
> > +{
> > +	loff_t dio_start, dio_end;
> > +	unsigned long v, total;
> > +	struct iov_iter iter;
> > +	struct kiocb kiocb;
> > +	ssize_t host_err;
> > +	size_t len;
> > +
> > +	init_sync_kiocb(&kiocb, nf->nf_file);
> > +	kiocb.ki_flags |= IOCB_DIRECT;
> > +
> > +	/* Read a properly-aligned region of bytes into rq_bvec */
> > +	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> > +	dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> > +
> > +	kiocb.ki_pos = dio_start;
> > +
> > +	v = 0;
> > +	total = dio_end - dio_start;
> > +	while (total) {
> > +		len = min_t(size_t, total, PAGE_SIZE);
> > +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> > +			      len, 0);
> > +		total -= len;
> > +		++v;
> > +	}
> > +	WARN_ON_ONCE(v > rqstp->rq_maxpages);
> 
> I would rather we had an early test rather than a late warn-on.
> e.g.
>   if (total > (rqstp->rq_maxpages >> PAGE_SHIFT))
>      return -EINVAL /* or whatever */;

-EINVAL is the devil. ;)

My follow-up patch should avoid any possibility of this WARN_ON_ONCE
ever triggering:
https://lore.kernel.org/linux-nfs/20250909233315.80318-2-snitzer@kernel.org/
(and should/could be folded into this 3/3 patch?)

> Otherwise it seems to be making unstated assumptions about how big the
> alignment requirements could be.

Just FYI, nfsd_direct_read's while loop code is nearly identical to
that which is in nfsd_iter_read().

So if an early return warranted in nfsd_direct_read, it is also
warranted in nfsd_iter_read.

(I had an early return in nfsd_iter_read N iterations ago, decided it
not needed given its not something that we'll ever hit.. really just a
canary in the coal mine that offers companionship until one of us dies
of black lung)

Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 23:39     ` Chuck Lever
  2025-09-09 23:48       ` Chuck Lever
@ 2025-09-10  1:52       ` NeilBrown
  2025-09-10 14:23         ` Chuck Lever
  1 sibling, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-09-10  1:52 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Mike Snitzer

On Wed, 10 Sep 2025, Chuck Lever wrote:
> On 9/9/25 7:37 PM, NeilBrown wrote:
> > On Wed, 10 Sep 2025, Chuck Lever wrote:
> >> From: Chuck Lever <chuck.lever@oracle.com>
> >>
> >> Add an experimental option that forces NFS READ operations to use
> >> direct I/O instead of reading through the NFS server's page cache.
> >>
> >> There are already other layers of caching:
> >>  - The page cache on NFS clients
> >>  - The block device underlying the exported file system
> >>
> >> The server's page cache, in many cases, is unlikely to provide
> >> additional benefit. Some benchmarks have demonstrated that the
> >> server's page cache is actively detrimental for workloads whose
> >> working set is larger than the server's available physical memory.
> >>
> >> For instance, on small NFS servers, cached NFS file content can
> >> squeeze out local memory consumers. For large sequential workloads,
> >> an enormous amount of data flows into and out of the page cache
> >> and is consumed by NFS clients exactly once -- caching that data
> >> is expensive to do and totally valueless.
> >>
> >> For now this is a hidden option that can be enabled on test
> >> systems for benchmarking. In the longer term, this option might
> >> be enabled persistently or per-export. When the exported file
> >> system does not support direct I/O, NFSD falls back to using
> >> either DONTCACHE or buffered I/O to fulfill NFS READ requests.
> >>
> >> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >> ---
> >>  fs/nfsd/debugfs.c |  2 ++
> >>  fs/nfsd/nfsd.h    |  1 +
> >>  fs/nfsd/trace.h   |  1 +
> >>  fs/nfsd/vfs.c     | 78 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  4 files changed, 82 insertions(+)
> >>
> >> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> >> index ed2b9e066206..00eb1ecef6ac 100644
> >> --- a/fs/nfsd/debugfs.c
> >> +++ b/fs/nfsd/debugfs.c
> >> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> >>   * Contents:
> >>   *   %0: NFS READ will use buffered IO
> >>   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> >> + *   %2: NFS READ will use direct IO
> >>   *
> >>   * This setting takes immediate effect for all NFS versions,
> >>   * all exports, and in all NFSD net namespaces.
> >> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> >>  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
> >>  		break;
> >>  	case NFSD_IO_DONTCACHE:
> >> +	case NFSD_IO_DIRECT:
> >>  		/*
> >>  		 * Must disable splice_read when enabling
> >>  		 * NFSD_IO_DONTCACHE.
> >> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> >> index ea87b42894dd..bdb60ee1f1a4 100644
> >> --- a/fs/nfsd/nfsd.h
> >> +++ b/fs/nfsd/nfsd.h
> >> @@ -157,6 +157,7 @@ enum {
> >>  	/* Any new NFSD_IO enum value must be added at the end */
> >>  	NFSD_IO_BUFFERED,
> >>  	NFSD_IO_DONTCACHE,
> >> +	NFSD_IO_DIRECT,
> >>  };
> >>  
> >>  extern u64 nfsd_io_cache_read __read_mostly;
> >> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> >> index e5af0d058fd0..88901df5fbb1 100644
> >> --- a/fs/nfsd/trace.h
> >> +++ b/fs/nfsd/trace.h
> >> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
> >>  DEFINE_NFSD_IO_EVENT(read_start);
> >>  DEFINE_NFSD_IO_EVENT(read_splice);
> >>  DEFINE_NFSD_IO_EVENT(read_vector);
> >> +DEFINE_NFSD_IO_EVENT(read_direct);
> >>  DEFINE_NFSD_IO_EVENT(read_io_done);
> >>  DEFINE_NFSD_IO_EVENT(read_done);
> >>  DEFINE_NFSD_IO_EVENT(write_start);
> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >> index 441267d877f9..9665454743eb 100644
> >> --- a/fs/nfsd/vfs.c
> >> +++ b/fs/nfsd/vfs.c
> >> @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >>  }
> >>  
> >> +/*
> >> + * The byte range of the client's READ request is expanded on both
> >> + * ends until it meets the underlying file system's direct I/O
> >> + * alignment requirements. After the internal read is complete, the
> >> + * byte range of the NFS READ payload is reduced to the byte range
> >> + * that was originally requested.
> >> + *
> >> + * Note that a direct read can be done only when the xdr_buf
> >> + * containing the NFS READ reply does not already have contents in
> >> + * its .pages array. This is due to potentially restrictive
> >> + * alignment requirements on the read buffer. When .page_len and
> >> + * @base are zero, the .pages array is guaranteed to be page-
> >> + * aligned.
> > 
> > Where do we test that this condition is met?
> > 
> > I can see that nfsd_direct_read() is only called if "base" is zero, but
> > that means the current contents of the .pages array are page-aligned,
> > not that there are now.
> > 
> >> + */
> >> +static noinline_for_stack __be32
> >> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >> +		 struct nfsd_file *nf, loff_t offset, unsigned long *count,
> >> +		 u32 *eof)
> >> +{
> >> +	loff_t dio_start, dio_end;
> >> +	unsigned long v, total;
> >> +	struct iov_iter iter;
> >> +	struct kiocb kiocb;
> >> +	ssize_t host_err;
> >> +	size_t len;
> >> +
> >> +	init_sync_kiocb(&kiocb, nf->nf_file);
> >> +	kiocb.ki_flags |= IOCB_DIRECT;
> >> +
> >> +	/* Read a properly-aligned region of bytes into rq_bvec */
> >> +	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> >> +	dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> >> +
> >> +	kiocb.ki_pos = dio_start;
> >> +
> >> +	v = 0;
> >> +	total = dio_end - dio_start;
> >> +	while (total) {
> >> +		len = min_t(size_t, total, PAGE_SIZE);
> >> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> >> +			      len, 0);
> >> +		total -= len;
> >> +		++v;
> >> +	}
> >> +	WARN_ON_ONCE(v > rqstp->rq_maxpages);
> > 
> > I would rather we had an early test rather than a late warn-on.
> > e.g.
> >   if (total > (rqstp->rq_maxpages >> PAGE_SHIFT))
> >      return -EINVAL /* or whatever */;
> > 
> > Otherwise it seems to be making unstated assumptions about how big the
> > alignment requirements could be.
> 
> This is the same warn-on test that nfsd_iter_read does for buffered and
> dontcache reads. It's done late because the final value of v is computed
> here, not known before the loop.

True, but in this case "total" could be larger than "*count" which was
size-checked in e.g.  nfsd4_encode_read.  So it could now be larger than
the available space.

> 
> I think we might be able to turn this into a short read, for all I/O
> modes?

Yes, that could be a clean way to handle the unlikely case that the
reads doesn't fit any more.

Thanks,
NeilBrown

> 
> 
> > Thanks,
> > NeilBrown
> > 
> >> +
> >> +	trace_nfsd_read_direct(rqstp, fhp, offset, *count);
> >> +	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, dio_end - dio_start);
> >> +
> >> +	host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> >> +	if (host_err >= 0) {
> >> +		unsigned int pad = offset - dio_start;
> >> +
> >> +		/* The returned payload starts after the pad */
> >> +		rqstp->rq_res.page_base = pad;
> >> +
> >> +		/* Compute the count of bytes to be returned */
> >> +		if (host_err > pad + *count) {
> >> +			host_err = *count;
> >> +		} else if (host_err > pad) {
> >> +			host_err -= pad;
> >> +		} else {
> >> +			host_err = 0;
> >> +		}
> >> +	} else if (unlikely(host_err == -EINVAL)) {
> >> +		pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> >> +		host_err = -ESERVERFAULT;
> >> +	}
> >> +
> >> +	return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
> >> +				eof, host_err);
> >> +}
> >> +
> >>  /**
> >>   * nfsd_iter_read - Perform a VFS read using an iterator
> >>   * @rqstp: RPC transaction context
> >> @@ -1106,6 +1179,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>  	switch (nfsd_io_cache_read) {
> >>  	case NFSD_IO_BUFFERED:
> >>  		break;
> >> +	case NFSD_IO_DIRECT:
> >> +		if (nf->nf_dio_read_offset_align && !base)
> >> +			return nfsd_direct_read(rqstp, fhp, nf, offset,
> >> +						count, eof);
> >> +		fallthrough;
> >>  	case NFSD_IO_DONTCACHE:
> >>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
> >>  			kiocb.ki_flags = IOCB_DONTCACHE;
> >> -- 
> >> 2.50.0
> >>
> >>
> >>
> > 
> 
> 
> -- 
> Chuck Lever
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 23:48       ` Chuck Lever
@ 2025-09-10  1:54         ` NeilBrown
  0 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2025-09-10  1:54 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer

On Wed, 10 Sep 2025, Chuck Lever wrote:
> On 9/9/25 7:39 PM, Chuck Lever wrote:
> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >> index 441267d877f9..9665454743eb 100644
> >> --- a/fs/nfsd/vfs.c
> >> +++ b/fs/nfsd/vfs.c
> >> @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >>  }
> >>  
> >> +/*
> >> + * The byte range of the client's READ request is expanded on both
> >> + * ends until it meets the underlying file system's direct I/O
> >> + * alignment requirements. After the internal read is complete, the
> >> + * byte range of the NFS READ payload is reduced to the byte range
> >> + * that was originally requested.
> >> + *
> >> + * Note that a direct read can be done only when the xdr_buf
> >> + * containing the NFS READ reply does not already have contents in
> >> + * its .pages array. This is due to potentially restrictive
> >> + * alignment requirements on the read buffer. When .page_len and
> >> + * @base are zero, the .pages array is guaranteed to be page-
> >> + * aligned.
> > Where do we test that this condition is met?
> > 
> > I can see that nfsd_direct_read() is only called if "base" is zero, but
> > that means the current contents of the .pages array are page-aligned,
> > not that there are now.
> 
> The above comment might be stale; I'm not sure the .page_len test is
> necessary. As long as the payload starts on a page boundary, it should
> meet most any plausible buffer alignment restriction.
> 
> However, if there are additional restrictions needed, checks can be
> added in nfsd_iter_read() just before nfsd_direct_read() is called.

I think that current test against rq_maxpages (in the WARN_ON) only
makes sense if .page_len was zero.

NeilBrown


> 
> 
> -- 
> Chuck Lever
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-09 19:05 ` [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
  2025-09-09 23:16   ` Mike Snitzer
  2025-09-09 23:37   ` NeilBrown
@ 2025-09-10 11:37   ` Jeff Layton
  2 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2025-09-10 11:37 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer

On Tue, 2025-09-09 at 15:05 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Add an experimental option that forces NFS READ operations to use
> direct I/O instead of reading through the NFS server's page cache.
> 
> There are already other layers of caching:
>  - The page cache on NFS clients
>  - The block device underlying the exported file system
> 
> The server's page cache, in many cases, is unlikely to provide
> additional benefit. Some benchmarks have demonstrated that the
> server's page cache is actively detrimental for workloads whose
> working set is larger than the server's available physical memory.
> 
> For instance, on small NFS servers, cached NFS file content can
> squeeze out local memory consumers. For large sequential workloads,
> an enormous amount of data flows into and out of the page cache
> and is consumed by NFS clients exactly once -- caching that data
> is expensive to do and totally valueless.
> 
> For now this is a hidden option that can be enabled on test
> systems for benchmarking. In the longer term, this option might
> be enabled persistently or per-export. When the exported file
> system does not support direct I/O, NFSD falls back to using
> either DONTCACHE or buffered I/O to fulfill NFS READ requests.

> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/debugfs.c |  2 ++
>  fs/nfsd/nfsd.h    |  1 +
>  fs/nfsd/trace.h   |  1 +
>  fs/nfsd/vfs.c     | 78 +++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index ed2b9e066206..00eb1ecef6ac 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,6 +44,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>   * Contents:
>   *   %0: NFS READ will use buffered IO
>   *   %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + *   %2: NFS READ will use direct IO
>   *
>   * This setting takes immediate effect for all NFS versions,
>   * all exports, and in all NFSD net namespaces.
> @@ -64,6 +65,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
>  		nfsd_io_cache_read = NFSD_IO_BUFFERED;
>  		break;
>  	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
>  		/*
>  		 * Must disable splice_read when enabling
>  		 * NFSD_IO_DONTCACHE.
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index ea87b42894dd..bdb60ee1f1a4 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -157,6 +157,7 @@ enum {
>  	/* Any new NFSD_IO enum value must be added at the end */
>  	NFSD_IO_BUFFERED,
>  	NFSD_IO_DONTCACHE,
> +	NFSD_IO_DIRECT,
>  };
>  
>  extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index e5af0d058fd0..88901df5fbb1 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -464,6 +464,7 @@ DEFINE_EVENT(nfsd_io_class, nfsd_##name,	\
>  DEFINE_NFSD_IO_EVENT(read_start);
>  DEFINE_NFSD_IO_EVENT(read_splice);
>  DEFINE_NFSD_IO_EVENT(read_vector);
> +DEFINE_NFSD_IO_EVENT(read_direct);
>  DEFINE_NFSD_IO_EVENT(read_io_done);
>  DEFINE_NFSD_IO_EVENT(read_done);
>  DEFINE_NFSD_IO_EVENT(write_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 441267d877f9..9665454743eb 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1074,6 +1074,79 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> +/*
> + * The byte range of the client's READ request is expanded on both
> + * ends until it meets the underlying file system's direct I/O
> + * alignment requirements. After the internal read is complete, the
> + * byte range of the NFS READ payload is reduced to the byte range
> + * that was originally requested.
> + *
> + * Note that a direct read can be done only when the xdr_buf
> + * containing the NFS READ reply does not already have contents in
> + * its .pages array. This is due to potentially restrictive
> + * alignment requirements on the read buffer. When .page_len and
> + * @base are zero, the .pages array is guaranteed to be page-
> + * aligned.
> + */
> +static noinline_for_stack __be32
> +nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		 struct nfsd_file *nf, loff_t offset, unsigned long *count,
> +		 u32 *eof)
> +{
> +	loff_t dio_start, dio_end;
> +	unsigned long v, total;
> +	struct iov_iter iter;
> +	struct kiocb kiocb;
> +	ssize_t host_err;
> +	size_t len;
> +
> +	init_sync_kiocb(&kiocb, nf->nf_file);
> +	kiocb.ki_flags |= IOCB_DIRECT;
> +
> +	/* Read a properly-aligned region of bytes into rq_bvec */
> +	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> +	dio_end = round_up(offset + *count, nf->nf_dio_read_offset_align);
> +
> +	kiocb.ki_pos = dio_start;
> +
> +	v = 0;
> +	total = dio_end - dio_start;
> +	while (total) {
> +		len = min_t(size_t, total, PAGE_SIZE);
> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
> +			      len, 0);
> +		total -= len;
> +		++v;
> +	}
> +	WARN_ON_ONCE(v > rqstp->rq_maxpages);
> +
> +	trace_nfsd_read_direct(rqstp, fhp, offset, *count);
> +	iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, dio_end - dio_start);
> +
> +	host_err = vfs_iocb_iter_read(nf->nf_file, &kiocb, &iter);
> +	if (host_err >= 0) {
> +		unsigned int pad = offset - dio_start;
> +
> +		/* The returned payload starts after the pad */
> +		rqstp->rq_res.page_base = pad;
> +
> +		/* Compute the count of bytes to be returned */
> +		if (host_err > pad + *count) {
> +			host_err = *count;
> +		} else if (host_err > pad) {
> +			host_err -= pad;
> +		} else {
> +			host_err = 0;
> +		}
> +	} else if (unlikely(host_err == -EINVAL)) {
> +		pr_info_ratelimited("nfsd: Unexpected direct I/O alignment failure\n");
> +		host_err = -ESERVERFAULT;
> +	}
> +
> +	return nfsd_finish_read(rqstp, fhp, nf->nf_file, offset, count,
> +				eof, host_err);
> +}
> +
>  /**
>   * nfsd_iter_read - Perform a VFS read using an iterator
>   * @rqstp: RPC transaction context
> @@ -1106,6 +1179,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	switch (nfsd_io_cache_read) {
>  	case NFSD_IO_BUFFERED:
>  		break;
> +	case NFSD_IO_DIRECT:
> +		if (nf->nf_dio_read_offset_align && !base)
> +			return nfsd_direct_read(rqstp, fhp, nf, offset,
> +						count, eof);
> +		fallthrough;
>  	case NFSD_IO_DONTCACHE:
>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
>  			kiocb.ki_flags = IOCB_DONTCACHE;

This looks fine:

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Since this is a "hidden" option, it may also be a good idea to add a
pr_info that pops when someone enables it. Then if we get a bug report
involving this later, that would at least show up in the log.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ
  2025-09-10  1:52       ` NeilBrown
@ 2025-09-10 14:23         ` Chuck Lever
  0 siblings, 0 replies; 23+ messages in thread
From: Chuck Lever @ 2025-09-10 14:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Mike Snitzer

On 9/9/25 9:52 PM, NeilBrown wrote:
>>>> +	v = 0;
>>>> +	total = dio_end - dio_start;
>>>> +	while (total) {
>>>> +		len = min_t(size_t, total, PAGE_SIZE);
>>>> +		bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
>>>> +			      len, 0);
>>>> +		total -= len;
>>>> +		++v;
>>>> +	}
>>>> +	WARN_ON_ONCE(v > rqstp->rq_maxpages);
>>> I would rather we had an early test rather than a late warn-on.
>>> e.g.
>>>   if (total > (rqstp->rq_maxpages >> PAGE_SHIFT))
>>>      return -EINVAL /* or whatever */;
>>>
>>> Otherwise it seems to be making unstated assumptions about how big the
>>> alignment requirements could be.
>> This is the same warn-on test that nfsd_iter_read does for buffered and
>> dontcache reads. It's done late because the final value of v is computed
>> here, not known before the loop.
> True, but in this case "total" could be larger than "*count" which was
> size-checked in e.g.  nfsd4_encode_read.  So it could now be larger than
> the available space.

Expanding the byte range is constrained to the alignment parameters,
meaning the most the range can increase is by a single page (assuming
the needed alignment is always less than or equal to a page size, or
that we stipulate larger alignments are not yet supported).

Both rq_bvec and rq_pages have that extra page already.


>> I think we might be able to turn this into a short read, for all I/O
>> modes?
> Yes, that could be a clean way to handle the unlikely case that the
> reads doesn't fit any more.

It's probably best to not have the WARN_ON at all. Either convert the
failure to a short read, or prove formally that the condition cannot
happen and simply remove the WARN_ON. I have never seen it fire.

That should be done to nfsd_iter_read() /before/ this series. Then 3/3
can "copy" and use the improved loop logic.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] sunrpc: add an extra reserve page to svc_serv_maxpages()
  2025-09-09 23:33   ` [PATCH 1/2] sunrpc: add an extra reserve page to svc_serv_maxpages() Mike Snitzer
@ 2025-09-10 14:29     ` Chuck Lever
  0 siblings, 0 replies; 23+ messages in thread
From: Chuck Lever @ 2025-09-10 14:29 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

On 9/9/25 7:33 PM, Mike Snitzer wrote:
> nfsd_iter_read() might need two extra pages when a READ payload is not
> DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor() are
> mutually exclusive (so reuse page reserved for nfsd_splice_actor).
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  include/linux/sunrpc/svc.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e64ab444e0a7f..190c2667500e2 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   * pages, one for the request, and one for the reply.
>   * nfsd_splice_actor() might need an extra page when a READ payload
>   * is not page-aligned.
> + * nfsd_iter_read() might need two extra pages when a READ payload
> + * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
> + * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
>   */
>  static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
>  {
> -	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> +	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>  }
>  
>  /*

I'm not convinced an additional page is necessary. The most the direct
I/O range can be is rsize + 1 page, AFAICT, and that seems to be covered
already.

If we want NFSD to support file systems / block devices with alignment
requirements large than a page, then perhaps additional buffer space
will be needed... or NFSD can return a short read.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-09-09 23:33   ` [PATCH 2/2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
@ 2025-10-08 18:59     ` Mike Snitzer
  2025-10-09 15:04       ` Jeff Layton
  2025-10-09 17:46       ` Chuck Lever
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Snitzer @ 2025-10-08 18:59 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever

If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Buffered IO (with preference
towards using DONTCACHE) is used for the misaligned extents and
O_DIRECT is used for the middle DIO-aligned extent.

If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
invalidate the page cache on behalf of the DIO WRITE, then
nfsd_issue_write_dio() will fall back to using buffered IO.

These changes served as the original starting point for the NFS
client's misaligned O_DIRECT support that landed with commit
c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and
WRITE"). But NFSD's support is simpler because it currently doesn't
use AIO completion.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/debugfs.c |   1 +
 fs/nfsd/trace.h   |   1 +
 fs/nfsd/vfs.c     | 210 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 207 insertions(+), 5 deletions(-)

changes in v2:
- add improved -EINVAL logging that matches latest DIO READ support
- update patch header

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	switch (val) {
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
 DEFINE_NFSD_IO_EVENT(read_done);
 DEFINE_NFSD_IO_EVENT(write_start);
 DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
 DEFINE_NFSD_IO_EVENT(write_io_done);
 DEFINE_NFSD_IO_EVENT(write_done);
 DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 4ae08065debf..2420f568b378 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,203 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio {
+	ssize_t	start_len;	/* Length for misaligned first extent */
+	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
+	ssize_t	end_len;	/* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+		       struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
+{
+	const u32 dio_blocksize = nf->nf_dio_offset_align;
+	loff_t start_end, orig_end, middle_end;
+
+	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+		return false;
+	if (unlikely(dio_blocksize > PAGE_SIZE))
+		return false;
+	if (unlikely(len < dio_blocksize))
+		return false;
+
+	start_end = round_up(offset, dio_blocksize);
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	write_dio->start_len = start_end - offset;
+	write_dio->middle_len = middle_end - start_end;
+	write_dio->end_len = orig_end - middle_end;
+
+	return true;
+}
+
+static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
+		unsigned int addr_mask, unsigned int len_mask)
+{
+	const struct bio_vec *bvec = i->bvec;
+	size_t skip = i->iov_offset;
+	size_t size = i->count;
+
+	if (size & len_mask)
+		return false;
+	do {
+		size_t len = bvec->bv_len;
+
+		if (len > size)
+			len = size;
+		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+			return false;
+		bvec++;
+		size -= len;
+		skip = 0;
+	} while (size);
+
+	return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+			   struct bio_vec *rq_bvec, unsigned int nvecs,
+			   unsigned long cnt, struct nfsd_write_dio *write_dio,
+			   struct nfsd_file *nf)
+{
+	int n_iters = 0;
+	struct iov_iter *iters = *iterp;
+
+	/* Setup misaligned start? */
+	if (write_dio->start_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iters[n_iters].count = write_dio->start_len;
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	/* Setup DIO-aligned middle */
+	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+	if (write_dio->start_len)
+		iov_iter_advance(&iters[n_iters], write_dio->start_len);
+	iters[n_iters].count -= write_dio->end_len;
+	iter_is_dio_aligned[n_iters] =
+		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+				nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
+	if (unlikely(!iter_is_dio_aligned[n_iters]))
+		return 0; /* no DIO-aligned IO possible */
+	++n_iters;
+
+	/* Setup misaligned end? */
+	if (write_dio->end_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iov_iter_advance(&iters[n_iters],
+				 write_dio->start_len + write_dio->middle_len);
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	return n_iters;
+}
+
+static int
+nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
+		    unsigned int nvecs, unsigned long *cnt,
+		    struct kiocb *kiocb)
+{
+	struct iov_iter iter;
+	int host_err;
+
+	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+	if (host_err < 0)
+		return host_err;
+	*cnt = host_err;
+
+	return 0;
+}
+
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
+		     loff_t offset, unsigned int nvecs, unsigned long *cnt,
+		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+{
+	struct file *file = nf->nf_file;
+	bool iter_is_dio_aligned[3];
+	struct iov_iter iter_stack[3];
+	struct iov_iter *iter = iter_stack;
+	unsigned int n_iters = 0;
+	unsigned long in_count = *cnt;
+	loff_t in_offset = kiocb->ki_pos;
+	ssize_t host_err;
+
+	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+				rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
+	if (unlikely(!n_iters))
+		return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
+
+	*cnt = 0;
+	for (int i = 0; i < n_iters; i++) {
+		if (iter_is_dio_aligned[i])
+			kiocb->ki_flags |= IOCB_DIRECT;
+		else
+			kiocb->ki_flags &= ~IOCB_DIRECT;
+
+		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+		if (host_err < 0) {
+			/* VFS will return -ENOTBLK if DIO WRITE fails to
+			 * invalidate the page cache. Retry using buffered IO.
+			 */
+			if (unlikely(host_err == -ENOTBLK)) {
+				kiocb->ki_flags &= ~IOCB_DIRECT;
+				*cnt = in_count;
+				kiocb->ki_pos = in_offset;
+				return nfsd_buffered_write(rqstp, file,
+							   nvecs, cnt, kiocb);
+			} else if (unlikely(host_err == -EINVAL)) {
+				struct inode *inode = d_inode(fhp->fh_dentry);
+
+				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+						    inode->i_sb->s_id, inode->i_ino);
+				host_err = -ESERVERFAULT;
+			}
+			return host_err;
+		}
+		*cnt += host_err;
+		if (host_err < iter[i].count) /* partial write? */
+			break;
+	}
+
+	return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct nfsd_file *nf, loff_t offset, unsigned int nvecs,
+		  unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct nfsd_write_dio write_dio;
+
+	/* Any buffered IO issued here will be misaligned, use
+	 * IOCB_SYNC to ensure it has completed before returning.
+	 */
+	kiocb->ki_flags |= IOCB_SYNC;
+	/* Check if IOCB_DONTCACHE should be used when issuing buffered IO;
+	 * if so, it will be ignored for any DIO issued here.
+	 */
+	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+		kiocb->ki_flags |= IOCB_DONTCACHE;
+
+	if (nfsd_is_write_dio_possible(offset, *cnt, nf, &write_dio)) {
+		trace_nfsd_write_direct(rqstp, fhp, offset, *cnt);
+		return nfsd_issue_write_dio(rqstp, fhp, nf, offset, nvecs,
+					    cnt, kiocb, &write_dio);
+	}
+
+	return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1281,7 +1478,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	struct super_block	*sb = file_inode(file)->i_sb;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
-	struct iov_iter		iter;
 	errseq_t		since;
 	__be32			nfserr;
 	int			host_err;
@@ -1318,25 +1514,29 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		kiocb.ki_flags |= IOCB_DSYNC;
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
-	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
-	case NFSD_IO_BUFFERED:
+	case NFSD_IO_DIRECT:
+		host_err = nfsd_direct_write(rqstp, fhp, nf, offset,
+					     nvecs, cnt, &kiocb);
 		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
+		fallthrough; /* must call nfsd_buffered_write */
+	case NFSD_IO_BUFFERED:
+		host_err = nfsd_buffered_write(rqstp, file,
+					       nvecs, cnt, &kiocb);
 		break;
 	}
-	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
 	if (host_err < 0) {
 		commit_reset_write_verifier(nn, rqstp, host_err);
 		goto out_nfserr;
 	}
-	*cnt = host_err;
 	nfsd_stats_io_write_add(nn, exp, *cnt);
 	fsnotify_modify(file);
 	host_err = filemap_check_wb_err(file->f_mapping, since);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-08 18:59     ` [PATCH v2] " Mike Snitzer
@ 2025-10-09 15:04       ` Jeff Layton
  2025-10-09 17:46       ` Chuck Lever
  1 sibling, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2025-10-09 15:04 UTC (permalink / raw)
  To: Mike Snitzer, Chuck Lever
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever

On Wed, 2025-10-08 at 14:59 -0400, Mike Snitzer wrote:
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. Buffered IO (with preference
> towards using DONTCACHE) is used for the misaligned extents and
> O_DIRECT is used for the middle DIO-aligned extent.
> 
> If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> invalidate the page cache on behalf of the DIO WRITE, then
> nfsd_issue_write_dio() will fall back to using buffered IO.
> 
> These changes served as the original starting point for the NFS
> client's misaligned O_DIRECT support that landed with commit
> c817248fc831 ("nfs/localio: add proper O_DIRECT support for READ and
> WRITE"). But NFSD's support is simpler because it currently doesn't
> use AIO completion.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/debugfs.c |   1 +
>  fs/nfsd/trace.h   |   1 +
>  fs/nfsd/vfs.c     | 210 ++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 207 insertions(+), 5 deletions(-)
> 
> changes in v2:
> - add improved -EINVAL logging that matches latest DIO READ support
> - update patch header
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 00eb1ecef6ac..7f44689e0a53 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
>  	switch (val) {
>  	case NFSD_IO_BUFFERED:
>  	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
>  		nfsd_io_cache_write = val;
>  		break;
>  	default:
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index bfd41236aff2..ad74439d0105 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
>  DEFINE_NFSD_IO_EVENT(read_done);
>  DEFINE_NFSD_IO_EVENT(write_start);
>  DEFINE_NFSD_IO_EVENT(write_opened);
> +DEFINE_NFSD_IO_EVENT(write_direct);
>  DEFINE_NFSD_IO_EVENT(write_io_done);
>  DEFINE_NFSD_IO_EVENT(write_done);
>  DEFINE_NFSD_IO_EVENT(commit_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 4ae08065debf..2420f568b378 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,6 +1254,203 @@ static int wait_for_concurrent_writes(struct file *file)
>  	return err;
>  }
>  
> +struct nfsd_write_dio {
> +	ssize_t	start_len;	/* Length for misaligned first extent */
> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> +	ssize_t	end_len;	/* Length for misaligned last extent */
> +};
> +
> +static bool
> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> +		       struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> +	loff_t start_end, orig_end, middle_end;
> +
> +	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
> +		return false;
> +	if (unlikely(dio_blocksize > PAGE_SIZE))
> +		return false;
> +	if (unlikely(len < dio_blocksize))
> +		return false;
> +
> +	start_end = round_up(offset, dio_blocksize);
> +	orig_end = offset + len;
> +	middle_end = round_down(orig_end, dio_blocksize);
> +
> +	write_dio->start_len = start_end - offset;
> +	write_dio->middle_len = middle_end - start_end;
> +	write_dio->end_len = orig_end - middle_end;
> +
> +	return true;
> +}
> +
> +static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
> +		unsigned int addr_mask, unsigned int len_mask)
> +{
> +	const struct bio_vec *bvec = i->bvec;
> +	size_t skip = i->iov_offset;
> +	size_t size = i->count;
> +
> +	if (size & len_mask)
> +		return false;
> +	do {
> +		size_t len = bvec->bv_len;
> +
> +		if (len > size)
> +			len = size;
> +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> +			return false;
> +		bvec++;
> +		size -= len;
> +		skip = 0;
> +	} while (size);
> +
> +	return true;
> +}
> +
> +/*
> + * Setup as many as 3 iov_iter based on extents described by @write_dio.
> + * Returns the number of iov_iter that were setup.
> + */
> +static int
> +nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> +			   struct bio_vec *rq_bvec, unsigned int nvecs,
> +			   unsigned long cnt, struct nfsd_write_dio *write_dio,
> +			   struct nfsd_file *nf)
> +{
> +	int n_iters = 0;
> +	struct iov_iter *iters = *iterp;
> +
> +	/* Setup misaligned start? */
> +	if (write_dio->start_len) {
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iters[n_iters].count = write_dio->start_len;
> +		iter_is_dio_aligned[n_iters] = false;
> +		++n_iters;
> +	}
> +
> +	/* Setup DIO-aligned middle */
> +	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +	if (write_dio->start_len)
> +		iov_iter_advance(&iters[n_iters], write_dio->start_len);
> +	iters[n_iters].count -= write_dio->end_len;
> +	iter_is_dio_aligned[n_iters] =
> +		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
> +				nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
> +	if (unlikely(!iter_is_dio_aligned[n_iters]))
> +		return 0; /* no DIO-aligned IO possible */
> +	++n_iters;
> +
> +	/* Setup misaligned end? */
> +	if (write_dio->end_len) {
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iov_iter_advance(&iters[n_iters],
> +				 write_dio->start_len + write_dio->middle_len);
> +		iter_is_dio_aligned[n_iters] = false;
> +		++n_iters;
> +	}
> +
> +	return n_iters;
> +}
> +
> +static int
> +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> +		    unsigned int nvecs, unsigned long *cnt,
> +		    struct kiocb *kiocb)
> +{
> +	struct iov_iter iter;
> +	int host_err;
> +
> +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> +	if (host_err < 0)
> +		return host_err;
> +	*cnt = host_err;
> +
> +	return 0;
> +}
> +
> +static int
> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> +		     loff_t offset, unsigned int nvecs, unsigned long *cnt,
> +		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
> +{
> +	struct file *file = nf->nf_file;
> +	bool iter_is_dio_aligned[3];
> +	struct iov_iter iter_stack[3];
> +	struct iov_iter *iter = iter_stack;
> +	unsigned int n_iters = 0;
> +	unsigned long in_count = *cnt;
> +	loff_t in_offset = kiocb->ki_pos;
> +	ssize_t host_err;
> +
> +	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
> +				rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
> +	if (unlikely(!n_iters))
> +		return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
> +
> +	*cnt = 0;
> +	for (int i = 0; i < n_iters; i++) {
> +		if (iter_is_dio_aligned[i])
> +			kiocb->ki_flags |= IOCB_DIRECT;
> +		else
> +			kiocb->ki_flags &= ~IOCB_DIRECT;
> +
> +		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> +		if (host_err < 0) {
> +			/* VFS will return -ENOTBLK if DIO WRITE fails to
> +			 * invalidate the page cache. Retry using buffered IO.
> +			 */
> +			if (unlikely(host_err == -ENOTBLK)) {
> +				kiocb->ki_flags &= ~IOCB_DIRECT;
> +				*cnt = in_count;
> +				kiocb->ki_pos = in_offset;
> +				return nfsd_buffered_write(rqstp, file,
> +							   nvecs, cnt, kiocb);
> +			} else if (unlikely(host_err == -EINVAL)) {
> +				struct inode *inode = d_inode(fhp->fh_dentry);
> +
> +				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> +						    inode->i_sb->s_id, inode->i_ino);
> +				host_err = -ESERVERFAULT;
> +			}
> +			return host_err;
> +		}
> +		*cnt += host_err;
> +		if (host_err < iter[i].count) /* partial write? */
> +			break;
> +	}
> +
> +	return 0;
> +}
> +
> +static noinline_for_stack int
> +nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		  struct nfsd_file *nf, loff_t offset, unsigned int nvecs,
> +		  unsigned long *cnt, struct kiocb *kiocb)
> +{
> +	struct nfsd_write_dio write_dio;
> +
> +	/* Any buffered IO issued here will be misaligned, use
> +	 * IOCB_SYNC to ensure it has completed before returning.
> +	 */
> +	kiocb->ki_flags |= IOCB_SYNC;
> +	/* Check if IOCB_DONTCACHE should be used when issuing buffered IO;
> +	 * if so, it will be ignored for any DIO issued here.
> +	 */
> +	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> +		kiocb->ki_flags |= IOCB_DONTCACHE;
> +
> +	if (nfsd_is_write_dio_possible(offset, *cnt, nf, &write_dio)) {
> +		trace_nfsd_write_direct(rqstp, fhp, offset, *cnt);
> +		return nfsd_issue_write_dio(rqstp, fhp, nf, offset, nvecs,
> +					    cnt, kiocb, &write_dio);
> +	}
> +
> +	return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
> +}
> +
>  /**
>   * nfsd_vfs_write - write data to an already-open file
>   * @rqstp: RPC execution context
> @@ -1281,7 +1478,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	struct super_block	*sb = file_inode(file)->i_sb;
>  	struct kiocb		kiocb;
>  	struct svc_export	*exp;
> -	struct iov_iter		iter;
>  	errseq_t		since;
>  	__be32			nfserr;
>  	int			host_err;
> @@ -1318,25 +1514,29 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		kiocb.ki_flags |= IOCB_DSYNC;
>  
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> -	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
>  	since = READ_ONCE(file->f_wb_err);
>  	if (verf)
>  		nfsd_copy_write_verifier(verf, nn);
>  
>  	switch (nfsd_io_cache_write) {
> -	case NFSD_IO_BUFFERED:
> +	case NFSD_IO_DIRECT:
> +		host_err = nfsd_direct_write(rqstp, fhp, nf, offset,
> +					     nvecs, cnt, &kiocb);
>  		break;
>  	case NFSD_IO_DONTCACHE:
>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
>  			kiocb.ki_flags |= IOCB_DONTCACHE;
> +		fallthrough; /* must call nfsd_buffered_write */
> +	case NFSD_IO_BUFFERED:
> +		host_err = nfsd_buffered_write(rqstp, file,
> +					       nvecs, cnt, &kiocb);
>  		break;
>  	}
> -	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
>  	if (host_err < 0) {
>  		commit_reset_write_verifier(nn, rqstp, host_err);
>  		goto out_nfserr;
>  	}
> -	*cnt = host_err;
>  	nfsd_stats_io_write_add(nn, exp, *cnt);
>  	fsnotify_modify(file);
>  	host_err = filemap_check_wb_err(file->f_mapping, since);


Nice work, Mike!

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-08 18:59     ` [PATCH v2] " Mike Snitzer
  2025-10-09 15:04       ` Jeff Layton
@ 2025-10-09 17:46       ` Chuck Lever
  2025-10-13 15:41         ` Mike Snitzer
  1 sibling, 1 reply; 23+ messages in thread
From: Chuck Lever @ 2025-10-09 17:46 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever

On 10/8/25 2:59 PM, Mike Snitzer wrote:
> +
> +static noinline_for_stack int
> +nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		  struct nfsd_file *nf, loff_t offset, unsigned int nvecs,
> +		  unsigned long *cnt, struct kiocb *kiocb)
> +{
> +	struct nfsd_write_dio write_dio;
> +
> +	/* Any buffered IO issued here will be misaligned, use
> +	 * IOCB_SYNC to ensure it has completed before returning.
> +	 */
> +	kiocb->ki_flags |= IOCB_SYNC;
> +	/* Check if IOCB_DONTCACHE should be used when issuing buffered IO;
> +	 * if so, it will be ignored for any DIO issued here.
> +	 */
> +	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> +		kiocb->ki_flags |= IOCB_DONTCACHE;
> +
> +	if (nfsd_is_write_dio_possible(offset, *cnt, nf, &write_dio)) {
> +		trace_nfsd_write_direct(rqstp, fhp, offset, *cnt);
> +		return nfsd_issue_write_dio(rqstp, fhp, nf, offset, nvecs,
> +					    cnt, kiocb, &write_dio);
> +	}
> +
> +	return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
> +}

Handful of initial comments:

The current NFSv3 code path, and perhaps NFSv4 too, doesn't support
changing a WRITE marked as UNSTABLE to FILE_SYNC. I'm not seeing that
rectified in this patch. I have a patch that enables changing the
"stable_how" setting, but it's on a system at home. I'll post it in
a couple of days and we can build on that.

In nfsd_direct_write we're setting IOCB_SYNC early. When falling back to
BUFFERED_IO, that setting is preserved, and the fallback buffered path
is now always FILE_SYNC. We should discuss whether the fallback in this
case should be always FILE_SYNC or should allow UNSTABLE.

Nit: I'd rather use the normal kernel comment style here, where an
initial "/*" appears on a line by itself.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-09 17:46       ` Chuck Lever
@ 2025-10-13 15:41         ` Mike Snitzer
  0 siblings, 0 replies; 23+ messages in thread
From: Mike Snitzer @ 2025-10-13 15:41 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

Hi Chuck,

On Thu, Oct 09, 2025 at 01:46:34PM -0400, Chuck Lever wrote:
> On 10/8/25 2:59 PM, Mike Snitzer wrote:
> > +
> > +static noinline_for_stack int
> > +nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > +		  struct nfsd_file *nf, loff_t offset, unsigned int nvecs,
> > +		  unsigned long *cnt, struct kiocb *kiocb)
> > +{
> > +	struct nfsd_write_dio write_dio;
> > +
> > +	/* Any buffered IO issued here will be misaligned, use
> > +	 * IOCB_SYNC to ensure it has completed before returning.
> > +	 */
> > +	kiocb->ki_flags |= IOCB_SYNC;
> > +	/* Check if IOCB_DONTCACHE should be used when issuing buffered IO;
> > +	 * if so, it will be ignored for any DIO issued here.
> > +	 */
> > +	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> > +		kiocb->ki_flags |= IOCB_DONTCACHE;
> > +
> > +	if (nfsd_is_write_dio_possible(offset, *cnt, nf, &write_dio)) {
> > +		trace_nfsd_write_direct(rqstp, fhp, offset, *cnt);
> > +		return nfsd_issue_write_dio(rqstp, fhp, nf, offset, nvecs,
> > +					    cnt, kiocb, &write_dio);
> > +	}
> > +
> > +	return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
> > +}
> 
> Handful of initial comments:
> 
> The current NFSv3 code path, and perhaps NFSv4 too, doesn't support
> changing a WRITE marked as UNSTABLE to FILE_SYNC. I'm not seeing that
> rectified in this patch. I have a patch that enables changing the
> "stable_how" setting, but it's on a system at home. I'll post it in
> a couple of days and we can build on that.

Just a quick follow-up to see if you might have time to either share
your patch and/or rebase NFSD Direct WRITE ontop of it?

> In nfsd_direct_write we're setting IOCB_SYNC early. When falling back to
> BUFFERED_IO, that setting is preserved, and the fallback buffered path
> is now always FILE_SYNC. We should discuss whether the fallback in this
> case should be always FILE_SYNC or should allow UNSTABLE.

Probably makes sense to only set IOCB_SYNC _and_ IOCB_DONTCACHE if
nfsd_is_write_dio_possible() returns true.  So inverting
nfsd_direct_write()'s flow as such:

   if (!nfsd_is_write_dio_possible())
      return nfsd_buffered_write()

   // could push the setting these flags into nfsd_issue_write_dio?
   kiocb->ki_flags |= IOCB_SYNC;
   if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
      kiocb->ki_flags |= IOCB_DONTCACHE;

   trace_nfsd_write_direct(rqstp, fhp, offset, *cnt);
   return nfsd_issue_write_dio()

But the devil is in the details in conjunction with needing to rebase
to deal with your change to pass stable_how by reference, etc.

> Nit: I'd rather use the normal kernel comment style here, where an
> initial "/*" appears on a line by itself.

OK, will use that in general moving forward.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-10-13 15:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-09 19:05 [PATCH v1 0/3] NFSD direct I/O read Chuck Lever
2025-09-09 19:05 ` [PATCH v1 1/3] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Chuck Lever
2025-09-09 23:07   ` NeilBrown
2025-09-09 19:05 ` [PATCH v1 2/3] NFSD: pass nfsd_file to nfsd_iter_read() Chuck Lever
2025-09-09 23:20   ` NeilBrown
2025-09-09 19:05 ` [PATCH v1 3/3] NFSD: Implement NFSD_IO_DIRECT for NFS READ Chuck Lever
2025-09-09 23:16   ` Mike Snitzer
2025-09-09 23:37   ` NeilBrown
2025-09-09 23:39     ` Chuck Lever
2025-09-09 23:48       ` Chuck Lever
2025-09-10  1:54         ` NeilBrown
2025-09-10  1:52       ` NeilBrown
2025-09-10 14:23         ` Chuck Lever
2025-09-09 23:56     ` Mike Snitzer
2025-09-10 11:37   ` Jeff Layton
2025-09-09 23:33 ` [PATCH 0/2] NFSD: continuation of NFSD DIRECT Mike Snitzer
2025-09-09 23:33   ` [PATCH 1/2] sunrpc: add an extra reserve page to svc_serv_maxpages() Mike Snitzer
2025-09-10 14:29     ` Chuck Lever
2025-09-09 23:33   ` [PATCH 2/2] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
2025-10-08 18:59     ` [PATCH v2] " Mike Snitzer
2025-10-09 15:04       ` Jeff Layton
2025-10-09 17:46       ` Chuck Lever
2025-10-13 15:41         ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox