[PATCH v7 00/14] NFSD: Implement NFSD_IO

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-10-24 14:42 Chuck Lever
  2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
                   ` (13 more replies)
  0 siblings, 14 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series includes the patches needed to make NFSD Direct WRITE
work.

Since the review comments have resulted in many changes, I've split
out these modifications into individual patches so that everyone can
easily follow my work. They can then be rejected, modified again,
squashed, or retained as separate patches, as we see fit.

I'm thinking that an additional simplification could be done if fall
back was handled completely inline: just never set the "use_dio"
boolean on any of the request's buffer segments.

Changes since v6:
* Patches to address review comments have been split out
* Refactored the iter initialization code

Changes since v5:
* Add a patch to make FILE_SYNC WRITEs persist timestamps
* Address some of Christoph's review comments
* The svcrdma patch has been dropped until we actually need it

Changes since v4:
* Split out refactoring nfsd_buffered_write() into a separate patch
* Expand patch description of 1/4
* Don't set IOCB_SYNC flag

Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks

Chuck Lever (12):
  NFSD: Make FILE_SYNC WRITEs comply with spec
  NFSD: Enable return of an updated stable_how to NFS clients
  NFSD: @stable for direct writes is always NFS_FILE_SYNC
  NFSD: Always set IOCB_SYNC in direct write path
  NFSD: Remove specific error handling
  NFSD: Remove alignment size checking
  NFSD: Remove the len_mask check
  NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec()
  NFSD: Clean up struct nfsd_write_dio
  NFSD: Introduce struct nfsd_write_dio_seg
  NFSD: Clean up direct write fall back error flow
  NFSD: Initialize separate ki_flags

Mike Snitzer (2):
  NFSD: Refactor nfsd_vfs_write()
  NFSD: Implement NFSD_IO_DIRECT for NFS WRITE

 fs/nfsd/debugfs.c  |   1 +
 fs/nfsd/nfs3proc.c |   2 +-
 fs/nfsd/nfs4proc.c |   2 +-
 fs/nfsd/nfsproc.c  |   3 +-
 fs/nfsd/trace.h    |   1 +
 fs/nfsd/vfs.c      | 213 ++++++++++++++++++++++++++++++++++++++++++---
 fs/nfsd/vfs.h      |   6 +-
 fs/nfsd/xdr3.h     |   2 +-
 8 files changed, 212 insertions(+), 18 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-24 14:42 ` Chuck Lever
  2025-10-24 15:21   ` Jeff Layton
  2025-10-27  8:02   ` Christoph Hellwig
  2025-10-24 14:42 ` [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer

From: Chuck Lever <chuck.lever@oracle.com>

Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:

> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.

For many years, NFSD has used a "data sync only" optimization for
FILE_SYNC WRITEs, in violation of the above text (and previous
incarnations of the NFS standard). File time stamps haven't been
persisted as the mandate above requires.

The purpose of this behavior is that, back in the day, file systems
on rotational media were too slow to handle writes with time stamp
updates. With the advent of UNSTABLE WRITE, the time stamp update is
done by the COMMIT, which amortizes the cost of one time stamp
update over possibly many WRITE requests.

The impact of this change will be felt only when a client explicitly
requests a FILE_SYNC WRITE on a shared file system backed by slow
storage. UNSTABLE and DATA_SYNC WRITEs should not be affected.

Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5a9a2a69bc08 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,14 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		stable = NFS_UNSTABLE;
 	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = offset;
-	if (stable && !fhp->fh_use_wgather)
-		kiocb.ki_flags |= IOCB_DSYNC;
+	if (stable && !fhp->fh_use_wgather) {
+		if (stable == NFS_FILE_SYNC)
+			/* persist data and timestamps */
+			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+		else
+			/* persist data only */
+			kiocb.ki_flags |= IOCB_DSYNC;
+	}
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-10-24 14:42 ` Chuck Lever
  2025-10-27  8:03   ` Christoph Hellwig
  2025-10-24 14:42 ` [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write() Chuck Lever
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

NFSv3 and newer protocols enable clients to perform a two-phase
WRITE. A client requests an UNSTABLE WRITE, which sends dirty data
to the NFS server, but does not persist it. The server replies that
it performed the UNSTABLE WRITE, and the client is then obligated to
follow up with a COMMIT request before it can remove the dirty data
from its own page cache. The COMMIT reply is the client's guarantee
that the written data has been persisted on the server.

The purpose of this protocol design is to enable clients to send
a large amount of data via multiple WRITE requests to a server, and
then wait for persistence just once. The server is able to start
persisting the data as soon as it gets it, to shorten the length of
time the client has to wait for the final COMMIT to complete.

It's also possible for the server to respond to an UNSTABLE WRITE
request in a way that indicates that the data was persisted anyway.
In that case, the client can skip the COMMIT and remove the dirty
data from its memory immediately. NetApp filers, for example, do
this because they have a battery-backed cache and can guarantee that
written data is persisted quickly and immediately.

NFSD has never implemented this kind of promotion. UNSTABLE WRITE
requests are unconditionally treated as UNSTABLE. However, in a
subsequent patch, nfsd_vfs_write() will be able to promote an
UNSTABLE WRITE to be a FILE_SYNC WRITE. This will be because NFSD
will handle some WRITE requests locally with O_DIRECT, which
persists written data immediately. The FILE_SYNC WRITE response
indicates to the client that no follow-up COMMIT is necessary.

This patch prepares for that change by enabling the modified
stable_how value to be passed along to NFSD's WRITE reply encoder.
No behavior change is expected.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c |  2 +-
 fs/nfsd/nfs4proc.c |  2 +-
 fs/nfsd/nfsproc.c  |  3 ++-
 fs/nfsd/vfs.c      | 11 ++++++-----
 fs/nfsd/vfs.h      |  6 ++++--
 fs/nfsd/xdr3.h     |  2 +-
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..ad14b34583bb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
 	resp->committed = argp->stable;
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
 				  &argp->payload, &cnt,
-				  resp->committed, resp->verf);
+				  &resp->committed, resp->verf);
 	resp->count = cnt;
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 7f7e6bb23a90..2222bb283baf 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1285,7 +1285,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	write->wr_how_written = write->wr_stable_how;
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
 				write->wr_offset, &write->wr_payload,
-				&cnt, write->wr_how_written,
+				&cnt, &write->wr_how_written,
 				(__be32 *)write->wr_verifier.data);
 	nfsd_file_put(nf);
 
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75..706401ed6f8d 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 	struct nfsd_writeargs *argp = rqstp->rq_argp;
 	struct nfsd_attrstat *resp = rqstp->rq_resp;
 	unsigned long cnt = argp->len;
+	u32 committed = NFS_DATA_SYNC;
 
 	dprintk("nfsd: WRITE    %s %u bytes at %d\n",
 		SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
-				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+				  &argp->payload, &cnt, &committed, NULL);
 	if (resp->status == nfs_ok)
 		resp->status = fh_getattr(&resp->fh, &resp->stat);
 	else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5a9a2a69bc08..dc98182a9048 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,7 +1262,7 @@ static int wait_for_concurrent_writes(struct file *file)
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1274,11 +1274,12 @@ __be32
 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	       struct nfsd_file *nf, loff_t offset,
 	       const struct xdr_buf *payload, unsigned long *cnt,
-	       int stable, __be32 *verf)
+	       u32 *stable_how, __be32 *verf)
 {
 	struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
 	struct file		*file = nf->nf_file;
 	struct super_block	*sb = file_inode(file)->i_sb;
+	u32			stable = *stable_how;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
 	struct iov_iter		iter;
@@ -1440,7 +1441,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1450,7 +1451,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  */
 __be32
 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
-	   const struct xdr_buf *payload, unsigned long *cnt, int stable,
+	   const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
 	   __be32 *verf)
 {
 	struct nfsd_file *nf;
@@ -1463,7 +1464,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
 		goto out;
 
 	err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
-			     stable, verf);
+			     stable_how, verf);
 	nfsd_file_put(nf);
 out:
 	trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fa46f8b5f132..c713ed0b04e0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -130,11 +130,13 @@ __be32		nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				u32 *eof);
 __be32		nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				loff_t offset, const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				struct nfsd_file *nf, loff_t offset,
 				const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_readlink(struct svc_rqst *, struct svc_fh *,
 				char *, int *);
 __be32		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..c0e443ef3a6b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
 	__be32			status;
 	struct svc_fh		fh;
 	unsigned long		count;
-	int			committed;
+	u32			committed;
 	__be32			verf[2];
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write()
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
  2025-10-24 14:42 ` [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-10-24 14:42 ` Chuck Lever
  2025-10-27  8:04   ` Christoph Hellwig
  2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

Extract the common code that is to be used in the buffered and
dontcache I/O modes. This common code will also be used as the
fallback when direct I/O is requested but cannot be used.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index dc98182a9048..6076821bb541 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,22 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+static int
+nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
+		unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct iov_iter iter;
+	int host_err;
+
+	iov_iter_bvec(&iter, ITER_SOURCE, bvec, nvecs, *cnt);
+	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+	if (host_err < 0)
+		return host_err;
+
+	*cnt = host_err;
+	return 0;
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1282,7 +1298,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	u32			stable = *stable_how;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
-	struct iov_iter		iter;
 	errseq_t		since;
 	__be32			nfserr;
 	int			host_err;
@@ -1325,25 +1340,25 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	}
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
-	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
-	case NFSD_IO_BUFFERED:
-		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
+		fallthrough;
+	case NFSD_IO_BUFFERED:
+		host_err = nfsd_iocb_write(file, rqstp->rq_bvec, nvecs, cnt,
+					   &kiocb);
 		break;
 	}
-	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
 	if (host_err < 0) {
 		commit_reset_write_verifier(nn, rqstp, host_err);
 		goto out_nfserr;
 	}
-	*cnt = host_err;
 	nfsd_stats_io_write_add(nn, exp, *cnt);
 	fsnotify_modify(file);
 	host_err = filemap_check_wb_err(file->f_mapping, since);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (2 preceding siblings ...)
  2025-10-24 14:42 ` [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write() Chuck Lever
@ 2025-10-24 14:42 ` Chuck Lever
  2025-10-24 17:12   ` Mike Snitzer
                     ` (2 more replies)
  2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Synchronous buffered IO (with
preference towards using DONTCACHE) is used for the misaligned extents
and O_DIRECT is used for the middle DIO-aligned extent.

nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
allows the client to drop its dirty data and avoid needing an extra
COMMIT operation.

If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
invalidate the page cache on behalf of the DIO WRITE, then
nfsd_issue_write_dio() will fall back to using buffered IO.

These changes served as the original starting point for the NFS
client's misaligned O_DIRECT support that landed with
commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
READ and WRITE"). But NFSD's support is simpler because it currently
doesn't use AIO completion.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/debugfs.c |   1 +
 fs/nfsd/trace.h   |   1 +
 fs/nfsd/vfs.c     | 197 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 199 insertions(+)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	switch (val) {
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
 DEFINE_NFSD_IO_EVENT(read_done);
 DEFINE_NFSD_IO_EVENT(write_start);
 DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
 DEFINE_NFSD_IO_EVENT(write_io_done);
 DEFINE_NFSD_IO_EVENT(write_done);
 DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6076821bb541..2832a66cda5b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,109 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio {
+	ssize_t	start_len;	/* Length for misaligned first extent */
+	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
+	ssize_t	end_len;	/* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+			   struct nfsd_file *nf,
+			   struct nfsd_write_dio *write_dio)
+{
+	const u32 dio_blocksize = nf->nf_dio_offset_align;
+	loff_t start_end, orig_end, middle_end;
+
+	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+		return false;
+	if (unlikely(dio_blocksize > PAGE_SIZE))
+		return false;
+	if (unlikely(len < dio_blocksize))
+		return false;
+
+	start_end = round_up(offset, dio_blocksize);
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	write_dio->start_len = start_end - offset;
+	write_dio->middle_len = middle_end - start_end;
+	write_dio->end_len = orig_end - middle_end;
+
+	return true;
+}
+
+static bool
+nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
+			   unsigned int len_mask)
+{
+	const struct bio_vec *bvec = i->bvec;
+	size_t skip = i->iov_offset;
+	size_t size = i->count;
+
+	if (size & len_mask)
+		return false;
+	do {
+		size_t len = bvec->bv_len;
+
+		if (len > size)
+			len = size;
+		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+			return false;
+		bvec++;
+		size -= len;
+		skip = 0;
+	} while (size);
+
+	return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+			   struct bio_vec *rq_bvec, unsigned int nvecs,
+			   unsigned long cnt, struct nfsd_write_dio *write_dio,
+			   struct nfsd_file *nf)
+{
+	int n_iters = 0;
+	struct iov_iter *iters = *iterp;
+
+	/* Setup misaligned start? */
+	if (write_dio->start_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iters[n_iters].count = write_dio->start_len;
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	/* Setup DIO-aligned middle */
+	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+	if (write_dio->start_len)
+		iov_iter_advance(&iters[n_iters], write_dio->start_len);
+	iters[n_iters].count -= write_dio->end_len;
+	iter_is_dio_aligned[n_iters] =
+		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+					   nf->nf_dio_mem_align - 1,
+					   nf->nf_dio_offset_align - 1);
+	if (unlikely(!iter_is_dio_aligned[n_iters]))
+		return 0; /* no DIO-aligned IO possible */
+	++n_iters;
+
+	/* Setup misaligned end? */
+	if (write_dio->end_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iov_iter_advance(&iters[n_iters],
+				 write_dio->start_len + write_dio->middle_len);
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	return n_iters;
+}
+
 static int
 nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
 		unsigned long *cnt, struct kiocb *kiocb)
@@ -1270,6 +1373,95 @@ nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
 	return 0;
 }
 
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
+		     u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
+		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+{
+	struct file *file = nf->nf_file;
+	bool iter_is_dio_aligned[3];
+	struct iov_iter iter_stack[3];
+	struct iov_iter *iter = iter_stack;
+	unsigned int n_iters = 0;
+	unsigned long in_count = *cnt;
+	loff_t in_offset = kiocb->ki_pos;
+	ssize_t host_err;
+
+	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+					     rqstp->rq_bvec, nvecs, *cnt,
+					     write_dio, nf);
+	if (unlikely(!n_iters))
+		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
+				       cnt, kiocb);
+
+	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
+
+	/*
+	 * Any buffered IO issued here will be misaligned, use
+	 * sync IO to ensure it has completed before returning.
+	 * Also update @stable_how to avoid need for COMMIT.
+	 */
+	kiocb->ki_flags |= IOCB_DSYNC;
+	*stable_how = NFS_FILE_SYNC;
+
+	*cnt = 0;
+	for (int i = 0; i < n_iters; i++) {
+		if (iter_is_dio_aligned[i])
+			kiocb->ki_flags |= IOCB_DIRECT;
+		else
+			kiocb->ki_flags &= ~IOCB_DIRECT;
+
+		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+		if (host_err < 0) {
+			/*
+			 * VFS will return -ENOTBLK if DIO WRITE fails to
+			 * invalidate the page cache. Retry using buffered IO.
+			 */
+			if (unlikely(host_err == -ENOTBLK)) {
+				kiocb->ki_flags &= ~IOCB_DIRECT;
+				*cnt = in_count;
+				kiocb->ki_pos = in_offset;
+				return nfsd_iocb_write(file, rqstp->rq_bvec,
+						       nvecs, cnt, kiocb);
+			} else if (unlikely(host_err == -EINVAL)) {
+				struct inode *inode = d_inode(fhp->fh_dentry);
+
+				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+						    inode->i_sb->s_id, inode->i_ino);
+				host_err = -ESERVERFAULT;
+			}
+			return host_err;
+		}
+		*cnt += host_err;
+		if (host_err < iter[i].count) /* partial write? */
+			break;
+	}
+
+	return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
+		  unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct nfsd_write_dio write_dio;
+
+	/*
+	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
+	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
+	 * be ignored for any DIO issued here).
+	 */
+	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+		kiocb->ki_flags |= IOCB_DONTCACHE;
+
+	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
+		return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
+					    cnt, kiocb, &write_dio);
+
+	return nfsd_iocb_write(nf->nf_file, rqstp->rq_bvec, nvecs, cnt, kiocb);
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1346,6 +1538,11 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
+	case NFSD_IO_DIRECT:
+		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
+					     nvecs, cnt, &kiocb);
+		stable = *stable_how;
+		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (3 preceding siblings ...)
  2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-24 14:42 ` Chuck Lever
  2025-10-24 15:22   ` Jeff Layton
  2025-10-27  8:05   ` Christoph Hellwig
  2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Clean up: The helpers in the nfsd_direct_write() code path don't set
stable_how to anything else but NFS_FILE_SYNC. All data writes in
this code path result in immediately durability.

Instead of passing it through the stack of functions, just set it
after the call is done.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2832a66cda5b..cd2c99e450fb 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1374,9 +1374,10 @@ nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
 }
 
 static int
-nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
-		     u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
-		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		     struct nfsd_file *nf, unsigned int nvecs,
+		     unsigned long *cnt, struct kiocb *kiocb,
+		     struct nfsd_write_dio *write_dio)
 {
 	struct file *file = nf->nf_file;
 	bool iter_is_dio_aligned[3];
@@ -1399,10 +1400,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
 	/*
 	 * Any buffered IO issued here will be misaligned, use
 	 * sync IO to ensure it has completed before returning.
-	 * Also update @stable_how to avoid need for COMMIT.
 	 */
 	kiocb->ki_flags |= IOCB_DSYNC;
-	*stable_how = NFS_FILE_SYNC;
 
 	*cnt = 0;
 	for (int i = 0; i < n_iters; i++) {
@@ -1442,7 +1441,7 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
 
 static noinline_for_stack int
 nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
+		  struct nfsd_file *nf, unsigned int nvecs,
 		  unsigned long *cnt, struct kiocb *kiocb)
 {
 	struct nfsd_write_dio write_dio;
@@ -1456,8 +1455,8 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		kiocb->ki_flags |= IOCB_DONTCACHE;
 
 	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
-		return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
-					    cnt, kiocb, &write_dio);
+		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
+					    &write_dio);
 
 	return nfsd_iocb_write(nf->nf_file, rqstp->rq_bvec, nvecs, cnt, kiocb);
 }
@@ -1539,9 +1538,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	switch (nfsd_io_cache_write) {
 	case NFSD_IO_DIRECT:
-		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
-					     nvecs, cnt, &kiocb);
-		stable = *stable_how;
+		host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs, cnt,
+					     &kiocb);
+		stable = *stable_how = NFS_FILE_SYNC;
 		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (4 preceding siblings ...)
  2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
@ 2025-10-24 14:42 ` Chuck Lever
  2025-10-24 15:22   ` Jeff Layton
  2025-10-27  8:08   ` Christoph Hellwig
  2025-10-24 14:42 ` [PATCH v7 07/14] NFSD: Remove specific error handling Chuck Lever
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The NFS specs mandate that an NFS_FILE_SYNC write means that file
metadata (eg time stamps) are durable before the server sends the
response.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index cd2c99e450fb..74fcb12bf19c 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1397,12 +1397,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
 
-	/*
-	 * Any buffered IO issued here will be misaligned, use
-	 * sync IO to ensure it has completed before returning.
-	 */
-	kiocb->ki_flags |= IOCB_DSYNC;
-
 	*cnt = 0;
 	for (int i = 0; i < n_iters; i++) {
 		if (iter_is_dio_aligned[i])
@@ -1454,6 +1448,17 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
 		kiocb->ki_flags |= IOCB_DONTCACHE;
 
+	/*
+	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
+	 * both written data and dirty time stamps.
+	 *
+	 * When falling back to buffered I/O or handling the unaligned
+	 * first and last segments, the data and time stamps must be
+	 * durable before nfsd_vfs_write() returns to its caller, matching
+	 * the behavior of direct I/O.
+	 */
+	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
+
 	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
 		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
 					    &write_dio);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 07/14] NFSD: Remove specific error handling
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (5 preceding siblings ...)
  2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
@ 2025-10-24 14:42 ` Chuck Lever
  2025-10-24 15:22   ` Jeff Layton
  2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:42 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

From: Chuck Lever <chuck.lever@oracle.com>

1. Christoph notes that ENOTBLK is not supposed to leak out of
   file systems, so it's unlikely or impossible to see that error
   code here.

2. There are several ways to get EINVAL on a write, and the least
   likely of those is a dio alignment problem. The warning here
   would be misleading in those more common cases.

It's unlikely that an administrator can do anything about either
of these cases, should they appear on a production system.

The trace_nfsd_write_done event will be able to record these errnos.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 74fcb12bf19c..b50be92343e3 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1405,26 +1405,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
 			kiocb->ki_flags &= ~IOCB_DIRECT;
 
 		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
-		if (host_err < 0) {
-			/*
-			 * VFS will return -ENOTBLK if DIO WRITE fails to
-			 * invalidate the page cache. Retry using buffered IO.
-			 */
-			if (unlikely(host_err == -ENOTBLK)) {
-				kiocb->ki_flags &= ~IOCB_DIRECT;
-				*cnt = in_count;
-				kiocb->ki_pos = in_offset;
-				return nfsd_iocb_write(file, rqstp->rq_bvec,
-						       nvecs, cnt, kiocb);
-			} else if (unlikely(host_err == -EINVAL)) {
-				struct inode *inode = d_inode(fhp->fh_dentry);
-
-				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
-						    inode->i_sb->s_id, inode->i_ino);
-				host_err = -ESERVERFAULT;
-			}
+		if (host_err < 0)
 			return host_err;
-		}
 		*cnt += host_err;
 		if (host_err < iter[i].count) /* partial write? */
 			break;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 08/14] NFSD: Remove alignment size checking
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (6 preceding siblings ...)
  2025-10-24 14:42 ` [PATCH v7 07/14] NFSD: Remove specific error handling Chuck Lever
@ 2025-10-24 14:43 ` Chuck Lever
  2025-10-24 15:22   ` Jeff Layton
  2025-10-27  8:09   ` Christoph Hellwig
  2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:43 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

From: Chuck Lever <chuck.lever@oracle.com>

The current set of in-tree file systems do not support alignments
larger than a PAGE, so this check is unnecessary.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index b50be92343e3..465d4d091f3d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1270,8 +1270,6 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
 
 	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
 		return false;
-	if (unlikely(dio_blocksize > PAGE_SIZE))
-		return false;
 	if (unlikely(len < dio_blocksize))
 		return false;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 09/14] NFSD: Remove the len_mask check
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (7 preceding siblings ...)
  2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
@ 2025-10-24 14:43 ` Chuck Lever
  2025-10-24 15:23   ` Jeff Layton
  2025-10-24 17:16   ` Mike Snitzer
  2025-10-24 14:43 ` [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec() Chuck Lever
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:43 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer

From: Chuck Lever <chuck.lever@oracle.com>

Mike says:
> > Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
> > would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
> > would return false.
>
> It is always due to memory alignment (addr_mask check), never due to
> logical alignment (len_mask check).
>
> So we could remove the len_mask arg and the 'if (size & len_mask)'
> check from nfsd_iov_iter_aligned_bvec

Suggested-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 465d4d091f3d..f6810630bb65 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1285,15 +1285,12 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
 }
 
 static bool
-nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
-			   unsigned int len_mask)
+nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask)
 {
 	const struct bio_vec *bvec = i->bvec;
 	size_t skip = i->iov_offset;
 	size_t size = i->count;
 
-	if (size & len_mask)
-		return false;
 	do {
 		size_t len = bvec->bv_len;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec()
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (8 preceding siblings ...)
  2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
@ 2025-10-24 14:43 ` Chuck Lever
  2025-10-24 15:24   ` Jeff Layton
  2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:43 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Clean up: Keep the specifics of the alignment checking inside of
nfsd_iov_iter_aligned_bvec(). Move the calculations of the alignment
parameters into the function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f6810630bb65..e7c3458bd178 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1285,8 +1285,9 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
 }
 
 static bool
-nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask)
+nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
 {
+	unsigned int addr_mask = nf->nf_dio_mem_align - 1;
 	const struct bio_vec *bvec = i->bvec;
 	size_t skip = i->iov_offset;
 	size_t size = i->count;
@@ -1333,9 +1334,7 @@ nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
 		iov_iter_advance(&iters[n_iters], write_dio->start_len);
 	iters[n_iters].count -= write_dio->end_len;
 	iter_is_dio_aligned[n_iters] =
-		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
-					   nf->nf_dio_mem_align - 1,
-					   nf->nf_dio_offset_align - 1);
+		nfsd_iov_iter_aligned_bvec(nf, &iters[n_iters]);
 	if (unlikely(!iter_is_dio_aligned[n_iters]))
 		return 0; /* no DIO-aligned IO possible */
 	++n_iters;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (9 preceding siblings ...)
  2025-10-24 14:43 ` [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec() Chuck Lever
@ 2025-10-24 14:43 ` Chuck Lever
  2025-10-24 15:26   ` Jeff Layton
  2025-10-24 17:20   ` Mike Snitzer
  2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:43 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Prepare for moving more common arguments into the shared per-request
structure.

First step is to move the target nfsd_file into that structure, as
it needs to be available in several functions.

As a clean-up, adopt the common naming of a structure that carries
the arguments for a number of functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 61 ++++++++++++++++++++++++++-------------------------
 1 file changed, 31 insertions(+), 30 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index e7c3458bd178..429f5fc61ead 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,21 +1254,22 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
-struct nfsd_write_dio {
+struct nfsd_write_dio_args {
+	struct nfsd_file		*nf;
+
 	ssize_t	start_len;	/* Length for misaligned first extent */
 	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
 	ssize_t	end_len;	/* Length for misaligned last extent */
 };
 
 static bool
-nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
-			   struct nfsd_file *nf,
-			   struct nfsd_write_dio *write_dio)
+nfsd_is_write_dio_possible(struct nfsd_write_dio_args *args, loff_t offset,
+			   unsigned long len)
 {
-	const u32 dio_blocksize = nf->nf_dio_offset_align;
+	const u32 dio_blocksize = args->nf->nf_dio_offset_align;
 	loff_t start_end, orig_end, middle_end;
 
-	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+	if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
 		return false;
 	if (unlikely(len < dio_blocksize))
 		return false;
@@ -1277,9 +1278,9 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
 	orig_end = offset + len;
 	middle_end = round_down(orig_end, dio_blocksize);
 
-	write_dio->start_len = start_end - offset;
-	write_dio->middle_len = middle_end - start_end;
-	write_dio->end_len = orig_end - middle_end;
+	args->start_len = start_end - offset;
+	args->middle_len = middle_end - start_end;
+	args->end_len = orig_end - middle_end;
 
 	return true;
 }
@@ -1314,36 +1315,35 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
 static int
 nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
 			   struct bio_vec *rq_bvec, unsigned int nvecs,
-			   unsigned long cnt, struct nfsd_write_dio *write_dio,
-			   struct nfsd_file *nf)
+			   unsigned long cnt, struct nfsd_write_dio_args *args)
 {
 	int n_iters = 0;
 	struct iov_iter *iters = *iterp;
 
 	/* Setup misaligned start? */
-	if (write_dio->start_len) {
+	if (args->start_len) {
 		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
-		iters[n_iters].count = write_dio->start_len;
+		iters[n_iters].count = args->start_len;
 		iter_is_dio_aligned[n_iters] = false;
 		++n_iters;
 	}
 
 	/* Setup DIO-aligned middle */
 	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
-	if (write_dio->start_len)
-		iov_iter_advance(&iters[n_iters], write_dio->start_len);
-	iters[n_iters].count -= write_dio->end_len;
+	if (args->start_len)
+		iov_iter_advance(&iters[n_iters], args->start_len);
+	iters[n_iters].count -= args->end_len;
 	iter_is_dio_aligned[n_iters] =
-		nfsd_iov_iter_aligned_bvec(nf, &iters[n_iters]);
+		nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
 	if (unlikely(!iter_is_dio_aligned[n_iters]))
 		return 0; /* no DIO-aligned IO possible */
 	++n_iters;
 
 	/* Setup misaligned end? */
-	if (write_dio->end_len) {
+	if (args->end_len) {
 		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
 		iov_iter_advance(&iters[n_iters],
-				 write_dio->start_len + write_dio->middle_len);
+				 args->start_len + args->middle_len);
 		iter_is_dio_aligned[n_iters] = false;
 		++n_iters;
 	}
@@ -1369,11 +1369,10 @@ nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
 
 static int
 nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		     struct nfsd_file *nf, unsigned int nvecs,
-		     unsigned long *cnt, struct kiocb *kiocb,
-		     struct nfsd_write_dio *write_dio)
+		     struct nfsd_write_dio_args *args, struct kiocb *kiocb,
+		     unsigned int nvecs, unsigned long *cnt)
 {
-	struct file *file = nf->nf_file;
+	struct file *file = args->nf->nf_file;
 	bool iter_is_dio_aligned[3];
 	struct iov_iter iter_stack[3];
 	struct iov_iter *iter = iter_stack;
@@ -1384,7 +1383,7 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
 					     rqstp->rq_bvec, nvecs, *cnt,
-					     write_dio, nf);
+					     args);
 	if (unlikely(!n_iters))
 		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
 				       cnt, kiocb);
@@ -1414,14 +1413,15 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		  struct nfsd_file *nf, unsigned int nvecs,
 		  unsigned long *cnt, struct kiocb *kiocb)
 {
-	struct nfsd_write_dio write_dio;
+	struct file *file = nf->nf_file;
+	struct nfsd_write_dio_args args;
 
 	/*
 	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
 	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
 	 * be ignored for any DIO issued here).
 	 */
-	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+	if (file->f_op->fop_flags & FOP_DONTCACHE)
 		kiocb->ki_flags |= IOCB_DONTCACHE;
 
 	/*
@@ -1435,11 +1435,12 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	 */
 	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
 
-	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
-		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
-					    &write_dio);
+	args.nf = nf;
+	if (nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt))
+		return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb,
+					    nvecs, cnt);
 
-	return nfsd_iocb_write(nf->nf_file, rqstp->rq_bvec, nvecs, cnt, kiocb);
+	return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs, cnt, kiocb);
 }
 
 /**
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (10 preceding siblings ...)
  2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
@ 2025-10-24 14:43 ` Chuck Lever
  2025-10-24 15:30   ` Jeff Layton
  2025-10-24 17:57   ` Mike Snitzer
  2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
  2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:43 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

From: Chuck Lever <chuck.lever@oracle.com>

Passing iter arrays by reference is a little risky. Instead, pass a
struct with a fixed-size array so bounds checking can be done.

Name each item in the array a "segment", as the term "extent"
generally refers to a set of blocks on storage, not to a buffer.
Each segment is processed via a single vfs_iocb_iter_write() call,
and is either IOCB_DIRECT or buffered.

Introduce a segment constructor function so each segment is
initialized identically.

Each segment has its own length. The loop that iterates over the
segment array can simply skip over the segments of zero length.
A count of segments is not needed.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 121 ++++++++++++++++++++++++--------------------------
 1 file changed, 57 insertions(+), 64 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 429f5fc61ead..b7f217aa4994 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,12 +1254,15 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio_seg {
+	struct iov_iter			iter;
+	size_t				len;
+	bool				use_dio;
+};
+
 struct nfsd_write_dio_args {
 	struct nfsd_file		*nf;
-
-	ssize_t	start_len;	/* Length for misaligned first extent */
-	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
-	ssize_t	end_len;	/* Length for misaligned last extent */
+	struct nfsd_write_dio_seg	segment[3];
 };
 
 static bool
@@ -1267,21 +1270,19 @@ nfsd_is_write_dio_possible(struct nfsd_write_dio_args *args, loff_t offset,
 			   unsigned long len)
 {
 	const u32 dio_blocksize = args->nf->nf_dio_offset_align;
-	loff_t start_end, orig_end, middle_end;
+	loff_t first_end, orig_end, middle_end;
 
 	if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
 		return false;
 	if (unlikely(len < dio_blocksize))
 		return false;
 
-	start_end = round_up(offset, dio_blocksize);
+	first_end = round_up(offset, dio_blocksize);
 	orig_end = offset + len;
 	middle_end = round_down(orig_end, dio_blocksize);
-
-	args->start_len = start_end - offset;
-	args->middle_len = middle_end - start_end;
-	args->end_len = orig_end - middle_end;
-
+	args->segment[0].len = first_end - offset;	/* first segment */
+	args->segment[1].len = middle_end - first_end;	/* middle segment */
+	args->segment[2].len = orig_end - middle_end;	/* last segment */
 	return true;
 }
 
@@ -1308,47 +1309,42 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
 	return true;
 }
 
-/*
- * Setup as many as 3 iov_iter based on extents described by @write_dio.
- * Returns the number of iov_iter that were setup.
- */
-static int
-nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
-			   struct bio_vec *rq_bvec, unsigned int nvecs,
-			   unsigned long cnt, struct nfsd_write_dio_args *args)
+static void
+nfsd_setup_write_dio_seg(struct nfsd_write_dio_seg *segment,
+			 struct bio_vec *bvec, unsigned int nvecs,
+			 unsigned long total, size_t start)
 {
-	int n_iters = 0;
-	struct iov_iter *iters = *iterp;
+	iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
+	if (start)
+		iov_iter_advance(&segment->iter, start);
+	iov_iter_truncate(&segment->iter, segment->len);
+	segment->use_dio = false;
+}
 
-	/* Setup misaligned start? */
-	if (args->start_len) {
-		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
-		iters[n_iters].count = args->start_len;
-		iter_is_dio_aligned[n_iters] = false;
-		++n_iters;
-	}
+static bool
+nfsd_setup_write_dio_iters(struct nfsd_write_dio_args *args,
+			   struct bio_vec *bvec, unsigned int nvecs,
+			   unsigned long total)
+{
+	/* first segment */
+	if (args->segment[0].len)
+		nfsd_setup_write_dio_seg(&args->segment[0],
+					 bvec, nvecs, total, 0);
 
-	/* Setup DIO-aligned middle */
-	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
-	if (args->start_len)
-		iov_iter_advance(&iters[n_iters], args->start_len);
-	iters[n_iters].count -= args->end_len;
-	iter_is_dio_aligned[n_iters] =
-		nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
-	if (unlikely(!iter_is_dio_aligned[n_iters]))
-		return 0; /* no DIO-aligned IO possible */
-	++n_iters;
+	/* middle segment */
+	nfsd_setup_write_dio_seg(&args->segment[1], bvec, nvecs, total,
+				 args->segment[0].len);
+	if (!nfsd_iov_iter_aligned_bvec(args->nf, &args->segment[1].iter))
+		return false; /* no DIO-aligned IO possible */
+	args->segment[1].use_dio = true;
 
-	/* Setup misaligned end? */
-	if (args->end_len) {
-		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
-		iov_iter_advance(&iters[n_iters],
-				 args->start_len + args->middle_len);
-		iter_is_dio_aligned[n_iters] = false;
-		++n_iters;
-	}
+	/* last segment */
+	if (args->segment[2].len)
+		nfsd_setup_write_dio_seg(&args->segment[2], bvec, nvecs,
+					 total, args->segment[0].len +
+					 args->segment[1].len);
 
-	return n_iters;
+	return true;
 }
 
 static int
@@ -1373,36 +1369,33 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		     unsigned int nvecs, unsigned long *cnt)
 {
 	struct file *file = args->nf->nf_file;
-	bool iter_is_dio_aligned[3];
-	struct iov_iter iter_stack[3];
-	struct iov_iter *iter = iter_stack;
-	unsigned int n_iters = 0;
-	unsigned long in_count = *cnt;
-	loff_t in_offset = kiocb->ki_pos;
+	struct nfsd_write_dio_seg *segment;
 	ssize_t host_err;
+	size_t i;
 
-	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
-					     rqstp->rq_bvec, nvecs, *cnt,
-					     args);
-	if (unlikely(!n_iters))
+	if (!nfsd_setup_write_dio_iters(args, rqstp->rq_bvec, nvecs, *cnt))
 		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
 				       cnt, kiocb);
 
-	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
-
 	*cnt = 0;
-	for (int i = 0; i < n_iters; i++) {
-		if (iter_is_dio_aligned[i])
+	segment = args->segment;
+	for (i = 0; i < ARRAY_SIZE(args->segment); i++) {
+		if (segment->len == 0)
+			continue;
+		if (segment->use_dio) {
 			kiocb->ki_flags |= IOCB_DIRECT;
-		else
+			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+						segment->len);
+		} else
 			kiocb->ki_flags &= ~IOCB_DIRECT;
 
-		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
 		if (host_err < 0)
 			return host_err;
 		*cnt += host_err;
-		if (host_err < iter[i].count) /* partial write? */
+		if (host_err < segment->iter.count)
 			break;
+		++segment;
 	}
 
 	return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (11 preceding siblings ...)
  2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
@ 2025-10-24 14:43 ` Chuck Lever
  2025-10-24 15:32   ` Jeff Layton
  2025-10-24 18:01   ` Mike Snitzer
  2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:43 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Clean up: Use the usual error flow form:

	if (uncommon condition) {
		handle it;
		return;
	}
	do common thing;

in nfsd_direct_write(). Now there is a single place where the direct
write path falls back to a single cached write.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index b7f217aa4994..b0e4105e0075 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1373,10 +1373,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	ssize_t host_err;
 	size_t i;
 
-	if (!nfsd_setup_write_dio_iters(args, rqstp->rq_bvec, nvecs, *cnt))
-		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
-				       cnt, kiocb);
-
 	*cnt = 0;
 	segment = args->segment;
 	for (i = 0; i < ARRAY_SIZE(args->segment); i++) {
@@ -1388,7 +1384,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
 						segment->len);
 		} else
 			kiocb->ki_flags &= ~IOCB_DIRECT;
-
 		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
 		if (host_err < 0)
 			return host_err;
@@ -1429,11 +1424,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
 
 	args.nf = nf;
-	if (nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt))
-		return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb,
-					    nvecs, cnt);
+	if (!nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt) ||
+	    !nfsd_setup_write_dio_iters(&args, rqstp->rq_bvec, nvecs, *cnt))
+		/* fall back to writing through the page cache */
+		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
+				       cnt, kiocb);
 
-	return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs, cnt, kiocb);
+	return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb, nvecs, cnt);
 }
 
 /**
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (12 preceding siblings ...)
  2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
@ 2025-10-24 14:43 ` Chuck Lever
  2025-10-24 15:34   ` Jeff Layton
  2025-10-24 18:13   ` Mike Snitzer
  13 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 14:43 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

From: Chuck Lever <chuck.lever@oracle.com>

Christoph says:
> > +	if (file->f_op->fop_flags & FOP_DONTCACHE)
> > +		kiocb->ki_flags |= IOCB_DONTCACHE;
> IOCB_DONTCACHE isn't defined for IOCB_DIRECT.  So this should
> move into a branch just for buffered I/O.

Instead, let's set up separate ki_flags for buffered I/O and for
direct I/O requests. Then we don't have to set a jumble of flag
bits in a single flags field.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index b0e4105e0075..b7b9f8cf0452 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,6 +1262,8 @@ struct nfsd_write_dio_seg {
 
 struct nfsd_write_dio_args {
 	struct nfsd_file		*nf;
+	int				flags_buffered;
+	int				flags_direct;
 	struct nfsd_write_dio_seg	segment[3];
 };
 
@@ -1379,11 +1381,11 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		if (segment->len == 0)
 			continue;
 		if (segment->use_dio) {
-			kiocb->ki_flags |= IOCB_DIRECT;
+			kiocb->ki_flags = args->flags_direct;
 			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
 						segment->len);
 		} else
-			kiocb->ki_flags &= ~IOCB_DIRECT;
+			kiocb->ki_flags = args->flags_buffered;
 		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
 		if (host_err < 0)
 			return host_err;
@@ -1405,30 +1407,27 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	struct nfsd_write_dio_args args;
 
 	/*
-	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
-	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
-	 * be ignored for any DIO issued here).
+	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
+	 * writing unaligned segments or handling fallback I/O.
 	 */
+	args.flags_buffered = kiocb->ki_flags | IOCB_SYNC | IOCB_DSYNC;
 	if (file->f_op->fop_flags & FOP_DONTCACHE)
-		kiocb->ki_flags |= IOCB_DONTCACHE;
+		args.flags_buffered |= IOCB_DONTCACHE;
 
 	/*
-	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
-	 * both written data and dirty time stamps.
-	 *
-	 * When falling back to buffered I/O or handling the unaligned
-	 * first and last segments, the data and time stamps must be
-	 * durable before nfsd_vfs_write() returns to its caller, matching
-	 * the behavior of direct I/O.
+	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should
+	 * persist both written data and dirty time stamps.
 	 */
-	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
+	args.flags_direct = kiocb->ki_flags | IOCB_SYNC | IOCB_DIRECT;
 
 	args.nf = nf;
 	if (!nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt) ||
-	    !nfsd_setup_write_dio_iters(&args, rqstp->rq_bvec, nvecs, *cnt))
+	    !nfsd_setup_write_dio_iters(&args, rqstp->rq_bvec, nvecs, *cnt)) {
 		/* fall back to writing through the page cache */
+		kiocb->ki_flags = args.flags_buffered;
 		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
 				       cnt, kiocb);
+	}
 
 	return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb, nvecs, cnt);
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec
  2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-10-24 15:21   ` Jeff Layton
  2025-10-27  8:02   ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:21 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer

On Fri, 2025-10-24 at 10:42 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
> does not also persist file time stamps. To wit, Section 18.32.3
> of RFC 8881 mandates:
> 
> > The client specifies with the stable parameter the method of how
> > the data is to be processed by the server. If stable is
> > FILE_SYNC4, the server MUST commit the data written plus all file
> > system metadata to stable storage before returning results. This
> > corresponds to the NFSv2 protocol semantics. Any other behavior
> > constitutes a protocol violation. If stable is DATA_SYNC4, then
> > the server MUST commit all of the data to stable storage and
> > enough of the metadata to retrieve the data before returning.
> 
> For many years, NFSD has used a "data sync only" optimization for
> FILE_SYNC WRITEs, in violation of the above text (and previous
> incarnations of the NFS standard). File time stamps haven't been
> persisted as the mandate above requires.
> 
> The purpose of this behavior is that, back in the day, file systems
> on rotational media were too slow to handle writes with time stamp
> updates. With the advent of UNSTABLE WRITE, the time stamp update is
> done by the COMMIT, which amortizes the cost of one time stamp
> update over possibly many WRITE requests.
> 
> The impact of this change will be felt only when a client explicitly
> requests a FILE_SYNC WRITE on a shared file system backed by slow
> storage. UNSTABLE and DATA_SYNC WRITEs should not be affected.
> 
> Reported-by: Mike Snitzer <snitzer@kernel.org>
> Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f537a7b4ee01..5a9a2a69bc08 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1314,8 +1314,14 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		stable = NFS_UNSTABLE;
>  	init_sync_kiocb(&kiocb, file);
>  	kiocb.ki_pos = offset;
> -	if (stable && !fhp->fh_use_wgather)
> -		kiocb.ki_flags |= IOCB_DSYNC;
> +	if (stable && !fhp->fh_use_wgather) {
> +		if (stable == NFS_FILE_SYNC)
> +			/* persist data and timestamps */
> +			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> +		else
> +			/* persist data only */
> +			kiocb.ki_flags |= IOCB_DSYNC;
> +	}
>  
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
>  	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
@ 2025-10-24 15:22   ` Jeff Layton
  2025-10-24 15:23     ` Chuck Lever
  2025-10-27  8:05   ` Christoph Hellwig
  1 sibling, 1 reply; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:22 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

On Fri, 2025-10-24 at 10:42 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Clean up: The helpers in the nfsd_direct_write() code path don't set
> stable_how to anything else but NFS_FILE_SYNC. All data writes in
> this code path result in immediately durability.
> 
> Instead of passing it through the stack of functions, just set it
> after the call is done.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 2832a66cda5b..cd2c99e450fb 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1374,9 +1374,10 @@ nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
>  }
>  
>  static int
> -nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> -		     u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
> -		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		     struct nfsd_file *nf, unsigned int nvecs,
> +		     unsigned long *cnt, struct kiocb *kiocb,
> +		     struct nfsd_write_dio *write_dio)
>  {
>  	struct file *file = nf->nf_file;
>  	bool iter_is_dio_aligned[3];
> @@ -1399,10 +1400,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
>  	/*
>  	 * Any buffered IO issued here will be misaligned, use
>  	 * sync IO to ensure it has completed before returning.
> -	 * Also update @stable_how to avoid need for COMMIT.
>  	 */
>  	kiocb->ki_flags |= IOCB_DSYNC;
> -	*stable_how = NFS_FILE_SYNC;
>  
>  	*cnt = 0;
>  	for (int i = 0; i < n_iters; i++) {
> @@ -1442,7 +1441,7 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
>  
>  static noinline_for_stack int
>  nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
> +		  struct nfsd_file *nf, unsigned int nvecs,
>  		  unsigned long *cnt, struct kiocb *kiocb)
>  {
>  	struct nfsd_write_dio write_dio;
> @@ -1456,8 +1455,8 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		kiocb->ki_flags |= IOCB_DONTCACHE;
>  
>  	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
> -		return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
> -					    cnt, kiocb, &write_dio);
> +		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
> +					    &write_dio);
>  
>  	return nfsd_iocb_write(nf->nf_file, rqstp->rq_bvec, nvecs, cnt, kiocb);
>  }
> @@ -1539,9 +1538,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	switch (nfsd_io_cache_write) {
>  	case NFSD_IO_DIRECT:
> -		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
> -					     nvecs, cnt, &kiocb);
> -		stable = *stable_how;
> +		host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs, cnt,
> +					     &kiocb);
> +		stable = *stable_how = NFS_FILE_SYNC;
>  		break;
>  	case NFSD_IO_DONTCACHE:
>  		if (file->f_op->fop_flags & FOP_DONTCACHE)

I assume you're going to squash some of these changes into the original
patches?

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path
  2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
@ 2025-10-24 15:22   ` Jeff Layton
  2025-10-27  8:08   ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:22 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

On Fri, 2025-10-24 at 10:42 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The NFS specs mandate that an NFS_FILE_SYNC write means that file
> metadata (eg time stamps) are durable before the server sends the
> response.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index cd2c99e450fb..74fcb12bf19c 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1397,12 +1397,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
>  
> -	/*
> -	 * Any buffered IO issued here will be misaligned, use
> -	 * sync IO to ensure it has completed before returning.
> -	 */
> -	kiocb->ki_flags |= IOCB_DSYNC;
> -
>  	*cnt = 0;
>  	for (int i = 0; i < n_iters; i++) {
>  		if (iter_is_dio_aligned[i])
> @@ -1454,6 +1448,17 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
>  		kiocb->ki_flags |= IOCB_DONTCACHE;
>  
> +	/*
> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
> +	 * both written data and dirty time stamps.
> +	 *
> +	 * When falling back to buffered I/O or handling the unaligned
> +	 * first and last segments, the data and time stamps must be
> +	 * durable before nfsd_vfs_write() returns to its caller, matching
> +	 * the behavior of direct I/O.
> +	 */
> +	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
> +
>  	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
>  		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
>  					    &write_dio);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 07/14] NFSD: Remove specific error handling
  2025-10-24 14:42 ` [PATCH v7 07/14] NFSD: Remove specific error handling Chuck Lever
@ 2025-10-24 15:22   ` Jeff Layton
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:22 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, 2025-10-24 at 10:42 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> 1. Christoph notes that ENOTBLK is not supposed to leak out of
>    file systems, so it's unlikely or impossible to see that error
>    code here.
> 
> 2. There are several ways to get EINVAL on a write, and the least
>    likely of those is a dio alignment problem. The warning here
>    would be misleading in those more common cases.
> 
> It's unlikely that an administrator can do anything about either
> of these cases, should they appear on a production system.
> 
> The trace_nfsd_write_done event will be able to record these errnos.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 20 +-------------------
>  1 file changed, 1 insertion(+), 19 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 74fcb12bf19c..b50be92343e3 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1405,26 +1405,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  			kiocb->ki_flags &= ~IOCB_DIRECT;
>  
>  		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> -		if (host_err < 0) {
> -			/*
> -			 * VFS will return -ENOTBLK if DIO WRITE fails to
> -			 * invalidate the page cache. Retry using buffered IO.
> -			 */
> -			if (unlikely(host_err == -ENOTBLK)) {
> -				kiocb->ki_flags &= ~IOCB_DIRECT;
> -				*cnt = in_count;
> -				kiocb->ki_pos = in_offset;
> -				return nfsd_iocb_write(file, rqstp->rq_bvec,
> -						       nvecs, cnt, kiocb);
> -			} else if (unlikely(host_err == -EINVAL)) {
> -				struct inode *inode = d_inode(fhp->fh_dentry);
> -
> -				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> -						    inode->i_sb->s_id, inode->i_ino);
> -				host_err = -ESERVERFAULT;
> -			}
> +		if (host_err < 0)
>  			return host_err;
> -		}
>  		*cnt += host_err;
>  		if (host_err < iter[i].count) /* partial write? */
>  			break;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 08/14] NFSD: Remove alignment size checking
  2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
@ 2025-10-24 15:22   ` Jeff Layton
  2025-10-27  8:09   ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:22 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The current set of in-tree file systems do not support alignments
> larger than a PAGE, so this check is unnecessary.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index b50be92343e3..465d4d091f3d 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1270,8 +1270,6 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>  
>  	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
>  		return false;
> -	if (unlikely(dio_blocksize > PAGE_SIZE))
> -		return false;
>  	if (unlikely(len < dio_blocksize))
>  		return false;
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 09/14] NFSD: Remove the len_mask check
  2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
@ 2025-10-24 15:23   ` Jeff Layton
  2025-10-24 17:16   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:23 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer

On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Mike says:
> > > Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
> > > would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
> > > would return false.
> > 
> > It is always due to memory alignment (addr_mask check), never due to
> > logical alignment (len_mask check).
> > 
> > So we could remove the len_mask arg and the 'if (size & len_mask)'
> > check from nfsd_iov_iter_aligned_bvec
> 
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 465d4d091f3d..f6810630bb65 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1285,15 +1285,12 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>  }
>  
>  static bool
> -nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> -			   unsigned int len_mask)
> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask)
>  {
>  	const struct bio_vec *bvec = i->bvec;
>  	size_t skip = i->iov_offset;
>  	size_t size = i->count;
>  
> -	if (size & len_mask)
> -		return false;
>  	do {
>  		size_t len = bvec->bv_len;
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-24 15:22   ` Jeff Layton
@ 2025-10-24 15:23     ` Chuck Lever
  0 siblings, 0 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 15:23 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

On 10/24/25 11:22 AM, Jeff Layton wrote:
> On Fri, 2025-10-24 at 10:42 -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Clean up: The helpers in the nfsd_direct_write() code path don't set
>> stable_how to anything else but NFS_FILE_SYNC. All data writes in
>> this code path result in immediately durability.
>>
>> Instead of passing it through the stack of functions, just set it
>> after the call is done.
>>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  fs/nfsd/vfs.c | 21 ++++++++++-----------
>>  1 file changed, 10 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 2832a66cda5b..cd2c99e450fb 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1374,9 +1374,10 @@ nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
>>  }
>>  
>>  static int
>> -nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>> -		     u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
>> -		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
>> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> +		     struct nfsd_file *nf, unsigned int nvecs,
>> +		     unsigned long *cnt, struct kiocb *kiocb,
>> +		     struct nfsd_write_dio *write_dio)
>>  {
>>  	struct file *file = nf->nf_file;
>>  	bool iter_is_dio_aligned[3];
>> @@ -1399,10 +1400,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
>>  	/*
>>  	 * Any buffered IO issued here will be misaligned, use
>>  	 * sync IO to ensure it has completed before returning.
>> -	 * Also update @stable_how to avoid need for COMMIT.
>>  	 */
>>  	kiocb->ki_flags |= IOCB_DSYNC;
>> -	*stable_how = NFS_FILE_SYNC;
>>  
>>  	*cnt = 0;
>>  	for (int i = 0; i < n_iters; i++) {
>> @@ -1442,7 +1441,7 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
>>  
>>  static noinline_for_stack int
>>  nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>> -		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
>> +		  struct nfsd_file *nf, unsigned int nvecs,
>>  		  unsigned long *cnt, struct kiocb *kiocb)
>>  {
>>  	struct nfsd_write_dio write_dio;
>> @@ -1456,8 +1455,8 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  		kiocb->ki_flags |= IOCB_DONTCACHE;
>>  
>>  	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
>> -		return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
>> -					    cnt, kiocb, &write_dio);
>> +		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
>> +					    &write_dio);
>>  
>>  	return nfsd_iocb_write(nf->nf_file, rqstp->rq_bvec, nvecs, cnt, kiocb);
>>  }
>> @@ -1539,9 +1538,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  
>>  	switch (nfsd_io_cache_write) {
>>  	case NFSD_IO_DIRECT:
>> -		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
>> -					     nvecs, cnt, &kiocb);
>> -		stable = *stable_how;
>> +		host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs, cnt,
>> +					     &kiocb);
>> +		stable = *stable_how = NFS_FILE_SYNC;
>>  		break;
>>  	case NFSD_IO_DONTCACHE:
>>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
> 
> I assume you're going to squash some of these changes into the original
> patches?

As the cover letter mentions, they can be squashed, rejected, or updated
individually... yes, squashing may occur. ;-)


> Reviewed-by: Jeff Layton <jlayton@kernel.org>


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec()
  2025-10-24 14:43 ` [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec() Chuck Lever
@ 2025-10-24 15:24   ` Jeff Layton
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:24 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Clean up: Keep the specifics of the alignment checking inside of
> nfsd_iov_iter_aligned_bvec(). Move the calculations of the alignment
> parameters into the function.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f6810630bb65..e7c3458bd178 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1285,8 +1285,9 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>  }
>  
>  static bool
> -nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask)
> +nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
>  {
> +	unsigned int addr_mask = nf->nf_dio_mem_align - 1;
>  	const struct bio_vec *bvec = i->bvec;
>  	size_t skip = i->iov_offset;
>  	size_t size = i->count;
> @@ -1333,9 +1334,7 @@ nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
>  		iov_iter_advance(&iters[n_iters], write_dio->start_len);
>  	iters[n_iters].count -= write_dio->end_len;
>  	iter_is_dio_aligned[n_iters] =
> -		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
> -					   nf->nf_dio_mem_align - 1,
> -					   nf->nf_dio_offset_align - 1);
> +		nfsd_iov_iter_aligned_bvec(nf, &iters[n_iters]);
>  	if (unlikely(!iter_is_dio_aligned[n_iters]))
>  		return 0; /* no DIO-aligned IO possible */
>  	++n_iters;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio
  2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
@ 2025-10-24 15:26   ` Jeff Layton
  2025-10-24 17:20   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:26 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Prepare for moving more common arguments into the shared per-request
> structure.
> 
> First step is to move the target nfsd_file into that structure, as
> it needs to be available in several functions.
> 
> As a clean-up, adopt the common naming of a structure that carries
> the arguments for a number of functions.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 61 ++++++++++++++++++++++++++-------------------------
>  1 file changed, 31 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index e7c3458bd178..429f5fc61ead 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,21 +1254,22 @@ static int wait_for_concurrent_writes(struct file *file)
>  	return err;
>  }
>  
> -struct nfsd_write_dio {
> +struct nfsd_write_dio_args {
> +	struct nfsd_file		*nf;
> +
>  	ssize_t	start_len;	/* Length for misaligned first extent */
>  	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
>  	ssize_t	end_len;	/* Length for misaligned last extent */
>  };
>  
>  static bool
> -nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> -			   struct nfsd_file *nf,
> -			   struct nfsd_write_dio *write_dio)
> +nfsd_is_write_dio_possible(struct nfsd_write_dio_args *args, loff_t offset,
> +			   unsigned long len)
>  {
> -	const u32 dio_blocksize = nf->nf_dio_offset_align;
> +	const u32 dio_blocksize = args->nf->nf_dio_offset_align;
>  	loff_t start_end, orig_end, middle_end;
>  
> -	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
> +	if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
>  		return false;
>  	if (unlikely(len < dio_blocksize))
>  		return false;
> @@ -1277,9 +1278,9 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>  	orig_end = offset + len;
>  	middle_end = round_down(orig_end, dio_blocksize);
>  
> -	write_dio->start_len = start_end - offset;
> -	write_dio->middle_len = middle_end - start_end;
> -	write_dio->end_len = orig_end - middle_end;
> +	args->start_len = start_end - offset;
> +	args->middle_len = middle_end - start_end;
> +	args->end_len = orig_end - middle_end;
>  
>  	return true;
>  }
> @@ -1314,36 +1315,35 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
>  static int
>  nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
>  			   struct bio_vec *rq_bvec, unsigned int nvecs,
> -			   unsigned long cnt, struct nfsd_write_dio *write_dio,
> -			   struct nfsd_file *nf)
> +			   unsigned long cnt, struct nfsd_write_dio_args *args)
>  {
>  	int n_iters = 0;
>  	struct iov_iter *iters = *iterp;
>  
>  	/* Setup misaligned start? */
> -	if (write_dio->start_len) {
> +	if (args->start_len) {
>  		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> -		iters[n_iters].count = write_dio->start_len;
> +		iters[n_iters].count = args->start_len;
>  		iter_is_dio_aligned[n_iters] = false;
>  		++n_iters;
>  	}
>  
>  	/* Setup DIO-aligned middle */
>  	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> -	if (write_dio->start_len)
> -		iov_iter_advance(&iters[n_iters], write_dio->start_len);
> -	iters[n_iters].count -= write_dio->end_len;
> +	if (args->start_len)
> +		iov_iter_advance(&iters[n_iters], args->start_len);
> +	iters[n_iters].count -= args->end_len;
>  	iter_is_dio_aligned[n_iters] =
> -		nfsd_iov_iter_aligned_bvec(nf, &iters[n_iters]);
> +		nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
>  	if (unlikely(!iter_is_dio_aligned[n_iters]))
>  		return 0; /* no DIO-aligned IO possible */
>  	++n_iters;
>  
>  	/* Setup misaligned end? */
> -	if (write_dio->end_len) {
> +	if (args->end_len) {
>  		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>  		iov_iter_advance(&iters[n_iters],
> -				 write_dio->start_len + write_dio->middle_len);
> +				 args->start_len + args->middle_len);
>  		iter_is_dio_aligned[n_iters] = false;
>  		++n_iters;
>  	}
> @@ -1369,11 +1369,10 @@ nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
>  
>  static int
>  nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -		     struct nfsd_file *nf, unsigned int nvecs,
> -		     unsigned long *cnt, struct kiocb *kiocb,
> -		     struct nfsd_write_dio *write_dio)
> +		     struct nfsd_write_dio_args *args, struct kiocb *kiocb,
> +		     unsigned int nvecs, unsigned long *cnt)
>  {
> -	struct file *file = nf->nf_file;
> +	struct file *file = args->nf->nf_file;
>  	bool iter_is_dio_aligned[3];
>  	struct iov_iter iter_stack[3];
>  	struct iov_iter *iter = iter_stack;
> @@ -1384,7 +1383,7 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
>  					     rqstp->rq_bvec, nvecs, *cnt,
> -					     write_dio, nf);
> +					     args);
>  	if (unlikely(!n_iters))
>  		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
>  				       cnt, kiocb);
> @@ -1414,14 +1413,15 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		  struct nfsd_file *nf, unsigned int nvecs,
>  		  unsigned long *cnt, struct kiocb *kiocb)
>  {
> -	struct nfsd_write_dio write_dio;
> +	struct file *file = nf->nf_file;
> +	struct nfsd_write_dio_args args;
>  
>  	/*
>  	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
>  	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
>  	 * be ignored for any DIO issued here).
>  	 */
> -	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> +	if (file->f_op->fop_flags & FOP_DONTCACHE)
>  		kiocb->ki_flags |= IOCB_DONTCACHE;
>  
>  	/*
> @@ -1435,11 +1435,12 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	 */
>  	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
>  
> -	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
> -		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
> -					    &write_dio);
> +	args.nf = nf;
> +	if (nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt))
> +		return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb,
> +					    nvecs, cnt);
>  
> -	return nfsd_iocb_write(nf->nf_file, rqstp->rq_bvec, nvecs, cnt, kiocb);
> +	return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs, cnt, kiocb);
>  }
>  
>  /**

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg
  2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
@ 2025-10-24 15:30   ` Jeff Layton
  2025-10-24 15:37     ` Chuck Lever
  2025-10-24 17:57   ` Mike Snitzer
  1 sibling, 1 reply; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:30 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Passing iter arrays by reference is a little risky. Instead, pass a
> struct with a fixed-size array so bounds checking can be done.
> 
> Name each item in the array a "segment", as the term "extent"
> generally refers to a set of blocks on storage, not to a buffer.
> Each segment is processed via a single vfs_iocb_iter_write() call,
> and is either IOCB_DIRECT or buffered.
> 
> Introduce a segment constructor function so each segment is
> initialized identically.
> 
> Each segment has its own length. The loop that iterates over the
> segment array can simply skip over the segments of zero length.
> A count of segments is not needed.
> 

True, but it's easy to get that sort of accounting wrong, and if we do
we're looking at a buffer overrun. Maybe it'd be reasonable to keep a
number of segments in this struct too, if only for defensive coding
reasons, and to enable proper bounds checking?


> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 121 ++++++++++++++++++++++++--------------------------
>  1 file changed, 57 insertions(+), 64 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 429f5fc61ead..b7f217aa4994 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,12 +1254,15 @@ static int wait_for_concurrent_writes(struct file *file)
>  	return err;
>  }
>  
> +struct nfsd_write_dio_seg {
> +	struct iov_iter			iter;
> +	size_t				len;
> +	bool				use_dio;
> +};
> +
>  struct nfsd_write_dio_args {
>  	struct nfsd_file		*nf;
> -
> -	ssize_t	start_len;	/* Length for misaligned first extent */
> -	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> -	ssize_t	end_len;	/* Length for misaligned last extent */
> +	struct nfsd_write_dio_seg	segment[3];
>  };
>  
>  static bool
> @@ -1267,21 +1270,19 @@ nfsd_is_write_dio_possible(struct nfsd_write_dio_args *args, loff_t offset,
>  			   unsigned long len)
>  {
>  	const u32 dio_blocksize = args->nf->nf_dio_offset_align;
> -	loff_t start_end, orig_end, middle_end;
> +	loff_t first_end, orig_end, middle_end;
>  
>  	if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
>  		return false;
>  	if (unlikely(len < dio_blocksize))
>  		return false;
>  
> -	start_end = round_up(offset, dio_blocksize);
> +	first_end = round_up(offset, dio_blocksize);
>  	orig_end = offset + len;
>  	middle_end = round_down(orig_end, dio_blocksize);
> -
> -	args->start_len = start_end - offset;
> -	args->middle_len = middle_end - start_end;
> -	args->end_len = orig_end - middle_end;
> -
> +	args->segment[0].len = first_end - offset;	/* first segment */
> +	args->segment[1].len = middle_end - first_end;	/* middle segment */
> +	args->segment[2].len = orig_end - middle_end;	/* last segment */
>  	return true;
>  }
>  
> @@ -1308,47 +1309,42 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
>  	return true;
>  }
>  
> -/*
> - * Setup as many as 3 iov_iter based on extents described by @write_dio.
> - * Returns the number of iov_iter that were setup.
> - */
> -static int
> -nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> -			   struct bio_vec *rq_bvec, unsigned int nvecs,
> -			   unsigned long cnt, struct nfsd_write_dio_args *args)
> +static void
> +nfsd_setup_write_dio_seg(struct nfsd_write_dio_seg *segment,
> +			 struct bio_vec *bvec, unsigned int nvecs,
> +			 unsigned long total, size_t start)
>  {
> -	int n_iters = 0;
> -	struct iov_iter *iters = *iterp;
> +	iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
> +	if (start)
> +		iov_iter_advance(&segment->iter, start);
> +	iov_iter_truncate(&segment->iter, segment->len);
> +	segment->use_dio = false;
> +}
>  
> -	/* Setup misaligned start? */
> -	if (args->start_len) {
> -		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> -		iters[n_iters].count = args->start_len;
> -		iter_is_dio_aligned[n_iters] = false;
> -		++n_iters;
> -	}
> +static bool
> +nfsd_setup_write_dio_iters(struct nfsd_write_dio_args *args,
> +			   struct bio_vec *bvec, unsigned int nvecs,
> +			   unsigned long total)
> +{
> +	/* first segment */
> +	if (args->segment[0].len)
> +		nfsd_setup_write_dio_seg(&args->segment[0],
> +					 bvec, nvecs, total, 0);
>  
> -	/* Setup DIO-aligned middle */
> -	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> -	if (args->start_len)
> -		iov_iter_advance(&iters[n_iters], args->start_len);
> -	iters[n_iters].count -= args->end_len;
> -	iter_is_dio_aligned[n_iters] =
> -		nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
> -	if (unlikely(!iter_is_dio_aligned[n_iters]))
> -		return 0; /* no DIO-aligned IO possible */
> -	++n_iters;
> +	/* middle segment */
> +	nfsd_setup_write_dio_seg(&args->segment[1], bvec, nvecs, total,
> +				 args->segment[0].len);
> +	if (!nfsd_iov_iter_aligned_bvec(args->nf, &args->segment[1].iter))
> +		return false; /* no DIO-aligned IO possible */
> +	args->segment[1].use_dio = true;
>  
> -	/* Setup misaligned end? */
> -	if (args->end_len) {
> -		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> -		iov_iter_advance(&iters[n_iters],
> -				 args->start_len + args->middle_len);
> -		iter_is_dio_aligned[n_iters] = false;
> -		++n_iters;
> -	}
> +	/* last segment */
> +	if (args->segment[2].len)
> +		nfsd_setup_write_dio_seg(&args->segment[2], bvec, nvecs,
> +					 total, args->segment[0].len +
> +					 args->segment[1].len);
>  
> -	return n_iters;
> +	return true;
>  }
>  
>  static int
> @@ -1373,36 +1369,33 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		     unsigned int nvecs, unsigned long *cnt)
>  {
>  	struct file *file = args->nf->nf_file;
> -	bool iter_is_dio_aligned[3];
> -	struct iov_iter iter_stack[3];
> -	struct iov_iter *iter = iter_stack;
> -	unsigned int n_iters = 0;
> -	unsigned long in_count = *cnt;
> -	loff_t in_offset = kiocb->ki_pos;
> +	struct nfsd_write_dio_seg *segment;
>  	ssize_t host_err;
> +	size_t i;
>  
> -	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
> -					     rqstp->rq_bvec, nvecs, *cnt,
> -					     args);
> -	if (unlikely(!n_iters))
> +	if (!nfsd_setup_write_dio_iters(args, rqstp->rq_bvec, nvecs, *cnt))
>  		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
>  				       cnt, kiocb);
>  
> -	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
> -
>  	*cnt = 0;
> -	for (int i = 0; i < n_iters; i++) {
> -		if (iter_is_dio_aligned[i])
> +	segment = args->segment;
> +	for (i = 0; i < ARRAY_SIZE(args->segment); i++) {
> +		if (segment->len == 0)
> +			continue;
> +		if (segment->use_dio) {
>  			kiocb->ki_flags |= IOCB_DIRECT;
> -		else
> +			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> +						segment->len);
> +		} else
>  			kiocb->ki_flags &= ~IOCB_DIRECT;
>  
> -		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> +		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
>  		if (host_err < 0)
>  			return host_err;
>  		*cnt += host_err;
> -		if (host_err < iter[i].count) /* partial write? */
> +		if (host_err < segment->iter.count)
>  			break;
> +		++segment;
>  	}
>  
>  	return 0;

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow
  2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
@ 2025-10-24 15:32   ` Jeff Layton
  2025-10-24 18:01   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:32 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Clean up: Use the usual error flow form:
> 
> 	if (uncommon condition) {
> 		handle it;
> 		return;
> 	}
> 	do common thing;
> 
> in nfsd_direct_write(). Now there is a single place where the direct
> write path falls back to a single cached write.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index b7f217aa4994..b0e4105e0075 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1373,10 +1373,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	ssize_t host_err;
>  	size_t i;
>  
> -	if (!nfsd_setup_write_dio_iters(args, rqstp->rq_bvec, nvecs, *cnt))
> -		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
> -				       cnt, kiocb);
> -
>  	*cnt = 0;
>  	segment = args->segment;
>  	for (i = 0; i < ARRAY_SIZE(args->segment); i++) {
> @@ -1388,7 +1384,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  						segment->len);
>  		} else
>  			kiocb->ki_flags &= ~IOCB_DIRECT;
> -
>  		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
>  		if (host_err < 0)
>  			return host_err;
> @@ -1429,11 +1424,13 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
>  
>  	args.nf = nf;
> -	if (nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt))
> -		return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb,
> -					    nvecs, cnt);
> +	if (!nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt) ||
> +	    !nfsd_setup_write_dio_iters(&args, rqstp->rq_bvec, nvecs, *cnt))
> +		/* fall back to writing through the page cache */
> +		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
> +				       cnt, kiocb);
>  
> -	return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs, cnt, kiocb);
> +	return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb, nvecs, cnt);
>  }
>  
>  /**

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
@ 2025-10-24 15:34   ` Jeff Layton
  2025-10-24 18:13   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-24 15:34 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Christoph says:
> > > +	if (file->f_op->fop_flags & FOP_DONTCACHE)
> > > +		kiocb->ki_flags |= IOCB_DONTCACHE;
> > IOCB_DONTCACHE isn't defined for IOCB_DIRECT.  So this should
> > move into a branch just for buffered I/O.
> 
> Instead, let's set up separate ki_flags for buffered I/O and for
> direct I/O requests. Then we don't have to set a jumble of flag
> bits in a single flags field.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index b0e4105e0075..b7b9f8cf0452 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1262,6 +1262,8 @@ struct nfsd_write_dio_seg {
>  
>  struct nfsd_write_dio_args {
>  	struct nfsd_file		*nf;
> +	int				flags_buffered;
> +	int				flags_direct;
>  	struct nfsd_write_dio_seg	segment[3];
>  };
>  
> @@ -1379,11 +1381,11 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		if (segment->len == 0)
>  			continue;
>  		if (segment->use_dio) {
> -			kiocb->ki_flags |= IOCB_DIRECT;
> +			kiocb->ki_flags = args->flags_direct;
>  			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
>  						segment->len);
>  		} else
> -			kiocb->ki_flags &= ~IOCB_DIRECT;
> +			kiocb->ki_flags = args->flags_buffered;
>  		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
>  		if (host_err < 0)
>  			return host_err;
> @@ -1405,30 +1407,27 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	struct nfsd_write_dio_args args;
>  
>  	/*
> -	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
> -	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
> -	 * be ignored for any DIO issued here).
> +	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> +	 * writing unaligned segments or handling fallback I/O.
>  	 */
> +	args.flags_buffered = kiocb->ki_flags | IOCB_SYNC | IOCB_DSYNC;
>  	if (file->f_op->fop_flags & FOP_DONTCACHE)
> -		kiocb->ki_flags |= IOCB_DONTCACHE;
> +		args.flags_buffered |= IOCB_DONTCACHE;
>  
>  	/*
> -	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
> -	 * both written data and dirty time stamps.
> -	 *
> -	 * When falling back to buffered I/O or handling the unaligned
> -	 * first and last segments, the data and time stamps must be
> -	 * durable before nfsd_vfs_write() returns to its caller, matching
> -	 * the behavior of direct I/O.
> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should
> +	 * persist both written data and dirty time stamps.
>  	 */
> -	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
> +	args.flags_direct = kiocb->ki_flags | IOCB_SYNC | IOCB_DIRECT;
>  
>  	args.nf = nf;
>  	if (!nfsd_is_write_dio_possible(&args, kiocb->ki_pos, *cnt) ||
> -	    !nfsd_setup_write_dio_iters(&args, rqstp->rq_bvec, nvecs, *cnt))
> +	    !nfsd_setup_write_dio_iters(&args, rqstp->rq_bvec, nvecs, *cnt)) {
>  		/* fall back to writing through the page cache */
> +		kiocb->ki_flags = args.flags_buffered;
>  		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
>  				       cnt, kiocb);
> +	}
>  
>  	return nfsd_issue_write_dio(rqstp, fhp, &args, kiocb, nvecs, cnt);
>  }

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg
  2025-10-24 15:30   ` Jeff Layton
@ 2025-10-24 15:37     ` Chuck Lever
  0 siblings, 0 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 15:37 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Christoph Hellwig

On 10/24/25 11:30 AM, Jeff Layton wrote:
> On Fri, 2025-10-24 at 10:43 -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Passing iter arrays by reference is a little risky. Instead, pass a
>> struct with a fixed-size array so bounds checking can be done.
>>
>> Name each item in the array a "segment", as the term "extent"
>> generally refers to a set of blocks on storage, not to a buffer.
>> Each segment is processed via a single vfs_iocb_iter_write() call,
>> and is either IOCB_DIRECT or buffered.
>>
>> Introduce a segment constructor function so each segment is
>> initialized identically.
>>
>> Each segment has its own length. The loop that iterates over the
>> segment array can simply skip over the segments of zero length.
>> A count of segments is not needed.
>>
> 
> True, but it's easy to get that sort of accounting wrong, and if we do
> we're looking at a buffer overrun.

All three segment lengths are initialized for each request.


> Maybe it'd be reasonable to keep a
> number of segments in this struct too, if only for defensive coding
> reasons, and to enable proper bounds checking?

Christoph also suggested that. I'll look into it.


>> Suggested-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  fs/nfsd/vfs.c | 121 ++++++++++++++++++++++++--------------------------
>>  1 file changed, 57 insertions(+), 64 deletions(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 429f5fc61ead..b7f217aa4994 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1254,12 +1254,15 @@ static int wait_for_concurrent_writes(struct file *file)
>>  	return err;
>>  }
>>  
>> +struct nfsd_write_dio_seg {
>> +	struct iov_iter			iter;
>> +	size_t				len;
>> +	bool				use_dio;
>> +};
>> +
>>  struct nfsd_write_dio_args {
>>  	struct nfsd_file		*nf;
>> -
>> -	ssize_t	start_len;	/* Length for misaligned first extent */
>> -	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
>> -	ssize_t	end_len;	/* Length for misaligned last extent */
>> +	struct nfsd_write_dio_seg	segment[3];
>>  };
>>  
>>  static bool
>> @@ -1267,21 +1270,19 @@ nfsd_is_write_dio_possible(struct nfsd_write_dio_args *args, loff_t offset,
>>  			   unsigned long len)
>>  {
>>  	const u32 dio_blocksize = args->nf->nf_dio_offset_align;
>> -	loff_t start_end, orig_end, middle_end;
>> +	loff_t first_end, orig_end, middle_end;
>>  
>>  	if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
>>  		return false;
>>  	if (unlikely(len < dio_blocksize))
>>  		return false;
>>  
>> -	start_end = round_up(offset, dio_blocksize);
>> +	first_end = round_up(offset, dio_blocksize);
>>  	orig_end = offset + len;
>>  	middle_end = round_down(orig_end, dio_blocksize);
>> -
>> -	args->start_len = start_end - offset;
>> -	args->middle_len = middle_end - start_end;
>> -	args->end_len = orig_end - middle_end;
>> -
>> +	args->segment[0].len = first_end - offset;	/* first segment */
>> +	args->segment[1].len = middle_end - first_end;	/* middle segment */
>> +	args->segment[2].len = orig_end - middle_end;	/* last segment */
>>  	return true;
>>  }
>>  
>> @@ -1308,47 +1309,42 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
>>  	return true;
>>  }
>>  
>> -/*
>> - * Setup as many as 3 iov_iter based on extents described by @write_dio.
>> - * Returns the number of iov_iter that were setup.
>> - */
>> -static int
>> -nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
>> -			   struct bio_vec *rq_bvec, unsigned int nvecs,
>> -			   unsigned long cnt, struct nfsd_write_dio_args *args)
>> +static void
>> +nfsd_setup_write_dio_seg(struct nfsd_write_dio_seg *segment,
>> +			 struct bio_vec *bvec, unsigned int nvecs,
>> +			 unsigned long total, size_t start)
>>  {
>> -	int n_iters = 0;
>> -	struct iov_iter *iters = *iterp;
>> +	iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
>> +	if (start)
>> +		iov_iter_advance(&segment->iter, start);
>> +	iov_iter_truncate(&segment->iter, segment->len);
>> +	segment->use_dio = false;
>> +}
>>  
>> -	/* Setup misaligned start? */
>> -	if (args->start_len) {
>> -		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> -		iters[n_iters].count = args->start_len;
>> -		iter_is_dio_aligned[n_iters] = false;
>> -		++n_iters;
>> -	}
>> +static bool
>> +nfsd_setup_write_dio_iters(struct nfsd_write_dio_args *args,
>> +			   struct bio_vec *bvec, unsigned int nvecs,
>> +			   unsigned long total)
>> +{
>> +	/* first segment */
>> +	if (args->segment[0].len)
>> +		nfsd_setup_write_dio_seg(&args->segment[0],
>> +					 bvec, nvecs, total, 0);
>>  
>> -	/* Setup DIO-aligned middle */
>> -	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> -	if (args->start_len)
>> -		iov_iter_advance(&iters[n_iters], args->start_len);
>> -	iters[n_iters].count -= args->end_len;
>> -	iter_is_dio_aligned[n_iters] =
>> -		nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
>> -	if (unlikely(!iter_is_dio_aligned[n_iters]))
>> -		return 0; /* no DIO-aligned IO possible */
>> -	++n_iters;
>> +	/* middle segment */
>> +	nfsd_setup_write_dio_seg(&args->segment[1], bvec, nvecs, total,
>> +				 args->segment[0].len);
>> +	if (!nfsd_iov_iter_aligned_bvec(args->nf, &args->segment[1].iter))
>> +		return false; /* no DIO-aligned IO possible */
>> +	args->segment[1].use_dio = true;
>>  
>> -	/* Setup misaligned end? */
>> -	if (args->end_len) {
>> -		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> -		iov_iter_advance(&iters[n_iters],
>> -				 args->start_len + args->middle_len);
>> -		iter_is_dio_aligned[n_iters] = false;
>> -		++n_iters;
>> -	}
>> +	/* last segment */
>> +	if (args->segment[2].len)
>> +		nfsd_setup_write_dio_seg(&args->segment[2], bvec, nvecs,
>> +					 total, args->segment[0].len +
>> +					 args->segment[1].len);
>>  
>> -	return n_iters;
>> +	return true;
>>  }
>>  
>>  static int
>> @@ -1373,36 +1369,33 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  		     unsigned int nvecs, unsigned long *cnt)
>>  {
>>  	struct file *file = args->nf->nf_file;
>> -	bool iter_is_dio_aligned[3];
>> -	struct iov_iter iter_stack[3];
>> -	struct iov_iter *iter = iter_stack;
>> -	unsigned int n_iters = 0;
>> -	unsigned long in_count = *cnt;
>> -	loff_t in_offset = kiocb->ki_pos;
>> +	struct nfsd_write_dio_seg *segment;
>>  	ssize_t host_err;
>> +	size_t i;
>>  
>> -	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
>> -					     rqstp->rq_bvec, nvecs, *cnt,
>> -					     args);
>> -	if (unlikely(!n_iters))
>> +	if (!nfsd_setup_write_dio_iters(args, rqstp->rq_bvec, nvecs, *cnt))
>>  		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
>>  				       cnt, kiocb);
>>  
>> -	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
>> -
>>  	*cnt = 0;
>> -	for (int i = 0; i < n_iters; i++) {
>> -		if (iter_is_dio_aligned[i])
>> +	segment = args->segment;
>> +	for (i = 0; i < ARRAY_SIZE(args->segment); i++) {
>> +		if (segment->len == 0)
>> +			continue;
>> +		if (segment->use_dio) {
>>  			kiocb->ki_flags |= IOCB_DIRECT;
>> -		else
>> +			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
>> +						segment->len);
>> +		} else
>>  			kiocb->ki_flags &= ~IOCB_DIRECT;
>>  
>> -		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
>> +		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
>>  		if (host_err < 0)
>>  			return host_err;
>>  		*cnt += host_err;
>> -		if (host_err < iter[i].count) /* partial write? */
>> +		if (host_err < segment->iter.count)
>>  			break;
>> +		++segment;
>>  	}
>>  
>>  	return 0;
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-24 17:12   ` Mike Snitzer
  2025-10-24 17:24     ` Chuck Lever
  2025-10-26  0:03   ` kernel test robot
  2025-10-26  1:16   ` kernel test robot
  2 siblings, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 17:12 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On Fri, Oct 24, 2025 at 10:42:56AM -0400, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
> 
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. Synchronous buffered IO (with
> preference towards using DONTCACHE) is used for the misaligned extents
> and O_DIRECT is used for the middle DIO-aligned extent.
> 
> nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
> allows the client to drop its dirty data and avoid needing an extra
> COMMIT operation.
> 
> If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> invalidate the page cache on behalf of the DIO WRITE, then
> nfsd_issue_write_dio() will fall back to using buffered IO.
> 
> These changes served as the original starting point for the NFS
> client's misaligned O_DIRECT support that landed with
> commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> READ and WRITE"). But NFSD's support is simpler because it currently
> doesn't use AIO completion.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/debugfs.c |   1 +
>  fs/nfsd/trace.h   |   1 +
>  fs/nfsd/vfs.c     | 197 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 199 insertions(+)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 00eb1ecef6ac..7f44689e0a53 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
>  	switch (val) {
>  	case NFSD_IO_BUFFERED:
>  	case NFSD_IO_DONTCACHE:
> +	case NFSD_IO_DIRECT:
>  		nfsd_io_cache_write = val;
>  		break;
>  	default:
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index bfd41236aff2..ad74439d0105 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
>  DEFINE_NFSD_IO_EVENT(read_done);
>  DEFINE_NFSD_IO_EVENT(write_start);
>  DEFINE_NFSD_IO_EVENT(write_opened);
> +DEFINE_NFSD_IO_EVENT(write_direct);
>  DEFINE_NFSD_IO_EVENT(write_io_done);
>  DEFINE_NFSD_IO_EVENT(write_done);
>  DEFINE_NFSD_IO_EVENT(commit_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 6076821bb541..2832a66cda5b 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,6 +1254,109 @@ static int wait_for_concurrent_writes(struct file *file)
>  	return err;
>  }
>  
> +struct nfsd_write_dio {
> +	ssize_t	start_len;	/* Length for misaligned first extent */
> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> +	ssize_t	end_len;	/* Length for misaligned last extent */
> +};
> +
> +static bool
> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> +			   struct nfsd_file *nf,
> +			   struct nfsd_write_dio *write_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> +	loff_t start_end, orig_end, middle_end;
> +
> +	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
> +		return false;
> +	if (unlikely(dio_blocksize > PAGE_SIZE))
> +		return false;
> +	if (unlikely(len < dio_blocksize))
> +		return false;
> +
> +	start_end = round_up(offset, dio_blocksize);
> +	orig_end = offset + len;
> +	middle_end = round_down(orig_end, dio_blocksize);
> +
> +	write_dio->start_len = start_end - offset;
> +	write_dio->middle_len = middle_end - start_end;
> +	write_dio->end_len = orig_end - middle_end;
> +
> +	return true;
> +}
> +
> +static bool
> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> +			   unsigned int len_mask)
> +{
> +	const struct bio_vec *bvec = i->bvec;
> +	size_t skip = i->iov_offset;
> +	size_t size = i->count;
> +
> +	if (size & len_mask)
> +		return false;
> +	do {
> +		size_t len = bvec->bv_len;
> +
> +		if (len > size)
> +			len = size;
> +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> +			return false;
> +		bvec++;
> +		size -= len;
> +		skip = 0;
> +	} while (size);
> +
> +	return true;
> +}
> +
> +/*
> + * Setup as many as 3 iov_iter based on extents described by @write_dio.
> + * Returns the number of iov_iter that were setup.
> + */
> +static int
> +nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> +			   struct bio_vec *rq_bvec, unsigned int nvecs,
> +			   unsigned long cnt, struct nfsd_write_dio *write_dio,
> +			   struct nfsd_file *nf)
> +{
> +	int n_iters = 0;
> +	struct iov_iter *iters = *iterp;
> +
> +	/* Setup misaligned start? */
> +	if (write_dio->start_len) {
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iters[n_iters].count = write_dio->start_len;
> +		iter_is_dio_aligned[n_iters] = false;
> +		++n_iters;
> +	}
> +
> +	/* Setup DIO-aligned middle */
> +	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +	if (write_dio->start_len)
> +		iov_iter_advance(&iters[n_iters], write_dio->start_len);
> +	iters[n_iters].count -= write_dio->end_len;
> +	iter_is_dio_aligned[n_iters] =
> +		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
> +					   nf->nf_dio_mem_align - 1,
> +					   nf->nf_dio_offset_align - 1);
> +	if (unlikely(!iter_is_dio_aligned[n_iters]))
> +		return 0; /* no DIO-aligned IO possible */
> +	++n_iters;
> +
> +	/* Setup misaligned end? */
> +	if (write_dio->end_len) {
> +		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> +		iov_iter_advance(&iters[n_iters],
> +				 write_dio->start_len + write_dio->middle_len);
> +		iter_is_dio_aligned[n_iters] = false;
> +		++n_iters;
> +	}
> +
> +	return n_iters;
> +}
> +
>  static int
>  nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
>  		unsigned long *cnt, struct kiocb *kiocb)
> @@ -1270,6 +1373,95 @@ nfsd_iocb_write(struct file *file, struct bio_vec *bvec, unsigned int nvecs,
>  	return 0;
>  }
>  
> +static int
> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> +		     u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
> +		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
> +{
> +	struct file *file = nf->nf_file;
> +	bool iter_is_dio_aligned[3];
> +	struct iov_iter iter_stack[3];
> +	struct iov_iter *iter = iter_stack;
> +	unsigned int n_iters = 0;
> +	unsigned long in_count = *cnt;
> +	loff_t in_offset = kiocb->ki_pos;
> +	ssize_t host_err;
> +
> +	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
> +					     rqstp->rq_bvec, nvecs, *cnt,
> +					     write_dio, nf);
> +	if (unlikely(!n_iters))
> +		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
> +				       cnt, kiocb);
> +
> +	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
> +
> +	/*
> +	 * Any buffered IO issued here will be misaligned, use
> +	 * sync IO to ensure it has completed before returning.
> +	 * Also update @stable_how to avoid need for COMMIT.
> +	 */
> +	kiocb->ki_flags |= IOCB_DSYNC;
> +	*stable_how = NFS_FILE_SYNC;

Patch 6 really should be folded into this patch, I originally
submitted my v3 (and you carried it in v4) with both
IOCB_DSYNC|IOCB_SSYNC being set, see:
https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/

If you'd like your comment change and removal of parenthesis factored
out (to patch 6 or whatever) that's up to you.

Thanks,
Mike


> +
> +	*cnt = 0;
> +	for (int i = 0; i < n_iters; i++) {
> +		if (iter_is_dio_aligned[i])
> +			kiocb->ki_flags |= IOCB_DIRECT;
> +		else
> +			kiocb->ki_flags &= ~IOCB_DIRECT;
> +
> +		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> +		if (host_err < 0) {
> +			/*
> +			 * VFS will return -ENOTBLK if DIO WRITE fails to
> +			 * invalidate the page cache. Retry using buffered IO.
> +			 */
> +			if (unlikely(host_err == -ENOTBLK)) {
> +				kiocb->ki_flags &= ~IOCB_DIRECT;
> +				*cnt = in_count;
> +				kiocb->ki_pos = in_offset;
> +				return nfsd_iocb_write(file, rqstp->rq_bvec,
> +						       nvecs, cnt, kiocb);
> +			} else if (unlikely(host_err == -EINVAL)) {
> +				struct inode *inode = d_inode(fhp->fh_dentry);
> +
> +				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> +						    inode->i_sb->s_id, inode->i_ino);
> +				host_err = -ESERVERFAULT;
> +			}
> +			return host_err;
> +		}
> +		*cnt += host_err;
> +		if (host_err < iter[i].count) /* partial write? */
> +			break;
> +	}
> +
> +	return 0;
> +}
> +
> +static noinline_for_stack int
> +nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> +		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
> +		  unsigned long *cnt, struct kiocb *kiocb)
> +{
> +	struct nfsd_write_dio write_dio;
> +
> +	/*
> +	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
> +	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
> +	 * be ignored for any DIO issued here).
> +	 */
> +	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> +		kiocb->ki_flags |= IOCB_DONTCACHE;
> +
> +	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
> +		return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
> +					    cnt, kiocb, &write_dio);
> +
> +	return nfsd_iocb_write(nf->nf_file, rqstp->rq_bvec, nvecs, cnt, kiocb);
> +}
> +
>  /**
>   * nfsd_vfs_write - write data to an already-open file
>   * @rqstp: RPC execution context
> @@ -1346,6 +1538,11 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		nfsd_copy_write_verifier(verf, nn);
>  
>  	switch (nfsd_io_cache_write) {
> +	case NFSD_IO_DIRECT:
> +		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
> +					     nvecs, cnt, &kiocb);
> +		stable = *stable_how;
> +		break;
>  	case NFSD_IO_DONTCACHE:
>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
>  			kiocb.ki_flags |= IOCB_DONTCACHE;
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 09/14] NFSD: Remove the len_mask check
  2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
  2025-10-24 15:23   ` Jeff Layton
@ 2025-10-24 17:16   ` Mike Snitzer
  2025-10-24 17:22     ` Chuck Lever
  1 sibling, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 17:16 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Fri, Oct 24, 2025 at 10:43:01AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Mike says:
> > > Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
> > > would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
> > > would return false.
> >
> > It is always due to memory alignment (addr_mask check), never due to
> > logical alignment (len_mask check).
> >
> > So we could remove the len_mask arg and the 'if (size & len_mask)'
> > check from nfsd_iov_iter_aligned_bvec
> 
> Suggested-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 465d4d091f3d..f6810630bb65 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1285,15 +1285,12 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>  }
>  
>  static bool
> -nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> -			   unsigned int len_mask)
> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask)
>  {
>  	const struct bio_vec *bvec = i->bvec;
>  	size_t skip = i->iov_offset;
>  	size_t size = i->count;
>  
> -	if (size & len_mask)
> -		return false;
>  	do {
>  		size_t len = bvec->bv_len;
>  
> -- 
> 2.51.0
> 
> 

Just a bisect-ability nit, the call to nfsd_iov_iter_aligned_bvec()
needs to remove the len_mask arg.

Otherwise:

Reviewed-by: Mike Snitzer <snitzer@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio
  2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
  2025-10-24 15:26   ` Jeff Layton
@ 2025-10-24 17:20   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 17:20 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Fri, Oct 24, 2025 at 10:43:03AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Prepare for moving more common arguments into the shared per-request
> structure.
> 
> First step is to move the target nfsd_file into that structure, as
> it needs to be available in several functions.
> 
> As a clean-up, adopt the common naming of a structure that carries
> the arguments for a number of functions.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Reviewed-by: Mike Snitzer <snitzer@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 09/14] NFSD: Remove the len_mask check
  2025-10-24 17:16   ` Mike Snitzer
@ 2025-10-24 17:22     ` Chuck Lever
  0 siblings, 0 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 17:22 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On 10/24/25 1:16 PM, Mike Snitzer wrote:
> On Fri, Oct 24, 2025 at 10:43:01AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Mike says:
>>>> Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
>>>> would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
>>>> would return false.
>>>
>>> It is always due to memory alignment (addr_mask check), never due to
>>> logical alignment (len_mask check).
>>>
>>> So we could remove the len_mask arg and the 'if (size & len_mask)'
>>> check from nfsd_iov_iter_aligned_bvec
>>
>> Suggested-by: Mike Snitzer <snitzer@kernel.org>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  fs/nfsd/vfs.c | 5 +----
>>  1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 465d4d091f3d..f6810630bb65 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1285,15 +1285,12 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>>  }
>>  
>>  static bool
>> -nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
>> -			   unsigned int len_mask)
>> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask)
>>  {
>>  	const struct bio_vec *bvec = i->bvec;
>>  	size_t skip = i->iov_offset;
>>  	size_t size = i->count;
>>  
>> -	if (size & len_mask)
>> -		return false;
>>  	do {
>>  		size_t len = bvec->bv_len;
>>  
>> -- 
>> 2.51.0
>>
>>
> 
> Just a bisect-ability nit, the call to nfsd_iov_iter_aligned_bvec()
> needs to remove the len_mask arg.

Fixed, thanks.


> 
> Otherwise:
> 
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>




-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-24 17:12   ` Mike Snitzer
@ 2025-10-24 17:24     ` Chuck Lever
  0 siblings, 0 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 17:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On 10/24/25 1:12 PM, Mike Snitzer wrote:
>> +	/*
>> +	 * Any buffered IO issued here will be misaligned, use
>> +	 * sync IO to ensure it has completed before returning.
>> +	 * Also update @stable_how to avoid need for COMMIT.
>> +	 */
>> +	kiocb->ki_flags |= IOCB_DSYNC;
>> +	*stable_how = NFS_FILE_SYNC;
> Patch 6 really should be folded into this patch, I originally
> submitted my v3 (and you carried it in v4) with both
> IOCB_DSYNC|IOCB_SSYNC being set, see:
> https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
> 
> If you'd like your comment change and removal of parenthesis factored
> out (to patch 6 or whatever) that's up to you.

I'm thinking of rebasing the series on your v3. I'll figure something
out.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg
  2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
  2025-10-24 15:30   ` Jeff Layton
@ 2025-10-24 17:57   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 17:57 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 10:43:04AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Passing iter arrays by reference is a little risky. Instead, pass a
> struct with a fixed-size array so bounds checking can be done.
> 
> Name each item in the array a "segment", as the term "extent"
> generally refers to a set of blocks on storage, not to a buffer.

Well both ondisk and memory are in play when setting up DIO and
checking alignment.  So length of associated ondisk is an extent.  SO
this is just to say, this header might do well to not get hung up on
past naming.

Not a big deal either way, we're ensuring the DIO-aligned middle
_extent_ ondisk, our alignment additional alignment checking can be
constrained to focus purely on memory segment checking.. so I agree
with your desire to flip the focus to "segment".

> Each segment is processed via a single vfs_iocb_iter_write() call,
> and is either IOCB_DIRECT or buffered.
> 
> Introduce a segment constructor function so each segment is
> initialized identically.
> 
> Each segment has its own length. The loop that iterates over the
> segment array can simply skip over the segments of zero length.
> A count of segments is not needed.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Code looks great, matches how I read Christoph's suggestion, nice
done.  Thanks for running with this.

Reviewed-by: Mike Snitzer <snitzer@kernel.org>

ps. super small nit below.

> @@ -1373,36 +1369,33 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		     unsigned int nvecs, unsigned long *cnt)
>  {
>  	struct file *file = args->nf->nf_file;
> -	bool iter_is_dio_aligned[3];
> -	struct iov_iter iter_stack[3];
> -	struct iov_iter *iter = iter_stack;
> -	unsigned int n_iters = 0;
> -	unsigned long in_count = *cnt;
> -	loff_t in_offset = kiocb->ki_pos;
> +	struct nfsd_write_dio_seg *segment;
>  	ssize_t host_err;
> +	size_t i;
>  
> -	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
> -					     rqstp->rq_bvec, nvecs, *cnt,
> -					     args);
> -	if (unlikely(!n_iters))
> +	if (!nfsd_setup_write_dio_iters(args, rqstp->rq_bvec, nvecs, *cnt))
>  		return nfsd_iocb_write(file, rqstp->rq_bvec, nvecs,
>  				       cnt, kiocb);
>  
> -	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
> -
>  	*cnt = 0;
> -	for (int i = 0; i < n_iters; i++) {
> -		if (iter_is_dio_aligned[i])
> +	segment = args->segment;
> +	for (i = 0; i < ARRAY_SIZE(args->segment); i++) {
> +		if (segment->len == 0)
> +			continue;
> +		if (segment->use_dio) {
>  			kiocb->ki_flags |= IOCB_DIRECT;
> -		else
> +			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> +						segment->len);
> +		} else
>  			kiocb->ki_flags &= ~IOCB_DIRECT;
>  
> -		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> +		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
>  		if (host_err < 0)
>  			return host_err;
>  		*cnt += host_err;
> -		if (host_err < iter[i].count) /* partial write? */
> +		if (host_err < segment->iter.count)
>  			break;
> +		++segment;
>  	}

I think the /* partial write? */ comment helps a bit. Maybe switch to:

 		if (host_err < segment->iter.count)
  			break; /* partial write */

Or not, up to you... ;)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow
  2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
  2025-10-24 15:32   ` Jeff Layton
@ 2025-10-24 18:01   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 18:01 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Fri, Oct 24, 2025 at 10:43:05AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Clean up: Use the usual error flow form:
> 
> 	if (uncommon condition) {
> 		handle it;
> 		return;
> 	}
> 	do common thing;
> 
> in nfsd_direct_write(). Now there is a single place where the direct
> write path falls back to a single cached write.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Definitely cleaner, thanks.

Reviewed-by: Mike Snitzer <snitzer@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
  2025-10-24 15:34   ` Jeff Layton
@ 2025-10-24 18:13   ` Mike Snitzer
  2025-10-24 19:34     ` Chuck Lever
  1 sibling, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 18:13 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 10:43:06AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Christoph says:
> > > +	if (file->f_op->fop_flags & FOP_DONTCACHE)
> > > +		kiocb->ki_flags |= IOCB_DONTCACHE;
> > IOCB_DONTCACHE isn't defined for IOCB_DIRECT.  So this should
> > move into a branch just for buffered I/O.
> 
> Instead, let's set up separate ki_flags for buffered I/O and for
> direct I/O requests. Then we don't have to set a jumble of flag
> bits in a single flags field.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index b0e4105e0075..b7b9f8cf0452 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1262,6 +1262,8 @@ struct nfsd_write_dio_seg {
>  
>  struct nfsd_write_dio_args {
>  	struct nfsd_file		*nf;
> +	int				flags_buffered;
> +	int				flags_direct;
>  	struct nfsd_write_dio_seg	segment[3];
>  };
>  
> @@ -1379,11 +1381,11 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		if (segment->len == 0)
>  			continue;
>  		if (segment->use_dio) {
> -			kiocb->ki_flags |= IOCB_DIRECT;
> +			kiocb->ki_flags = args->flags_direct;
>  			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
>  						segment->len);
>  		} else
> -			kiocb->ki_flags &= ~IOCB_DIRECT;
> +			kiocb->ki_flags = args->flags_buffered;
>  		host_err = vfs_iocb_iter_write(file, kiocb, &segment->iter);
>  		if (host_err < 0)
>  			return host_err;
> @@ -1405,30 +1407,27 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	struct nfsd_write_dio_args args;
>  
>  	/*
> -	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
> -	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
> -	 * be ignored for any DIO issued here).
> +	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> +	 * writing unaligned segments or handling fallback I/O.
>  	 */
> +	args.flags_buffered = kiocb->ki_flags | IOCB_SYNC | IOCB_DSYNC;
>  	if (file->f_op->fop_flags & FOP_DONTCACHE)
> -		kiocb->ki_flags |= IOCB_DONTCACHE;
> +		args.flags_buffered |= IOCB_DONTCACHE;
>  
>  	/*
> -	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
> -	 * both written data and dirty time stamps.
> -	 *
> -	 * When falling back to buffered I/O or handling the unaligned
> -	 * first and last segments, the data and time stamps must be
> -	 * durable before nfsd_vfs_write() returns to its caller, matching
> -	 * the behavior of direct I/O.
> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should
> +	 * persist both written data and dirty time stamps.
>  	 */
> -	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
> +	args.flags_direct = kiocb->ki_flags | IOCB_SYNC | IOCB_DIRECT;

AFAIK we still need: IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC

IOCB_DIRECT | IOCB_DSYNC was recently put under a microscope relative
to XFS performance and the resulting improvement was merged for 6.18
with commit c91d38b57f ("xfs: rework datasync tracking and execution")

Otherwise, everything else looks good, thanks.

Reviewed-by: Mike Snitzer <snitzer@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 18:13   ` Mike Snitzer
@ 2025-10-24 19:34     ` Chuck Lever
  2025-10-24 20:37       ` Mike Snitzer
  2025-10-27  8:12       ` Christoph Hellwig
  0 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 19:34 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On 10/24/25 2:13 PM, Mike Snitzer wrote:
>>  	/*
>> -	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
>> -	 * both written data and dirty time stamps.
>> -	 *
>> -	 * When falling back to buffered I/O or handling the unaligned
>> -	 * first and last segments, the data and time stamps must be
>> -	 * durable before nfsd_vfs_write() returns to its caller, matching
>> -	 * the behavior of direct I/O.
>> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should
>> +	 * persist both written data and dirty time stamps.
>>  	 */
>> -	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
>> +	args.flags_direct = kiocb->ki_flags | IOCB_SYNC | IOCB_DIRECT;
> AFAIK we still need: IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC
> 
> IOCB_DIRECT | IOCB_DSYNC was recently put under a microscope relative
> to XFS performance and the resulting improvement was merged for 6.18
> with commit c91d38b57f ("xfs: rework datasync tracking and execution")

This looks like an xfs-specific fix. I'm reluctant to apply a fix for
a specific file system implementation in what's supposed to be generic
code.

If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
systems, then it needs an explanatory code comment, which I'm not yet
qualified to write. I don't see any textual material in previous
incarnations of this code that might help get me started.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 19:34     ` Chuck Lever
@ 2025-10-24 20:37       ` Mike Snitzer
  2025-10-24 21:16         ` Chuck Lever
  2025-10-27  8:14         ` Christoph Hellwig
  2025-10-27  8:12       ` Christoph Hellwig
  1 sibling, 2 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 20:37 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
> On 10/24/25 2:13 PM, Mike Snitzer wrote:
> >>  	/*
> >> -	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
> >> -	 * both written data and dirty time stamps.
> >> -	 *
> >> -	 * When falling back to buffered I/O or handling the unaligned
> >> -	 * first and last segments, the data and time stamps must be
> >> -	 * durable before nfsd_vfs_write() returns to its caller, matching
> >> -	 * the behavior of direct I/O.
> >> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should
> >> +	 * persist both written data and dirty time stamps.
> >>  	 */
> >> -	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
> >> +	args.flags_direct = kiocb->ki_flags | IOCB_SYNC | IOCB_DIRECT;
> > AFAIK we still need: IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC
> > 
> > IOCB_DIRECT | IOCB_DSYNC was recently put under a microscope relative
> > to XFS performance and the resulting improvement was merged for 6.18
> > with commit c91d38b57f ("xfs: rework datasync tracking and execution")
> 
> This looks like an xfs-specific fix. I'm reluctant to apply a fix for
> a specific file system implementation in what's supposed to be generic
> code.
> 
> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
> systems, then it needs an explanatory code comment, which I'm not yet
> qualified to write. I don't see any textual material in previous
> incarnations of this code that might help get me started.

The XFS specific performance improvement isn't the point.  The point
is that applications (like I think DB2 is what started all this with
Jan Kara and the XFS filesystem) results in the use of
O_DIRECT+O_DSYNC.

It is a clear reality that other filesystems are catering to
O_DIRECT+O_DSYNC. And given our findings with Christoph that buffered
IO needs O_DSYNC+O_SYNC, I'd rather we not expose ourselves to not
having O_DSYNC set.

Particularly because any filesystem NFSD is writing to _can_ also
fallback to using buffered IO if O_DIRECT set (NFSD is doing exactly
that). Which we _know_ (from Christoph) that having O_DSYNC set is
important when we fallback to using buffered IO (like we do for the
misaligned head and/or tail).

Please let's not make the same mistake twice.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 20:37       ` Mike Snitzer
@ 2025-10-24 21:16         ` Chuck Lever
  2025-10-24 23:56           ` Mike Snitzer
  2025-10-27  8:14         ` Christoph Hellwig
  1 sibling, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-24 21:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On 10/24/25 4:37 PM, Mike Snitzer wrote:
> On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
>> On 10/24/25 2:13 PM, Mike Snitzer wrote:
>>>>  	/*
>>>> -	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
>>>> -	 * both written data and dirty time stamps.
>>>> -	 *
>>>> -	 * When falling back to buffered I/O or handling the unaligned
>>>> -	 * first and last segments, the data and time stamps must be
>>>> -	 * durable before nfsd_vfs_write() returns to its caller, matching
>>>> -	 * the behavior of direct I/O.
>>>> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should
>>>> +	 * persist both written data and dirty time stamps.
>>>>  	 */
>>>> -	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
>>>> +	args.flags_direct = kiocb->ki_flags | IOCB_SYNC | IOCB_DIRECT;
>>> AFAIK we still need: IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC
>>>
>>> IOCB_DIRECT | IOCB_DSYNC was recently put under a microscope relative
>>> to XFS performance and the resulting improvement was merged for 6.18
>>> with commit c91d38b57f ("xfs: rework datasync tracking and execution")
>>
>> This looks like an xfs-specific fix. I'm reluctant to apply a fix for
>> a specific file system implementation in what's supposed to be generic
>> code.
>>
>> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
>> systems, then it needs an explanatory code comment, which I'm not yet
>> qualified to write. I don't see any textual material in previous
>> incarnations of this code that might help get me started.
> 
> The XFS specific performance improvement isn't the point.  The point
> is that applications (like I think DB2 is what started all this with
> Jan Kara and the XFS filesystem) results in the use of
> O_DIRECT+O_DSYNC.
> 
> It is a clear reality that other filesystems are catering to
> O_DIRECT+O_DSYNC. And given our findings with Christoph that buffered
> IO needs O_DSYNC+O_SYNC, I'd rather we not expose ourselves to not
> having O_DSYNC set.
> 
> Particularly because any filesystem NFSD is writing to _can_ also
> fallback to using buffered IO if O_DIRECT set (NFSD is doing exactly
> that). Which we _know_ (from Christoph) that having O_DSYNC set is
> important when we fallback to using buffered IO (like we do for the
> misaligned head and/or tail).
> 
> Please let's not make the same mistake twice.

To be clear, I'm not refusing to add IOCB_DSYNC with IOCB_DIRECT, I'm
just confused about why it is necessary.

Direct and buffered I/O in the direct write path now each have their own
set of ki_flags. The ki_flags used for buffered writes has SYNC and
DSYNC set. So, for fallback I/O and for the unaligned segments of the
buffer, both flags are set. I think we are in agreement on that.

You might be referring above to this email from Christoph:

> > I think IOCB_SYNC would be needed with O_DIRECT to force timestamp
> > updates. Otherwise, IOCB_SYNC is relevant only when the function is
> > forced to fall back to some form of write through the page cache.
>
> Well, IOCB_SYNC is only needed to commit timestamps.  O_DSYNC is
> always required if you want to commit to stable storage.  As said
> above I don't really understand from the patch why we want to do
> that, but IFF we want to do that, we need IOCB_DSYNC bother for
> direct and buffered I/O.

He says "we need IOCB_DSYNC both... for direct and buffered I/O". Fair
enough, but why does IOCB_DIRECT, which is essentially a synchronous
write already, need to explicitly set IOCB_DSYNC? All I want is
something I can distill into a code comment. "Force a FUA after each
direct write" or something like that.

I'm really surprised that IOCB_DIRECT does not imply IOCB_DSYNC, and
there doesn't seem to be any clear documentation about the semantics
of these flags. Thus I believe a code comment here is warranted.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 21:16         ` Chuck Lever
@ 2025-10-24 23:56           ` Mike Snitzer
  2025-10-27  8:15             ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-24 23:56 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 05:16:52PM -0400, Chuck Lever wrote:
> On 10/24/25 4:37 PM, Mike Snitzer wrote:
> > On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
> >> On 10/24/25 2:13 PM, Mike Snitzer wrote:
> >>>>  	/*
> >>>> -	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
> >>>> -	 * both written data and dirty time stamps.
> >>>> -	 *
> >>>> -	 * When falling back to buffered I/O or handling the unaligned
> >>>> -	 * first and last segments, the data and time stamps must be
> >>>> -	 * durable before nfsd_vfs_write() returns to its caller, matching
> >>>> -	 * the behavior of direct I/O.
> >>>> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should
> >>>> +	 * persist both written data and dirty time stamps.
> >>>>  	 */
> >>>> -	kiocb->ki_flags |= IOCB_SYNC | IOCB_DSYNC;
> >>>> +	args.flags_direct = kiocb->ki_flags | IOCB_SYNC | IOCB_DIRECT;
> >>> AFAIK we still need: IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC
> >>>
> >>> IOCB_DIRECT | IOCB_DSYNC was recently put under a microscope relative
> >>> to XFS performance and the resulting improvement was merged for 6.18
> >>> with commit c91d38b57f ("xfs: rework datasync tracking and execution")
> >>
> >> This looks like an xfs-specific fix. I'm reluctant to apply a fix for
> >> a specific file system implementation in what's supposed to be generic
> >> code.
> >>
> >> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
> >> systems, then it needs an explanatory code comment, which I'm not yet
> >> qualified to write. I don't see any textual material in previous
> >> incarnations of this code that might help get me started.
> > 
> > The XFS specific performance improvement isn't the point.  The point
> > is that applications (like I think DB2 is what started all this with
> > Jan Kara and the XFS filesystem) results in the use of
> > O_DIRECT+O_DSYNC.
> > 
> > It is a clear reality that other filesystems are catering to
> > O_DIRECT+O_DSYNC. And given our findings with Christoph that buffered
> > IO needs O_DSYNC+O_SYNC, I'd rather we not expose ourselves to not
> > having O_DSYNC set.
> > 
> > Particularly because any filesystem NFSD is writing to _can_ also
> > fallback to using buffered IO if O_DIRECT set (NFSD is doing exactly
> > that). Which we _know_ (from Christoph) that having O_DSYNC set is
> > important when we fallback to using buffered IO (like we do for the
> > misaligned head and/or tail).
> > 
> > Please let's not make the same mistake twice.
> 
> To be clear, I'm not refusing to add IOCB_DSYNC with IOCB_DIRECT, I'm
> just confused about why it is necessary.
> 
> Direct and buffered I/O in the direct write path now each have their own
> set of ki_flags. The ki_flags used for buffered writes has SYNC and
> DSYNC set. So, for fallback I/O and for the unaligned segments of the
> buffer, both flags are set. I think we are in agreement on that.
> 
> You might be referring above to this email from Christoph:
> 
> > > I think IOCB_SYNC would be needed with O_DIRECT to force timestamp
> > > updates. Otherwise, IOCB_SYNC is relevant only when the function is
> > > forced to fall back to some form of write through the page cache.
> >
> > Well, IOCB_SYNC is only needed to commit timestamps.  O_DSYNC is
> > always required if you want to commit to stable storage.  As said
> > above I don't really understand from the patch why we want to do
> > that, but IFF we want to do that, we need IOCB_DSYNC bother for
> > direct and buffered I/O.
> 
> He says "we need IOCB_DSYNC both... for direct and buffered I/O". Fair
> enough, but why does IOCB_DIRECT, which is essentially a synchronous
> write already, need to explicitly set IOCB_DSYNC? All I want is
> something I can distill into a code comment. "Force a FUA after each
> direct write" or something like that.

Christoph said here:
https://lore.kernel.org/linux-nfs/aPnChLocfNsu_UN7@infradead.org/

"Well, IOCB_SYNC is only needed to commit timestamps.  O_DSYNC is
always required if you want to commit to stable storage."

^ This should be all you need to say?

Why/how that is also the semantics of O_DIRECT is lost on me at the
moment...

Though this message from Christoph gets into why/how things got less
than ideal due to Linux's historic handling of these flags:
https://lore.kernel.org/linux-nfs/aPnBIGeFjrZLbxBG@infradead.org/

Christoph admits "this is all a bit odd".

> I'm really surprised that IOCB_DIRECT does not imply IOCB_DSYNC, and
> there doesn't seem to be any clear documentation about the semantics
> of these flags. Thus I believe a code comment here is warranted.

I'm with you, it is all "clear as mud" and unintuitive for me too.
But I'm just playing the cards dealt.

Really, even ignoring all the quirkiness of this: that O_DIRECT can
fallback to buffered, and we need IOCB_DSYNC|IOCB_SYNC for our use of
buffered IO when NFSD_IO_DIRECT configured to ensure data has hit
stable storage -- that's enough justification.  Bit circular but
compelling to prove the need.. albeit wordy and a lot to unpack.

Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-24 17:12   ` Mike Snitzer
@ 2025-10-26  0:03   ` kernel test robot
  2025-10-26  1:16   ` kernel test robot
  2 siblings, 0 replies; 87+ messages in thread
From: kernel test robot @ 2025-10-26  0:03 UTC (permalink / raw)
  To: Chuck Lever; +Cc: oe-kbuild-all

Hi Chuck,

kernel test robot noticed the following build errors:

[auto build test ERROR on v6.18-rc2]
[also build test ERROR on linus/master next-20251024]
[cannot apply to brauner-vfs/vfs.all]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Chuck-Lever/NFSD-Make-FILE_SYNC-WRITEs-comply-with-spec/20251024-224529
base:   v6.18-rc2
patch link:    https://lore.kernel.org/r/20251024144306.35652-5-cel%40kernel.org
patch subject: [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
config: parisc-defconfig (https://download.01.org/0day-ci/archive/20251026/202510260750.LeqZZZx8-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251026/202510260750.LeqZZZx8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510260750.LeqZZZx8-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/nfsd/vfs.c: In function 'nfsd_vfs_write':
>> fs/nfsd/vfs.c:1455:14: error: 'NFSD_IO_DIRECT' undeclared (first use in this function)
    1455 |         case NFSD_IO_DIRECT:
         |              ^~~~~~~~~~~~~~
   fs/nfsd/vfs.c:1455:14: note: each undeclared identifier is reported only once for each function it appears in
--
   fs/nfsd/debugfs.c: In function 'nfsd_io_cache_write_set':
>> fs/nfsd/debugfs.c:109:14: error: 'NFSD_IO_DIRECT' undeclared (first use in this function)
     109 |         case NFSD_IO_DIRECT:
         |              ^~~~~~~~~~~~~~
   fs/nfsd/debugfs.c:109:14: note: each undeclared identifier is reported only once for each function it appears in


vim +/NFSD_IO_DIRECT +1455 fs/nfsd/vfs.c

  1378	
  1379	/**
  1380	 * nfsd_vfs_write - write data to an already-open file
  1381	 * @rqstp: RPC execution context
  1382	 * @fhp: File handle of file to write into
  1383	 * @nf: An open file matching @fhp
  1384	 * @offset: Byte offset of start
  1385	 * @payload: xdr_buf containing the write payload
  1386	 * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
  1387	 * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  1388	 * @verf: NFS WRITE verifier
  1389	 *
  1390	 * Upon return, caller must invoke fh_put on @fhp.
  1391	 *
  1392	 * Return values:
  1393	 *   An nfsstat value in network byte order.
  1394	 */
  1395	__be32
  1396	nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
  1397		       struct nfsd_file *nf, loff_t offset,
  1398		       const struct xdr_buf *payload, unsigned long *cnt,
  1399		       u32 *stable_how, __be32 *verf)
  1400	{
  1401		struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
  1402		struct file		*file = nf->nf_file;
  1403		struct super_block	*sb = file_inode(file)->i_sb;
  1404		u32			stable = *stable_how;
  1405		struct kiocb		kiocb;
  1406		struct svc_export	*exp;
  1407		errseq_t		since;
  1408		__be32			nfserr;
  1409		int			host_err;
  1410		unsigned long		exp_op_flags = 0;
  1411		unsigned int		pflags = current->flags;
  1412		bool			restore_flags = false;
  1413		unsigned int		nvecs;
  1414	
  1415		trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
  1416	
  1417		if (sb->s_export_op)
  1418			exp_op_flags = sb->s_export_op->flags;
  1419	
  1420		if (test_bit(RQ_LOCAL, &rqstp->rq_flags) &&
  1421		    !(exp_op_flags & EXPORT_OP_REMOTE_FS)) {
  1422			/*
  1423			 * We want throttling in balance_dirty_pages()
  1424			 * and shrink_inactive_list() to only consider
  1425			 * the backingdev we are writing to, so that nfs to
  1426			 * localhost doesn't cause nfsd to lock up due to all
  1427			 * the client's dirty pages or its congested queue.
  1428			 */
  1429			current->flags |= PF_LOCAL_THROTTLE;
  1430			restore_flags = true;
  1431		}
  1432	
  1433		exp = fhp->fh_export;
  1434	
  1435		if (!EX_ISSYNC(exp))
  1436			stable = NFS_UNSTABLE;
  1437		init_sync_kiocb(&kiocb, file);
  1438		kiocb.ki_pos = offset;
  1439		if (stable && !fhp->fh_use_wgather) {
  1440			if (stable == NFS_FILE_SYNC)
  1441				/* persist data and timestamps */
  1442				kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
  1443			else
  1444				/* persist data only */
  1445				kiocb.ki_flags |= IOCB_DSYNC;
  1446		}
  1447	
  1448		nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
  1449	
  1450		since = READ_ONCE(file->f_wb_err);
  1451		if (verf)
  1452			nfsd_copy_write_verifier(verf, nn);
  1453	
  1454		switch (nfsd_io_cache_write) {
> 1455		case NFSD_IO_DIRECT:
  1456			host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
  1457						     nvecs, cnt, &kiocb);
  1458			stable = *stable_how;
  1459			break;
  1460		case NFSD_IO_DONTCACHE:
  1461			if (file->f_op->fop_flags & FOP_DONTCACHE)
  1462				kiocb.ki_flags |= IOCB_DONTCACHE;
  1463			fallthrough;
  1464		case NFSD_IO_BUFFERED:
  1465			host_err = nfsd_iocb_write(file, rqstp->rq_bvec, nvecs, cnt,
  1466						   &kiocb);
  1467			break;
  1468		}
  1469		if (host_err < 0) {
  1470			commit_reset_write_verifier(nn, rqstp, host_err);
  1471			goto out_nfserr;
  1472		}
  1473		nfsd_stats_io_write_add(nn, exp, *cnt);
  1474		fsnotify_modify(file);
  1475		host_err = filemap_check_wb_err(file->f_mapping, since);
  1476		if (host_err < 0)
  1477			goto out_nfserr;
  1478	
  1479		if (stable && fhp->fh_use_wgather) {
  1480			host_err = wait_for_concurrent_writes(file);
  1481			if (host_err < 0)
  1482				commit_reset_write_verifier(nn, rqstp, host_err);
  1483		}
  1484	
  1485	out_nfserr:
  1486		if (host_err >= 0) {
  1487			trace_nfsd_write_io_done(rqstp, fhp, offset, *cnt);
  1488			nfserr = nfs_ok;
  1489		} else {
  1490			trace_nfsd_write_err(rqstp, fhp, offset, host_err);
  1491			nfserr = nfserrno(host_err);
  1492		}
  1493		if (restore_flags)
  1494			current_restore_flags(pflags, PF_LOCAL_THROTTLE);
  1495		return nfserr;
  1496	}
  1497	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-24 17:12   ` Mike Snitzer
  2025-10-26  0:03   ` kernel test robot
@ 2025-10-26  1:16   ` kernel test robot
  2 siblings, 0 replies; 87+ messages in thread
From: kernel test robot @ 2025-10-26  1:16 UTC (permalink / raw)
  To: Chuck Lever; +Cc: llvm, oe-kbuild-all

Hi Chuck,

kernel test robot noticed the following build errors:

[auto build test ERROR on v6.18-rc2]
[also build test ERROR on linus/master next-20251024]
[cannot apply to brauner-vfs/vfs.all]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Chuck-Lever/NFSD-Make-FILE_SYNC-WRITEs-comply-with-spec/20251024-224529
base:   v6.18-rc2
patch link:    https://lore.kernel.org/r/20251024144306.35652-5-cel%40kernel.org
patch subject: [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251026/202510260920.HUFBtu3X-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251026/202510260920.HUFBtu3X-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510260920.HUFBtu3X-lkp@intel.com/

All errors (new ones prefixed by >>):

>> fs/nfsd/vfs.c:1455:7: error: use of undeclared identifier 'NFSD_IO_DIRECT'
    1455 |         case NFSD_IO_DIRECT:
         |              ^
   1 error generated.
--
>> fs/nfsd/debugfs.c:109:7: error: use of undeclared identifier 'NFSD_IO_DIRECT'
     109 |         case NFSD_IO_DIRECT:
         |              ^
   1 error generated.


vim +/NFSD_IO_DIRECT +1455 fs/nfsd/vfs.c

  1378	
  1379	/**
  1380	 * nfsd_vfs_write - write data to an already-open file
  1381	 * @rqstp: RPC execution context
  1382	 * @fhp: File handle of file to write into
  1383	 * @nf: An open file matching @fhp
  1384	 * @offset: Byte offset of start
  1385	 * @payload: xdr_buf containing the write payload
  1386	 * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
  1387	 * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  1388	 * @verf: NFS WRITE verifier
  1389	 *
  1390	 * Upon return, caller must invoke fh_put on @fhp.
  1391	 *
  1392	 * Return values:
  1393	 *   An nfsstat value in network byte order.
  1394	 */
  1395	__be32
  1396	nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
  1397		       struct nfsd_file *nf, loff_t offset,
  1398		       const struct xdr_buf *payload, unsigned long *cnt,
  1399		       u32 *stable_how, __be32 *verf)
  1400	{
  1401		struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
  1402		struct file		*file = nf->nf_file;
  1403		struct super_block	*sb = file_inode(file)->i_sb;
  1404		u32			stable = *stable_how;
  1405		struct kiocb		kiocb;
  1406		struct svc_export	*exp;
  1407		errseq_t		since;
  1408		__be32			nfserr;
  1409		int			host_err;
  1410		unsigned long		exp_op_flags = 0;
  1411		unsigned int		pflags = current->flags;
  1412		bool			restore_flags = false;
  1413		unsigned int		nvecs;
  1414	
  1415		trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
  1416	
  1417		if (sb->s_export_op)
  1418			exp_op_flags = sb->s_export_op->flags;
  1419	
  1420		if (test_bit(RQ_LOCAL, &rqstp->rq_flags) &&
  1421		    !(exp_op_flags & EXPORT_OP_REMOTE_FS)) {
  1422			/*
  1423			 * We want throttling in balance_dirty_pages()
  1424			 * and shrink_inactive_list() to only consider
  1425			 * the backingdev we are writing to, so that nfs to
  1426			 * localhost doesn't cause nfsd to lock up due to all
  1427			 * the client's dirty pages or its congested queue.
  1428			 */
  1429			current->flags |= PF_LOCAL_THROTTLE;
  1430			restore_flags = true;
  1431		}
  1432	
  1433		exp = fhp->fh_export;
  1434	
  1435		if (!EX_ISSYNC(exp))
  1436			stable = NFS_UNSTABLE;
  1437		init_sync_kiocb(&kiocb, file);
  1438		kiocb.ki_pos = offset;
  1439		if (stable && !fhp->fh_use_wgather) {
  1440			if (stable == NFS_FILE_SYNC)
  1441				/* persist data and timestamps */
  1442				kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
  1443			else
  1444				/* persist data only */
  1445				kiocb.ki_flags |= IOCB_DSYNC;
  1446		}
  1447	
  1448		nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
  1449	
  1450		since = READ_ONCE(file->f_wb_err);
  1451		if (verf)
  1452			nfsd_copy_write_verifier(verf, nn);
  1453	
  1454		switch (nfsd_io_cache_write) {
> 1455		case NFSD_IO_DIRECT:
  1456			host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
  1457						     nvecs, cnt, &kiocb);
  1458			stable = *stable_how;
  1459			break;
  1460		case NFSD_IO_DONTCACHE:
  1461			if (file->f_op->fop_flags & FOP_DONTCACHE)
  1462				kiocb.ki_flags |= IOCB_DONTCACHE;
  1463			fallthrough;
  1464		case NFSD_IO_BUFFERED:
  1465			host_err = nfsd_iocb_write(file, rqstp->rq_bvec, nvecs, cnt,
  1466						   &kiocb);
  1467			break;
  1468		}
  1469		if (host_err < 0) {
  1470			commit_reset_write_verifier(nn, rqstp, host_err);
  1471			goto out_nfserr;
  1472		}
  1473		nfsd_stats_io_write_add(nn, exp, *cnt);
  1474		fsnotify_modify(file);
  1475		host_err = filemap_check_wb_err(file->f_mapping, since);
  1476		if (host_err < 0)
  1477			goto out_nfserr;
  1478	
  1479		if (stable && fhp->fh_use_wgather) {
  1480			host_err = wait_for_concurrent_writes(file);
  1481			if (host_err < 0)
  1482				commit_reset_write_verifier(nn, rqstp, host_err);
  1483		}
  1484	
  1485	out_nfserr:
  1486		if (host_err >= 0) {
  1487			trace_nfsd_write_io_done(rqstp, fhp, offset, *cnt);
  1488			nfserr = nfs_ok;
  1489		} else {
  1490			trace_nfsd_write_err(rqstp, fhp, offset, host_err);
  1491			nfserr = nfserrno(host_err);
  1492		}
  1493		if (restore_flags)
  1494			current_restore_flags(pflags, PF_LOCAL_THROTTLE);
  1495		return nfserr;
  1496	}
  1497	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec
  2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
  2025-10-24 15:21   ` Jeff Layton
@ 2025-10-27  8:02   ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:02 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Mike Snitzer

On Fri, Oct 24, 2025 at 10:42:53AM -0400, Chuck Lever wrote:
> For many years, NFSD has used a "data sync only" optimization for
> FILE_SYNC WRITEs, in violation of the above text (and previous
> incarnations of the NFS standard). File time stamps haven't been
> persisted as the mandate above requires.

Haven't been forced to be persisted.  Because for most (all?) Linux
file systems timetstamps (and other metadata not needed to find the
data) are piggy backed on fdatasync-mandatory metadata updates, this
would only affect the case where previously fully allocated file data
is overwritten.  I think that's important to note here.

> The purpose of this behavior is that, back in the day, file systems
> on rotational media were too slow to handle writes with time stamp
> updates.

I really don't think that is true.  Doing an extra roundtrip is just
as (relatively) expensive on SSDs.  I also don't see any argument for
that in the commit history.  As far as I can tell this simple was
an oversight.

> The impact of this change will be felt only when a client explicitly
> requests a FILE_SYNC WRITE on a shared file system backed by slow
> storage. UNSTABLE and DATA_SYNC WRITEs should not be affected.

Again, nothing about slow storage.  Optimizing non-overwriting writes
using the FUA bit in Linux was specifically done for SSD storage.

> @@ -1314,8 +1314,14 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		stable = NFS_UNSTABLE;
>  	init_sync_kiocb(&kiocb, file);
>  	kiocb.ki_pos = offset;
> -	if (stable && !fhp->fh_use_wgather)
> -		kiocb.ki_flags |= IOCB_DSYNC;
> +	if (stable && !fhp->fh_use_wgather) {
> +		if (stable == NFS_FILE_SYNC)
> +			/* persist data and timestamps */
> +			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> +		else
> +			/* persist data only */
> +			kiocb.ki_flags |= IOCB_DSYNC;
> +	}

Maybe use a switch statement here to enumerate the valid cases?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients
  2025-10-24 14:42 ` [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-10-27  8:03   ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:03 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Fri, Oct 24, 2025 at 10:42:54AM -0400, Chuck Lever wrote:
> This patch prepares for that change by enabling the modified
> stable_how value to be passed along to NFSD's WRITE reply encoder.
> No behavior change is expected.
> 

Thanks for the detailed commit message!

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write()
  2025-10-24 14:42 ` [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write() Chuck Lever
@ 2025-10-27  8:04   ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:04 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer

On Fri, Oct 24, 2025 at 10:42:55AM -0400, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
> 
> Extract the common code that is to be used in the buffered and
> dontcache I/O modes. This common code will also be used as the
> fallback when direct I/O is requested but cannot be used.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
  2025-10-24 15:22   ` Jeff Layton
@ 2025-10-27  8:05   ` Christoph Hellwig
  2025-10-27 13:23     ` Chuck Lever
  1 sibling, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:05 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Fri, Oct 24, 2025 at 10:42:57AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Clean up: The helpers in the nfsd_direct_write() code path don't set
> stable_how to anything else but NFS_FILE_SYNC. All data writes in
> this code path result in immediately durability.

No doubting the statement of fact for the current patch set, but this
is probably a bad idea.  Direct I/O still has to flush caches on devices
with a volatile write cache (aka consumer grade SSDs), and it still has
to commit a transaction to record metadata changes for most writes.
Being able to batch these in a commit is a good idea even for direct
I/O.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path
  2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
  2025-10-24 15:22   ` Jeff Layton
@ 2025-10-27  8:08   ` Christoph Hellwig
  2025-10-27 10:38     ` Jeff Layton
  1 sibling, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:08 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Fri, Oct 24, 2025 at 10:42:58AM -0400, Chuck Lever wrote:
> +	/*
> +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
> +	 * both written data and dirty time stamps.
> +	 *
> +	 * When falling back to buffered I/O or handling the unaligned
> +	 * first and last segments, the data and time stamps must be
> +	 * durable before nfsd_vfs_write() returns to its caller, matching
> +	 * the behavior of direct I/O.

I still haven't understood why we need for for sync writes with direct
I/O.  The comments suggest it has something to do with the buffered write
fallback, but even for that I don't really understand it. But for pure
direct I/O writes that are properly aligned there definitively should be
no need.  If there is a need for the fallback we really need to explain
it, as it's non-obvious and a performance issue.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 08/14] NFSD: Remove alignment size checking
  2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
  2025-10-24 15:22   ` Jeff Layton
@ 2025-10-27  8:09   ` Christoph Hellwig
  2025-10-27 13:25     ` Chuck Lever
  1 sibling, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:09 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 10:43:00AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The current set of in-tree file systems do not support alignments
> larger than a PAGE, so this check is unnecessary.

XFS does, although your won't find production hardware with > 4k
blocks as far as I can tell.  The reason to drop this check was
to not arbitrarily exclude them for no reason.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 19:34     ` Chuck Lever
  2025-10-24 20:37       ` Mike Snitzer
@ 2025-10-27  8:12       ` Christoph Hellwig
  2025-10-27 13:27         ` Chuck Lever
  2025-10-27 14:11         ` Chuck Lever
  1 sibling, 2 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:12 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Mike Snitzer, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
> This looks like an xfs-specific fix. I'm reluctant to apply a fix for
> a specific file system implementation in what's supposed to be generic
> code.

It's not a fix, it is a performance optimization.

> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
> systems, then it needs an explanatory code comment, which I'm not yet
> qualified to write. I don't see any textual material in previous
> incarnations of this code that might help get me started.

IOCB_SYNC always needs IOCB_DSYNC as I explained three times now,
including a detailed analsys of all users (We really need to rename
IOCB_SYNC to __IOCB_SYNC to match __O_SYNC to make this more obvious
I guess..)  I still don't understand why we need sync behavior and
forced stable writes at all, though.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 20:37       ` Mike Snitzer
  2025-10-24 21:16         ` Chuck Lever
@ 2025-10-27  8:14         ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 04:37:31PM -0400, Mike Snitzer wrote:
> Particularly because any filesystem NFSD is writing to _can_ also
> fallback to using buffered IO if O_DIRECT set (NFSD is doing exactly
> that). Which we _know_ (from Christoph) that having O_DSYNC set is
> important when we fallback to using buffered IO (like we do for the
> misaligned head and/or tail).

No, I still don't know why any sync behavior is needed.  I've been
explaining how the sync behavior works at lot lately, but I still
don't understand why we think we need it.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-24 23:56           ` Mike Snitzer
@ 2025-10-27  8:15             ` Christoph Hellwig
  2025-10-27 10:50               ` Jeff Layton
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27  8:15 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Fri, Oct 24, 2025 at 07:56:59PM -0400, Mike Snitzer wrote:
> Really, even ignoring all the quirkiness of this: that O_DIRECT can
> fallback to buffered, and we need IOCB_DSYNC|IOCB_SYNC for our use of
> buffered IO when NFSD_IO_DIRECT configured to ensure data has hit
> stable storage -- that's enough justification.  Bit circular but
> compelling to prove the need.. albeit wordy and a lot to unpack.

You always need IOCB_DSYNC for data to hit stable storage, both for
buffered and direct I/O.  You need IOCB_SYNC in addition to also sync
out the timestamps, which I think we now agree we need.  I still don't
understand why using direct I/O implies that we want NFS stable writes
and not two-stage writes, though.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path
  2025-10-27  8:08   ` Christoph Hellwig
@ 2025-10-27 10:38     ` Jeff Layton
  2025-10-27 10:40       ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff Layton @ 2025-10-27 10:38 UTC (permalink / raw)
  To: Christoph Hellwig, Chuck Lever
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever

On Mon, 2025-10-27 at 01:08 -0700, Christoph Hellwig wrote:
> On Fri, Oct 24, 2025 at 10:42:58AM -0400, Chuck Lever wrote:
> > +	/*
> > +	 * IOCB_SYNC + IOCB_DIRECT requests that iter_write should persist
> > +	 * both written data and dirty time stamps.
> > +	 *
> > +	 * When falling back to buffered I/O or handling the unaligned
> > +	 * first and last segments, the data and time stamps must be
> > +	 * durable before nfsd_vfs_write() returns to its caller, matching
> > +	 * the behavior of direct I/O.
> 
> I still haven't understood why we need for for sync writes with direct
> I/O.  The comments suggest it has something to do with the buffered write
> fallback, but even for that I don't really understand it. But for pure
> direct I/O writes that are properly aligned there definitively should be
> no need.  If there is a need for the fallback we really need to explain
> it, as it's non-obvious and a performance issue.

The current DIO scheme has the server writing the aligned middle part
of the range using direct I/O, but any unaligned parts at the start or
end of the WRITE will use buffered I/O.

I don't see a need for sync writes with the DIO parts either, but the
unaligned buffered ends need to be synchronous so that we know they hit
the platter before we reply with NFS_FILE_SYNC. 
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path
  2025-10-27 10:38     ` Jeff Layton
@ 2025-10-27 10:40       ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 10:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever

On Mon, Oct 27, 2025 at 06:38:36AM -0400, Jeff Layton wrote:
> > I still haven't understood why we need for for sync writes with direct
> > I/O.  The comments suggest it has something to do with the buffered write
> > fallback, but even for that I don't really understand it. But for pure
> > direct I/O writes that are properly aligned there definitively should be
> > no need.  If there is a need for the fallback we really need to explain
> > it, as it's non-obvious and a performance issue.
> 
> The current DIO scheme has the server writing the aligned middle part
> of the range using direct I/O, but any unaligned parts at the start or
> end of the WRITE will use buffered I/O.
> 
> I don't see a need for sync writes with the DIO parts either, but the
> unaligned buffered ends need to be synchronous so that we know they hit
> the platter before we reply with NFS_FILE_SYNC. 

You also need IOCB_DSYNC for direct I/O to hit the media if you want
to return NFS_FILE_SYNC.  But I still don't understand why we want or
need to return NFS_FILE_SYNC to start with.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27  8:15             ` Christoph Hellwig
@ 2025-10-27 10:50               ` Jeff Layton
  2025-10-27 10:55                 ` Christoph Hellwig
                                   ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Jeff Layton @ 2025-10-27 10:50 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On Mon, 2025-10-27 at 01:15 -0700, Christoph Hellwig wrote:
> On Fri, Oct 24, 2025 at 07:56:59PM -0400, Mike Snitzer wrote:
> > Really, even ignoring all the quirkiness of this: that O_DIRECT can
> > fallback to buffered, and we need IOCB_DSYNC|IOCB_SYNC for our use of
> > buffered IO when NFSD_IO_DIRECT configured to ensure data has hit
> > stable storage -- that's enough justification.  Bit circular but
> > compelling to prove the need.. albeit wordy and a lot to unpack.
> 
> You always need IOCB_DSYNC for data to hit stable storage, both for
> buffered and direct I/O.  You need IOCB_SYNC in addition to also sync
> out the timestamps, which I think we now agree we need.  I still don't
> understand why using direct I/O implies that we want NFS stable writes
> and not two-stage writes, though.

That's certainly a possibility too. Consider the case where we have a
WRITE with unaligned parts at both ends. This set so far just does the
ends as synchronous I/Os.

We could do the end bits as non-synchronous writes, and follow up with
a vfs_fsync_range() call before returning NFS_FILE_SYNC.

We could also just return NFS_FILE_UNSTABLE and let the client follow
up with a commit when the write is unaligned. That may be the most
efficient scheme if you have a client streaming unaligned writes to the
server without gaps.

My feeling is that if you're doing a lot of unaligned I/Os, you're
probably better off not enabling DIO support and just doing regular
buffered (or DONTCACHE) I/Os.

That said, we don't really know either way (which is why all of this is
behind debugfs switch instead of a more permanent interface).

> You also need IOCB_DSYNC for direct I/O to hit the media if you want
> to return NFS_FILE_SYNC.  But I still don't understand why we want or
> need to return NFS_FILE_SYNC to start with.

NFS_FILE_SYNC is not required here, but it's better if we can return
that. If the server returns NFS_FILE_SYNC there is no need for the
client to follow up with a COMMIT.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 10:50               ` Jeff Layton
@ 2025-10-27 10:55                 ` Christoph Hellwig
  2025-10-27 13:48                 ` Chuck Lever
  2025-10-27 16:05                 ` Mike Snitzer
  2 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 10:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Mike Snitzer, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever,
	Christoph Hellwig

On Mon, Oct 27, 2025 at 06:50:03AM -0400, Jeff Layton wrote:
> > buffered and direct I/O.  You need IOCB_SYNC in addition to also sync
> > out the timestamps, which I think we now agree we need.  I still don't
> > understand why using direct I/O implies that we want NFS stable writes
> > and not two-stage writes, though.
> 
> That's certainly a possibility too. Consider the case where we have a
> WRITE with unaligned parts at both ends. This set so far just does the
> ends as synchronous I/Os.
> 
> We could do the end bits as non-synchronous writes, and follow up with
> a vfs_fsync_range() call before returning NFS_FILE_SYNC.
> 
> We could also just return NFS_FILE_UNSTABLE and let the client follow
> up with a commit when the write is unaligned. That may be the most
> efficient scheme if you have a client streaming unaligned writes to the
> server without gaps.

It's also the most efficient use for most direct I/O writes.

I'm really confused what the promote to stable writes things is trying
to solve.  If the clients wants to do O_(D)SYNC writes it can ask for
STABLE writes itself, no need to do magic on the server.

> > You also need IOCB_DSYNC for direct I/O to hit the media if you want
> > to return NFS_FILE_SYNC.  But I still don't understand why we want or
> > need to return NFS_FILE_SYNC to start with.
> 
> NFS_FILE_SYNC is not required here, but it's better if we can return
> that. If the server returns NFS_FILE_SYNC there is no need for the
> client to follow up with a COMMIT.

Yes, but the server had to do a lot more expensive work for every
write.  And the client can just ask for stable if it wants that.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27  8:05   ` Christoph Hellwig
@ 2025-10-27 13:23     ` Chuck Lever
  2025-10-27 13:27       ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On 10/27/25 4:05 AM, Christoph Hellwig wrote:
> On Fri, Oct 24, 2025 at 10:42:57AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Clean up: The helpers in the nfsd_direct_write() code path don't set
>> stable_how to anything else but NFS_FILE_SYNC. All data writes in
>> this code path result in immediately durability.
> 
> No doubting the statement of fact for the current patch set, but this
> is probably a bad idea.  Direct I/O still has to flush caches on devices
> with a volatile write cache (aka consumer grade SSDs), and it still has
> to commit a transaction to record metadata changes for most writes.
> Being able to batch these in a commit is a good idea even for direct
> I/O.
> 

Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea, based on
the assumption that IOCB_DIRECT writes to local file systems left
nothing to be done by a later commit. My assumption is based on the
behavior of O_DIRECT on NFS files.

If that assumption is not true, then I agree there is no technical
reason to promote NFSD_IO_DIRECT writes to FILE_SYNC, and I can remove
that built-in assumption for v8 of this series.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 08/14] NFSD: Remove alignment size checking
  2025-10-27  8:09   ` Christoph Hellwig
@ 2025-10-27 13:25     ` Chuck Lever
  2025-10-27 13:30       ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever, Christoph Hellwig

On 10/27/25 4:09 AM, Christoph Hellwig wrote:
> On Fri, Oct 24, 2025 at 10:43:00AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> The current set of in-tree file systems do not support alignments
>> larger than a PAGE, so this check is unnecessary.
> 
> XFS does, although your won't find production hardware with > 4k
> blocks as far as I can tell.  The reason to drop this check was
> to not arbitrarily exclude them for no reason.

IIRC there is a reason: NFSD_IO_DIRECT writes won't be able to align
on larger than a page (for the memory buffers). But perhaps the
alignment constraint that was removed applied only to the file
offset.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27  8:12       ` Christoph Hellwig
@ 2025-10-27 13:27         ` Chuck Lever
  2025-10-27 13:30           ` Chuck Lever
  2025-10-27 14:11         ` Chuck Lever
  1 sibling, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 13:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On 10/27/25 4:12 AM, Christoph Hellwig wrote:
> On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
>> This looks like an xfs-specific fix. I'm reluctant to apply a fix for
>> a specific file system implementation in what's supposed to be generic
>> code.
> 
> It's not a fix, it is a performance optimization.
> 
>> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
>> systems, then it needs an explanatory code comment, which I'm not yet
>> qualified to write. I don't see any textual material in previous
>> incarnations of this code that might help get me started.
> 
> IOCB_SYNC always needs IOCB_DSYNC as I explained three times now,

It wasn't clear to me until now that:

a. The reason to add SYNC in this case was only for DSYNC

b. Even after an IOCB_DIRECT write returns, there is more work to be
done.

So I stand corrected.


> including a detailed analsys of all users (We really need to rename
> IOCB_SYNC to __IOCB_SYNC to match __O_SYNC to make this more obvious
> I guess..)  I still don't understand why we need sync behavior and
> forced stable writes at all, though.
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27 13:23     ` Chuck Lever
@ 2025-10-27 13:27       ` Christoph Hellwig
  2025-10-27 14:31         ` Mike Snitzer
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 13:27 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever

On Mon, Oct 27, 2025 at 09:23:27AM -0400, Chuck Lever wrote:
> Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea, based on
> the assumption that IOCB_DIRECT writes to local file systems left
> nothing to be done by a later commit. My assumption is based on the
> behavior of O_DIRECT on NFS files.
> 
> If that assumption is not true, then I agree there is no technical
> reason to promote NFSD_IO_DIRECT writes to FILE_SYNC, and I can remove
> that built-in assumption for v8 of this series.

It is not true, or rather only true for a tiny subset of use cases
(which NFS can't even query a head of time).

For devices that advertise a volatile write cache, commit has to flush
that.  High-end enough device won't have one, but a lot of devices that
people NFS-export do.  For pure overwrites the file system could
optimize this way by using the FUA flag, and at least the iomap direct
I/O code does implementation that optimization for that particular case.

For any write that is not purely an overwrite, commit has to write out
the metadata to track the newly allocated blocks.  Applications that
do overwrite fully allocated blocks typically do that using the O_DSYNC
flag to fully benefit from optimizations for that case.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 08/14] NFSD: Remove alignment size checking
  2025-10-27 13:25     ` Chuck Lever
@ 2025-10-27 13:30       ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 13:30 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Mon, Oct 27, 2025 at 09:25:49AM -0400, Chuck Lever wrote:
> > XFS does, although your won't find production hardware with > 4k
> > blocks as far as I can tell.  The reason to drop this check was
> > to not arbitrarily exclude them for no reason.
> 
> IIRC there is a reason: NFSD_IO_DIRECT writes won't be able to align
> on larger than a page (for the memory buffers).

That could be fixed with > 0 order folio allocations, but..

> But perhaps the
> alignment constraint that was removed applied only to the file
> offset.

I think and I need to double check.  If not we can add the check
back, but then please document exactly why it is there.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 13:27         ` Chuck Lever
@ 2025-10-27 13:30           ` Chuck Lever
  2025-10-27 13:31             ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 13:30 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig
  Cc: Mike Snitzer, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Christoph Hellwig

On 10/27/25 9:27 AM, Chuck Lever wrote:
> On 10/27/25 4:12 AM, Christoph Hellwig wrote:
>> On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
>>> This looks like an xfs-specific fix. I'm reluctant to apply a fix for
>>> a specific file system implementation in what's supposed to be generic
>>> code.
>>
>> It's not a fix, it is a performance optimization.
>>
>>> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
>>> systems, then it needs an explanatory code comment, which I'm not yet
>>> qualified to write. I don't see any textual material in previous
>>> incarnations of this code that might help get me started.
>>
>> IOCB_SYNC always needs IOCB_DSYNC as I explained three times now,
> 
> It wasn't clear to me until now that:
> 
> a. The reason to add SYNC in this case was only for DSYNC

I might have gotten that backwards. The only reason to add DSYNC was
that SYNC doesn't make sense without it.


> b. Even after an IOCB_DIRECT write returns, there is more work to be
> done.
> 
> So I stand corrected.
> 
> 
>> including a detailed analsys of all users (We really need to rename
>> IOCB_SYNC to __IOCB_SYNC to match __O_SYNC to make this more obvious
>> I guess..)  I still don't understand why we need sync behavior and
>> forced stable writes at all, though.
>>
> 
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 13:30           ` Chuck Lever
@ 2025-10-27 13:31             ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 13:31 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Chuck Lever, Christoph Hellwig, Mike Snitzer, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Christoph Hellwig

On Mon, Oct 27, 2025 at 09:30:25AM -0400, Chuck Lever wrote:
> >>> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
> >>> systems, then it needs an explanatory code comment, which I'm not yet
> >>> qualified to write. I don't see any textual material in previous
> >>> incarnations of this code that might help get me started.
> >>
> >> IOCB_SYNC always needs IOCB_DSYNC as I explained three times now,
> > 
> > It wasn't clear to me until now that:
> > 
> > a. The reason to add SYNC in this case was only for DSYNC
> 
> I might have gotten that backwards. The only reason to add DSYNC was
> that SYNC doesn't make sense without it.

Yes.  That's something where the IOCB_ flags made the pre-existing
O_ flag magic even more awkward.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 10:50               ` Jeff Layton
  2025-10-27 10:55                 ` Christoph Hellwig
@ 2025-10-27 13:48                 ` Chuck Lever
  2025-10-27 13:49                   ` Christoph Hellwig
  2025-10-27 16:18                   ` Mike Snitzer
  2025-10-27 16:05                 ` Mike Snitzer
  2 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 13:48 UTC (permalink / raw)
  To: Jeff Layton, Christoph Hellwig, Mike Snitzer
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Christoph Hellwig

On 10/27/25 6:50 AM, Jeff Layton wrote:
> On Mon, 2025-10-27 at 01:15 -0700, Christoph Hellwig wrote:
>> On Fri, Oct 24, 2025 at 07:56:59PM -0400, Mike Snitzer wrote:
>>> Really, even ignoring all the quirkiness of this: that O_DIRECT can
>>> fallback to buffered, and we need IOCB_DSYNC|IOCB_SYNC for our use of
>>> buffered IO when NFSD_IO_DIRECT configured to ensure data has hit
>>> stable storage -- that's enough justification.  Bit circular but
>>> compelling to prove the need.. albeit wordy and a lot to unpack.
>> You always need IOCB_DSYNC for data to hit stable storage, both for
>> buffered and direct I/O.  You need IOCB_SYNC in addition to also sync
>> out the timestamps, which I think we now agree we need.  I still don't
>> understand why using direct I/O implies that we want NFS stable writes
>> and not two-stage writes, though.
> That's certainly a possibility too. Consider the case where we have a
> WRITE with unaligned parts at both ends. This set so far just does the
> ends as synchronous I/Os.
> 
> We could do the end bits as non-synchronous writes, and follow up with
> a vfs_fsync_range() call before returning NFS_FILE_SYNC.

What concerns me a bit is that the code that handles unaligned ends
is careful to issue the vfs_iocb_iter_writes in file offset order. Are
we OK to use IOCB_DSYNC for the unaligned parts but IOCB_DIRECT +
subsequent COMMIT for the direct I/O middle segment?


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 13:48                 ` Chuck Lever
@ 2025-10-27 13:49                   ` Christoph Hellwig
  2025-10-27 16:18                   ` Mike Snitzer
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 13:49 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Christoph Hellwig, Mike Snitzer, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever,
	Christoph Hellwig

On Mon, Oct 27, 2025 at 09:48:23AM -0400, Chuck Lever wrote:
> What concerns me a bit is that the code that handles unaligned ends
> is careful to issue the vfs_iocb_iter_writes in file offset order. Are
> we OK to use IOCB_DSYNC for the unaligned parts but IOCB_DIRECT +
> subsequent COMMIT for the direct I/O middle segment?

I would just expect a COMMIT for everything.  If the client wants a
stable write, it should ask for it.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27  8:12       ` Christoph Hellwig
  2025-10-27 13:27         ` Chuck Lever
@ 2025-10-27 14:11         ` Chuck Lever
  2025-10-27 14:45           ` Christoph Hellwig
  1 sibling, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 14:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On 10/27/25 4:12 AM, Christoph Hellwig wrote:
> On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
>> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
>> systems, then it needs an explanatory code comment, which I'm not yet
>> qualified to write. I don't see any textual material in previous
>> incarnations of this code that might help get me started.
> 
> IOCB_SYNC always needs IOCB_DSYNC as I explained three times now,
> including a detailed analsys of all users (We really need to rename
> IOCB_SYNC to __IOCB_SYNC to match __O_SYNC to make this more obvious
> I guess..)  I still don't understand why we need sync behavior and
> forced stable writes at all, though.
> 

Well the relationship between IOCB_SYNC and IOCB_DSYNC is absent
from the iocb_flags() helper, which you referred to in email a few
days ago.

What would be best, IMHO, would be actual API documentation, since
there are too many subtleties to expect adequate documentation from
smart source code alone. Hence my hew and cry for some text I can
stick in a comment.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27 13:27       ` Christoph Hellwig
@ 2025-10-27 14:31         ` Mike Snitzer
  2025-10-27 14:36           ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-27 14:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever

On Mon, Oct 27, 2025 at 06:27:47AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 09:23:27AM -0400, Chuck Lever wrote:
> > Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea, based on
> > the assumption that IOCB_DIRECT writes to local file systems left
> > nothing to be done by a later commit. My assumption is based on the
> > behavior of O_DIRECT on NFS files.
> > 
> > If that assumption is not true, then I agree there is no technical
> > reason to promote NFSD_IO_DIRECT writes to FILE_SYNC, and I can remove
> > that built-in assumption for v8 of this series.
> 
> It is not true, or rather only true for a tiny subset of use cases
> (which NFS can't even query a head of time).
> 
> For devices that advertise a volatile write cache, commit has to flush
> that.  High-end enough device won't have one, but a lot of devices that
> people NFS-export do.  For pure overwrites the file system could
> optimize this way by using the FUA flag, and at least the iomap direct
> I/O code does implementation that optimization for that particular case.

NFSD_IO_DIRECT isn't meant to be uniformly better for all types of
storage.  Any storage that has a volatile write cache is probably best
served by existing NFSD default (NFSD_IO_BUFFERED).

> For any write that is not purely an overwrite, commit has to write out
> the metadata to track the newly allocated blocks.  Applications that
> do overwrite fully allocated blocks typically do that using the O_DSYNC
> flag to fully benefit from optimizations for that case.

Think there is a bit of disconnect here in that this new
NFSD_IO_DIRECT mode is admin configured purely on NFSD's side and
always-on once enabled.

The client doesn't control if/when NFSD would make use of O_DIRECT
(other than if it sends misaligned IO and NFSD must do what it can to
ensure it safely hits stable storage).

In addition, the use of NFSD_IO_DIRECT is intended to allow for
systems large _and_ small to get the advantage of lower memory
utilization.  Buffered IO is one extreme, but even using a model where
NFSD were to not impose NFS_FILE_SYNC would create a situation where
more memory needed batch IO and then wait for client to send COMMIT.

The current approach of using IOCB_DSYNC|IOCB_SYNC have performed
really well on modern NVMe servers.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27 14:31         ` Mike Snitzer
@ 2025-10-27 14:36           ` Christoph Hellwig
  2025-10-27 14:58             ` Mike Snitzer
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 14:36 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever

On Mon, Oct 27, 2025 at 10:31:55AM -0400, Mike Snitzer wrote:
> > that.  High-end enough device won't have one, but a lot of devices that
> > people NFS-export do.  For pure overwrites the file system could
> > optimize this way by using the FUA flag, and at least the iomap direct
> > I/O code does implementation that optimization for that particular case.
> 
> NFSD_IO_DIRECT isn't meant to be uniformly better for all types of
> storage.  Any storage that has a volatile write cache is probably best
> served by existing NFSD default (NFSD_IO_BUFFERED).

That's a very odd claim.  Also what does it have to do with the rest of
the discussion here?

> The client doesn't control if/when NFSD would make use of O_DIRECT
> (other than if it sends misaligned IO and NFSD must do what it can to
> ensure it safely hits stable storage).

Sure.

> In addition, the use of NFSD_IO_DIRECT is intended to allow for
> systems large _and_ small to get the advantage of lower memory
> utilization.  Buffered IO is one extreme, but even using a model where
> NFSD were to not impose NFS_FILE_SYNC would create a situation where
> more memory needed batch IO and then wait for client to send COMMIT.

Why?

> The current approach of using IOCB_DSYNC|IOCB_SYNC have performed
> really well on modern NVMe servers.

NVMe does not implement a concept called servers.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 14:11         ` Chuck Lever
@ 2025-10-27 14:45           ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 14:45 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Mike Snitzer, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever,
	Christoph Hellwig

On Mon, Oct 27, 2025 at 10:11:28AM -0400, Chuck Lever wrote:
> On 10/27/25 4:12 AM, Christoph Hellwig wrote:
> > On Fri, Oct 24, 2025 at 03:34:00PM -0400, Chuck Lever wrote:
> >> If (IOCB_DIRECT | IOCB_DSYNC | IOCB_SYNC) /is/ correct for all file
> >> systems, then it needs an explanatory code comment, which I'm not yet
> >> qualified to write. I don't see any textual material in previous
> >> incarnations of this code that might help get me started.
> > 
> > IOCB_SYNC always needs IOCB_DSYNC as I explained three times now,
> > including a detailed analsys of all users (We really need to rename
> > IOCB_SYNC to __IOCB_SYNC to match __O_SYNC to make this more obvious
> > I guess..)  I still don't understand why we need sync behavior and
> > forced stable writes at all, though.
> > 
> 
> Well the relationship between IOCB_SYNC and IOCB_DSYNC is absent
> from the iocb_flags() helper, which you referred to in email a few
> days ago.
> 
> What would be best, IMHO, would be actual API documentation, since
> there are too many subtleties to expect adequate documentation from
> smart source code alone. Hence my hew and cry for some text I can
> stick in a comment.

The O_DSYNC vs __O_SYNC text is applicable here 1:1.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27 14:36           ` Christoph Hellwig
@ 2025-10-27 14:58             ` Mike Snitzer
  2025-10-27 15:04               ` Chuck Lever
  2025-10-27 15:05               ` Christoph Hellwig
  0 siblings, 2 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-27 14:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever

On Mon, Oct 27, 2025 at 07:36:18AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 10:31:55AM -0400, Mike Snitzer wrote:
> > > that.  High-end enough device won't have one, but a lot of devices that
> > > people NFS-export do.  For pure overwrites the file system could
> > > optimize this way by using the FUA flag, and at least the iomap direct
> > > I/O code does implementation that optimization for that particular case.
> > 
> > NFSD_IO_DIRECT isn't meant to be uniformly better for all types of
> > storage.  Any storage that has a volatile write cache is probably best
> > served by existing NFSD default (NFSD_IO_BUFFERED).
> 
> That's a very odd claim.  Also what does it have to do with the rest of
> the discussion here?

What's an odd claim?  That shitty storage shouldn't be the bad apple
that spoils the bunch?  You seem to be advocating for requiring
additional NFSD resources as part of the NFSD_IO_DIRECT
implementation.

But we could make NFSD's behavior tuneable such that it can signal to
NFS client NFS_FILE_SYNC or not.

The goal is to walk before expanding to running with other
permutations of how to tune NFSD's IO modes.

> > The client doesn't control if/when NFSD would make use of O_DIRECT
> > (other than if it sends misaligned IO and NFSD must do what it can to
> > ensure it safely hits stable storage).
> 
> Sure.
> 
> > In addition, the use of NFSD_IO_DIRECT is intended to allow for
> > systems large _and_ small to get the advantage of lower memory
> > utilization.  Buffered IO is one extreme, but even using a model where
> > NFSD were to not impose NFS_FILE_SYNC would create a situation where
> > more memory needed batch IO and then wait for client to send COMMIT.
> 
> Why?

Because NFSD will then need to hold the IO until the COMMIT operation.
That requires extra NFSD resurces right?

> > The current approach of using IOCB_DSYNC|IOCB_SYNC have performed
> > really well on modern NVMe servers.
> 
> NVMe does not implement a concept called servers.

But you're aware that servers have NVMe devices in them.. that's all I
meant.  All of these NFSD_IO_DIRECT changes have been developed and
tested in modern servers with 8 NVMe, see:
https://www.youtube.com/watch?v=tpPFDu9Nuuw

(NOTE: results covered in this session did _not_ have the benefit
of NFSD responding to client with NFS_FILE_SYNC to avoid COMMIT, the
ability to do so was discussed at Bakeathon and was acted on with
these latest NFSD Direct patchsets).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27 14:58             ` Mike Snitzer
@ 2025-10-27 15:04               ` Chuck Lever
  2025-10-27 15:19                 ` Mike Snitzer
  2025-10-27 15:05               ` Christoph Hellwig
  1 sibling, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Mike Snitzer, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On 10/27/25 10:58 AM, Mike Snitzer wrote:
>>> The current approach of using IOCB_DSYNC|IOCB_SYNC have performed
>>> really well on modern NVMe servers.
>> NVMe does not implement a concept called servers.
> But you're aware that servers have NVMe devices in them.. that's all I
> meant.  All of these NFSD_IO_DIRECT changes have been developed and
> tested in modern servers with 8 NVMe, see:
> https://www.youtube.com/watch?v=tpPFDu9Nuuw
> 
> (NOTE: results covered in this session did _not_ have the benefit
> of NFSD responding to client with NFS_FILE_SYNC to avoid COMMIT, the
> ability to do so was discussed at Bakeathon and was acted on with
> these latest NFSD Direct patchsets).

This is why I suspect that leaving direct writes as UNSTABLE, rather
than promoting them to FILE_SYNC, is probably not going to make much
difference to Jonathan's benchmark results. That's just a guess though.

IOW the memory costs of sticking with UNSTABLE + COMMIT have already
been demonstrated, and it's low.

The v8 of this series will go back to that idea, and if you want, you
can benchmark it again. If it regresses, we can stick with FILE_SYNC.
It's just ones and zeroes, as they say.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27 14:58             ` Mike Snitzer
  2025-10-27 15:04               ` Chuck Lever
@ 2025-10-27 15:05               ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-27 15:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever

On Mon, Oct 27, 2025 at 10:58:19AM -0400, Mike Snitzer wrote:
> On Mon, Oct 27, 2025 at 07:36:18AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 27, 2025 at 10:31:55AM -0400, Mike Snitzer wrote:
> > > > that.  High-end enough device won't have one, but a lot of devices that
> > > > people NFS-export do.  For pure overwrites the file system could
> > > > optimize this way by using the FUA flag, and at least the iomap direct
> > > > I/O code does implementation that optimization for that particular case.
> > > 
> > > NFSD_IO_DIRECT isn't meant to be uniformly better for all types of
> > > storage.  Any storage that has a volatile write cache is probably best
> > > served by existing NFSD default (NFSD_IO_BUFFERED).
> > 
> > That's a very odd claim.  Also what does it have to do with the rest of
> > the discussion here?
> 
> What's an odd claim?  That shitty storage shouldn't be the bad apple
> that spoils the bunch?  You seem to be advocating for requiring
> additional NFSD resources as part of the NFSD_IO_DIRECT
> implementation.

What f***ing additional resources?   What "shitty" storage?

> But we could make NFSD's behavior tuneable such that it can signal to
> NFS client NFS_FILE_SYNC or not.

Sure.  If you can come up with a good argument for returning it
for writers that don't requrst it, because in general it's going
to make things slower.

> The goal is to walk before expanding to running with other
> permutations of how to tune NFSD's IO modes.

So don't throw in random other permutations and jut concentrate on
direct I/O instead of opening up a totally new can of worms?

> > Why?
> 
> Because NFSD will then need to hold the IO until the COMMIT operation.
> That requires extra NFSD resurces right?

What I/O does it need to hold?  It just needs to implement a commit that
does a range fsync.  The client will have to still track these I/Os,
but that's fairly light weight and for most cases has huge benefits.

> 
> > > The current approach of using IOCB_DSYNC|IOCB_SYNC have performed
> > > really well on modern NVMe servers.
> > 
> > NVMe does not implement a concept called servers.
> 
> But you're aware that servers have NVMe devices in them.. that's all I
> meant.  All of these NFSD_IO_DIRECT changes have been developed and
> tested in modern servers with 8 NVMe, see:
> https://www.youtube.com/watch?v=tpPFDu9Nuuw

I'm not going to watch videos to reply to a mail thread, so if there
is anything important in that please summarize it.

> (NOTE: results covered in this session did _not_ have the benefit
> of NFSD responding to client with NFS_FILE_SYNC to avoid COMMIT, the
> ability to do so was discussed at Bakeathon and was acted on with
> these latest NFSD Direct patchsets).

Why do you think randomly returning NFS_FILE_SYNC is actually going
to improve things?  Except for one particular corner case I expect
it to make things worse, often much worse.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC
  2025-10-27 15:04               ` Chuck Lever
@ 2025-10-27 15:19                 ` Mike Snitzer
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-27 15:19 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever

On Mon, Oct 27, 2025 at 11:04:47AM -0400, Chuck Lever wrote:
> On 10/27/25 10:58 AM, Mike Snitzer wrote:
> >>> The current approach of using IOCB_DSYNC|IOCB_SYNC have performed
> >>> really well on modern NVMe servers.
> >> NVMe does not implement a concept called servers.
> > But you're aware that servers have NVMe devices in them.. that's all I
> > meant.  All of these NFSD_IO_DIRECT changes have been developed and
> > tested in modern servers with 8 NVMe, see:
> > https://www.youtube.com/watch?v=tpPFDu9Nuuw
> > 
> > (NOTE: results covered in this session did _not_ have the benefit
> > of NFSD responding to client with NFS_FILE_SYNC to avoid COMMIT, the
> > ability to do so was discussed at Bakeathon and was acted on with
> > these latest NFSD Direct patchsets).
> 
> This is why I suspect that leaving direct writes as UNSTABLE, rather
> than promoting them to FILE_SYNC, is probably not going to make much
> difference to Jonathan's benchmark results. That's just a guess though.
> 
> IOW the memory costs of sticking with UNSTABLE + COMMIT have already
> been demonstrated, and it's low.
> 
> The v8 of this series will go back to that idea, and if you want, you
> can benchmark it again. If it regresses, we can stick with FILE_SYNC.
> It's just ones and zeroes, as they say.

OK.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 10:50               ` Jeff Layton
  2025-10-27 10:55                 ` Christoph Hellwig
  2025-10-27 13:48                 ` Chuck Lever
@ 2025-10-27 16:05                 ` Mike Snitzer
  2025-10-27 17:57                   ` Chuck Lever
  2 siblings, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-27 16:05 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Mon, Oct 27, 2025 at 06:50:03AM -0400, Jeff Layton wrote:
> On Mon, 2025-10-27 at 01:15 -0700, Christoph Hellwig wrote:
> > On Fri, Oct 24, 2025 at 07:56:59PM -0400, Mike Snitzer wrote:
> > > Really, even ignoring all the quirkiness of this: that O_DIRECT can
> > > fallback to buffered, and we need IOCB_DSYNC|IOCB_SYNC for our use of
> > > buffered IO when NFSD_IO_DIRECT configured to ensure data has hit
> > > stable storage -- that's enough justification.  Bit circular but
> > > compelling to prove the need.. albeit wordy and a lot to unpack.
> > 
> > You always need IOCB_DSYNC for data to hit stable storage, both for
> > buffered and direct I/O.  You need IOCB_SYNC in addition to also sync
> > out the timestamps, which I think we now agree we need.  I still don't
> > understand why using direct I/O implies that we want NFS stable writes
> > and not two-stage writes, though.
> 
> That's certainly a possibility too. Consider the case where we have a
> WRITE with unaligned parts at both ends. This set so far just does the
> ends as synchronous I/Os.
> 
> We could do the end bits as non-synchronous writes, and follow up with
> a vfs_fsync_range() call before returning NFS_FILE_SYNC.
> 
> We could also just return NFS_FILE_UNSTABLE and let the client follow
> up with a commit when the write is unaligned. That may be the most
> efficient scheme if you have a client streaming unaligned writes to the
> server without gaps.

Yes, that is how I handled streaming unaligned writes last I needed to
benchmark it (IO500's IOR_HARD benchmark uses IO size of 47008 bytes).

Use buffered IO for misaligned WRITE's head/tail.  In addition, I had
the heuristic to always used buffered IO for any READ less than 32K
(the misaligned head and tail included).  This allows to leverage page
cache for RMW needed to service the unaligned streaming WRITE.

> My feeling is that if you're doing a lot of unaligned I/Os, you're
> probably better off not enabling DIO support and just doing regular
> buffered (or DONTCACHE) I/Os.

DONTCACHE will not do well for RMW because it'll immediately drop IO
it just read.

But will be testing this further...

> That said, we don't really know either way (which is why all of this is
> behind debugfs switch instead of a more permanent interface).

Yeap.
 
> > You also need IOCB_DSYNC for direct I/O to hit the media if you want
> > to return NFS_FILE_SYNC.  But I still don't understand why we want or
> > need to return NFS_FILE_SYNC to start with.
> 
> NFS_FILE_SYNC is not required here, but it's better if we can return
> that. If the server returns NFS_FILE_SYNC there is no need for the
> client to follow up with a COMMIT.

Yes, which is why I'm confused by Chuck wanting to do away with
NFSD_IO_DIRECT setting NFS_FILE_SYNC "for now".  Not heard compelling
reason, but "it is what it is". ;)

Were we not all concerned about safety first (especially of mixing
buffered and DIO) and performance a secondary concern?  Using
IOCB_DSYNC|IOCB_SYNC for all WRITEs and returning NFS_FILE_SYNC is
really safe right?

And we already showed that doing so really isn't slow.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 13:48                 ` Chuck Lever
  2025-10-27 13:49                   ` Christoph Hellwig
@ 2025-10-27 16:18                   ` Mike Snitzer
  2025-10-27 16:59                     ` Mike Snitzer
  2025-10-29  7:20                     ` Christoph Hellwig
  1 sibling, 2 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-27 16:18 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Mon, Oct 27, 2025 at 09:48:23AM -0400, Chuck Lever wrote:
> On 10/27/25 6:50 AM, Jeff Layton wrote:
> > On Mon, 2025-10-27 at 01:15 -0700, Christoph Hellwig wrote:
> >> On Fri, Oct 24, 2025 at 07:56:59PM -0400, Mike Snitzer wrote:
> >>> Really, even ignoring all the quirkiness of this: that O_DIRECT can
> >>> fallback to buffered, and we need IOCB_DSYNC|IOCB_SYNC for our use of
> >>> buffered IO when NFSD_IO_DIRECT configured to ensure data has hit
> >>> stable storage -- that's enough justification.  Bit circular but
> >>> compelling to prove the need.. albeit wordy and a lot to unpack.
> >> You always need IOCB_DSYNC for data to hit stable storage, both for
> >> buffered and direct I/O.  You need IOCB_SYNC in addition to also sync
> >> out the timestamps, which I think we now agree we need.  I still don't
> >> understand why using direct I/O implies that we want NFS stable writes
> >> and not two-stage writes, though.
> > That's certainly a possibility too. Consider the case where we have a
> > WRITE with unaligned parts at both ends. This set so far just does the
> > ends as synchronous I/Os.
> > 
> > We could do the end bits as non-synchronous writes, and follow up with
> > a vfs_fsync_range() call before returning NFS_FILE_SYNC.
> 
> What concerns me a bit is that the code that handles unaligned ends
> is careful to issue the vfs_iocb_iter_writes in file offset order. Are
> we OK to use IOCB_DSYNC for the unaligned parts but IOCB_DIRECT +
> subsequent COMMIT for the direct I/O middle segment?

LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
middle (via AIO completion of that aligned middle).  So out of order
relative to file offset.

That has been working well with LOCALIO.. though I did just post a
patchset today that fixes some quirks of the implementation that got
flagged for KASAN use-after-free.

(But now that I look at LOCALIO misaligned DIO code, it actually isn't
even setting IOCB_DSYNC for the misaligned head/tail... needs fixing.)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 16:18                   ` Mike Snitzer
@ 2025-10-27 16:59                     ` Mike Snitzer
  2025-10-29  7:20                     ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-27 16:59 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> On Mon, Oct 27, 2025 at 09:48:23AM -0400, Chuck Lever wrote:
> > On 10/27/25 6:50 AM, Jeff Layton wrote:
> > > On Mon, 2025-10-27 at 01:15 -0700, Christoph Hellwig wrote:
> > >> On Fri, Oct 24, 2025 at 07:56:59PM -0400, Mike Snitzer wrote:
> > >>> Really, even ignoring all the quirkiness of this: that O_DIRECT can
> > >>> fallback to buffered, and we need IOCB_DSYNC|IOCB_SYNC for our use of
> > >>> buffered IO when NFSD_IO_DIRECT configured to ensure data has hit
> > >>> stable storage -- that's enough justification.  Bit circular but
> > >>> compelling to prove the need.. albeit wordy and a lot to unpack.
> > >> You always need IOCB_DSYNC for data to hit stable storage, both for
> > >> buffered and direct I/O.  You need IOCB_SYNC in addition to also sync
> > >> out the timestamps, which I think we now agree we need.  I still don't
> > >> understand why using direct I/O implies that we want NFS stable writes
> > >> and not two-stage writes, though.
> > > That's certainly a possibility too. Consider the case where we have a
> > > WRITE with unaligned parts at both ends. This set so far just does the
> > > ends as synchronous I/Os.
> > > 
> > > We could do the end bits as non-synchronous writes, and follow up with
> > > a vfs_fsync_range() call before returning NFS_FILE_SYNC.
> > 
> > What concerns me a bit is that the code that handles unaligned ends
> > is careful to issue the vfs_iocb_iter_writes in file offset order. Are
> > we OK to use IOCB_DSYNC for the unaligned parts but IOCB_DIRECT +
> > subsequent COMMIT for the direct I/O middle segment?
> 
> LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> middle (via AIO completion of that aligned middle).  So out of order
> relative to file offset.

Doing things out of order does make for needing code to avoid
advancing the file offset in response to misaligned end completing but
middle ends up having a short write.

I think that'd be one can of worms NFSD would need to take on if you
do wan to follow through with getting away from using
IOCB_DYNC|IOCB_SYNC and returning NFS_FILE_SYNC back to the client.

Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 16:05                 ` Mike Snitzer
@ 2025-10-27 17:57                   ` Chuck Lever
  2025-10-28  3:26                     ` Mike Snitzer
  2025-10-29  7:25                     ` Christoph Hellwig
  0 siblings, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-27 17:57 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton
  Cc: Christoph Hellwig, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On 10/27/25 12:05 PM, Mike Snitzer wrote:
>>> You also need IOCB_DSYNC for direct I/O to hit the media if you want
>>> to return NFS_FILE_SYNC.  But I still don't understand why we want or
>>> need to return NFS_FILE_SYNC to start with.
>> NFS_FILE_SYNC is not required here, but it's better if we can return
>> that. If the server returns NFS_FILE_SYNC there is no need for the
>> client to follow up with a COMMIT.
> Yes, which is why I'm confused by Chuck wanting to do away with
> NFSD_IO_DIRECT setting NFS_FILE_SYNC "for now".  Not heard compelling
> reason, but "it is what it is". 😉

The compelling reason is that it's generally faster (or less work for
the NFS server and its storage) to sync the metadata after the client
has sent all of the data it wants to write. This amortizes the cost of
the metadata operations, and allows the server to get the written data
persisted (if it makes sense to do that) while waiting for the COMMIT.

Since your patch asserts IOCB_DSYNC for all direct write segments,
NFSD_IO_DIRECT (as it is implemented in your patch) does not need a
subsequent COMMIT operation. Forcing the client to COMMIT is totally
unnecessary. That's why I suggested promoting all NFSD_IO_DIRECT WRITEs
to FILE_SYNC.

So perhaps the issue here is that the rationale for using IOCB_DSYNC
for all NFSD_IO_DIRECT writes is hazy and needs to be clarified.

> Were we not all concerned about safety first (especially of mixing
> buffered and DIO) and performance a secondary concern?  Using
> IOCB_DSYNC|IOCB_SYNC for all WRITEs and returning NFS_FILE_SYNC is
> really safe right?

The client ensures that UNSTABLE + COMMIT is just as "safe" as
FILE_SYNC, generally, by preserving the dirty data in its own page cache
until the server indicates the written data is durable and is safe to
evict if needed.

If you want to make an argument about data integrity, let's
be as precise as we can about what we believe might be unsafe. The
data integrity doubt was with the unaligned ends, IIRC? If we need
that extra bit of integrity, then WRITEs with an unaligned portion
will need to be IOCB_DSYNC, and then promoted.

An NFS WRITE, even an UNSTABLE one, I believe makes the written data
visible to other readers. That might be an argument for using IOCB_DSYNC
with IOCB_DIRECT, so that subsequent NFS READs (and local applications)
see the written data as soon as NFSD generates a response to an NFS
WRITE backed by IOCB_DIRECT.

I think we also discussed that an NFS WRITE should make updates visible
in byte order, which is why the segments are handled low offset to high
offset.

Or, we might decide that, no, NFS WRITE has no data visibility mandate;
applications achieve data visibility explicitly using COMMIT and file
locking, so none of this matters.

> And we already showed that doing so really isn't slow.

Well we don't have a comparison with "IOCB_DIRECT without IOCB_DSYNC".
That might be faster than what you tested? Plus I think your test was
on esoteric enterprise NVMe devices, not on the significantly more
commonly deployed SSD devices.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 17:57                   ` Chuck Lever
@ 2025-10-28  3:26                     ` Mike Snitzer
  2025-10-28 15:37                       ` Chuck Lever
  2025-10-29  7:32                       ` Christoph Hellwig
  2025-10-29  7:25                     ` Christoph Hellwig
  1 sibling, 2 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-28  3:26 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

[Apologies, I think I repeated myself 4 slightly different ways below]

On Mon, Oct 27, 2025 at 01:57:25PM -0400, Chuck Lever wrote:
> On 10/27/25 12:05 PM, Mike Snitzer wrote:
> >>> You also need IOCB_DSYNC for direct I/O to hit the media if you want
> >>> to return NFS_FILE_SYNC.  But I still don't understand why we want or
> >>> need to return NFS_FILE_SYNC to start with.
> >> NFS_FILE_SYNC is not required here, but it's better if we can return
> >> that. If the server returns NFS_FILE_SYNC there is no need for the
> >> client to follow up with a COMMIT.
> > Yes, which is why I'm confused by Chuck wanting to do away with
> > NFSD_IO_DIRECT setting NFS_FILE_SYNC "for now".  Not heard compelling
> > reason, but "it is what it is". 😉
> 
> The compelling reason is that it's generally faster (or less work for
> the NFS server and its storage) to sync the metadata after the client
> has sent all of the data it wants to write. This amortizes the cost of
> the metadata operations, and allows the server to get the written data
> persisted (if it makes sense to do that) while waiting for the COMMIT.

But none of that matters if the only safe way to implement mixing
buffered and direct IO is by waiting for the DIO to succeed and with
it any associated page cache invalidation (and associated possible
failure to invalidate the page handled with using buffered IO fallback
by underlying filesystem).

Any buffered or direct IO associated with the misaligned DIO WRITE
handling, in terms of 3 segments, must use IOCB_DSYNC.

So can we please revisit your desire to eliminate the use of
IOCB_DSYNC for NFSD_IO_DIRECT WRITEs?

I contend that NFSD_IO_DIRECT should use IOCB_DSYNC|IOCBD_SYNC for
all 3 segments of a misaligned DIO WRITE (so for both buffered and
direct IO).

> Since your patch asserts IOCB_DSYNC for all direct write segments,
> NFSD_IO_DIRECT (as it is implemented in your patch) does not need a
> subsequent COMMIT operation. Forcing the client to COMMIT is totally
> unnecessary. That's why I suggested promoting all NFSD_IO_DIRECT WRITEs
> to FILE_SYNC.

No, the COMMIT can only be elided (and NFS_FILE_SYNC return to client)
if both IOCB_DSYNC and IOCB_SYNC are set.

But yes, at Bakeathon, the intent/understanding was: if you're already
setting SYNC then you can avoid the COMMIT -- this nuance of DSYNC vs
SYNC wasn't on our radar at the time.

> So perhaps the issue here is that the rationale for using IOCB_DSYNC
> for all NFSD_IO_DIRECT writes is hazy and needs to be clarified.

How is it still hazy?  We've had repeat discussion about the need for
IOCB_DSYNC (and IOCB_SYNC if we really want to honor intent of
NFS_FILE_SYNC).

Christoph has repeatedly said DSYNC is needed with O_DIRECT, yet you
keep removing it.

> > Were we not all concerned about safety first (especially of mixing
> > buffered and DIO) and performance a secondary concern?  Using
> > IOCB_DSYNC|IOCB_SYNC for all WRITEs and returning NFS_FILE_SYNC is
> > really safe right?
> 
> The client ensures that UNSTABLE + COMMIT is just as "safe" as
> FILE_SYNC, generally, by preserving the dirty data in its own page cache
> until the server indicates the written data is durable and is safe to
> evict if needed.

I'm aware UNSTABLE is safe thanks to COMMIT, etc.

But the entire intent behind NFSD's O_DIRECT support is to ensure IO
is on stable storage when it replies to the client.  The client isn't
meant to get involved with driving the correctness of NFSD's O_DIRECT
support (by requiring the client set NFS_FILE_SYNC for the benefit of
a feature it doesn't know enabled in the server).

That cannot be what you're saying, NFSD_IO_DIRECT is entirely managed
by the server as a configurable implementation detail.  So we need to
make sure it is implemented safely without the client being involved.

> If you want to make an argument about data integrity, let's
> be as precise as we can about what we believe might be unsafe. The
> data integrity doubt was with the unaligned ends, IIRC? If we need
> that extra bit of integrity, then WRITEs with an unaligned portion
> will need to be IOCB_DSYNC, and then promoted.

The repeat _valid_ concern from Jeff and you was about the need to
ensure data integrity in the NFSD_IO_DIRECT's misaligned WRITE support
(mixes both buffered IO and direct IO for a single misaligned WRITE).

But now you no longer have that concern and have removed the
IOCB_DSYNC flag which is required for _both_ buffered and direct IO.

(IOCB_SYNC is only required if we'd like to return NFS_FILE_SYNC to
the client, maybe you meant to leave IOCB_DSYNC but remove IOCB_SYNC?
Oh man do I really hope so... ;) If so, please trim from here down)

> An NFS WRITE, even an UNSTABLE one, I believe makes the written data
> visible to other readers. That might be an argument for using IOCB_DSYNC
> with IOCB_DIRECT, so that subsequent NFS READs (and local applications)
> see the written data as soon as NFSD generates a response to an NFS
> WRITE backed by IOCB_DIRECT.

I'm confused, how is using IOCB_DSYNC with IOCB_DIRECT up for question
again?  It is needed for misaligned DIO WRITE (splitting into 3
segments, mixing buffered and direct IO).

From the start, NFSD_IO_DIRECT's WRITE support has been about ensuring
cache coherence on a per WRITE basis (once its acknowledged back to
the client). I have tried to be careful to do that even when handling
misaligned DIO WRITEs.

> I think we also discussed that an NFS WRITE should make updates visible
> in byte order, which is why the segments are handled low offset to high
> offset.

Trond mentioned it as needed relative to ensuring consistent file
offset.  I mentioned that is the case, and with NFSD we get that
simply by issuing the IO in series, in file offset order -- but for
NFS client's LOCALIO I _do_ issue out-of-order segment IOs (head/tail
buffered and then DIO aligned middle, but with care to preserve file
offset integrity even with partial writes of any of the 3 segments).

Maybe its fine to have a mode where NFSD allows O_DIRECT to be used
without setting DSYNC when using UNSTABLE, _but_ it really shouldn't
be the default.

> Or, we might decide that, no, NFS WRITE has no data visibility mandate;
> applications achieve data visibility explicitly using COMMIT and file
> locking, so none of this matters.

We don't have that freedom if we cannot preserve file offset integrity
(as would be the case if we removed IOCB_DSYNC when handling all 3
segments of a misaligned DIO WRITE).  Removing IOCB_DSYNC would
compromise misaligned DIO WRITE as implemented.

> > And we already showed that doing so really isn't slow.
> 
> Well we don't have a comparison with "IOCB_DIRECT without IOCB_DSYNC".
> That might be faster than what you tested? Plus I think your test was
> on esoteric enterprise NVMe devices, not on the significantly more
> commonly deployed SSD devices.

IOCB_DIRECT without IOCB_DSYNC isn't an option because we must ensure
the data is ondisk.

The only related bake-off would be:
1) IOCB_DIRECT | IOCB_DSYNC with UNSTABLE 
 vs
2) IOCB_DIRECT | IOCB_DSYNC | IOCBD_SYNC with NFS_FILE_SYNC

(but on esoteric enterprise storage: there is no difference)

This thread and evolution of the threads before it is jarring:

Just recently we were surprised to find IOCB_DIRECT needs IOCB_DSYNC
to ensure data is ondisk; that you (and I too) thought O_DIRECT would
imply that -- I raised concern given what I saw in the XFS performance
improvement patch that was focused on combining O_DIRECT and O_DSYNC.
Because I couldn't see how to avoid setting DSYNC, and we've since
learned DSYNC needed _to ensure data is ondisk_.

So I'm left confused... and I'm feeling sick and would like to get off
this merry-go-round now ;)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-28  3:26                     ` Mike Snitzer
@ 2025-10-28 15:37                       ` Chuck Lever
  2025-10-28 16:04                         ` Mike Snitzer
  2025-10-29  7:37                         ` Christoph Hellwig
  2025-10-29  7:32                       ` Christoph Hellwig
  1 sibling, 2 replies; 87+ messages in thread
From: Chuck Lever @ 2025-10-28 15:37 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton, Christoph Hellwig
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Christoph Hellwig

On 10/27/25 11:26 PM, Mike Snitzer wrote:
> So can we please revisit your desire to eliminate the use of
> IOCB_DSYNC for NFSD_IO_DIRECT WRITEs?

Let's have a little less breathless panic, please. The whole point of
review is to revisit our decisions. And nothing I've done so far is
written in stone... even after merge, we can still apply patches. If
the consensus is we don't like v8 or some particular patch, I will
rewrite or replace it, as I've already said.

You might view me as a whimsical and authoritarian maintainer, but
actually, I am resisting the urge to merge a patch that the community
(that is now responsible for free long-term support of the patch) hasn't
yet fully learned about and carefully reviewed. I see my role as
enforcing a consensus process, and both learning and consensus takes
time.

>> So perhaps the issue here is that the rationale for using IOCB_DSYNC
>> for all NFSD_IO_DIRECT writes is hazy and needs to be clarified.

> How is it still hazy?  We've had repeat discussion about the need for
> IOCB_DSYNC (and IOCB_SYNC if we really want to honor intent of
> NFS_FILE_SYNC).

TL;DR: it's hazy for folks who were not in the room in Raleigh.

I don't see any comments that explain /why/ the unaligned ends need to
be durable along with the middle segment. It appears to be assumed that
everyone agrees it's necessary. Patch review has shown that is perhaps
not a valid assumption.

So far it's been discussed verbally, but we really /really/ want to have
this documented, because we're all going to forget the rationale in a
few months.

And please do not forget that this is open source code. The code has to
be able to be modified by developers outside our community. Right now it
isn't well enough documented for anyone who was outside of the room in
Raleigh two weeks ago to understand WTF is going on. The point being
that we cannot make final decisions in a closed room -- eventually they
need to face the larger community.

If we need to insist that NFSD_IO_DIRECT mode is always going to be
fully durable, that design choice needs to be explained in code comments
that are very close to the code that implements it.

> Christoph has repeatedly said DSYNC is needed with O_DIRECT, yet you
> keep removing it.

That's not what I read. Over the course of three email threads, he wrote
that:

- IOCB_DSYNC is always needed when IOCB_SYNC is set, whether or not
  we're using IOCB_DIRECT.

- in order to guarantee that a direct write is durable, we /either/ need
  IOCB_DSYNC + IOCB_DIRECT, /or/ IOCB_DIRECT by itself with a follow-up
  COMMIT.

- for some commonly deployed media types, IOCB_DSYNC with IOCB_DIRECT
  might be slower than IOCB_DIRECT followed up with COMMIT.

Therefore, we need to carefully justify why the current patches stick
with only IOCB_DSYNC + IOCB_DIRECT, or decide it's truly not necessary
to force all NFSD_IO_DIRECT writes to be IOCB_DSYNC.

Christoph and I (if I may put words in his mouth) both seem to be
interested in making NFSD_IO_DIRECT useful in contexts other than a very
specific enterprise-grade server with esoteric NVMe devices and ultra
high bandwidth networking. After all, one of the deep requirements for
merging something upstream is that it is likely to be useful to more
than a very narrow constituency (see recent discussions about merging
TernFS).

If we truly care about making NFSD_IO_DIRECT valuable for small memory
NFSD systems, then we need to acknowledge that their durable storage is
likely to be virtual or in some other way bandwidth-compromised, and not
a directly-attached real NVMe device.

>> Or, we might decide that, no, NFS WRITE has no data visibility mandate;
>> applications achieve data visibility explicitly using COMMIT and file
>> locking, so none of this matters.
> 
> We don't have that freedom if we cannot preserve file offset integrity
> (as would be the case if we removed IOCB_DSYNC when handling all 3
> segments of a misaligned DIO WRITE).  Removing IOCB_DSYNC would
> compromise misaligned DIO WRITE as implemented.

As I understand it, IOCB_DSYNC has nothing to do with whether the three
segments are directed to the correct file offsets. Serially initiating
the writes with the same iocb should be sufficient to ensure
correctness.

The concern about integrity is that in the multi-segment case, the
segments won't make it to durable storage at the same time, and an
intervening NFS READ might see an intermittent file state.

If the risk of a torn write is an actual problem for an application, it
should serialize its reads, writes, and flushes itself. I think there
are already plausible situations in today's non-DIRECT world where
incomplete writes are visible, so it might be sensible not to worry too
much about it here.

Jeff, please do clarify if I've misunderstood your concern.

>>> And we already showed that doing so really isn't slow.
>>
>> Well we don't have a comparison with "IOCB_DIRECT without IOCB_DSYNC".
>> That might be faster than what you tested? Plus I think your test was
>> on esoteric enterprise NVMe devices, not on the significantly more
>> commonly deployed SSD devices.
> 
> IOCB_DIRECT without IOCB_DSYNC isn't an option because we must ensure
> the data is ondisk.

You are stating the exact assumption that we are testing right now. It
is not 100% clear from code and comments in these patches why "we must
ensure the data is on disk".

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-28 15:37                       ` Chuck Lever
@ 2025-10-28 16:04                         ` Mike Snitzer
  2025-10-28 18:48                           ` Chuck Lever
  2025-10-29  7:37                         ` Christoph Hellwig
  1 sibling, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-28 16:04 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Tue, Oct 28, 2025 at 11:37:52AM -0400, Chuck Lever wrote:
> On 10/27/25 11:26 PM, Mike Snitzer wrote:
> > So can we please revisit your desire to eliminate the use of
> > IOCB_DSYNC for NFSD_IO_DIRECT WRITEs?
> 
> Let's have a little less breathless panic, please. The whole point of
> review is to revisit our decisions. And nothing I've done so far is
> written in stone... even after merge, we can still apply patches. If
> the consensus is we don't like v8 or some particular patch, I will
> rewrite or replace it, as I've already said.
> 
> You might view me as a whimsical and authoritarian maintainer, but
> actually, I am resisting the urge to merge a patch that the community
> (that is now responsible for free long-term support of the patch) hasn't
> yet fully learned about and carefully reviewed. I see my role as
> enforcing a consensus process, and both learning and consensus takes
> time.

Cool, thanks for level setting.  Helps!

> >> So perhaps the issue here is that the rationale for using IOCB_DSYNC
> >> for all NFSD_IO_DIRECT writes is hazy and needs to be clarified.
> 
> > How is it still hazy?  We've had repeat discussion about the need for
> > IOCB_DSYNC (and IOCB_SYNC if we really want to honor intent of
> > NFS_FILE_SYNC).
> 
> TL;DR: it's hazy for folks who were not in the room in Raleigh.
> 
> I don't see any comments that explain /why/ the unaligned ends need to
> be durable along with the middle segment. It appears to be assumed that
> everyone agrees it's necessary. Patch review has shown that is perhaps
> not a valid assumption.

How so?  I missed any patch review that called into question the need
to ensure all segments are on stable storage.  The only was to do that
for buffered and direct I is to set IOCB_DSYNC.
 
> So far it's been discussed verbally, but we really /really/ want to have
> this documented, because we're all going to forget the rationale in a
> few months.

But it isn't so complex.  DSYNC is needed if you want data to be
ondisk when the call returns.

> And please do not forget that this is open source code. The code has to
> be able to be modified by developers outside our community. Right now it
> isn't well enough documented for anyone who was outside of the room in
> Raleigh two weeks ago to understand WTF is going on. The point being
> that we cannot make final decisions in a closed room -- eventually they
> need to face the larger community.
> 
> If we need to insist that NFSD_IO_DIRECT mode is always going to be
> fully durable, that design choice needs to be explained in code comments
> that are very close to the code that implements it.
> 
> 
> > Christoph has repeatedly said DSYNC is needed with O_DIRECT, yet you
> > keep removing it.
> 
> That's not what I read. Over the course of three email threads, he wrote
> that:
> 
> - IOCB_DSYNC is always needed when IOCB_SYNC is set, whether or not
>   we're using IOCB_DIRECT.
> 
> - in order to guarantee that a direct write is durable, we /either/ need
>   IOCB_DSYNC + IOCB_DIRECT, /or/ IOCB_DIRECT by itself with a follow-up
>   COMMIT.
> 
> - for some commonly deployed media types, IOCB_DSYNC with IOCB_DIRECT
>   might be slower than IOCB_DIRECT followed up with COMMIT.

Hmm, I have other quotes that stand out to me.. but I'll spare you
(especially since I'm short on time to reply at the moment).

> Therefore, we need to carefully justify why the current patches stick
> with only IOCB_DSYNC + IOCB_DIRECT, or decide it's truly not necessary
> to force all NFSD_IO_DIRECT writes to be IOCB_DSYNC.

The misaligned DIO WRITE is what I'm concerned about given the 3
segments that make up the whole of a given NFS WRITE.

We do have the ability to know a misaligned DIO WRITE is occuring
(nsegs > 1).

I would _really_ appreciate it if we could ensure setting IOCB_DSYNC
for that case.

> Christoph and I (if I may put words in his mouth) both seem to be
> interested in making NFSD_IO_DIRECT useful in contexts other than a very
> specific enterprise-grade server with esoteric NVMe devices and ultra
> high bandwidth networking. After all, one of the deep requirements for
> merging something upstream is that it is likely to be useful to more
> than a very narrow constituency (see recent discussions about merging
> TernFS).

You're fine to extend NFSD_IO_DIRECT to be broadly applicable, I
completely agree with that.  Accomplishing that by removing flags that
ensure data integrity for misaligned DIO WRITE isn't the way forward.

That's by only point.  Its typical to start with more conservtive
protection, especially for something manifesting as a new optional
debug setting like NFSD_IO_DIRECT.

Relaxing the code in a sensible and safe manner is perfectly
acceptable.  I'm just saying it is _not_ in the misaligned DIO WRITE
case (I'll stop repeating that now, promise.. heh).

> If we truly care about making NFSD_IO_DIRECT valuable for small memory
> NFSD systems, then we need to acknowledge that their durable storage is
> likely to be virtual or in some other way bandwidth-compromised, and not
> a directly-attached real NVMe device.

Not always the case, but I agree that is one possibility.

> >> Or, we might decide that, no, NFS WRITE has no data visibility mandate;
> >> applications achieve data visibility explicitly using COMMIT and file
> >> locking, so none of this matters.
> > 
> > We don't have that freedom if we cannot preserve file offset integrity
> > (as would be the case if we removed IOCB_DSYNC when handling all 3
> > segments of a misaligned DIO WRITE).  Removing IOCB_DSYNC would
> > compromise misaligned DIO WRITE as implemented.
> 
> As I understand it, IOCB_DSYNC has nothing to do with whether the three
> segments are directed to the correct file offsets. Serially initiating
> the writes with the same iocb should be sufficient to ensure
> correctness.

IOCB_DSYNC just ensures when they complete they are on-disk.  And any
failure of, or short WRITE, is trapped and immediate cause for early
return.  Whereby ensuring file offset integrity.

> The concern about integrity is that in the multi-segment case, the
> segments won't make it to durable storage at the same time, and an
> intervening NFS READ might see an intermittent file state.

Well that is one concern yes; but I'm also concerned about initiating
any page cache invalidation off the back of the DIO WRITE in the
middle (and its potential to fallback to buffered if that invalidation
fails... any associated buffered IO fallback or ends best have O_DSYNC
set it ensure everything ondisk).

> If the risk of a torn write is an actual problem for an application, it
> should serialize its reads, writes, and flushes itself. I think there
> are already plausible situations in today's non-DIRECT world where
> incomplete writes are visible, so it might be sensible not to worry too
> much about it here.
> 
> Jeff, please do clarify if I've misunderstood your concern.

Yeah, applications really shouldn't expect perfection in the face of
contended buffered READ vs DIO WRITE.. I'm most concerned about us
trying to have misaligned DIO WRITE's IO be handled as if all 3
segments are a unit (_not_ atomic) that when written will still
preserve file offset integrity.  Kills 2 birds of "integrity" (data
and file offset).

> >>> And we already showed that doing so really isn't slow.
> >>
> >> Well we don't have a comparison with "IOCB_DIRECT without IOCB_DSYNC".
> >> That might be faster than what you tested? Plus I think your test was
> >> on esoteric enterprise NVMe devices, not on the significantly more
> >> commonly deployed SSD devices.
> > 
> > IOCB_DIRECT without IOCB_DSYNC isn't an option because we must ensure
> > the data is ondisk.
> 
> You are stating the exact assumption that we are testing right now. It
> is not 100% clear from code and comments in these patches why "we must
> ensure the data is on disk".

So misaligned DIO WRITE in terms of 3 segments works.

Thank you for your attention to this matter! ;)

Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-28 16:04                         ` Mike Snitzer
@ 2025-10-28 18:48                           ` Chuck Lever
  2025-10-28 23:56                             ` Mike Snitzer
  0 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-28 18:48 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On 10/28/25 12:04 PM, Mike Snitzer wrote:
> I'm also concerned about initiating
> any page cache invalidation off the back of the DIO WRITE in the
> middle (and its potential to fallback to buffered if that invalidation
> fails... any associated buffered IO fallback or ends best have O_DSYNC
> set it ensure everything ondisk).

My questions:

A. Which filesystem code can return -ENOTBLK?

- iomap-based DIO (fs/iomap/direct-io.c:715): XFS, ext4, f2fs, gfs2,
  zonefs, erofs
- Legacy DIO path (fs/direct-io.c:984): ext2, some ext4 code paths
- btrfs (has its own DIO implementation that uses -ENOTBLK)

Code auditing shows that vfs_iocb_iter_write() never returns -ENOTBLK
because all filesystems that could generate this error internally
convert it to 0 or another error code before returning from their
write_iter implementation.

Therefore NFSD itself does not need to know about or handle page cache
invalidation failure.

B. Even if NFSD had to recover from a failed page cache invalidation,
   does a fallback write need to set IOCB_DSYNC (not considering the NFS
   protocol requirements) ?

When invalidation fails, some pages remain in the cache (dirty or with
private data). The buffered fallback writes to those same pages,
updating them with the correct data. There's no torn state or corruption
hazard.

The fallback restarts the entire write from scratch, as buffered. Any
partial progress (like a buffered first segment) gets rewritten with
the same data. No orphaned blocks or inconsistent state.

If another process does a DIO read of this region, it will call
kiocb_write_and_wait() first (see mm/filemap.c:2912), which flushes any
dirty pages before reading. So they'll see correct data even if it's
still dirty in cache (Jeff's concern).

C. When setting up a three-segment write, is IOCB_DSYNC needed on the
   unaligned ends (not considering the NFS protocol requirements) ?

Before writing the DIO middle segment, the VFS must first invalidate the
page cache, so the first (buffered) segment is flushed to durability
anyway. The use of IOCB_DSYNC for the first segment is superfluous.

After writing the DIO middle segment, the unaligned end remains only in
the page cache if IOCB_DSYNC is not used. But a subsequent DIO read will
flush that to durability (via kiocb_write_and_wait) before satisfying
the read. So, IOCB_DSYNC is not needed there to meet putative file data
visibility mandates.

>> As I understand it, IOCB_DSYNC has nothing to do with whether the three
>> segments are directed to the correct file offsets. Serially initiating
>> the writes with the same iocb should be sufficient to ensure
>> correctness.
> 
> IOCB_DSYNC just ensures when they complete they are on-disk.  And any
> failure of, or short WRITE, is trapped and immediate cause for early
> return.  Whereby ensuring file offset integrity.

AFAICT, generic_perform_write() updates ki_pos synchronously whenever
it returns successfully, regardless of the IOCB_DSYNC or IOCB_DIRECT
flag settings. If the vfs_iocb_iter_write() call fails or returns fewer
bytes than expected, the loop terminates immediately and writes to
subsequent segments are not initiated. I don't see a data integrity
issue here.

Thus I'm still unconvinced that IOCB_DSYNC is required if the client has
requested an UNSTABLE write. I'm open to reviewing actual evidence of a
failure mode where IOCB_DSYNC might prevent data corruption, but so far
I don't see how it can happen.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-28 18:48                           ` Chuck Lever
@ 2025-10-28 23:56                             ` Mike Snitzer
  2025-10-29 15:22                               ` Chuck Lever
  0 siblings, 1 reply; 87+ messages in thread
From: Mike Snitzer @ 2025-10-28 23:56 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On Tue, Oct 28, 2025 at 02:48:33PM -0400, Chuck Lever wrote:
> On 10/28/25 12:04 PM, Mike Snitzer wrote:
> > I'm also concerned about initiating
> > any page cache invalidation off the back of the DIO WRITE in the
> > middle (and its potential to fallback to buffered if that invalidation
> > fails... any associated buffered IO fallback or ends best have O_DSYNC
> > set it ensure everything ondisk).
>
> My questions:

All good questions, and I appreciate the time you spent analyzing
these aspects.

I'm all for NFSD's IO modes being able to work with all possible
stable_how.  Your analysis helps expand the scope of these IO modes to
really be pushed to their limits (specifically thinking of
NFSD_IO_DIRECT being paired with UNSTABLE and it _not_ using
IOCB_DSYNC, interesting combo but IMO quite a risky default given
we're only at the debugfs interface stage for NFSD_IO_DIRECT).

> A. Which filesystem code can return -ENOTBLK?
> 
> - iomap-based DIO (fs/iomap/direct-io.c:715): XFS, ext4, f2fs, gfs2,
>   zonefs, erofs
> - Legacy DIO path (fs/direct-io.c:984): ext2, some ext4 code paths
> - btrfs (has its own DIO implementation that uses -ENOTBLK)
> 
> Code auditing shows that vfs_iocb_iter_write() never returns -ENOTBLK
> because all filesystems that could generate this error internally
> convert it to 0 or another error code before returning from their
> write_iter implementation.
> 
> Therefore NFSD itself does not need to know about or handle page cache
> invalidation failure.

OK, good to hear; but have you reviewed if they are all using D_SYNC
as part of their internal handling of their buffered fallback?  NFSD
setting DSYNC just acknowledges it is required to ensure O_DIRECT
WRITEs are ondisk upon return to client -- something I don't want to
leave to the client to ensure via NFS_UNSTABLE's subsequent COMMIT.

Setting O_DSYNC gives us a baseline where we don't leave some other
subsystem to ensure data gets to disk as quickly as possible (O_DSYNC
offers NFS_DATA_SYNC, but O_DSYNC|SYNC's NFS_FILE_SYNC offers a win by
avoiding COMMIT, even more enticing).

So I need NFSD_IO_DIRECT to be able to require IOCB_DSYNC and possibly
IOCB_DSYNC|IOCB_SYNC.  And I'm most comfortable with the misaligned
DIO WRITE support using O_DSYNC for all 3 segments; so that WRITE's
data is ondisk before returning from the WRITE (NFS_DATA_SYNC), and
bonus if we can avoid extra COMMIT (NFS_FILE_SYNC).

> B. Even if NFSD had to recover from a failed page cache invalidation,
>    does a fallback write need to set IOCB_DSYNC (not considering the NFS
>    protocol requirements) ?
> 
> When invalidation fails, some pages remain in the cache (dirty or with
> private data). The buffered fallback writes to those same pages,
> updating them with the correct data. There's no torn state or corruption
> hazard.
> 
> The fallback restarts the entire write from scratch, as buffered. Any
> partial progress (like a buffered first segment) gets rewritten with
> the same data. No orphaned blocks or inconsistent state.
> 
> If another process does a DIO read of this region, it will call
> kiocb_write_and_wait() first (see mm/filemap.c:2912), which flushes any
> dirty pages before reading. So they'll see correct data even if it's
> still dirty in cache (Jeff's concern).

Yes, but the point is there is extra work that involves the page
cache by other layers.  And O_DIRECT is intended to work expecting the
IO has been flushed to disk.  O_DIRECT is aiming to reduce page cache
usage, to keep memory use low.  So ensuring the page cache invalidated
to disk by the subsequent write as a side-efffect of O_DSYNC helps
(more on this below [0]).

> C. When setting up a three-segment write, is IOCB_DSYNC needed on the
>    unaligned ends (not considering the NFS protocol requirements) ?
> 
> Before writing the DIO middle segment, the VFS must first invalidate the
> page cache, so the first (buffered) segment is flushed to durability
> anyway. The use of IOCB_DSYNC for the first segment is superfluous.

No, the first segment reflects the misaligned head of the WRITE. The
DIO-aligned middle segment's DIO will just invalidate pages aligned on
the middle segment's alignment (which is page aligned).  So the DIO
from the middle segment won't have any side-effect on the first
segment's page.  First page needs to be written out with IOCB_DSYNC.

> After writing the DIO middle segment, the unaligned end remains only in
> the page cache if IOCB_DSYNC is not used. But a subsequent DIO read will
> flush that to durability (via kiocb_write_and_wait) before satisfying
> the read. So, IOCB_DSYNC is not needed there to meet putative file data
> visibility mandates.

The same applies to both the misaligned first (head) and third (tail)
segments, they both need O_DSYNC to ensure their contents are ondisk.

[0]: Again, my goal for NFSD_IO_DIRECT is to operate in terms of
O_DIRECT|O_DSYNC ondisk.. and not lean on the page cache more than
needed. This makes the VM subsystem more scalable as a side-effect of
it having more clean pages that are quickly dropped and/or reused if
something needs memory.

> >> As I understand it, IOCB_DSYNC has nothing to do with whether the three
> >> segments are directed to the correct file offsets. Serially initiating
> >> the writes with the same iocb should be sufficient to ensure
> >> correctness.
> > 
> > IOCB_DSYNC just ensures when they complete they are on-disk.  And any
> > failure of, or short WRITE, is trapped and immediate cause for early
> > return.  Whereby ensuring file offset integrity.
> 
> AFAICT, generic_perform_write() updates ki_pos synchronously whenever
> it returns successfully, regardless of the IOCB_DSYNC or IOCB_DIRECT
> flag settings. If the vfs_iocb_iter_write() call fails or returns fewer
> bytes than expected, the loop terminates immediately and writes to
> subsequent segments are not initiated. I don't see a data integrity
> issue here.

For ki_pos we're saying the same thing: yes ki_pos is updated (and
it'll reflect a short write if one happens, etc). IOCB_DSYNC doesn't
influence if ki_pos is advanced. But if IOCB_DSYNC isn't set then
ki_pos is advanced beyond what was written _to disk_. That obviously
doesn't matter if NFS_UNSTABLE, thanks to its COMMIT, but it does
increase latency (though it is worth benchmarking with various
workloads!).

Anyway, I've been working to ensure NFSD_IO_DIRECT acts as if at least
NFS_DATA_SYNC, but preferably NFS_FILE_SYNC, set.  I should have more  
clearly stated that.  Bypassing all caches as much possible allows for
enabling scalable cache coherence, e.g. intelligent applications that
don't rely on file locking, think multiple writers writing to their
own extent of a large file.

Having NFSD_IO_DIRECT be handled as UNSTABLE isn't adequate for my
needs (immediate ondisk requirements, lowest memory and CPU usage as
possible).  Would you be OK with adding 2 new debugfs knobs?:

/sys/kernel/debug/nfsd/io_cache_read_stable_how
/sys/kernel/debug/nfsd/io_cache_write_stable_how

- Each defaults to using the stable_how that the client requested.

- Each will serve as a floor, and only override default if they are
  greater/stricter than the client requested stable_how. (so if
  'io_cache_write_stable_how' set to NFS_FILE_SYNC it'd override
  client specified UNSTABLE).

- Each can override the stable_how used so each IO mode behaves
  accordingly (e.g. I can set NFSD_IO_DIRECT and NFS_FILE_SYNC to get
  the bahviour I'd like).

> Thus I'm still unconvinced that IOCB_DSYNC is required if the client has
> requested an UNSTABLE write. I'm open to reviewing actual evidence of a
> failure mode where IOCB_DSYNC might prevent data corruption, but so far
> I don't see how it can happen.

Yeah, I see your point now.  If io_cache_{read,write}_stable_how
debugfs controls are acceptable to you it'll give us maximum
flexibility for controlling how NFSD behaves in each NFSD_IO mode.

Thanks.
Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 16:18                   ` Mike Snitzer
  2025-10-27 16:59                     ` Mike Snitzer
@ 2025-10-29  7:20                     ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-29  7:20 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Chuck Lever, Jeff Layton, Christoph Hellwig, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever,
	Christoph Hellwig

On Mon, Oct 27, 2025 at 12:18:30PM -0400, Mike Snitzer wrote:
> LOCALIO's misaligned DIO will issue head/tail followed by O_DIRECT
> middle (via AIO completion of that aligned middle).  So out of order
> relative to file offset.

That's in general a really bad idea.  It will obviously work, but
both on SSDs and out of place write file systems it is a sure way
to increase your garbage collection overhead a lot down the line.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-27 17:57                   ` Chuck Lever
  2025-10-28  3:26                     ` Mike Snitzer
@ 2025-10-29  7:25                     ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-29  7:25 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Mike Snitzer, Jeff Layton, Christoph Hellwig, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever,
	Christoph Hellwig

On Mon, Oct 27, 2025 at 01:57:25PM -0400, Chuck Lever wrote:
> The compelling reason is that it's generally faster (or less work for
> the NFS server and its storage) to sync the metadata after the client
> has sent all of the data it wants to write. This amortizes the cost of
> the metadata operations, and allows the server to get the written data
> persisted (if it makes sense to do that) while waiting for the COMMIT.

Yes.  Also IFF the client does not want an extra COMMIT it can ask
for a stable write already.  And the (Linux) client already gets a
hint from the application that it doesn't want to do the extra commit
by using O_SYNC/O_DSYNC or RWF_SYNC.  So there really is not need for
the server to second guess the behavior here, and instead the client
should be changed to do this (from what I can tell it currently doesn't,
but I might be missing something).

> If you want to make an argument about data integrity, let's
> be as precise as we can about what we believe might be unsafe. The
> data integrity doubt was with the unaligned ends, IIRC? If we need
> that extra bit of integrity, then WRITEs with an unaligned portion
> will need to be IOCB_DSYNC, and then promoted.

IFF.  I've not really seen any argument for that.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-28  3:26                     ` Mike Snitzer
  2025-10-28 15:37                       ` Chuck Lever
@ 2025-10-29  7:32                       ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-29  7:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Chuck Lever, Jeff Layton, Christoph Hellwig, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever,
	Christoph Hellwig

On Mon, Oct 27, 2025 at 11:26:58PM -0400, Mike Snitzer wrote:
> But none of that matters if the only safe way to implement mixing
> buffered and direct IO is by waiting for the DIO to succeed and with
> it any associated page cache invalidation (and associated possible
> failure to invalidate the page handled with using buffered IO fallback
> by underlying filesystem).

Why do you think it is the only safe way?  (The above seems to imply
that to, or maybe it is a question?)

> 
> Any buffered or direct IO associated with the misaligned DIO WRITE
> handling, in terms of 3 segments, must use IOCB_DSYNC.

Can you explain why exactly.  If that's is indeed the case, it is a
very important point we clearly need to document.

> But the entire intent behind NFSD's O_DIRECT support is to ensure IO
> is on stable storage when it replies to the client.

Is it?  This is the first time I read that this is the "entire point".
I though the main reason was to stop wasting memory on the server
and reduce memory copies.  If the entire intent is to to commit to
stable storage only, there is no need for all the direct I/O games,
and you can just use RWF_DSYNC only.

> The client isn't
> meant to get involved with driving the correctness of NFSD's O_DIRECT
> support (by requiring the client set NFS_FILE_SYNC for the benefit of
> a feature it doesn't know enabled in the server).

Of course the client should not care about the servers implementation
detail.  But I don't see how this is relevant here at all.

> IOCB_DIRECT without IOCB_DSYNC isn't an option because we must ensure
> the data is ondisk.

Why?

> The only related bake-off would be:
> 1) IOCB_DIRECT | IOCB_DSYNC with UNSTABLE 
>  vs
> 2) IOCB_DIRECT | IOCB_DSYNC | IOCBD_SYNC with NFS_FILE_SYNC
> 
> (but on esoteric enterprise storage: there is no difference)

What are you talking about? With any Linux file system on any storage
it does make a huge difference except for the corner case of pure
overwrites of fully allocated ranges.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-28 15:37                       ` Chuck Lever
  2025-10-28 16:04                         ` Mike Snitzer
@ 2025-10-29  7:37                         ` Christoph Hellwig
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2025-10-29  7:37 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Mike Snitzer, Jeff Layton, Christoph Hellwig, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever,
	Christoph Hellwig

On Tue, Oct 28, 2025 at 11:37:52AM -0400, Chuck Lever wrote:
> > Christoph has repeatedly said DSYNC is needed with O_DIRECT, yet you
> > keep removing it.
> 
> That's not what I read. Over the course of three email threads, he wrote
> that:
> 
> - IOCB_DSYNC is always needed when IOCB_SYNC is set, whether or not
>   we're using IOCB_DIRECT.

Yes.

> 
> - in order to guarantee that a direct write is durable, we /either/ need
>   IOCB_DSYNC + IOCB_DIRECT, /or/ IOCB_DIRECT by itself with a follow-up
>   COMMIT.

Yes - although at this level I'd talk about fsync/fdatasync instead of
COMMIT to be more clear.

> - for some commonly deployed media types, IOCB_DSYNC with IOCB_DIRECT
>   might be slower than IOCB_DIRECT followed up with COMMIT.

This is not primarily about media types.  For any allocation write
(append, hole filling, conversion of unwritten extent, out of place
writes due to reflink or a log structured file system), we need to
commit metadata to make data durable.  Any batching of that is huge
efficiency win.  For pure overwrites (file data written before, not
just preallocated, and not on a file system / file writing out of
place), on devices that do not have a volatile write cache,
using IOCB_DSYNC will usually be fast.  Maybe also on some devices
with a write cache if their REQ_FUA implementation is faster than
a full cache flush (which for cheaper SSDs generally is not the case).

> Therefore, we need to carefully justify why the current patches stick
> with only IOCB_DSYNC + IOCB_DIRECT, or decide it's truly not necessary
> to force all NFSD_IO_DIRECT writes to be IOCB_DSYNC.

Yes.  Especially as the client can explicitly ask for stable writes if
it thinks they are applicable, and the client is in a much better
position to decide that as the application tells it!

> Christoph and I (if I may put words in his mouth) both seem to be
> interested in making NFSD_IO_DIRECT useful in contexts other than a very
> specific enterprise-grade server with esoteric NVMe devices and ultra
> high bandwidth networking.

SSDs or hard disks with a non-volatile write cache aren't exactly
esoteric, they are just the more expensive tier.  But for most write
patterns that doesn't help you anyway.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-28 23:56                             ` Mike Snitzer
@ 2025-10-29 15:22                               ` Chuck Lever
  2025-10-29 16:54                                 ` Mike Snitzer
  0 siblings, 1 reply; 87+ messages in thread
From: Chuck Lever @ 2025-10-29 15:22 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig

On 10/28/25 7:56 PM, Mike Snitzer wrote:
> On Tue, Oct 28, 2025 at 02:48:33PM -0400, Chuck Lever wrote:
>> On 10/28/25 12:04 PM, Mike Snitzer wrote:
>>> I'm also concerned about initiating
>>> any page cache invalidation off the back of the DIO WRITE in the
>>> middle (and its potential to fallback to buffered if that invalidation
>>> fails... any associated buffered IO fallback or ends best have O_DSYNC
>>> set it ensure everything ondisk).
>>
>> My questions:
> 
> All good questions, and I appreciate the time you spent analyzing
> these aspects.
> 
> I'm all for NFSD's IO modes being able to work with all possible
> stable_how.  Your analysis helps expand the scope of these IO modes to
> really be pushed to their limits (specifically thinking of
> NFSD_IO_DIRECT being paired with UNSTABLE and it _not_ using
> IOCB_DSYNC, interesting combo but IMO quite a risky default given
> we're only at the debugfs interface stage for NFSD_IO_DIRECT).
>  
>> A. Which filesystem code can return -ENOTBLK?
>>
>> - iomap-based DIO (fs/iomap/direct-io.c:715): XFS, ext4, f2fs, gfs2,
>>   zonefs, erofs
>> - Legacy DIO path (fs/direct-io.c:984): ext2, some ext4 code paths
>> - btrfs (has its own DIO implementation that uses -ENOTBLK)
>>
>> Code auditing shows that vfs_iocb_iter_write() never returns -ENOTBLK
>> because all filesystems that could generate this error internally
>> convert it to 0 or another error code before returning from their
>> write_iter implementation.
>>
>> Therefore NFSD itself does not need to know about or handle page cache
>> invalidation failure.
> 
> OK, good to hear; but have you reviewed if they are all using D_SYNC
> as part of their internal handling of their buffered fallback?  NFSD
> setting DSYNC just acknowledges it is required to ensure O_DIRECT
> WRITEs are ondisk upon return to client -- something I don't want to
> leave to the client to ensure via NFS_UNSTABLE's subsequent COMMIT.

As Christoph said, the application runs on the client, and it tells
the client exactly what needs to happen. Then the client tells the
server.

Essentially what forcing all writes to be slow durable writes does
is makes a special case a little faster (maybe, that has yet to be
demonstrated) at the expense of all other use cases that prefer to
write a lot and then commit.

And if I understand Jonathan's numbers correctly, this only truly
matters when the server's memory has been exhausted. The usual Linux
paradigm is to let workloads whose resident sets fit in memory go
just as fast as can be allowed. We need the additional durability
only when the server is tipping over.

The server could flush more aggressively when it recognizes it cannot
hold its working set in memory. But I think the issue there is that the
server is tipping over because it simply cannot flush fast enough to
keep up with clients.

I'd like someone to have a look at slowing down noisy clients when the
server starts to reach its limits. At least NFSD should get some
observability mechanisms to tell which clients are the troublesome ones.

> Setting O_DSYNC gives us a baseline where we don't leave some other
> subsystem to ensure data gets to disk as quickly as possible (O_DSYNC
> offers NFS_DATA_SYNC, but O_DSYNC|SYNC's NFS_FILE_SYNC offers a win by
> avoiding COMMIT, even more enticing).
> 
> So I need NFSD_IO_DIRECT to be able to require IOCB_DSYNC and possibly
> IOCB_DSYNC|IOCB_SYNC.  And I'm most comfortable with the misaligned
> DIO WRITE support using O_DSYNC for all 3 segments; so that WRITE's
> data is ondisk before returning from the WRITE (NFS_DATA_SYNC), and
> bonus if we can avoid extra COMMIT (NFS_FILE_SYNC).

This flies in the face of what the protocol mandates and what
applications have come to expect over the past 30 years.

The server is free to promote WRITE to higher durability, but that
/always/ comes at a cost. That cost is worth it, I can see, only when
the server cannot cache an UNSTABLE WRITE.

I suspect your true mission is to turn NFS into Lustre. ;-)

>> B. Even if NFSD had to recover from a failed page cache invalidation,
>>    does a fallback write need to set IOCB_DSYNC (not considering the NFS
>>    protocol requirements) ?
>>
>> When invalidation fails, some pages remain in the cache (dirty or with
>> private data). The buffered fallback writes to those same pages,
>> updating them with the correct data. There's no torn state or corruption
>> hazard.
>>
>> The fallback restarts the entire write from scratch, as buffered. Any
>> partial progress (like a buffered first segment) gets rewritten with
>> the same data. No orphaned blocks or inconsistent state.
>>
>> If another process does a DIO read of this region, it will call
>> kiocb_write_and_wait() first (see mm/filemap.c:2912), which flushes any
>> dirty pages before reading. So they'll see correct data even if it's
>> still dirty in cache (Jeff's concern).
> 
> Yes, but the point is there is extra work that involves the page
> cache by other layers.  And O_DIRECT is intended to work expecting the
> IO has been flushed to disk. O_DIRECT is aiming to reduce page cache> usage, to keep memory use low.  So ensuring the page cache invalidated
> to disk by the subsequent write as a side-efffect of O_DSYNC helps
> (more on this below [0]).

Go look at what happens after an IOCB_DIRECT write returns to its
caller. The write buffer goes away almost immediately and there's
nothing left in memory.

The unaligned ends go through the page cache, but IOCB_DONTCACHE is set
on those, so those pages are immediately evictable.

>> C. When setting up a three-segment write, is IOCB_DSYNC needed on the
>>    unaligned ends (not considering the NFS protocol requirements) ?
>>
>> Before writing the DIO middle segment, the VFS must first invalidate the
>> page cache, so the first (buffered) segment is flushed to durability
>> anyway. The use of IOCB_DSYNC for the first segment is superfluous.
> 
> No, the first segment reflects the misaligned head of the WRITE. The
> DIO-aligned middle segment's DIO will just invalidate pages aligned on
> the middle segment's alignment (which is page aligned).  So the DIO
> from the middle segment won't have any side-effect on the first
> segment's page.  First page needs to be written out with IOCB_DSYNC.

You still haven't demonstrated why. What is the risk of not making
the first segment durable? I don't find the arguments about a little
extra memory usage at all convincing, especially if DONTCACHE is in
play.

So, have we decided that there is no data integrity risk with leaving
off IOCB_DSYNC for the burnt ends? The only purpose now is a potential
performance optimization?

>> After writing the DIO middle segment, the unaligned end remains only in
>> the page cache if IOCB_DSYNC is not used. But a subsequent DIO read will
>> flush that to durability (via kiocb_write_and_wait) before satisfying
>> the read. So, IOCB_DSYNC is not needed there to meet putative file data
>> visibility mandates.
> 
> The same applies to both the misaligned first (head) and third (tail)
> segments, they both need O_DSYNC to ensure their contents are ondisk.

A direct READ will write back any cached pages in the read byte range
before it reads from the media. Full stop. There is no need for NFSD
to add IOCB_DSYNC for reads to get the latest data.

> [0]: Again, my goal for NFSD_IO_DIRECT is to operate in terms of
> O_DIRECT|O_DSYNC ondisk.. and not lean on the page cache more than
> needed. This makes the VM subsystem more scalable as a side-effect of
> it having more clean pages that are quickly dropped and/or reused if
> something needs memory.

That's a giant claim, and it needs a lot of clear evidence to back it
up. I think when you have code we can review, we can begin to discuss
it further. For now, let's focus on the code to be merged now.

> Anyway, I've been working to ensure NFSD_IO_DIRECT acts as if at least
> NFS_DATA_SYNC, but preferably NFS_FILE_SYNC, set.

Clearly, but the rationale for this effort needs some work. It doesn't
reflect the mandates of the NFS protocol and I haven't seen a use case
for it.

> I should have more  
> clearly stated that.  Bypassing all caches as much possible allows for
> enabling scalable cache coherence, e.g. intelligent applications that
> don't rely on file locking, think multiple writers writing to their
> own extent of a large file.

At the expense of applications that do not need or want that extra
bit of coherence at the expense of slower writes.

> Having NFSD_IO_DIRECT be handled as UNSTABLE isn't adequate for my
> needs (immediate ondisk requirements, lowest memory and CPU usage as
> possible).  Would you be OK with adding 2 new debugfs knobs?:
> 
> /sys/kernel/debug/nfsd/io_cache_read_stable_how

READs don't have a stable_how argument. I don't even know what a
stable_how argument on a READ would do.

> /sys/kernel/debug/nfsd/io_cache_write_stable_how
> 
> - Each defaults to using the stable_how that the client requested.
> 
> - Each will serve as a floor, and only override default if they are
>   greater/stricter than the client requested stable_how. (so if
>   'io_cache_write_stable_how' set to NFS_FILE_SYNC it'd override
>   client specified UNSTABLE).
> 
> - Each can override the stable_how used so each IO mode behaves
>   accordingly (e.g. I can set NFSD_IO_DIRECT and NFS_FILE_SYNC to get
>   the bahviour I'd like).

I'm loath to add another potential administrative setting for something
that should be determined automatically (or at least, based on the
workload conditions). But, why not add a fourth IO_MODE setting instead
of another debug setting?

But first, let's see the use cases and performance data. I think you're
getting way ahead of yourself, and none of this needs to be considered
to decide when the present series is ready to be merged.

Before anyone can consider these ideas, you need to create the patches
and show there's a value difference for some applications and
negligible cost to everyone else.

As they say on the cop shows, let's go where the evidence leads.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 14/14] NFSD: Initialize separate ki_flags
  2025-10-29 15:22                               ` Chuck Lever
@ 2025-10-29 16:54                                 ` Mike Snitzer
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Snitzer @ 2025-10-29 16:54 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Christoph Hellwig, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Christoph Hellwig,
	jonathan.flynn, trondmy

On Wed, Oct 29, 2025 at 11:22:13AM -0400, Chuck Lever wrote:
> On 10/28/25 7:56 PM, Mike Snitzer wrote:
> > On Tue, Oct 28, 2025 at 02:48:33PM -0400, Chuck Lever wrote:
> >> On 10/28/25 12:04 PM, Mike Snitzer wrote:
> >>> I'm also concerned about initiating
> >>> any page cache invalidation off the back of the DIO WRITE in the
> >>> middle (and its potential to fallback to buffered if that invalidation
> >>> fails... any associated buffered IO fallback or ends best have O_DSYNC
> >>> set it ensure everything ondisk).
> >>
> >> My questions:
> > 
> > All good questions, and I appreciate the time you spent analyzing
> > these aspects.
> > 
> > I'm all for NFSD's IO modes being able to work with all possible
> > stable_how.  Your analysis helps expand the scope of these IO modes to
> > really be pushed to their limits (specifically thinking of
> > NFSD_IO_DIRECT being paired with UNSTABLE and it _not_ using
> > IOCB_DSYNC, interesting combo but IMO quite a risky default given
> > we're only at the debugfs interface stage for NFSD_IO_DIRECT).
> >  
> >> A. Which filesystem code can return -ENOTBLK?
> >>
> >> - iomap-based DIO (fs/iomap/direct-io.c:715): XFS, ext4, f2fs, gfs2,
> >>   zonefs, erofs
> >> - Legacy DIO path (fs/direct-io.c:984): ext2, some ext4 code paths
> >> - btrfs (has its own DIO implementation that uses -ENOTBLK)
> >>
> >> Code auditing shows that vfs_iocb_iter_write() never returns -ENOTBLK
> >> because all filesystems that could generate this error internally
> >> convert it to 0 or another error code before returning from their
> >> write_iter implementation.
> >>
> >> Therefore NFSD itself does not need to know about or handle page cache
> >> invalidation failure.
> > 
> > OK, good to hear; but have you reviewed if they are all using D_SYNC
> > as part of their internal handling of their buffered fallback?  NFSD
> > setting DSYNC just acknowledges it is required to ensure O_DIRECT
> > WRITEs are ondisk upon return to client -- something I don't want to
> > leave to the client to ensure via NFS_UNSTABLE's subsequent COMMIT.
> 
> As Christoph said, the application runs on the client, and it tells
> the client exactly what needs to happen. Then the client tells the
> server.
> 
> Essentially what forcing all writes to be slow durable writes does
> is makes a special case a little faster (maybe, that has yet to be
> demonstrated) at the expense of all other use cases that prefer to
> write a lot and then commit.
> 
> And if I understand Jonathan's numbers correctly, this only truly
> matters when the server's memory has been exhausted. The usual Linux
> paradigm is to let workloads whose resident sets fit in memory go
> just as fast as can be allowed. We need the additional durability
> only when the server is tipping over.
> 
> The server could flush more aggressively when it recognizes it cannot
> hold its working set in memory. But I think the issue there is that the
> server is tipping over because it simply cannot flush fast enough to
> keep up with clients.

Yes, network faster than storage.  Clients that can basically fill the
NFS server's page cache within seconds.

Trond has pretty solid concerns about requiring the client to take
control of ensuring the NFSD server is able to make forward progress.

The NFSD writeback and associated page reclaim hangs/stalls due to
NFSD_IO_BUFFERED are quite bad if/when they hit (due to low memory,
kswapd and kcompactd working too hard, etc).

NFSD_IO_DIRECT is doing an amazing job of mitigating those MM issues.

> I'd like someone to have a look at slowing down noisy clients when the
> server starts to reach its limits. At least NFSD should get some
> observability mechanisms to tell which clients are the troublesome ones.

Yes, having NFSD be able to communicate congestion or "pressure" to
the client to backoff would be interesting (PSI is very applicable
here, we could tie it to PSI data and trigger at configurable
thresholds).

Given that NFSD is a shared resource, with N threads.. distilling the
system's PSI data down to a per client basis is likely only doable if
each NFSD thread were to have their own cgroup? :(

But even then it is messy to wrangle that data as the basis for
per-client PSI data. But worth considering as one way forward.

> > Setting O_DSYNC gives us a baseline where we don't leave some other
> > subsystem to ensure data gets to disk as quickly as possible (O_DSYNC
> > offers NFS_DATA_SYNC, but O_DSYNC|SYNC's NFS_FILE_SYNC offers a win by
> > avoiding COMMIT, even more enticing).
> > 
> > So I need NFSD_IO_DIRECT to be able to require IOCB_DSYNC and possibly
> > IOCB_DSYNC|IOCB_SYNC.  And I'm most comfortable with the misaligned
> > DIO WRITE support using O_DSYNC for all 3 segments; so that WRITE's
> > data is ondisk before returning from the WRITE (NFS_DATA_SYNC), and
> > bonus if we can avoid extra COMMIT (NFS_FILE_SYNC).
> 
> This flies in the face of what the protocol mandates and what
> applications have come to expect over the past 30 years.

Not looking to fly in the face of anything, like you've reminded me
relative to the current NFSD_IO_DIRECT code: nothing is written in
stone, etc.

My real "need" is to expose knobs that allow discovery of best
performance with NFSD given all options at our fingertips (obviously
NFSD_IO_DIRECT is the most promising so far).  But this _is_ all
"early days" in the grand scheme of things. Understanding what works
best for workload+config X, Y and Z. Discovery phase, iterative
testing/learning/feedback to inform inplementation and minimal
required knobs to allow progress.

> The server is free to promote WRITE to higher durability, but that
> /always/ comes at a cost. That cost is worth it, I can see, only when
> the server cannot cache an UNSTABLE WRITE.
> 
> I suspect your true mission is to turn NFS into Lustre. ;-)

Lustre's "N writers to 1 file" (or "many-writers to 1 file") is
definitely a use case that is desirable to support.

> >> B. Even if NFSD had to recover from a failed page cache invalidation,
> >>    does a fallback write need to set IOCB_DSYNC (not considering the NFS
> >>    protocol requirements) ?
> >>
> >> When invalidation fails, some pages remain in the cache (dirty or with
> >> private data). The buffered fallback writes to those same pages,
> >> updating them with the correct data. There's no torn state or corruption
> >> hazard.
> >>
> >> The fallback restarts the entire write from scratch, as buffered. Any
> >> partial progress (like a buffered first segment) gets rewritten with
> >> the same data. No orphaned blocks or inconsistent state.
> >>
> >> If another process does a DIO read of this region, it will call
> >> kiocb_write_and_wait() first (see mm/filemap.c:2912), which flushes any
> >> dirty pages before reading. So they'll see correct data even if it's
> >> still dirty in cache (Jeff's concern).
> > 
> > Yes, but the point is there is extra work that involves the page
> > cache by other layers.  And O_DIRECT is intended to work expecting the
> > IO has been flushed to disk. O_DIRECT is aiming to reduce page cache> usage, to keep memory use low.  So ensuring the page cache invalidated
> > to disk by the subsequent write as a side-efffect of O_DSYNC helps
> > (more on this below [0]).
> 
> Go look at what happens after an IOCB_DIRECT write returns to its
> caller. The write buffer goes away almost immediately and there's
> nothing left in memory.

But there would be if IOCB_DIRECT IO is delayed (as would be the case
with NFS_UNSTABLE), right?

> The unaligned ends go through the page cache, but IOCB_DONTCACHE is set
> on those, so those pages are immediately evictable.

That brings up a new thread of development that I have a patch for,
streaming misaligned IO, as I mentioned I was looking at again here:
https://lore.kernel.org/linux-nfs/aP-YV2i8Y9jsrPF9@kernel.org/

I revisited this workload (last optimized/touched it ~6 months ago for
ISC2025) because it is time for the every 6 months industry bakeoff
that is IO500.  IO500's nastiest test is IOR_HARD (streaming WRITEs IO
of 47008 bytes each).

DONTCACHE is actively harmful for misaligned WRITEs due to it
immediately dropping pages from the page cache.  The pages associated
with the misaligned head/tail are best servived with traditional
NFSD_IO_BUFFERED.  Using BUFFERED instead of DONTCACHE offers a HUGE
performance improvement.

I did also reinstate my old NFSD_IO_DIRECT change to "use buffered IO
if WRITE is less than 32K".

Changed 2 things at once and got fantastic performance.  Need to
decouple and retest without the latter before I post patches that
build on your latest code.

> >> C. When setting up a three-segment write, is IOCB_DSYNC needed on the
> >>    unaligned ends (not considering the NFS protocol requirements) ?
> >>
> >> Before writing the DIO middle segment, the VFS must first invalidate the
> >> page cache, so the first (buffered) segment is flushed to durability
> >> anyway. The use of IOCB_DSYNC for the first segment is superfluous.
> > 
> > No, the first segment reflects the misaligned head of the WRITE. The
> > DIO-aligned middle segment's DIO will just invalidate pages aligned on
> > the middle segment's alignment (which is page aligned).  So the DIO
> > from the middle segment won't have any side-effect on the first
> > segment's page.  First page needs to be written out with IOCB_DSYNC.
> 
> You still haven't demonstrated why. What is the risk of not making
> the first segment durable? I don't find the arguments about a little
> extra memory usage at all convincing, especially if DONTCACHE is in
> play.

DONTCACHE isn't in play for me at the moment.  I didn't push back when
you introduced DIRECT to BUFFERED fallback in terms of DONTCACHE but
it really is a problem.

Anyway, that aside: I just want to make sure the misaligned DIO case
is safe.  Jeff and you spooked me about your concerns, etc (and it was
Christoph who quite early mentioned that mixing buffered and direct is
risky due to invalidation races).  So maybe I'm conditioned to just be
excessive about ensuring the page cache usage doesn't cause
invalidation races and such.

This rope-a-dope is kind of interesting, flipping it such that there
is no concern now is quite the 180 switch.  But I went all in on
trying to ensure misaligned DIO in terms of 3 segments that mix
buffered and direct _safe_.  That safety comes at the cost of SYNC IO.
But sure, maybe it was all paramoid theatre and isn't needed...

> So, have we decided that there is no data integrity risk with leaving
> off IOCB_DSYNC for the burnt ends? The only purpose now is a potential
> performance optimization?

I think yes, IOCB_DIRECT with NFS_UNSTABLE will work with the COMMIT
that follows.  And having the ability to tune to raise the stable_how
used like I suggested in my previous email will allow us to fully
explore relative differences.

> >> After writing the DIO middle segment, the unaligned end remains only in
> >> the page cache if IOCB_DSYNC is not used. But a subsequent DIO read will
> >> flush that to durability (via kiocb_write_and_wait) before satisfying
> >> the read. So, IOCB_DSYNC is not needed there to meet putative file data
> >> visibility mandates.
> > 
> > The same applies to both the misaligned first (head) and third (tail)
> > segments, they both need O_DSYNC to ensure their contents are ondisk.
> 
> A direct READ will write back any cached pages in the read byte range
> before it reads from the media. Full stop. There is no need for NFSD
> to add IOCB_DSYNC for reads to get the latest data.

That seems like a non-sequitor.. but OK (I can now see you're focused
on the READ aspect).

I was just pointing out that the misaligned head and tail are
identical in terms of them not being page aligned and needing to be
written to disk.

> > [0]: Again, my goal for NFSD_IO_DIRECT is to operate in terms of
> > O_DIRECT|O_DSYNC ondisk.. and not lean on the page cache more than
> > needed. This makes the VM subsystem more scalable as a side-effect of
> > it having more clean pages that are quickly dropped and/or reused if
> > something needs memory.
> 
> That's a giant claim, and it needs a lot of clear evidence to back it
> up. I think when you have code we can review, we can begin to discuss
> it further. For now, let's focus on the code to be merged now.

OK, but I've provided all the evidence in conjunction with Jonathan
Flynn's testing to scale the testing out.  So we have the data for
DSYNC+SYNC usage (first with NFS_UNSTABLE and later with NFSD server
responding NFS_FILE_SYNC to client), we can relax it to not use DSYNC
(or DSYNC+SYNC) at all and use NFS_UNSTABLE and see how it goes.

> > Anyway, I've been working to ensure NFSD_IO_DIRECT acts as if at least
> > NFS_DATA_SYNC, but preferably NFS_FILE_SYNC, set.
> 
> Clearly, but the rationale for this effort needs some work. It doesn't
> reflect the mandates of the NFS protocol and I haven't seen a use case
> for it.

Fair, thanks for helping hone the justification and keeping me honest.

> > I should have more  
> > clearly stated that.  Bypassing all caches as much possible allows for
> > enabling scalable cache coherence, e.g. intelligent applications that
> > don't rely on file locking, think multiple writers writing to their
> > own extent of a large file.
> 
> At the expense of applications that do not need or want that extra
> bit of coherence at the expense of slower writes.

Possibly, yes, devil is in the details (based on hardware config used,
etc).  But point taken.

> > Having NFSD_IO_DIRECT be handled as UNSTABLE isn't adequate for my
> > needs (immediate ondisk requirements, lowest memory and CPU usage as
> > possible).  Would you be OK with adding 2 new debugfs knobs?:
> > 
> > /sys/kernel/debug/nfsd/io_cache_read_stable_how
> 
> READs don't have a stable_how argument. I don't even know what a
> stable_how argument on a READ would do.

Sorry, yes... not sure why I had that; I was rushing to hit send
before racing to the airport to pickup my wife.

> > /sys/kernel/debug/nfsd/io_cache_write_stable_how
> > 
> > - Each defaults to using the stable_how that the client requested.
> > 
> > - Each will serve as a floor, and only override default if they are
> >   greater/stricter than the client requested stable_how. (so if
> >   'io_cache_write_stable_how' set to NFS_FILE_SYNC it'd override
> >   client specified UNSTABLE).
> > 
> > - Each can override the stable_how used so each IO mode behaves
> >   accordingly (e.g. I can set NFSD_IO_DIRECT and NFS_FILE_SYNC to get
> >   the bahviour I'd like).
> 
> I'm loath to add another potential administrative setting for something
> that should be determined automatically (or at least, based on the
> workload conditions). But, why not add a fourth IO_MODE setting instead
> of another debug setting?

Sure, that can work too, less knobs is good.

> But first, let's see the use cases and performance data. I think you're
> getting way ahead of yourself

I think I might be ahead of you in understanding what has worked well,
I'm apt to try a thing and see what happens.  But it follows from me
having need and pushing forward.  You have already caught up and are
asking all the right questions and more.  Appreciate your time.

> and none of this needs to be considered
> to decide when the present series is ready to be merged.

That's fine, we can take it as it comes.

> Before anyone can consider these ideas, you need to create the patches
> and show there's a value difference for some applications and
> negligible cost to everyone else.
> 
> As they say on the cop shows, let's go where the evidence leads.

Yeap.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2025-10-29 16:54 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24 14:42 [PATCH v7 00/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 14:42 ` [PATCH v7 01/14] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-24 15:21   ` Jeff Layton
2025-10-27  8:02   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 02/14] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-27  8:03   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 03/14] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-27  8:04   ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 04/14] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-24 17:12   ` Mike Snitzer
2025-10-24 17:24     ` Chuck Lever
2025-10-26  0:03   ` kernel test robot
2025-10-26  1:16   ` kernel test robot
2025-10-24 14:42 ` [PATCH v7 05/14] NFSD: @stable for direct writes is always NFS_FILE_SYNC Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 15:23     ` Chuck Lever
2025-10-27  8:05   ` Christoph Hellwig
2025-10-27 13:23     ` Chuck Lever
2025-10-27 13:27       ` Christoph Hellwig
2025-10-27 14:31         ` Mike Snitzer
2025-10-27 14:36           ` Christoph Hellwig
2025-10-27 14:58             ` Mike Snitzer
2025-10-27 15:04               ` Chuck Lever
2025-10-27 15:19                 ` Mike Snitzer
2025-10-27 15:05               ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 06/14] NFSD: Always set IOCB_SYNC in direct write path Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:08   ` Christoph Hellwig
2025-10-27 10:38     ` Jeff Layton
2025-10-27 10:40       ` Christoph Hellwig
2025-10-24 14:42 ` [PATCH v7 07/14] NFSD: Remove specific error handling Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 08/14] NFSD: Remove alignment size checking Chuck Lever
2025-10-24 15:22   ` Jeff Layton
2025-10-27  8:09   ` Christoph Hellwig
2025-10-27 13:25     ` Chuck Lever
2025-10-27 13:30       ` Christoph Hellwig
2025-10-24 14:43 ` [PATCH v7 09/14] NFSD: Remove the len_mask check Chuck Lever
2025-10-24 15:23   ` Jeff Layton
2025-10-24 17:16   ` Mike Snitzer
2025-10-24 17:22     ` Chuck Lever
2025-10-24 14:43 ` [PATCH v7 10/14] NFSD: Clean up synopsis of nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-10-24 15:24   ` Jeff Layton
2025-10-24 14:43 ` [PATCH v7 11/14] NFSD: Clean up struct nfsd_write_dio Chuck Lever
2025-10-24 15:26   ` Jeff Layton
2025-10-24 17:20   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 12/14] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
2025-10-24 15:30   ` Jeff Layton
2025-10-24 15:37     ` Chuck Lever
2025-10-24 17:57   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 13/14] NFSD: Clean up direct write fall back error flow Chuck Lever
2025-10-24 15:32   ` Jeff Layton
2025-10-24 18:01   ` Mike Snitzer
2025-10-24 14:43 ` [PATCH v7 14/14] NFSD: Initialize separate ki_flags Chuck Lever
2025-10-24 15:34   ` Jeff Layton
2025-10-24 18:13   ` Mike Snitzer
2025-10-24 19:34     ` Chuck Lever
2025-10-24 20:37       ` Mike Snitzer
2025-10-24 21:16         ` Chuck Lever
2025-10-24 23:56           ` Mike Snitzer
2025-10-27  8:15             ` Christoph Hellwig
2025-10-27 10:50               ` Jeff Layton
2025-10-27 10:55                 ` Christoph Hellwig
2025-10-27 13:48                 ` Chuck Lever
2025-10-27 13:49                   ` Christoph Hellwig
2025-10-27 16:18                   ` Mike Snitzer
2025-10-27 16:59                     ` Mike Snitzer
2025-10-29  7:20                     ` Christoph Hellwig
2025-10-27 16:05                 ` Mike Snitzer
2025-10-27 17:57                   ` Chuck Lever
2025-10-28  3:26                     ` Mike Snitzer
2025-10-28 15:37                       ` Chuck Lever
2025-10-28 16:04                         ` Mike Snitzer
2025-10-28 18:48                           ` Chuck Lever
2025-10-28 23:56                             ` Mike Snitzer
2025-10-29 15:22                               ` Chuck Lever
2025-10-29 16:54                                 ` Mike Snitzer
2025-10-29  7:37                         ` Christoph Hellwig
2025-10-29  7:32                       ` Christoph Hellwig
2025-10-29  7:25                     ` Christoph Hellwig
2025-10-27  8:14         ` Christoph Hellwig
2025-10-27  8:12       ` Christoph Hellwig
2025-10-27 13:27         ` Chuck Lever
2025-10-27 13:30           ` Chuck Lever
2025-10-27 13:31             ` Christoph Hellwig
2025-10-27 14:11         ` Chuck Lever
2025-10-27 14:45           ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.