[PATCH v4 0/3] NFSD: Implement NFSD_IO

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-10-18  0:54 Chuck Lever
  2025-10-18  0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Chuck Lever @ 2025-10-18  0:54 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series applies on the nfsd-testing testing branch and includes
both patches needed to make NFSD Direct WRITE work.

Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks

Chuck Lever (2):
  NFSD: Enable return of an updated stable_how to NFS clients
  svcrdma: Mark Read chunks

Mike Snitzer (1):
  NFSD: Implement NFSD_IO_DIRECT for NFS WRITE

 fs/nfsd/debugfs.c                       |   1 +
 fs/nfsd/nfs3proc.c                      |   2 +-
 fs/nfsd/nfs4proc.c                      |   2 +-
 fs/nfsd/nfsproc.c                       |   3 +-
 fs/nfsd/trace.h                         |   1 +
 fs/nfsd/vfs.c                           | 233 +++++++++++++++++++++++-
 fs/nfsd/vfs.h                           |   6 +-
 fs/nfsd/xdr3.h                          |   2 +-
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   5 +
 9 files changed, 239 insertions(+), 16 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients
  2025-10-18  0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-18  0:54 ` Chuck Lever
  2025-10-20  7:02   ` Christoph Hellwig
  2025-10-18  0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-18  0:54 ` [PATCH v4 3/3] svcrdma: Mark Read chunks Chuck Lever
  2 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2025-10-18  0:54 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

In a subsequent patch, nfsd_vfs_write() will promote an UNSTABLE
WRITE to be a FILE_SYNC WRITE. This indicates that the client does
not need a subsequent COMMIT operation, saving a round trip and
allowing the client to dispense with cached dirty data as soon as
it receives the server's WRITE response.

This patch refactors nfsd_vfs_write() to return a possibly modified
stable_how value to its callers. No behavior change is expected.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c |  2 +-
 fs/nfsd/nfs4proc.c |  2 +-
 fs/nfsd/nfsproc.c  |  3 ++-
 fs/nfsd/vfs.c      | 11 ++++++-----
 fs/nfsd/vfs.h      |  6 ++++--
 fs/nfsd/xdr3.h     |  2 +-
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..ad14b34583bb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
 	resp->committed = argp->stable;
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
 				  &argp->payload, &cnt,
-				  resp->committed, resp->verf);
+				  &resp->committed, resp->verf);
 	resp->count = cnt;
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 7f7e6bb23a90..2222bb283baf 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1285,7 +1285,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	write->wr_how_written = write->wr_stable_how;
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
 				write->wr_offset, &write->wr_payload,
-				&cnt, write->wr_how_written,
+				&cnt, &write->wr_how_written,
 				(__be32 *)write->wr_verifier.data);
 	nfsd_file_put(nf);
 
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75..706401ed6f8d 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 	struct nfsd_writeargs *argp = rqstp->rq_argp;
 	struct nfsd_attrstat *resp = rqstp->rq_resp;
 	unsigned long cnt = argp->len;
+	u32 committed = NFS_DATA_SYNC;
 
 	dprintk("nfsd: WRITE    %s %u bytes at %d\n",
 		SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
-				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+				  &argp->payload, &cnt, &committed, NULL);
 	if (resp->status == nfs_ok)
 		resp->status = fh_getattr(&resp->fh, &resp->stat);
 	else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..8b2dc7a88aab 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,7 +1262,7 @@ static int wait_for_concurrent_writes(struct file *file)
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1274,11 +1274,12 @@ __be32
 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	       struct nfsd_file *nf, loff_t offset,
 	       const struct xdr_buf *payload, unsigned long *cnt,
-	       int stable, __be32 *verf)
+	       u32 *stable_how, __be32 *verf)
 {
 	struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
 	struct file		*file = nf->nf_file;
 	struct super_block	*sb = file_inode(file)->i_sb;
+	u32			stable = *stable_how;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
 	struct iov_iter		iter;
@@ -1434,7 +1435,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1444,7 +1445,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  */
 __be32
 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
-	   const struct xdr_buf *payload, unsigned long *cnt, int stable,
+	   const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
 	   __be32 *verf)
 {
 	struct nfsd_file *nf;
@@ -1457,7 +1458,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
 		goto out;
 
 	err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
-			     stable, verf);
+			     stable_how, verf);
 	nfsd_file_put(nf);
 out:
 	trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fa46f8b5f132..c713ed0b04e0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -130,11 +130,13 @@ __be32		nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				u32 *eof);
 __be32		nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				loff_t offset, const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				struct nfsd_file *nf, loff_t offset,
 				const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_readlink(struct svc_rqst *, struct svc_fh *,
 				char *, int *);
 __be32		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..c0e443ef3a6b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
 	__be32			status;
 	struct svc_fh		fh;
 	unsigned long		count;
-	int			committed;
+	u32			committed;
 	__be32			verf[2];
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-18  0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-18  0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-10-18  0:54 ` Chuck Lever
  2025-10-20  7:19   ` Christoph Hellwig
  2025-10-18  0:54 ` [PATCH v4 3/3] svcrdma: Mark Read chunks Chuck Lever
  2 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2025-10-18  0:54 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Synchronous buffered IO (with
preference towards using DONTCACHE) is used for the misaligned extents
and O_DIRECT is used for the middle DIO-aligned extent.

nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
allows the client to drop its dirty data and avoid needing an extra
COMMIT operation.

If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
invalidate the page cache on behalf of the DIO WRITE, then
nfsd_issue_write_dio() will fall back to using buffered IO.

These changes served as the original starting point for the NFS
client's misaligned O_DIRECT support that landed with
commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
READ and WRITE"). But NFSD's support is simpler because it currently
doesn't use AIO completion.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/debugfs.c |   1 +
 fs/nfsd/trace.h   |   1 +
 fs/nfsd/vfs.c     | 222 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 219 insertions(+), 5 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	switch (val) {
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
 DEFINE_NFSD_IO_EVENT(read_done);
 DEFINE_NFSD_IO_EVENT(write_start);
 DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
 DEFINE_NFSD_IO_EVENT(write_io_done);
 DEFINE_NFSD_IO_EVENT(write_done);
 DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8b2dc7a88aab..cb43d67ece10 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,214 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio {
+	ssize_t	start_len;	/* Length for misaligned first extent */
+	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
+	ssize_t	end_len;	/* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+			   struct nfsd_file *nf,
+			   struct nfsd_write_dio *write_dio)
+{
+	const u32 dio_blocksize = nf->nf_dio_offset_align;
+	loff_t start_end, orig_end, middle_end;
+
+	if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+		return false;
+	if (unlikely(dio_blocksize > PAGE_SIZE))
+		return false;
+	if (unlikely(len < dio_blocksize))
+		return false;
+
+	start_end = round_up(offset, dio_blocksize);
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	write_dio->start_len = start_end - offset;
+	write_dio->middle_len = middle_end - start_end;
+	write_dio->end_len = orig_end - middle_end;
+
+	return true;
+}
+
+static bool
+nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
+			   unsigned int len_mask)
+{
+	const struct bio_vec *bvec = i->bvec;
+	size_t skip = i->iov_offset;
+	size_t size = i->count;
+
+	if (size & len_mask)
+		return false;
+	do {
+		size_t len = bvec->bv_len;
+
+		if (len > size)
+			len = size;
+		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+			return false;
+		bvec++;
+		size -= len;
+		skip = 0;
+	} while (size);
+
+	return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+			   struct bio_vec *rq_bvec, unsigned int nvecs,
+			   unsigned long cnt, struct nfsd_write_dio *write_dio,
+			   struct nfsd_file *nf)
+{
+	int n_iters = 0;
+	struct iov_iter *iters = *iterp;
+
+	/* Setup misaligned start? */
+	if (write_dio->start_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iters[n_iters].count = write_dio->start_len;
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	/* Setup DIO-aligned middle */
+	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+	if (write_dio->start_len)
+		iov_iter_advance(&iters[n_iters], write_dio->start_len);
+	iters[n_iters].count -= write_dio->end_len;
+	iter_is_dio_aligned[n_iters] =
+		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+					   nf->nf_dio_mem_align - 1,
+					   nf->nf_dio_offset_align - 1);
+	if (unlikely(!iter_is_dio_aligned[n_iters]))
+		return 0; /* no DIO-aligned IO possible */
+	++n_iters;
+
+	/* Setup misaligned end? */
+	if (write_dio->end_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iov_iter_advance(&iters[n_iters],
+				 write_dio->start_len + write_dio->middle_len);
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	return n_iters;
+}
+
+static int
+nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
+		    unsigned int nvecs, unsigned long *cnt,
+		    struct kiocb *kiocb)
+{
+	struct iov_iter iter;
+	int host_err;
+
+	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+	if (host_err < 0)
+		return host_err;
+	*cnt = host_err;
+
+	return 0;
+}
+
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
+		     u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
+		     struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+{
+	struct file *file = nf->nf_file;
+	bool iter_is_dio_aligned[3];
+	struct iov_iter iter_stack[3];
+	struct iov_iter *iter = iter_stack;
+	unsigned int n_iters = 0;
+	unsigned long in_count = *cnt;
+	loff_t in_offset = kiocb->ki_pos;
+	ssize_t host_err;
+
+	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+					     rqstp->rq_bvec, nvecs, *cnt,
+					     write_dio, nf);
+	if (unlikely(!n_iters))
+		return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
+
+	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
+
+	/*
+	 * Any buffered IO issued here will be misaligned, use
+	 * sync IO to ensure it has completed before returning.
+	 * Also update @stable_how to avoid need for COMMIT.
+	 */
+	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+	*stable_how = NFS_FILE_SYNC;
+
+	*cnt = 0;
+	for (int i = 0; i < n_iters; i++) {
+		if (iter_is_dio_aligned[i])
+			kiocb->ki_flags |= IOCB_DIRECT;
+		else
+			kiocb->ki_flags &= ~IOCB_DIRECT;
+
+		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+		if (host_err < 0) {
+			/*
+			 * VFS will return -ENOTBLK if DIO WRITE fails to
+			 * invalidate the page cache. Retry using buffered IO.
+			 */
+			if (unlikely(host_err == -ENOTBLK)) {
+				kiocb->ki_flags &= ~IOCB_DIRECT;
+				*cnt = in_count;
+				kiocb->ki_pos = in_offset;
+				return nfsd_buffered_write(rqstp, file,
+							   nvecs, cnt, kiocb);
+			} else if (unlikely(host_err == -EINVAL)) {
+				struct inode *inode = d_inode(fhp->fh_dentry);
+
+				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+						    inode->i_sb->s_id, inode->i_ino);
+				host_err = -ESERVERFAULT;
+			}
+			return host_err;
+		}
+		*cnt += host_err;
+		if (host_err < iter[i].count) /* partial write? */
+			break;
+	}
+
+	return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
+		  unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct nfsd_write_dio write_dio;
+
+	/*
+	 * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
+	 * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
+	 * be ignored for any DIO issued here).
+	 */
+	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+		kiocb->ki_flags |= IOCB_DONTCACHE;
+
+	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
+		return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
+					    cnt, kiocb, &write_dio);
+
+	return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1282,7 +1490,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	u32			stable = *stable_how;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
-	struct iov_iter		iter;
 	errseq_t		since;
 	__be32			nfserr;
 	int			host_err;
@@ -1319,25 +1526,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		kiocb.ki_flags |= IOCB_DSYNC;
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
-	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
-	case NFSD_IO_BUFFERED:
+	case NFSD_IO_DIRECT:
+		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
+					     nvecs, cnt, &kiocb);
+		stable = *stable_how;
 		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
+		fallthrough;
+	case NFSD_IO_BUFFERED:
+		host_err = nfsd_buffered_write(rqstp, file,
+					       nvecs, cnt, &kiocb);
 		break;
 	}
-	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
 	if (host_err < 0) {
 		commit_reset_write_verifier(nn, rqstp, host_err);
 		goto out_nfserr;
 	}
-	*cnt = host_err;
 	nfsd_stats_io_write_add(nn, exp, *cnt);
 	fsnotify_modify(file);
 	host_err = filemap_check_wb_err(file->f_mapping, since);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 3/3] svcrdma: Mark Read chunks
  2025-10-18  0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-18  0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
  2025-10-18  0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-18  0:54 ` Chuck Lever
  2 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2025-10-18  0:54 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The upper layer may want to know when the receive buffer's .pages
array is guaranteed to contain only an opaque payload. This permits
the upper layer to optimize its buffer handling.

NB: Since svc_rdma_recvfrom.c is under net/, we use the comment
style that is preferred in the networking layer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e7e4a39ca6c6..b1a0c72f73de 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -815,6 +815,11 @@ static void svc_rdma_read_complete_one(struct svc_rqst *rqstp,
 	buf->page_len = length;
 	buf->len += length;
 	buf->buflen += length;
+
+	/* Transport guarantees that only the chunk payload
+	 * appears in buf->pages.
+	 */
+	buf->flags |= XDRBUF_READ;
 }
 
 /* Finish constructing the RPC Call message in rqstp::rq_arg.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients
  2025-10-18  0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-10-20  7:02   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-20  7:02 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Fri, Oct 17, 2025 at 08:54:29PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> In a subsequent patch, nfsd_vfs_write() will promote an UNSTABLE
> WRITE to be a FILE_SYNC WRITE. This indicates that the client does

I'd much prefer you'd mention why that subsequen patch needs to update
it instead of being so vague.  That really helps when trying to
understand the change when finding it in a bisect or blame.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-18  0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-20  7:19   ` Christoph Hellwig
  2025-10-20 13:56     ` Chuck Lever
                       ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-20  7:19 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer

On Fri, Oct 17, 2025 at 08:54:30PM -0400, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
> 
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. Synchronous buffered IO (with
> preference towards using DONTCACHE) is used for the misaligned extents
> and O_DIRECT is used for the middle DIO-aligned extent.

Can you define synchronous better here?  The term is unfortunately
overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
style I/O.  As of now I don't understand which one you mean, especially
with the DONTCACHE reference thrown in, but I guess I'll figure it out
reading the patch.

> If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> invalidate the page cache on behalf of the DIO WRITE, then
> nfsd_issue_write_dio() will fall back to using buffered IO.

Did you see -ENOTBLK leaking out of the file systems?  Because at
least for iomap it is supposed to be an indication that the
file system ->write_iter handler needs to retry using buffered
I/O and never leak to the caller.

> These changes served as the original starting point for the NFS
> client's misaligned O_DIRECT support that landed with
> commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> READ and WRITE"). But NFSD's support is simpler because it currently
> doesn't use AIO completion.

I don't understand this paragraph.  What does starting point mean
here?  How does it matter for the patch description?

> +struct nfsd_write_dio {
> +     ssize_t start_len;      /* Length for misaligned first extent */
> +     ssize_t middle_len;     /* Length for DIO-aligned middle extent */
> +     ssize_t end_len;        /* Length for misaligned last extent */
> +};

Looking at how the code is structured later on, it seems like it would
work much better if each of these sections had it's own object with
the len, iov_iter, flag if it's aligned, etc.  Otherwise we have this
structure and lots of arrays of three items passed around.

> +static bool
> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> +                        unsigned int len_mask)

Wouldn't it make sense to track the alignment when building the bio_vec
array instead of doing another walk here touching all cache lines?

> +	if (unlikely(dio_blocksize > PAGE_SIZE))
> +		return false;

Why does this matter?  Can you add a comment explaining it?

> +static int
> +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> +		    unsigned int nvecs, unsigned long *cnt,
> +		    struct kiocb *kiocb)
> +{
> +	struct iov_iter iter;
> +	int host_err;
> +
> +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> +	if (host_err < 0)
> +		return host_err;
> +	*cnt = host_err;
> +
> +	return 0;

Nothing really buffered here per se, it's just a small wrapper
around vfs_iocb_iter_write.

> +	/*
> +	 * Any buffered IO issued here will be misaligned, use
> +	 * sync IO to ensure it has completed before returning.
> +	 * Also update @stable_how to avoid need for COMMIT.
> +	 */
> +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);

What do you mean with completed before returning?  I guess you
mean writeback actually happening, right?  Why do you need that,
why do you also force it for the direct I/O?

Also IOCB_SYNC is wrong here, as the only thing it does over
IOCB_DSYNC is also forcing back of metadata not needed to find
data (aka timestamps), which I can't see any need for here.

> +	*stable_how = NFS_FILE_SYNC;
> +
> +	*cnt = 0;
> +	for (int i = 0; i < n_iters; i++) {
> +		if (iter_is_dio_aligned[i])
> +			kiocb->ki_flags |= IOCB_DIRECT;
> +		else
> +			kiocb->ki_flags &= ~IOCB_DIRECT;
> +
> +		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> +		if (host_err < 0) {
> +			/*
> +			 * VFS will return -ENOTBLK if DIO WRITE fails to
> +			 * invalidate the page cache. Retry using buffered IO.
> +			 */
> +			if (unlikely(host_err == -ENOTBLK)) {

The VFS certainly does not, and if it leaks out of a specific file
system we need to fix that.

> +			} else if (unlikely(host_err == -EINVAL)) {
> +				struct inode *inode = d_inode(fhp->fh_dentry);
> +
> +				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> +						    inode->i_sb->s_id, inode->i_ino);
> +				host_err = -ESERVERFAULT;

-EINVAL can be lot more things than alignment failure.   And more
importantly alignment failures should not happen with the proper
checks in place.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-20  7:19   ` Christoph Hellwig
@ 2025-10-20 13:56     ` Chuck Lever
  2025-10-20 14:05       ` Christoph Hellwig
  2025-10-20 16:27     ` Mike Snitzer
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2025-10-20 13:56 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On 10/20/25 3:19 AM, Christoph Hellwig wrote:
> On Fri, Oct 17, 2025 at 08:54:30PM -0400, Chuck Lever wrote:
>> From: Mike Snitzer <snitzer@kernel.org>

>> +static int
>> +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
>> +		    unsigned int nvecs, unsigned long *cnt,
>> +		    struct kiocb *kiocb)
>> +{
>> +	struct iov_iter iter;
>> +	int host_err;
>> +
>> +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
>> +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
>> +	if (host_err < 0)
>> +		return host_err;
>> +	*cnt = host_err;
>> +
>> +	return 0;
> 
> 
> Nothing really buffered here per se, it's just a small wrapper
> around vfs_iocb_iter_write.

This is the original NFSD_IO_BUFFERED code, refactored out of
nfsd_vfs_write(). IMO it might be more clear what's going on if the
refactoring was done in a separate prerequisite patch.


>> +			} else if (unlikely(host_err == -EINVAL)) {
>> +				struct inode *inode = d_inode(fhp->fh_dentry);
>> +
>> +				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
>> +						    inode->i_sb->s_id, inode->i_ino);
>> +				host_err = -ESERVERFAULT;
> 
> -EINVAL can be lot more things than alignment failure.   And more
> importantly alignment failures should not happen with the proper
> checks in place.
I tend to agree that I expect an alignment failure will rarely if ever
happen if we've done our job well.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-20 13:56     ` Chuck Lever
@ 2025-10-20 14:05       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-20 14:05 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Mike Snitzer, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

On Mon, Oct 20, 2025 at 09:56:02AM -0400, Chuck Lever wrote:
> I tend to agree that I expect an alignment failure will rarely if ever
> happen if we've done our job well.

It should never happen if the job was done properly.  But when -EINVAL
happens it can be just as likely any other mismatched condition.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-20  7:19   ` Christoph Hellwig
  2025-10-20 13:56     ` Chuck Lever
@ 2025-10-20 16:27     ` Mike Snitzer
  2025-10-22  5:14       ` Christoph Hellwig
  2025-10-21 11:24     ` Jeff Layton
  2025-10-22 17:59     ` Chuck Lever
  3 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2025-10-20 16:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs

On Mon, Oct 20, 2025 at 12:19:35AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 17, 2025 at 08:54:30PM -0400, Chuck Lever wrote:
> > From: Mike Snitzer <snitzer@kernel.org>
> > 
> > If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> > middle and end as needed. The large middle extent is DIO-aligned and
> > the start and/or end are misaligned. Synchronous buffered IO (with
> > preference towards using DONTCACHE) is used for the misaligned extents
> > and O_DIRECT is used for the middle DIO-aligned extent.
> 
> Can you define synchronous better here?  The term is unfortunately
> overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
> style I/O.  As of now I don't understand which one you mean, especially
> with the DONTCACHE reference thrown in, but I guess I'll figure it out
> reading the patch.

It clearly talks about synchronous IO.  DONTCACHE just happens to be
the buffered IO that'll be used if supported.

I don't see anything that needs changing here.

> > If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> > invalidate the page cache on behalf of the DIO WRITE, then
> > nfsd_issue_write_dio() will fall back to using buffered IO.
> 
> Did you see -ENOTBLK leaking out of the file systems?  Because at
> least for iomap it is supposed to be an indication that the
> file system ->write_iter handler needs to retry using buffered
> I/O and never leak to the caller.

fs/iomap/direct-io.c:__iomap_dio_rw() will return -ENOTBLK on its own.

> > These changes served as the original starting point for the NFS
> > client's misaligned O_DIRECT support that landed with
> > commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> > READ and WRITE"). But NFSD's support is simpler because it currently
> > doesn't use AIO completion.
> 
> I don't understand this paragraph.  What does starting point mean
> here?  How does it matter for the patch description?

This patch's NFSD changes were starting point for NFS commit
c817248fc831

How does it matter that this NFSD code served as the basis for NFS
code that has been merged?  I thought it useful context.

> > +struct nfsd_write_dio {
> > +     ssize_t start_len;      /* Length for misaligned first extent */
> > +     ssize_t middle_len;     /* Length for DIO-aligned middle extent */
> > +     ssize_t end_len;        /* Length for misaligned last extent */
> > +};
> 
> Looking at how the code is structured later on, it seems like it would
> work much better if each of these sections had it's own object with
> the len, iov_iter, flag if it's aligned, etc.  Otherwise we have this
> structure and lots of arrays of three items passed around.

I did look at doing that, really isn't much better. And 2 isn't lots.

But not opposed to revisiting it.

Would prefer that cleanup, if done, to happen with an incremental
follow-up patch.

> > +static bool
> > +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> > +                        unsigned int len_mask)
> 
> Wouldn't it make sense to track the alignment when building the bio_vec
> array instead of doing another walk here touching all cache lines?

Yes, that is the conventional wisdom that justified Keith removing
iov_iter_aligned.  But for NFSD's WRITE payload the bio_vec is built
outside of the fs/nfsd/vfs.c code.  Could be there is a natural way to
clean this up (to make the sunrpc layer to conditionally care about
alignment) but I didn't confront that yet.

Room for follow-on improvement to be sure.

> > +	if (unlikely(dio_blocksize > PAGE_SIZE))
> > +		return false;
> 
> Why does this matter?  Can you add a comment explaining it?

I did/do have concerns that a bug in the storage driver could expose
dio_offset_align that is far too large and so bounded dio_blocksize to
a conservative size.  What is a better constant to check?

But... for trusted storage and drivers, probably doesn't matter and
could be removed.

I split the check out:

        if (unlikely(dio_blocksize > PAGE_SIZE))
                return false;
        if (unlikely(len < dio_blocksize))
                return false;

could've been:

        if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
                return false;

physical blocksizes larger than PAGE_SIZE are coming (or here?) but
overall this initial NFSD O_DIRECT support isn't meant to concern
itself with such exotics at this point.

> > +static int
> > +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> > +		    unsigned int nvecs, unsigned long *cnt,
> > +		    struct kiocb *kiocb)
> > +{
> > +	struct iov_iter iter;
> > +	int host_err;
> > +
> > +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> > +	if (host_err < 0)
> > +		return host_err;
> > +	*cnt = host_err;
> > +
> > +	return 0;
> 
> 
> Nothing really buffered here per se, it's just a small wrapper
> around vfs_iocb_iter_write.

Its factored out code with in a named function. Name fit its purpose.

> > +	/*
> > +	 * Any buffered IO issued here will be misaligned, use
> > +	 * sync IO to ensure it has completed before returning.
> > +	 * Also update @stable_how to avoid need for COMMIT.
> > +	 */
> > +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> 
> What do you mean with completed before returning?  I guess you
> mean writeback actually happening, right?  Why do you need that,
> why do you also force it for the direct I/O?

The IO needs to have reached stable storage.

Changing these flags when IOCB_DIRECT used doesn't seem needed.

Open to suggestions, could be I'm being too heavy but the goal is to
ensure the misaligned ends (head and tail) have hit the storage.

> Also IOCB_SYNC is wrong here, as the only thing it does over
> IOCB_DSYNC is also forcing back of metadata not needed to find
> data (aka timestamps), which I can't see any need for here.

Well, we're eliding any followup SYNC from the NFS client by setting
stable_how to NFS_FILE_SYNC, so I made sure to use SYNC:

> > +	*stable_how = NFS_FILE_SYNC;
> > +
> > +	*cnt = 0;
> > +	for (int i = 0; i < n_iters; i++) {
> > +		if (iter_is_dio_aligned[i])
> > +			kiocb->ki_flags |= IOCB_DIRECT;
> > +		else
> > +			kiocb->ki_flags &= ~IOCB_DIRECT;
> > +
> > +		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> > +		if (host_err < 0) {
> > +			/*
> > +			 * VFS will return -ENOTBLK if DIO WRITE fails to
> > +			 * invalidate the page cache. Retry using buffered IO.
> > +			 */
> > +			if (unlikely(host_err == -ENOTBLK)) {
> 
> The VFS certainly does not, and if it leaks out of a specific file
> system we need to fix that.

As I said above, fs/iomap/direct-io.c:__iomap_dio_rw() does.

> > +			} else if (unlikely(host_err == -EINVAL)) {
> > +				struct inode *inode = d_inode(fhp->fh_dentry);
> > +
> > +				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> > +						    inode->i_sb->s_id, inode->i_ino);
> > +				host_err = -ESERVERFAULT;
> 
> -EINVAL can be lot more things than alignment failure.   And more
> importantly alignment failures should not happen with the proper
> checks in place.

Regardless, the most likely outcome from issing what should be aligned
DIO only to get -EINVAL is exactly what is addressed with this
_unlikely_ error handling: misaligned IO.. should happen but bugs
happen (or optimization happens, e.g. in the future to fold alignment
checking into sunrpc's WRITE bio_vec creation, and in turn bugs could
happen).

Not submitting a new version of this patch until/unless there is clear
feedback there is something in this v4 that really cannot stand.

Thanks for your review!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-20  7:19   ` Christoph Hellwig
  2025-10-20 13:56     ` Chuck Lever
  2025-10-20 16:27     ` Mike Snitzer
@ 2025-10-21 11:24     ` Jeff Layton
  2025-10-22  5:16       ` Christoph Hellwig
  2025-10-22 17:59     ` Chuck Lever
  3 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2025-10-21 11:24 UTC (permalink / raw)
  To: Christoph Hellwig, Chuck Lever
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Mike Snitzer

On Mon, 2025-10-20 at 00:19 -0700, Christoph Hellwig wrote:
> On Fri, Oct 17, 2025 at 08:54:30PM -0400, Chuck Lever wrote:
> > From: Mike Snitzer <snitzer@kernel.org>
> > 
> > If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> > middle and end as needed. The large middle extent is DIO-aligned and
> > the start and/or end are misaligned. Synchronous buffered IO (with
> > preference towards using DONTCACHE) is used for the misaligned extents
> > and O_DIRECT is used for the middle DIO-aligned extent.
> 
> Can you define synchronous better here?  The term is unfortunately
> overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
> style I/O.  As of now I don't understand which one you mean, especially
> with the DONTCACHE reference thrown in, but I guess I'll figure it out
> reading the patch.
> 
> > If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> > invalidate the page cache on behalf of the DIO WRITE, then
> > nfsd_issue_write_dio() will fall back to using buffered IO.
> 
> Did you see -ENOTBLK leaking out of the file systems?  Because at
> least for iomap it is supposed to be an indication that the
> file system ->write_iter handler needs to retry using buffered
> I/O and never leak to the caller.
> 
> > These changes served as the original starting point for the NFS
> > client's misaligned O_DIRECT support that landed with
> > commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> > READ and WRITE"). But NFSD's support is simpler because it currently
> > doesn't use AIO completion.
> 
> I don't understand this paragraph.  What does starting point mean
> here?  How does it matter for the patch description?
> 
> > +struct nfsd_write_dio {
> > +     ssize_t start_len;      /* Length for misaligned first extent */
> > +     ssize_t middle_len;     /* Length for DIO-aligned middle extent */
> > +     ssize_t end_len;        /* Length for misaligned last extent */
> > +};
> 
> Looking at how the code is structured later on, it seems like it would
> work much better if each of these sections had it's own object with
> the len, iov_iter, flag if it's aligned, etc.  Otherwise we have this
> structure and lots of arrays of three items passed around.
> 
> > +static bool
> > +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> > +                        unsigned int len_mask)
> 
> Wouldn't it make sense to track the alignment when building the bio_vec
> array instead of doing another walk here touching all cache lines?
> 
> > +	if (unlikely(dio_blocksize > PAGE_SIZE))
> > +		return false;
> 
> Why does this matter?  Can you add a comment explaining it?
> 
> > +static int
> > +nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> > +		    unsigned int nvecs, unsigned long *cnt,
> > +		    struct kiocb *kiocb)
> > +{
> > +	struct iov_iter iter;
> > +	int host_err;
> > +
> > +	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > +	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> > +	if (host_err < 0)
> > +		return host_err;
> > +	*cnt = host_err;
> > +
> > +	return 0;
> 
> 
> Nothing really buffered here per se, it's just a small wrapper
> around vfs_iocb_iter_write.
> 
> > +	/*
> > +	 * Any buffered IO issued here will be misaligned, use
> > +	 * sync IO to ensure it has completed before returning.
> > +	 * Also update @stable_how to avoid need for COMMIT.
> > +	 */
> > +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> 
> What do you mean with completed before returning?  I guess you
> mean writeback actually happening, right?  Why do you need that,
> why do you also force it for the direct I/O?
> 
> Also IOCB_SYNC is wrong here, as the only thing it does over
> IOCB_DSYNC is also forcing back of metadata not needed to find
> data (aka timestamps), which I can't see any need for here.
> 

Responding to a WRITE with NFS_FILE_SYNC flag set means that the data
the client wrote is now on stable storage (and hence the client doesn't
need to follow up with a COMMIT). This patch is using DIO for the
aligned middle section but any unaligned ends use buffered I/O. If we
want to return NFS_FILE_SYNC here then all of the data and metadata
need to be on disk when the reply goes out.

Don't we need IOCB_SYNC here in this case? Otherwise, the server could
crash while the metadata is still in memory. When it comes back up, the
client could see stale timestamps. Maybe that's not fatal, but it seems
pretty sketchy.


> > +	*stable_how = NFS_FILE_SYNC;
> > +
> > +	*cnt = 0;
> > +	for (int i = 0; i < n_iters; i++) {
> > +		if (iter_is_dio_aligned[i])
> > +			kiocb->ki_flags |= IOCB_DIRECT;
> > +		else
> > +			kiocb->ki_flags &= ~IOCB_DIRECT;
> > +
> > +		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> > +		if (host_err < 0) {
> > +			/*
> > +			 * VFS will return -ENOTBLK if DIO WRITE fails to
> > +			 * invalidate the page cache. Retry using buffered IO.
> > +			 */
> > +			if (unlikely(host_err == -ENOTBLK)) {
> 
> The VFS certainly does not, and if it leaks out of a specific file
> system we need to fix that.
> 
> > +			} else if (unlikely(host_err == -EINVAL)) {
> > +				struct inode *inode = d_inode(fhp->fh_dentry);
> > +
> > +				pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
> > +						    inode->i_sb->s_id, inode->i_ino);
> > +				host_err = -ESERVERFAULT;
> 
> -EINVAL can be lot more things than alignment failure.   And more
> importantly alignment failures should not happen with the proper
> checks in place.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-20 16:27     ` Mike Snitzer
@ 2025-10-22  5:14       ` Christoph Hellwig
  2025-10-22 14:37         ` Chuck Lever
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-22  5:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

On Mon, Oct 20, 2025 at 12:27:04PM -0400, Mike Snitzer wrote:
> > Can you define synchronous better here?  The term is unfortunately
> > overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
> > style I/O.  As of now I don't understand which one you mean, especially
> > with the DONTCACHE reference thrown in, but I guess I'll figure it out
> > reading the patch.
> 
> It clearly talks about synchronous IO.  DONTCACHE just happens to be
> the buffered IO that'll be used if supported.
> 
> I don't see anything that needs changing here.

Again, what synchronous?  O_(D)SYNC, or !async?

> 
> > > If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> > > invalidate the page cache on behalf of the DIO WRITE, then
> > > nfsd_issue_write_dio() will fall back to using buffered IO.
> > 
> > Did you see -ENOTBLK leaking out of the file systems?  Because at
> > least for iomap it is supposed to be an indication that the
> > file system ->write_iter handler needs to retry using buffered
> > I/O and never leak to the caller.
> 
> fs/iomap/direct-io.c:__iomap_dio_rw() will return -ENOTBLK on its own.

Yes, and the callers in the file system methods are supposed to handle
that and fall back to buffered I/O.  Take a look at xfs_file_write_iter.

If this leaks out of ->write_iter we have a problem that needs fixing
in the affected file system.

> 
> > > These changes served as the original starting point for the NFS
> > > client's misaligned O_DIRECT support that landed with
> > > commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> > > READ and WRITE"). But NFSD's support is simpler because it currently
> > > doesn't use AIO completion.
> > 
> > I don't understand this paragraph.  What does starting point mean
> > here?  How does it matter for the patch description?
> 
> This patch's NFSD changes were starting point for NFS commit
> c817248fc831

But that commit is already in?  This sentence really confuses me.

> Would prefer that cleanup, if done, to happen with an incremental
> follow-up patch.

Starting out with well structured code is generally a good idea :(

> 
> > > +static bool
> > > +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> > > +                        unsigned int len_mask)
> > 
> > Wouldn't it make sense to track the alignment when building the bio_vec
> > array instead of doing another walk here touching all cache lines?
> 
> Yes, that is the conventional wisdom that justified Keith removing
> iov_iter_aligned.  But for NFSD's WRITE payload the bio_vec is built
> outside of the fs/nfsd/vfs.c code.  Could be there is a natural way to
> clean this up (to make the sunrpc layer to conditionally care about
> alignment) but I didn't confront that yet.

Well, for the block code it's also build outside the layer consuming it.
But Keith showed that you can easily communicate that information and
avoid extra loops touching the cache lines.

> 
> Room for follow-on improvement to be sure.
> 
> > > +	if (unlikely(dio_blocksize > PAGE_SIZE))
> > > +		return false;
> > 
> > Why does this matter?  Can you add a comment explaining it?
> 
> I did/do have concerns that a bug in the storage driver could expose
> dio_offset_align that is far too large and so bounded dio_blocksize to
> a conservative size.  What is a better constant to check?

I don't think there is a good upper limit, we not do supper LBA sizes
larger than PAGE_SIZE, although I don't think there's any shipping
devices that do that.  But the most important thing is to always add
a comment with the rationale for such non-obvious checks so that someone
who has to lift it later can understand why it is done.

> > > +	 * Also update @stable_how to avoid need for COMMIT.
> > > +	 */
> > > +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> > 
> > What do you mean with completed before returning?  I guess you
> > mean writeback actually happening, right?  Why do you need that,
> > why do you also force it for the direct I/O?
> 
> The IO needs to have reached stable storage.

Please spell that out.  Because that's different from how completed
is generally used in file systems and storage protocols.

> > Also IOCB_SYNC is wrong here, as the only thing it does over
> > IOCB_DSYNC is also forcing back of metadata not needed to find
> > data (aka timestamps), which I can't see any need for here.
> 
> Well, we're eliding any followup SYNC from the NFS client by setting
> stable_how to NFS_FILE_SYNC, so I made sure to use SYNC:

Why do you care about timestamps reaching stable storage?  Which is the
only practical extra thing IOCB_SYNC will give you.  If there is a good
reason for that, add a comment because that's a bit unexpected (and
in general IOCB_SYNC / O_SYNC is just a sign that someone does not know
about SYNC vs DSYNC semantics).

> > The VFS certainly does not, and if it leaks out of a specific file
> > system we need to fix that.
> 
> As I said above, fs/iomap/direct-io.c:__iomap_dio_rw() does.

As stated above that's an internal helper and the code is supposed to
be handled by the file system and never leak out.

> Regardless, the most likely outcome from issing what should be aligned
> DIO only to get -EINVAL is exactly what is addressed with this
> _unlikely_ error handling: misaligned IO..

No.  There is no reason why this is any more likely than any of the
other -EINVAL cases.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-21 11:24     ` Jeff Layton
@ 2025-10-22  5:16       ` Christoph Hellwig
  2025-10-22 10:15         ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-22  5:16 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Mike Snitzer

On Tue, Oct 21, 2025 at 07:24:42AM -0400, Jeff Layton wrote:
> Responding to a WRITE with NFS_FILE_SYNC flag set means that the data
> the client wrote is now on stable storage (and hence the client doesn't
> need to follow up with a COMMIT). This patch is using DIO for the
> aligned middle section but any unaligned ends use buffered I/O. If we
> want to return NFS_FILE_SYNC here then all of the data and metadata
> need to be on disk when the reply goes out.
> 
> Don't we need IOCB_SYNC here in this case? Otherwise, the server could
> crash while the metadata is still in memory. When it comes back up, the
> client could see stale timestamps. Maybe that's not fatal, but it seems
> pretty sketchy.

Why do you care about the timestamps which in NFS AFAIK aren't tied
to the data writeback in any way?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22  5:16       ` Christoph Hellwig
@ 2025-10-22 10:15         ` Jeff Layton
  2025-10-22 11:17           ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2025-10-22 10:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer

On Tue, 2025-10-21 at 22:16 -0700, Christoph Hellwig wrote:
> On Tue, Oct 21, 2025 at 07:24:42AM -0400, Jeff Layton wrote:
> > Responding to a WRITE with NFS_FILE_SYNC flag set means that the data
> > the client wrote is now on stable storage (and hence the client doesn't
> > need to follow up with a COMMIT). This patch is using DIO for the
> > aligned middle section but any unaligned ends use buffered I/O. If we
> > want to return NFS_FILE_SYNC here then all of the data and metadata
> > need to be on disk when the reply goes out.
> > 
> > Don't we need IOCB_SYNC here in this case? Otherwise, the server could
> > crash while the metadata is still in memory. When it comes back up, the
> > client could see stale timestamps. Maybe that's not fatal, but it seems
> > pretty sketchy.
> 
> Why do you care about the timestamps which in NFS AFAIK aren't tied
> to the data writeback in any way?

Cache coherency. If the timestamps roll back, some clients may not
notice that data has changed after a write.

It's not the end of the world -- non-evident timestamp changes were
typical prior to the multigrain timestamp patches going in, but it's
not ideal.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 10:15         ` Jeff Layton
@ 2025-10-22 11:17           ` Christoph Hellwig
  2025-10-22 11:30             ` Jeff Layton
  2025-10-22 13:31             ` Chuck Lever
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-22 11:17 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Mike Snitzer

On Wed, Oct 22, 2025 at 06:15:44AM -0400, Jeff Layton wrote:
> Cache coherency. If the timestamps roll back, some clients may not
> notice that data has changed after a write.
> 
> It's not the end of the world -- non-evident timestamp changes were
> typical prior to the multigrain timestamp patches going in, but it's
> not ideal.

Well, in that case nfsd_vfs_write needs to use IOCB_SYNC as well.
And both that and this code need a big fat comment about it.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 11:17           ` Christoph Hellwig
@ 2025-10-22 11:30             ` Jeff Layton
  2025-10-22 13:31             ` Chuck Lever
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2025-10-22 11:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer

On Wed, 2025-10-22 at 04:17 -0700, Christoph Hellwig wrote:
> On Wed, Oct 22, 2025 at 06:15:44AM -0400, Jeff Layton wrote:
> > Cache coherency. If the timestamps roll back, some clients may not
> > notice that data has changed after a write.
> > 
> > It's not the end of the world -- non-evident timestamp changes were
> > typical prior to the multigrain timestamp patches going in, but it's
> > not ideal.
> 
> Well, in that case nfsd_vfs_write needs to use IOCB_SYNC as well.
> And both that and this code need a big fat comment about it.

That was my thinking too, but Chuck pointed out that nfsd has been
violating this basically since forever. Given that, I think using
IOCB_DSYNC is probably fine, and it would certainly be less costly. 
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 11:17           ` Christoph Hellwig
  2025-10-22 11:30             ` Jeff Layton
@ 2025-10-22 13:31             ` Chuck Lever
  2025-10-23  5:27               ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2025-10-22 13:31 UTC (permalink / raw)
  To: Christoph Hellwig, Jeff Layton
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Mike Snitzer

On 10/22/25 7:17 AM, Christoph Hellwig wrote:
> On Wed, Oct 22, 2025 at 06:15:44AM -0400, Jeff Layton wrote:
>> Cache coherency. If the timestamps roll back, some clients may not
>> notice that data has changed after a write.
>>
>> It's not the end of the world -- non-evident timestamp changes were
>> typical prior to the multigrain timestamp patches going in, but it's
>> not ideal.
> 
> Well, in that case nfsd_vfs_write needs to use IOCB_SYNC as well.
> And both that and this code need a big fat comment about it.
NFSD's historical behavior doesn't match the spec, so yes, I'd like a
comment or two that calls this out if we agree that's something we
do not want to change.

Otherwise, I'm not opposed to bringing NFSD into compliance. But that
might need patches that can be backported. Not technically difficult,
but some planning is necessary.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22  5:14       ` Christoph Hellwig
@ 2025-10-22 14:37         ` Chuck Lever
  2025-10-23  5:46           ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2025-10-22 14:37 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On 10/22/25 1:14 AM, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 12:27:04PM -0400, Mike Snitzer wrote:
>>> Can you define synchronous better here?  The term is unfortunately
>>> overloaded between synchronous syscalls vs aio/io_uring and O_(D)SYNC
>>> style I/O.  As of now I don't understand which one you mean, especially
>>> with the DONTCACHE reference thrown in, but I guess I'll figure it out
>>> reading the patch.
>>
>> It clearly talks about synchronous IO.  DONTCACHE just happens to be
>> the buffered IO that'll be used if supported.
>>
>> I don't see anything that needs changing here.
> 
> Again, what synchronous?  O_(D)SYNC, or !async?
> 
>>
>>>> If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
>>>> invalidate the page cache on behalf of the DIO WRITE, then
>>>> nfsd_issue_write_dio() will fall back to using buffered IO.
>>>
>>> Did you see -ENOTBLK leaking out of the file systems?  Because at
>>> least for iomap it is supposed to be an indication that the
>>> file system ->write_iter handler needs to retry using buffered
>>> I/O and never leak to the caller.
>>
>> fs/iomap/direct-io.c:__iomap_dio_rw() will return -ENOTBLK on its own.
> 
> Yes, and the callers in the file system methods are supposed to handle
> that and fall back to buffered I/O.  Take a look at xfs_file_write_iter.
> 
> If this leaks out of ->write_iter we have a problem that needs fixing
> in the affected file system.

What I'm hearing is that nfsd_vfs_write() should never see -ENOTBLK, so
this check is essentially dead code.


>>>> These changes served as the original starting point for the NFS
>>>> client's misaligned O_DIRECT support that landed with
>>>> commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
>>>> READ and WRITE"). But NFSD's support is simpler because it currently
>>>> doesn't use AIO completion.
>>>
>>> I don't understand this paragraph.  What does starting point mean
>>> here?  How does it matter for the patch description?
>>
>> This patch's NFSD changes were starting point for NFS commit
>> c817248fc831
> 
> But that commit is already in?  This sentence really confuses me.

I'm not convinced this paragraph in the patch description is needed.
Can it be removed?


>> Would prefer that cleanup, if done, to happen with an incremental
>> follow-up patch.

I'm not following. What clean-up? And why push it out to a later
merge?


> Starting out with well structured code is generally a good idea :(

That is my preference as well, though I recognize that some subsystems
do not mind merging incomplete features and moving forward from there.

Let's remember, upstream does not target specific kernel releases when
developing new features. You might want your feature to hit the next LTS
kernel, and even that is frowned upon. Let's focus on getting this as
right as we know how to, rather than hitting a delivery date.


>>>> +static bool
>>>> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
>>>> +                        unsigned int len_mask)
>>>
>>> Wouldn't it make sense to track the alignment when building the bio_vec
>>> array instead of doing another walk here touching all cache lines?
>>
>> Yes, that is the conventional wisdom that justified Keith removing
>> iov_iter_aligned.  But for NFSD's WRITE payload the bio_vec is built
>> outside of the fs/nfsd/vfs.c code.  Could be there is a natural way to
>> clean this up (to make the sunrpc layer to conditionally care about
>> alignment) but I didn't confront that yet.
> 
> Well, for the block code it's also build outside the layer consuming it.
> But Keith showed that you can easily communicate that information and
> avoid extra loops touching the cache lines.

Again, is there a reason to push this suggestion out to a later merge
window? It seems reasonable to tackle it now.

(Perhaps to get us started, Christoph, do you know of specific code that
shows how this code could be reorganized?)


>> Room for follow-on improvement to be sure.
>>
>>>> +	if (unlikely(dio_blocksize > PAGE_SIZE))
>>>> +		return false;
>>>
>>> Why does this matter?  Can you add a comment explaining it?
>>
>> I did/do have concerns that a bug in the storage driver could expose
>> dio_offset_align that is far too large and so bounded dio_blocksize to
>> a conservative size.  What is a better constant to check?
> 
> I don't think there is a good upper limit, we not do supper LBA sizes
> larger than PAGE_SIZE, although I don't think there's any shipping
> devices that do that.  But the most important thing is to always add
> a comment with the rationale for such non-obvious checks so that someone
> who has to lift it later can understand why it is done.
> 
>>>> +	 * Also update @stable_how to avoid need for COMMIT.
>>>> +	 */
>>>> +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
>>>
>>> What do you mean with completed before returning?  I guess you
>>> mean writeback actually happening, right?  Why do you need that,
>>> why do you also force it for the direct I/O?
>>
>> The IO needs to have reached stable storage.
> 
> Please spell that out.  Because that's different from how completed
> is generally used in file systems and storage protocols.

It's sensible to add a comment that cites the NFS protocol compliance
mandate that states a data durability requirement.


>>> Also IOCB_SYNC is wrong here, as the only thing it does over
>>> IOCB_DSYNC is also forcing back of metadata not needed to find
>>> data (aka timestamps), which I can't see any need for here.
>>
>> Well, we're eliding any followup SYNC from the NFS client by setting
>> stable_how to NFS_FILE_SYNC, so I made sure to use SYNC:
> 
> Why do you care about timestamps reaching stable storage?  Which is the
> only practical extra thing IOCB_SYNC will give you.  If there is a good
> reason for that, add a comment because that's a bit unexpected (and
> in general IOCB_SYNC / O_SYNC is just a sign that someone does not know
> about SYNC vs DSYNC semantics).

If NFSD responds with NFS_FILE_SYNC here, then timestamps need to be
persisted as well as data. As discussed in other threads, currently
nfsd_vfs_write() seems to miss that mark.

So either: return NFS_DATA_SYNC (which means we've persisted only the
data) or make this path persist the timestamps too.

I haven't seen any existing code that convinces me that setting both
IOCB_SYNC and IOCB_DSYNC is necessary.


>>> The VFS certainly does not, and if it leaks out of a specific file
>>> system we need to fix that.
>>
>> As I said above, fs/iomap/direct-io.c:__iomap_dio_rw() does.
> 
> As stated above that's an internal helper and the code is supposed to
> be handled by the file system and never leak out.
> 
>> Regardless, the most likely outcome from issing what should be aligned
>> DIO only to get -EINVAL is exactly what is addressed with this
>> _unlikely_ error handling: misaligned IO..
> 
> No.  There is no reason why this is any more likely than any of the
> other -EINVAL cases.

Agreed, the likelihood of getting an -EINVAL here due to a block
alignment issue is much lower than getting -EINVAL due to other
problems.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-20  7:19   ` Christoph Hellwig
                       ` (2 preceding siblings ...)
  2025-10-21 11:24     ` Jeff Layton
@ 2025-10-22 17:59     ` Chuck Lever
  2025-10-23  5:52       ` Christoph Hellwig
  3 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2025-10-22 17:59 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On 10/20/25 3:19 AM, Christoph Hellwig wrote:
>> +	/*
>> +	 * Any buffered IO issued here will be misaligned, use
>> +	 * sync IO to ensure it has completed before returning.
>> +	 * Also update @stable_how to avoid need for COMMIT.
>> +	 */
>> +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> What do you mean with completed before returning?  I guess you
> mean writeback actually happening, right?  Why do you need that,
> why do you also force it for the direct I/O?

This is the only comment where I'm not clear on what corrective
action to take.

I think IOCB_SYNC would be needed with O_DIRECT to force timestamp
updates. Otherwise, IOCB_SYNC is relevant only when the function is
forced to fall back to some form of write through the page cache.

Is that correct?


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 13:31             ` Chuck Lever
@ 2025-10-23  5:27               ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-23  5:27 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jeff Layton, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, linux-nfs, Mike Snitzer

On Wed, Oct 22, 2025 at 09:31:53AM -0400, Chuck Lever wrote:
> On 10/22/25 7:17 AM, Christoph Hellwig wrote:
> > On Wed, Oct 22, 2025 at 06:15:44AM -0400, Jeff Layton wrote:
> >> Cache coherency. If the timestamps roll back, some clients may not
> >> notice that data has changed after a write.
> >>
> >> It's not the end of the world -- non-evident timestamp changes were
> >> typical prior to the multigrain timestamp patches going in, but it's
> >> not ideal.
> > 
> > Well, in that case nfsd_vfs_write needs to use IOCB_SYNC as well.
> > And both that and this code need a big fat comment about it.
> NFSD's historical behavior doesn't match the spec, so yes, I'd like a
> comment or two that calls this out if we agree that's something we
> do not want to change.
> 
> Otherwise, I'm not opposed to bringing NFSD into compliance. But that
> might need patches that can be backported. Not technically difficult,
> but some planning is necessary.

It's basically just adding IOCB_SYNC in nfsd_vfs_write.  If you do
that before this series it should be fairly easily backportable.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 14:37         ` Chuck Lever
@ 2025-10-23  5:46           ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-23  5:46 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Mike Snitzer, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

On Wed, Oct 22, 2025 at 10:37:35AM -0400, Chuck Lever wrote:
> What I'm hearing is that nfsd_vfs_write() should never see -ENOTBLK, so
> this check is essentially dead code.

That's how iomap_dio_rw is designed to work.  If it ever leaks out,
that is a file system bug because no other caller of ->write_iter
handles -ENOTBLK.

> >>>> These changes served as the original starting point for the NFS
> >>>> client's misaligned O_DIRECT support that landed with
> >>>> commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
> >>>> READ and WRITE"). But NFSD's support is simpler because it currently
> >>>> doesn't use AIO completion.
> >>>
> >>> I don't understand this paragraph.  What does starting point mean
> >>> here?  How does it matter for the patch description?
> >>
> >> This patch's NFSD changes were starting point for NFS commit
> >> c817248fc831
> > 
> > But that commit is already in?  This sentence really confuses me.
> 
> I'm not convinced this paragraph in the patch description is needed.
> Can it be removed?

I guess Mike wants to say the two are based on the concept.  But given
that c817248fc831 is already in I'd expect to word it as "these changes
are similar to the direct I/O writes implemented for the NFS client
local I/O feature in commit c817248fc831 <insert subject line here>.
But I'm not really sure there is much of a point in that.

> (Perhaps to get us started, Christoph, do you know of specific code that
> shows how this code could be reorganized?)
> 

This is what I had in mind:

https://lore.kernel.org/linux-block/20251014150456.2219261-1-kbusch@meta.com/T/#m3588b5824b7fa76f001d57e54664889b73471422

It records the alignment while building (or iterating for another
reason), and then just checking a mask later.  For nfsd this would be a
bit different as you'd also record the boundaries of the aligned
segments, but the idea is the same.

> If NFSD responds with NFS_FILE_SYNC here, then timestamps need to be
> persisted as well as data. As discussed in other threads, currently
> nfsd_vfs_write() seems to miss that mark.
> 
> So either: return NFS_DATA_SYNC (which means we've persisted only the
> data) or make this path persist the timestamps too.
> 
> I haven't seen any existing code that convinces me that setting both
> IOCB_SYNC and IOCB_DSYNC is necessary.

iocb_flags is the canonical place.  IOCB_DSYNC is the main path driving
the syncing, and IOCB_SYNC drivers the additional syncing of the
timestamps.

This is all a bit odd, because Linux historically only supposed the
O_SYNC flag, but used it for O_DSYNC semantics.  When I fixed that,
I had to do a few non-standard things so that we could keep existing
binaries work as expected despite renaming the flags.  See commit
6b2f3d1f769b ("vfs: Implement proper O_SYNC semantics") for details.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 17:59     ` Chuck Lever
@ 2025-10-23  5:52       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-10-23  5:52 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Mike Snitzer, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs

On Wed, Oct 22, 2025 at 01:59:04PM -0400, Chuck Lever wrote:
> On 10/20/25 3:19 AM, Christoph Hellwig wrote:
> >> +	/*
> >> +	 * Any buffered IO issued here will be misaligned, use
> >> +	 * sync IO to ensure it has completed before returning.
> >> +	 * Also update @stable_how to avoid need for COMMIT.
> >> +	 */
> >> +	kiocb->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> > What do you mean with completed before returning?  I guess you
> > mean writeback actually happening, right?  Why do you need that,
> > why do you also force it for the direct I/O?
> 
> This is the only comment where I'm not clear on what corrective
> action to take.

I have multiple issues with the comment.  One is that it really should
talk about commiting data to stable storage instead of completing,
as that seems to be the main point (and also what IOCB_DSYNC /
IOCB_SYNC) does.  If we stick to using IOCB_SYNC it should also
mention metadata or more specifically timestamps.

But I also really don't get the "Any buffered IO issued here will
be misaligned, ..." part.  What does buffered I/O and/or misalignment
have to do with not wanting to do a COMMIT?

> I think IOCB_SYNC would be needed with O_DIRECT to force timestamp
> updates. Otherwise, IOCB_SYNC is relevant only when the function is
> forced to fall back to some form of write through the page cache.

Well, IOCB_SYNC is only needed to commit timestamps.  O_DSYNC is
always required if you want to commit to stable storage.  As said
above I don't really understand from the patch why we want to do
that, but IFF we want to do that, we need IOCB_DSYNC bother for
direct and buffered I/O.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-10-23  5:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-18  0:54 [PATCH v4 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-18  0:54 ` [PATCH v4 1/3] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-20  7:02   ` Christoph Hellwig
2025-10-18  0:54 ` [PATCH v4 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-20  7:19   ` Christoph Hellwig
2025-10-20 13:56     ` Chuck Lever
2025-10-20 14:05       ` Christoph Hellwig
2025-10-20 16:27     ` Mike Snitzer
2025-10-22  5:14       ` Christoph Hellwig
2025-10-22 14:37         ` Chuck Lever
2025-10-23  5:46           ` Christoph Hellwig
2025-10-21 11:24     ` Jeff Layton
2025-10-22  5:16       ` Christoph Hellwig
2025-10-22 10:15         ` Jeff Layton
2025-10-22 11:17           ` Christoph Hellwig
2025-10-22 11:30             ` Jeff Layton
2025-10-22 13:31             ` Chuck Lever
2025-10-23  5:27               ` Christoph Hellwig
2025-10-22 17:59     ` Chuck Lever
2025-10-23  5:52       ` Christoph Hellwig
2025-10-18  0:54 ` [PATCH v4 3/3] svcrdma: Mark Read chunks Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).