[PATCH v6 0/5] NFSD: Implement NFSD_IO

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-10-22 19:22 Chuck Lever
  2025-10-22 19:22 ` [PATCH v6 1/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 19:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series includes the patches needed to make NFSD Direct WRITE
work.

The "large" piece of work left to do:
> > > Wouldn't it make sense to track the alignment when building the bio_vec
> > > array instead of doing another walk here touching all cache lines?
> > 
> > Yes, that is the conventional wisdom that justified Keith removing
> > iov_iter_aligned.  But for NFSD's WRITE payload the bio_vec is built
> > outside of the fs/nfsd/vfs.c code.  Could be there is a natural way to
> > clean this up (to make the sunrpc layer to conditionally care about
> > alignment) but I didn't confront that yet.
> 
> Well, for the block code it's also build outside the layer consuming it.
> But Keith showed that you can easily communicate that information and
> avoid extra loops touching the cache lines.

Changes since v5:
* Add a patch to make FILE_SYNC WRITEs persist timestamps
* Address some of Christoph's review comments

Changes since v4:
* Split out refactoring nfsd_buffered_write() into a separate patch
* Expand patch description of 1/4
* Don't set IOCB_SYNC flag

Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks

Chuck Lever (3):
  NFSD: Make FILE_SYNC WRITEs comply with spec
  NFSD: Enable return of an updated stable_how to NFS clients
  svcrdma: Mark Read chunks

Mike Snitzer (2):
  NFSD: Refactor nfsd_vfs_write()
  NFSD: Implement NFSD_IO_DIRECT for NFS WRITE

 fs/nfsd/debugfs.c                       |   1 +
 fs/nfsd/nfs3proc.c                      |   2 +-
 fs/nfsd/nfs4proc.c                      |   2 +-
 fs/nfsd/nfsproc.c                       |   3 +-
 fs/nfsd/trace.h                         |   1 +
 fs/nfsd/vfs.c                           | 219 ++++++++++++++++++++++--
 fs/nfsd/vfs.h                           |   6 +-
 fs/nfsd/xdr3.h                          |   2 +-
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   5 +
 9 files changed, 224 insertions(+), 17 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/5] NFSD: Make FILE_SYNC WRITEs comply with spec
  2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-22 19:22 ` Chuck Lever
  2025-10-22 19:22 ` [PATCH v6 2/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 19:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer

From: Chuck Lever <chuck.lever@oracle.com>

Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:

> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.

For many years, NFSD has used a "data sync only" optimization for
FILE_SYNC WRITEs, in violation of the above text (and previous
incarnations of the NFS standard). File time stamps haven't been
persisted as the mandate above requires.

The purpose of this behavior is that, back in the day, file systems
on rotational media were too slow to handle writes with time stamp
updates. With the advent of UNSTABLE WRITE, the time stamp update is
done by the COMMIT, which amortizes the cost of one time stamp
update over possibly many WRITE requests.

The impact of this change will be felt only when a client explicitly
requests a FILE_SYNC WRITE on a shared file system backed by slow
storage. UNSTABLE and DATA_SYNC WRITEs should not be affected.

Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..2c5d38f38454 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1315,7 +1315,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = offset;
 	if (stable && !fhp->fh_use_wgather)
-		kiocb.ki_flags |= IOCB_DSYNC;
+		kiocb.ki_flags |=
+			(stable == NFS_FILE_SYNC ? IOCB_SYNC : IOCB_DSYNC);

 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 2/5] NFSD: Enable return of an updated stable_how to NFS clients
  2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-22 19:22 ` [PATCH v6 1/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-10-22 19:22 ` Chuck Lever
  2025-10-22 19:22 ` [PATCH v6 3/5] NFSD: Refactor nfsd_vfs_write() Chuck Lever
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 19:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

NFSv3 and newer protocols enable clients to perform a two-phase
WRITE. A client requests an UNSTABLE WRITE, which sends dirty data
to the NFS server, but does not persist it. The server replies that
it performed the UNSTABLE WRITE, and the client is then obligated to
follow up with a COMMIT request before it can remove the dirty data
from its own page cache. The COMMIT reply is the client's guarantee
that the written data has been persisted on the server.

The purpose of this protocol design is to enable clients to send
a large amount of data via multiple WRITE requests to a server, and
then wait for persistence just once. The server is able to start
persisting the data as soon as it gets it, to shorten the length of
time the client has to wait for the final COMMIT to complete.

It's also possible for the server to respond to an UNSTABLE WRITE
request in a way that indicates that the data was persisted anyway.
In that case, the client can skip the COMMIT and remove the dirty
data from its memory immediately. NetApp filers, for example, do
this because they have a battery-backed cache and can guarantee that
written data is persisted quickly and immediately.

NFSD has never implemented this kind of promotion. UNSTABLE WRITE
requests are unconditionally treated as UNSTABLE. However, in a
subsequent patch, nfsd_vfs_write() will be able to promote an
UNSTABLE WRITE to be a FILE_SYNC WRITE. This will be because NFSD
will handle some WRITE requests locally with O_DIRECT, which
persists written data immediately. The FILE_SYNC WRITE response
indicates to the client that no follow-up COMMIT is necessary.

This patch prepares for that change by enabling the modified
stable_how value to be passed along to NFSD's WRITE reply encoder.
No behavior change is expected.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c |  2 +-
 fs/nfsd/nfs4proc.c |  2 +-
 fs/nfsd/nfsproc.c  |  3 ++-
 fs/nfsd/vfs.c      | 11 ++++++-----
 fs/nfsd/vfs.h      |  6 ++++--
 fs/nfsd/xdr3.h     |  2 +-
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..ad14b34583bb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
 	resp->committed = argp->stable;
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
 				  &argp->payload, &cnt,
-				  resp->committed, resp->verf);
+				  &resp->committed, resp->verf);
 	resp->count = cnt;
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 7f7e6bb23a90..2222bb283baf 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1285,7 +1285,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	write->wr_how_written = write->wr_stable_how;
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
 				write->wr_offset, &write->wr_payload,
-				&cnt, write->wr_how_written,
+				&cnt, &write->wr_how_written,
 				(__be32 *)write->wr_verifier.data);
 	nfsd_file_put(nf);
 
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75..706401ed6f8d 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 	struct nfsd_writeargs *argp = rqstp->rq_argp;
 	struct nfsd_attrstat *resp = rqstp->rq_resp;
 	unsigned long cnt = argp->len;
+	u32 committed = NFS_DATA_SYNC;
 
 	dprintk("nfsd: WRITE    %s %u bytes at %d\n",
 		SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
-				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+				  &argp->payload, &cnt, &committed, NULL);
 	if (resp->status == nfs_ok)
 		resp->status = fh_getattr(&resp->fh, &resp->stat);
 	else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2c5d38f38454..2b238d07f1ba 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,7 +1262,7 @@ static int wait_for_concurrent_writes(struct file *file)
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1274,11 +1274,12 @@ __be32
 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	       struct nfsd_file *nf, loff_t offset,
 	       const struct xdr_buf *payload, unsigned long *cnt,
-	       int stable, __be32 *verf)
+	       u32 *stable_how, __be32 *verf)
 {
 	struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
 	struct file		*file = nf->nf_file;
 	struct super_block	*sb = file_inode(file)->i_sb;
+	u32			stable = *stable_how;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
 	struct iov_iter		iter;
@@ -1435,7 +1436,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * @offset: Byte offset of start
  * @payload: xdr_buf containing the write payload
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
  * @verf: NFS WRITE verifier
  *
  * Upon return, caller must invoke fh_put on @fhp.
@@ -1445,7 +1446,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  */
 __be32
 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
-	   const struct xdr_buf *payload, unsigned long *cnt, int stable,
+	   const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
 	   __be32 *verf)
 {
 	struct nfsd_file *nf;
@@ -1458,7 +1459,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
 		goto out;
 
 	err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
-			     stable, verf);
+			     stable_how, verf);
 	nfsd_file_put(nf);
 out:
 	trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fa46f8b5f132..c713ed0b04e0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -130,11 +130,13 @@ __be32		nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				u32 *eof);
 __be32		nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				loff_t offset, const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				struct nfsd_file *nf, loff_t offset,
 				const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, u32 *stable_how,
+				__be32 *verf);
 __be32		nfsd_readlink(struct svc_rqst *, struct svc_fh *,
 				char *, int *);
 __be32		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..c0e443ef3a6b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
 	__be32			status;
 	struct svc_fh		fh;
 	unsigned long		count;
-	int			committed;
+	u32			committed;
 	__be32			verf[2];
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 3/5] NFSD: Refactor nfsd_vfs_write()
  2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-22 19:22 ` [PATCH v6 1/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
  2025-10-22 19:22 ` [PATCH v6 2/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-10-22 19:22 ` Chuck Lever
  2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 19:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

Extract the common code that is to be used in the buffered and
dontcache I/O modes. This common code will also be used as the
fallback when direct I/O is requested but cannot be used.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2b238d07f1ba..41cd2b53d803 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,22 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+static int
+nfsd_iocb_write(struct svc_rqst *rqstp, struct file *file, unsigned int nvecs,
+		unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct iov_iter iter;
+	int host_err;
+
+	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+	host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+	if (host_err < 0)
+		return host_err;
+	*cnt = host_err;
+
+	return 0;
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1282,7 +1298,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	u32			stable = *stable_how;
 	struct kiocb		kiocb;
 	struct svc_export	*exp;
-	struct iov_iter		iter;
 	errseq_t		since;
 	__be32			nfserr;
 	int			host_err;
@@ -1320,25 +1335,24 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 			(stable == NFS_FILE_SYNC ? IOCB_SYNC : IOCB_DSYNC);
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
-	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
-	case NFSD_IO_BUFFERED:
-		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
+		fallthrough;
+	case NFSD_IO_BUFFERED:
+		host_err = nfsd_iocb_write(rqstp, file, nvecs, cnt, &kiocb);
 		break;
 	}
-	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
 	if (host_err < 0) {
 		commit_reset_write_verifier(nn, rqstp, host_err);
 		goto out_nfserr;
 	}
-	*cnt = host_err;
 	nfsd_stats_io_write_add(nn, exp, *cnt);
 	fsnotify_modify(file);
 	host_err = filemap_check_wb_err(file->f_mapping, since);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (2 preceding siblings ...)
  2025-10-22 19:22 ` [PATCH v6 3/5] NFSD: Refactor nfsd_vfs_write() Chuck Lever
@ 2025-10-22 19:22 ` Chuck Lever
  2025-10-22 19:27   ` Chuck Lever
                     ` (2 more replies)
  2025-10-22 19:22 ` [PATCH v6 5/5] svcrdma: Mark Read chunks Chuck Lever
  2025-10-22 21:56 ` [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
  5 siblings, 3 replies; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 19:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. An O_SYNC buffered write (with
preference towards using DONTCACHE) is used for the misaligned extents
and O_DIRECT is used for the middle DIO-aligned extent.

nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
allows the client to drop its dirty data and avoid needing an extra
COMMIT operation.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/debugfs.c |   1 +
 fs/nfsd/trace.h   |   1 +
 fs/nfsd/vfs.c     | 181 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 183 insertions(+)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	switch (val) {
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
 DEFINE_NFSD_IO_EVENT(read_done);
 DEFINE_NFSD_IO_EVENT(write_start);
 DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
 DEFINE_NFSD_IO_EVENT(write_io_done);
 DEFINE_NFSD_IO_EVENT(write_done);
 DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 41cd2b53d803..29c29a5111f8 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,105 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio {
+	ssize_t	start_len;	/* Length for misaligned first extent */
+	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
+	ssize_t	end_len;	/* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+			   struct nfsd_file *nf,
+			   struct nfsd_write_dio *write_dio)
+{
+	const u32 dio_blocksize = nf->nf_dio_offset_align;
+	loff_t start_end, orig_end, middle_end;
+
+	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
+		return false;
+
+	start_end = round_up(offset, dio_blocksize);
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	write_dio->start_len = start_end - offset;
+	write_dio->middle_len = middle_end - start_end;
+	write_dio->end_len = orig_end - middle_end;
+
+	return true;
+}
+
+static bool
+nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
+			   unsigned int len_mask)
+{
+	const struct bio_vec *bvec = i->bvec;
+	size_t skip = i->iov_offset;
+	size_t size = i->count;
+
+	if (size & len_mask)
+		return false;
+	do {
+		size_t len = bvec->bv_len;
+
+		if (len > size)
+			len = size;
+		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+			return false;
+		bvec++;
+		size -= len;
+		skip = 0;
+	} while (size);
+
+	return true;
+}
+
+/*
+ * Set up as many as 3 iov_iter based on extents described by @write_dio.
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+			   struct bio_vec *rq_bvec, unsigned int nvecs,
+			   unsigned long cnt, struct nfsd_write_dio *write_dio,
+			   struct nfsd_file *nf)
+{
+	struct iov_iter *iters = *iterp;
+	int n_iters = 0;
+
+	/* Set up misaligned start? */
+	if (write_dio->start_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iters[n_iters].count = write_dio->start_len;
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	/* Set up DIO-aligned middle */
+	iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+	if (write_dio->start_len)
+		iov_iter_advance(&iters[n_iters], write_dio->start_len);
+	iters[n_iters].count -= write_dio->end_len;
+	iter_is_dio_aligned[n_iters] =
+		nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+					   nf->nf_dio_mem_align - 1,
+					   nf->nf_dio_offset_align - 1);
+	if (unlikely(!iter_is_dio_aligned[n_iters]))
+		return 0; /* no DIO-aligned IO possible */
+	++n_iters;
+
+	/* Set up misaligned end? */
+	if (write_dio->end_len) {
+		iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+		iov_iter_advance(&iters[n_iters],
+				 write_dio->start_len + write_dio->middle_len);
+		iter_is_dio_aligned[n_iters] = false;
+		++n_iters;
+	}
+
+	return n_iters;
+}
+
 static int
 nfsd_iocb_write(struct svc_rqst *rqstp, struct file *file, unsigned int nvecs,
 		unsigned long *cnt, struct kiocb *kiocb)
@@ -1270,6 +1369,80 @@ nfsd_iocb_write(struct svc_rqst *rqstp, struct file *file, unsigned int nvecs,
 	return 0;
 }
 
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		     struct nfsd_file *nf, unsigned int nvecs,
+		     unsigned long *cnt, struct kiocb *kiocb,
+		     struct nfsd_write_dio *write_dio)
+{
+	struct iov_iter iter_stack[3];
+	struct iov_iter *iter = iter_stack;
+	loff_t in_offset = kiocb->ki_pos;
+	struct file *file = nf->nf_file;
+	unsigned long in_count = *cnt;
+	bool iter_is_dio_aligned[3];
+	unsigned int n_iters = 0;
+	ssize_t host_err;
+
+	n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+					     rqstp->rq_bvec, nvecs, *cnt,
+					     write_dio, nf);
+	if (unlikely(!n_iters))
+		return nfsd_iocb_write(rqstp, file, nvecs, cnt, kiocb);
+
+	trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
+
+	/*
+	 * SYNC + DIRECT means the write should persist dirty time stamps.
+	 *
+	 * But for buffered writes done as fallback or to handle misaligned
+	 * payload segments, the data must be durable before nfsd_vfs_write
+	 * returns.
+	 */
+	kiocb->ki_flags &= ~IOCB_DSYNC;
+	kiocb->ki_flags |= IOCB_SYNC;
+
+	*cnt = 0;
+	for (int i = 0; i < n_iters; i++) {
+		if (iter_is_dio_aligned[i])
+			kiocb->ki_flags |= IOCB_DIRECT;
+		else
+			kiocb->ki_flags &= ~IOCB_DIRECT;
+
+		host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+		if (host_err < 0)
+			return host_err;
+		*cnt += host_err;
+		if (host_err < iter[i].count) /* partial write? */
+			break;
+	}
+
+	return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct nfsd_file *nf, unsigned int nvecs,
+		  unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct nfsd_write_dio write_dio;
+
+	/*
+	 * Check if the underlying file system allows IOCB_DONTCACHE to be
+	 * used when issuing buffered IO. If so, request it to preserve the
+	 * intent of NFSD_IO_DIRECT. DONTCACHE will be ignored for direct
+	 * writes.
+	 */
+	if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+		kiocb->ki_flags |= IOCB_DONTCACHE;
+
+	if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
+		return nfsd_issue_write_dio(rqstp, fhp, nf, nvecs, cnt, kiocb,
+					    &write_dio);
+
+	return nfsd_iocb_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1311,6 +1484,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (sb->s_export_op)
 		exp_op_flags = sb->s_export_op->flags;
 
+	/* cel: UNSTABLE buffered WRITEs might want some form of throttling
+	 *	as well to prevent clients from pushing writes to us faster
+	 *	than they can be flushed onto durable storage */
 	if (test_bit(RQ_LOCAL, &rqstp->rq_flags) &&
 	    !(exp_op_flags & EXPORT_OP_REMOTE_FS)) {
 		/*
@@ -1341,6 +1517,11 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
+	case NFSD_IO_DIRECT:
+		host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs, cnt,
+					     &kiocb);
+		stable = *stable_how = NFS_FILE_SYNC;
+		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 5/5] svcrdma: Mark Read chunks
  2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (3 preceding siblings ...)
  2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-22 19:22 ` Chuck Lever
  2025-10-22 21:25   ` Mike Snitzer
  2025-10-22 21:56 ` [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
  5 siblings, 1 reply; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 19:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The upper layer may want to know when the receive buffer's .pages
array is guaranteed to contain only an opaque payload. This permits
the upper layer to optimize its buffer handling.

NB: Since svc_rdma_recvfrom.c is under net/, we use the comment
style that is preferred in the networking layer.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e7e4a39ca6c6..b1a0c72f73de 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -815,6 +815,11 @@ static void svc_rdma_read_complete_one(struct svc_rqst *rqstp,
 	buf->page_len = length;
 	buf->len += length;
 	buf->buflen += length;
+
+	/* Transport guarantees that only the chunk payload
+	 * appears in buf->pages.
+	 */
+	buf->flags |= XDRBUF_READ;
 }
 
 /* Finish constructing the RPC Call message in rqstp::rq_arg.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-22 19:27   ` Chuck Lever
  2025-10-22 21:09     ` Mike Snitzer
  2025-10-22 20:58   ` Mike Snitzer
  2025-10-23 19:37   ` Chuck Lever
  2 siblings, 1 reply; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 19:27 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

On 10/22/25 3:22 PM, Chuck Lever wrote:
> @@ -1311,6 +1484,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	if (sb->s_export_op)
>  		exp_op_flags = sb->s_export_op->flags;
>  
> +	/* cel: UNSTABLE buffered WRITEs might want some form of throttling
> +	 *	as well to prevent clients from pushing writes to us faster
> +	 *	than they can be flushed onto durable storage */
>  	if (test_bit(RQ_LOCAL, &rqstp->rq_flags) &&
>  	    !(exp_op_flags & EXPORT_OP_REMOTE_FS)) {
>  		/*

Note: the above is an open question. I don't intend to merge this patch
with the above comment included.

-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-22 19:27   ` Chuck Lever
@ 2025-10-22 20:58   ` Mike Snitzer
  2025-10-22 21:09     ` Chuck Lever
  2025-10-23 19:37   ` Chuck Lever
  2 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2025-10-22 20:58 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On Wed, Oct 22, 2025 at 03:22:07PM -0400, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
> 
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle extent is DIO-aligned and
> the start and/or end are misaligned. An O_SYNC buffered write (with
> preference towards using DONTCACHE) is used for the misaligned extents
> and O_DIRECT is used for the middle DIO-aligned extent.
> 
> nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
> allows the client to drop its dirty data and avoid needing an extra
> COMMIT operation.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/debugfs.c |   1 +
>  fs/nfsd/trace.h   |   1 +
>  fs/nfsd/vfs.c     | 181 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 183 insertions(+)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 41cd2b53d803..29c29a5111f8 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,6 +1254,105 @@ static int wait_for_concurrent_writes(struct file *file)
>  	return err;
>  }
>  
> +struct nfsd_write_dio {
> +	ssize_t	start_len;	/* Length for misaligned first extent */
> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> +	ssize_t	end_len;	/* Length for misaligned last extent */
> +};
> +
> +static bool
> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> +			   struct nfsd_file *nf,
> +			   struct nfsd_write_dio *write_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> +	loff_t start_end, orig_end, middle_end;
> +

I see you removed this check:

-       if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
-               return false;

Curious why. Seems unsafe because they can be 0.

> +	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
> +		return false;
> +
> +	start_end = round_up(offset, dio_blocksize);
> +	orig_end = offset + len;
> +	middle_end = round_down(orig_end, dio_blocksize);
> +
> +	write_dio->start_len = start_end - offset;
> +	write_dio->middle_len = middle_end - start_end;
> +	write_dio->end_len = orig_end - middle_end;
> +
> +	return true;
> +}

Otherwise, your other changes all seem fine.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 19:27   ` Chuck Lever
@ 2025-10-22 21:09     ` Mike Snitzer
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Snitzer @ 2025-10-22 21:09 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On Wed, Oct 22, 2025 at 03:27:02PM -0400, Chuck Lever wrote:
> On 10/22/25 3:22 PM, Chuck Lever wrote:
> > @@ -1311,6 +1484,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	if (sb->s_export_op)
> >  		exp_op_flags = sb->s_export_op->flags;
> >  
> > +	/* cel: UNSTABLE buffered WRITEs might want some form of throttling
> > +	 *	as well to prevent clients from pushing writes to us faster
> > +	 *	than they can be flushed onto durable storage */
> >  	if (test_bit(RQ_LOCAL, &rqstp->rq_flags) &&
> >  	    !(exp_op_flags & EXPORT_OP_REMOTE_FS)) {
> >  		/*
> 
> Note: the above is an open question. I don't intend to merge this patch
> with the above comment included.

Such a mechanism would benefit all NFSD_IO modes.

As-is, even NFSD_IO_DIRECT isn't sufficient to ensure client's forward
progress if NFS client application is using buffered IO -- because the
client itself can then enter page reclaim (when it exhausts its memory
because it doesn't get any back pressure).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 20:58   ` Mike Snitzer
@ 2025-10-22 21:09     ` Chuck Lever
  0 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever @ 2025-10-22 21:09 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On 10/22/25 4:58 PM, Mike Snitzer wrote:
> On Wed, Oct 22, 2025 at 03:22:07PM -0400, Chuck Lever wrote:
>> From: Mike Snitzer <snitzer@kernel.org>
>>
>> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
>> middle and end as needed. The large middle extent is DIO-aligned and
>> the start and/or end are misaligned. An O_SYNC buffered write (with
>> preference towards using DONTCACHE) is used for the misaligned extents
>> and O_DIRECT is used for the middle DIO-aligned extent.
>>
>> nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
>> allows the client to drop its dirty data and avoid needing an extra
>> COMMIT operation.
>>
>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  fs/nfsd/debugfs.c |   1 +
>>  fs/nfsd/trace.h   |   1 +
>>  fs/nfsd/vfs.c     | 181 ++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 183 insertions(+)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 41cd2b53d803..29c29a5111f8 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1254,6 +1254,105 @@ static int wait_for_concurrent_writes(struct file *file)
>>  	return err;
>>  }
>>  
>> +struct nfsd_write_dio {
>> +	ssize_t	start_len;	/* Length for misaligned first extent */
>> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
>> +	ssize_t	end_len;	/* Length for misaligned last extent */
>> +};
>> +
>> +static bool
>> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>> +			   struct nfsd_file *nf,
>> +			   struct nfsd_write_dio *write_dio)
>> +{
>> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
>> +	loff_t start_end, orig_end, middle_end;
>> +
> 
> I see you removed this check:
> 
> -       if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
> -               return false;
> 
> Curious why. Seems unsafe because they can be 0.

Hm. I might have removed the wrong check. Will fix that up for the next
round.


> 
>> +	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
>> +		return false;
>> +
>> +	start_end = round_up(offset, dio_blocksize);
>> +	orig_end = offset + len;
>> +	middle_end = round_down(orig_end, dio_blocksize);
>> +
>> +	write_dio->start_len = start_end - offset;
>> +	write_dio->middle_len = middle_end - start_end;
>> +	write_dio->end_len = orig_end - middle_end;
>> +
>> +	return true;
>> +}
> 
> Otherwise, your other changes all seem fine.
> 
> Thanks,
> Mike


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 5/5] svcrdma: Mark Read chunks
  2025-10-22 19:22 ` [PATCH v6 5/5] svcrdma: Mark Read chunks Chuck Lever
@ 2025-10-22 21:25   ` Mike Snitzer
  2025-10-23 13:00     ` Chuck Lever
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2025-10-22 21:25 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Wed, Oct 22, 2025 at 03:22:08PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The upper layer may want to know when the receive buffer's .pages
> array is guaranteed to contain only an opaque payload. This permits
> the upper layer to optimize its buffer handling.
> 
> NB: Since svc_rdma_recvfrom.c is under net/, we use the comment
> style that is preferred in the networking layer.
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index e7e4a39ca6c6..b1a0c72f73de 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -815,6 +815,11 @@ static void svc_rdma_read_complete_one(struct svc_rqst *rqstp,
>  	buf->page_len = length;
>  	buf->len += length;
>  	buf->buflen += length;
> +
> +	/* Transport guarantees that only the chunk payload
> +	 * appears in buf->pages.
> +	 */
> +	buf->flags |= XDRBUF_READ;
>  }
>  
>  /* Finish constructing the RPC Call message in rqstp::rq_arg.
> -- 
> 2.51.0
> 
> 

This patch header leaves me unclear on your thinking for upper layer
optimization that would look for XDRBUF_READ.

I see rpcrdma_marshal_req() is checking XDRBUF_READ but...

I'm left hoping for a clearer payoff on this change's benefit (and
relation to the rest of the patchset).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
                   ` (4 preceding siblings ...)
  2025-10-22 19:22 ` [PATCH v6 5/5] svcrdma: Mark Read chunks Chuck Lever
@ 2025-10-22 21:56 ` Mike Snitzer
  5 siblings, 0 replies; 17+ messages in thread
From: Mike Snitzer @ 2025-10-22 21:56 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Wed, Oct 22, 2025 at 03:22:03PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
> this series includes the patches needed to make NFSD Direct WRITE
> work.
> 
> The "large" piece of work left to do:
> > > > Wouldn't it make sense to track the alignment when building the bio_vec
> > > > array instead of doing another walk here touching all cache lines?
> > > 
> > > Yes, that is the conventional wisdom that justified Keith removing
> > > iov_iter_aligned.  But for NFSD's WRITE payload the bio_vec is built
> > > outside of the fs/nfsd/vfs.c code.  Could be there is a natural way to
> > > clean this up (to make the sunrpc layer to conditionally care about
> > > alignment) but I didn't confront that yet.
> > 
> > Well, for the block code it's also build outside the layer consuming it.
> > But Keith showed that you can easily communicate that information and
> > avoid extra loops touching the cache lines.

I had started to reply to the other thread relative to this point, but
I'll cherry pick it here:

On Wed, Oct 22, 2025 at 10:37:35AM -0400, Chuck Lever wrote:
> Again, is there a reason to push this suggestion out to a later merge
> window? It seems reasonable to tackle it now.

If you'd be OK with us tackling it as an incremental change attributed
to whoever takes the lead, then it can happen now or whenever someone
can get to it.

IMO it is worthy of incremental effort because it is genuinely
destabilizing line of development -- any -EINVAL that could result
from misaligned DIO being issued to the underlying filesystem can
cause serious problems with IO retry stalls.. at least in my
experience with pNFS flexfiles (as implemented by Hammerspace).
And that is why I was logging -EINVAL is important; now removed in
this v6. So you've removed checking that'd save us from bugs growing
to be way more painful _before_ optimization work that might expose us
to -EINVAL. To be clear: I am perfectly fine with removing that
seemingly heavy -EINVAL handling from nfsd_issue_write_dio like you
did _because_ nfsd_iov_iter_aligned_bvec() does a comprehensive job of
avoiding the possibility of misaligned DIO being issued to underlying
filesystem.

As cliche as "perfection is the enemy of good" is, it does genuinely
govern how I engineer solutions.

That doesn't mean I'm not seeking perfection, but for me: perfection
is accomplished iteratively by honing good to great to perfect.

Bounding each step towards perfection helps reach it.

Apologies if all that caused your eyes to roll into the back of your
head.  I have this additional optimization work on my TODO, I just
haven't been able to circle back to it.

> (Perhaps to get us started, Christoph, do you know of specific code that
> shows how this code could be reorganized?)

It is at the time of the write payload's bvec creation where alignment
would need to be assessed.

Keith's changes in these commits provide context (albeit opaque):

fec2e705729d block: check for valid bio while splitting
743bf2e0c49c block: add size alignment to bio_iov_iter_get_pages
20a0e6276edb block: align the bio after building it
5ff3f74e145a block: simplify direct io validity check
7eac33186957 iomap: simplify direct io validity check
9eab1d4e0d15 block: remove bdev_iter_is_aligned
69d7ed5b9ef6 blk-integrity: use simpler alignment check
b475272f03ca iov_iter: remove iov_iter_is_aligned

If you look at the entirety of those commits' changes with:
git diff fec2e705729d^..b475272f03ca

You can see that Keith removed calls to iov_iter_is_aligned and
bdev_iter_is_aligned by doing checking earlier at the time of creation
of "the thing" (for our case that'd be the bio_vec entries sunrpc adds
to the iov_iter that gets passed as write payload to vfs.c?).

Mike

> Changes since v5:
> * Add a patch to make FILE_SYNC WRITEs persist timestamps
> * Address some of Christoph's review comments
> 
> Changes since v4:
> * Split out refactoring nfsd_buffered_write() into a separate patch
> * Expand patch description of 1/4
> * Don't set IOCB_SYNC flag
> 
> Changes since v3:
> * Address checkpatch.pl nits in 2/3
> * Add an untested patch to mark ingress RDMA Read chunks
> 
> Chuck Lever (3):
>   NFSD: Make FILE_SYNC WRITEs comply with spec
>   NFSD: Enable return of an updated stable_how to NFS clients
>   svcrdma: Mark Read chunks
> 
> Mike Snitzer (2):
>   NFSD: Refactor nfsd_vfs_write()
>   NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
> 
>  fs/nfsd/debugfs.c                       |   1 +
>  fs/nfsd/nfs3proc.c                      |   2 +-
>  fs/nfsd/nfs4proc.c                      |   2 +-
>  fs/nfsd/nfsproc.c                       |   3 +-
>  fs/nfsd/trace.h                         |   1 +
>  fs/nfsd/vfs.c                           | 219 ++++++++++++++++++++++--
>  fs/nfsd/vfs.h                           |   6 +-
>  fs/nfsd/xdr3.h                          |   2 +-
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   5 +
>  9 files changed, 224 insertions(+), 17 deletions(-)
> 
> -- 
> 2.51.0
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 5/5] svcrdma: Mark Read chunks
  2025-10-22 21:25   ` Mike Snitzer
@ 2025-10-23 13:00     ` Chuck Lever
  0 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever @ 2025-10-23 13:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On 10/22/25 5:25 PM, Mike Snitzer wrote:
> On Wed, Oct 22, 2025 at 03:22:08PM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> The upper layer may want to know when the receive buffer's .pages
>> array is guaranteed to contain only an opaque payload. This permits
>> the upper layer to optimize its buffer handling.
>>
>> NB: Since svc_rdma_recvfrom.c is under net/, we use the comment
>> style that is preferred in the networking layer.
>>
>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index e7e4a39ca6c6..b1a0c72f73de 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -815,6 +815,11 @@ static void svc_rdma_read_complete_one(struct svc_rqst *rqstp,
>>  	buf->page_len = length;
>>  	buf->len += length;
>>  	buf->buflen += length;
>> +
>> +	/* Transport guarantees that only the chunk payload
>> +	 * appears in buf->pages.
>> +	 */
>> +	buf->flags |= XDRBUF_READ;
>>  }
>>  
>>  /* Finish constructing the RPC Call message in rqstp::rq_arg.
>> -- 
>> 2.51.0
>>
>>
> 
> This patch header leaves me unclear on your thinking for upper layer
> optimization that would look for XDRBUF_READ.
> 
> I see rpcrdma_marshal_req() is checking XDRBUF_READ but...
> 
> I'm left hoping for a clearer payoff on this change's benefit (and
> relation to the rest of the patchset).

There's no pay-off yet. You asked me to provide a mechanism that
indicates when the transport is certain that buf.pages contains /only/
the write payload.

That's all that's happening here -- it paves the way for you to
construct the optimization in nfsd_vfs_write(). We don't have to commit
this change until you have a patch that can take advantage of it.

It could be that I misunderstood your request.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-10-22 19:27   ` Chuck Lever
  2025-10-22 20:58   ` Mike Snitzer
@ 2025-10-23 19:37   ` Chuck Lever
  2025-10-23 21:23     ` Mike Snitzer
  2 siblings, 1 reply; 17+ messages in thread
From: Chuck Lever @ 2025-10-23 19:37 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: linux-nfs, Tom Talpey, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo

On 10/22/25 3:22 PM, Chuck Lever wrote:
> +struct nfsd_write_dio {
> +	ssize_t	start_len;	/* Length for misaligned first extent */
> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> +	ssize_t	end_len;	/* Length for misaligned last extent */
> +};
> +
> +static bool
> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> +			   struct nfsd_file *nf,
> +			   struct nfsd_write_dio *write_dio)
> +{
> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> +	loff_t start_end, orig_end, middle_end;
> +
> +	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
> +		return false;
> +
> +	start_end = round_up(offset, dio_blocksize);
> +	orig_end = offset + len;
> +	middle_end = round_down(orig_end, dio_blocksize);
> +
> +	write_dio->start_len = start_end - offset;
> +	write_dio->middle_len = middle_end - start_end;
> +	write_dio->end_len = orig_end - middle_end;
> +
> +	return true;
> +}
> +
> +static bool
> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> +			   unsigned int len_mask)
> +{
> +	const struct bio_vec *bvec = i->bvec;
> +	size_t skip = i->iov_offset;
> +	size_t size = i->count;
> +
> +	if (size & len_mask)
> +		return false;
> +	do {
> +		size_t len = bvec->bv_len;
> +
> +		if (len > size)
> +			len = size;
> +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> +			return false;
> +		bvec++;
> +		size -= len;
> +		skip = 0;
> +	} while (size);
> +
> +	return true;
> +}
> +

Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
would return false.


-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-23 19:37   ` Chuck Lever
@ 2025-10-23 21:23     ` Mike Snitzer
  2025-10-24 14:24       ` Chuck Lever
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2025-10-23 21:23 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-nfs, Tom Talpey, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo

On Thu, Oct 23, 2025 at 03:37:36PM -0400, Chuck Lever wrote:
> On 10/22/25 3:22 PM, Chuck Lever wrote:
> > +struct nfsd_write_dio {
> > +	ssize_t	start_len;	/* Length for misaligned first extent */
> > +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> > +	ssize_t	end_len;	/* Length for misaligned last extent */
> > +};
> > +
> > +static bool
> > +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> > +			   struct nfsd_file *nf,
> > +			   struct nfsd_write_dio *write_dio)
> > +{
> > +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> > +	loff_t start_end, orig_end, middle_end;
> > +
> > +	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
> > +		return false;
> > +
> > +	start_end = round_up(offset, dio_blocksize);
> > +	orig_end = offset + len;
> > +	middle_end = round_down(orig_end, dio_blocksize);
> > +
> > +	write_dio->start_len = start_end - offset;
> > +	write_dio->middle_len = middle_end - start_end;
> > +	write_dio->end_len = orig_end - middle_end;
> > +
> > +	return true;
> > +}
> > +
> > +static bool
> > +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> > +			   unsigned int len_mask)
> > +{
> > +	const struct bio_vec *bvec = i->bvec;
> > +	size_t skip = i->iov_offset;
> > +	size_t size = i->count;
> > +
> > +	if (size & len_mask)
> > +		return false;
> > +	do {
> > +		size_t len = bvec->bv_len;
> > +
> > +		if (len > size)
> > +			len = size;
> > +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > +			return false;
> > +		bvec++;
> > +		size -= len;
> > +		skip = 0;
> > +	} while (size);
> > +
> > +	return true;
> > +}
> > +
> 
> Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
> would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
> would return false.

It is always due to memory alignment (addr_mask check), never due to
logical alignment (len_mask check).

So we could remove the len_mask arg and the 'if (size & len_mask)'
check from nfsd_iov_iter_aligned_bvec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-23 21:23     ` Mike Snitzer
@ 2025-10-24 14:24       ` Chuck Lever
  2025-10-24 18:30         ` Mike Snitzer
  0 siblings, 1 reply; 17+ messages in thread
From: Chuck Lever @ 2025-10-24 14:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: linux-nfs, Tom Talpey, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo

On 10/23/25 5:23 PM, Mike Snitzer wrote:
> On Thu, Oct 23, 2025 at 03:37:36PM -0400, Chuck Lever wrote:
>> On 10/22/25 3:22 PM, Chuck Lever wrote:
>>> +struct nfsd_write_dio {
>>> +	ssize_t	start_len;	/* Length for misaligned first extent */
>>> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
>>> +	ssize_t	end_len;	/* Length for misaligned last extent */
>>> +};
>>> +
>>> +static bool
>>> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>>> +			   struct nfsd_file *nf,
>>> +			   struct nfsd_write_dio *write_dio)
>>> +{
>>> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
>>> +	loff_t start_end, orig_end, middle_end;
>>> +
>>> +	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
>>> +		return false;
>>> +
>>> +	start_end = round_up(offset, dio_blocksize);
>>> +	orig_end = offset + len;
>>> +	middle_end = round_down(orig_end, dio_blocksize);
>>> +
>>> +	write_dio->start_len = start_end - offset;
>>> +	write_dio->middle_len = middle_end - start_end;
>>> +	write_dio->end_len = orig_end - middle_end;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static bool
>>> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
>>> +			   unsigned int len_mask)
>>> +{
>>> +	const struct bio_vec *bvec = i->bvec;
>>> +	size_t skip = i->iov_offset;
>>> +	size_t size = i->count;
>>> +
>>> +	if (size & len_mask)
>>> +		return false;
>>> +	do {
>>> +		size_t len = bvec->bv_len;
>>> +
>>> +		if (len > size)
>>> +			len = size;
>>> +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
>>> +			return false;
>>> +		bvec++;
>>> +		size -= len;
>>> +		skip = 0;
>>> +	} while (size);
>>> +
>>> +	return true;
>>> +}
>>> +
>>
>> Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
>> would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
>> would return false.
> 
> It is always due to memory alignment (addr_mask check), never due to
> logical alignment (len_mask check).
> 
> So we could remove the len_mask arg and the 'if (size & len_mask)'
> check from nfsd_iov_iter_aligned_bvec

Fair enough.

Another question: why does nfsd_iov_iter_aligned_bvec need to walk the
entire bvec? For current NFSD and SunRPC, the elements of the bvec
array will be contiguous pages. After nfsd_is_write_dio_possible says
"OK", seems like you need to check only the first element in the bvec?

Or, maybe the check isn't necessary at all? I'm just wondering why
nfsd_is_write_dio_possible by itself isn't sufficient. Our network
transports aren't going to hand us arbitrary blobs in the bvec elements.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-10-24 14:24       ` Chuck Lever
@ 2025-10-24 18:30         ` Mike Snitzer
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Snitzer @ 2025-10-24 18:30 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-nfs, Tom Talpey, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo

[-- Attachment #1: Type: text/plain, Size: 3740 bytes --]

On Fri, Oct 24, 2025 at 10:24:41AM -0400, Chuck Lever wrote:
> On 10/23/25 5:23 PM, Mike Snitzer wrote:
> > On Thu, Oct 23, 2025 at 03:37:36PM -0400, Chuck Lever wrote:
> >> On 10/22/25 3:22 PM, Chuck Lever wrote:
> >>> +struct nfsd_write_dio {
> >>> +	ssize_t	start_len;	/* Length for misaligned first extent */
> >>> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> >>> +	ssize_t	end_len;	/* Length for misaligned last extent */
> >>> +};
> >>> +
> >>> +static bool
> >>> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> >>> +			   struct nfsd_file *nf,
> >>> +			   struct nfsd_write_dio *write_dio)
> >>> +{
> >>> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> >>> +	loff_t start_end, orig_end, middle_end;
> >>> +
> >>> +	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
> >>> +		return false;
> >>> +
> >>> +	start_end = round_up(offset, dio_blocksize);
> >>> +	orig_end = offset + len;
> >>> +	middle_end = round_down(orig_end, dio_blocksize);
> >>> +
> >>> +	write_dio->start_len = start_end - offset;
> >>> +	write_dio->middle_len = middle_end - start_end;
> >>> +	write_dio->end_len = orig_end - middle_end;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static bool
> >>> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> >>> +			   unsigned int len_mask)
> >>> +{
> >>> +	const struct bio_vec *bvec = i->bvec;
> >>> +	size_t skip = i->iov_offset;
> >>> +	size_t size = i->count;
> >>> +
> >>> +	if (size & len_mask)
> >>> +		return false;
> >>> +	do {
> >>> +		size_t len = bvec->bv_len;
> >>> +
> >>> +		if (len > size)
> >>> +			len = size;
> >>> +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> >>> +			return false;
> >>> +		bvec++;
> >>> +		size -= len;
> >>> +		skip = 0;
> >>> +	} while (size);
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>
> >> Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
> >> would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
> >> would return false.
> > 
> > It is always due to memory alignment (addr_mask check), never due to
> > logical alignment (len_mask check).
> > 
> > So we could remove the len_mask arg and the 'if (size & len_mask)'
> > check from nfsd_iov_iter_aligned_bvec
> 
> Fair enough.
> 
> Another question: why does nfsd_iov_iter_aligned_bvec need to walk the
> entire bvec? For current NFSD and SunRPC, the elements of the bvec
> array will be contiguous pages. After nfsd_is_write_dio_possible says
> "OK", seems like you need to check only the first element in the bvec?

I haven't analyzed if simply checking the first element of the bvec is
sufficient, that'd be a very welcomed efficiency improvement!.

I just know the checking to important and so when
iov_iter_aligned_bvec() removed by Keith I knew we still needed the
benefit of that checking.

> Or, maybe the check isn't necessary at all?

No, the memory alignment checking is certainly needed.  I've attached
the python reproducer that Jonathan Flynn coded (with ChatGPT assist,
hence the emoji like output ;) ) that really exposes the need for
memory alignment checking.

This python code effectively is the preliminary work done by the
MLperf benchmark to extract npz files as executed using MLperf's
buffered IO mode at the client.  (MLperf _can_ be executed in DIO mode
and when that is used there is no resulting misaligned memory).

Run it with:
./mlperf_npz_tool.py --npz-path /mnt/share1/test.npz

> I'm just wondering why nfsd_is_write_dio_possible by itself isn't
> sufficient. Our network transports aren't going to hand us arbitrary
> blobs in the bvec elements.

Sure, would be great if simply checking the first element sufficient.

Thanks,
Mike

[-- Attachment #2: mlperf_npz_tool.py --]
[-- Type: text/plain, Size: 7770 bytes --]

#!/usr/bin/env python3
import argparse, math, os, struct, zlib
from pathlib import Path
import numpy as np
import zipfile

# -----------------------
# Defaults (from your YAML)
# -----------------------
DEFAULT_MEAN_BYTES  = 146_600_628
DEFAULT_STDEV_BYTES = 68_341_808
DEFAULT_RESIZE      = 2_097_152     # 2 MiB
DEFAULT_SAMPLES     = 1
DEFAULT_DTYPE       = "uint8"
DEFAULT_SEED        = 10

# -----------------------
# Helpers
# -----------------------
DTYPE_SIZES = {
    "uint8": 1, "int8": 1,
    "uint16": 2, "int16": 2,
    "uint32": 4, "int32": 4,
    "float32": 4, "float64": 8,
}

def choose_target_bytes(mean_b, stdev_b, resize, randomize):
    if randomize and stdev_b > 0:
        draw = int(np.random.normal(loc=mean_b, scale=stdev_b))
        draw = max(draw, 1)
    else:
        draw = int(mean_b)
    # Round to nearest multiple of resize
    return int(round(draw / resize) * resize)

def choose_hw_for_bytes(total_bytes, samples, dtype_size):
    """
    Choose H, W making H*W*samples*dtype_size == total_bytes.
    We factor total elements and spread powers of two across H and W
    to avoid super-skinny arrays.
    """
    total_elems = total_bytes // (dtype_size * samples)
    if total_elems == 0:
        raise ValueError("Total elements computed as 0; check inputs.")
    n = total_elems
    # Factor out powers of two
    exp2 = (n & -n).bit_length() - 1
    odd  = n >> exp2
    h = 1 << (exp2 // 2)
    w = (1 << (exp2 - exp2 // 2)) * odd
    return int(h), int(w)

def save_npz(out_path: Path, *, mean_bytes, stdev_bytes, resize_bytes,
             samples, dtype_name, seed, compress, randomize):
    dtype = getattr(np, dtype_name)
    dtype_size = DTYPE_SIZES[dtype_name]

    np.random.seed(seed)
    target_bytes = choose_target_bytes(mean_bytes, stdev_bytes, resize_bytes, randomize)
    # Ensure divisibility:
    elems_per_sample = target_bytes // dtype_size // samples
    if elems_per_sample * dtype_size * samples != target_bytes:
        raise ValueError("Target bytes not divisible by dtype_size*samples; adjust params.")

    h, w = choose_hw_for_bytes(target_bytes, samples, dtype_size)

    x = np.random.randint(255, size=(h, w, samples), dtype=dtype if dtype_name == "uint8" else np.uint8)
    if dtype_name != "uint8":
        x = x.astype(dtype, copy=False)
    y = np.zeros((samples,), dtype=np.uint8)  # matches DLIO NPZ generator convention

    out_path.parent.mkdir(parents=True, exist_ok=True)
    if compress:
        np.savez_compressed(out_path, x=x, y=y)
    else:
        np.savez(out_path, x=x, y=y)

    print(f"✅ Wrote {out_path}")
    try:
        sz = out_path.stat().st_size
        print(f"   size={sz} bytes, x.shape={x.shape}, dtype={x.dtype}, samples={samples}")
    except FileNotFoundError:
        pass

def list_and_crc(npz_path: Path, deep=False):
    print(f"📂 File: {npz_path}")
    with zipfile.ZipFile(npz_path, "r") as zf:
        names = zf.namelist()
        print(f"📦 Files in archive: {names}\n")
        for name in names:
            info = zf.getinfo(name)
            print(f"--- {name} ---")
            print(f"Stored CRC32       : 0x{info.CRC:08x}")
            print(f"Compressed Size    : {info.compress_size}")
            print(f"Uncompressed Size  : {info.file_size}")
            try:
                with zf.open(info) as f:
                    _ = f.read()                 # will raise if CRC mismatch
                print("✅ CRC verified by zipfile.\n")
            except zipfile.BadZipFile as e:
                print(f"⚠️ CRC error via zipfile: {e}")
                if deep:
                    ok = deep_crc_check(npz_path, info)
                    print("🔎 Deep check      :", "✅ OK\n" if ok else "❌ Mismatch\n")
                else:
                    print("ℹ️  Re-run with --deep-check to diagnose.\n")
            except Exception as e:
                print(f"❌ Unexpected error: {e}\n")

def deep_crc_check(npz_path: Path, info: zipfile.ZipInfo) -> bool:
    """
    Manual CRC of the *uncompressed* payload.
    Parse the local file header to find the compressed bytes, then
    decompress and compute CRC32 of the uncompressed stream.
    """
    with npz_path.open("rb") as fh:
        fh.seek(info.header_offset)
        local = fh.read(30)  # fixed part of local header
        # local file header sig 'PK\x03\x04'
        if local[:4] != b'PK\x03\x04':
            return False
        # filename length, extra length
        name_len, extra_len = struct.unpack("<HH", local[26:30])
        fh.seek(info.header_offset + 30 + name_len + extra_len)
        comp = fh.read(info.compress_size)

    # Decompress if needed
    if info.compress_type == zipfile.ZIP_STORED:
        data = comp
    elif info.compress_type == zipfile.ZIP_DEFLATED:
        # ZIP uses raw DEFLATE stream (no zlib header): wbits = -15
        try:
            data = zlib.decompress(comp, -15)
        except zlib.error:
            # Some writers include zlib headers; try normal
            data = zlib.decompress(comp)
    else:
        # Other methods not handled here
        return False

    crc = zlib.crc32(data) & 0xFFFFFFFF
    print(f"Computed CRC32     : 0x{crc:08x}")
    return crc == info.CRC

# -----------------------
# CLI
# -----------------------
def parse_args():
    p = argparse.ArgumentParser(description="MLPerf-style NPZ save & check tool")
    mode = p.add_mutually_exclusive_group()
    mode.add_argument("--save-only", action="store_true", help="Only save the NPZ")
    mode.add_argument("--check-only", action="store_true", help="Only verify/show the NPZ")

    p.add_argument("--npz-path", type=Path, help="Full output/check path to NPZ (overrides --out-dir/--name)")
    p.add_argument("--out-dir", type=Path, default=Path("/mnt/hs_test"), help="Directory for output NPZ")
    p.add_argument("--name", default="sample_000000.npz", help="Filename for output NPZ")
    p.add_argument("--compress", action="store_true", help="Use compressed NPZ (deflate)")

    # Size/dtype controls
    p.add_argument("--mean-bytes", type=int, default=DEFAULT_MEAN_BYTES, help="record_length_bytes")
    p.add_argument("--stdev-bytes", type=int, default=DEFAULT_STDEV_BYTES, help="record_length_bytes_stdev")
    p.add_argument("--resize-bytes", type=int, default=DEFAULT_RESIZE, help="record_length_bytes_resize multiple")
    p.add_argument("--randomize", action="store_true", help="Draw size from N(mean,stdev) before rounding")
    p.add_argument("--samples", type=int, default=DEFAULT_SAMPLES, help="num_samples_per_file")
    p.add_argument("--dtype", choices=DTYPE_SIZES.keys(), default=DEFAULT_DTYPE, help="Data dtype for 'x'")
    p.add_argument("--seed", type=int, default=DEFAULT_SEED, help="RNG seed for reproducibility")

    # Check controls
    p.add_argument("--deep-check", action="store_true", help="When checking, manually parse & CRC the member data")
    return p.parse_args()

def main():
    args = parse_args()
    out_path = args.npz_path or (args.out_dir / args.name)

    did_save = False
    if not args.check_only:
        save_npz(
            out_path=out_path,
            mean_bytes=args.mean_bytes,
            stdev_bytes=args.stdev_bytes,
            resize_bytes=args.resize_bytes,
            samples=args.samples,
            dtype_name=args.dtype,
            seed=args.seed,
            compress=args.compress,
            randomize=args.randomize,
        )
        did_save = True

    if not args.save_only:
        if not out_path.exists():
            raise SystemExit(f"File not found for check: {out_path}")
        list_and_crc(out_path, deep=args.deep_check)
    elif did_save:
        # echo path for easy piping
        print(out_path)

if __name__ == "__main__":
    main()


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-10-24 18:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-22 19:22 ` [PATCH v6 1/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-22 19:22 ` [PATCH v6 2/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-22 19:22 ` [PATCH v6 3/5] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-22 19:27   ` Chuck Lever
2025-10-22 21:09     ` Mike Snitzer
2025-10-22 20:58   ` Mike Snitzer
2025-10-22 21:09     ` Chuck Lever
2025-10-23 19:37   ` Chuck Lever
2025-10-23 21:23     ` Mike Snitzer
2025-10-24 14:24       ` Chuck Lever
2025-10-24 18:30         ` Mike Snitzer
2025-10-22 19:22 ` [PATCH v6 5/5] svcrdma: Mark Read chunks Chuck Lever
2025-10-22 21:25   ` Mike Snitzer
2025-10-23 13:00     ` Chuck Lever
2025-10-22 21:56 ` [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).