* [PATCH v5 1/4] NFSD: Enable return of an updated stable_how to NFS clients
2025-10-20 16:25 [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-20 16:25 ` Chuck Lever
2025-10-20 16:25 ` [PATCH v5 2/4] NFSD: Refactor nfsd_vfs_write() Chuck Lever
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2025-10-20 16:25 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
NFSv3 and newer protocols enable clients to perform a two-phase
WRITE. A client requests an UNSTABLE WRITE, which sends dirty data
to the NFS server, but does not persist it. The server replies that
it performed the UNSTABLE WRITE, and the client is then obligated to
follow up with a COMMIT request before it can remove the dirty data
from its own page cache. The COMMIT reply is the client's guarantee
that the written data has been persisted on the server.
The purpose of this protocol design is to enable clients to send
a large amount of data via multiple WRITE requests to a server, and
then wait for persistence just once. The server is able to start
persisting the data as soon as it gets it, to shorten the length of
time the client has to wait for the final COMMIT to complete.
It's also possible for the server to respond to an UNSTABLE WRITE
request in a way that indicates that the data was persisted anyway.
In that case, the client can skip the COMMIT and remove the dirty
data from its memory immediately. NetApp filers, for example, do
this because they have a battery-backed cache and can guarantee that
written data is persisted quickly and immediately.
NFSD has never implemented this kind of promotion. UNSTABLE WRITE
requests are unconditionally treated as UNSTABLE. However, in a
subsequent patch, nfsd_vfs_write() will be able to promote an
UNSTABLE WRITE to be a FILE_SYNC WRITE. This will be because NFSD
will handle some WRITE requests locally with O_DIRECT, which
persists written data immediately. The FILE_SYNC WRITE response
indicates to the client that no follow-up COMMIT is necessary.
This patch prepares for that change by enabling the modified
stable_how value to be passed along to NFSD's WRITE reply encoder.
No behavior change is expected.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs3proc.c | 2 +-
fs/nfsd/nfs4proc.c | 2 +-
fs/nfsd/nfsproc.c | 3 ++-
fs/nfsd/vfs.c | 11 ++++++-----
fs/nfsd/vfs.h | 6 ++++--
fs/nfsd/xdr3.h | 2 +-
6 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..ad14b34583bb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
resp->committed = argp->stable;
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
&argp->payload, &cnt,
- resp->committed, resp->verf);
+ &resp->committed, resp->verf);
resp->count = cnt;
resp->status = nfsd3_map_status(resp->status);
return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 7f7e6bb23a90..2222bb283baf 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1285,7 +1285,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
write->wr_how_written = write->wr_stable_how;
status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
write->wr_offset, &write->wr_payload,
- &cnt, write->wr_how_written,
+ &cnt, &write->wr_how_written,
(__be32 *)write->wr_verifier.data);
nfsd_file_put(nf);
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75..706401ed6f8d 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
struct nfsd_writeargs *argp = rqstp->rq_argp;
struct nfsd_attrstat *resp = rqstp->rq_resp;
unsigned long cnt = argp->len;
+ u32 committed = NFS_DATA_SYNC;
dprintk("nfsd: WRITE %s %u bytes at %d\n",
SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
fh_copy(&resp->fh, &argp->fh);
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
- &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+ &argp->payload, &cnt, &committed, NULL);
if (resp->status == nfs_ok)
resp->status = fh_getattr(&resp->fh, &resp->stat);
else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..8b2dc7a88aab 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,7 +1262,7 @@ static int wait_for_concurrent_writes(struct file *file)
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1274,11 +1274,12 @@ __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload, unsigned long *cnt,
- int stable, __be32 *verf)
+ u32 *stable_how, __be32 *verf)
{
struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
struct file *file = nf->nf_file;
struct super_block *sb = file_inode(file)->i_sb;
+ u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
struct iov_iter iter;
@@ -1434,7 +1435,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1444,7 +1445,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
*/
__be32
nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
- const struct xdr_buf *payload, unsigned long *cnt, int stable,
+ const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
__be32 *verf)
{
struct nfsd_file *nf;
@@ -1457,7 +1458,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
goto out;
err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
- stable, verf);
+ stable_how, verf);
nfsd_file_put(nf);
out:
trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fa46f8b5f132..c713ed0b04e0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -130,11 +130,13 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 *eof);
__be32 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
loff_t offset, const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_readlink(struct svc_rqst *, struct svc_fh *,
char *, int *);
__be32 nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..c0e443ef3a6b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
__be32 status;
struct svc_fh fh;
unsigned long count;
- int committed;
+ u32 committed;
__be32 verf[2];
};
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH v5 2/4] NFSD: Refactor nfsd_vfs_write()
2025-10-20 16:25 [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-20 16:25 ` [PATCH v5 1/4] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-10-20 16:25 ` Chuck Lever
2025-10-20 16:32 ` Mike Snitzer
2025-10-21 10:51 ` Jeff Layton
2025-10-20 16:25 ` [PATCH v5 3/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (2 subsequent siblings)
4 siblings, 2 replies; 16+ messages in thread
From: Chuck Lever @ 2025-10-20 16:25 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Extract the common code that is to be used in the buffered and
dontcache I/O modes. This common code will also be used as the
fallback when direct I/O is requested but cannot be used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8b2dc7a88aab..ae9c41f7374e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,22 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+static int
+nfsd_iocb_write(struct svc_rqst *rqstp, struct file *file, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
+{
+ struct iov_iter iter;
+ int host_err;
+
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt = host_err;
+
+ return 0;
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1282,7 +1298,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
- struct iov_iter iter;
errseq_t since;
__be32 nfserr;
int host_err;
@@ -1319,25 +1334,24 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
kiocb.ki_flags |= IOCB_DSYNC;
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
- case NFSD_IO_BUFFERED:
- break;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_BUFFERED:
+ host_err = nfsd_iocb_write(rqstp, file, nvecs, cnt, &kiocb);
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v5 2/4] NFSD: Refactor nfsd_vfs_write()
2025-10-20 16:25 ` [PATCH v5 2/4] NFSD: Refactor nfsd_vfs_write() Chuck Lever
@ 2025-10-20 16:32 ` Mike Snitzer
2025-10-21 10:51 ` Jeff Layton
1 sibling, 0 replies; 16+ messages in thread
From: Mike Snitzer @ 2025-10-20 16:32 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Oct 20, 2025 at 12:25:44PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Extract the common code that is to be used in the buffered and
> dontcache I/O modes. This common code will also be used as the
> fallback when direct I/O is requested but cannot be used.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Looks good, but please keep it attributed to me. Feel free to add
Co-developed-by or something to reflect your assist?
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 2/4] NFSD: Refactor nfsd_vfs_write()
2025-10-20 16:25 ` [PATCH v5 2/4] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-20 16:32 ` Mike Snitzer
@ 2025-10-21 10:51 ` Jeff Layton
1 sibling, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2025-10-21 10:51 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Mon, 2025-10-20 at 12:25 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Extract the common code that is to be used in the buffered and
> dontcache I/O modes. This common code will also be used as the
> fallback when direct I/O is requested but cannot be used.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 26 ++++++++++++++++++++------
> 1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 8b2dc7a88aab..ae9c41f7374e 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,6 +1254,22 @@ static int wait_for_concurrent_writes(struct file *file)
> return err;
> }
>
> +static int
> +nfsd_iocb_write(struct svc_rqst *rqstp, struct file *file, unsigned int nvecs,
> + unsigned long *cnt, struct kiocb *kiocb)
> +{
> + struct iov_iter iter;
> + int host_err;
> +
> + iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> + host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> + if (host_err < 0)
> + return host_err;
> + *cnt = host_err;
> +
> + return 0;
> +}
> +
> /**
> * nfsd_vfs_write - write data to an already-open file
> * @rqstp: RPC execution context
> @@ -1282,7 +1298,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> u32 stable = *stable_how;
> struct kiocb kiocb;
> struct svc_export *exp;
> - struct iov_iter iter;
> errseq_t since;
> __be32 nfserr;
> int host_err;
> @@ -1319,25 +1334,24 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> kiocb.ki_flags |= IOCB_DSYNC;
>
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> - iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
> since = READ_ONCE(file->f_wb_err);
> if (verf)
> nfsd_copy_write_verifier(verf, nn);
>
> switch (nfsd_io_cache_write) {
> - case NFSD_IO_BUFFERED:
> - break;
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags |= IOCB_DONTCACHE;
> + fallthrough;
> + case NFSD_IO_BUFFERED:
> + host_err = nfsd_iocb_write(rqstp, file, nvecs, cnt, &kiocb);
> break;
> }
> - host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
> if (host_err < 0) {
> commit_reset_write_verifier(nn, rqstp, host_err);
> goto out_nfserr;
> }
> - *cnt = host_err;
> nfsd_stats_io_write_add(nn, exp, *cnt);
> fsnotify_modify(file);
> host_err = filemap_check_wb_err(file->f_mapping, since);
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 3/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-20 16:25 [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-20 16:25 ` [PATCH v5 1/4] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-20 16:25 ` [PATCH v5 2/4] NFSD: Refactor nfsd_vfs_write() Chuck Lever
@ 2025-10-20 16:25 ` Chuck Lever
2025-10-20 16:25 ` [PATCH v5 4/4] svcrdma: Mark Read chunks Chuck Lever
2025-10-20 16:33 ` [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
4 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2025-10-20 16:25 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Synchronous buffered IO (with
preference towards using DONTCACHE) is used for the misaligned extents
and O_DIRECT is used for the middle DIO-aligned extent.
nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
allows the client to drop its dirty data and avoid needing an extra
COMMIT operation.
If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
invalidate the page cache on behalf of the DIO WRITE, then
nfsd_issue_write_dio() will fall back to using buffered IO.
These changes served as the original starting point for the NFS
client's misaligned O_DIRECT support that landed with
commit c817248fc831 ("nfs/localio: add proper O_DIRECT support for
READ and WRITE"). But NFSD's support is simpler because it currently
doesn't use AIO completion.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 198 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
switch (val) {
case NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
nfsd_io_cache_write = val;
break;
default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
DEFINE_NFSD_IO_EVENT(write_io_done);
DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ae9c41f7374e..6840fff37649 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,109 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio {
+ ssize_t start_len; /* Length for misaligned first extent */
+ ssize_t middle_len; /* Length for DIO-aligned middle extent */
+ ssize_t end_len; /* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+ struct nfsd_file *nf,
+ struct nfsd_write_dio *write_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_offset_align;
+ loff_t start_end, orig_end, middle_end;
+
+ if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+ return false;
+ if (unlikely(dio_blocksize > PAGE_SIZE))
+ return false;
+ if (unlikely(len < dio_blocksize))
+ return false;
+
+ start_end = round_up(offset, dio_blocksize);
+ orig_end = offset + len;
+ middle_end = round_down(orig_end, dio_blocksize);
+
+ write_dio->start_len = start_end - offset;
+ write_dio->middle_len = middle_end - start_end;
+ write_dio->end_len = orig_end - middle_end;
+
+ return true;
+}
+
+static bool
+nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
+ unsigned int len_mask)
+{
+ const struct bio_vec *bvec = i->bvec;
+ size_t skip = i->iov_offset;
+ size_t size = i->count;
+
+ if (size & len_mask)
+ return false;
+ do {
+ size_t len = bvec->bv_len;
+
+ if (len > size)
+ len = size;
+ if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+ return false;
+ bvec++;
+ size -= len;
+ skip = 0;
+ } while (size);
+
+ return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+ struct bio_vec *rq_bvec, unsigned int nvecs,
+ unsigned long cnt, struct nfsd_write_dio *write_dio,
+ struct nfsd_file *nf)
+{
+ int n_iters = 0;
+ struct iov_iter *iters = *iterp;
+
+ /* Setup misaligned start? */
+ if (write_dio->start_len) {
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iters[n_iters].count = write_dio->start_len;
+ iter_is_dio_aligned[n_iters] = false;
+ ++n_iters;
+ }
+
+ /* Setup DIO-aligned middle */
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ if (write_dio->start_len)
+ iov_iter_advance(&iters[n_iters], write_dio->start_len);
+ iters[n_iters].count -= write_dio->end_len;
+ iter_is_dio_aligned[n_iters] =
+ nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+ nf->nf_dio_mem_align - 1,
+ nf->nf_dio_offset_align - 1);
+ if (unlikely(!iter_is_dio_aligned[n_iters]))
+ return 0; /* no DIO-aligned IO possible */
+ ++n_iters;
+
+ /* Setup misaligned end? */
+ if (write_dio->end_len) {
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iov_iter_advance(&iters[n_iters],
+ write_dio->start_len + write_dio->middle_len);
+ iter_is_dio_aligned[n_iters] = false;
+ ++n_iters;
+ }
+
+ return n_iters;
+}
+
static int
nfsd_iocb_write(struct svc_rqst *rqstp, struct file *file, unsigned int nvecs,
unsigned long *cnt, struct kiocb *kiocb)
@@ -1270,6 +1373,94 @@ nfsd_iocb_write(struct svc_rqst *rqstp, struct file *file, unsigned int nvecs,
return 0;
}
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
+ u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+{
+ struct file *file = nf->nf_file;
+ bool iter_is_dio_aligned[3];
+ struct iov_iter iter_stack[3];
+ struct iov_iter *iter = iter_stack;
+ unsigned int n_iters = 0;
+ unsigned long in_count = *cnt;
+ loff_t in_offset = kiocb->ki_pos;
+ ssize_t host_err;
+
+ n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+ rqstp->rq_bvec, nvecs, *cnt,
+ write_dio, nf);
+ if (unlikely(!n_iters))
+ return nfsd_iocb_write(rqstp, file, nvecs, cnt, kiocb);
+
+ trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
+
+ /*
+ * Any buffered IO issued here will be misaligned, use
+ * sync IO to ensure it has completed before returning.
+ * Also update @stable_how to avoid need for COMMIT.
+ */
+ kiocb->ki_flags |= IOCB_DSYNC;
+ *stable_how = NFS_FILE_SYNC;
+
+ *cnt = 0;
+ for (int i = 0; i < n_iters; i++) {
+ if (iter_is_dio_aligned[i])
+ kiocb->ki_flags |= IOCB_DIRECT;
+ else
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+ if (host_err < 0) {
+ /*
+ * VFS will return -ENOTBLK if DIO WRITE fails to
+ * invalidate the page cache. Retry using buffered IO.
+ */
+ if (unlikely(host_err == -ENOTBLK)) {
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+ *cnt = in_count;
+ kiocb->ki_pos = in_offset;
+ return nfsd_iocb_write(rqstp, file, nvecs, cnt,
+ kiocb);
+ } else if (unlikely(host_err == -EINVAL)) {
+ struct inode *inode = d_inode(fhp->fh_dentry);
+
+ pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+ inode->i_sb->s_id, inode->i_ino);
+ host_err = -ESERVERFAULT;
+ }
+ return host_err;
+ }
+ *cnt += host_err;
+ if (host_err < iter[i].count) /* partial write? */
+ break;
+ }
+
+ return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
+{
+ struct nfsd_write_dio write_dio;
+
+ /*
+ * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
+ * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
+ * be ignored for any DIO issued here).
+ */
+ if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ kiocb->ki_flags |= IOCB_DONTCACHE;
+
+ if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
+ return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
+ cnt, kiocb, &write_dio);
+
+ return nfsd_iocb_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1340,6 +1531,11 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
+ case NFSD_IO_DIRECT:
+ host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
+ nvecs, cnt, &kiocb);
+ stable = *stable_how;
+ break;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH v5 4/4] svcrdma: Mark Read chunks
2025-10-20 16:25 [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (2 preceding siblings ...)
2025-10-20 16:25 ` [PATCH v5 3/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-20 16:25 ` Chuck Lever
2025-10-21 11:07 ` Jeff Layton
2025-10-20 16:33 ` [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
4 siblings, 1 reply; 16+ messages in thread
From: Chuck Lever @ 2025-10-20 16:25 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The upper layer may want to know when the receive buffer's .pages
array is guaranteed to contain only an opaque payload. This permits
the upper layer to optimize its buffer handling.
NB: Since svc_rdma_recvfrom.c is under net/, we use the comment
style that is preferred in the networking layer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e7e4a39ca6c6..b1a0c72f73de 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -815,6 +815,11 @@ static void svc_rdma_read_complete_one(struct svc_rqst *rqstp,
buf->page_len = length;
buf->len += length;
buf->buflen += length;
+
+ /* Transport guarantees that only the chunk payload
+ * appears in buf->pages.
+ */
+ buf->flags |= XDRBUF_READ;
}
/* Finish constructing the RPC Call message in rqstp::rq_arg.
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v5 4/4] svcrdma: Mark Read chunks
2025-10-20 16:25 ` [PATCH v5 4/4] svcrdma: Mark Read chunks Chuck Lever
@ 2025-10-21 11:07 ` Jeff Layton
0 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2025-10-21 11:07 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Mon, 2025-10-20 at 12:25 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> The upper layer may want to know when the receive buffer's .pages
> array is guaranteed to contain only an opaque payload. This permits
> the upper layer to optimize its buffer handling.
>
> NB: Since svc_rdma_recvfrom.c is under net/, we use the comment
> style that is preferred in the networking layer.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index e7e4a39ca6c6..b1a0c72f73de 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -815,6 +815,11 @@ static void svc_rdma_read_complete_one(struct svc_rqst *rqstp,
> buf->page_len = length;
> buf->len += length;
> buf->buflen += length;
> +
> + /* Transport guarantees that only the chunk payload
> + * appears in buf->pages.
> + */
> + buf->flags |= XDRBUF_READ;
> }
>
> /* Finish constructing the RPC Call message in rqstp::rq_arg.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-20 16:25 [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (3 preceding siblings ...)
2025-10-20 16:25 ` [PATCH v5 4/4] svcrdma: Mark Read chunks Chuck Lever
@ 2025-10-20 16:33 ` Mike Snitzer
2025-10-20 16:44 ` Chuck Lever
4 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2025-10-20 16:33 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
> this series includes all patches needed to make NFSD Direct WRITE
> work, plus an experimental follow-on patch.
>
> Mike, just send along any responses to review comments as patch
> snippets and I will apply them as needed.
>
> Changes since v4:
> * Split out refactoring nfsd_buffered_write() into a separate patch
> * Expand patch description of 1/4
> * Don't set IOCB_SYNC flag
Sure thing. Just a bit concerned about removing IOCB_SYNC in that
we're altering stable_how to be NFS_FILE_SYNC.
Mike
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-20 16:33 ` [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
@ 2025-10-20 16:44 ` Chuck Lever
2025-10-20 18:20 ` Chuck Lever
2025-10-21 11:12 ` Jeff Layton
0 siblings, 2 replies; 16+ messages in thread
From: Chuck Lever @ 2025-10-20 16:44 UTC (permalink / raw)
To: Mike Snitzer
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On 10/20/25 12:33 PM, Mike Snitzer wrote:
> On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
> Just a bit concerned about removing IOCB_SYNC in that
> we're altering stable_how to be NFS_FILE_SYNC.
Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") introduces
the first use of IOCB_ flags in NFSD's write path, and it uses
IOCB_DSYNC. The patch has Reviewed-by's from Christoph, Neil, and
Jeff.
Should we be concerned that IOCB_DSYNC does not persist time stamp
changes that might be lost during an unplanned server boot?
As a reminder to the thread, Section 3.3.7 of RFC 1813 says:
If stable is FILE_SYNC, the server must commit the data
written plus all file system metadata to stable storage
before returning results.
The text is a bit blurry about whether "file system metadata" means
all of the outstanding metadata changes for every file, or just the
metadata changes for the target file handle.
NFSD has historically treated DATA_SYNC and FILE_SYNC identically,
as the Linux NFS client does not use DATA_SYNC (IIRC).
--
Chuck Lever
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-20 16:44 ` Chuck Lever
@ 2025-10-20 18:20 ` Chuck Lever
2025-10-21 11:12 ` Jeff Layton
1 sibling, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2025-10-20 18:20 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer
On 10/20/25 12:44 PM, Chuck Lever wrote:
> On 10/20/25 12:33 PM, Mike Snitzer wrote:
>> On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
>> Just a bit concerned about removing IOCB_SYNC in that
>> we're altering stable_how to be NFS_FILE_SYNC.
> Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") introduces
> the first use of IOCB_ flags in NFSD's write path, and it uses
> IOCB_DSYNC. The patch has Reviewed-by's from Christoph, Neil, and
> Jeff.
>
> Should we be concerned that IOCB_DSYNC does not persist time stamp
> changes that might be lost during an unplanned server boot?
One reason to mention 3f3503adb332 is that if we decide IOCB_SYNC is
necessary to correctly implement NFS_FILE_SYNC, then a separate,
backport-able, patch will be needed to replace the flag that was used in
that patch.
--
Chuck Lever
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-20 16:44 ` Chuck Lever
2025-10-20 18:20 ` Chuck Lever
@ 2025-10-21 11:12 ` Jeff Layton
2025-10-21 13:35 ` Chuck Lever
1 sibling, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2025-10-21 11:12 UTC (permalink / raw)
To: Chuck Lever, Mike Snitzer
Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Mon, 2025-10-20 at 12:44 -0400, Chuck Lever wrote:
> On 10/20/25 12:33 PM, Mike Snitzer wrote:
> > On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
> > Just a bit concerned about removing IOCB_SYNC in that
> > we're altering stable_how to be NFS_FILE_SYNC.
> Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") introduces
> the first use of IOCB_ flags in NFSD's write path, and it uses
> IOCB_DSYNC. The patch has Reviewed-by's from Christoph, Neil, and
> Jeff.
>
> Should we be concerned that IOCB_DSYNC does not persist time stamp
> changes that might be lost during an unplanned server boot?
>
> As a reminder to the thread, Section 3.3.7 of RFC 1813 says:
>
> If stable is FILE_SYNC, the server must commit the data
> written plus all file system metadata to stable storage
> before returning results.
>
> The text is a bit blurry about whether "file system metadata" means
> all of the outstanding metadata changes for every file, or just the
> metadata changes for the target file handle.
>
> NFSD has historically treated DATA_SYNC and FILE_SYNC identically,
> as the Linux NFS client does not use DATA_SYNC (IIRC).
>
Surely it just meant for the one file. FILE_SYNC is only applicable to
WRITE/COMMIT operations and those only deal with a single file at a
time. If the client gets back FILE_SYNC on a write, it should _not_
assume that all outstanding dirty data to all files has been sync'ed.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-21 11:12 ` Jeff Layton
@ 2025-10-21 13:35 ` Chuck Lever
2025-10-21 13:45 ` Jeff Layton
2025-10-22 17:00 ` Mike Snitzer
0 siblings, 2 replies; 16+ messages in thread
From: Chuck Lever @ 2025-10-21 13:35 UTC (permalink / raw)
To: Jeff Layton, Mike Snitzer
Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On 10/21/25 7:12 AM, Jeff Layton wrote:
> On Mon, 2025-10-20 at 12:44 -0400, Chuck Lever wrote:
>> On 10/20/25 12:33 PM, Mike Snitzer wrote:
>>> On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
>>> Just a bit concerned about removing IOCB_SYNC in that
>>> we're altering stable_how to be NFS_FILE_SYNC.
>> Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") introduces
>> the first use of IOCB_ flags in NFSD's write path, and it uses
>> IOCB_DSYNC. The patch has Reviewed-by's from Christoph, Neil, and
>> Jeff.
>>
>> Should we be concerned that IOCB_DSYNC does not persist time stamp
>> changes that might be lost during an unplanned server boot?
>>
>> As a reminder to the thread, Section 3.3.7 of RFC 1813 says:
>>
>> If stable is FILE_SYNC, the server must commit the data
>> written plus all file system metadata to stable storage
>> before returning results.
>>
>> The text is a bit blurry about whether "file system metadata" means
>> all of the outstanding metadata changes for every file, or just the
>> metadata changes for the target file handle.
>>
>> NFSD has historically treated DATA_SYNC and FILE_SYNC identically,
>> as the Linux NFS client does not use DATA_SYNC (IIRC).
>>
>
> Surely it just meant for the one file.
Well yes that is the traditional understanding. I'm merely pointing out
that the actual text is not quite as specific as what we've come to
understand.
> FILE_SYNC is only applicable to
> WRITE/COMMIT operations and those only deal with a single file at a
> time.
True but you may recall that NFSD's COMMIT used to ignore the range
arguments and flush the whole file. Some file systems used to flush
all dirty data in this case, IIRC.
There's always been a bit of a mismatch between the spec and what NFSD
has implemented.
> If the client gets back FILE_SYNC on a write, it should _not_
> assume that all outstanding dirty data to all files has been sync'ed.
Agreed.
But back to Mike's point.
- The spec says NFS_DATA_SYNC means persist file data.
- The spec says NFS_FILE_SYNC means persist file data and file
attributes.
- After consulting with the section describing COMMIT, I think that
COMMIT is supposed to persist both file data and attributes.
And my reading of the code in fs/nfsd/vfs.c is that NFSD does the
equivalent of NFS_DATA_SYNC in all of these cases, and has done for
as long as I cared to chase the commit log.
Moveover, commit 3f3503adb332 did not introduce this behavior.
Previous to that commit, nfsd_vfs_write() passed RWF_SYNC to
vfs_iter_write(). This API uses kiocb_set_rw_flags() to convert the RWF
flag into an IOCB flag. kiocb_set_rw_flags does this:
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
And that's where I copied IOCB_DSYNC from. The use of RWF_SYNC was
introduced in 2016 by commit 24368aad47dc ("nfsd: use RWF_SYNC").
So we've tacitly agreed to let NFSD fall short of the specs in this
regard for some time. However I don't believe this is documented
anywhere.
Based on this reasoning, IOCB_DSYNC is historically correct for the
DIRECT WRITE path and its fallbacks. I'm guessing that an O_DIRECT WRITE
is going to persist the written data but won't persist file attribute
changes either.
I'm open to making NFSD adhere more strictly to the spec language, but
I bet there will be a performance impact. Maybe that impact will be
unnoticeable on modern storage devices.
--
Chuck Lever
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-21 13:35 ` Chuck Lever
@ 2025-10-21 13:45 ` Jeff Layton
2025-10-21 13:53 ` Chuck Lever
2025-10-22 17:00 ` Mike Snitzer
1 sibling, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2025-10-21 13:45 UTC (permalink / raw)
To: Chuck Lever, Mike Snitzer
Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Tue, 2025-10-21 at 09:35 -0400, Chuck Lever wrote:
> On 10/21/25 7:12 AM, Jeff Layton wrote:
> > On Mon, 2025-10-20 at 12:44 -0400, Chuck Lever wrote:
> > > On 10/20/25 12:33 PM, Mike Snitzer wrote:
> > > > On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
> > > > Just a bit concerned about removing IOCB_SYNC in that
> > > > we're altering stable_how to be NFS_FILE_SYNC.
> > > Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") introduces
> > > the first use of IOCB_ flags in NFSD's write path, and it uses
> > > IOCB_DSYNC. The patch has Reviewed-by's from Christoph, Neil, and
> > > Jeff.
> > >
> > > Should we be concerned that IOCB_DSYNC does not persist time stamp
> > > changes that might be lost during an unplanned server boot?
> > >
> > > As a reminder to the thread, Section 3.3.7 of RFC 1813 says:
> > >
> > > If stable is FILE_SYNC, the server must commit the data
> > > written plus all file system metadata to stable storage
> > > before returning results.
> > >
> > > The text is a bit blurry about whether "file system metadata" means
> > > all of the outstanding metadata changes for every file, or just the
> > > metadata changes for the target file handle.
> > >
> > > NFSD has historically treated DATA_SYNC and FILE_SYNC identically,
> > > as the Linux NFS client does not use DATA_SYNC (IIRC).
> > >
> >
> > Surely it just meant for the one file.
>
> Well yes that is the traditional understanding. I'm merely pointing out
> that the actual text is not quite as specific as what we've come to
> understand.
>
>
> > FILE_SYNC is only applicable to
> > WRITE/COMMIT operations and those only deal with a single file at a
> > time.
>
> True but you may recall that NFSD's COMMIT used to ignore the range
> arguments and flush the whole file. Some file systems used to flush
> all dirty data in this case, IIRC.
>
> There's always been a bit of a mismatch between the spec and what NFSD
> has implemented.
>
>
> > If the client gets back FILE_SYNC on a write, it should _not_
> > assume that all outstanding dirty data to all files has been sync'ed.
>
> Agreed.
>
> But back to Mike's point.
>
> - The spec says NFS_DATA_SYNC means persist file data.
>
> - The spec says NFS_FILE_SYNC means persist file data and file
> attributes.
>
> - After consulting with the section describing COMMIT, I think that
> COMMIT is supposed to persist both file data and attributes.
>
> And my reading of the code in fs/nfsd/vfs.c is that NFSD does the
> equivalent of NFS_DATA_SYNC in all of these cases, and has done for
> as long as I cared to chase the commit log.
>
> Moveover, commit 3f3503adb332 did not introduce this behavior.
>
> Previous to that commit, nfsd_vfs_write() passed RWF_SYNC to
> vfs_iter_write(). This API uses kiocb_set_rw_flags() to convert the RWF
> flag into an IOCB flag. kiocb_set_rw_flags does this:
>
> kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
> if (flags & RWF_SYNC)
> kiocb_flags |= IOCB_DSYNC;
>
> And that's where I copied IOCB_DSYNC from. The use of RWF_SYNC was
> introduced in 2016 by commit 24368aad47dc ("nfsd: use RWF_SYNC").
>
> So we've tacitly agreed to let NFSD fall short of the specs in this
> regard for some time. However I don't believe this is documented
> anywhere.
>
> Based on this reasoning, IOCB_DSYNC is historically correct for the
> DIRECT WRITE path and its fallbacks. I'm guessing that an O_DIRECT WRITE
> is going to persist the written data but won't persist file attribute
> changes either.
>
I think that's the case, generally.
> I'm open to making NFSD adhere more strictly to the spec language, but
> I bet there will be a performance impact. Maybe that impact will be
> unnoticeable on modern storage devices.
>
Ok, I had missed the context that we had been doing this all along
anyway. In that case, IOCB_DSYNC seems like it's probably acceptable.
We likely have bigger cache coherency problems outstanding than
potential timestamp rollbacks anyway.
Alternately, we could just return NFS_DATA_SYNC, but then we'd have to
deal with follow-on commits.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-21 13:45 ` Jeff Layton
@ 2025-10-21 13:53 ` Chuck Lever
0 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2025-10-21 13:53 UTC (permalink / raw)
To: Jeff Layton, Mike Snitzer
Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On 10/21/25 9:45 AM, Jeff Layton wrote:
> On Tue, 2025-10-21 at 09:35 -0400, Chuck Lever wrote:
>> On 10/21/25 7:12 AM, Jeff Layton wrote:
>>> On Mon, 2025-10-20 at 12:44 -0400, Chuck Lever wrote:
>>>> On 10/20/25 12:33 PM, Mike Snitzer wrote:
>>>>> On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
>>>>> Just a bit concerned about removing IOCB_SYNC in that
>>>>> we're altering stable_how to be NFS_FILE_SYNC.
>>>> Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") introduces
>>>> the first use of IOCB_ flags in NFSD's write path, and it uses
>>>> IOCB_DSYNC. The patch has Reviewed-by's from Christoph, Neil, and
>>>> Jeff.
>>>>
>>>> Should we be concerned that IOCB_DSYNC does not persist time stamp
>>>> changes that might be lost during an unplanned server boot?
>>>>
>>>> As a reminder to the thread, Section 3.3.7 of RFC 1813 says:
>>>>
>>>> If stable is FILE_SYNC, the server must commit the data
>>>> written plus all file system metadata to stable storage
>>>> before returning results.
>>>>
>>>> The text is a bit blurry about whether "file system metadata" means
>>>> all of the outstanding metadata changes for every file, or just the
>>>> metadata changes for the target file handle.
>>>>
>>>> NFSD has historically treated DATA_SYNC and FILE_SYNC identically,
>>>> as the Linux NFS client does not use DATA_SYNC (IIRC).
>>>>
>>>
>>> Surely it just meant for the one file.
>>
>> Well yes that is the traditional understanding. I'm merely pointing out
>> that the actual text is not quite as specific as what we've come to
>> understand.
>>
>>
>>> FILE_SYNC is only applicable to
>>> WRITE/COMMIT operations and those only deal with a single file at a
>>> time.
>>
>> True but you may recall that NFSD's COMMIT used to ignore the range
>> arguments and flush the whole file. Some file systems used to flush
>> all dirty data in this case, IIRC.
>>
>> There's always been a bit of a mismatch between the spec and what NFSD
>> has implemented.
>>
>>
>>> If the client gets back FILE_SYNC on a write, it should _not_
>>> assume that all outstanding dirty data to all files has been sync'ed.
>>
>> Agreed.
>>
>> But back to Mike's point.
>>
>> - The spec says NFS_DATA_SYNC means persist file data.
>>
>> - The spec says NFS_FILE_SYNC means persist file data and file
>> attributes.
>>
>> - After consulting with the section describing COMMIT, I think that
>> COMMIT is supposed to persist both file data and attributes.
>>
>> And my reading of the code in fs/nfsd/vfs.c is that NFSD does the
>> equivalent of NFS_DATA_SYNC in all of these cases, and has done for
>> as long as I cared to chase the commit log.
>>
>> Moveover, commit 3f3503adb332 did not introduce this behavior.
>>
>> Previous to that commit, nfsd_vfs_write() passed RWF_SYNC to
>> vfs_iter_write(). This API uses kiocb_set_rw_flags() to convert the RWF
>> flag into an IOCB flag. kiocb_set_rw_flags does this:
>>
>> kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
>> if (flags & RWF_SYNC)
>> kiocb_flags |= IOCB_DSYNC;
>>
>> And that's where I copied IOCB_DSYNC from. The use of RWF_SYNC was
>> introduced in 2016 by commit 24368aad47dc ("nfsd: use RWF_SYNC").
>>
>> So we've tacitly agreed to let NFSD fall short of the specs in this
>> regard for some time. However I don't believe this is documented
>> anywhere.
>>
>> Based on this reasoning, IOCB_DSYNC is historically correct for the
>> DIRECT WRITE path and its fallbacks. I'm guessing that an O_DIRECT WRITE
>> is going to persist the written data but won't persist file attribute
>> changes either.
>>
>
> I think that's the case, generally.
>
>> I'm open to making NFSD adhere more strictly to the spec language, but
>> I bet there will be a performance impact. Maybe that impact will be
>> unnoticeable on modern storage devices.
>>
>
> Ok, I had missed the context that we had been doing this all along
> anyway. In that case, IOCB_DSYNC seems like it's probably acceptable.
> We likely have bigger cache coherency problems outstanding than
> potential timestamp rollbacks anyway.
>
> Alternately, we could just return NFS_DATA_SYNC, but then we'd have to
> deal with follow-on commits.
But what's the point of asking the client to send us a COMMIT if we're
still not going to sync the file attributes? ;-)
--
Chuck Lever
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 0/4] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-21 13:35 ` Chuck Lever
2025-10-21 13:45 ` Jeff Layton
@ 2025-10-22 17:00 ` Mike Snitzer
1 sibling, 0 replies; 16+ messages in thread
From: Mike Snitzer @ 2025-10-22 17:00 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, jonathan.flynn
On Tue, Oct 21, 2025 at 09:35:32AM -0400, Chuck Lever wrote:
> On 10/21/25 7:12 AM, Jeff Layton wrote:
> > On Mon, 2025-10-20 at 12:44 -0400, Chuck Lever wrote:
> >> On 10/20/25 12:33 PM, Mike Snitzer wrote:
> >>> On Mon, Oct 20, 2025 at 12:25:42PM -0400, Chuck Lever wrote:
> >>> Just a bit concerned about removing IOCB_SYNC in that
> >>> we're altering stable_how to be NFS_FILE_SYNC.
> >> Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") introduces
> >> the first use of IOCB_ flags in NFSD's write path, and it uses
> >> IOCB_DSYNC. The patch has Reviewed-by's from Christoph, Neil, and
> >> Jeff.
> >>
> >> Should we be concerned that IOCB_DSYNC does not persist time stamp
> >> changes that might be lost during an unplanned server boot?
> >>
> >> As a reminder to the thread, Section 3.3.7 of RFC 1813 says:
> >>
> >> If stable is FILE_SYNC, the server must commit the data
> >> written plus all file system metadata to stable storage
> >> before returning results.
> >>
> >> The text is a bit blurry about whether "file system metadata" means
> >> all of the outstanding metadata changes for every file, or just the
> >> metadata changes for the target file handle.
> >>
> >> NFSD has historically treated DATA_SYNC and FILE_SYNC identically,
> >> as the Linux NFS client does not use DATA_SYNC (IIRC).
> >>
> >
> > Surely it just meant for the one file.
>
> Well yes that is the traditional understanding. I'm merely pointing out
> that the actual text is not quite as specific as what we've come to
> understand.
>
>
> > FILE_SYNC is only applicable to
> > WRITE/COMMIT operations and those only deal with a single file at a
> > time.
>
> True but you may recall that NFSD's COMMIT used to ignore the range
> arguments and flush the whole file. Some file systems used to flush
> all dirty data in this case, IIRC.
>
> There's always been a bit of a mismatch between the spec and what NFSD
> has implemented.
>
>
> > If the client gets back FILE_SYNC on a write, it should _not_
> > assume that all outstanding dirty data to all files has been sync'ed.
>
> Agreed.
>
> But back to Mike's point.
>
> - The spec says NFS_DATA_SYNC means persist file data.
>
> - The spec says NFS_FILE_SYNC means persist file data and file
> attributes.
>
> - After consulting with the section describing COMMIT, I think that
> COMMIT is supposed to persist both file data and attributes.
>
> And my reading of the code in fs/nfsd/vfs.c is that NFSD does the
> equivalent of NFS_DATA_SYNC in all of these cases, and has done for
> as long as I cared to chase the commit log.
>
> Moveover, commit 3f3503adb332 did not introduce this behavior.
>
> Previous to that commit, nfsd_vfs_write() passed RWF_SYNC to
> vfs_iter_write(). This API uses kiocb_set_rw_flags() to convert the RWF
> flag into an IOCB flag. kiocb_set_rw_flags does this:
>
> kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
> if (flags & RWF_SYNC)
> kiocb_flags |= IOCB_DSYNC;
>
> And that's where I copied IOCB_DSYNC from. The use of RWF_SYNC was
> introduced in 2016 by commit 24368aad47dc ("nfsd: use RWF_SYNC").
>
> So we've tacitly agreed to let NFSD fall short of the specs in this
> regard for some time. However I don't believe this is documented
> anywhere.
>
> Based on this reasoning, IOCB_DSYNC is historically correct for the
> DIRECT WRITE path and its fallbacks. I'm guessing that an O_DIRECT WRITE
> is going to persist the written data but won't persist file attribute
> changes either.
>
> I'm open to making NFSD adhere more strictly to the spec language, but
> I bet there will be a performance impact. Maybe that impact will be
> unnoticeable on modern storage devices.
I was able to get NFSD Direct's use of IOCB_DSYNC|IOCB_SYNC tested
with Jonathan Flynn last week, there was no performance difference on
his modern test cluster (8 "enterprise" NVMe devices in the server).
Was very comforting to learn IOCB_SYNC didn't cause any performance
loss.
Mike
^ permalink raw reply [flat|nested] 16+ messages in thread