* [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-11-05 19:28 Chuck Lever
2025-11-05 19:28 ` [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty Chuck Lever
` (4 more replies)
0 siblings, 5 replies; 26+ messages in thread
From: Chuck Lever @ 2025-11-05 19:28 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series includes the patches needed to make NFSD Direct WRITE
operational.
As requested by several reviewers, I've combined all the review
patches into a single patch. I've removed R-b's for the resulting
patches.
The series compiles and passes static analysis checks but still
needs functional and performance testing.
Changes since v9:
* Unaligned segments no longer use IOCB_DONTCACHE
* Squashed all review patches into Mike's initial patch
* Squashed Mike's documentation update into the final patch
Changes since v8:
* Drop "NFSD: Handle both offset and memory alignment for direct I/O"
* Include the Sep 3 version of the Documentation update
Changes since v7:
* Rebase the series on Mike's original v3 patch
* Address more review comments
* Optimize the "when can NFSD use IOCB_DIRECT" logic
* Revert the "always promote to FILE_SYNC" logic
Changes since v6:
* Patches to address review comments have been split out
* Refactored the iter initialization code
Changes since v5:
* Add a patch to make FILE_SYNC WRITEs persist timestamps
* Address some of Christoph's review comments
* The svcrdma patch has been dropped until we actually need it
Changes since v4:
* Split out refactoring nfsd_buffered_write() into a separate patch
* Expand patch description of 1/4
* Don't set IOCB_SYNC flag
Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks
Chuck Lever (2):
NFSD: Make FILE_SYNC WRITEs comply with spec
NFSD: Enable return of an updated stable_how to NFS clients
Mike Snitzer (2):
NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
Olga Kornievskaia (1):
NFSD: don't start nfsd if sv_permsocks is empty
.../filesystems/nfs/nfsd-io-modes.rst | 150 ++++++++++++++
fs/nfsd/debugfs.c | 1 +
fs/nfsd/nfs3proc.c | 2 +-
fs/nfsd/nfs4proc.c | 2 +-
fs/nfsd/nfsproc.c | 3 +-
fs/nfsd/nfssvc.c | 28 +--
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 195 +++++++++++++++++-
fs/nfsd/vfs.h | 6 +-
fs/nfsd/xdr3.h | 2 +-
10 files changed, 350 insertions(+), 40 deletions(-)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
--
2.51.0
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty
2025-11-05 19:28 [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-05 19:28 ` Chuck Lever
2025-11-05 19:31 ` Chuck Lever
2025-11-05 19:28 ` [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
` (3 subsequent siblings)
4 siblings, 1 reply; 26+ messages in thread
From: Chuck Lever @ 2025-11-05 19:28 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey; +Cc: linux-nfs
From: Olga Kornievskaia <okorniev@redhat.com>
Previously, while trying to create a server instance, if no
listening sockets were present then default parameter udp
and tcp listeners were created. It's unclear what purpose
was of starting these listeners were and how this could have
been triggered by the userland setup. This patch proposed
to ensure the reverse that we never end in a situation where
no listener sockets are created and we are trying to create
nfsd threads.
The problem it solves is: when nfs.conf only has tcp=n (and
nothing else for the choice of transports), nfsdctl would
still start the server and create udp and tcp listeners.
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfssvc.c | 28 +++++-----------------------
1 file changed, 5 insertions(+), 23 deletions(-)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 7057ddd7a0a8..b08ae85d53ef 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -249,27 +249,6 @@ int nfsd_nrthreads(struct net *net)
return rv;
}
-static int nfsd_init_socks(struct net *net, const struct cred *cred)
-{
- int error;
- struct nfsd_net *nn = net_generic(net, nfsd_net_id);
-
- if (!list_empty(&nn->nfsd_serv->sv_permsocks))
- return 0;
-
- error = svc_xprt_create(nn->nfsd_serv, "udp", net, PF_INET, NFS_PORT,
- SVC_SOCK_DEFAULTS, cred);
- if (error < 0)
- return error;
-
- error = svc_xprt_create(nn->nfsd_serv, "tcp", net, PF_INET, NFS_PORT,
- SVC_SOCK_DEFAULTS, cred);
- if (error < 0)
- return error;
-
- return 0;
-}
-
static int nfsd_users = 0;
static int nfsd_startup_generic(void)
@@ -377,9 +356,12 @@ static int nfsd_startup_net(struct net *net, const struct cred *cred)
ret = nfsd_startup_generic();
if (ret)
return ret;
- ret = nfsd_init_socks(net, cred);
- if (ret)
+
+ if (list_empty(&nn->nfsd_serv->sv_permsocks)) {
+ pr_warn("NFSD: Failed to start, no listeners configured.\n");
+ ret = -EIO;
goto out_socks;
+ }
if (nfsd_needs_lockd(nn) && !nn->lockd_up) {
ret = lockd_up(net, cred);
--
2.51.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-11-05 19:28 [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-05 19:28 ` [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty Chuck Lever
@ 2025-11-05 19:28 ` Chuck Lever
2025-11-06 0:55 ` NeilBrown
2025-11-06 13:05 ` Christoph Hellwig
2025-11-05 19:28 ` [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
` (2 subsequent siblings)
4 siblings, 2 replies; 26+ messages in thread
From: Chuck Lever @ 2025-11-05 19:28 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer, stable
From: Chuck Lever <chuck.lever@oracle.com>
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:
> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.
Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:
- flags |= RWF_SYNC;
with:
+ kiocb.ki_flags |= IOCB_DSYNC;
which appears to be correct given:
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
in kiocb_set_rw_flags(). However the author of that commit did not
appreciate that the previous line in kiocb_set_rw_flags() results
in IOCB_SYNC also being set:
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
IOCB_SYNC. Reviewers at the time did not catch the omission.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5333d49910d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
stable = NFS_UNSTABLE;
init_sync_kiocb(&kiocb, file);
kiocb.ki_pos = offset;
- if (stable && !fhp->fh_use_wgather)
- kiocb.ki_flags |= IOCB_DSYNC;
+ if (likely(!fhp->fh_use_wgather)) {
+ switch (stable) {
+ case NFS_FILE_SYNC:
+ /* persist data and timestamps */
+ kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+ break;
+ case NFS_DATA_SYNC:
+ /* persist data only */
+ kiocb.ki_flags |= IOCB_DSYNC;
+ break;
+ }
+ }
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
--
2.51.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients
2025-11-05 19:28 [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-05 19:28 ` [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty Chuck Lever
2025-11-05 19:28 ` [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-11-05 19:28 ` Chuck Lever
2025-11-06 13:07 ` Christoph Hellwig
2025-11-05 19:28 ` [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-05 19:28 ` [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
4 siblings, 1 reply; 26+ messages in thread
From: Chuck Lever @ 2025-11-05 19:28 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
NFSv3 and newer protocols enable clients to perform a two-phase
WRITE. A client requests an UNSTABLE WRITE, which sends dirty data
to the NFS server, but does not persist it. The server replies that
it performed the UNSTABLE WRITE, and the client is then obligated to
follow up with a COMMIT request before it can remove the dirty data
from its own page cache. The COMMIT reply is the client's guarantee
that the written data has been persisted on the server.
The purpose of this protocol design is to enable clients to send
a large amount of data via multiple WRITE requests to a server, and
then wait for persistence just once. The server is able to start
persisting the data as soon as it gets it, to shorten the length of
time the client has to wait for the final COMMIT to complete.
It's also possible for the server to respond to an UNSTABLE WRITE
request in a way that indicates that the data was persisted anyway.
In that case, the client can skip the COMMIT and remove the dirty
data from its memory immediately. NetApp filers, for example, do
this because they have a battery-backed cache and can guarantee that
written data is persisted quickly and immediately.
NFSD has never implemented this kind of promotion. UNSTABLE WRITE
requests are unconditionally treated as UNSTABLE. However, in a
subsequent patch, nfsd_vfs_write() will be able to promote an
UNSTABLE WRITE to be a FILE_SYNC WRITE. This will be because NFSD
will handle some WRITE requests locally with O_DIRECT, which
persists written data immediately. The FILE_SYNC WRITE response
indicates to the client that no follow-up COMMIT is necessary.
This patch prepares for that change by enabling the modified
stable_how value to be passed along to NFSD's WRITE reply encoder.
No behavior change is expected.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs3proc.c | 2 +-
fs/nfsd/nfs4proc.c | 2 +-
fs/nfsd/nfsproc.c | 3 ++-
fs/nfsd/vfs.c | 11 ++++++-----
fs/nfsd/vfs.h | 6 ++++--
fs/nfsd/xdr3.h | 2 +-
6 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..ad14b34583bb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
resp->committed = argp->stable;
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
&argp->payload, &cnt,
- resp->committed, resp->verf);
+ &resp->committed, resp->verf);
resp->count = cnt;
resp->status = nfsd3_map_status(resp->status);
return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 7f7e6bb23a90..2222bb283baf 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1285,7 +1285,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
write->wr_how_written = write->wr_stable_how;
status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
write->wr_offset, &write->wr_payload,
- &cnt, write->wr_how_written,
+ &cnt, &write->wr_how_written,
(__be32 *)write->wr_verifier.data);
nfsd_file_put(nf);
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75..706401ed6f8d 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
struct nfsd_writeargs *argp = rqstp->rq_argp;
struct nfsd_attrstat *resp = rqstp->rq_resp;
unsigned long cnt = argp->len;
+ u32 committed = NFS_DATA_SYNC;
dprintk("nfsd: WRITE %s %u bytes at %d\n",
SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
fh_copy(&resp->fh, &argp->fh);
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
- &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+ &argp->payload, &cnt, &committed, NULL);
if (resp->status == nfs_ok)
resp->status = fh_getattr(&resp->fh, &resp->stat);
else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5333d49910d9..f3be36b960e5 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,7 +1262,7 @@ static int wait_for_concurrent_writes(struct file *file)
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1274,11 +1274,12 @@ __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload, unsigned long *cnt,
- int stable, __be32 *verf)
+ u32 *stable_how, __be32 *verf)
{
struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
struct file *file = nf->nf_file;
struct super_block *sb = file_inode(file)->i_sb;
+ u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
struct iov_iter iter;
@@ -1444,7 +1445,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1454,7 +1455,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
*/
__be32
nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
- const struct xdr_buf *payload, unsigned long *cnt, int stable,
+ const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
__be32 *verf)
{
struct nfsd_file *nf;
@@ -1467,7 +1468,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
goto out;
err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
- stable, verf);
+ stable_how, verf);
nfsd_file_put(nf);
out:
trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fa46f8b5f132..c713ed0b04e0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -130,11 +130,13 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 *eof);
__be32 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
loff_t offset, const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_readlink(struct svc_rqst *, struct svc_fh *,
char *, int *);
__be32 nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..c0e443ef3a6b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
__be32 status;
struct svc_fh fh;
unsigned long count;
- int committed;
+ u32 committed;
__be32 verf[2];
};
--
2.51.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-05 19:28 [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (2 preceding siblings ...)
2025-11-05 19:28 ` [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-11-05 19:28 ` Chuck Lever
2025-11-06 10:11 ` NeilBrown
2025-11-05 19:28 ` [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
4 siblings, 1 reply; 26+ messages in thread
From: Chuck Lever @ 2025-11-05 19:28 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer, Chuck Lever
From: Mike Snitzer <snitzer@kernel.org>
When NFSD_IO_DIRECT is selected via the
/sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
incoming unaligned NFS WRITE requests into a prefix, middle and
suffix segment, as needed. The middle segment is now DIO-aligned and
the prefix and/or suffix are unaligned. Synchronous buffered IO is
used for the unaligned segments, and IOCB_DIRECT is used for the
middle DIO-aligned extent.
Although IOCB_DIRECT avoids the use of the page cache, by itself it
doesn't guarantee data durability. For UNSTABLE WRITE requests,
durability is obtained by a subsequent NFS COMMIT request.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 170 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 168 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
switch (val) {
case NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
nfsd_io_cache_write = val;
break;
default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
DEFINE_NFSD_IO_EVENT(write_io_done);
DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f3be36b960e5..8158e129a560 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,161 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio_seg {
+ struct iov_iter iter;
+ bool use_dio;
+};
+
+struct nfsd_write_dio_args {
+ struct nfsd_file *nf;
+ int flags_buffered;
+ int flags_direct;
+ unsigned int nsegs;
+ struct nfsd_write_dio_seg segment[3];
+};
+
+/*
+ * Check if the bvec iterator is aligned for direct I/O.
+ *
+ * bvecs generated from RPC receive buffers are contiguous: After the first
+ * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
+ * Therefore, only the first bvec is checked.
+ */
+static bool
+nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
+{
+ unsigned int addr_mask = nf->nf_dio_mem_align - 1;
+ const struct bio_vec *bvec = i->bvec;
+
+ return ((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask) == 0;
+}
+
+static void
+nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
+ struct bio_vec *bvec, unsigned int nvecs,
+ unsigned long total, size_t start, size_t len)
+{
+ iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
+ if (start)
+ iov_iter_advance(&segment->iter, start);
+ iov_iter_truncate(&segment->iter, len);
+ segment->use_dio = false;
+}
+
+static void
+nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
+ loff_t offset, unsigned long total,
+ struct nfsd_write_dio_args *args)
+{
+ u32 offset_align = args->nf->nf_dio_offset_align;
+ u32 mem_align = args->nf->nf_dio_mem_align;
+ loff_t prefix_end, orig_end, middle_end;
+ size_t prefix, middle, suffix;
+
+ args->nsegs = 0;
+
+ /*
+ * Check if direct I/O is feasible for this write request.
+ * If alignments are not available, the write is too small,
+ * or no alignment can be found, fall back to buffered I/O.
+ */
+ if (unlikely(!mem_align || !offset_align) ||
+ unlikely(total < max(offset_align, mem_align)))
+ goto no_dio;
+
+ /* Calculate aligned segments */
+ prefix_end = round_up(offset, offset_align);
+ orig_end = offset + total;
+ middle_end = round_down(orig_end, offset_align);
+
+ prefix = prefix_end - offset;
+ middle = middle_end - prefix_end;
+ suffix = orig_end - middle_end;
+
+ if (!middle)
+ goto no_dio;
+
+ if (prefix) {
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
+ nvecs, total, 0, prefix);
+ ++args->nsegs;
+ }
+
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
+ total, prefix, middle);
+ if (!nfsd_iov_iter_aligned_bvec(args->nf,
+ &args->segment[args->nsegs].iter))
+ goto no_dio;
+ args->segment[args->nsegs].use_dio = true;
+ ++args->nsegs;
+
+ if (suffix) {
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
+ nvecs, total, prefix + middle, suffix);
+ ++args->nsegs;
+ }
+
+ return;
+
+no_dio:
+ /*
+ * No DIO alignment possible - pack into single non-DIO segment.
+ * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
+ */
+ if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ args->flags_buffered |= IOCB_DONTCACHE;
+ nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
+ 0, total);
+ args->nsegs = 1;
+}
+
+static int
+nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct kiocb *kiocb, unsigned int nvecs,
+ unsigned long *cnt, struct nfsd_write_dio_args *args)
+{
+ struct file *file = args->nf->nf_file;
+ ssize_t host_err;
+ unsigned int i;
+
+ nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
+ *cnt, args);
+
+ *cnt = 0;
+ for (i = 0; i < args->nsegs; i++) {
+ if (args->segment[i].use_dio) {
+ kiocb->ki_flags = args->flags_direct;
+ trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+ args->segment[i].iter.count);
+ } else
+ kiocb->ki_flags = args->flags_buffered;
+
+ host_err = vfs_iocb_iter_write(file, kiocb,
+ &args->segment[i].iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt += host_err;
+ if (host_err < args->segment[i].iter.count)
+ break; /* partial write */
+ }
+
+ return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
+{
+ struct nfsd_write_dio_args args;
+
+ args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
+ args.flags_buffered = kiocb->ki_flags;
+ args.nf = nf;
+
+ return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1329,25 +1484,32 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
- case NFSD_IO_BUFFERED:
+ case NFSD_IO_DIRECT:
+ host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs,
+ cnt, &kiocb);
break;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_BUFFERED:
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
+ if (host_err < 0)
+ break;
+ *cnt = host_err;
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.51.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-05 19:28 [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (3 preceding siblings ...)
2025-11-05 19:28 ` [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-05 19:28 ` Chuck Lever
2025-11-06 10:24 ` NeilBrown
4 siblings, 1 reply; 26+ messages in thread
From: Chuck Lever @ 2025-11-05 19:28 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:
/sys/kernel/debug/nfsd/io_cache_read
/sys/kernel/debug/nfsd/io_cache_write
This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).
Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
.../filesystems/nfs/nfsd-io-modes.rst | 150 ++++++++++++++++++
1 file changed, 150 insertions(+)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 000000000000..29b84c9c9e25
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,150 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
+- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
+
+To set an NFSD IO mode, write a supported value (0 - 4) to the
+corresponding IO operation's debugfs interface, e.g.:
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+ echo 4 > /sys/kernel/debug/nfsd/io_cache_write
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.:
+ cat /sys/kernel/debug/nfsd/io_cache_read
+ cat /sys/kernel/debug/nfsd/io_cache_write
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to servicing
+new IO requests.
+
+For more context on DONTCACHE, please see these Linux commit headers:
+- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+ to take a struct kiocb")
+- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
+ RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
+interfaces, FOP_DONTCACHE must be advertised as supported by the
+underlying filesystem (e.g. XFS), otherwise all IO flagged with
+RWF_DONTCACHE will fail with -EOPNOTSUPP.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+But in summary:
+- NFSD DIRECT can significantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+ If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+ DIO-aligned block (on either end of the READ). The expanded READ is
+ verified to have proper offset/len (logical_block_size) and
+ dma_alignment checking.
+
+ Any misaligned READ that is less than 32K won't be expanded to be
+ DIO-aligned (this heuristic just avoids excess work, like allocating
+ start_extra_page, for smaller IO that can generally already perform
+ well using buffered IO).
+
+Misaligned WRITE:
+ If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+ middle and end as needed. The large middle segment is DIO-aligned
+ and the start and/or end are misaligned. Buffered IO is used for the
+ misaligned segments and O_DIRECT is used for the middle DIO-aligned
+ segment. DONTCACHE buffered IO is _not_ used for the misaligned
+ segments because using normal buffered IO offers significant RMW
+ performance benefit when handling streaming misaligned WRITEs.
+
+Tracing:
+ The nfsd_read_direct trace event shows how NFSD expands any
+ misaligned READ to the next DIO-aligned block (on either end of the
+ original READ, as needed).
+
+ This combination of trace events is useful for READs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+ The nfsd_write_direct trace event shows how NFSD splits a given
+ misaligned WRITE into a DIO-aligned middle segment.
+
+ This combination of trace events is useful for WRITEs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
--
2.51.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty
2025-11-05 19:28 ` [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty Chuck Lever
@ 2025-11-05 19:31 ` Chuck Lever
0 siblings, 0 replies; 26+ messages in thread
From: Chuck Lever @ 2025-11-05 19:31 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey; +Cc: linux-nfs
On 11/5/25 2:28 PM, Chuck Lever wrote:
> From: Olga Kornievskaia <okorniev@redhat.com>
>
> Previously, while trying to create a server instance, if no
> listening sockets were present then default parameter udp
> and tcp listeners were created. It's unclear what purpose
> was of starting these listeners were and how this could have
> been triggered by the userland setup. This patch proposed
> to ensure the reverse that we never end in a situation where
> no listener sockets are created and we are trying to create
> nfsd threads.
>
> The problem it solves is: when nfs.conf only has tcp=n (and
> nothing else for the choice of transports), nfsdctl would
> still start the server and create udp and tcp listeners.
>
> Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
> Reviewed-by: NeilBrown <neil@brown.name>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/nfssvc.c | 28 +++++-----------------------
> 1 file changed, 5 insertions(+), 23 deletions(-)
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 7057ddd7a0a8..b08ae85d53ef 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -249,27 +249,6 @@ int nfsd_nrthreads(struct net *net)
> return rv;
> }
>
> -static int nfsd_init_socks(struct net *net, const struct cred *cred)
> -{
> - int error;
> - struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> -
> - if (!list_empty(&nn->nfsd_serv->sv_permsocks))
> - return 0;
> -
> - error = svc_xprt_create(nn->nfsd_serv, "udp", net, PF_INET, NFS_PORT,
> - SVC_SOCK_DEFAULTS, cred);
> - if (error < 0)
> - return error;
> -
> - error = svc_xprt_create(nn->nfsd_serv, "tcp", net, PF_INET, NFS_PORT,
> - SVC_SOCK_DEFAULTS, cred);
> - if (error < 0)
> - return error;
> -
> - return 0;
> -}
> -
> static int nfsd_users = 0;
>
> static int nfsd_startup_generic(void)
> @@ -377,9 +356,12 @@ static int nfsd_startup_net(struct net *net, const struct cred *cred)
> ret = nfsd_startup_generic();
> if (ret)
> return ret;
> - ret = nfsd_init_socks(net, cred);
> - if (ret)
> +
> + if (list_empty(&nn->nfsd_serv->sv_permsocks)) {
> + pr_warn("NFSD: Failed to start, no listeners configured.\n");
> + ret = -EIO;
> goto out_socks;
> + }
>
> if (nfsd_needs_lockd(nn) && !nn->lockd_up) {
> ret = lockd_up(net, cred);
Ignore this one. It's already applied.
--
Chuck Lever
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-11-05 19:28 ` [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-11-06 0:55 ` NeilBrown
2025-11-06 13:05 ` Christoph Hellwig
1 sibling, 0 replies; 26+ messages in thread
From: NeilBrown @ 2025-11-06 0:55 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Mike Snitzer, stable
On Thu, 06 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
> does not also persist file time stamps. To wit, Section 18.32.3
> of RFC 8881 mandates:
>
> > The client specifies with the stable parameter the method of how
> > the data is to be processed by the server. If stable is
> > FILE_SYNC4, the server MUST commit the data written plus all file
> > system metadata to stable storage before returning results. This
> > corresponds to the NFSv2 protocol semantics. Any other behavior
> > constitutes a protocol violation. If stable is DATA_SYNC4, then
> > the server MUST commit all of the data to stable storage and
> > enough of the metadata to retrieve the data before returning.
>
> Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:
>
> - flags |= RWF_SYNC;
>
> with:
>
> + kiocb.ki_flags |= IOCB_DSYNC;
>
> which appears to be correct given:
>
> if (flags & RWF_SYNC)
> kiocb_flags |= IOCB_DSYNC;
>
> in kiocb_set_rw_flags(). However the author of that commit did not
"the author and reviewers of that commit"
Reviewed-by: NeilBrown <neil@brown.name>
Thanks,
NeilBrown
> appreciate that the previous line in kiocb_set_rw_flags() results
> in IOCB_SYNC also being set:
>
> kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
>
> RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
> IOCB_SYNC. Reviewers at the time did not catch the omission.
>
> Reported-by: Mike Snitzer <snitzer@kernel.org>
> Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f537a7b4ee01..5333d49910d9 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> stable = NFS_UNSTABLE;
> init_sync_kiocb(&kiocb, file);
> kiocb.ki_pos = offset;
> - if (stable && !fhp->fh_use_wgather)
> - kiocb.ki_flags |= IOCB_DSYNC;
> + if (likely(!fhp->fh_use_wgather)) {
> + switch (stable) {
> + case NFS_FILE_SYNC:
> + /* persist data and timestamps */
> + kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> + break;
> + case NFS_DATA_SYNC:
> + /* persist data only */
> + kiocb.ki_flags |= IOCB_DSYNC;
> + break;
> + }
> + }
>
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-05 19:28 ` [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-06 10:11 ` NeilBrown
2025-11-06 13:15 ` Christoph Hellwig
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2025-11-06 10:11 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Mike Snitzer, Chuck Lever
On Thu, 06 Nov 2025, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
>
> When NFSD_IO_DIRECT is selected via the
> /sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
> incoming unaligned NFS WRITE requests into a prefix, middle and
> suffix segment, as needed. The middle segment is now DIO-aligned and
> the prefix and/or suffix are unaligned. Synchronous buffered IO is
> used for the unaligned segments, and IOCB_DIRECT is used for the
> middle DIO-aligned extent.
>
> Although IOCB_DIRECT avoids the use of the page cache, by itself it
> doesn't guarantee data durability. For UNSTABLE WRITE requests,
> durability is obtained by a subsequent NFS COMMIT request.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/debugfs.c | 1 +
> fs/nfsd/trace.h | 1 +
> fs/nfsd/vfs.c | 170 ++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 168 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 00eb1ecef6ac..7f44689e0a53 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
> switch (val) {
> case NFSD_IO_BUFFERED:
> case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> nfsd_io_cache_write = val;
> break;
> default:
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index bfd41236aff2..ad74439d0105 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
> DEFINE_NFSD_IO_EVENT(read_done);
> DEFINE_NFSD_IO_EVENT(write_start);
> DEFINE_NFSD_IO_EVENT(write_opened);
> +DEFINE_NFSD_IO_EVENT(write_direct);
> DEFINE_NFSD_IO_EVENT(write_io_done);
> DEFINE_NFSD_IO_EVENT(write_done);
> DEFINE_NFSD_IO_EVENT(commit_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f3be36b960e5..8158e129a560 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,6 +1254,161 @@ static int wait_for_concurrent_writes(struct file *file)
> return err;
> }
>
> +struct nfsd_write_dio_seg {
> + struct iov_iter iter;
> + bool use_dio;
This is only used to choose which flags to use.
I think it would be neater the have 'flags' here explicitly.
> +};
> +
> +struct nfsd_write_dio_args {
> + struct nfsd_file *nf;
> + int flags_buffered;
> + int flags_direct;
The difference between these two is that the latter has IOCB_DIRECT.
So we don't need both. Just have 'flags' and when we currently use
"flags_direct", use "flag | IOCB_DIRECT" instead.
> + unsigned int nsegs;
> + struct nfsd_write_dio_seg segment[3];
> +};
> +
> +/*
> + * Check if the bvec iterator is aligned for direct I/O.
> + *
> + * bvecs generated from RPC receive buffers are contiguous: After the first
> + * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
> + * Therefore, only the first bvec is checked.
> + */
> +static bool
> +nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
> +{
> + unsigned int addr_mask = nf->nf_dio_mem_align - 1;
> + const struct bio_vec *bvec = i->bvec;
> +
> + return ((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask) == 0;
> +}
> +
> +static void
> +nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> + struct bio_vec *bvec, unsigned int nvecs,
> + unsigned long total, size_t start, size_t len)
If we passed 'args' in here then:
- we wouldn't need segment or bvec
- we could increment nsegs in one place
- we would have direct access to 'flags'.
possibly a 'direct' flag could be passed in which causes IOCB_DIRECT to
be set, and triggers failure if DIRECT isn't possible.
I'm not certain all the above changes are a certain win, but I ask you
to consider them and see if you think it makes the code cleaner.
> +{
> + iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
> + if (start)
> + iov_iter_advance(&segment->iter, start);
> + iov_iter_truncate(&segment->iter, len);
> + segment->use_dio = false;
> +}
> +
> +static void
> +nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> + loff_t offset, unsigned long total,
> + struct nfsd_write_dio_args *args)
> +{
> + u32 offset_align = args->nf->nf_dio_offset_align;
> + u32 mem_align = args->nf->nf_dio_mem_align;
> + loff_t prefix_end, orig_end, middle_end;
> + size_t prefix, middle, suffix;
> +
> + args->nsegs = 0;
> +
> + /*
> + * Check if direct I/O is feasible for this write request.
> + * If alignments are not available, the write is too small,
> + * or no alignment can be found, fall back to buffered I/O.
> + */
> + if (unlikely(!mem_align || !offset_align) ||
> + unlikely(total < max(offset_align, mem_align)))
> + goto no_dio;
> +
> + /* Calculate aligned segments */
> + prefix_end = round_up(offset, offset_align);
> + orig_end = offset + total;
> + middle_end = round_down(orig_end, offset_align);
> +
> + prefix = prefix_end - offset;
> + middle = middle_end - prefix_end;
> + suffix = orig_end - middle_end;
> +
> + if (!middle)
> + goto no_dio;
> +
> + if (prefix) {
> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> + nvecs, total, 0, prefix);
> + ++args->nsegs;
> + }
> +
> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
> + total, prefix, middle);
> + if (!nfsd_iov_iter_aligned_bvec(args->nf,
> + &args->segment[args->nsegs].iter))
> + goto no_dio;
> + args->segment[args->nsegs].use_dio = true;
> + ++args->nsegs;
> +
> + if (suffix) {
> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> + nvecs, total, prefix + middle, suffix);
> + ++args->nsegs;
> + }
> +
> + return;
> +
> +no_dio:
> + /*
> + * No DIO alignment possible - pack into single non-DIO segment.
> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
> + */
> + if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> + args->flags_buffered |= IOCB_DONTCACHE;
> + nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
> + 0, total);
> + args->nsegs = 1;
> +}
> +
> +static int
> +nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct kiocb *kiocb, unsigned int nvecs,
> + unsigned long *cnt, struct nfsd_write_dio_args *args)
> +{
> + struct file *file = args->nf->nf_file;
> + ssize_t host_err;
> + unsigned int i;
> +
> + nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> + *cnt, args);
> +
> + *cnt = 0;
> + for (i = 0; i < args->nsegs; i++) {
> + if (args->segment[i].use_dio) {
> + kiocb->ki_flags = args->flags_direct;
> + trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> + args->segment[i].iter.count);
> + } else
> + kiocb->ki_flags = args->flags_buffered;
Why do we trace the direct write, but not the buffered write?
Thanks,
NeilBrown
> +
> + host_err = vfs_iocb_iter_write(file, kiocb,
> + &args->segment[i].iter);
> + if (host_err < 0)
> + return host_err;
> + *cnt += host_err;
> + if (host_err < args->segment[i].iter.count)
> + break; /* partial write */
> + }
> +
> + return 0;
> +}
> +
> +static noinline_for_stack int
> +nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, unsigned int nvecs,
> + unsigned long *cnt, struct kiocb *kiocb)
> +{
> + struct nfsd_write_dio_args args;
> +
> + args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
> + args.flags_buffered = kiocb->ki_flags;
> + args.nf = nf;
> +
> + return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
> +}
> +
> /**
> * nfsd_vfs_write - write data to an already-open file
> * @rqstp: RPC execution context
> @@ -1329,25 +1484,32 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> }
>
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> - iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> +
> since = READ_ONCE(file->f_wb_err);
> if (verf)
> nfsd_copy_write_verifier(verf, nn);
>
> switch (nfsd_io_cache_write) {
> - case NFSD_IO_BUFFERED:
> + case NFSD_IO_DIRECT:
> + host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs,
> + cnt, &kiocb);
> break;
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags |= IOCB_DONTCACHE;
> + fallthrough;
> + case NFSD_IO_BUFFERED:
> + iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> + host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
> + if (host_err < 0)
> + break;
> + *cnt = host_err;
> break;
> }
> - host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
> if (host_err < 0) {
> commit_reset_write_verifier(nn, rqstp, host_err);
> goto out_nfserr;
> }
> - *cnt = host_err;
> nfsd_stats_io_write_add(nn, exp, *cnt);
> fsnotify_modify(file);
> host_err = filemap_check_wb_err(file->f_mapping, since);
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-05 19:28 ` [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
@ 2025-11-06 10:24 ` NeilBrown
2025-11-06 15:46 ` Mike Snitzer
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2025-11-06 10:24 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Mike Snitzer
On Thu, 06 Nov 2025, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
>
> This document details the NFSD IO modes that are configurable using
> NFSD's experimental debugfs interfaces:
>
> /sys/kernel/debug/nfsd/io_cache_read
> /sys/kernel/debug/nfsd/io_cache_write
>
> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
> debugfs interfaces are replaced with per-export controls).
>
> Future updates will provide more specific guidance and howto
> information to help others use and evaluate NFSD's IO modes:
> BUFFERED, DONTCACHE and DIRECT.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> .../filesystems/nfs/nfsd-io-modes.rst | 150 ++++++++++++++++++
> 1 file changed, 150 insertions(+)
> create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
>
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> new file mode 100644
> index 000000000000..29b84c9c9e25
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -0,0 +1,150 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +NFSD IO MODES
> +=============
> +
> +Overview
> +========
> +
> +NFSD has historically always used buffered IO when servicing READ and
> +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
> +to override that default to use either DONTCACHE or DIRECT IO modes.
> +
> +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
> +mode used for READ and WRITE to be configured independently. See both:
> +- /sys/kernel/debug/nfsd/io_cache_read
> +- /sys/kernel/debug/nfsd/io_cache_write
> +
> +The default value for both io_cache_read and io_cache_write reflects
> +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
> +
> +Based on the configured settings, NFSD's IO will either be:
> +- cached using page cache (NFSD_IO_BUFFERED=0)
> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> +- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
> +- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
> +
> +To set an NFSD IO mode, write a supported value (0 - 4) to the
> +corresponding IO operation's debugfs interface, e.g.:
> + echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> + echo 4 > /sys/kernel/debug/nfsd/io_cache_write
> +
> +To check which IO mode NFSD is using for READ or WRITE, simply read the
> +corresponding IO operation's debugfs interface, e.g.:
> + cat /sys/kernel/debug/nfsd/io_cache_read
> + cat /sys/kernel/debug/nfsd/io_cache_write
> +
> +NFSD DONTCACHE
> +==============
> +
> +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
> +the benefits of using DIRECT IO without any of the strict alignment
> +requirements that DIRECT IO imposes. To achieve this buffered IO is used
> +but the IO is flagged to "drop behind" (meaning associated pages are
> +dropped from the page cache) when IO completes.
> +
> +DONTCACHE aims to avoid what has proven to be a fairly significant
> +limition of Linux's memory management subsystem if/when large amounts of
> +data is infrequently accessed (e.g. read once _or_ written once but not
> +read until much later). Such use-cases are particularly problematic
> +because the page cache will eventually become a bottleneck to servicing
> +new IO requests.
> +
> +For more context on DONTCACHE, please see these Linux commit headers:
> +- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
> + to take a struct kiocb")
> +- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
> + RWF_DONTCACHE")
> +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
> +
> +If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
> +interfaces, FOP_DONTCACHE must be advertised as supported by the
> +underlying filesystem (e.g. XFS), otherwise all IO flagged with
> +RWF_DONTCACHE will fail with -EOPNOTSUPP.
If FOP_DONTCACHE isn't advertised, nfsd doesn't even try RWF_DONTCACHE,
so error don't occur.
Maybe:
"NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the
underlying filesystem doesn't indicaate support by setting
FOP_DONTCACHE."
> +
> +NFSD DIRECT
> +===========
> +
> +DIRECT IO doesn't make use of the page cache, as such it is able to
> +avoid the Linux memory management's page reclaim scalability problems
> +without resorting to the hybrid use of page cache that DONTCACHE does.
> +
> +Some workloads benefit from NFSD avoiding the page cache, particularly
> +those with a working set that is significantly larger than available
> +system memory. The pathological worst-case workload that NFSD DIRECT has
> +proven to help most is: NFS client issuing large sequential IO to a file
> +that is 2-3 times larger than the NFS server's available system memory.
> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
> +that the memory management subsystem would otherwise be required to
> +perform (e.g. page allocation, dirty writeback, page reclaim). When
> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
> +time trying to find adequate free pages so that forward IO progress can
> +be made.
> +
> +The performance win associated with using NFSD DIRECT was previously
> +discussed on linux-nfs, see:
> +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> +But in summary:
> +- NFSD DIRECT can significantly reduce memory requirements
> +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
> +- NFSD DIRECT can offer more deterministic IO performance
> +
> +As always, your mileage may vary and so it is important to carefully
> +consider if/when it is beneficial to make use of NFSD DIRECT. When
> +assessing comparative performance of your workload please be sure to log
> +relevant performance metrics during testing (e.g. memory usage, cpu
> +usage, IO performance). Using perf to collect perf data that may be used
> +to generate a "flamegraph" for work Linux must perform on behalf of your
> +test is a really meaningful way to compare the relative health of the
> +system and how switching NFSD's IO mode changes what is observed.
> +
> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
> +the underlying block device's logical_block_size. Also the memory buffer
> +used to store the READ or WRITE payload must be aligned relative to the
> +underlying block device's dma_alignment.
> +
> +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
> +it can:
> +
> +Misaligned READ:
> + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> + DIO-aligned block (on either end of the READ). The expanded READ is
> + verified to have proper offset/len (logical_block_size) and
> + dma_alignment checking.
> +
> + Any misaligned READ that is less than 32K won't be expanded to be
> + DIO-aligned (this heuristic just avoids excess work, like allocating
> + start_extra_page, for smaller IO that can generally already perform
> + well using buffered IO).
I couldn't find this 32K in the code.
Do we want to say something like:
If you experiment with this on a recent kernel have have interesting
results, please report them to linux-nfs@vger.kernel.org
Thanks,
NeilBrown
> +
> +Misaligned WRITE:
> + If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> + middle and end as needed. The large middle segment is DIO-aligned
> + and the start and/or end are misaligned. Buffered IO is used for the
> + misaligned segments and O_DIRECT is used for the middle DIO-aligned
> + segment. DONTCACHE buffered IO is _not_ used for the misaligned
> + segments because using normal buffered IO offers significant RMW
> + performance benefit when handling streaming misaligned WRITEs.
> +
> +Tracing:
> + The nfsd_read_direct trace event shows how NFSD expands any
> + misaligned READ to the next DIO-aligned block (on either end of the
> + original READ, as needed).
> +
> + This combination of trace events is useful for READs:
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
> + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
> +
> + The nfsd_write_direct trace event shows how NFSD splits a given
> + misaligned WRITE into a DIO-aligned middle segment.
> +
> + This combination of trace events is useful for WRITEs:
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
> + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-11-05 19:28 ` [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-06 0:55 ` NeilBrown
@ 2025-11-06 13:05 ` Christoph Hellwig
1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2025-11-06 13:05 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Mike Snitzer, stable
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients
2025-11-05 19:28 ` [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-11-06 13:07 ` Christoph Hellwig
2025-11-06 16:30 ` Chuck Lever
0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2025-11-06 13:07 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
I don't think we need this any more with the current version, or am I
missing something?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 10:11 ` NeilBrown
@ 2025-11-06 13:15 ` Christoph Hellwig
2025-11-06 13:51 ` Christoph Hellwig
0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2025-11-06 13:15 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer, Chuck Lever
On Thu, Nov 06, 2025 at 09:11:51PM +1100, NeilBrown wrote:
> > +struct nfsd_write_dio_seg {
> > + struct iov_iter iter;
> > + bool use_dio;
>
> This is only used to choose which flags to use.
> I think it would be neater the have 'flags' here explicitly.
Actually, looking at the grand unified patch now (thanks, this is so
much easier to review!), we can just do away with the struct entirely.
Just have nfsd_write_dio_iters_init return if direct I/O is possible
or not, and do a single vfs_iocb_iter_write on the origin kiocb/iter
if not.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 13:15 ` Christoph Hellwig
@ 2025-11-06 13:51 ` Christoph Hellwig
2025-11-06 14:45 ` Chuck Lever
2025-11-06 16:48 ` Mike Snitzer
0 siblings, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2025-11-06 13:51 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer, Chuck Lever
On Thu, Nov 06, 2025 at 05:15:45AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 06, 2025 at 09:11:51PM +1100, NeilBrown wrote:
> > > +struct nfsd_write_dio_seg {
> > > + struct iov_iter iter;
> > > + bool use_dio;
> >
> > This is only used to choose which flags to use.
> > I think it would be neater the have 'flags' here explicitly.
>
> Actually, looking at the grand unified patch now (thanks, this is so
> much easier to review!), we can just do away with the struct entirely.
> Just have nfsd_write_dio_iters_init return if direct I/O is possible
> or not, and do a single vfs_iocb_iter_write on the origin kiocb/iter
> if not.
That didn't work out too well, and indeed having flags here seems
saner.
Chuck, below is an untested incremental patch I did while reviewing
it. Besides this flags thing, it adds the actual NFSD_IO_DIRECT
definition that was missing, but otherwise mostly just folds things
so that we don't need the extra args structure and thus simplifies
things quite a bit.
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index ea87b42894dd..bdb60ee1f1a4 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -157,6 +157,7 @@ enum {
/* Any new NFSD_IO enum value must be added at the end */
NFSD_IO_BUFFERED,
NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT,
};
extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index bb94da333d50..8038d8d038f3 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1170,56 +1170,38 @@ static int wait_for_concurrent_writes(struct file *file)
struct nfsd_write_dio_seg {
struct iov_iter iter;
- bool use_dio;
+ int flags;
};
-struct nfsd_write_dio_args {
- struct nfsd_file *nf;
- int flags_buffered;
- int flags_direct;
- unsigned int nsegs;
- struct nfsd_write_dio_seg segment[3];
-};
-
-/*
- * Check if the bvec iterator is aligned for direct I/O.
- *
- * bvecs generated from RPC receive buffers are contiguous: After the first
- * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
- * Therefore, only the first bvec is checked.
- */
-static bool
-nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
+static unsigned long iov_iter_bvec_offset(const struct iov_iter *iter)
{
- unsigned int addr_mask = nf->nf_dio_mem_align - 1;
- const struct bio_vec *bvec = i->bvec;
-
- return ((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask) == 0;
+ return (unsigned long)(iter->bvec->bv_offset + iter->iov_offset);
}
static void
nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
struct bio_vec *bvec, unsigned int nvecs,
- unsigned long total, size_t start, size_t len)
+ unsigned long total, size_t start, size_t len,
+ struct kiocb *iocb)
{
iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
if (start)
iov_iter_advance(&segment->iter, start);
iov_iter_truncate(&segment->iter, len);
- segment->use_dio = false;
+ segment->flags = iocb->ki_flags;
}
-static void
-nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
- loff_t offset, unsigned long total,
- struct nfsd_write_dio_args *args)
+static unsigned int
+nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
+ unsigned int nvecs, struct kiocb *iocb, unsigned long total,
+ struct nfsd_write_dio_seg segments[3])
{
- u32 offset_align = args->nf->nf_dio_offset_align;
- u32 mem_align = args->nf->nf_dio_mem_align;
+ u32 offset_align = nf->nf_dio_offset_align;
+ u32 mem_align = nf->nf_dio_mem_align;
+ loff_t offset = iocb->ki_pos;
loff_t prefix_end, orig_end, middle_end;
size_t prefix, middle, suffix;
-
- args->nsegs = 0;
+ unsigned int nsegs = 0;
/*
* Check if direct I/O is feasible for this write request.
@@ -1242,87 +1224,80 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
if (!middle)
goto no_dio;
- if (prefix) {
- nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
- nvecs, total, 0, prefix);
- ++args->nsegs;
- }
+ if (prefix)
+ nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
+ nvecs, total, 0, prefix, iocb);
- nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
- total, prefix, middle);
- if (!nfsd_iov_iter_aligned_bvec(args->nf,
- &args->segment[args->nsegs].iter))
+ nfsd_write_dio_seg_init(&segments[nsegs], bvec, nvecs,
+ total, prefix, middle, iocb);
+
+ /*
+ * Check if the bvec iterator is aligned for direct I/O.
+ *
+ * bvecs generated from RPC receive buffers are contiguous: After the
+ * first bvec, all subsequent bvecs start at bv_offset zero
+ * (page-aligned).
+ * Therefore, only the first bvec is checked.
+ */
+ if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
goto no_dio;
- args->segment[args->nsegs].use_dio = true;
- ++args->nsegs;
+ segments[nsegs].flags |= IOCB_DIRECT;
+ nsegs++;
- if (suffix) {
- nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
- nvecs, total, prefix + middle, suffix);
- ++args->nsegs;
- }
+ if (suffix)
+ nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
+ nvecs, total, prefix + middle, suffix,
+ iocb);
- return;
+ return nsegs;
no_dio:
/*
* No DIO alignment possible - pack into single non-DIO segment.
- * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
*/
- if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
- args->flags_buffered |= IOCB_DONTCACHE;
- nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
- 0, total);
- args->nsegs = 1;
+ nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total,
+ 0, total, iocb);
+ return 1;
}
-static int
-nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct kiocb *kiocb, unsigned int nvecs,
- unsigned long *cnt, struct nfsd_write_dio_args *args)
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
{
- struct file *file = args->nf->nf_file;
+ struct file *file = nf->nf_file;
+ struct nfsd_write_dio_seg segments[3];
+ unsigned int nsegs = 0, i;
ssize_t host_err;
- unsigned int i;
- nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
- *cnt, args);
+ nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
+ kiocb, *cnt, segments);
*cnt = 0;
- for (i = 0; i < args->nsegs; i++) {
- if (args->segment[i].use_dio) {
- kiocb->ki_flags = args->flags_direct;
+ for (i = 0; i < nsegs; i++) {
+ kiocb->ki_flags = segments[i].flags;
+ if (kiocb->ki_flags & IOCB_DIRECT) {
trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
- args->segment[i].iter.count);
- } else
- kiocb->ki_flags = args->flags_buffered;
+ segments[i].iter.count);
+ } else if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE) {
+ /*
+ * IOCB_DONTCACHE preserves the intent of
+ * NFSD_IO_DIRECT.
+ */
+ kiocb->ki_flags |= IOCB_DONTCACHE;
+ }
- host_err = vfs_iocb_iter_write(file, kiocb,
- &args->segment[i].iter);
+ host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
if (host_err < 0)
return host_err;
*cnt += host_err;
- if (host_err < args->segment[i].iter.count)
+ if (host_err < segments[i].iter.count)
break; /* partial write */
}
return 0;
}
-static noinline_for_stack int
-nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct nfsd_file *nf, unsigned int nvecs,
- unsigned long *cnt, struct kiocb *kiocb)
-{
- struct nfsd_write_dio_args args;
-
- args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
- args.flags_buffered = kiocb->ki_flags;
- args.nf = nf;
-
- return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
-}
-
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 13:51 ` Christoph Hellwig
@ 2025-11-06 14:45 ` Chuck Lever
2025-11-06 14:49 ` Christoph Hellwig
2025-11-06 16:48 ` Mike Snitzer
1 sibling, 1 reply; 26+ messages in thread
From: Chuck Lever @ 2025-11-06 14:45 UTC (permalink / raw)
To: Christoph Hellwig, NeilBrown
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Mike Snitzer, Chuck Lever
On 11/6/25 8:51 AM, Christoph Hellwig wrote:
> On Thu, Nov 06, 2025 at 05:15:45AM -0800, Christoph Hellwig wrote:
>> On Thu, Nov 06, 2025 at 09:11:51PM +1100, NeilBrown wrote:
>>>> +struct nfsd_write_dio_seg {
>>>> + struct iov_iter iter;
>>>> + bool use_dio;
>>>
>>> This is only used to choose which flags to use.
>>> I think it would be neater the have 'flags' here explicitly.
>>
>> Actually, looking at the grand unified patch now (thanks, this is so
>> much easier to review!), we can just do away with the struct entirely.
>> Just have nfsd_write_dio_iters_init return if direct I/O is possible
>> or not, and do a single vfs_iocb_iter_write on the origin kiocb/iter
>> if not.
>
> That didn't work out too well, and indeed having flags here seems
> saner.
>
> Chuck, below is an untested incremental patch I did while reviewing
> it. Besides this flags thing, it adds the actual NFSD_IO_DIRECT
> definition that was missing,
It's not missing. This series applies on patches that are already in
the nfsd-testing branch, which has a patch that defines that constant.
I'm not sure my tooling enables me to specify a particular base commit
when posting a series. Maybe I can add it to the cover letter.
--
Chuck Lever
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 14:45 ` Chuck Lever
@ 2025-11-06 14:49 ` Christoph Hellwig
0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2025-11-06 14:49 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Mike Snitzer, Chuck Lever
On Thu, Nov 06, 2025 at 09:45:39AM -0500, Chuck Lever wrote:
> It's not missing. This series applies on patches that are already in
> the nfsd-testing branch, which has a patch that defines that constant.
Ah, sorry. It did apply and compile fine except for the define, though.
> I'm not sure my tooling enables me to specify a particular base commit
> when posting a series. Maybe I can add it to the cover letter.
Yeah, I usually just write that down manually.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-06 10:24 ` NeilBrown
@ 2025-11-06 15:46 ` Mike Snitzer
2025-11-06 15:52 ` Chuck Lever
0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2025-11-06 15:46 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs
On Thu, Nov 06, 2025 at 09:24:06PM +1100, NeilBrown wrote:
> On Thu, 06 Nov 2025, Chuck Lever wrote:
> > From: Mike Snitzer <snitzer@kernel.org>
> >
> > This document details the NFSD IO modes that are configurable using
> > NFSD's experimental debugfs interfaces:
> >
> > /sys/kernel/debug/nfsd/io_cache_read
> > /sys/kernel/debug/nfsd/io_cache_write
> >
> > This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
> > debugfs interfaces are replaced with per-export controls).
> >
> > Future updates will provide more specific guidance and howto
> > information to help others use and evaluate NFSD's IO modes:
> > BUFFERED, DONTCACHE and DIRECT.
> >
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> > .../filesystems/nfs/nfsd-io-modes.rst | 150 ++++++++++++++++++
> > 1 file changed, 150 insertions(+)
> > create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
> >
> > diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> > new file mode 100644
> > index 000000000000..29b84c9c9e25
> > --- /dev/null
> > +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> > @@ -0,0 +1,150 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +NFSD IO MODES
> > +=============
> > +
> > +Overview
> > +========
> > +
> > +NFSD has historically always used buffered IO when servicing READ and
> > +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
> > +to override that default to use either DONTCACHE or DIRECT IO modes.
> > +
> > +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
> > +mode used for READ and WRITE to be configured independently. See both:
> > +- /sys/kernel/debug/nfsd/io_cache_read
> > +- /sys/kernel/debug/nfsd/io_cache_write
> > +
> > +The default value for both io_cache_read and io_cache_write reflects
> > +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
> > +
> > +Based on the configured settings, NFSD's IO will either be:
> > +- cached using page cache (NFSD_IO_BUFFERED=0)
> > +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> > +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> > +- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
> > +- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
> > +
> > +To set an NFSD IO mode, write a supported value (0 - 4) to the
> > +corresponding IO operation's debugfs interface, e.g.:
> > + echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> > + echo 4 > /sys/kernel/debug/nfsd/io_cache_write
> > +
> > +To check which IO mode NFSD is using for READ or WRITE, simply read the
> > +corresponding IO operation's debugfs interface, e.g.:
> > + cat /sys/kernel/debug/nfsd/io_cache_read
> > + cat /sys/kernel/debug/nfsd/io_cache_write
> > +
> > +NFSD DONTCACHE
> > +==============
> > +
> > +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
> > +the benefits of using DIRECT IO without any of the strict alignment
> > +requirements that DIRECT IO imposes. To achieve this buffered IO is used
> > +but the IO is flagged to "drop behind" (meaning associated pages are
> > +dropped from the page cache) when IO completes.
> > +
> > +DONTCACHE aims to avoid what has proven to be a fairly significant
> > +limition of Linux's memory management subsystem if/when large amounts of
> > +data is infrequently accessed (e.g. read once _or_ written once but not
> > +read until much later). Such use-cases are particularly problematic
> > +because the page cache will eventually become a bottleneck to servicing
> > +new IO requests.
> > +
> > +For more context on DONTCACHE, please see these Linux commit headers:
> > +- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
> > + to take a struct kiocb")
> > +- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
> > + RWF_DONTCACHE")
> > +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
> > +
> > +If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
> > +interfaces, FOP_DONTCACHE must be advertised as supported by the
> > +underlying filesystem (e.g. XFS), otherwise all IO flagged with
> > +RWF_DONTCACHE will fail with -EOPNOTSUPP.
>
> If FOP_DONTCACHE isn't advertised, nfsd doesn't even try RWF_DONTCACHE,
> so error don't occur.
>
> Maybe:
>
> "NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the
> underlying filesystem doesn't indicaate support by setting
> FOP_DONTCACHE."
>
> > +
> > +NFSD DIRECT
> > +===========
> > +
> > +DIRECT IO doesn't make use of the page cache, as such it is able to
> > +avoid the Linux memory management's page reclaim scalability problems
> > +without resorting to the hybrid use of page cache that DONTCACHE does.
> > +
> > +Some workloads benefit from NFSD avoiding the page cache, particularly
> > +those with a working set that is significantly larger than available
> > +system memory. The pathological worst-case workload that NFSD DIRECT has
> > +proven to help most is: NFS client issuing large sequential IO to a file
> > +that is 2-3 times larger than the NFS server's available system memory.
> > +The reason for such improvement is NFSD DIRECT eliminates a lot of work
> > +that the memory management subsystem would otherwise be required to
> > +perform (e.g. page allocation, dirty writeback, page reclaim). When
> > +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
> > +time trying to find adequate free pages so that forward IO progress can
> > +be made.
> > +
> > +The performance win associated with using NFSD DIRECT was previously
> > +discussed on linux-nfs, see:
> > +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> > +But in summary:
> > +- NFSD DIRECT can significantly reduce memory requirements
> > +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
> > +- NFSD DIRECT can offer more deterministic IO performance
> > +
> > +As always, your mileage may vary and so it is important to carefully
> > +consider if/when it is beneficial to make use of NFSD DIRECT. When
> > +assessing comparative performance of your workload please be sure to log
> > +relevant performance metrics during testing (e.g. memory usage, cpu
> > +usage, IO performance). Using perf to collect perf data that may be used
> > +to generate a "flamegraph" for work Linux must perform on behalf of your
> > +test is a really meaningful way to compare the relative health of the
> > +system and how switching NFSD's IO mode changes what is observed.
> > +
> > +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
> > +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
> > +the underlying block device's logical_block_size. Also the memory buffer
> > +used to store the READ or WRITE payload must be aligned relative to the
> > +underlying block device's dma_alignment.
> > +
> > +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
> > +it can:
> > +
> > +Misaligned READ:
> > + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> > + DIO-aligned block (on either end of the READ). The expanded READ is
> > + verified to have proper offset/len (logical_block_size) and
> > + dma_alignment checking.
> > +
> > + Any misaligned READ that is less than 32K won't be expanded to be
> > + DIO-aligned (this heuristic just avoids excess work, like allocating
> > + start_extra_page, for smaller IO that can generally already perform
> > + well using buffered IO).
>
> I couldn't find this 32K in the code.
>
> Do we want to say something like:
>
> If you experiment with this on a recent kernel have have interesting
> results, please report them to linux-nfs@vger.kernel.org
>
> Thanks,
> NeilBrown
>
Thanks for the review, I clearly missed some clean up. Chuck, please
consider applying this incremental patch which should address Neil's
feedback and remove some stable_how related changes that aren't
relevant without my corresponding patch:
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
index 29b84c9c9e25..e3a522d09766 100644
--- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -23,19 +23,20 @@ Based on the configured settings, NFSD's IO will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
-- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
-- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
-To set an NFSD IO mode, write a supported value (0 - 4) to the
+To set an NFSD IO mode, write a supported value (0 - 2) to the
corresponding IO operation's debugfs interface, e.g.:
echo 2 > /sys/kernel/debug/nfsd/io_cache_read
- echo 4 > /sys/kernel/debug/nfsd/io_cache_write
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_write
To check which IO mode NFSD is using for READ or WRITE, simply read the
corresponding IO operation's debugfs interface, e.g.:
cat /sys/kernel/debug/nfsd/io_cache_read
cat /sys/kernel/debug/nfsd/io_cache_write
+If you experiment with NFSD's IO modes on a recent kernel and have
+interesting results, please report them to linux-nfs@vger.kernel.org
+
NFSD DONTCACHE
==============
@@ -59,10 +60,8 @@ For more context on DONTCACHE, please see these Linux commit headers:
RWF_DONTCACHE")
- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
-If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
-interfaces, FOP_DONTCACHE must be advertised as supported by the
-underlying filesystem (e.g. XFS), otherwise all IO flagged with
-RWF_DONTCACHE will fail with -EOPNOTSUPP.
+NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
+filesystem doesn't indicate support by setting FOP_DONTCACHE.
NFSD DIRECT
===========
@@ -115,11 +114,6 @@ Misaligned READ:
verified to have proper offset/len (logical_block_size) and
dma_alignment checking.
- Any misaligned READ that is less than 32K won't be expanded to be
- DIO-aligned (this heuristic just avoids excess work, like allocating
- start_extra_page, for smaller IO that can generally already perform
- well using buffered IO).
-
Misaligned WRITE:
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle segment is DIO-aligned
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-06 15:46 ` Mike Snitzer
@ 2025-11-06 15:52 ` Chuck Lever
0 siblings, 0 replies; 26+ messages in thread
From: Chuck Lever @ 2025-11-06 15:52 UTC (permalink / raw)
To: Mike Snitzer, NeilBrown
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs
On 11/6/25 10:46 AM, Mike Snitzer wrote:
> On Thu, Nov 06, 2025 at 09:24:06PM +1100, NeilBrown wrote:
>> On Thu, 06 Nov 2025, Chuck Lever wrote:
>>> From: Mike Snitzer <snitzer@kernel.org>
>>>
>>> This document details the NFSD IO modes that are configurable using
>>> NFSD's experimental debugfs interfaces:
>>>
>>> /sys/kernel/debug/nfsd/io_cache_read
>>> /sys/kernel/debug/nfsd/io_cache_write
>>>
>>> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
>>> debugfs interfaces are replaced with per-export controls).
>>>
>>> Future updates will provide more specific guidance and howto
>>> information to help others use and evaluate NFSD's IO modes:
>>> BUFFERED, DONTCACHE and DIRECT.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> .../filesystems/nfs/nfsd-io-modes.rst | 150 ++++++++++++++++++
>>> 1 file changed, 150 insertions(+)
>>> create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
>>>
>>> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>>> new file mode 100644
>>> index 000000000000..29b84c9c9e25
>>> --- /dev/null
>>> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>>> @@ -0,0 +1,150 @@
>>> +.. SPDX-License-Identifier: GPL-2.0
>>> +
>>> +=============
>>> +NFSD IO MODES
>>> +=============
>>> +
>>> +Overview
>>> +========
>>> +
>>> +NFSD has historically always used buffered IO when servicing READ and
>>> +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
>>> +to override that default to use either DONTCACHE or DIRECT IO modes.
>>> +
>>> +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
>>> +mode used for READ and WRITE to be configured independently. See both:
>>> +- /sys/kernel/debug/nfsd/io_cache_read
>>> +- /sys/kernel/debug/nfsd/io_cache_write
>>> +
>>> +The default value for both io_cache_read and io_cache_write reflects
>>> +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
>>> +
>>> +Based on the configured settings, NFSD's IO will either be:
>>> +- cached using page cache (NFSD_IO_BUFFERED=0)
>>> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
>>> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
>>> +- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
>>> +- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
>>> +
>>> +To set an NFSD IO mode, write a supported value (0 - 4) to the
>>> +corresponding IO operation's debugfs interface, e.g.:
>>> + echo 2 > /sys/kernel/debug/nfsd/io_cache_read
>>> + echo 4 > /sys/kernel/debug/nfsd/io_cache_write
>>> +
>>> +To check which IO mode NFSD is using for READ or WRITE, simply read the
>>> +corresponding IO operation's debugfs interface, e.g.:
>>> + cat /sys/kernel/debug/nfsd/io_cache_read
>>> + cat /sys/kernel/debug/nfsd/io_cache_write
>>> +
>>> +NFSD DONTCACHE
>>> +==============
>>> +
>>> +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
>>> +the benefits of using DIRECT IO without any of the strict alignment
>>> +requirements that DIRECT IO imposes. To achieve this buffered IO is used
>>> +but the IO is flagged to "drop behind" (meaning associated pages are
>>> +dropped from the page cache) when IO completes.
>>> +
>>> +DONTCACHE aims to avoid what has proven to be a fairly significant
>>> +limition of Linux's memory management subsystem if/when large amounts of
>>> +data is infrequently accessed (e.g. read once _or_ written once but not
>>> +read until much later). Such use-cases are particularly problematic
>>> +because the page cache will eventually become a bottleneck to servicing
>>> +new IO requests.
>>> +
>>> +For more context on DONTCACHE, please see these Linux commit headers:
>>> +- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
>>> + to take a struct kiocb")
>>> +- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
>>> + RWF_DONTCACHE")
>>> +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
>>> +
>>> +If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
>>> +interfaces, FOP_DONTCACHE must be advertised as supported by the
>>> +underlying filesystem (e.g. XFS), otherwise all IO flagged with
>>> +RWF_DONTCACHE will fail with -EOPNOTSUPP.
>>
>> If FOP_DONTCACHE isn't advertised, nfsd doesn't even try RWF_DONTCACHE,
>> so error don't occur.
>>
>> Maybe:
>>
>> "NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the
>> underlying filesystem doesn't indicaate support by setting
>> FOP_DONTCACHE."
>>
>>> +
>>> +NFSD DIRECT
>>> +===========
>>> +
>>> +DIRECT IO doesn't make use of the page cache, as such it is able to
>>> +avoid the Linux memory management's page reclaim scalability problems
>>> +without resorting to the hybrid use of page cache that DONTCACHE does.
>>> +
>>> +Some workloads benefit from NFSD avoiding the page cache, particularly
>>> +those with a working set that is significantly larger than available
>>> +system memory. The pathological worst-case workload that NFSD DIRECT has
>>> +proven to help most is: NFS client issuing large sequential IO to a file
>>> +that is 2-3 times larger than the NFS server's available system memory.
>>> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
>>> +that the memory management subsystem would otherwise be required to
>>> +perform (e.g. page allocation, dirty writeback, page reclaim). When
>>> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
>>> +time trying to find adequate free pages so that forward IO progress can
>>> +be made.
>>> +
>>> +The performance win associated with using NFSD DIRECT was previously
>>> +discussed on linux-nfs, see:
>>> +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>>> +But in summary:
>>> +- NFSD DIRECT can significantly reduce memory requirements
>>> +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
>>> +- NFSD DIRECT can offer more deterministic IO performance
>>> +
>>> +As always, your mileage may vary and so it is important to carefully
>>> +consider if/when it is beneficial to make use of NFSD DIRECT. When
>>> +assessing comparative performance of your workload please be sure to log
>>> +relevant performance metrics during testing (e.g. memory usage, cpu
>>> +usage, IO performance). Using perf to collect perf data that may be used
>>> +to generate a "flamegraph" for work Linux must perform on behalf of your
>>> +test is a really meaningful way to compare the relative health of the
>>> +system and how switching NFSD's IO mode changes what is observed.
>>> +
>>> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
>>> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
>>> +the underlying block device's logical_block_size. Also the memory buffer
>>> +used to store the READ or WRITE payload must be aligned relative to the
>>> +underlying block device's dma_alignment.
>>> +
>>> +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
>>> +it can:
>>> +
>>> +Misaligned READ:
>>> + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
>>> + DIO-aligned block (on either end of the READ). The expanded READ is
>>> + verified to have proper offset/len (logical_block_size) and
>>> + dma_alignment checking.
>>> +
>>> + Any misaligned READ that is less than 32K won't be expanded to be
>>> + DIO-aligned (this heuristic just avoids excess work, like allocating
>>> + start_extra_page, for smaller IO that can generally already perform
>>> + well using buffered IO).
>>
>> I couldn't find this 32K in the code.
>>
>> Do we want to say something like:
>>
>> If you experiment with this on a recent kernel have have interesting
>> results, please report them to linux-nfs@vger.kernel.org
>>
>> Thanks,
>> NeilBrown
>>
>
> Thanks for the review, I clearly missed some clean up. Chuck, please
> consider applying this incremental patch which should address Neil's
> feedback and remove some stable_how related changes that aren't
> relevant without my corresponding patch:
>
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> index 29b84c9c9e25..e3a522d09766 100644
> --- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -23,19 +23,20 @@ Based on the configured settings, NFSD's IO will either be:
> - cached using page cache (NFSD_IO_BUFFERED=0)
> - cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> - not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> -- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
> -- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
>
> -To set an NFSD IO mode, write a supported value (0 - 4) to the
> +To set an NFSD IO mode, write a supported value (0 - 2) to the
> corresponding IO operation's debugfs interface, e.g.:
> echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> - echo 4 > /sys/kernel/debug/nfsd/io_cache_write
> + echo 2 > /sys/kernel/debug/nfsd/io_cache_write
>
> To check which IO mode NFSD is using for READ or WRITE, simply read the
> corresponding IO operation's debugfs interface, e.g.:
> cat /sys/kernel/debug/nfsd/io_cache_read
> cat /sys/kernel/debug/nfsd/io_cache_write
>
> +If you experiment with NFSD's IO modes on a recent kernel and have
> +interesting results, please report them to linux-nfs@vger.kernel.org
> +
> NFSD DONTCACHE
> ==============
>
> @@ -59,10 +60,8 @@ For more context on DONTCACHE, please see these Linux commit headers:
> RWF_DONTCACHE")
> - for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
>
> -If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
> -interfaces, FOP_DONTCACHE must be advertised as supported by the
> -underlying filesystem (e.g. XFS), otherwise all IO flagged with
> -RWF_DONTCACHE will fail with -EOPNOTSUPP.
> +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
> +filesystem doesn't indicate support by setting FOP_DONTCACHE.
>
> NFSD DIRECT
> ===========
> @@ -115,11 +114,6 @@ Misaligned READ:
> verified to have proper offset/len (logical_block_size) and
> dma_alignment checking.
>
> - Any misaligned READ that is less than 32K won't be expanded to be
> - DIO-aligned (this heuristic just avoids excess work, like allocating
> - start_extra_page, for smaller IO that can generally already perform
> - well using buffered IO).
> -
> Misaligned WRITE:
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> middle and end as needed. The large middle segment is DIO-aligned
Thanks, that saves me some time!
--
Chuck Lever
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients
2025-11-06 13:07 ` Christoph Hellwig
@ 2025-11-06 16:30 ` Chuck Lever
0 siblings, 0 replies; 26+ messages in thread
From: Chuck Lever @ 2025-11-06 16:30 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On 11/6/25 8:07 AM, Christoph Hellwig wrote:
> I don't think we need this any more with the current version, or am I
> missing something?
>
Correct, the updated version of 4/5 does not need this change.
I left it in because I agree with Neil's observation that it's a good
change to have anyway. However, the usual practice is to omit changes
that are not directly related to a series, so I will drop this one for
the next revision.
Let's keep a pin in it, though. It might be needed in the future.
--
Chuck Lever
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 13:51 ` Christoph Hellwig
2025-11-06 14:45 ` Chuck Lever
@ 2025-11-06 16:48 ` Mike Snitzer
2025-11-06 18:10 ` Chuck Lever
1 sibling, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2025-11-06 16:48 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs, Chuck Lever
On Thu, Nov 06, 2025 at 05:51:55AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 06, 2025 at 05:15:45AM -0800, Christoph Hellwig wrote:
> > On Thu, Nov 06, 2025 at 09:11:51PM +1100, NeilBrown wrote:
> > > > +struct nfsd_write_dio_seg {
> > > > + struct iov_iter iter;
> > > > + bool use_dio;
> > >
> > > This is only used to choose which flags to use.
> > > I think it would be neater the have 'flags' here explicitly.
> >
> > Actually, looking at the grand unified patch now (thanks, this is so
> > much easier to review!), we can just do away with the struct entirely.
> > Just have nfsd_write_dio_iters_init return if direct I/O is possible
> > or not, and do a single vfs_iocb_iter_write on the origin kiocb/iter
> > if not.
>
> That didn't work out too well, and indeed having flags here seems
> saner.
>
> Chuck, below is an untested incremental patch I did while reviewing
> it. Besides this flags thing, it adds the actual NFSD_IO_DIRECT
> definition that was missing, but otherwise mostly just folds things
> so that we don't need the extra args structure and thus simplifies
> things quite a bit.
>
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index ea87b42894dd..bdb60ee1f1a4 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -157,6 +157,7 @@ enum {
> /* Any new NFSD_IO enum value must be added at the end */
> NFSD_IO_BUFFERED,
> NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT,
> };
>
> extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index bb94da333d50..8038d8d038f3 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1170,56 +1170,38 @@ static int wait_for_concurrent_writes(struct file *file)
>
> struct nfsd_write_dio_seg {
> struct iov_iter iter;
> - bool use_dio;
> + int flags;
> };
>
> -struct nfsd_write_dio_args {
> - struct nfsd_file *nf;
> - int flags_buffered;
> - int flags_direct;
> - unsigned int nsegs;
> - struct nfsd_write_dio_seg segment[3];
> -};
> -
> -/*
> - * Check if the bvec iterator is aligned for direct I/O.
> - *
> - * bvecs generated from RPC receive buffers are contiguous: After the first
> - * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
> - * Therefore, only the first bvec is checked.
> - */
> -static bool
> -nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
> +static unsigned long iov_iter_bvec_offset(const struct iov_iter *iter)
> {
> - unsigned int addr_mask = nf->nf_dio_mem_align - 1;
> - const struct bio_vec *bvec = i->bvec;
> -
> - return ((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask) == 0;
> + return (unsigned long)(iter->bvec->bv_offset + iter->iov_offset);
> }
>
This ^ being factored out and documented like was before is better
than the result of this patch, which then spreads the associated
documentation into the caller.
> static void
> nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> struct bio_vec *bvec, unsigned int nvecs,
> - unsigned long total, size_t start, size_t len)
> + unsigned long total, size_t start, size_t len,
> + struct kiocb *iocb)
> {
> iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
> if (start)
> iov_iter_advance(&segment->iter, start);
> iov_iter_truncate(&segment->iter, len);
> - segment->use_dio = false;
> + segment->flags = iocb->ki_flags;
> }
>
> -static void
> -nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> - loff_t offset, unsigned long total,
> - struct nfsd_write_dio_args *args)
> +static unsigned int
> +nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> + unsigned int nvecs, struct kiocb *iocb, unsigned long total,
> + struct nfsd_write_dio_seg segments[3])
> {
> - u32 offset_align = args->nf->nf_dio_offset_align;
> - u32 mem_align = args->nf->nf_dio_mem_align;
> + u32 offset_align = nf->nf_dio_offset_align;
> + u32 mem_align = nf->nf_dio_mem_align;
> + loff_t offset = iocb->ki_pos;
> loff_t prefix_end, orig_end, middle_end;
> size_t prefix, middle, suffix;
> -
> - args->nsegs = 0;
> + unsigned int nsegs = 0;
>
> /*
> * Check if direct I/O is feasible for this write request.
> @@ -1242,87 +1224,80 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> if (!middle)
> goto no_dio;
>
> - if (prefix) {
> - nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> - nvecs, total, 0, prefix);
> - ++args->nsegs;
> - }
> + if (prefix)
> + nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
> + nvecs, total, 0, prefix, iocb);
>
> - nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
> - total, prefix, middle);
> - if (!nfsd_iov_iter_aligned_bvec(args->nf,
> - &args->segment[args->nsegs].iter))
> + nfsd_write_dio_seg_init(&segments[nsegs], bvec, nvecs,
> + total, prefix, middle, iocb);
> +
> + /*
> + * Check if the bvec iterator is aligned for direct I/O.
> + *
> + * bvecs generated from RPC receive buffers are contiguous: After the
> + * first bvec, all subsequent bvecs start at bv_offset zero
> + * (page-aligned).
> + * Therefore, only the first bvec is checked.
> + */
> + if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
> goto no_dio;
> - args->segment[args->nsegs].use_dio = true;
> - ++args->nsegs;
> + segments[nsegs].flags |= IOCB_DIRECT;
> + nsegs++;
>
> - if (suffix) {
> - nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> - nvecs, total, prefix + middle, suffix);
> - ++args->nsegs;
> - }
> + if (suffix)
> + nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
> + nvecs, total, prefix + middle, suffix,
> + iocb);
>
> - return;
> + return nsegs;
>
> no_dio:
> /*
> * No DIO alignment possible - pack into single non-DIO segment.
> - * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
> */
> - if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> - args->flags_buffered |= IOCB_DONTCACHE;
> - nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
> - 0, total);
> - args->nsegs = 1;
> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total,
> + 0, total, iocb);
> + return 1;
> }
Selectively pushing the flag twiddling out to nfsd_direct_write()
ignores that we don't want to use DONTCACHE for the unaligned
prefix/suffix. Chuck and I discussed this a fair bit 1-2 days ago.
> -static int
> -nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> - struct kiocb *kiocb, unsigned int nvecs,
> - unsigned long *cnt, struct nfsd_write_dio_args *args)
> +static noinline_for_stack int
> +nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct nfsd_file *nf, unsigned int nvecs,
> + unsigned long *cnt, struct kiocb *kiocb)
> {
> - struct file *file = args->nf->nf_file;
> + struct file *file = nf->nf_file;
> + struct nfsd_write_dio_seg segments[3];
> + unsigned int nsegs = 0, i;
> ssize_t host_err;
> - unsigned int i;
>
> - nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> - *cnt, args);
> + nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
> + kiocb, *cnt, segments);
>
> *cnt = 0;
> - for (i = 0; i < args->nsegs; i++) {
> - if (args->segment[i].use_dio) {
> - kiocb->ki_flags = args->flags_direct;
> + for (i = 0; i < nsegs; i++) {
> + kiocb->ki_flags = segments[i].flags;
> + if (kiocb->ki_flags & IOCB_DIRECT) {
> trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> - args->segment[i].iter.count);
> - } else
> - kiocb->ki_flags = args->flags_buffered;
> + segments[i].iter.count);
> + } else if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE) {
> + /*
> + * IOCB_DONTCACHE preserves the intent of
> + * NFSD_IO_DIRECT.
> + */
> + kiocb->ki_flags |= IOCB_DONTCACHE;
> + }
So this FOP_DONTCACHE branch needs to stay in
nfsd_write_dio_iters_init() no_dio: section.
>
> - host_err = vfs_iocb_iter_write(file, kiocb,
> - &args->segment[i].iter);
> + host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> if (host_err < 0)
> return host_err;
> *cnt += host_err;
> - if (host_err < args->segment[i].iter.count)
> + if (host_err < segments[i].iter.count)
> break; /* partial write */
> }
>
> return 0;
> }
>
> -static noinline_for_stack int
> -nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> - struct nfsd_file *nf, unsigned int nvecs,
> - unsigned long *cnt, struct kiocb *kiocb)
> -{
> - struct nfsd_write_dio_args args;
> -
> - args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
> - args.flags_buffered = kiocb->ki_flags;
> - args.nf = nf;
> -
> - return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
> -}
> -
> /**
> * nfsd_vfs_write - write data to an already-open file
> * @rqstp: RPC execution context
Combining nfsd_direct_write and nfsd_issue_dio_write is good. And I
like the intent of removing args but this first attempt has issues
that can be resolved by keeping the flags setup in
nfsd_write_dio_iters_init().
Mike
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 16:48 ` Mike Snitzer
@ 2025-11-06 18:10 ` Chuck Lever
2025-11-06 19:02 ` Mike Snitzer
0 siblings, 1 reply; 26+ messages in thread
From: Chuck Lever @ 2025-11-06 18:10 UTC (permalink / raw)
To: Mike Snitzer, Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On 11/6/25 11:48 AM, Mike Snitzer wrote:
>> no_dio:
>> /*
>> * No DIO alignment possible - pack into single non-DIO segment.
>> - * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
>> */
>> - if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
>> - args->flags_buffered |= IOCB_DONTCACHE;
>> - nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
>> - 0, total);
>> - args->nsegs = 1;
>> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total,
>> + 0, total, iocb);
>> + return 1;
>> }
> Selectively pushing the flag twiddling out to nfsd_direct_write()
> ignores that we don't want to use DONTCACHE for the unaligned
> prefix/suffix. Chuck and I discussed this a fair bit 1-2 days ago.
Agreed. I fixed this up after applying Christoph's patch.
--
Chuck Lever
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 18:10 ` Chuck Lever
@ 2025-11-06 19:02 ` Mike Snitzer
2025-11-07 13:24 ` Christoph Hellwig
0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2025-11-06 19:02 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Thu, Nov 06, 2025 at 01:10:17PM -0500, Chuck Lever wrote:
> On 11/6/25 11:48 AM, Mike Snitzer wrote:
> >> no_dio:
> >> /*
> >> * No DIO alignment possible - pack into single non-DIO segment.
> >> - * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
> >> */
> >> - if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> >> - args->flags_buffered |= IOCB_DONTCACHE;
> >> - nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
> >> - 0, total);
> >> - args->nsegs = 1;
> >> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total,
> >> + 0, total, iocb);
> >> + return 1;
> >> }
> > Selectively pushing the flag twiddling out to nfsd_direct_write()
> > ignores that we don't want to use DONTCACHE for the unaligned
> > prefix/suffix. Chuck and I discussed this a fair bit 1-2 days ago.
>
> Agreed. I fixed this up after applying Christoph's patch.
OK, here is the incremental I just came up with, but sounds like
you're all set:
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6fce4f08a4d4..db86b31e0c10 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1272,6 +1272,19 @@ static unsigned long iov_iter_bvec_offset(const struct iov_iter *iter)
return (unsigned long)(iter->bvec->bv_offset + iter->iov_offset);
}
+/*
+ * Check if the bvec iterator is aligned for direct I/O.
+ *
+ * bvecs generated from RPC receive buffers are contiguous: After the first
+ * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
+ * Therefore, only the first bvec is checked.
+ */
+static bool
+nfsd_iov_iter_aligned_bvec(const struct iov_iter *iter, unsigned int addr_mask)
+{
+ return (iov_iter_bvec_offset(iter) & addr_mask) == 0;
+}
+
static void
nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
struct bio_vec *bvec, unsigned int nvecs,
@@ -1324,16 +1337,8 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
nfsd_write_dio_seg_init(&segments[nsegs], bvec, nvecs,
total, prefix, middle, iocb);
-
- /*
- * Check if the bvec iterator is aligned for direct I/O.
- *
- * bvecs generated from RPC receive buffers are contiguous: After the
- * first bvec, all subsequent bvecs start at bv_offset zero
- * (page-aligned).
- * Therefore, only the first bvec is checked.
- */
- if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
+ if (!nfsd_iov_iter_aligned_bvec(&segments[nsegs].iter,
+ (mem_align - 1)))
goto no_dio;
segments[nsegs].flags |= IOCB_DIRECT;
nsegs++;
@@ -1348,9 +1353,12 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
no_dio:
/*
* No DIO alignment possible - pack into single non-DIO segment.
+ * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
*/
nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total,
0, total, iocb);
+ if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ segments[0].flags |= IOCB_DONTCACHE;
return 1;
}
@@ -1370,16 +1378,9 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
*cnt = 0;
for (i = 0; i < nsegs; i++) {
kiocb->ki_flags = segments[i].flags;
- if (kiocb->ki_flags & IOCB_DIRECT) {
+ if (kiocb->ki_flags & IOCB_DIRECT)
trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
segments[i].iter.count);
- } else if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE) {
- /*
- * IOCB_DONTCACHE preserves the intent of
- * NFSD_IO_DIRECT.
- */
- kiocb->ki_flags |= IOCB_DONTCACHE;
- }
host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
if (host_err < 0)
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-06 19:02 ` Mike Snitzer
@ 2025-11-07 13:24 ` Christoph Hellwig
2025-11-07 14:38 ` Chuck Lever
0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2025-11-07 13:24 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Thu, Nov 06, 2025 at 02:02:22PM -0500, Mike Snitzer wrote:
> > > Selectively pushing the flag twiddling out to nfsd_direct_write()
> > > ignores that we don't want to use DONTCACHE for the unaligned
> > > prefix/suffix. Chuck and I discussed this a fair bit 1-2 days ago.
Please document why. Because honestly it makes zero sense to do that.
But if you insist, at least write a detailed comment why you don't want
that, so if people have to debug that odd decision in the future they
at least know why it was taken.
The rest looks ok to me.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 13:24 ` Christoph Hellwig
@ 2025-11-07 14:38 ` Chuck Lever
2025-11-07 15:24 ` Christoph Hellwig
0 siblings, 1 reply; 26+ messages in thread
From: Chuck Lever @ 2025-11-07 14:38 UTC (permalink / raw)
To: Christoph Hellwig, Mike Snitzer
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On 11/7/25 8:24 AM, Christoph Hellwig wrote:
> On Thu, Nov 06, 2025 at 02:02:22PM -0500, Mike Snitzer wrote:
>>>> Selectively pushing the flag twiddling out to nfsd_direct_write()
>>>> ignores that we don't want to use DONTCACHE for the unaligned
>>>> prefix/suffix. Chuck and I discussed this a fair bit 1-2 days ago.
>
> Please document why. Because honestly it makes zero sense to do that.
There is a slow-down. But it's not like using DONTCACHE on the unaligned
ends results in a regression, technically, since this is a brand new
feature.
So for an UNSTABLE unaligned WRITE in NFSD_IO_DIRECT mode, the unaligned
ends would go through the page cache, and would not be durable until
the subsequent COMMIT. Using DONTCACHE for the end segments adds a
penalty, and no durability; but what is the value we get for that?
> But if you insist, at least write a detailed comment why you don't want
> that, so if people have to debug that odd decision in the future they
> at least know why it was taken.
>
> The rest looks ok to me.
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 14:38 ` Chuck Lever
@ 2025-11-07 15:24 ` Christoph Hellwig
2025-11-07 15:26 ` Chuck Lever
0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2025-11-07 15:24 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, Mike Snitzer, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Fri, Nov 07, 2025 at 09:38:43AM -0500, Chuck Lever wrote:
> On 11/7/25 8:24 AM, Christoph Hellwig wrote:
> > On Thu, Nov 06, 2025 at 02:02:22PM -0500, Mike Snitzer wrote:
> >>>> Selectively pushing the flag twiddling out to nfsd_direct_write()
> >>>> ignores that we don't want to use DONTCACHE for the unaligned
> >>>> prefix/suffix. Chuck and I discussed this a fair bit 1-2 days ago.
> >
> > Please document why. Because honestly it makes zero sense to do that.
>
> There is a slow-down. But it's not like using DONTCACHE on the unaligned
> ends results in a regression, technically, since this is a brand new
> feature.
>
> So for an UNSTABLE unaligned WRITE in NFSD_IO_DIRECT mode, the unaligned
> ends would go through the page cache, and would not be durable until
> the subsequent COMMIT. Using DONTCACHE for the end segments adds a
> penalty, and no durability; but what is the value we get for that?
Less cache pollution, and especially less conflicts with other direct
I/O writes. Do you have a link to results with the slowdown? How much
is it?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:24 ` Christoph Hellwig
@ 2025-11-07 15:26 ` Chuck Lever
0 siblings, 0 replies; 26+ messages in thread
From: Chuck Lever @ 2025-11-07 15:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mike Snitzer, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs, Chuck Lever
On 11/7/25 10:24 AM, Christoph Hellwig wrote:
> On Fri, Nov 07, 2025 at 09:38:43AM -0500, Chuck Lever wrote:
>> On 11/7/25 8:24 AM, Christoph Hellwig wrote:
>>> On Thu, Nov 06, 2025 at 02:02:22PM -0500, Mike Snitzer wrote:
>>>>>> Selectively pushing the flag twiddling out to nfsd_direct_write()
>>>>>> ignores that we don't want to use DONTCACHE for the unaligned
>>>>>> prefix/suffix. Chuck and I discussed this a fair bit 1-2 days ago.
>>>
>>> Please document why. Because honestly it makes zero sense to do that.
>>
>> There is a slow-down. But it's not like using DONTCACHE on the unaligned
>> ends results in a regression, technically, since this is a brand new
>> feature.
>>
>> So for an UNSTABLE unaligned WRITE in NFSD_IO_DIRECT mode, the unaligned
>> ends would go through the page cache, and would not be durable until
>> the subsequent COMMIT. Using DONTCACHE for the end segments adds a
>> penalty, and no durability; but what is the value we get for that?
>
> Less cache pollution, and especially less conflicts with other direct
> I/O writes. Do you have a link to results with the slowdown? How much
> is it?
>
https://lore.kernel.org/linux-nfs/aQrsWnHK1ny3Md9g@kernel.org/
--
Chuck Lever
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2025-11-07 15:26 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05 19:28 [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-05 19:28 ` [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty Chuck Lever
2025-11-05 19:31 ` Chuck Lever
2025-11-05 19:28 ` [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-06 0:55 ` NeilBrown
2025-11-06 13:05 ` Christoph Hellwig
2025-11-05 19:28 ` [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-11-06 13:07 ` Christoph Hellwig
2025-11-06 16:30 ` Chuck Lever
2025-11-05 19:28 ` [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-06 10:11 ` NeilBrown
2025-11-06 13:15 ` Christoph Hellwig
2025-11-06 13:51 ` Christoph Hellwig
2025-11-06 14:45 ` Chuck Lever
2025-11-06 14:49 ` Christoph Hellwig
2025-11-06 16:48 ` Mike Snitzer
2025-11-06 18:10 ` Chuck Lever
2025-11-06 19:02 ` Mike Snitzer
2025-11-07 13:24 ` Christoph Hellwig
2025-11-07 14:38 ` Chuck Lever
2025-11-07 15:24 ` Christoph Hellwig
2025-11-07 15:26 ` Chuck Lever
2025-11-05 19:28 ` [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2025-11-06 10:24 ` NeilBrown
2025-11-06 15:46 ` Mike Snitzer
2025-11-06 15:52 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).