* [PATCH v9 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 22:14 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 02/12] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
` (10 subsequent siblings)
11 siblings, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:
> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.
For many years, NFSD has used a "data sync only" optimization for
FILE_SYNC WRITEs, in violation of the above text (and previous
incarnations of the NFS standard). File time stamps haven't been
forced to be persisted as the mandate above requires.
For many in-tree Linux file systems, time stamps, as well as other
metadata not needed to find the data on disk, are piggy-backed on
fdatasync-mandatory metadata updates. Thus, this change should
affect only the case where previously fully-allocated file data
is overwritten.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5333d49910d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
stable = NFS_UNSTABLE;
init_sync_kiocb(&kiocb, file);
kiocb.ki_pos = offset;
- if (stable && !fhp->fh_use_wgather)
- kiocb.ki_flags |= IOCB_DSYNC;
+ if (likely(!fhp->fh_use_wgather)) {
+ switch (stable) {
+ case NFS_FILE_SYNC:
+ /* persist data and timestamps */
+ kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+ break;
+ case NFS_DATA_SYNC:
+ /* persist data only */
+ kiocb.ki_flags |= IOCB_DSYNC;
+ break;
+ }
+ }
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-11-03 16:53 ` [PATCH v9 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-11-03 22:14 ` NeilBrown
0 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2025-11-03 22:14 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Mike Snitzer
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
> does not also persist file time stamps. To wit, Section 18.32.3
> of RFC 8881 mandates:
>
> > The client specifies with the stable parameter the method of how
> > the data is to be processed by the server. If stable is
> > FILE_SYNC4, the server MUST commit the data written plus all file
> > system metadata to stable storage before returning results. This
> > corresponds to the NFSv2 protocol semantics. Any other behavior
> > constitutes a protocol violation. If stable is DATA_SYNC4, then
> > the server MUST commit all of the data to stable storage and
> > enough of the metadata to retrieve the data before returning.
>
> For many years, NFSD has used a "data sync only" optimization for
> FILE_SYNC WRITEs, in violation of the above text (and previous
> incarnations of the NFS standard). File time stamps haven't been
> forced to be persisted as the mandate above requires.
I think that is "For the last 5 months..."
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
That patch replaced
- flags |= RWF_SYNC;
with
+ kiocb.ki_flags |= IOCB_DSYNC;
which looks right given
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
in kiocb_set_rw_flags().
However that ignores the previous line
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
where RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit
as IOCB_SYNC.
NeilBrown
>
> For many in-tree Linux file systems, time stamps, as well as other
> metadata not needed to find the data on disk, are piggy-backed on
> fdatasync-mandatory metadata updates. Thus, this change should
> affect only the case where previously fully-allocated file data
> is overwritten.
>
> Reported-by: Mike Snitzer <snitzer@kernel.org>
> Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index f537a7b4ee01..5333d49910d9 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> stable = NFS_UNSTABLE;
> init_sync_kiocb(&kiocb, file);
> kiocb.ki_pos = offset;
> - if (stable && !fhp->fh_use_wgather)
> - kiocb.ki_flags |= IOCB_DSYNC;
> + if (likely(!fhp->fh_use_wgather)) {
> + switch (stable) {
> + case NFS_FILE_SYNC:
> + /* persist data and timestamps */
> + kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> + break;
> + case NFS_DATA_SYNC:
> + /* persist data only */
> + kiocb.ki_flags |= IOCB_DSYNC;
> + break;
> + }
> + }
>
> nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 02/12] NFSD: Enable return of an updated stable_how to NFS clients
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-03 16:53 ` [PATCH v9 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 16:53 ` [PATCH v9 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (9 subsequent siblings)
11 siblings, 0 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
NFSv3 and newer protocols enable clients to perform a two-phase
WRITE. A client requests an UNSTABLE WRITE, which sends dirty data
to the NFS server, but does not persist it. The server replies that
it performed the UNSTABLE WRITE, and the client is then obligated to
follow up with a COMMIT request before it can remove the dirty data
from its own page cache. The COMMIT reply is the client's guarantee
that the written data has been persisted on the server.
The purpose of this protocol design is to enable clients to send
a large amount of data via multiple WRITE requests to a server, and
then wait for persistence just once. The server is able to start
persisting the data as soon as it gets it, to shorten the length of
time the client has to wait for the final COMMIT to complete.
It's also possible for the server to respond to an UNSTABLE WRITE
request in a way that indicates that the data was persisted anyway.
In that case, the client can skip the COMMIT and remove the dirty
data from its memory immediately. NetApp filers, for example, do
this because they have a battery-backed cache and can guarantee that
written data is persisted quickly and immediately.
NFSD has never implemented this kind of promotion. UNSTABLE WRITE
requests are unconditionally treated as UNSTABLE. However, in a
subsequent patch, nfsd_vfs_write() will be able to promote an
UNSTABLE WRITE to be a FILE_SYNC WRITE. This will be because NFSD
will handle some WRITE requests locally with O_DIRECT, which
persists written data immediately. The FILE_SYNC WRITE response
indicates to the client that no follow-up COMMIT is necessary.
This patch prepares for that change by enabling the modified
stable_how value to be passed along to NFSD's WRITE reply encoder.
No behavior change is expected.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs3proc.c | 2 +-
fs/nfsd/nfs4proc.c | 2 +-
fs/nfsd/nfsproc.c | 3 ++-
fs/nfsd/vfs.c | 11 ++++++-----
fs/nfsd/vfs.h | 6 ++++--
fs/nfsd/xdr3.h | 2 +-
6 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..ad14b34583bb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
resp->committed = argp->stable;
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
&argp->payload, &cnt,
- resp->committed, resp->verf);
+ &resp->committed, resp->verf);
resp->count = cnt;
resp->status = nfsd3_map_status(resp->status);
return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 7f7e6bb23a90..2222bb283baf 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1285,7 +1285,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
write->wr_how_written = write->wr_stable_how;
status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
write->wr_offset, &write->wr_payload,
- &cnt, write->wr_how_written,
+ &cnt, &write->wr_how_written,
(__be32 *)write->wr_verifier.data);
nfsd_file_put(nf);
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75..706401ed6f8d 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
struct nfsd_writeargs *argp = rqstp->rq_argp;
struct nfsd_attrstat *resp = rqstp->rq_resp;
unsigned long cnt = argp->len;
+ u32 committed = NFS_DATA_SYNC;
dprintk("nfsd: WRITE %s %u bytes at %d\n",
SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
fh_copy(&resp->fh, &argp->fh);
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
- &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+ &argp->payload, &cnt, &committed, NULL);
if (resp->status == nfs_ok)
resp->status = fh_getattr(&resp->fh, &resp->stat);
else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5333d49910d9..f3be36b960e5 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,7 +1262,7 @@ static int wait_for_concurrent_writes(struct file *file)
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1274,11 +1274,12 @@ __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload, unsigned long *cnt,
- int stable, __be32 *verf)
+ u32 *stable_how, __be32 *verf)
{
struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
struct file *file = nf->nf_file;
struct super_block *sb = file_inode(file)->i_sb;
+ u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
struct iov_iter iter;
@@ -1444,7 +1445,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1454,7 +1455,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
*/
__be32
nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
- const struct xdr_buf *payload, unsigned long *cnt, int stable,
+ const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
__be32 *verf)
{
struct nfsd_file *nf;
@@ -1467,7 +1468,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
goto out;
err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
- stable, verf);
+ stable_how, verf);
nfsd_file_put(nf);
out:
trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fa46f8b5f132..c713ed0b04e0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -130,11 +130,13 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 *eof);
__be32 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
loff_t offset, const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_readlink(struct svc_rqst *, struct svc_fh *,
char *, int *);
__be32 nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..c0e443ef3a6b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
__be32 status;
struct svc_fh fh;
unsigned long count;
- int committed;
+ u32 committed;
__be32 verf[2];
};
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH v9 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-03 16:53 ` [PATCH v9 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-03 16:53 ` [PATCH v9 02/12] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 16:53 ` [PATCH v9 04/12] NFSD: Remove specific error handling Chuck Lever
` (8 subsequent siblings)
11 siblings, 0 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Synchronous buffered IO (with
preference towards using DONTCACHE) is used for the misaligned extents
and O_DIRECT is used for the middle DIO-aligned extent.
nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
allows the client to drop its dirty data and avoid needing an extra
COMMIT operation.
If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
invalidate the page cache on behalf of the DIO WRITE, then
nfsd_issue_write_dio() will fall back to using buffered IO.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 218 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 215 insertions(+), 5 deletions(-)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
switch (val) {
case NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
nfsd_io_cache_write = val;
break;
default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
DEFINE_NFSD_IO_EVENT(write_io_done);
DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f3be36b960e5..6cb0efd8ecdd 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,210 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio {
+ ssize_t start_len; /* Length for misaligned first extent */
+ ssize_t middle_len; /* Length for DIO-aligned middle extent */
+ ssize_t end_len; /* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+ struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_offset_align;
+ loff_t start_end, orig_end, middle_end;
+
+ if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+ return false;
+ if (unlikely(dio_blocksize > PAGE_SIZE))
+ return false;
+ if (unlikely(len < dio_blocksize))
+ return false;
+
+ start_end = round_up(offset, dio_blocksize);
+ orig_end = offset + len;
+ middle_end = round_down(orig_end, dio_blocksize);
+
+ write_dio->start_len = start_end - offset;
+ write_dio->middle_len = middle_end - start_end;
+ write_dio->end_len = orig_end - middle_end;
+
+ return true;
+}
+
+static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
+ unsigned int addr_mask, unsigned int len_mask)
+{
+ const struct bio_vec *bvec = i->bvec;
+ size_t skip = i->iov_offset;
+ size_t size = i->count;
+
+ if (size & len_mask)
+ return false;
+ do {
+ size_t len = bvec->bv_len;
+
+ if (len > size)
+ len = size;
+ if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+ return false;
+ bvec++;
+ size -= len;
+ skip = 0;
+ } while (size);
+
+ return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+ struct bio_vec *rq_bvec, unsigned int nvecs,
+ unsigned long cnt, struct nfsd_write_dio *write_dio,
+ struct nfsd_file *nf)
+{
+ int n_iters = 0;
+ struct iov_iter *iters = *iterp;
+
+ /* Setup misaligned start? */
+ if (write_dio->start_len) {
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iters[n_iters].count = write_dio->start_len;
+ iter_is_dio_aligned[n_iters] = false;
+ ++n_iters;
+ }
+
+ /* Setup DIO-aligned middle */
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ if (write_dio->start_len)
+ iov_iter_advance(&iters[n_iters], write_dio->start_len);
+ iters[n_iters].count -= write_dio->end_len;
+ iter_is_dio_aligned[n_iters] =
+ nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+ nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
+ if (unlikely(!iter_is_dio_aligned[n_iters]))
+ return 0; /* no DIO-aligned IO possible */
+ ++n_iters;
+
+ /* Setup misaligned end? */
+ if (write_dio->end_len) {
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iov_iter_advance(&iters[n_iters],
+ write_dio->start_len + write_dio->middle_len);
+ iter_is_dio_aligned[n_iters] = false;
+ ++n_iters;
+ }
+
+ return n_iters;
+}
+
+static int
+nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
+ unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb)
+{
+ struct iov_iter iter;
+ int host_err;
+
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt = host_err;
+
+ return 0;
+}
+
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
+ u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+{
+ struct file *file = nf->nf_file;
+ bool iter_is_dio_aligned[3];
+ struct iov_iter iter_stack[3];
+ struct iov_iter *iter = iter_stack;
+ unsigned int n_iters = 0;
+ unsigned long in_count = *cnt;
+ loff_t in_offset = kiocb->ki_pos;
+ ssize_t host_err;
+
+ n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+ rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
+ if (unlikely(!n_iters))
+ return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
+
+ trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
+
+ /*
+ * Any buffered IO issued here will be misaligned, use
+ * sync IO to ensure it has completed before returning.
+ * Also update @stable_how to avoid need for COMMIT.
+ */
+ kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
+ *stable_how = NFS_FILE_SYNC;
+
+ *cnt = 0;
+ for (int i = 0; i < n_iters; i++) {
+ if (iter_is_dio_aligned[i])
+ kiocb->ki_flags |= IOCB_DIRECT;
+ else
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+ if (host_err < 0) {
+ /*
+ * VFS will return -ENOTBLK if DIO WRITE fails to
+ * invalidate the page cache. Retry using buffered IO.
+ */
+ if (unlikely(host_err == -ENOTBLK)) {
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+ *cnt = in_count;
+ kiocb->ki_pos = in_offset;
+ return nfsd_buffered_write(rqstp, file,
+ nvecs, cnt, kiocb);
+ } else if (unlikely(host_err == -EINVAL)) {
+ struct inode *inode = d_inode(fhp->fh_dentry);
+
+ pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+ inode->i_sb->s_id, inode->i_ino);
+ host_err = -ESERVERFAULT;
+ }
+ return host_err;
+ }
+ *cnt += host_err;
+ if (host_err < iter[i].count) /* partial write? */
+ break;
+ }
+
+ return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
+{
+ struct nfsd_write_dio write_dio;
+
+ /*
+ * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
+ * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
+ * be ignored for any DIO issued here).
+ */
+ if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ kiocb->ki_flags |= IOCB_DONTCACHE;
+
+ if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
+ return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
+ cnt, kiocb, &write_dio);
+
+ return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1282,7 +1486,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
- struct iov_iter iter;
errseq_t since;
__be32 nfserr;
int host_err;
@@ -1329,25 +1532,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
- case NFSD_IO_BUFFERED:
+ case NFSD_IO_DIRECT:
+ host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
+ nvecs, cnt, &kiocb);
+ stable = *stable_how;
break;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough; /* must call nfsd_buffered_write */
+ case NFSD_IO_BUFFERED:
+ host_err = nfsd_buffered_write(rqstp, file,
+ nvecs, cnt, &kiocb);
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH v9 04/12] NFSD: Remove specific error handling
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (2 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 21:17 ` Mike Snitzer
2025-11-03 16:53 ` [PATCH v9 05/12] NFSD: Remove alignment size checking Chuck Lever
` (7 subsequent siblings)
11 siblings, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
1. Christoph notes that ENOTBLK is not supposed to leak out of file
systems, so it's unlikely or impossible to see that error code
here.
2. There are several ways to get EINVAL on a write. The least likely
of those is a dio alignment problem. The warning here would be
misleading in those more common cases.
It's unlikely that an administrator can do anything about either
of these cases, should they appear on a production system.
The trace_nfsd_write_done event will record these errnos when they
occur.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 20 +-------------------
1 file changed, 1 insertion(+), 19 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6cb0efd8ecdd..30094d8f489e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1408,26 +1408,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
kiocb->ki_flags &= ~IOCB_DIRECT;
host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
- if (host_err < 0) {
- /*
- * VFS will return -ENOTBLK if DIO WRITE fails to
- * invalidate the page cache. Retry using buffered IO.
- */
- if (unlikely(host_err == -ENOTBLK)) {
- kiocb->ki_flags &= ~IOCB_DIRECT;
- *cnt = in_count;
- kiocb->ki_pos = in_offset;
- return nfsd_buffered_write(rqstp, file,
- nvecs, cnt, kiocb);
- } else if (unlikely(host_err == -EINVAL)) {
- struct inode *inode = d_inode(fhp->fh_dentry);
-
- pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
- inode->i_sb->s_id, inode->i_ino);
- host_err = -ESERVERFAULT;
- }
+ if (host_err < 0)
return host_err;
- }
*cnt += host_err;
if (host_err < iter[i].count) /* partial write? */
break;
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 04/12] NFSD: Remove specific error handling
2025-11-03 16:53 ` [PATCH v9 04/12] NFSD: Remove specific error handling Chuck Lever
@ 2025-11-03 21:17 ` Mike Snitzer
0 siblings, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-03 21:17 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Christoph Hellwig
On Mon, Nov 03, 2025 at 11:53:43AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> 1. Christoph notes that ENOTBLK is not supposed to leak out of file
> systems, so it's unlikely or impossible to see that error code
> here.
>
> 2. There are several ways to get EINVAL on a write. The least likely
> of those is a dio alignment problem. The warning here would be
> misleading in those more common cases.
>
> It's unlikely that an administrator can do anything about either
> of these cases, should they appear on a production system.
>
> The trace_nfsd_write_done event will record these errnos when they
> occur.
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (3 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 04/12] NFSD: Remove specific error handling Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 21:16 ` Mike Snitzer
2025-11-03 22:30 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 06/12] NFSD: Clean up struct nfsd_write_dio Chuck Lever
` (6 subsequent siblings)
11 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
Noted that the NFS client's LOCALIO code still retains this check.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 30094d8f489e..80bc105eb0b6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1269,8 +1269,6 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
return false;
- if (unlikely(dio_blocksize > PAGE_SIZE))
- return false;
if (unlikely(len < dio_blocksize))
return false;
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-03 16:53 ` [PATCH v9 05/12] NFSD: Remove alignment size checking Chuck Lever
@ 2025-11-03 21:16 ` Mike Snitzer
2025-11-03 22:30 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-03 21:16 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Christoph Hellwig
On Mon, Nov 03, 2025 at 11:53:44AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Noted that the NFS client's LOCALIO code still retains this check.
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-03 16:53 ` [PATCH v9 05/12] NFSD: Remove alignment size checking Chuck Lever
2025-11-03 21:16 ` Mike Snitzer
@ 2025-11-03 22:30 ` NeilBrown
2025-11-04 11:52 ` Christoph Hellwig
1 sibling, 1 reply; 37+ messages in thread
From: NeilBrown @ 2025-11-03 22:30 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Christoph Hellwig
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Noted that the NFS client's LOCALIO code still retains this check.
It might be good to capture here *why* the check is removed.
Is it because alignments never exceed PAGE_SIZE, or because the code is
quite capable of handling larger alignments
(I haven't been following the conversation closely..)
NeilBrown
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 30094d8f489e..80bc105eb0b6 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1269,8 +1269,6 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>
> if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
> return false;
> - if (unlikely(dio_blocksize > PAGE_SIZE))
> - return false;
> if (unlikely(len < dio_blocksize))
> return false;
>
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-03 22:30 ` NeilBrown
@ 2025-11-04 11:52 ` Christoph Hellwig
2025-11-04 14:14 ` Chuck Lever
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-04 11:52 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Tue, Nov 04, 2025 at 09:30:25AM +1100, NeilBrown wrote:
> On Tue, 04 Nov 2025, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> >
> > Noted that the NFS client's LOCALIO code still retains this check.
>
> It might be good to capture here *why* the check is removed.
> Is it because alignments never exceed PAGE_SIZE, or because the code is
> quite capable of handling larger alignments
> (I haven't been following the conversation closely..)
I'm still trying to understand why it was added in the first place :)
But I'm also completely lost in the maze of fixup patches.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-04 11:52 ` Christoph Hellwig
@ 2025-11-04 14:14 ` Chuck Lever
2025-11-04 15:54 ` Mike Snitzer
2025-11-05 12:52 ` Christoph Hellwig
0 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-04 14:14 UTC (permalink / raw)
To: Christoph Hellwig, NeilBrown
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On 11/4/25 6:52 AM, Christoph Hellwig wrote:
> On Tue, Nov 04, 2025 at 09:30:25AM +1100, NeilBrown wrote:
>> On Tue, 04 Nov 2025, Chuck Lever wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> Noted that the NFS client's LOCALIO code still retains this check.
>>
>> It might be good to capture here *why* the check is removed.
>> Is it because alignments never exceed PAGE_SIZE, or because the code is
>> quite capable of handling larger alignments
>> (I haven't been following the conversation closely..)
>
> I'm still trying to understand why it was added in the first place :)
I'm trying to understand what action you'd like me to take. Should I
drop this patch?
> But I'm also completely lost in the maze of fixup patches.
Several people have asked me to collapse the fix-ups into a single
patch. We would lose some history and attributions doing that. Does
anyone have other thoughts?
--
Chuck Lever
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-04 14:14 ` Chuck Lever
@ 2025-11-04 15:54 ` Mike Snitzer
2025-11-05 12:52 ` Christoph Hellwig
1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-04 15:54 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Tue, Nov 04, 2025 at 09:14:09AM -0500, Chuck Lever wrote:
> On 11/4/25 6:52 AM, Christoph Hellwig wrote:
> > On Tue, Nov 04, 2025 at 09:30:25AM +1100, NeilBrown wrote:
> >> On Tue, 04 Nov 2025, Chuck Lever wrote:
> >>> From: Chuck Lever <chuck.lever@oracle.com>
> >>>
> >>> Noted that the NFS client's LOCALIO code still retains this check.
> >>
> >> It might be good to capture here *why* the check is removed.
> >> Is it because alignments never exceed PAGE_SIZE, or because the code is
> >> quite capable of handling larger alignments
> >> (I haven't been following the conversation closely..)
> >
> > I'm still trying to understand why it was added in the first place :)
>
> I'm trying to understand what action you'd like me to take. Should I
> drop this patch?
Maybe just fold it into Patch 3 and we can forget it ever happened?
The reason it was added is this NFSD_IO_DIRECT feature has had many
iterations, and in the earlier days on its development I was
allocating extra _pages_ to accomodate the misaligned head and tail
pages _outside_ of the sunrpc provided buffers. This started with
READ and copy-n-paste brought it to the WRITE side IIRC.
> > But I'm also completely lost in the maze of fixup patches.
> Several people have asked me to collapse the fix-ups into a single
> patch. We would lose some history and attributions doing that. Does
> anyone have other thoughts?
The series has a mix of renames that distract, but overall what
matters most is where the code ends up. I don't have a problem with
the various changes when I look at the result (and because I've
acclimated to the evolution). Probably best to just leave well enough
alone given Jeff, myself and Neil have provided various Reviewed-by
Mike
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-04 14:14 ` Chuck Lever
2025-11-04 15:54 ` Mike Snitzer
@ 2025-11-05 12:52 ` Christoph Hellwig
2025-11-05 14:38 ` Chuck Lever
1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-05 12:52 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Tue, Nov 04, 2025 at 09:14:09AM -0500, Chuck Lever wrote:
> >> It might be good to capture here *why* the check is removed.
> >> Is it because alignments never exceed PAGE_SIZE, or because the code is
> >> quite capable of handling larger alignments
> >> (I haven't been following the conversation closely..)
> >
> > I'm still trying to understand why it was added in the first place :)
>
> I'm trying to understand what action you'd like me to take. Should I
> drop this patch?
With "it" I meant the check. І think Mike explain this was due to a
PAGE_SIZE bound buffer originally, and in that context it makes sense.
Without the explanation I don't understand the rationale for adding the
check in the first place.
>
> > But I'm also completely lost in the maze of fixup patches.
> Several people have asked me to collapse the fix-ups into a single
> patch. We would lose some history and attributions doing that. Does
> anyone have other thoughts?
The action I'd see is to collapse the series into reviewable chunks.
I.e., fold the addition of the direct I/O writes into a single patch
that has all the policy decisions and documents them, leaving only
clearly separate prep patches separate.
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-05 12:52 ` Christoph Hellwig
@ 2025-11-05 14:38 ` Chuck Lever
2025-11-05 14:48 ` Mike Snitzer
2025-11-05 14:55 ` Christoph Hellwig
0 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-05 14:38 UTC (permalink / raw)
To: Christoph Hellwig, Mike Snitzer
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On 11/5/25 7:52 AM, Christoph Hellwig wrote:
> On Tue, Nov 04, 2025 at 09:14:09AM -0500, Chuck Lever wrote:
>>>> It might be good to capture here *why* the check is removed.
>>>> Is it because alignments never exceed PAGE_SIZE, or because the code is
>>>> quite capable of handling larger alignments
>>>> (I haven't been following the conversation closely..)
>>>
>>> I'm still trying to understand why it was added in the first place :)
>>
>> I'm trying to understand what action you'd like me to take. Should I
>> drop this patch?
>
> With "it" I meant the check. І think Mike explain this was due to a
> PAGE_SIZE bound buffer originally, and in that context it makes sense.
> Without the explanation I don't understand the rationale for adding the
> check in the first place.
Agreed, Mike's original patch has no explanatory comment, and there
needs to be one. Mike, can you suggest a one or two sentence comment
and I will replace this patch with one that adds the comment.
>>> But I'm also completely lost in the maze of fixup patches.
>> Several people have asked me to collapse the fix-ups into a single
>> patch. We would lose some history and attributions doing that. Does
>> anyone have other thoughts?
>
> The action I'd see is to collapse the series into reviewable chunks.
> I.e., fold the addition of the direct I/O writes into a single patch
> that has all the policy decisions and documents them, leaving only
> clearly separate prep patches separate.
Meaning: combine the patches from 3/12 to 12/12 into a single patch.
--
Chuck Lever
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-05 14:38 ` Chuck Lever
@ 2025-11-05 14:48 ` Mike Snitzer
2025-11-05 14:55 ` Christoph Hellwig
1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-05 14:48 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Wed, Nov 05, 2025 at 09:38:33AM -0500, Chuck Lever wrote:
> On 11/5/25 7:52 AM, Christoph Hellwig wrote:
> > On Tue, Nov 04, 2025 at 09:14:09AM -0500, Chuck Lever wrote:
> >>>> It might be good to capture here *why* the check is removed.
> >>>> Is it because alignments never exceed PAGE_SIZE, or because the code is
> >>>> quite capable of handling larger alignments
> >>>> (I haven't been following the conversation closely..)
> >>>
> >>> I'm still trying to understand why it was added in the first place :)
> >>
> >> I'm trying to understand what action you'd like me to take. Should I
> >> drop this patch?
> >
> > With "it" I meant the check. І think Mike explain this was due to a
> > PAGE_SIZE bound buffer originally, and in that context it makes sense.
> > Without the explanation I don't understand the rationale for adding the
> > check in the first place.
>
> Agreed, Mike's original patch has no explanatory comment, and there
> needs to be one. Mike, can you suggest a one or two sentence comment
> and I will replace this patch with one that adds the comment.
I replied yesterday suggesting you just fold 5 into 3.
> >>> But I'm also completely lost in the maze of fixup patches.
> >> Several people have asked me to collapse the fix-ups into a single
> >> patch. We would lose some history and attributions doing that. Does
> >> anyone have other thoughts?
> >
> > The action I'd see is to collapse the series into reviewable chunks.
> > I.e., fold the addition of the direct I/O writes into a single patch
> > that has all the policy decisions and documents them, leaving only
> > clearly separate prep patches separate.
> Meaning: combine the patches from 3/12 to 12/12 into a single patch.
But if you elect to fold all of 3 to 12, definitely add your
Co-developed-by:
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 05/12] NFSD: Remove alignment size checking
2025-11-05 14:38 ` Chuck Lever
2025-11-05 14:48 ` Mike Snitzer
@ 2025-11-05 14:55 ` Christoph Hellwig
1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-05 14:55 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, Mike Snitzer, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Wed, Nov 05, 2025 at 09:38:33AM -0500, Chuck Lever wrote:
> > The action I'd see is to collapse the series into reviewable chunks.
> > I.e., fold the addition of the direct I/O writes into a single patch
> > that has all the policy decisions and documents them, leaving only
> > clearly separate prep patches separate.
> Meaning: combine the patches from 3/12 to 12/12 into a single patch.
Yes. Unless the merging makes it clear that something could actually
be split out as a self-contained prep patch, but nothing sticks out to
me currently.
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 06/12] NFSD: Clean up struct nfsd_write_dio
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (4 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 05/12] NFSD: Remove alignment size checking Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 22:36 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 07/12] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
` (5 subsequent siblings)
11 siblings, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Prepare for moving more common arguments into the shared per-request
structure.
First step is to move the target nfsd_file into that structure, as
it needs to be available in several functions.
As a clean-up, adopt the conventional naming of a structure that
carries the same arguments for a number of functions.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 67 +++++++++++++++++++++++++++------------------------
1 file changed, 36 insertions(+), 31 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 80bc105eb0b6..326c60eada65 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,7 +1254,9 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
-struct nfsd_write_dio {
+struct nfsd_write_dio_args {
+ struct nfsd_file *nf;
+
ssize_t start_len; /* Length for misaligned first extent */
ssize_t middle_len; /* Length for DIO-aligned middle extent */
ssize_t end_len; /* Length for misaligned last extent */
@@ -1262,12 +1264,12 @@ struct nfsd_write_dio {
static bool
nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
- struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
+ struct nfsd_write_dio_args *args)
{
- const u32 dio_blocksize = nf->nf_dio_offset_align;
+ u32 dio_blocksize = args->nf->nf_dio_offset_align;
loff_t start_end, orig_end, middle_end;
- if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+ if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
return false;
if (unlikely(len < dio_blocksize))
return false;
@@ -1276,16 +1278,18 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
orig_end = offset + len;
middle_end = round_down(orig_end, dio_blocksize);
- write_dio->start_len = start_end - offset;
- write_dio->middle_len = middle_end - start_end;
- write_dio->end_len = orig_end - middle_end;
+ args->start_len = start_end - offset;
+ args->middle_len = middle_end - start_end;
+ args->end_len = orig_end - middle_end;
return true;
}
-static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
- unsigned int addr_mask, unsigned int len_mask)
+static bool
+nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
{
+ unsigned int len_mask = nf->nf_dio_offset_align - 1;
+ unsigned int addr_mask = nf->nf_dio_mem_align - 1;
const struct bio_vec *bvec = i->bvec;
size_t skip = i->iov_offset;
size_t size = i->count;
@@ -1314,37 +1318,35 @@ static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
static int
nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
struct bio_vec *rq_bvec, unsigned int nvecs,
- unsigned long cnt, struct nfsd_write_dio *write_dio,
- struct nfsd_file *nf)
+ unsigned long cnt, struct nfsd_write_dio_args *args)
{
int n_iters = 0;
struct iov_iter *iters = *iterp;
/* Setup misaligned start? */
- if (write_dio->start_len) {
+ if (args->start_len) {
iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- iters[n_iters].count = write_dio->start_len;
+ iters[n_iters].count = args->start_len;
iter_is_dio_aligned[n_iters] = false;
++n_iters;
}
/* Setup DIO-aligned middle */
iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- if (write_dio->start_len)
- iov_iter_advance(&iters[n_iters], write_dio->start_len);
- iters[n_iters].count -= write_dio->end_len;
+ if (args->start_len)
+ iov_iter_advance(&iters[n_iters], args->start_len);
+ iters[n_iters].count -= args->end_len;
iter_is_dio_aligned[n_iters] =
- nfsd_iov_iter_aligned_bvec(&iters[n_iters],
- nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
+ nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
if (unlikely(!iter_is_dio_aligned[n_iters]))
return 0; /* no DIO-aligned IO possible */
++n_iters;
/* Setup misaligned end? */
- if (write_dio->end_len) {
+ if (args->end_len) {
iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
iov_iter_advance(&iters[n_iters],
- write_dio->start_len + write_dio->middle_len);
+ args->start_len + args->middle_len);
iter_is_dio_aligned[n_iters] = false;
++n_iters;
}
@@ -1370,11 +1372,11 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
}
static int
-nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
- u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
- struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
+ struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
+ struct nfsd_write_dio_args *args)
{
- struct file *file = nf->nf_file;
+ struct file *file = args->nf->nf_file;
bool iter_is_dio_aligned[3];
struct iov_iter iter_stack[3];
struct iov_iter *iter = iter_stack;
@@ -1384,7 +1386,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
ssize_t host_err;
n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
- rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
+ rqstp->rq_bvec, nvecs, *cnt,
+ args);
if (unlikely(!n_iters))
return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
@@ -1421,21 +1424,23 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
unsigned long *cnt, struct kiocb *kiocb)
{
- struct nfsd_write_dio write_dio;
+ struct nfsd_write_dio_args args;
+
+ args.nf = nf;
/*
* Check if IOCB_DONTCACHE can be used when issuing buffered IO;
* if so, set it to preserve intent of NFSD_IO_DIRECT (it will
* be ignored for any DIO issued here).
*/
- if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
kiocb->ki_flags |= IOCB_DONTCACHE;
- if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
- return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
- cnt, kiocb, &write_dio);
+ if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, &args))
+ return nfsd_issue_write_dio(rqstp, fhp, stable_how, kiocb,
+ nvecs, cnt, &args);
- return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+ return nfsd_buffered_write(rqstp, args.nf->nf_file, nvecs, cnt, kiocb);
}
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 06/12] NFSD: Clean up struct nfsd_write_dio
2025-11-03 16:53 ` [PATCH v9 06/12] NFSD: Clean up struct nfsd_write_dio Chuck Lever
@ 2025-11-03 22:36 ` NeilBrown
0 siblings, 0 replies; 37+ messages in thread
From: NeilBrown @ 2025-11-03 22:36 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Mike Snitzer
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Prepare for moving more common arguments into the shared per-request
> structure.
>
> First step is to move the target nfsd_file into that structure, as
> it needs to be available in several functions.
>
> As a clean-up, adopt the conventional naming of a structure that
> carries the same arguments for a number of functions.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 67 +++++++++++++++++++++++++++------------------------
> 1 file changed, 36 insertions(+), 31 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 80bc105eb0b6..326c60eada65 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,7 +1254,9 @@ static int wait_for_concurrent_writes(struct file *file)
> return err;
> }
>
> -struct nfsd_write_dio {
> +struct nfsd_write_dio_args {
> + struct nfsd_file *nf;
> +
> ssize_t start_len; /* Length for misaligned first extent */
> ssize_t middle_len; /* Length for DIO-aligned middle extent */
> ssize_t end_len; /* Length for misaligned last extent */
> @@ -1262,12 +1264,12 @@ struct nfsd_write_dio {
>
> static bool
> nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> - struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
> + struct nfsd_write_dio_args *args)
> {
> - const u32 dio_blocksize = nf->nf_dio_offset_align;
> + u32 dio_blocksize = args->nf->nf_dio_offset_align;
> loff_t start_end, orig_end, middle_end;
>
> - if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
> + if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
> return false;
> if (unlikely(len < dio_blocksize))
> return false;
> @@ -1276,16 +1278,18 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> orig_end = offset + len;
> middle_end = round_down(orig_end, dio_blocksize);
>
> - write_dio->start_len = start_end - offset;
> - write_dio->middle_len = middle_end - start_end;
> - write_dio->end_len = orig_end - middle_end;
> + args->start_len = start_end - offset;
> + args->middle_len = middle_end - start_end;
> + args->end_len = orig_end - middle_end;
>
> return true;
> }
>
> -static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
> - unsigned int addr_mask, unsigned int len_mask)
> +static bool
> +nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
> {
> + unsigned int len_mask = nf->nf_dio_offset_align - 1;
> + unsigned int addr_mask = nf->nf_dio_mem_align - 1;
> const struct bio_vec *bvec = i->bvec;
> size_t skip = i->iov_offset;
> size_t size = i->count;
> @@ -1314,37 +1318,35 @@ static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
> static int
> nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> struct bio_vec *rq_bvec, unsigned int nvecs,
> - unsigned long cnt, struct nfsd_write_dio *write_dio,
> - struct nfsd_file *nf)
> + unsigned long cnt, struct nfsd_write_dio_args *args)
> {
> int n_iters = 0;
> struct iov_iter *iters = *iterp;
>
> /* Setup misaligned start? */
> - if (write_dio->start_len) {
> + if (args->start_len) {
> iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> - iters[n_iters].count = write_dio->start_len;
> + iters[n_iters].count = args->start_len;
> iter_is_dio_aligned[n_iters] = false;
> ++n_iters;
> }
>
> /* Setup DIO-aligned middle */
> iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> - if (write_dio->start_len)
> - iov_iter_advance(&iters[n_iters], write_dio->start_len);
> - iters[n_iters].count -= write_dio->end_len;
> + if (args->start_len)
> + iov_iter_advance(&iters[n_iters], args->start_len);
> + iters[n_iters].count -= args->end_len;
> iter_is_dio_aligned[n_iters] =
> - nfsd_iov_iter_aligned_bvec(&iters[n_iters],
> - nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
> + nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
> if (unlikely(!iter_is_dio_aligned[n_iters]))
> return 0; /* no DIO-aligned IO possible */
> ++n_iters;
>
> /* Setup misaligned end? */
> - if (write_dio->end_len) {
> + if (args->end_len) {
> iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> iov_iter_advance(&iters[n_iters],
> - write_dio->start_len + write_dio->middle_len);
> + args->start_len + args->middle_len);
> iter_is_dio_aligned[n_iters] = false;
> ++n_iters;
> }
> @@ -1370,11 +1372,11 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> }
>
> static int
> -nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> - u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
> - struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
> +nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
> + struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
> + struct nfsd_write_dio_args *args)
> {
> - struct file *file = nf->nf_file;
> + struct file *file = args->nf->nf_file;
> bool iter_is_dio_aligned[3];
> struct iov_iter iter_stack[3];
> struct iov_iter *iter = iter_stack;
> @@ -1384,7 +1386,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
> ssize_t host_err;
>
> n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
> - rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
> + rqstp->rq_bvec, nvecs, *cnt,
> + args);
> if (unlikely(!n_iters))
> return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
>
> @@ -1421,21 +1424,23 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
> unsigned long *cnt, struct kiocb *kiocb)
> {
> - struct nfsd_write_dio write_dio;
> + struct nfsd_write_dio_args args;
> +
> + args.nf = nf;
>
> /*
> * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
> * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
> * be ignored for any DIO issued here).
> */
> - if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> + if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
This change from "nf->" to "args.nf->" is unnecessary and to my eye
looks weird. I prefer the original code.
> kiocb->ki_flags |= IOCB_DONTCACHE;
>
> - if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
> - return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
> - cnt, kiocb, &write_dio);
> + if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, &args))
> + return nfsd_issue_write_dio(rqstp, fhp, stable_how, kiocb,
> + nvecs, cnt, &args);
>
> - return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
> + return nfsd_buffered_write(rqstp, args.nf->nf_file, nvecs, cnt, kiocb);
This one looks particularly weird. "args" is for dio, and here we have
decided not to use dio, but we are using args anyway.
But overall I like the change, so
Reviewed-by: NeilBrown <neil@brown.name>
with or without the two changes above that I don't like.
NeilBrown
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 07/12] NFSD: Introduce struct nfsd_write_dio_seg
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (5 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 06/12] NFSD: Clean up struct nfsd_write_dio Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 22:45 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
` (4 subsequent siblings)
11 siblings, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Christoph Hellwig, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Passing iter arrays by reference is a little risky. Instead, pass a
struct with a fixed-size array so bounds checking can be done.
Name each item in the array a "segment", as the term "extent"
generally refers to a set of blocks on storage, not to a buffer.
Each segment is processed via a single vfs_iocb_iter_write() call,
and is either IOCB_DIRECT or buffered.
Introduce a segment constructor function so each segment is
initialized identically.
Consensus is that allowing the code to build segment arrays that
are smaller than 3 is better than the I/O loop unconditionally
visiting all 3 segments, skipping the zero-length ones.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 120 ++++++++++++++++++++++++--------------------------
1 file changed, 58 insertions(+), 62 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 326c60eada65..5d6efcceb8c9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,12 +1254,16 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio_seg {
+ struct iov_iter iter;
+ bool use_dio;
+};
+
struct nfsd_write_dio_args {
struct nfsd_file *nf;
-
- ssize_t start_len; /* Length for misaligned first extent */
- ssize_t middle_len; /* Length for DIO-aligned middle extent */
- ssize_t end_len; /* Length for misaligned last extent */
+ size_t first, middle, last;
+ unsigned int nsegs;
+ struct nfsd_write_dio_seg segment[3];
};
static bool
@@ -1267,21 +1271,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
struct nfsd_write_dio_args *args)
{
u32 dio_blocksize = args->nf->nf_dio_offset_align;
- loff_t start_end, orig_end, middle_end;
+ loff_t first_end, orig_end, middle_end;
if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
return false;
if (unlikely(len < dio_blocksize))
return false;
- start_end = round_up(offset, dio_blocksize);
+ first_end = round_up(offset, dio_blocksize);
orig_end = offset + len;
middle_end = round_down(orig_end, dio_blocksize);
- args->start_len = start_end - offset;
- args->middle_len = middle_end - start_end;
- args->end_len = orig_end - middle_end;
-
+ args->first = first_end - offset;
+ args->middle = middle_end - first_end;
+ args->last = orig_end - middle_end;
return true;
}
@@ -1311,47 +1314,47 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
return true;
}
-/*
- * Setup as many as 3 iov_iter based on extents described by @write_dio.
- * Returns the number of iov_iter that were setup.
- */
-static int
-nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
- struct bio_vec *rq_bvec, unsigned int nvecs,
- unsigned long cnt, struct nfsd_write_dio_args *args)
+static void
+nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
+ struct bio_vec *bvec, unsigned int nvecs,
+ unsigned long total, size_t start, size_t len)
{
- int n_iters = 0;
- struct iov_iter *iters = *iterp;
+ iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
+ if (start)
+ iov_iter_advance(&segment->iter, start);
+ iov_iter_truncate(&segment->iter, len);
+ segment->use_dio = false;
+}
- /* Setup misaligned start? */
- if (args->start_len) {
- iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- iters[n_iters].count = args->start_len;
- iter_is_dio_aligned[n_iters] = false;
- ++n_iters;
+static bool
+nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
+ unsigned long total,
+ struct nfsd_write_dio_args *args)
+{
+ args->nsegs = 0;
+
+ if (args->first) {
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
+ nvecs, total, 0, args->first);
+ ++args->nsegs;
}
- /* Setup DIO-aligned middle */
- iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- if (args->start_len)
- iov_iter_advance(&iters[n_iters], args->start_len);
- iters[n_iters].count -= args->end_len;
- iter_is_dio_aligned[n_iters] =
- nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
- if (unlikely(!iter_is_dio_aligned[n_iters]))
- return 0; /* no DIO-aligned IO possible */
- ++n_iters;
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
+ total, args->first, args->middle);
+ if (!nfsd_iov_iter_aligned_bvec(args->nf,
+ &args->segment[args->nsegs].iter))
+ return false; /* no DIO-aligned IO possible */
+ args->segment[args->nsegs].use_dio = true;
+ ++args->nsegs;
- /* Setup misaligned end? */
- if (args->end_len) {
- iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- iov_iter_advance(&iters[n_iters],
- args->start_len + args->middle_len);
- iter_is_dio_aligned[n_iters] = false;
- ++n_iters;
+ if (args->last) {
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
+ nvecs, total, args->first +
+ args->middle, args->last);
+ ++args->nsegs;
}
- return n_iters;
+ return true;
}
static int
@@ -1377,22 +1380,12 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
struct nfsd_write_dio_args *args)
{
struct file *file = args->nf->nf_file;
- bool iter_is_dio_aligned[3];
- struct iov_iter iter_stack[3];
- struct iov_iter *iter = iter_stack;
- unsigned int n_iters = 0;
- unsigned long in_count = *cnt;
- loff_t in_offset = kiocb->ki_pos;
ssize_t host_err;
+ unsigned int i;
- n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
- rqstp->rq_bvec, nvecs, *cnt,
- args);
- if (unlikely(!n_iters))
+ if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
- trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
-
/*
* Any buffered IO issued here will be misaligned, use
* sync IO to ensure it has completed before returning.
@@ -1402,18 +1395,21 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
*stable_how = NFS_FILE_SYNC;
*cnt = 0;
- for (int i = 0; i < n_iters; i++) {
- if (iter_is_dio_aligned[i])
+ for (i = 0; i < args->nsegs; i++) {
+ if (args->segment[i].use_dio) {
kiocb->ki_flags |= IOCB_DIRECT;
- else
+ trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+ args->segment[i].iter.count);
+ } else
kiocb->ki_flags &= ~IOCB_DIRECT;
- host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+ host_err = vfs_iocb_iter_write(file, kiocb,
+ &args->segment[i].iter);
if (host_err < 0)
return host_err;
*cnt += host_err;
- if (host_err < iter[i].count) /* partial write? */
- break;
+ if (host_err < args->segment[i].iter.count)
+ break; /* partial write */
}
return 0;
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 07/12] NFSD: Introduce struct nfsd_write_dio_seg
2025-11-03 16:53 ` [PATCH v9 07/12] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
@ 2025-11-03 22:45 ` NeilBrown
2025-11-03 22:48 ` Chuck Lever
0 siblings, 1 reply; 37+ messages in thread
From: NeilBrown @ 2025-11-03 22:45 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Christoph Hellwig, Mike Snitzer
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Passing iter arrays by reference is a little risky. Instead, pass a
> struct with a fixed-size array so bounds checking can be done.
>
> Name each item in the array a "segment", as the term "extent"
> generally refers to a set of blocks on storage, not to a buffer.
> Each segment is processed via a single vfs_iocb_iter_write() call,
> and is either IOCB_DIRECT or buffered.
>
> Introduce a segment constructor function so each segment is
> initialized identically.
>
> Consensus is that allowing the code to build segment arrays that
> are smaller than 3 is better than the I/O loop unconditionally
> visiting all 3 segments, skipping the zero-length ones.
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 120 ++++++++++++++++++++++++--------------------------
> 1 file changed, 58 insertions(+), 62 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 326c60eada65..5d6efcceb8c9 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,12 +1254,16 @@ static int wait_for_concurrent_writes(struct file *file)
> return err;
> }
>
> +struct nfsd_write_dio_seg {
> + struct iov_iter iter;
> + bool use_dio;
> +};
> +
> struct nfsd_write_dio_args {
> struct nfsd_file *nf;
> -
> - ssize_t start_len; /* Length for misaligned first extent */
> - ssize_t middle_len; /* Length for DIO-aligned middle extent */
> - ssize_t end_len; /* Length for misaligned last extent */
> + size_t first, middle, last;
> + unsigned int nsegs;
> + struct nfsd_write_dio_seg segment[3];
> };
>
> static bool
> @@ -1267,21 +1271,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> struct nfsd_write_dio_args *args)
> {
> u32 dio_blocksize = args->nf->nf_dio_offset_align;
> - loff_t start_end, orig_end, middle_end;
> + loff_t first_end, orig_end, middle_end;
>
> if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
> return false;
> if (unlikely(len < dio_blocksize))
> return false;
>
> - start_end = round_up(offset, dio_blocksize);
> + first_end = round_up(offset, dio_blocksize);
> orig_end = offset + len;
> middle_end = round_down(orig_end, dio_blocksize);
>
> - args->start_len = start_end - offset;
> - args->middle_len = middle_end - start_end;
> - args->end_len = orig_end - middle_end;
> -
> + args->first = first_end - offset;
> + args->middle = middle_end - first_end;
> + args->last = orig_end - middle_end;
The commit message didn't warn that "start" is being renamed to "first"
and that "_len" is being dropped.
I can understand the former (though I'd go for "head", "body", "tail"
myself) but I would prefer to keep _len.
> return true;
> }
>
> @@ -1311,47 +1314,47 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
> return true;
> }
>
> -/*
> - * Setup as many as 3 iov_iter based on extents described by @write_dio.
> - * Returns the number of iov_iter that were setup.
> - */
> -static int
> -nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
> - struct bio_vec *rq_bvec, unsigned int nvecs,
> - unsigned long cnt, struct nfsd_write_dio_args *args)
> +static void
> +nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> + struct bio_vec *bvec, unsigned int nvecs,
> + unsigned long total, size_t start, size_t len)
> {
> - int n_iters = 0;
> - struct iov_iter *iters = *iterp;
> + iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
> + if (start)
> + iov_iter_advance(&segment->iter, start);
To me, this would read better as
if (start_len)
iov_iter......
But I don't see anything actually wrong with the patch, so
Reviewed-by: NeilBrown <neil@brown.name>
NeilBrown
> + iov_iter_truncate(&segment->iter, len);
> + segment->use_dio = false;
> +}
>
> - /* Setup misaligned start? */
> - if (args->start_len) {
> - iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> - iters[n_iters].count = args->start_len;
> - iter_is_dio_aligned[n_iters] = false;
> - ++n_iters;
> +static bool
> +nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
> + unsigned long total,
> + struct nfsd_write_dio_args *args)
> +{
> + args->nsegs = 0;
> +
> + if (args->first) {
> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> + nvecs, total, 0, args->first);
> + ++args->nsegs;
> }
>
> - /* Setup DIO-aligned middle */
> - iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> - if (args->start_len)
> - iov_iter_advance(&iters[n_iters], args->start_len);
> - iters[n_iters].count -= args->end_len;
> - iter_is_dio_aligned[n_iters] =
> - nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
> - if (unlikely(!iter_is_dio_aligned[n_iters]))
> - return 0; /* no DIO-aligned IO possible */
> - ++n_iters;
> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
> + total, args->first, args->middle);
> + if (!nfsd_iov_iter_aligned_bvec(args->nf,
> + &args->segment[args->nsegs].iter))
> + return false; /* no DIO-aligned IO possible */
> + args->segment[args->nsegs].use_dio = true;
> + ++args->nsegs;
>
> - /* Setup misaligned end? */
> - if (args->end_len) {
> - iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
> - iov_iter_advance(&iters[n_iters],
> - args->start_len + args->middle_len);
> - iter_is_dio_aligned[n_iters] = false;
> - ++n_iters;
> + if (args->last) {
> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> + nvecs, total, args->first +
> + args->middle, args->last);
> + ++args->nsegs;
> }
>
> - return n_iters;
> + return true;
> }
>
> static int
> @@ -1377,22 +1380,12 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
> struct nfsd_write_dio_args *args)
> {
> struct file *file = args->nf->nf_file;
> - bool iter_is_dio_aligned[3];
> - struct iov_iter iter_stack[3];
> - struct iov_iter *iter = iter_stack;
> - unsigned int n_iters = 0;
> - unsigned long in_count = *cnt;
> - loff_t in_offset = kiocb->ki_pos;
> ssize_t host_err;
> + unsigned int i;
>
> - n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
> - rqstp->rq_bvec, nvecs, *cnt,
> - args);
> - if (unlikely(!n_iters))
> + if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
> return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
>
> - trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
> -
> /*
> * Any buffered IO issued here will be misaligned, use
> * sync IO to ensure it has completed before returning.
> @@ -1402,18 +1395,21 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
> *stable_how = NFS_FILE_SYNC;
>
> *cnt = 0;
> - for (int i = 0; i < n_iters; i++) {
> - if (iter_is_dio_aligned[i])
> + for (i = 0; i < args->nsegs; i++) {
> + if (args->segment[i].use_dio) {
> kiocb->ki_flags |= IOCB_DIRECT;
> - else
> + trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> + args->segment[i].iter.count);
> + } else
> kiocb->ki_flags &= ~IOCB_DIRECT;
>
> - host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
> + host_err = vfs_iocb_iter_write(file, kiocb,
> + &args->segment[i].iter);
> if (host_err < 0)
> return host_err;
> *cnt += host_err;
> - if (host_err < iter[i].count) /* partial write? */
> - break;
> + if (host_err < args->segment[i].iter.count)
> + break; /* partial write */
> }
>
> return 0;
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH v9 07/12] NFSD: Introduce struct nfsd_write_dio_seg
2025-11-03 22:45 ` NeilBrown
@ 2025-11-03 22:48 ` Chuck Lever
0 siblings, 0 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 22:48 UTC (permalink / raw)
To: NeilBrown
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Christoph Hellwig, Mike Snitzer
On 11/3/25 5:45 PM, NeilBrown wrote:
> On Tue, 04 Nov 2025, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Passing iter arrays by reference is a little risky. Instead, pass a
>> struct with a fixed-size array so bounds checking can be done.
>>
>> Name each item in the array a "segment", as the term "extent"
>> generally refers to a set of blocks on storage, not to a buffer.
>> Each segment is processed via a single vfs_iocb_iter_write() call,
>> and is either IOCB_DIRECT or buffered.
>>
>> Introduce a segment constructor function so each segment is
>> initialized identically.
>>
>> Consensus is that allowing the code to build segment arrays that
>> are smaller than 3 is better than the I/O loop unconditionally
>> visiting all 3 segments, skipping the zero-length ones.
>>
>> Suggested-by: Christoph Hellwig <hch@lst.de>
>> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> fs/nfsd/vfs.c | 120 ++++++++++++++++++++++++--------------------------
>> 1 file changed, 58 insertions(+), 62 deletions(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 326c60eada65..5d6efcceb8c9 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1254,12 +1254,16 @@ static int wait_for_concurrent_writes(struct file *file)
>> return err;
>> }
>>
>> +struct nfsd_write_dio_seg {
>> + struct iov_iter iter;
>> + bool use_dio;
>> +};
>> +
>> struct nfsd_write_dio_args {
>> struct nfsd_file *nf;
>> -
>> - ssize_t start_len; /* Length for misaligned first extent */
>> - ssize_t middle_len; /* Length for DIO-aligned middle extent */
>> - ssize_t end_len; /* Length for misaligned last extent */
>> + size_t first, middle, last;
>> + unsigned int nsegs;
>> + struct nfsd_write_dio_seg segment[3];
>> };
>>
>> static bool
>> @@ -1267,21 +1271,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>> struct nfsd_write_dio_args *args)
>> {
>> u32 dio_blocksize = args->nf->nf_dio_offset_align;
>> - loff_t start_end, orig_end, middle_end;
>> + loff_t first_end, orig_end, middle_end;
>>
>> if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
>> return false;
>> if (unlikely(len < dio_blocksize))
>> return false;
>>
>> - start_end = round_up(offset, dio_blocksize);
>> + first_end = round_up(offset, dio_blocksize);
>> orig_end = offset + len;
>> middle_end = round_down(orig_end, dio_blocksize);
>>
>> - args->start_len = start_end - offset;
>> - args->middle_len = middle_end - start_end;
>> - args->end_len = orig_end - middle_end;
>> -
>> + args->first = first_end - offset;
>> + args->middle = middle_end - first_end;
>> + args->last = orig_end - middle_end;
>
> The commit message didn't warn that "start" is being renamed to "first"
> and that "_len" is being dropped.
> I can understand the former (though I'd go for "head", "body", "tail"
> myself) but I would prefer to keep _len.
A subsequent patch renames these variables again to "prefix", "middle",
and "suffix". The series should probably just stick with one rename.
>> return true;
>> }
>>
>> @@ -1311,47 +1314,47 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
>> return true;
>> }
>>
>> -/*
>> - * Setup as many as 3 iov_iter based on extents described by @write_dio.
>> - * Returns the number of iov_iter that were setup.
>> - */
>> -static int
>> -nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
>> - struct bio_vec *rq_bvec, unsigned int nvecs,
>> - unsigned long cnt, struct nfsd_write_dio_args *args)
>> +static void
>> +nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
>> + struct bio_vec *bvec, unsigned int nvecs,
>> + unsigned long total, size_t start, size_t len)
>> {
>> - int n_iters = 0;
>> - struct iov_iter *iters = *iterp;
>> + iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
>> + if (start)
>> + iov_iter_advance(&segment->iter, start);
>
> To me, this would read better as
> if (start_len)
> iov_iter......
>
>
> But I don't see anything actually wrong with the patch, so
> Reviewed-by: NeilBrown <neil@brown.name>
>
> NeilBrown
>
>
>
>> + iov_iter_truncate(&segment->iter, len);
>> + segment->use_dio = false;
>> +}
>>
>> - /* Setup misaligned start? */
>> - if (args->start_len) {
>> - iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> - iters[n_iters].count = args->start_len;
>> - iter_is_dio_aligned[n_iters] = false;
>> - ++n_iters;
>> +static bool
>> +nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
>> + unsigned long total,
>> + struct nfsd_write_dio_args *args)
>> +{
>> + args->nsegs = 0;
>> +
>> + if (args->first) {
>> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
>> + nvecs, total, 0, args->first);
>> + ++args->nsegs;
>> }
>>
>> - /* Setup DIO-aligned middle */
>> - iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> - if (args->start_len)
>> - iov_iter_advance(&iters[n_iters], args->start_len);
>> - iters[n_iters].count -= args->end_len;
>> - iter_is_dio_aligned[n_iters] =
>> - nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
>> - if (unlikely(!iter_is_dio_aligned[n_iters]))
>> - return 0; /* no DIO-aligned IO possible */
>> - ++n_iters;
>> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
>> + total, args->first, args->middle);
>> + if (!nfsd_iov_iter_aligned_bvec(args->nf,
>> + &args->segment[args->nsegs].iter))
>> + return false; /* no DIO-aligned IO possible */
>> + args->segment[args->nsegs].use_dio = true;
>> + ++args->nsegs;
>>
>> - /* Setup misaligned end? */
>> - if (args->end_len) {
>> - iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
>> - iov_iter_advance(&iters[n_iters],
>> - args->start_len + args->middle_len);
>> - iter_is_dio_aligned[n_iters] = false;
>> - ++n_iters;
>> + if (args->last) {
>> + nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
>> + nvecs, total, args->first +
>> + args->middle, args->last);
>> + ++args->nsegs;
>> }
>>
>> - return n_iters;
>> + return true;
>> }
>>
>> static int
>> @@ -1377,22 +1380,12 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
>> struct nfsd_write_dio_args *args)
>> {
>> struct file *file = args->nf->nf_file;
>> - bool iter_is_dio_aligned[3];
>> - struct iov_iter iter_stack[3];
>> - struct iov_iter *iter = iter_stack;
>> - unsigned int n_iters = 0;
>> - unsigned long in_count = *cnt;
>> - loff_t in_offset = kiocb->ki_pos;
>> ssize_t host_err;
>> + unsigned int i;
>>
>> - n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
>> - rqstp->rq_bvec, nvecs, *cnt,
>> - args);
>> - if (unlikely(!n_iters))
>> + if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
>> return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
>>
>> - trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
>> -
>> /*
>> * Any buffered IO issued here will be misaligned, use
>> * sync IO to ensure it has completed before returning.
>> @@ -1402,18 +1395,21 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
>> *stable_how = NFS_FILE_SYNC;
>>
>> *cnt = 0;
>> - for (int i = 0; i < n_iters; i++) {
>> - if (iter_is_dio_aligned[i])
>> + for (i = 0; i < args->nsegs; i++) {
>> + if (args->segment[i].use_dio) {
>> kiocb->ki_flags |= IOCB_DIRECT;
>> - else
>> + trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
>> + args->segment[i].iter.count);
>> + } else
>> kiocb->ki_flags &= ~IOCB_DIRECT;
>>
>> - host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
>> + host_err = vfs_iocb_iter_write(file, kiocb,
>> + &args->segment[i].iter);
>> if (host_err < 0)
>> return host_err;
>> *cnt += host_err;
>> - if (host_err < iter[i].count) /* partial write? */
>> - break;
>> + if (host_err < args->segment[i].iter.count)
>> + break; /* partial write */
>> }
>>
>> return 0;
>> --
>> 2.51.0
>>
>>
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec()
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (6 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 07/12] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 21:13 ` Mike Snitzer
2025-11-03 22:48 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
` (3 subsequent siblings)
11 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Pages in the RPC receive buffer are contiguous. Practically
speaking, this means that, after the first bvec, the bv_offset
field is always zero.
The loop in nfsd_iov_iter_aligned_bvec() also sets "skip" to zero
after the first bvec is checked. Thus, for bvecs following the first
one, bv_offset = 0 and skip = 0, and the check becomes:
(0 + 0) & addr_mask = 0
This always passes regardless of the alignment mask.
Since all bvecs after the first one start at bv_offset zero (page-
aligned), they are inherently aligned for DIO. Therefore, only the
first bvec needs to be checked for DIO alignment.
Note that for RDMA transports, the incoming payload always starts on
page alignment, since svcrdma sets up the RDMA Read sink buffers
that way. For RDMA, this loop visits every bvec in each payload, and
never finds an unaligned bvec.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 26 ++++++++------------------
1 file changed, 8 insertions(+), 18 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5d6efcceb8c9..37353fb48d58 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1288,30 +1288,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
return true;
}
+/*
+ * Check if the bvec iterator is aligned for direct I/O.
+ *
+ * bvecs generated from RPC receive buffers are contiguous: After the first
+ * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
+ * Therefore, only the first bvec is checked.
+ */
static bool
nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
{
- unsigned int len_mask = nf->nf_dio_offset_align - 1;
unsigned int addr_mask = nf->nf_dio_mem_align - 1;
const struct bio_vec *bvec = i->bvec;
- size_t skip = i->iov_offset;
- size_t size = i->count;
- if (size & len_mask)
- return false;
- do {
- size_t len = bvec->bv_len;
-
- if (len > size)
- len = size;
- if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
- return false;
- bvec++;
- size -= len;
- skip = 0;
- } while (size);
-
- return true;
+ return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
}
static void
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec()
2025-11-03 16:53 ` [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
@ 2025-11-03 21:13 ` Mike Snitzer
2025-11-03 22:48 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-03 21:13 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Nov 03, 2025 at 11:53:47AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Pages in the RPC receive buffer are contiguous. Practically
> speaking, this means that, after the first bvec, the bv_offset
> field is always zero.
>
> The loop in nfsd_iov_iter_aligned_bvec() also sets "skip" to zero
> after the first bvec is checked. Thus, for bvecs following the first
> one, bv_offset = 0 and skip = 0, and the check becomes:
>
> (0 + 0) & addr_mask = 0
>
> This always passes regardless of the alignment mask.
>
> Since all bvecs after the first one start at bv_offset zero (page-
> aligned), they are inherently aligned for DIO. Therefore, only the
> first bvec needs to be checked for DIO alignment.
>
> Note that for RDMA transports, the incoming payload always starts on
> page alignment, since svcrdma sets up the RDMA Read sink buffers
> that way. For RDMA, this loop visits every bvec in each payload, and
> never finds an unaligned bvec.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Very welcome improvement!
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec()
2025-11-03 16:53 ` [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-11-03 21:13 ` Mike Snitzer
@ 2025-11-03 22:48 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: NeilBrown @ 2025-11-03 22:48 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Pages in the RPC receive buffer are contiguous. Practically
> speaking, this means that, after the first bvec, the bv_offset
> field is always zero.
>
> The loop in nfsd_iov_iter_aligned_bvec() also sets "skip" to zero
> after the first bvec is checked. Thus, for bvecs following the first
> one, bv_offset = 0 and skip = 0, and the check becomes:
>
> (0 + 0) & addr_mask = 0
>
> This always passes regardless of the alignment mask.
>
> Since all bvecs after the first one start at bv_offset zero (page-
> aligned), they are inherently aligned for DIO. Therefore, only the
> first bvec needs to be checked for DIO alignment.
>
> Note that for RDMA transports, the incoming payload always starts on
> page alignment, since svcrdma sets up the RDMA Read sink buffers
> that way. For RDMA, this loop visits every bvec in each payload, and
> never finds an unaligned bvec.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 26 ++++++++------------------
> 1 file changed, 8 insertions(+), 18 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5d6efcceb8c9..37353fb48d58 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1288,30 +1288,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> return true;
> }
>
> +/*
> + * Check if the bvec iterator is aligned for direct I/O.
> + *
> + * bvecs generated from RPC receive buffers are contiguous: After the first
> + * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
> + * Therefore, only the first bvec is checked.
> + */
> static bool
> nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
> {
> - unsigned int len_mask = nf->nf_dio_offset_align - 1;
> unsigned int addr_mask = nf->nf_dio_mem_align - 1;
> const struct bio_vec *bvec = i->bvec;
> - size_t skip = i->iov_offset;
> - size_t size = i->count;
>
> - if (size & len_mask)
> - return false;
> - do {
> - size_t len = bvec->bv_len;
> -
> - if (len > size)
> - len = size;
> - if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> - return false;
> - bvec++;
> - size -= len;
> - skip = 0;
> - } while (size);
> -
> - return true;
> + return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
I would MUCH rather
return ((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask) == 0;
as it really is a number, not a bool, that is being tested.
I find the latter easy to read, and the former a challenge.
But this is a nice simplification.
Reviewed-by: NeilBrown <neil@brown.name>
NeilBrown
> }
>
> static void
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (7 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 21:19 ` Mike Snitzer
2025-11-03 22:55 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
` (2 subsequent siblings)
11 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
When direct I/O is not feasible (due to missing alignment info,
too-small writes, or no alignment possible), pack the entire
write payload into a single non-DIO segment and follow the usual
direct write I/O path.
This simplifies nfsd_direct_write() by eliminating the fallback path
and the separate nfsd_buffered_write() call - all writes now go
through nfsd_issue_write_dio() which handles both DIO and buffered
segments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 93 +++++++++++++++++++++++++++------------------------
1 file changed, 49 insertions(+), 44 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 37353fb48d58..1b1173de4f78 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1261,33 +1261,10 @@ struct nfsd_write_dio_seg {
struct nfsd_write_dio_args {
struct nfsd_file *nf;
- size_t first, middle, last;
unsigned int nsegs;
struct nfsd_write_dio_seg segment[3];
};
-static bool
-nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
- struct nfsd_write_dio_args *args)
-{
- u32 dio_blocksize = args->nf->nf_dio_offset_align;
- loff_t first_end, orig_end, middle_end;
-
- if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
- return false;
- if (unlikely(len < dio_blocksize))
- return false;
-
- first_end = round_up(offset, dio_blocksize);
- orig_end = offset + len;
- middle_end = round_down(orig_end, dio_blocksize);
-
- args->first = first_end - offset;
- args->middle = middle_end - first_end;
- args->last = orig_end - middle_end;
- return true;
-}
-
/*
* Check if the bvec iterator is aligned for direct I/O.
*
@@ -1316,35 +1293,66 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
segment->use_dio = false;
}
-static bool
-nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
- unsigned long total,
- struct nfsd_write_dio_args *args)
+static void
+nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
+ loff_t offset, unsigned long total,
+ struct nfsd_write_dio_args *args)
{
+ u32 offset_align = args->nf->nf_dio_offset_align;
+ u32 mem_align = args->nf->nf_dio_mem_align;
+ loff_t prefix_end, orig_end, middle_end;
+ size_t prefix, middle, suffix;
+
args->nsegs = 0;
- if (args->first) {
+ /*
+ * Check if direct I/O is feasible for this write request.
+ * If alignments are not available, the write is too small,
+ * or no alignment can be found, fall back to buffered I/O.
+ */
+ if (unlikely(!mem_align || !offset_align) ||
+ unlikely(total < max(offset_align, mem_align)))
+ goto no_dio;
+
+ /* Calculate aligned segments */
+ prefix_end = round_up(offset, offset_align);
+ orig_end = offset + total;
+ middle_end = round_down(orig_end, offset_align);
+
+ prefix = prefix_end - offset;
+ middle = middle_end - prefix_end;
+ suffix = orig_end - middle_end;
+
+ if (!middle)
+ goto no_dio;
+
+ if (prefix) {
nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
- nvecs, total, 0, args->first);
+ nvecs, total, 0, prefix);
++args->nsegs;
}
nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
- total, args->first, args->middle);
+ total, prefix, middle);
if (!nfsd_iov_iter_aligned_bvec(args->nf,
&args->segment[args->nsegs].iter))
- return false; /* no DIO-aligned IO possible */
+ goto no_dio;
args->segment[args->nsegs].use_dio = true;
++args->nsegs;
- if (args->last) {
+ if (suffix) {
nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
- nvecs, total, args->first +
- args->middle, args->last);
+ nvecs, total, prefix + middle, suffix);
++args->nsegs;
}
- return true;
+ return;
+
+no_dio:
+ /* No alignment possible - pack into single non-DIO segment */
+ nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
+ 0, total);
+ args->nsegs = 1;
}
static int
@@ -1365,7 +1373,7 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
}
static int
-nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
+nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
struct nfsd_write_dio_args *args)
{
@@ -1373,9 +1381,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
ssize_t host_err;
unsigned int i;
- if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
- return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
-
/*
* Any buffered IO issued here will be misaligned, use
* sync IO to ensure it has completed before returning.
@@ -1384,6 +1389,9 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
*stable_how = NFS_FILE_SYNC;
+ nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
+ *cnt, args);
+
*cnt = 0;
for (i = 0; i < args->nsegs; i++) {
if (args->segment[i].use_dio) {
@@ -1422,11 +1430,8 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
kiocb->ki_flags |= IOCB_DONTCACHE;
- if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, &args))
- return nfsd_issue_write_dio(rqstp, fhp, stable_how, kiocb,
- nvecs, cnt, &args);
-
- return nfsd_buffered_write(rqstp, args.nf->nf_file, nvecs, cnt, kiocb);
+ return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
+ cnt, &args);
}
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup
2025-11-03 16:53 ` [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
@ 2025-11-03 21:19 ` Mike Snitzer
2025-11-03 22:55 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-03 21:19 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Nov 03, 2025 at 11:53:48AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> When direct I/O is not feasible (due to missing alignment info,
> too-small writes, or no alignment possible), pack the entire
> write payload into a single non-DIO segment and follow the usual
> direct write I/O path.
>
> This simplifies nfsd_direct_write() by eliminating the fallback path
> and the separate nfsd_buffered_write() call - all writes now go
> through nfsd_issue_write_dio() which handles both DIO and buffered
> segments.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup
2025-11-03 16:53 ` [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
2025-11-03 21:19 ` Mike Snitzer
@ 2025-11-03 22:55 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: NeilBrown @ 2025-11-03 22:55 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> When direct I/O is not feasible (due to missing alignment info,
> too-small writes, or no alignment possible), pack the entire
> write payload into a single non-DIO segment and follow the usual
> direct write I/O path.
>
> This simplifies nfsd_direct_write() by eliminating the fallback path
> and the separate nfsd_buffered_write() call - all writes now go
> through nfsd_issue_write_dio() which handles both DIO and buffered
> segments.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 93 +++++++++++++++++++++++++++------------------------
> 1 file changed, 49 insertions(+), 44 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 37353fb48d58..1b1173de4f78 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1261,33 +1261,10 @@ struct nfsd_write_dio_seg {
>
> struct nfsd_write_dio_args {
> struct nfsd_file *nf;
> - size_t first, middle, last;
> unsigned int nsegs;
> struct nfsd_write_dio_seg segment[3];
> };
>
> -static bool
> -nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> - struct nfsd_write_dio_args *args)
> -{
> - u32 dio_blocksize = args->nf->nf_dio_offset_align;
> - loff_t first_end, orig_end, middle_end;
> -
> - if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
> - return false;
> - if (unlikely(len < dio_blocksize))
> - return false;
> -
> - first_end = round_up(offset, dio_blocksize);
> - orig_end = offset + len;
> - middle_end = round_down(orig_end, dio_blocksize);
> -
> - args->first = first_end - offset;
> - args->middle = middle_end - first_end;
> - args->last = orig_end - middle_end;
> - return true;
> -}
> -
> /*
> * Check if the bvec iterator is aligned for direct I/O.
> *
> @@ -1316,35 +1293,66 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> segment->use_dio = false;
> }
>
> -static bool
> -nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
> - unsigned long total,
> - struct nfsd_write_dio_args *args)
> +static void
> +nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> + loff_t offset, unsigned long total,
> + struct nfsd_write_dio_args *args)
> {
> + u32 offset_align = args->nf->nf_dio_offset_align;
> + u32 mem_align = args->nf->nf_dio_mem_align;
> + loff_t prefix_end, orig_end, middle_end;
> + size_t prefix, middle, suffix;
> +
> args->nsegs = 0;
>
> - if (args->first) {
> + /*
> + * Check if direct I/O is feasible for this write request.
> + * If alignments are not available, the write is too small,
> + * or no alignment can be found, fall back to buffered I/O.
> + */
> + if (unlikely(!mem_align || !offset_align) ||
> + unlikely(total < max(offset_align, mem_align)))
> + goto no_dio;
> +
> + /* Calculate aligned segments */
> + prefix_end = round_up(offset, offset_align);
> + orig_end = offset + total;
> + middle_end = round_down(orig_end, offset_align);
> +
> + prefix = prefix_end - offset;
> + middle = middle_end - prefix_end;
> + suffix = orig_end - middle_end;
So first, middle, last is now prefix, middle, suffix. I do like that
more though the constant change leaves me a bit dizzy :-)
Again, a nice simplification - fewer paths and special cases so easier
to follow.
Reviewed-by: NeilBrown <neil@brown.name>
NeilBrown
> +
> + if (!middle)
> + goto no_dio;
> +
> + if (prefix) {
> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> - nvecs, total, 0, args->first);
> + nvecs, total, 0, prefix);
> ++args->nsegs;
> }
>
> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
> - total, args->first, args->middle);
> + total, prefix, middle);
> if (!nfsd_iov_iter_aligned_bvec(args->nf,
> &args->segment[args->nsegs].iter))
> - return false; /* no DIO-aligned IO possible */
> + goto no_dio;
> args->segment[args->nsegs].use_dio = true;
> ++args->nsegs;
>
> - if (args->last) {
> + if (suffix) {
> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> - nvecs, total, args->first +
> - args->middle, args->last);
> + nvecs, total, prefix + middle, suffix);
> ++args->nsegs;
> }
>
> - return true;
> + return;
> +
> +no_dio:
> + /* No alignment possible - pack into single non-DIO segment */
> + nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
> + 0, total);
> + args->nsegs = 1;
> }
>
> static int
> @@ -1365,7 +1373,7 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> }
>
> static int
> -nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
> +nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
> struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
> struct nfsd_write_dio_args *args)
> {
> @@ -1373,9 +1381,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
> ssize_t host_err;
> unsigned int i;
>
> - if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
> - return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
> -
> /*
> * Any buffered IO issued here will be misaligned, use
> * sync IO to ensure it has completed before returning.
> @@ -1384,6 +1389,9 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
> kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
> *stable_how = NFS_FILE_SYNC;
>
> + nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> + *cnt, args);
> +
> *cnt = 0;
> for (i = 0; i < args->nsegs; i++) {
> if (args->segment[i].use_dio) {
> @@ -1422,11 +1430,8 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb->ki_flags |= IOCB_DONTCACHE;
>
> - if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, &args))
> - return nfsd_issue_write_dio(rqstp, fhp, stable_how, kiocb,
> - nvecs, cnt, &args);
> -
> - return nfsd_buffered_write(rqstp, args.nf->nf_file, nvecs, cnt, kiocb);
> + return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
> + cnt, &args);
> }
>
> /**
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (8 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 21:14 ` Mike Snitzer
2025-11-03 23:03 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
2025-11-03 16:53 ` [PATCH v9 12/12] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
11 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
Christoph says:
> > + if (file->f_op->fop_flags & FOP_DONTCACHE)
> > + kiocb->ki_flags |= IOCB_DONTCACHE;
> IOCB_DONTCACHE isn't defined for IOCB_DIRECT. So this should
> move into a branch just for buffered I/O.
and
> > Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea,
> > based on the assumption that IOCB_DIRECT writes to local file
> > systems left nothing to be done by a later commit. My assumption
> > is based on the behavior of O_DIRECT on NFS files.
> >
> > If that assumption is not true, then I agree there is no
> > technical reason to promote NFSD_IO_DIRECT writes to FILE_SYNC,
> > and I can remove that built-in assumption for v8 of this series.
>
> It is not true, or rather only true for a tiny subset of use cases
> (which NFS can't even query a head of time).
So, observe the existing setting of ki_flags rather than forcing
persistence unconditionally, and ensure that DONTCACHE is not set
for IOCB_DIRECT writes.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1b1173de4f78..0e5e82b286f1 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1261,6 +1261,8 @@ struct nfsd_write_dio_seg {
struct nfsd_write_dio_args {
struct nfsd_file *nf;
+ int flags_buffered;
+ int flags_direct;
unsigned int nsegs;
struct nfsd_write_dio_seg segment[3];
};
@@ -1373,33 +1375,25 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
}
static int
-nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
- struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
- struct nfsd_write_dio_args *args)
+nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct kiocb *kiocb, unsigned int nvecs,
+ unsigned long *cnt, struct nfsd_write_dio_args *args)
{
struct file *file = args->nf->nf_file;
ssize_t host_err;
unsigned int i;
- /*
- * Any buffered IO issued here will be misaligned, use
- * sync IO to ensure it has completed before returning.
- * Also update @stable_how to avoid need for COMMIT.
- */
- kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
- *stable_how = NFS_FILE_SYNC;
-
nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
*cnt, args);
*cnt = 0;
for (i = 0; i < args->nsegs; i++) {
if (args->segment[i].use_dio) {
- kiocb->ki_flags |= IOCB_DIRECT;
+ kiocb->ki_flags = args->flags_direct;
trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
args->segment[i].iter.count);
} else
- kiocb->ki_flags &= ~IOCB_DIRECT;
+ kiocb->ki_flags = args->flags_buffered;
host_err = vfs_iocb_iter_write(file, kiocb,
&args->segment[i].iter);
@@ -1423,15 +1417,16 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
args.nf = nf;
/*
- * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
- * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
- * be ignored for any DIO issued here).
+ * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
+ * writing unaligned segments or handling fallback I/O.
*/
+ args.flags_buffered = kiocb->ki_flags;
if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
- kiocb->ki_flags |= IOCB_DONTCACHE;
+ args.flags_buffered |= IOCB_DONTCACHE;
- return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
- cnt, &args);
+ args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
+
+ return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
}
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly
2025-11-03 16:53 ` [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
@ 2025-11-03 21:14 ` Mike Snitzer
2025-11-03 23:03 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-03 21:14 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever, Christoph Hellwig
On Mon, Nov 03, 2025 at 11:53:49AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Christoph says:
> > > + if (file->f_op->fop_flags & FOP_DONTCACHE)
> > > + kiocb->ki_flags |= IOCB_DONTCACHE;
> > IOCB_DONTCACHE isn't defined for IOCB_DIRECT. So this should
> > move into a branch just for buffered I/O.
>
> and
>
> > > Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea,
> > > based on the assumption that IOCB_DIRECT writes to local file
> > > systems left nothing to be done by a later commit. My assumption
> > > is based on the behavior of O_DIRECT on NFS files.
> > >
> > > If that assumption is not true, then I agree there is no
> > > technical reason to promote NFSD_IO_DIRECT writes to FILE_SYNC,
> > > and I can remove that built-in assumption for v8 of this series.
> >
> > It is not true, or rather only true for a tiny subset of use cases
> > (which NFS can't even query a head of time).
>
> So, observe the existing setting of ki_flags rather than forcing
> persistence unconditionally, and ensure that DONTCACHE is not set
> for IOCB_DIRECT writes.
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly
2025-11-03 16:53 ` [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
2025-11-03 21:14 ` Mike Snitzer
@ 2025-11-03 23:03 ` NeilBrown
2025-11-04 11:55 ` Christoph Hellwig
1 sibling, 1 reply; 37+ messages in thread
From: NeilBrown @ 2025-11-03 23:03 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever, Christoph Hellwig
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Christoph says:
> > > + if (file->f_op->fop_flags & FOP_DONTCACHE)
> > > + kiocb->ki_flags |= IOCB_DONTCACHE;
> > IOCB_DONTCACHE isn't defined for IOCB_DIRECT. So this should
> > move into a branch just for buffered I/O.
>
> and
>
> > > Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea,
> > > based on the assumption that IOCB_DIRECT writes to local file
> > > systems left nothing to be done by a later commit. My assumption
> > > is based on the behavior of O_DIRECT on NFS files.
> > >
> > > If that assumption is not true, then I agree there is no
> > > technical reason to promote NFSD_IO_DIRECT writes to FILE_SYNC,
> > > and I can remove that built-in assumption for v8 of this series.
> >
> > It is not true, or rather only true for a tiny subset of use cases
> > (which NFS can't even query a head of time).
>
> So, observe the existing setting of ki_flags rather than forcing
> persistence unconditionally, and ensure that DONTCACHE is not set
> for IOCB_DIRECT writes.
Of the write has no prefix or suffix and so is entirely O_DIRECT, then
we get DATASYNC for free do we not?
In that case we should tell the client that DATASYNC has been provided.
I might have missed something in all the refactoring, but this patch
seems to suggest.
In fact, with this change as it stands, the earlier change to allow a
different stable_how to be returned is not used... but I think there
are cases where it could be used.
NeilBrown
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 33 ++++++++++++++-------------------
> 1 file changed, 14 insertions(+), 19 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 1b1173de4f78..0e5e82b286f1 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1261,6 +1261,8 @@ struct nfsd_write_dio_seg {
>
> struct nfsd_write_dio_args {
> struct nfsd_file *nf;
> + int flags_buffered;
> + int flags_direct;
> unsigned int nsegs;
> struct nfsd_write_dio_seg segment[3];
> };
> @@ -1373,33 +1375,25 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> }
>
> static int
> -nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
> - struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
> - struct nfsd_write_dio_args *args)
> +nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct kiocb *kiocb, unsigned int nvecs,
> + unsigned long *cnt, struct nfsd_write_dio_args *args)
> {
> struct file *file = args->nf->nf_file;
> ssize_t host_err;
> unsigned int i;
>
> - /*
> - * Any buffered IO issued here will be misaligned, use
> - * sync IO to ensure it has completed before returning.
> - * Also update @stable_how to avoid need for COMMIT.
> - */
> - kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
> - *stable_how = NFS_FILE_SYNC;
> -
> nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> *cnt, args);
>
> *cnt = 0;
> for (i = 0; i < args->nsegs; i++) {
> if (args->segment[i].use_dio) {
> - kiocb->ki_flags |= IOCB_DIRECT;
> + kiocb->ki_flags = args->flags_direct;
> trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> args->segment[i].iter.count);
> } else
> - kiocb->ki_flags &= ~IOCB_DIRECT;
> + kiocb->ki_flags = args->flags_buffered;
>
> host_err = vfs_iocb_iter_write(file, kiocb,
> &args->segment[i].iter);
> @@ -1423,15 +1417,16 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> args.nf = nf;
>
> /*
> - * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
> - * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
> - * be ignored for any DIO issued here).
> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> + * writing unaligned segments or handling fallback I/O.
> */
> + args.flags_buffered = kiocb->ki_flags;
> if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> - kiocb->ki_flags |= IOCB_DONTCACHE;
> + args.flags_buffered |= IOCB_DONTCACHE;
>
> - return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
> - cnt, &args);
> + args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
> +
> + return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
> }
>
> /**
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly
2025-11-03 23:03 ` NeilBrown
@ 2025-11-04 11:55 ` Christoph Hellwig
0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-11-04 11:55 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Tue, Nov 04, 2025 at 10:03:19AM +1100, NeilBrown wrote:
> > So, observe the existing setting of ki_flags rather than forcing
> > persistence unconditionally, and ensure that DONTCACHE is not set
> > for IOCB_DIRECT writes.
>
> Of the write has no prefix or suffix and so is entirely O_DIRECT, then
> we get DATASYNC for free do we not?
No. O_DIRECT still needs IOCB_DSYNC for datasync semantics.
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (9 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 21:15 ` Mike Snitzer
2025-11-03 23:05 ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 12/12] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
11 siblings, 2 replies; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
There is now only one caller of nfsd_buffered_write(), so it can
be folded back into nfsd_vfs_write().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 27 +++++++--------------------
1 file changed, 7 insertions(+), 20 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0e5e82b286f1..8518fb619f56 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1357,23 +1357,6 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
args->nsegs = 1;
}
-static int
-nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
- unsigned int nvecs, unsigned long *cnt,
- struct kiocb *kiocb)
-{
- struct iov_iter iter;
- int host_err;
-
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
- host_err = vfs_iocb_iter_write(file, kiocb, &iter);
- if (host_err < 0)
- return host_err;
- *cnt = host_err;
-
- return 0;
-}
-
static int
nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct kiocb *kiocb, unsigned int nvecs,
@@ -1457,6 +1440,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
+ struct iov_iter iter;
errseq_t since;
__be32 nfserr;
int host_err;
@@ -1517,10 +1501,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
- fallthrough; /* must call nfsd_buffered_write */
+ fallthrough;
case NFSD_IO_BUFFERED:
- host_err = nfsd_buffered_write(rqstp, file,
- nvecs, cnt, &kiocb);
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
+ if (host_err < 0)
+ break;
+ *cnt = host_err;
break;
}
if (host_err < 0) {
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write
2025-11-03 16:53 ` [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
@ 2025-11-03 21:15 ` Mike Snitzer
2025-11-03 23:05 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-03 21:15 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Mon, Nov 03, 2025 at 11:53:50AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> There is now only one caller of nfsd_buffered_write(), so it can
> be folded back into nfsd_vfs_write().
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write
2025-11-03 16:53 ` [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
2025-11-03 21:15 ` Mike Snitzer
@ 2025-11-03 23:05 ` NeilBrown
1 sibling, 0 replies; 37+ messages in thread
From: NeilBrown @ 2025-11-03 23:05 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Tue, 04 Nov 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> There is now only one caller of nfsd_buffered_write(), so it can
> be folded back into nfsd_vfs_write().
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Looks good
Reviewed-by: NeilBrown <neil@brown.name>
Thanks,
NeilBrown
> ---
> fs/nfsd/vfs.c | 27 +++++++--------------------
> 1 file changed, 7 insertions(+), 20 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 0e5e82b286f1..8518fb619f56 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1357,23 +1357,6 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> args->nsegs = 1;
> }
>
> -static int
> -nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> - unsigned int nvecs, unsigned long *cnt,
> - struct kiocb *kiocb)
> -{
> - struct iov_iter iter;
> - int host_err;
> -
> - iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> - host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> - if (host_err < 0)
> - return host_err;
> - *cnt = host_err;
> -
> - return 0;
> -}
> -
> static int
> nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> struct kiocb *kiocb, unsigned int nvecs,
> @@ -1457,6 +1440,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> u32 stable = *stable_how;
> struct kiocb kiocb;
> struct svc_export *exp;
> + struct iov_iter iter;
> errseq_t since;
> __be32 nfserr;
> int host_err;
> @@ -1517,10 +1501,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags |= IOCB_DONTCACHE;
> - fallthrough; /* must call nfsd_buffered_write */
> + fallthrough;
> case NFSD_IO_BUFFERED:
> - host_err = nfsd_buffered_write(rqstp, file,
> - nvecs, cnt, &kiocb);
> + iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> + host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
> + if (host_err < 0)
> + break;
> + *cnt = host_err;
> break;
> }
> if (host_err < 0) {
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 12/12] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (10 preceding siblings ...)
2025-11-03 16:53 ` [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
@ 2025-11-03 16:53 ` Chuck Lever
2025-11-03 21:25 ` Mike Snitzer
11 siblings, 1 reply; 37+ messages in thread
From: Chuck Lever @ 2025-11-03 16:53 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:
/sys/kernel/debug/nfsd/io_cache_read
/sys/kernel/debug/nfsd/io_cache_write
This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).
Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
.../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++++
1 file changed, 144 insertions(+)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 000000000000..4863885c7035
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,144 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from the page cache upon completion
+ (NFSD_IO_DONTCACHE=1).
+- not cached (NFSD_IO_DIRECT=2)
+
+To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
+corresponding IO operation's debugfs interface, e.g.:
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.:
+ cat /sys/kernel/debug/nfsd/io_cache_read
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to surfacing
+new IO requests.
+
+For more context, please see these Linux commit headers:
+- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+ to take a struct kiocb")
+- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
+ RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
+interfaces, FOP_DONTCACHE must be advertised as supported by the
+underlying filesystem (e.g. XFS), otherwise all IO flagged with
+RWF_DONTCACHE will fail with -EOPNOTSUPP.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+But in summary:
+- NFSD DIRECT can signicantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
+interfaces, ideally the IO will be aligned relative to the underlying
+block device's logical_block_size. Also the memory buffer used to store
+the READ or WRITE payload must be aligned relative to the underlying
+block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+ If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+ DIO-aligned block (on either end of the READ). The expanded READ is
+ verified to have proper offset/len (logical_block_size) and
+ dma_alignment checking.
+
+ Any misaligned READ that is less than 32K won't be expanded to be
+ DIO-aligned (this heuristic just avoids excess work, like allocating
+ start_extra_page, for smaller IO that can generally already perform
+ well using buffered IO).
+
+Misaligned WRITE:
+ If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+ middle and end as needed. The large middle extent is DIO-aligned and
+ the start and/or end are misaligned. Buffered IO is used for the
+ misaligned extents and O_DIRECT is used for the middle DIO-aligned
+ extent.
+
+ If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
+ invalidate the page cache on behalf of the DIO WRITE, then
+ nfsd_issue_write_dio() will fall back to using buffered IO.
+
+Tracing:
+ The nfsd_analyze_read_dio trace event shows how NFSD expands any
+ misaligned READ to the next DIO-aligned block (on either end of the
+ original READ, as needed).
+
+ This combination of trace events is useful for READs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+ The nfsd_analyze_write_dio trace event shows how NFSD splits a given
+ misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
+ extent.
+
+ This combination of trace events is useful for WRITEs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
--
2.51.0
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH v9 12/12] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-03 16:53 ` [PATCH v9 12/12] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
@ 2025-11-03 21:25 ` Mike Snitzer
0 siblings, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2025-11-03 21:25 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs
On Mon, Nov 03, 2025 at 11:53:51AM -0500, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
>
> This document details the NFSD IO modes that are configurable using
> NFSD's experimental debugfs interfaces:
>
> /sys/kernel/debug/nfsd/io_cache_read
> /sys/kernel/debug/nfsd/io_cache_write
>
> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
> debugfs interfaces are replaced with per-export controls).
>
> Future updates will provide more specific guidance and howto
> information to help others use and evaluate NFSD's IO modes:
> BUFFERED, DONTCACHE and DIRECT.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> .../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++++
> 1 file changed, 144 insertions(+)
> create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
Most of this document holds up, but the tracing section is stale:
nfsd_analyze_{read,write}_dio was removed.
nfsd_write_direct was added.
But I'll go over it all and backfill analysis like asked, will send an
incremental diff (probaby tomorrow).
Mike
^ permalink raw reply [flat|nested] 37+ messages in thread