* [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-10-27 15:46 Chuck Lever
2025-10-27 15:46 ` [PATCH v8 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
` (11 more replies)
0 siblings, 12 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series includes the patches needed to make NFSD Direct WRITE
operational.
I've reduced the amount of CPU effort needed to determine when
IOCB_DIRECT can be used. Because a receive buffer is comprised of
contiguous pages, the elements in the bvec after the first one
always have bv_offset == 0, thus they are always memory-aligned.
From there, further simplifications are possible. (This is an effort
to address Christoph's observation that the construction of the
bvec array was inefficient and duplicative).
We've also determined that there might be actual work in the storage
layer that might be better off left until a subsequent COMMIT. So
this series no longer promotes all NFSD_IO_DIRECT writes to be
FILE_SYNC, but instead, observes the client's stable_how argument.
If any of these changes results in a regression, I'll revisit.
Noted that one of the fstests triggers the NFS server's kernel
to emit this warning:
WARNING: CPU: 5 PID: 1348 at fs/iomap/buffered-io.c:1402
iomap_zero_iter+0x1a4/0x390
mlperf_npz_tool.py passes on NFSv3 with both TCP and RDMA, as does
the git regression test suite with jobs=24.
Changes since v7:
* Rebase the series on Mike's original v3 patch
* Address more review comments
* Optimize the "when can NFSD use IOCB_DIRECT" logic
* Revert the "always promote to FILE_SYNC" logic
Changes since v6:
* Patches to address review comments have been split out
* Refactored the iter initialization code
Changes since v5:
* Add a patch to make FILE_SYNC WRITEs persist timestamps
* Address some of Christoph's review comments
* The svcrdma patch has been dropped until we actually need it
Changes since v4:
* Split out refactoring nfsd_buffered_write() into a separate patch
* Expand patch description of 1/4
* Don't set IOCB_SYNC flag
Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks
Chuck Lever (11):
NFSD: Make FILE_SYNC WRITEs comply with spec
NFSD: Enable return of an updated stable_how to NFS clients
NFSD: Remove specific error handling
NFSD: Remove alignment size checking
NFSD: Clean up struct nfsd_write_dio
NFSD: Introduce struct nfsd_write_dio_seg
NFSD: Simplify nfsd_iov_iter_aligned_bvec()
NFSD: Handle both offset and memory alignment for direct I/O
NFSD: Combine direct I/O feasibility check with iterator setup
NFSD: Handle kiocb->ki_flags correctly
NFSD: Refactor nfsd_vfs_write
Mike Snitzer (1):
NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
fs/nfsd/debugfs.c | 1 +
fs/nfsd/filecache.c | 5 +
fs/nfsd/filecache.h | 1 +
fs/nfsd/nfs3proc.c | 2 +-
fs/nfsd/nfs4proc.c | 2 +-
fs/nfsd/nfsproc.c | 3 +-
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 222 +++++++++++++++++++++++++++++++++++++++++---
fs/nfsd/vfs.h | 6 +-
fs/nfsd/xdr3.h | 2 +-
10 files changed, 228 insertions(+), 17 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v8 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-27 15:46 ` [PATCH v8 02/12] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
` (10 subsequent siblings)
11 siblings, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:
> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.
For many years, NFSD has used a "data sync only" optimization for
FILE_SYNC WRITEs, in violation of the above text (and previous
incarnations of the NFS standard). File time stamps haven't been
forced to be persisted as the mandate above requires.
For many in-tree Linux file systems, time stamps, as well as other
metadata not needed to find the data on disk, are piggy-backed on
fdatasync-mandatory metadata updates. Thus, this change should
affect only the case where previously fully-allocated file data
is overwritten.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5333d49910d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
stable = NFS_UNSTABLE;
init_sync_kiocb(&kiocb, file);
kiocb.ki_pos = offset;
- if (stable && !fhp->fh_use_wgather)
- kiocb.ki_flags |= IOCB_DSYNC;
+ if (likely(!fhp->fh_use_wgather)) {
+ switch (stable) {
+ case NFS_FILE_SYNC:
+ /* persist data and timestamps */
+ kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+ break;
+ case NFS_DATA_SYNC:
+ /* persist data only */
+ kiocb.ki_flags |= IOCB_DSYNC;
+ break;
+ }
+ }
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 02/12] NFSD: Enable return of an updated stable_how to NFS clients
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-27 15:46 ` [PATCH v8 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-27 15:46 ` [PATCH v8 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (9 subsequent siblings)
11 siblings, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
NFSv3 and newer protocols enable clients to perform a two-phase
WRITE. A client requests an UNSTABLE WRITE, which sends dirty data
to the NFS server, but does not persist it. The server replies that
it performed the UNSTABLE WRITE, and the client is then obligated to
follow up with a COMMIT request before it can remove the dirty data
from its own page cache. The COMMIT reply is the client's guarantee
that the written data has been persisted on the server.
The purpose of this protocol design is to enable clients to send
a large amount of data via multiple WRITE requests to a server, and
then wait for persistence just once. The server is able to start
persisting the data as soon as it gets it, to shorten the length of
time the client has to wait for the final COMMIT to complete.
It's also possible for the server to respond to an UNSTABLE WRITE
request in a way that indicates that the data was persisted anyway.
In that case, the client can skip the COMMIT and remove the dirty
data from its memory immediately. NetApp filers, for example, do
this because they have a battery-backed cache and can guarantee that
written data is persisted quickly and immediately.
NFSD has never implemented this kind of promotion. UNSTABLE WRITE
requests are unconditionally treated as UNSTABLE. However, in a
subsequent patch, nfsd_vfs_write() will be able to promote an
UNSTABLE WRITE to be a FILE_SYNC WRITE. This will be because NFSD
will handle some WRITE requests locally with O_DIRECT, which
persists written data immediately. The FILE_SYNC WRITE response
indicates to the client that no follow-up COMMIT is necessary.
This patch prepares for that change by enabling the modified
stable_how value to be passed along to NFSD's WRITE reply encoder.
No behavior change is expected.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/nfs3proc.c | 2 +-
fs/nfsd/nfs4proc.c | 2 +-
fs/nfsd/nfsproc.c | 3 ++-
fs/nfsd/vfs.c | 11 ++++++-----
fs/nfsd/vfs.h | 6 ++++--
fs/nfsd/xdr3.h | 2 +-
6 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..ad14b34583bb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -236,7 +236,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
resp->committed = argp->stable;
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
&argp->payload, &cnt,
- resp->committed, resp->verf);
+ &resp->committed, resp->verf);
resp->count = cnt;
resp->status = nfsd3_map_status(resp->status);
return rpc_success;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 7f7e6bb23a90..2222bb283baf 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1285,7 +1285,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
write->wr_how_written = write->wr_stable_how;
status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
write->wr_offset, &write->wr_payload,
- &cnt, write->wr_how_written,
+ &cnt, &write->wr_how_written,
(__be32 *)write->wr_verifier.data);
nfsd_file_put(nf);
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75..706401ed6f8d 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -251,6 +251,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
struct nfsd_writeargs *argp = rqstp->rq_argp;
struct nfsd_attrstat *resp = rqstp->rq_resp;
unsigned long cnt = argp->len;
+ u32 committed = NFS_DATA_SYNC;
dprintk("nfsd: WRITE %s %u bytes at %d\n",
SVCFH_fmt(&argp->fh),
@@ -258,7 +259,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
fh_copy(&resp->fh, &argp->fh);
resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
- &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+ &argp->payload, &cnt, &committed, NULL);
if (resp->status == nfs_ok)
resp->status = fh_getattr(&resp->fh, &resp->stat);
else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5333d49910d9..f3be36b960e5 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1262,7 +1262,7 @@ static int wait_for_concurrent_writes(struct file *file)
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1274,11 +1274,12 @@ __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload, unsigned long *cnt,
- int stable, __be32 *verf)
+ u32 *stable_how, __be32 *verf)
{
struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
struct file *file = nf->nf_file;
struct super_block *sb = file_inode(file)->i_sb;
+ u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
struct iov_iter iter;
@@ -1444,7 +1445,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* @offset: Byte offset of start
* @payload: xdr_buf containing the write payload
* @cnt: IN: number of bytes to write, OUT: number of bytes actually written
- * @stable: An NFS stable_how value
+ * @stable_how: IN: Client's requested stable_how, OUT: Actual stable_how
* @verf: NFS WRITE verifier
*
* Upon return, caller must invoke fh_put on @fhp.
@@ -1454,7 +1455,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
*/
__be32
nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
- const struct xdr_buf *payload, unsigned long *cnt, int stable,
+ const struct xdr_buf *payload, unsigned long *cnt, u32 *stable_how,
__be32 *verf)
{
struct nfsd_file *nf;
@@ -1467,7 +1468,7 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
goto out;
err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
- stable, verf);
+ stable_how, verf);
nfsd_file_put(nf);
out:
trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fa46f8b5f132..c713ed0b04e0 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -130,11 +130,13 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 *eof);
__be32 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
loff_t offset, const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, loff_t offset,
const struct xdr_buf *payload,
- unsigned long *cnt, int stable, __be32 *verf);
+ unsigned long *cnt, u32 *stable_how,
+ __be32 *verf);
__be32 nfsd_readlink(struct svc_rqst *, struct svc_fh *,
char *, int *);
__be32 nfsd_symlink(struct svc_rqst *, struct svc_fh *,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..c0e443ef3a6b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -152,7 +152,7 @@ struct nfsd3_writeres {
__be32 status;
struct svc_fh fh;
unsigned long count;
- int committed;
+ u32 committed;
__be32 verf[2];
};
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-27 15:46 ` [PATCH v8 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-27 15:46 ` [PATCH v8 02/12] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-27 15:46 ` [PATCH v8 04/12] NFSD: Remove specific error handling Chuck Lever
` (8 subsequent siblings)
11 siblings, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
middle and end as needed. The large middle extent is DIO-aligned and
the start and/or end are misaligned. Synchronous buffered IO (with
preference towards using DONTCACHE) is used for the misaligned extents
and O_DIRECT is used for the middle DIO-aligned extent.
nfsd_issue_write_dio() promotes @stable_how to NFS_FILE_SYNC, which
allows the client to drop its dirty data and avoid needing an extra
COMMIT operation.
If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
invalidate the page cache on behalf of the DIO WRITE, then
nfsd_issue_write_dio() will fall back to using buffered IO.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 218 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 215 insertions(+), 5 deletions(-)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
switch (val) {
case NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
nfsd_io_cache_write = val;
break;
default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index bfd41236aff2..ad74439d0105 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
DEFINE_NFSD_IO_EVENT(write_io_done);
DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f3be36b960e5..6cb0efd8ecdd 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,210 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio {
+ ssize_t start_len; /* Length for misaligned first extent */
+ ssize_t middle_len; /* Length for DIO-aligned middle extent */
+ ssize_t end_len; /* Length for misaligned last extent */
+};
+
+static bool
+nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
+ struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
+{
+ const u32 dio_blocksize = nf->nf_dio_offset_align;
+ loff_t start_end, orig_end, middle_end;
+
+ if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+ return false;
+ if (unlikely(dio_blocksize > PAGE_SIZE))
+ return false;
+ if (unlikely(len < dio_blocksize))
+ return false;
+
+ start_end = round_up(offset, dio_blocksize);
+ orig_end = offset + len;
+ middle_end = round_down(orig_end, dio_blocksize);
+
+ write_dio->start_len = start_end - offset;
+ write_dio->middle_len = middle_end - start_end;
+ write_dio->end_len = orig_end - middle_end;
+
+ return true;
+}
+
+static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
+ unsigned int addr_mask, unsigned int len_mask)
+{
+ const struct bio_vec *bvec = i->bvec;
+ size_t skip = i->iov_offset;
+ size_t size = i->count;
+
+ if (size & len_mask)
+ return false;
+ do {
+ size_t len = bvec->bv_len;
+
+ if (len > size)
+ len = size;
+ if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
+ return false;
+ bvec++;
+ size -= len;
+ skip = 0;
+ } while (size);
+
+ return true;
+}
+
+/*
+ * Setup as many as 3 iov_iter based on extents described by @write_dio.
+ * Returns the number of iov_iter that were setup.
+ */
+static int
+nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
+ struct bio_vec *rq_bvec, unsigned int nvecs,
+ unsigned long cnt, struct nfsd_write_dio *write_dio,
+ struct nfsd_file *nf)
+{
+ int n_iters = 0;
+ struct iov_iter *iters = *iterp;
+
+ /* Setup misaligned start? */
+ if (write_dio->start_len) {
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iters[n_iters].count = write_dio->start_len;
+ iter_is_dio_aligned[n_iters] = false;
+ ++n_iters;
+ }
+
+ /* Setup DIO-aligned middle */
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ if (write_dio->start_len)
+ iov_iter_advance(&iters[n_iters], write_dio->start_len);
+ iters[n_iters].count -= write_dio->end_len;
+ iter_is_dio_aligned[n_iters] =
+ nfsd_iov_iter_aligned_bvec(&iters[n_iters],
+ nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
+ if (unlikely(!iter_is_dio_aligned[n_iters]))
+ return 0; /* no DIO-aligned IO possible */
+ ++n_iters;
+
+ /* Setup misaligned end? */
+ if (write_dio->end_len) {
+ iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
+ iov_iter_advance(&iters[n_iters],
+ write_dio->start_len + write_dio->middle_len);
+ iter_is_dio_aligned[n_iters] = false;
+ ++n_iters;
+ }
+
+ return n_iters;
+}
+
+static int
+nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
+ unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb)
+{
+ struct iov_iter iter;
+ int host_err;
+
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt = host_err;
+
+ return 0;
+}
+
+static int
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
+ u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
+ struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+{
+ struct file *file = nf->nf_file;
+ bool iter_is_dio_aligned[3];
+ struct iov_iter iter_stack[3];
+ struct iov_iter *iter = iter_stack;
+ unsigned int n_iters = 0;
+ unsigned long in_count = *cnt;
+ loff_t in_offset = kiocb->ki_pos;
+ ssize_t host_err;
+
+ n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
+ rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
+ if (unlikely(!n_iters))
+ return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
+
+ trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
+
+ /*
+ * Any buffered IO issued here will be misaligned, use
+ * sync IO to ensure it has completed before returning.
+ * Also update @stable_how to avoid need for COMMIT.
+ */
+ kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
+ *stable_how = NFS_FILE_SYNC;
+
+ *cnt = 0;
+ for (int i = 0; i < n_iters; i++) {
+ if (iter_is_dio_aligned[i])
+ kiocb->ki_flags |= IOCB_DIRECT;
+ else
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+
+ host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+ if (host_err < 0) {
+ /*
+ * VFS will return -ENOTBLK if DIO WRITE fails to
+ * invalidate the page cache. Retry using buffered IO.
+ */
+ if (unlikely(host_err == -ENOTBLK)) {
+ kiocb->ki_flags &= ~IOCB_DIRECT;
+ *cnt = in_count;
+ kiocb->ki_pos = in_offset;
+ return nfsd_buffered_write(rqstp, file,
+ nvecs, cnt, kiocb);
+ } else if (unlikely(host_err == -EINVAL)) {
+ struct inode *inode = d_inode(fhp->fh_dentry);
+
+ pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
+ inode->i_sb->s_id, inode->i_ino);
+ host_err = -ESERVERFAULT;
+ }
+ return host_err;
+ }
+ *cnt += host_err;
+ if (host_err < iter[i].count) /* partial write? */
+ break;
+ }
+
+ return 0;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
+{
+ struct nfsd_write_dio write_dio;
+
+ /*
+ * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
+ * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
+ * be ignored for any DIO issued here).
+ */
+ if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ kiocb->ki_flags |= IOCB_DONTCACHE;
+
+ if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
+ return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
+ cnt, kiocb, &write_dio);
+
+ return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1282,7 +1486,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
- struct iov_iter iter;
errseq_t since;
__be32 nfserr;
int host_err;
@@ -1329,25 +1532,30 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
- case NFSD_IO_BUFFERED:
+ case NFSD_IO_DIRECT:
+ host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
+ nvecs, cnt, &kiocb);
+ stable = *stable_how;
break;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough; /* must call nfsd_buffered_write */
+ case NFSD_IO_BUFFERED:
+ host_err = nfsd_buffered_write(rqstp, file,
+ nvecs, cnt, &kiocb);
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 04/12] NFSD: Remove specific error handling
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (2 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-27 15:46 ` [PATCH v8 05/12] NFSD: Remove alignment size checking Chuck Lever
` (7 subsequent siblings)
11 siblings, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
1. Christoph notes that ENOTBLK is not supposed to leak out of file
systems, so it's unlikely or impossible to see that error code
here.
2. There are several ways to get EINVAL on a write. The least likely
of those is a dio alignment problem. The warning here would be
misleading in those more common cases.
It's unlikely that an administrator can do anything about either
of these cases, should they appear on a production system.
The trace_nfsd_write_done event will record these errnos when they
occur.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 20 +-------------------
1 file changed, 1 insertion(+), 19 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6cb0efd8ecdd..30094d8f489e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1408,26 +1408,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
kiocb->ki_flags &= ~IOCB_DIRECT;
host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
- if (host_err < 0) {
- /*
- * VFS will return -ENOTBLK if DIO WRITE fails to
- * invalidate the page cache. Retry using buffered IO.
- */
- if (unlikely(host_err == -ENOTBLK)) {
- kiocb->ki_flags &= ~IOCB_DIRECT;
- *cnt = in_count;
- kiocb->ki_pos = in_offset;
- return nfsd_buffered_write(rqstp, file,
- nvecs, cnt, kiocb);
- } else if (unlikely(host_err == -EINVAL)) {
- struct inode *inode = d_inode(fhp->fh_dentry);
-
- pr_info_ratelimited("nfsd: Direct I/O alignment failure on %s/%ld\n",
- inode->i_sb->s_id, inode->i_ino);
- host_err = -ESERVERFAULT;
- }
+ if (host_err < 0)
return host_err;
- }
*cnt += host_err;
if (host_err < iter[i].count) /* partial write? */
break;
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 05/12] NFSD: Remove alignment size checking
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (3 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 04/12] NFSD: Remove specific error handling Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-27 15:46 ` [PATCH v8 06/12] NFSD: Clean up struct nfsd_write_dio Chuck Lever
` (6 subsequent siblings)
11 siblings, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Noted that the NFS client's LOCALIO code still retains this check.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 30094d8f489e..80bc105eb0b6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1269,8 +1269,6 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
return false;
- if (unlikely(dio_blocksize > PAGE_SIZE))
- return false;
if (unlikely(len < dio_blocksize))
return false;
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 06/12] NFSD: Clean up struct nfsd_write_dio
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (4 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 05/12] NFSD: Remove alignment size checking Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-27 15:46 ` [PATCH v8 07/12] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
` (5 subsequent siblings)
11 siblings, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Prepare for moving more common arguments into the shared per-request
structure.
First step is to move the target nfsd_file into that structure, as
it needs to be available in several functions.
As a clean-up, adopt the conventional naming of a structure that
carries the same arguments for a number of functions.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 67 +++++++++++++++++++++++++++------------------------
1 file changed, 36 insertions(+), 31 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 80bc105eb0b6..326c60eada65 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,7 +1254,9 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
-struct nfsd_write_dio {
+struct nfsd_write_dio_args {
+ struct nfsd_file *nf;
+
ssize_t start_len; /* Length for misaligned first extent */
ssize_t middle_len; /* Length for DIO-aligned middle extent */
ssize_t end_len; /* Length for misaligned last extent */
@@ -1262,12 +1264,12 @@ struct nfsd_write_dio {
static bool
nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
- struct nfsd_file *nf, struct nfsd_write_dio *write_dio)
+ struct nfsd_write_dio_args *args)
{
- const u32 dio_blocksize = nf->nf_dio_offset_align;
+ u32 dio_blocksize = args->nf->nf_dio_offset_align;
loff_t start_end, orig_end, middle_end;
- if (unlikely(!nf->nf_dio_mem_align || !dio_blocksize))
+ if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
return false;
if (unlikely(len < dio_blocksize))
return false;
@@ -1276,16 +1278,18 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
orig_end = offset + len;
middle_end = round_down(orig_end, dio_blocksize);
- write_dio->start_len = start_end - offset;
- write_dio->middle_len = middle_end - start_end;
- write_dio->end_len = orig_end - middle_end;
+ args->start_len = start_end - offset;
+ args->middle_len = middle_end - start_end;
+ args->end_len = orig_end - middle_end;
return true;
}
-static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
- unsigned int addr_mask, unsigned int len_mask)
+static bool
+nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
{
+ unsigned int len_mask = nf->nf_dio_offset_align - 1;
+ unsigned int addr_mask = nf->nf_dio_mem_align - 1;
const struct bio_vec *bvec = i->bvec;
size_t skip = i->iov_offset;
size_t size = i->count;
@@ -1314,37 +1318,35 @@ static bool nfsd_iov_iter_aligned_bvec(const struct iov_iter *i,
static int
nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
struct bio_vec *rq_bvec, unsigned int nvecs,
- unsigned long cnt, struct nfsd_write_dio *write_dio,
- struct nfsd_file *nf)
+ unsigned long cnt, struct nfsd_write_dio_args *args)
{
int n_iters = 0;
struct iov_iter *iters = *iterp;
/* Setup misaligned start? */
- if (write_dio->start_len) {
+ if (args->start_len) {
iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- iters[n_iters].count = write_dio->start_len;
+ iters[n_iters].count = args->start_len;
iter_is_dio_aligned[n_iters] = false;
++n_iters;
}
/* Setup DIO-aligned middle */
iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- if (write_dio->start_len)
- iov_iter_advance(&iters[n_iters], write_dio->start_len);
- iters[n_iters].count -= write_dio->end_len;
+ if (args->start_len)
+ iov_iter_advance(&iters[n_iters], args->start_len);
+ iters[n_iters].count -= args->end_len;
iter_is_dio_aligned[n_iters] =
- nfsd_iov_iter_aligned_bvec(&iters[n_iters],
- nf->nf_dio_mem_align-1, nf->nf_dio_offset_align-1);
+ nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
if (unlikely(!iter_is_dio_aligned[n_iters]))
return 0; /* no DIO-aligned IO possible */
++n_iters;
/* Setup misaligned end? */
- if (write_dio->end_len) {
+ if (args->end_len) {
iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
iov_iter_advance(&iters[n_iters],
- write_dio->start_len + write_dio->middle_len);
+ args->start_len + args->middle_len);
iter_is_dio_aligned[n_iters] = false;
++n_iters;
}
@@ -1370,11 +1372,11 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
}
static int
-nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
- u32 *stable_how, unsigned int nvecs, unsigned long *cnt,
- struct kiocb *kiocb, struct nfsd_write_dio *write_dio)
+nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
+ struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
+ struct nfsd_write_dio_args *args)
{
- struct file *file = nf->nf_file;
+ struct file *file = args->nf->nf_file;
bool iter_is_dio_aligned[3];
struct iov_iter iter_stack[3];
struct iov_iter *iter = iter_stack;
@@ -1384,7 +1386,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_fil
ssize_t host_err;
n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
- rqstp->rq_bvec, nvecs, *cnt, write_dio, nf);
+ rqstp->rq_bvec, nvecs, *cnt,
+ args);
if (unlikely(!n_iters))
return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
@@ -1421,21 +1424,23 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
unsigned long *cnt, struct kiocb *kiocb)
{
- struct nfsd_write_dio write_dio;
+ struct nfsd_write_dio_args args;
+
+ args.nf = nf;
/*
* Check if IOCB_DONTCACHE can be used when issuing buffered IO;
* if so, set it to preserve intent of NFSD_IO_DIRECT (it will
* be ignored for any DIO issued here).
*/
- if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
kiocb->ki_flags |= IOCB_DONTCACHE;
- if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, nf, &write_dio))
- return nfsd_issue_write_dio(rqstp, fhp, nf, stable_how, nvecs,
- cnt, kiocb, &write_dio);
+ if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, &args))
+ return nfsd_issue_write_dio(rqstp, fhp, stable_how, kiocb,
+ nvecs, cnt, &args);
- return nfsd_buffered_write(rqstp, nf->nf_file, nvecs, cnt, kiocb);
+ return nfsd_buffered_write(rqstp, args.nf->nf_file, nvecs, cnt, kiocb);
}
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 07/12] NFSD: Introduce struct nfsd_write_dio_seg
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (5 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 06/12] NFSD: Clean up struct nfsd_write_dio Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-27 15:46 ` [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
` (4 subsequent siblings)
11 siblings, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever, Mike Snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Passing iter arrays by reference is a little risky. Instead, pass a
struct with a fixed-size array so bounds checking can be done.
Name each item in the array a "segment", as the term "extent"
generally refers to a set of blocks on storage, not to a buffer.
Each segment is processed via a single vfs_iocb_iter_write() call,
and is either IOCB_DIRECT or buffered.
Introduce a segment constructor function so each segment is
initialized identically.
Consensus is that allowing the code to build segment arrays that
are smaller than 3 is better than the I/O loop unconditionally
visiting all 3 segments, skipping the zero-length ones.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 120 ++++++++++++++++++++++++--------------------------
1 file changed, 58 insertions(+), 62 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 326c60eada65..5d6efcceb8c9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,12 +1254,16 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio_seg {
+ struct iov_iter iter;
+ bool use_dio;
+};
+
struct nfsd_write_dio_args {
struct nfsd_file *nf;
-
- ssize_t start_len; /* Length for misaligned first extent */
- ssize_t middle_len; /* Length for DIO-aligned middle extent */
- ssize_t end_len; /* Length for misaligned last extent */
+ size_t first, middle, last;
+ unsigned int nsegs;
+ struct nfsd_write_dio_seg segment[3];
};
static bool
@@ -1267,21 +1271,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
struct nfsd_write_dio_args *args)
{
u32 dio_blocksize = args->nf->nf_dio_offset_align;
- loff_t start_end, orig_end, middle_end;
+ loff_t first_end, orig_end, middle_end;
if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
return false;
if (unlikely(len < dio_blocksize))
return false;
- start_end = round_up(offset, dio_blocksize);
+ first_end = round_up(offset, dio_blocksize);
orig_end = offset + len;
middle_end = round_down(orig_end, dio_blocksize);
- args->start_len = start_end - offset;
- args->middle_len = middle_end - start_end;
- args->end_len = orig_end - middle_end;
-
+ args->first = first_end - offset;
+ args->middle = middle_end - first_end;
+ args->last = orig_end - middle_end;
return true;
}
@@ -1311,47 +1314,47 @@ nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
return true;
}
-/*
- * Setup as many as 3 iov_iter based on extents described by @write_dio.
- * Returns the number of iov_iter that were setup.
- */
-static int
-nfsd_setup_write_dio_iters(struct iov_iter **iterp, bool *iter_is_dio_aligned,
- struct bio_vec *rq_bvec, unsigned int nvecs,
- unsigned long cnt, struct nfsd_write_dio_args *args)
+static void
+nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
+ struct bio_vec *bvec, unsigned int nvecs,
+ unsigned long total, size_t start, size_t len)
{
- int n_iters = 0;
- struct iov_iter *iters = *iterp;
+ iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
+ if (start)
+ iov_iter_advance(&segment->iter, start);
+ iov_iter_truncate(&segment->iter, len);
+ segment->use_dio = false;
+}
- /* Setup misaligned start? */
- if (args->start_len) {
- iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- iters[n_iters].count = args->start_len;
- iter_is_dio_aligned[n_iters] = false;
- ++n_iters;
+static bool
+nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
+ unsigned long total,
+ struct nfsd_write_dio_args *args)
+{
+ args->nsegs = 0;
+
+ if (args->first) {
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
+ nvecs, total, 0, args->first);
+ ++args->nsegs;
}
- /* Setup DIO-aligned middle */
- iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- if (args->start_len)
- iov_iter_advance(&iters[n_iters], args->start_len);
- iters[n_iters].count -= args->end_len;
- iter_is_dio_aligned[n_iters] =
- nfsd_iov_iter_aligned_bvec(args->nf, &iters[n_iters]);
- if (unlikely(!iter_is_dio_aligned[n_iters]))
- return 0; /* no DIO-aligned IO possible */
- ++n_iters;
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
+ total, args->first, args->middle);
+ if (!nfsd_iov_iter_aligned_bvec(args->nf,
+ &args->segment[args->nsegs].iter))
+ return false; /* no DIO-aligned IO possible */
+ args->segment[args->nsegs].use_dio = true;
+ ++args->nsegs;
- /* Setup misaligned end? */
- if (args->end_len) {
- iov_iter_bvec(&iters[n_iters], ITER_SOURCE, rq_bvec, nvecs, cnt);
- iov_iter_advance(&iters[n_iters],
- args->start_len + args->middle_len);
- iter_is_dio_aligned[n_iters] = false;
- ++n_iters;
+ if (args->last) {
+ nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
+ nvecs, total, args->first +
+ args->middle, args->last);
+ ++args->nsegs;
}
- return n_iters;
+ return true;
}
static int
@@ -1377,22 +1380,12 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
struct nfsd_write_dio_args *args)
{
struct file *file = args->nf->nf_file;
- bool iter_is_dio_aligned[3];
- struct iov_iter iter_stack[3];
- struct iov_iter *iter = iter_stack;
- unsigned int n_iters = 0;
- unsigned long in_count = *cnt;
- loff_t in_offset = kiocb->ki_pos;
ssize_t host_err;
+ unsigned int i;
- n_iters = nfsd_setup_write_dio_iters(&iter, iter_is_dio_aligned,
- rqstp->rq_bvec, nvecs, *cnt,
- args);
- if (unlikely(!n_iters))
+ if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
- trace_nfsd_write_direct(rqstp, fhp, in_offset, in_count);
-
/*
* Any buffered IO issued here will be misaligned, use
* sync IO to ensure it has completed before returning.
@@ -1402,18 +1395,21 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
*stable_how = NFS_FILE_SYNC;
*cnt = 0;
- for (int i = 0; i < n_iters; i++) {
- if (iter_is_dio_aligned[i])
+ for (i = 0; i < args->nsegs; i++) {
+ if (args->segment[i].use_dio) {
kiocb->ki_flags |= IOCB_DIRECT;
- else
+ trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+ args->segment[i].iter.count);
+ } else
kiocb->ki_flags &= ~IOCB_DIRECT;
- host_err = vfs_iocb_iter_write(file, kiocb, &iter[i]);
+ host_err = vfs_iocb_iter_write(file, kiocb,
+ &args->segment[i].iter);
if (host_err < 0)
return host_err;
*cnt += host_err;
- if (host_err < iter[i].count) /* partial write? */
- break;
+ if (host_err < args->segment[i].iter.count)
+ break; /* partial write */
}
return 0;
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec()
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (6 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 07/12] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-30 15:00 ` Jeff Layton
2025-10-31 13:16 ` Christoph Hellwig
2025-10-27 15:46 ` [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O Chuck Lever
` (3 subsequent siblings)
11 siblings, 2 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Pages in the RPC receive buffer are contiguous. Practically
speaking, this means that, after the first bvec, the bv_offset
field is always zero.
The loop in nfsd_iov_iter_aligned_bvec() also sets "skip" to zero
after the first bvec is checked. Thus, for bvecs following the first
one, bv_offset = 0 and skip = 0, and the check becomes:
(0 + 0) & addr_mask = 0
This always passes regardless of the alignment mask.
Since all bvecs after the first one start at bv_offset zero (page-
aligned), they are inherently aligned for DIO. Therefore, only the
first bvec needs to be checked for DIO alignment.
Note that for RDMA transports, the incoming payload always starts on
page alignment, since svcrdma sets up the RDMA Read sink buffers
that way. For RDMA, this loop visits every bvec in each payload, and
never finds an unaligned bvec.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 26 ++++++++------------------
1 file changed, 8 insertions(+), 18 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5d6efcceb8c9..37353fb48d58 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1288,30 +1288,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
return true;
}
+/*
+ * Check if the bvec iterator is aligned for direct I/O.
+ *
+ * bvecs generated from RPC receive buffers are contiguous: After the first
+ * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
+ * Therefore, only the first bvec is checked.
+ */
static bool
nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
{
- unsigned int len_mask = nf->nf_dio_offset_align - 1;
unsigned int addr_mask = nf->nf_dio_mem_align - 1;
const struct bio_vec *bvec = i->bvec;
- size_t skip = i->iov_offset;
- size_t size = i->count;
- if (size & len_mask)
- return false;
- do {
- size_t len = bvec->bv_len;
-
- if (len > size)
- len = size;
- if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
- return false;
- bvec++;
- size -= len;
- skip = 0;
- } while (size);
-
- return true;
+ return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
}
static void
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (7 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-30 19:52 ` Jeff Layton
2025-10-31 13:19 ` Christoph Hellwig
2025-10-27 15:46 ` [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
` (2 subsequent siblings)
11 siblings, 2 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Currently, nfsd_is_write_dio_possible() only considers file offset
alignment (nf_dio_offset_align) when splitting an NFS WRITE request
into segments. This leaves accounting for memory buffer alignment
(nf_dio_mem_align) until nfsd_setup_write_dio_iters(). If this
second check fails, the code falls back to cached I/O entirely,
wasting the opportunity to use direct I/O for the bulk of the
request.
Enhance the logic to find a beginning segment size that satisfies
both alignment constraints simultaneously. The search algorithm
starts with the file offset alignment requirement and steps through
multiples of offset_align, checking memory alignment at each step.
The search is bounded by lcm(offset_align, mem_align) to ensure that
it always terminates.
Signed-off-by: Chuck Lever <cel@kernel.org>
---
fs/nfsd/filecache.c | 5 ++
fs/nfsd/filecache.h | 1 +
fs/nfsd/vfs.c | 119 +++++++++++++++++++++++++++++---------------
3 files changed, 86 insertions(+), 39 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index a238b6725008..89adc4ab5b24 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -40,6 +40,7 @@
#include <linux/seq_file.h>
#include <linux/rhashtable.h>
#include <linux/nfslocalio.h>
+#include <linux/lcm.h>
#include "vfs.h"
#include "nfsd.h"
@@ -234,6 +235,7 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
nf->nf_dio_mem_align = 0;
nf->nf_dio_offset_align = 0;
nf->nf_dio_read_offset_align = 0;
+ nf->nf_dio_align_lcm = 0;
return nf;
}
@@ -1071,6 +1073,9 @@ nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
if (stat.result_mask & STATX_DIOALIGN) {
nf->nf_dio_mem_align = stat.dio_mem_align;
nf->nf_dio_offset_align = stat.dio_offset_align;
+ if (stat.dio_mem_align && stat.dio_offset_align)
+ nf->nf_dio_align_lcm = lcm(stat.dio_mem_align,
+ stat.dio_offset_align);
}
if (stat.result_mask & STATX_DIO_READ_ALIGN)
nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index e3d6ca2b6030..2648aaab5a1b 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -58,6 +58,7 @@ struct nfsd_file {
u32 nf_dio_mem_align;
u32 nf_dio_offset_align;
u32 nf_dio_read_offset_align;
+ unsigned long nf_dio_align_lcm;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 37353fb48d58..a872be300c9f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1261,49 +1261,73 @@ struct nfsd_write_dio_seg {
struct nfsd_write_dio_args {
struct nfsd_file *nf;
- size_t first, middle, last;
unsigned int nsegs;
struct nfsd_write_dio_seg segment[3];
};
+/*
+ * Find the minimum offset within the write request that aligns both
+ * the file offset and memory buffer for direct I/O.
+ *
+ * Returns the size of the unaligned prefix, or SIZE_MAX if no alignment
+ * is possible within reasonable bounds.
+ */
+static size_t
+nfsd_find_dio_aligned_offset(struct nfsd_file *nf, loff_t file_offset,
+ unsigned long mem_offset, size_t total_len)
+{
+ u32 offset_align = nf->nf_dio_offset_align;
+ u32 mem_align = nf->nf_dio_mem_align;
+ unsigned long search_limit;
+ size_t first;
+
+ /* Start with the file offset alignment requirement */
+ first = round_up(file_offset, offset_align) - file_offset;
+
+ /* Quick check: does this also satisfy memory alignment? */
+ if (((mem_offset + first) & (mem_align - 1)) == 0)
+ return first;
+
+ /*
+ * Search for a value that satisfies both constraints by stepping
+ * through multiples of offset_align. Limit search to one period
+ * of the LCM. We need to check up through the search_limit to
+ * cover all possible alignments within the LCM period.
+ */
+ search_limit = min_t(unsigned long, nf->nf_dio_align_lcm, total_len);
+
+ for (; first <= search_limit && first < total_len; first += offset_align) {
+ if (((mem_offset + first) & (mem_align - 1)) == 0)
+ return first;
+ }
+
+ return SIZE_MAX; /* No alignment found */
+}
+
+/*
+ * Check if the underlying file system implements direct I/O.
+ */
static bool
nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
struct nfsd_write_dio_args *args)
{
- u32 dio_blocksize = args->nf->nf_dio_offset_align;
- loff_t first_end, orig_end, middle_end;
+ u32 offset_align = args->nf->nf_dio_offset_align;
+ u32 mem_align = args->nf->nf_dio_mem_align;
- if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
- return false;
- if (unlikely(len < dio_blocksize))
+ if (unlikely(!mem_align || !offset_align))
return false;
- first_end = round_up(offset, dio_blocksize);
- orig_end = offset + len;
- middle_end = round_down(orig_end, dio_blocksize);
+ /*
+ * Need enough data to potentially find an aligned segment.
+ * In the worst case, we might need up to
+ * lcm(offset_align, mem_align) bytes for the prefix.
+ */
+ if (unlikely(len < max(offset_align, mem_align)))
+ return false;
- args->first = first_end - offset;
- args->middle = middle_end - first_end;
- args->last = orig_end - middle_end;
return true;
}
-/*
- * Check if the bvec iterator is aligned for direct I/O.
- *
- * bvecs generated from RPC receive buffers are contiguous: After the first
- * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
- * Therefore, only the first bvec is checked.
- */
-static bool
-nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
-{
- unsigned int addr_mask = nf->nf_dio_mem_align - 1;
- const struct bio_vec *bvec = i->bvec;
-
- return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
-}
-
static void
nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
struct bio_vec *bvec, unsigned int nvecs,
@@ -1318,29 +1342,45 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
static bool
nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
- unsigned long total,
+ loff_t offset, unsigned long total,
struct nfsd_write_dio_args *args)
{
+ u32 offset_align = args->nf->nf_dio_offset_align;
+ unsigned long mem_offset = bvec->bv_offset;
+ loff_t prefix_end, orig_end, middle_end;
+ size_t prefix, middle, suffix;
+
args->nsegs = 0;
- if (args->first) {
+ prefix = nfsd_find_dio_aligned_offset(args->nf, offset, mem_offset,
+ total);
+ if (prefix == SIZE_MAX)
+ return false; /* No alignment possible */
+
+ prefix_end = offset + prefix;
+ orig_end = offset + total;
+ middle_end = round_down(orig_end, offset_align);
+
+ middle = middle_end - prefix_end;
+ suffix = orig_end - middle_end;
+
+ if (prefix) {
nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
- nvecs, total, 0, args->first);
+ nvecs, total, 0, prefix);
++args->nsegs;
}
+ if (!middle)
+ return false; /* No aligned region for DIO */
+
nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
- total, args->first, args->middle);
- if (!nfsd_iov_iter_aligned_bvec(args->nf,
- &args->segment[args->nsegs].iter))
- return false; /* no DIO-aligned IO possible */
+ total, prefix, middle);
args->segment[args->nsegs].use_dio = true;
++args->nsegs;
- if (args->last) {
+ if (suffix) {
nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
- nvecs, total, args->first +
- args->middle, args->last);
+ nvecs, total, prefix + middle, suffix);
++args->nsegs;
}
@@ -1373,7 +1413,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
ssize_t host_err;
unsigned int i;
- if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
+ if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
+ *cnt, args))
return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (8 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-30 19:59 ` Jeff Layton
2025-10-31 13:20 ` Christoph Hellwig
2025-10-27 15:46 ` [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
2025-10-27 15:46 ` [PATCH v8 12/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
11 siblings, 2 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
When direct I/O is not feasible (due to missing alignment info,
too-small writes, or no alignment possible), pack the entire
write payload into a single non-DIO segment and follow the usual
direct write I/O path.
This simplifies nfsd_direct_write() by eliminating the fallback path
and the separate nfsd_buffered_write() call - all writes now go
through nfsd_issue_write_dio() which handles both DIO and buffered
segments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 69 +++++++++++++++++++++------------------------------
1 file changed, 28 insertions(+), 41 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a872be300c9f..be0688f2ab3d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1304,30 +1304,6 @@ nfsd_find_dio_aligned_offset(struct nfsd_file *nf, loff_t file_offset,
return SIZE_MAX; /* No alignment found */
}
-/*
- * Check if the underlying file system implements direct I/O.
- */
-static bool
-nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
- struct nfsd_write_dio_args *args)
-{
- u32 offset_align = args->nf->nf_dio_offset_align;
- u32 mem_align = args->nf->nf_dio_mem_align;
-
- if (unlikely(!mem_align || !offset_align))
- return false;
-
- /*
- * Need enough data to potentially find an aligned segment.
- * In the worst case, we might need up to
- * lcm(offset_align, mem_align) bytes for the prefix.
- */
- if (unlikely(len < max(offset_align, mem_align)))
- return false;
-
- return true;
-}
-
static void
nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
struct bio_vec *bvec, unsigned int nvecs,
@@ -1340,22 +1316,31 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
segment->use_dio = false;
}
-static bool
-nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
- loff_t offset, unsigned long total,
- struct nfsd_write_dio_args *args)
+static void
+nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
+ loff_t offset, unsigned long total,
+ struct nfsd_write_dio_args *args)
{
u32 offset_align = args->nf->nf_dio_offset_align;
+ u32 mem_align = args->nf->nf_dio_mem_align;
unsigned long mem_offset = bvec->bv_offset;
loff_t prefix_end, orig_end, middle_end;
size_t prefix, middle, suffix;
args->nsegs = 0;
+ /*
+ * Check if direct I/O is feasible for this write request.
+ * If alignments are not available, the write is too small,
+ * or no alignment can be found, fall back to buffered I/O.
+ */
+ if (unlikely(!mem_align || !offset_align) ||
+ unlikely(total < max(offset_align, mem_align)))
+ goto no_dio;
prefix = nfsd_find_dio_aligned_offset(args->nf, offset, mem_offset,
total);
if (prefix == SIZE_MAX)
- return false; /* No alignment possible */
+ goto no_dio;
prefix_end = offset + prefix;
orig_end = offset + total;
@@ -1371,7 +1356,7 @@ nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
}
if (!middle)
- return false; /* No aligned region for DIO */
+ goto no_dio;
nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
total, prefix, middle);
@@ -1384,7 +1369,13 @@ nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
++args->nsegs;
}
- return true;
+ return;
+
+no_dio:
+ /* No alignment possible - pack into single non-DIO segment */
+ nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
+ 0, total);
+ args->nsegs = 1;
}
static int
@@ -1405,7 +1396,7 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
}
static int
-nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
+nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
struct nfsd_write_dio_args *args)
{
@@ -1413,10 +1404,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
ssize_t host_err;
unsigned int i;
- if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
- *cnt, args))
- return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
-
/*
* Any buffered IO issued here will be misaligned, use
* sync IO to ensure it has completed before returning.
@@ -1425,6 +1412,9 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
*stable_how = NFS_FILE_SYNC;
+ nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
+ *cnt, args);
+
*cnt = 0;
for (i = 0; i < args->nsegs; i++) {
if (args->segment[i].use_dio) {
@@ -1463,11 +1453,8 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
kiocb->ki_flags |= IOCB_DONTCACHE;
- if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, &args))
- return nfsd_issue_write_dio(rqstp, fhp, stable_how, kiocb,
- nvecs, cnt, &args);
-
- return nfsd_buffered_write(rqstp, args.nf->nf_file, nvecs, cnt, kiocb);
+ return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
+ cnt, &args);
}
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (9 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-30 20:01 ` Jeff Layton
2025-10-31 13:21 ` Christoph Hellwig
2025-10-27 15:46 ` [PATCH v8 12/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
11 siblings, 2 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Christoph says:
> > + if (file->f_op->fop_flags & FOP_DONTCACHE)
> > + kiocb->ki_flags |= IOCB_DONTCACHE;
> IOCB_DONTCACHE isn't defined for IOCB_DIRECT. So this should
> move into a branch just for buffered I/O.
and
> > Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea,
> > based on the assumption that IOCB_DIRECT writes to local file
> > systems left nothing to be done by a later commit. My assumption
> > is based on the behavior of O_DIRECT on NFS files.
> >
> > If that assumption is not true, then I agree there is no
> > technical reason to promote NFSD_IO_DIRECT writes to FILE_SYNC,
> > and I can remove that built-in assumption for v8 of this series.
>
> It is not true, or rather only true for a tiny subset of use cases
> (which NFS can't even query a head of time).
So, observe the existing setting of ki_flags rather than forcing
persistence unconditionally, and ensure that DONTCACHE is not set
for IOCB_DIRECT writes.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index be0688f2ab3d..3c78b3aeea4b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1261,6 +1261,8 @@ struct nfsd_write_dio_seg {
struct nfsd_write_dio_args {
struct nfsd_file *nf;
+ int flags_buffered;
+ int flags_direct;
unsigned int nsegs;
struct nfsd_write_dio_seg segment[3];
};
@@ -1396,33 +1398,25 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
}
static int
-nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
- struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
- struct nfsd_write_dio_args *args)
+nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct kiocb *kiocb, unsigned int nvecs,
+ unsigned long *cnt, struct nfsd_write_dio_args *args)
{
struct file *file = args->nf->nf_file;
ssize_t host_err;
unsigned int i;
- /*
- * Any buffered IO issued here will be misaligned, use
- * sync IO to ensure it has completed before returning.
- * Also update @stable_how to avoid need for COMMIT.
- */
- kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
- *stable_how = NFS_FILE_SYNC;
-
nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
*cnt, args);
*cnt = 0;
for (i = 0; i < args->nsegs; i++) {
if (args->segment[i].use_dio) {
- kiocb->ki_flags |= IOCB_DIRECT;
+ kiocb->ki_flags = args->flags_direct;
trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
args->segment[i].iter.count);
} else
- kiocb->ki_flags &= ~IOCB_DIRECT;
+ kiocb->ki_flags = args->flags_buffered;
host_err = vfs_iocb_iter_write(file, kiocb,
&args->segment[i].iter);
@@ -1446,15 +1440,16 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
args.nf = nf;
/*
- * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
- * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
- * be ignored for any DIO issued here).
+ * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
+ * writing unaligned segments or handling fallback I/O.
*/
+ args.flags_buffered = kiocb->ki_flags;
if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
- kiocb->ki_flags |= IOCB_DONTCACHE;
+ args.flags_buffered |= IOCB_DONTCACHE;
- return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
- cnt, &args);
+ args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
+
+ return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
}
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 12/12] NFSD: Refactor nfsd_vfs_write
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
` (10 preceding siblings ...)
2025-10-27 15:46 ` [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
@ 2025-10-27 15:46 ` Chuck Lever
2025-10-30 20:02 ` Jeff Layton
11 siblings, 1 reply; 27+ messages in thread
From: Chuck Lever @ 2025-10-27 15:46 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
There is now only one caller of nfsd_buffered_write(), so it can
be folded back into nfsd_vfs_write().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 27 +++++++--------------------
1 file changed, 7 insertions(+), 20 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 3c78b3aeea4b..934090e168c6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1380,23 +1380,6 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
args->nsegs = 1;
}
-static int
-nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
- unsigned int nvecs, unsigned long *cnt,
- struct kiocb *kiocb)
-{
- struct iov_iter iter;
- int host_err;
-
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
- host_err = vfs_iocb_iter_write(file, kiocb, &iter);
- if (host_err < 0)
- return host_err;
- *cnt = host_err;
-
- return 0;
-}
-
static int
nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct kiocb *kiocb, unsigned int nvecs,
@@ -1480,6 +1463,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
u32 stable = *stable_how;
struct kiocb kiocb;
struct svc_export *exp;
+ struct iov_iter iter;
errseq_t since;
__be32 nfserr;
int host_err;
@@ -1540,10 +1524,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
- fallthrough; /* must call nfsd_buffered_write */
+ fallthrough;
case NFSD_IO_BUFFERED:
- host_err = nfsd_buffered_write(rqstp, file,
- nvecs, cnt, &kiocb);
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
+ if (host_err < 0)
+ break;
+ *cnt = host_err;
break;
}
if (host_err < 0) {
--
2.51.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec()
2025-10-27 15:46 ` [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
@ 2025-10-30 15:00 ` Jeff Layton
2025-10-31 13:16 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2025-10-30 15:00 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, 2025-10-27 at 11:46 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Pages in the RPC receive buffer are contiguous. Practically
> speaking, this means that, after the first bvec, the bv_offset
> field is always zero.
>
> The loop in nfsd_iov_iter_aligned_bvec() also sets "skip" to zero
> after the first bvec is checked. Thus, for bvecs following the first
> one, bv_offset = 0 and skip = 0, and the check becomes:
>
> (0 + 0) & addr_mask = 0
>
> This always passes regardless of the alignment mask.
>
> Since all bvecs after the first one start at bv_offset zero (page-
> aligned), they are inherently aligned for DIO. Therefore, only the
> first bvec needs to be checked for DIO alignment.
>
> Note that for RDMA transports, the incoming payload always starts on
> page alignment, since svcrdma sets up the RDMA Read sink buffers
> that way. For RDMA, this loop visits every bvec in each payload, and
> never finds an unaligned bvec.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 26 ++++++++------------------
> 1 file changed, 8 insertions(+), 18 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5d6efcceb8c9..37353fb48d58 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1288,30 +1288,20 @@ nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> return true;
> }
>
> +/*
> + * Check if the bvec iterator is aligned for direct I/O.
> + *
> + * bvecs generated from RPC receive buffers are contiguous: After the first
> + * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
> + * Therefore, only the first bvec is checked.
> + */
> static bool
> nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
> {
> - unsigned int len_mask = nf->nf_dio_offset_align - 1;
> unsigned int addr_mask = nf->nf_dio_mem_align - 1;
> const struct bio_vec *bvec = i->bvec;
> - size_t skip = i->iov_offset;
> - size_t size = i->count;
>
> - if (size & len_mask)
> - return false;
> - do {
> - size_t len = bvec->bv_len;
> -
> - if (len > size)
> - len = size;
> - if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> - return false;
> - bvec++;
> - size -= len;
> - skip = 0;
> - } while (size);
> -
> - return true;
> + return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
> }
>
> static void
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-27 15:46 ` [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O Chuck Lever
@ 2025-10-30 19:52 ` Jeff Layton
2025-10-30 19:55 ` Chuck Lever
2025-10-31 9:13 ` Christoph Hellwig
2025-10-31 13:19 ` Christoph Hellwig
1 sibling, 2 replies; 27+ messages in thread
From: Jeff Layton @ 2025-10-30 19:52 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, 2025-10-27 at 11:46 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Currently, nfsd_is_write_dio_possible() only considers file offset
> alignment (nf_dio_offset_align) when splitting an NFS WRITE request
> into segments. This leaves accounting for memory buffer alignment
> (nf_dio_mem_align) until nfsd_setup_write_dio_iters(). If this
> second check fails, the code falls back to cached I/O entirely,
> wasting the opportunity to use direct I/O for the bulk of the
> request.
>
> Enhance the logic to find a beginning segment size that satisfies
> both alignment constraints simultaneously. The search algorithm
> starts with the file offset alignment requirement and steps through
> multiples of offset_align, checking memory alignment at each step.
> The search is bounded by lcm(offset_align, mem_align) to ensure that
> it always terminates.
>
> Signed-off-by: Chuck Lever <cel@kernel.org>
> ---
> fs/nfsd/filecache.c | 5 ++
> fs/nfsd/filecache.h | 1 +
> fs/nfsd/vfs.c | 119 +++++++++++++++++++++++++++++---------------
> 3 files changed, 86 insertions(+), 39 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a238b6725008..89adc4ab5b24 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -40,6 +40,7 @@
> #include <linux/seq_file.h>
> #include <linux/rhashtable.h>
> #include <linux/nfslocalio.h>
> +#include <linux/lcm.h>
>
> #include "vfs.h"
> #include "nfsd.h"
> @@ -234,6 +235,7 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
> nf->nf_dio_mem_align = 0;
> nf->nf_dio_offset_align = 0;
> nf->nf_dio_read_offset_align = 0;
> + nf->nf_dio_align_lcm = 0;
> return nf;
> }
>
> @@ -1071,6 +1073,9 @@ nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
> if (stat.result_mask & STATX_DIOALIGN) {
> nf->nf_dio_mem_align = stat.dio_mem_align;
> nf->nf_dio_offset_align = stat.dio_offset_align;
> + if (stat.dio_mem_align && stat.dio_offset_align)
> + nf->nf_dio_align_lcm = lcm(stat.dio_mem_align,
> + stat.dio_offset_align);
> }
> if (stat.result_mask & STATX_DIO_READ_ALIGN)
> nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index e3d6ca2b6030..2648aaab5a1b 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -58,6 +58,7 @@ struct nfsd_file {
> u32 nf_dio_mem_align;
> u32 nf_dio_offset_align;
> u32 nf_dio_read_offset_align;
> + unsigned long nf_dio_align_lcm;
> };
>
> int nfsd_file_cache_init(void);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 37353fb48d58..a872be300c9f 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1261,49 +1261,73 @@ struct nfsd_write_dio_seg {
>
> struct nfsd_write_dio_args {
> struct nfsd_file *nf;
> - size_t first, middle, last;
> unsigned int nsegs;
> struct nfsd_write_dio_seg segment[3];
> };
>
> +/*
> + * Find the minimum offset within the write request that aligns both
> + * the file offset and memory buffer for direct I/O.
> + *
> + * Returns the size of the unaligned prefix, or SIZE_MAX if no alignment
> + * is possible within reasonable bounds.
> + */
> +static size_t
> +nfsd_find_dio_aligned_offset(struct nfsd_file *nf, loff_t file_offset,
> + unsigned long mem_offset, size_t total_len)
> +{
> + u32 offset_align = nf->nf_dio_offset_align;
> + u32 mem_align = nf->nf_dio_mem_align;
> + unsigned long search_limit;
> + size_t first;
> +
> + /* Start with the file offset alignment requirement */
> + first = round_up(file_offset, offset_align) - file_offset;
> +
> + /* Quick check: does this also satisfy memory alignment? */
> + if (((mem_offset + first) & (mem_align - 1)) == 0)
> + return first;
> +
> + /*
> + * Search for a value that satisfies both constraints by stepping
> + * through multiples of offset_align. Limit search to one period
> + * of the LCM. We need to check up through the search_limit to
> + * cover all possible alignments within the LCM period.
> + */
> + search_limit = min_t(unsigned long, nf->nf_dio_align_lcm, total_len);
> +
> + for (; first <= search_limit && first < total_len; first += offset_align) {
> + if (((mem_offset + first) & (mem_align - 1)) == 0)
> + return first;
> + }
> +
> + return SIZE_MAX; /* No alignment found */
> +}
> +
> +/*
> + * Check if the underlying file system implements direct I/O.
> + */
> static bool
> nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> struct nfsd_write_dio_args *args)
> {
> - u32 dio_blocksize = args->nf->nf_dio_offset_align;
> - loff_t first_end, orig_end, middle_end;
> + u32 offset_align = args->nf->nf_dio_offset_align;
> + u32 mem_align = args->nf->nf_dio_mem_align;
>
> - if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
> - return false;
> - if (unlikely(len < dio_blocksize))
> + if (unlikely(!mem_align || !offset_align))
> return false;
>
> - first_end = round_up(offset, dio_blocksize);
> - orig_end = offset + len;
> - middle_end = round_down(orig_end, dio_blocksize);
> + /*
> + * Need enough data to potentially find an aligned segment.
> + * In the worst case, we might need up to
> + * lcm(offset_align, mem_align) bytes for the prefix.
> + */
> + if (unlikely(len < max(offset_align, mem_align)))
> + return false;
>
> - args->first = first_end - offset;
> - args->middle = middle_end - first_end;
> - args->last = orig_end - middle_end;
> return true;
> }
>
> -/*
> - * Check if the bvec iterator is aligned for direct I/O.
> - *
> - * bvecs generated from RPC receive buffers are contiguous: After the first
> - * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
> - * Therefore, only the first bvec is checked.
> - */
> -static bool
> -nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
> -{
> - unsigned int addr_mask = nf->nf_dio_mem_align - 1;
> - const struct bio_vec *bvec = i->bvec;
> -
> - return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
> -}
> -
> static void
> nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> struct bio_vec *bvec, unsigned int nvecs,
> @@ -1318,29 +1342,45 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
>
> static bool
> nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
> - unsigned long total,
> + loff_t offset, unsigned long total,
> struct nfsd_write_dio_args *args)
> {
> + u32 offset_align = args->nf->nf_dio_offset_align;
> + unsigned long mem_offset = bvec->bv_offset;
> + loff_t prefix_end, orig_end, middle_end;
> + size_t prefix, middle, suffix;
> +
> args->nsegs = 0;
>
> - if (args->first) {
> + prefix = nfsd_find_dio_aligned_offset(args->nf, offset, mem_offset,
> + total);
> + if (prefix == SIZE_MAX)
> + return false; /* No alignment possible */
> +
> + prefix_end = offset + prefix;
> + orig_end = offset + total;
> + middle_end = round_down(orig_end, offset_align);
> +
> + middle = middle_end - prefix_end;
> + suffix = orig_end - middle_end;
> +
> + if (prefix) {
> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> - nvecs, total, 0, args->first);
> + nvecs, total, 0, prefix);
> ++args->nsegs;
> }
>
> + if (!middle)
> + return false; /* No aligned region for DIO */
> +
> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
> - total, args->first, args->middle);
> - if (!nfsd_iov_iter_aligned_bvec(args->nf,
> - &args->segment[args->nsegs].iter))
> - return false; /* no DIO-aligned IO possible */
> + total, prefix, middle);
> args->segment[args->nsegs].use_dio = true;
> ++args->nsegs;
>
> - if (args->last) {
> + if (suffix) {
> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
> - nvecs, total, args->first +
> - args->middle, args->last);
> + nvecs, total, prefix + middle, suffix);
> ++args->nsegs;
> }
>
> @@ -1373,7 +1413,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
> ssize_t host_err;
> unsigned int i;
>
> - if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
> + if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> + *cnt, args))
> return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
>
> /*
Not an easy patch to follow, but I think I get it.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-30 19:52 ` Jeff Layton
@ 2025-10-30 19:55 ` Chuck Lever
2025-10-31 9:13 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-30 19:55 UTC (permalink / raw)
To: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
On 10/30/25 3:52 PM, Jeff Layton wrote:
> On Mon, 2025-10-27 at 11:46 -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Currently, nfsd_is_write_dio_possible() only considers file offset
>> alignment (nf_dio_offset_align) when splitting an NFS WRITE request
>> into segments. This leaves accounting for memory buffer alignment
>> (nf_dio_mem_align) until nfsd_setup_write_dio_iters(). If this
>> second check fails, the code falls back to cached I/O entirely,
>> wasting the opportunity to use direct I/O for the bulk of the
>> request.
>>
>> Enhance the logic to find a beginning segment size that satisfies
>> both alignment constraints simultaneously. The search algorithm
>> starts with the file offset alignment requirement and steps through
>> multiples of offset_align, checking memory alignment at each step.
>> The search is bounded by lcm(offset_align, mem_align) to ensure that
>> it always terminates.
>>
>> Signed-off-by: Chuck Lever <cel@kernel.org>
>> ---
>> fs/nfsd/filecache.c | 5 ++
>> fs/nfsd/filecache.h | 1 +
>> fs/nfsd/vfs.c | 119 +++++++++++++++++++++++++++++---------------
>> 3 files changed, 86 insertions(+), 39 deletions(-)
>>
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index a238b6725008..89adc4ab5b24 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -40,6 +40,7 @@
>> #include <linux/seq_file.h>
>> #include <linux/rhashtable.h>
>> #include <linux/nfslocalio.h>
>> +#include <linux/lcm.h>
>>
>> #include "vfs.h"
>> #include "nfsd.h"
>> @@ -234,6 +235,7 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
>> nf->nf_dio_mem_align = 0;
>> nf->nf_dio_offset_align = 0;
>> nf->nf_dio_read_offset_align = 0;
>> + nf->nf_dio_align_lcm = 0;
>> return nf;
>> }
>>
>> @@ -1071,6 +1073,9 @@ nfsd_file_get_dio_attrs(const struct svc_fh *fhp, struct nfsd_file *nf)
>> if (stat.result_mask & STATX_DIOALIGN) {
>> nf->nf_dio_mem_align = stat.dio_mem_align;
>> nf->nf_dio_offset_align = stat.dio_offset_align;
>> + if (stat.dio_mem_align && stat.dio_offset_align)
>> + nf->nf_dio_align_lcm = lcm(stat.dio_mem_align,
>> + stat.dio_offset_align);
>> }
>> if (stat.result_mask & STATX_DIO_READ_ALIGN)
>> nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
>> index e3d6ca2b6030..2648aaab5a1b 100644
>> --- a/fs/nfsd/filecache.h
>> +++ b/fs/nfsd/filecache.h
>> @@ -58,6 +58,7 @@ struct nfsd_file {
>> u32 nf_dio_mem_align;
>> u32 nf_dio_offset_align;
>> u32 nf_dio_read_offset_align;
>> + unsigned long nf_dio_align_lcm;
>> };
>>
>> int nfsd_file_cache_init(void);
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 37353fb48d58..a872be300c9f 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -1261,49 +1261,73 @@ struct nfsd_write_dio_seg {
>>
>> struct nfsd_write_dio_args {
>> struct nfsd_file *nf;
>> - size_t first, middle, last;
>> unsigned int nsegs;
>> struct nfsd_write_dio_seg segment[3];
>> };
>>
>> +/*
>> + * Find the minimum offset within the write request that aligns both
>> + * the file offset and memory buffer for direct I/O.
>> + *
>> + * Returns the size of the unaligned prefix, or SIZE_MAX if no alignment
>> + * is possible within reasonable bounds.
>> + */
>> +static size_t
>> +nfsd_find_dio_aligned_offset(struct nfsd_file *nf, loff_t file_offset,
>> + unsigned long mem_offset, size_t total_len)
>> +{
>> + u32 offset_align = nf->nf_dio_offset_align;
>> + u32 mem_align = nf->nf_dio_mem_align;
>> + unsigned long search_limit;
>> + size_t first;
>> +
>> + /* Start with the file offset alignment requirement */
>> + first = round_up(file_offset, offset_align) - file_offset;
>> +
>> + /* Quick check: does this also satisfy memory alignment? */
>> + if (((mem_offset + first) & (mem_align - 1)) == 0)
>> + return first;
>> +
>> + /*
>> + * Search for a value that satisfies both constraints by stepping
>> + * through multiples of offset_align. Limit search to one period
>> + * of the LCM. We need to check up through the search_limit to
>> + * cover all possible alignments within the LCM period.
>> + */
>> + search_limit = min_t(unsigned long, nf->nf_dio_align_lcm, total_len);
>> +
>> + for (; first <= search_limit && first < total_len; first += offset_align) {
>> + if (((mem_offset + first) & (mem_align - 1)) == 0)
>> + return first;
>> + }
>> +
>> + return SIZE_MAX; /* No alignment found */
>> +}
>> +
>> +/*
>> + * Check if the underlying file system implements direct I/O.
>> + */
>> static bool
>> nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
>> struct nfsd_write_dio_args *args)
>> {
>> - u32 dio_blocksize = args->nf->nf_dio_offset_align;
>> - loff_t first_end, orig_end, middle_end;
>> + u32 offset_align = args->nf->nf_dio_offset_align;
>> + u32 mem_align = args->nf->nf_dio_mem_align;
>>
>> - if (unlikely(!args->nf->nf_dio_mem_align || !dio_blocksize))
>> - return false;
>> - if (unlikely(len < dio_blocksize))
>> + if (unlikely(!mem_align || !offset_align))
>> return false;
>>
>> - first_end = round_up(offset, dio_blocksize);
>> - orig_end = offset + len;
>> - middle_end = round_down(orig_end, dio_blocksize);
>> + /*
>> + * Need enough data to potentially find an aligned segment.
>> + * In the worst case, we might need up to
>> + * lcm(offset_align, mem_align) bytes for the prefix.
>> + */
>> + if (unlikely(len < max(offset_align, mem_align)))
>> + return false;
>>
>> - args->first = first_end - offset;
>> - args->middle = middle_end - first_end;
>> - args->last = orig_end - middle_end;
>> return true;
>> }
>>
>> -/*
>> - * Check if the bvec iterator is aligned for direct I/O.
>> - *
>> - * bvecs generated from RPC receive buffers are contiguous: After the first
>> - * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
>> - * Therefore, only the first bvec is checked.
>> - */
>> -static bool
>> -nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
>> -{
>> - unsigned int addr_mask = nf->nf_dio_mem_align - 1;
>> - const struct bio_vec *bvec = i->bvec;
>> -
>> - return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
>> -}
>> -
>> static void
>> nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
>> struct bio_vec *bvec, unsigned int nvecs,
>> @@ -1318,29 +1342,45 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
>>
>> static bool
>> nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
>> - unsigned long total,
>> + loff_t offset, unsigned long total,
>> struct nfsd_write_dio_args *args)
>> {
>> + u32 offset_align = args->nf->nf_dio_offset_align;
>> + unsigned long mem_offset = bvec->bv_offset;
>> + loff_t prefix_end, orig_end, middle_end;
>> + size_t prefix, middle, suffix;
>> +
>> args->nsegs = 0;
>>
>> - if (args->first) {
>> + prefix = nfsd_find_dio_aligned_offset(args->nf, offset, mem_offset,
>> + total);
>> + if (prefix == SIZE_MAX)
>> + return false; /* No alignment possible */
>> +
>> + prefix_end = offset + prefix;
>> + orig_end = offset + total;
>> + middle_end = round_down(orig_end, offset_align);
>> +
>> + middle = middle_end - prefix_end;
>> + suffix = orig_end - middle_end;
>> +
>> + if (prefix) {
>> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
>> - nvecs, total, 0, args->first);
>> + nvecs, total, 0, prefix);
>> ++args->nsegs;
>> }
>>
>> + if (!middle)
>> + return false; /* No aligned region for DIO */
>> +
>> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
>> - total, args->first, args->middle);
>> - if (!nfsd_iov_iter_aligned_bvec(args->nf,
>> - &args->segment[args->nsegs].iter))
>> - return false; /* no DIO-aligned IO possible */
>> + total, prefix, middle);
>> args->segment[args->nsegs].use_dio = true;
>> ++args->nsegs;
>>
>> - if (args->last) {
>> + if (suffix) {
>> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec,
>> - nvecs, total, args->first +
>> - args->middle, args->last);
>> + nvecs, total, prefix + middle, suffix);
>> ++args->nsegs;
>> }
>>
>> @@ -1373,7 +1413,8 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
>> ssize_t host_err;
>> unsigned int i;
>>
>> - if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, *cnt, args))
>> + if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
>> + *cnt, args))
>> return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
>>
>> /*
>
> Not an easy patch to follow, but I think I get it.
Agreed, and that is certainly concerning.
"Kernighan's Law - Debugging is twice as hard as writing the code in the
first place. Therefore, if you write the code as cleverly as possible,
you are, by definition, not smart enough to debug it."
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
--
Chuck Lever
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup
2025-10-27 15:46 ` [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
@ 2025-10-30 19:59 ` Jeff Layton
2025-10-31 13:20 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2025-10-30 19:59 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, 2025-10-27 at 11:46 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> When direct I/O is not feasible (due to missing alignment info,
> too-small writes, or no alignment possible), pack the entire
> write payload into a single non-DIO segment and follow the usual
> direct write I/O path.
>
> This simplifies nfsd_direct_write() by eliminating the fallback path
> and the separate nfsd_buffered_write() call - all writes now go
> through nfsd_issue_write_dio() which handles both DIO and buffered
> segments.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 69 +++++++++++++++++++++------------------------------
> 1 file changed, 28 insertions(+), 41 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index a872be300c9f..be0688f2ab3d 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1304,30 +1304,6 @@ nfsd_find_dio_aligned_offset(struct nfsd_file *nf, loff_t file_offset,
> return SIZE_MAX; /* No alignment found */
> }
>
> -/*
> - * Check if the underlying file system implements direct I/O.
> - */
> -static bool
> -nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> - struct nfsd_write_dio_args *args)
> -{
> - u32 offset_align = args->nf->nf_dio_offset_align;
> - u32 mem_align = args->nf->nf_dio_mem_align;
> -
> - if (unlikely(!mem_align || !offset_align))
> - return false;
> -
> - /*
> - * Need enough data to potentially find an aligned segment.
> - * In the worst case, we might need up to
> - * lcm(offset_align, mem_align) bytes for the prefix.
> - */
> - if (unlikely(len < max(offset_align, mem_align)))
> - return false;
> -
> - return true;
> -}
> -
> static void
> nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> struct bio_vec *bvec, unsigned int nvecs,
> @@ -1340,22 +1316,31 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> segment->use_dio = false;
> }
>
> -static bool
> -nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
> - loff_t offset, unsigned long total,
> - struct nfsd_write_dio_args *args)
> +static void
> +nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> + loff_t offset, unsigned long total,
> + struct nfsd_write_dio_args *args)
> {
> u32 offset_align = args->nf->nf_dio_offset_align;
> + u32 mem_align = args->nf->nf_dio_mem_align;
> unsigned long mem_offset = bvec->bv_offset;
> loff_t prefix_end, orig_end, middle_end;
> size_t prefix, middle, suffix;
>
> args->nsegs = 0;
>
> + /*
> + * Check if direct I/O is feasible for this write request.
> + * If alignments are not available, the write is too small,
> + * or no alignment can be found, fall back to buffered I/O.
> + */
> + if (unlikely(!mem_align || !offset_align) ||
> + unlikely(total < max(offset_align, mem_align)))
> + goto no_dio;
> prefix = nfsd_find_dio_aligned_offset(args->nf, offset, mem_offset,
> total);
> if (prefix == SIZE_MAX)
> - return false; /* No alignment possible */
> + goto no_dio;
>
> prefix_end = offset + prefix;
> orig_end = offset + total;
> @@ -1371,7 +1356,7 @@ nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
> }
>
> if (!middle)
> - return false; /* No aligned region for DIO */
> + goto no_dio;
>
> nfsd_write_dio_seg_init(&args->segment[args->nsegs], bvec, nvecs,
> total, prefix, middle);
> @@ -1384,7 +1369,13 @@ nfsd_setup_write_dio_iters(struct bio_vec *bvec, unsigned int nvecs,
> ++args->nsegs;
> }
>
> - return true;
> + return;
> +
> +no_dio:
> + /* No alignment possible - pack into single non-DIO segment */
> + nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
> + 0, total);
> + args->nsegs = 1;
> }
>
> static int
> @@ -1405,7 +1396,7 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> }
>
> static int
> -nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
> +nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
> struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
> struct nfsd_write_dio_args *args)
> {
> @@ -1413,10 +1404,6 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
> ssize_t host_err;
> unsigned int i;
>
> - if (!nfsd_setup_write_dio_iters(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> - *cnt, args))
> - return nfsd_buffered_write(rqstp, file, nvecs, cnt, kiocb);
> -
> /*
> * Any buffered IO issued here will be misaligned, use
> * sync IO to ensure it has completed before returning.
> @@ -1425,6 +1412,9 @@ nfsd_issue_write_dio(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how
> kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
> *stable_how = NFS_FILE_SYNC;
>
> + nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> + *cnt, args);
> +
> *cnt = 0;
> for (i = 0; i < args->nsegs; i++) {
> if (args->segment[i].use_dio) {
> @@ -1463,11 +1453,8 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb->ki_flags |= IOCB_DONTCACHE;
>
> - if (nfsd_is_write_dio_possible(kiocb->ki_pos, *cnt, &args))
> - return nfsd_issue_write_dio(rqstp, fhp, stable_how, kiocb,
> - nvecs, cnt, &args);
> -
> - return nfsd_buffered_write(rqstp, args.nf->nf_file, nvecs, cnt, kiocb);
> + return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
> + cnt, &args);
> }
>
> /**
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly
2025-10-27 15:46 ` [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
@ 2025-10-30 20:01 ` Jeff Layton
2025-10-31 13:21 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2025-10-30 20:01 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, 2025-10-27 at 11:46 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Christoph says:
> > > + if (file->f_op->fop_flags & FOP_DONTCACHE)
> > > + kiocb->ki_flags |= IOCB_DONTCACHE;
> > IOCB_DONTCACHE isn't defined for IOCB_DIRECT. So this should
> > move into a branch just for buffered I/O.
>
> and
>
> > > Promoting all NFSD_IO_DIRECT writes to FILE_SYNC was my idea,
> > > based on the assumption that IOCB_DIRECT writes to local file
> > > systems left nothing to be done by a later commit. My assumption
> > > is based on the behavior of O_DIRECT on NFS files.
> > >
> > > If that assumption is not true, then I agree there is no
> > > technical reason to promote NFSD_IO_DIRECT writes to FILE_SYNC,
> > > and I can remove that built-in assumption for v8 of this series.
> >
> > It is not true, or rather only true for a tiny subset of use cases
> > (which NFS can't even query a head of time).
>
> So, observe the existing setting of ki_flags rather than forcing
> persistence unconditionally, and ensure that DONTCACHE is not set
> for IOCB_DIRECT writes.
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 33 ++++++++++++++-------------------
> 1 file changed, 14 insertions(+), 19 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index be0688f2ab3d..3c78b3aeea4b 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1261,6 +1261,8 @@ struct nfsd_write_dio_seg {
>
> struct nfsd_write_dio_args {
> struct nfsd_file *nf;
> + int flags_buffered;
> + int flags_direct;
> unsigned int nsegs;
> struct nfsd_write_dio_seg segment[3];
> };
> @@ -1396,33 +1398,25 @@ nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> }
>
> static int
> -nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp, u32 *stable_how,
> - struct kiocb *kiocb, unsigned int nvecs, unsigned long *cnt,
> - struct nfsd_write_dio_args *args)
> +nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + struct kiocb *kiocb, unsigned int nvecs,
> + unsigned long *cnt, struct nfsd_write_dio_args *args)
> {
> struct file *file = args->nf->nf_file;
> ssize_t host_err;
> unsigned int i;
>
> - /*
> - * Any buffered IO issued here will be misaligned, use
> - * sync IO to ensure it has completed before returning.
> - * Also update @stable_how to avoid need for COMMIT.
> - */
> - kiocb->ki_flags |= (IOCB_DSYNC|IOCB_SYNC);
> - *stable_how = NFS_FILE_SYNC;
> -
> nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> *cnt, args);
>
> *cnt = 0;
> for (i = 0; i < args->nsegs; i++) {
> if (args->segment[i].use_dio) {
> - kiocb->ki_flags |= IOCB_DIRECT;
> + kiocb->ki_flags = args->flags_direct;
> trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> args->segment[i].iter.count);
> } else
> - kiocb->ki_flags &= ~IOCB_DIRECT;
> + kiocb->ki_flags = args->flags_buffered;
>
> host_err = vfs_iocb_iter_write(file, kiocb,
> &args->segment[i].iter);
> @@ -1446,15 +1440,16 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> args.nf = nf;
>
> /*
> - * Check if IOCB_DONTCACHE can be used when issuing buffered IO;
> - * if so, set it to preserve intent of NFSD_IO_DIRECT (it will
> - * be ignored for any DIO issued here).
> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> + * writing unaligned segments or handling fallback I/O.
> */
> + args.flags_buffered = kiocb->ki_flags;
> if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> - kiocb->ki_flags |= IOCB_DONTCACHE;
> + args.flags_buffered |= IOCB_DONTCACHE;
>
> - return nfsd_issue_dio_write(rqstp, fhp, stable_how, kiocb, nvecs,
> - cnt, &args);
> + args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
> +
> + return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
> }
>
> /**
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 12/12] NFSD: Refactor nfsd_vfs_write
2025-10-27 15:46 ` [PATCH v8 12/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
@ 2025-10-30 20:02 ` Jeff Layton
0 siblings, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2025-10-30 20:02 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, 2025-10-27 at 11:46 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> There is now only one caller of nfsd_buffered_write(), so it can
> be folded back into nfsd_vfs_write().
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/vfs.c | 27 +++++++--------------------
> 1 file changed, 7 insertions(+), 20 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 3c78b3aeea4b..934090e168c6 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1380,23 +1380,6 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> args->nsegs = 1;
> }
>
> -static int
> -nfsd_buffered_write(struct svc_rqst *rqstp, struct file *file,
> - unsigned int nvecs, unsigned long *cnt,
> - struct kiocb *kiocb)
> -{
> - struct iov_iter iter;
> - int host_err;
> -
> - iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> - host_err = vfs_iocb_iter_write(file, kiocb, &iter);
> - if (host_err < 0)
> - return host_err;
> - *cnt = host_err;
> -
> - return 0;
> -}
> -
> static int
> nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> struct kiocb *kiocb, unsigned int nvecs,
> @@ -1480,6 +1463,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> u32 stable = *stable_how;
> struct kiocb kiocb;
> struct svc_export *exp;
> + struct iov_iter iter;
> errseq_t since;
> __be32 nfserr;
> int host_err;
> @@ -1540,10 +1524,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> case NFSD_IO_DONTCACHE:
> if (file->f_op->fop_flags & FOP_DONTCACHE)
> kiocb.ki_flags |= IOCB_DONTCACHE;
> - fallthrough; /* must call nfsd_buffered_write */
> + fallthrough;
> case NFSD_IO_BUFFERED:
> - host_err = nfsd_buffered_write(rqstp, file,
> - nvecs, cnt, &kiocb);
> + iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> + host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
> + if (host_err < 0)
> + break;
> + *cnt = host_err;
> break;
> }
> if (host_err < 0) {
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-30 19:52 ` Jeff Layton
2025-10-30 19:55 ` Chuck Lever
@ 2025-10-31 9:13 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-31 9:13 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
Hi Jeff,
I've been trying to read your various replies, but could not find any
text after scrolling down a few pages. Please fix your mailer to avoid
fullquots, and until then I'll have to ignore them.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec()
2025-10-27 15:46 ` [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-10-30 15:00 ` Jeff Layton
@ 2025-10-31 13:16 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-31 13:16 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, Oct 27, 2025 at 11:46:26AM -0400, Chuck Lever wrote:
> +/*
> + * Check if the bvec iterator is aligned for direct I/O.
> + *
> + * bvecs generated from RPC receive buffers are contiguous: After the first
> + * bvec, all subsequent bvecs start at bv_offset zero (page-aligned).
> + * Therefore, only the first bvec is checked.
> + */
> static bool
> nfsd_iov_iter_aligned_bvec(const struct nfsd_file *nf, const struct iov_iter *i)
>
>
>
> {
> unsigned int addr_mask = nf->nf_dio_mem_align - 1;
> const struct bio_vec *bvec = i->bvec;
>
> + return !((unsigned long)(bvec->bv_offset + i->iov_offset) & addr_mask);
Nice simplification!
But this entirely helper feels a bit pointless now vs just open coding
two lines in the caller.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-27 15:46 ` [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O Chuck Lever
2025-10-30 19:52 ` Jeff Layton
@ 2025-10-31 13:19 ` Christoph Hellwig
2025-10-31 13:21 ` Chuck Lever
1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-31 13:19 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, Oct 27, 2025 at 11:46:27AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Currently, nfsd_is_write_dio_possible() only considers file offset
> alignment (nf_dio_offset_align) when splitting an NFS WRITE request
> into segments. This leaves accounting for memory buffer alignment
> (nf_dio_mem_align) until nfsd_setup_write_dio_iters(). If this
> second check fails, the code falls back to cached I/O entirely,
> wasting the opportunity to use direct I/O for the bulk of the
> request.
>
> Enhance the logic to find a beginning segment size that satisfies
> both alignment constraints simultaneously. The search algorithm
> starts with the file offset alignment requirement and steps through
> multiples of offset_align, checking memory alignment at each step.
> The search is bounded by lcm(offset_align, mem_align) to ensure that
> it always terminates.
How likely is that going to happen? After the first bvec the
alignment constrains won't change, so how are we going to succeed
then?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup
2025-10-27 15:46 ` [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
2025-10-30 19:59 ` Jeff Layton
@ 2025-10-31 13:20 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-31 13:20 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Christoph Hellwig, Chuck Lever
On Mon, Oct 27, 2025 at 11:46:28AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> When direct I/O is not feasible (due to missing alignment info,
> too-small writes, or no alignment possible), pack the entire
> write payload into a single non-DIO segment and follow the usual
> direct write I/O path.
>
> This simplifies nfsd_direct_write() by eliminating the fallback path
> and the separate nfsd_buffered_write() call - all writes now go
> through nfsd_issue_write_dio() which handles both DIO and buffered
> segments.
Yes, this sounds much more sensible. No real review as I've already
lost the context from 10 patches earlier touching the area
unfortunately.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-31 13:19 ` Christoph Hellwig
@ 2025-10-31 13:21 ` Chuck Lever
2025-10-31 13:23 ` Christoph Hellwig
2025-10-31 16:07 ` Mike Snitzer
0 siblings, 2 replies; 27+ messages in thread
From: Chuck Lever @ 2025-10-31 13:21 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Christoph Hellwig, Chuck Lever
On 10/31/25 9:19 AM, Christoph Hellwig wrote:
> On Mon, Oct 27, 2025 at 11:46:27AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Currently, nfsd_is_write_dio_possible() only considers file offset
>> alignment (nf_dio_offset_align) when splitting an NFS WRITE request
>> into segments. This leaves accounting for memory buffer alignment
>> (nf_dio_mem_align) until nfsd_setup_write_dio_iters(). If this
>> second check fails, the code falls back to cached I/O entirely,
>> wasting the opportunity to use direct I/O for the bulk of the
>> request.
>>
>> Enhance the logic to find a beginning segment size that satisfies
>> both alignment constraints simultaneously. The search algorithm
>> starts with the file offset alignment requirement and steps through
>> multiples of offset_align, checking memory alignment at each step.
>> The search is bounded by lcm(offset_align, mem_align) to ensure that
>> it always terminates.
>
> How likely is that going to happen? After the first bvec the
> alignment constrains won't change, so how are we going to succeed
> then?
>
I was hoping that this algorithm would improve the likelihood of
finding a middle segment alignment for NFS WRITE on TCP. I'm not
entirely sure it has been effective.
Given the complexity, I'm wondering if I want to keep this one.
--
Chuck Lever
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly
2025-10-27 15:46 ` [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
2025-10-30 20:01 ` Jeff Layton
@ 2025-10-31 13:21 ` Christoph Hellwig
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-31 13:21 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Christoph Hellwig, Chuck Lever
Thanks, this looks good.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-31 13:21 ` Chuck Lever
@ 2025-10-31 13:23 ` Christoph Hellwig
2025-10-31 16:07 ` Mike Snitzer
1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-31 13:23 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Christoph Hellwig, Chuck Lever
On Fri, Oct 31, 2025 at 09:21:27AM -0400, Chuck Lever wrote:
> > How likely is that going to happen? After the first bvec the
> > alignment constrains won't change, so how are we going to succeed
> > then?
> >
>
> I was hoping that this algorithm would improve the likelihood of
> finding a middle segment alignment for NFS WRITE on TCP. I'm not
> entirely sure it has been effective.
Maybe I misremember the nfsd buffer management, but didn't you just
rely on the fact that any but the first bvec is page aligned? And
given that we have no holes I don't really see how we'd get better
aligned later. But I might just be very confused here, sorry.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O
2025-10-31 13:21 ` Chuck Lever
2025-10-31 13:23 ` Christoph Hellwig
@ 2025-10-31 16:07 ` Mike Snitzer
1 sibling, 0 replies; 27+ messages in thread
From: Mike Snitzer @ 2025-10-31 16:07 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Christoph Hellwig, Chuck Lever
On Fri, Oct 31, 2025 at 09:21:27AM -0400, Chuck Lever wrote:
> On 10/31/25 9:19 AM, Christoph Hellwig wrote:
> > On Mon, Oct 27, 2025 at 11:46:27AM -0400, Chuck Lever wrote:
> >> From: Chuck Lever <chuck.lever@oracle.com>
> >>
> >> Currently, nfsd_is_write_dio_possible() only considers file offset
> >> alignment (nf_dio_offset_align) when splitting an NFS WRITE request
> >> into segments. This leaves accounting for memory buffer alignment
> >> (nf_dio_mem_align) until nfsd_setup_write_dio_iters(). If this
> >> second check fails, the code falls back to cached I/O entirely,
> >> wasting the opportunity to use direct I/O for the bulk of the
> >> request.
> >>
> >> Enhance the logic to find a beginning segment size that satisfies
> >> both alignment constraints simultaneously. The search algorithm
> >> starts with the file offset alignment requirement and steps through
> >> multiples of offset_align, checking memory alignment at each step.
> >> The search is bounded by lcm(offset_align, mem_align) to ensure that
> >> it always terminates.
> >
> > How likely is that going to happen? After the first bvec the
> > alignment constrains won't change, so how are we going to succeed
> > then?
> >
>
> I was hoping that this algorithm would improve the likelihood of
> finding a middle segment alignment for NFS WRITE on TCP. I'm not
> entirely sure it has been effective.
>
> Given the complexity, I'm wondering if I want to keep this one.
I don't think its worth the complexity. And I don't see how it'd help
make TCP's potentially unaligned buffer, for the NFS WRITE's payload,
able to work within the required ondisk logical offset granularity.
The mlperf_npz_tool.py I provided does a really good job of exposing
this worse case scenario, especially when the NFS client is forced to
use DIO (but even if just using buffered IO it has some issue just
that the NFS client's use of the page cache causes the IO that's sent
over the wire to NFSD to be much larger, so its head and tail of that
large IO are misaligned). But when the NFS client does use O_DIRECT
each and every 1MB IO is offset in memory such that every single
logical_block_size isn't aligned relative to associated page -- each
spans multiple pages.
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-10-31 16:07 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 15:46 [PATCH v8 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-27 15:46 ` [PATCH v8 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-27 15:46 ` [PATCH v8 02/12] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-27 15:46 ` [PATCH v8 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-27 15:46 ` [PATCH v8 04/12] NFSD: Remove specific error handling Chuck Lever
2025-10-27 15:46 ` [PATCH v8 05/12] NFSD: Remove alignment size checking Chuck Lever
2025-10-27 15:46 ` [PATCH v8 06/12] NFSD: Clean up struct nfsd_write_dio Chuck Lever
2025-10-27 15:46 ` [PATCH v8 07/12] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
2025-10-27 15:46 ` [PATCH v8 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-10-30 15:00 ` Jeff Layton
2025-10-31 13:16 ` Christoph Hellwig
2025-10-27 15:46 ` [PATCH v8 09/12] NFSD: Handle both offset and memory alignment for direct I/O Chuck Lever
2025-10-30 19:52 ` Jeff Layton
2025-10-30 19:55 ` Chuck Lever
2025-10-31 9:13 ` Christoph Hellwig
2025-10-31 13:19 ` Christoph Hellwig
2025-10-31 13:21 ` Chuck Lever
2025-10-31 13:23 ` Christoph Hellwig
2025-10-31 16:07 ` Mike Snitzer
2025-10-27 15:46 ` [PATCH v8 10/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
2025-10-30 19:59 ` Jeff Layton
2025-10-31 13:20 ` Christoph Hellwig
2025-10-27 15:46 ` [PATCH v8 11/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
2025-10-30 20:01 ` Jeff Layton
2025-10-31 13:21 ` Christoph Hellwig
2025-10-27 15:46 ` [PATCH v8 12/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
2025-10-30 20:02 ` Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox