* [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-11-07 15:34 Chuck Lever
2025-11-07 15:34 ` [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
` (2 more replies)
0 siblings, 3 replies; 32+ messages in thread
From: Chuck Lever @ 2025-11-07 15:34 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series includes the patches needed to make NFSD Direct WRITE
operational.
After applying Christoph's patch, I looked at restoring the comment
block in front of iov_iter_bvec_offset(), but it seems to make
slightly more sense to leave it where it is in
nfsd_write_dio_iters_init().
I'm still looking into Neil's comment about adding a trace point
for unaligned segments.
One controversy remains: Whether to set DONTCACHE for the unaligned
segments.
I still see this during fstests runs with NFSD_IO_DIRECT enabled:
WARNING: CPU: 5 PID: 1309 at fs/iomap/buffered-io.c:1402 iomap_zero_iter+0x1a4/0x390
No new test failures, but I need to narrow down which test is
triggering this message.
Applies on: 7f7d8421c7fa2588930146cb461e3e069658ced9
In the branch https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=nfsd-testing
Changes since v10:
* Applied Christoph's clean-ups
* Applied Mike's documentation fixes
Changes since v9:
* Unaligned segments no longer use IOCB_DONTCACHE
* Squashed all review patches into Mike's initial patch
* Squashed Mike's documentation update into the final patch
Changes since v8:
* Drop "NFSD: Handle both offset and memory alignment for direct I/O"
* Include the Sep 3 version of the Documentation update
Changes since v7:
* Rebase the series on Mike's original v3 patch
* Address more review comments
* Optimize the "when can NFSD use IOCB_DIRECT" logic
* Revert the "always promote to FILE_SYNC" logic
Changes since v6:
* Patches to address review comments have been split out
* Refactored the iter initialization code
Changes since v5:
* Add a patch to make FILE_SYNC WRITEs persist timestamps
* Address some of Christoph's review comments
* The svcrdma patch has been dropped until we actually need it
Changes since v4:
* Split out refactoring nfsd_buffered_write() into a separate patch
* Expand patch description of 1/4
* Don't set IOCB_SYNC flag
Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks
Chuck Lever (1):
NFSD: Make FILE_SYNC WRITEs comply with spec
Mike Snitzer (2):
NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
.../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 154 +++++++++++++++++-
4 files changed, 294 insertions(+), 6 deletions(-)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
--
2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-11-07 15:34 [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-07 15:34 ` Chuck Lever
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:34 ` [PATCH v11 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2 siblings, 0 replies; 32+ messages in thread
From: Chuck Lever @ 2025-11-07 15:34 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer, stable, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:
> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.
Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:
- flags |= RWF_SYNC;
with:
+ kiocb.ki_flags |= IOCB_DSYNC;
which appears to be correct given:
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
in kiocb_set_rw_flags(). However the author of that commit did not
appreciate that the previous line in kiocb_set_rw_flags() results
in IOCB_SYNC also being set:
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
IOCB_SYNC. Reviewers at the time did not catch the omission.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5333d49910d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
stable = NFS_UNSTABLE;
init_sync_kiocb(&kiocb, file);
kiocb.ki_pos = offset;
- if (stable && !fhp->fh_use_wgather)
- kiocb.ki_flags |= IOCB_DSYNC;
+ if (likely(!fhp->fh_use_wgather)) {
+ switch (stable) {
+ case NFS_FILE_SYNC:
+ /* persist data and timestamps */
+ kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+ break;
+ case NFS_DATA_SYNC:
+ /* persist data only */
+ kiocb.ki_flags |= IOCB_DSYNC;
+ break;
+ }
+ }
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:34 [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:34 ` [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-11-07 15:34 ` Chuck Lever
2025-11-07 15:39 ` Christoph Hellwig
` (2 more replies)
2025-11-07 15:34 ` [PATCH v11 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2 siblings, 3 replies; 32+ messages in thread
From: Chuck Lever @ 2025-11-07 15:34 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer, Chuck Lever
From: Mike Snitzer <snitzer@kernel.org>
When NFSD_IO_DIRECT is selected via the
/sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
incoming unaligned NFS WRITE requests into a prefix, middle and
suffix segment, as needed. The middle segment is now DIO-aligned and
the prefix and/or suffix are unaligned. Synchronous buffered IO is
used for the unaligned segments, and IOCB_DIRECT is used for the
middle DIO-aligned extent.
Although IOCB_DIRECT avoids the use of the page cache, by itself it
doesn't guarantee data durability. For UNSTABLE WRITE requests,
durability is obtained by a subsequent NFS COMMIT request.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 1 +
fs/nfsd/vfs.c | 140 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 138 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
switch (val) {
case NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
nfsd_io_cache_write = val;
break;
default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 85a1521ad757..8047a6d97b81 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
DEFINE_NFSD_IO_EVENT(write_io_done);
DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5333d49910d9..7e56be190170 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,131 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio_seg {
+ struct iov_iter iter;
+ int flags;
+};
+
+static unsigned long
+iov_iter_bvec_offset(const struct iov_iter *iter)
+{
+ return (unsigned long)(iter->bvec->bv_offset + iter->iov_offset);
+}
+
+static void
+nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
+ struct bio_vec *bvec, unsigned int nvecs,
+ unsigned long total, size_t start, size_t len,
+ struct kiocb *iocb)
+{
+ iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
+ if (start)
+ iov_iter_advance(&segment->iter, start);
+ iov_iter_truncate(&segment->iter, len);
+ segment->flags = iocb->ki_flags;
+}
+
+static unsigned int
+nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
+ unsigned int nvecs, struct kiocb *iocb,
+ unsigned long total,
+ struct nfsd_write_dio_seg segments[3])
+{
+ u32 offset_align = nf->nf_dio_offset_align;
+ loff_t prefix_end, orig_end, middle_end;
+ u32 mem_align = nf->nf_dio_mem_align;
+ size_t prefix, middle, suffix;
+ loff_t offset = iocb->ki_pos;
+ unsigned int nsegs = 0;
+
+ /*
+ * Check if direct I/O is feasible for this write request.
+ * If alignments are not available, the write is too small,
+ * or no alignment can be found, fall back to buffered I/O.
+ */
+ if (unlikely(!mem_align || !offset_align) ||
+ unlikely(total < max(offset_align, mem_align)))
+ goto no_dio;
+
+ prefix_end = round_up(offset, offset_align);
+ orig_end = offset + total;
+ middle_end = round_down(orig_end, offset_align);
+
+ prefix = prefix_end - offset;
+ middle = middle_end - prefix_end;
+ suffix = orig_end - middle_end;
+
+ if (!middle)
+ goto no_dio;
+
+ if (prefix)
+ nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
+ nvecs, total, 0, prefix, iocb);
+
+ nfsd_write_dio_seg_init(&segments[nsegs], bvec, nvecs,
+ total, prefix, middle, iocb);
+
+ /*
+ * Check if the bvec iterator is aligned for direct I/O.
+ *
+ * bvecs generated from RPC receive buffers are contiguous: After
+ * the first bvec, all subsequent bvecs start at bv_offset zero
+ * (page-aligned). Therefore, only the first bvec is checked.
+ */
+ if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
+ goto no_dio;
+ segments[nsegs].flags |= IOCB_DIRECT;
+ nsegs++;
+
+ if (suffix)
+ nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
+ prefix + middle, suffix, iocb);
+
+ return nsegs;
+
+no_dio:
+ /*
+ * No DIO alignment possible - pack into single non-DIO segment.
+ * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
+ */
+ nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
+ total, iocb);
+ if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ segments[nsegs].flags |= IOCB_DONTCACHE;
+ return 1;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
+{
+ struct nfsd_write_dio_seg segments[3];
+ struct file *file = nf->nf_file;
+ unsigned int nsegs, i;
+ ssize_t host_err;
+
+ nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
+ kiocb, *cnt, segments);
+
+ *cnt = 0;
+ for (i = 0; i < nsegs; i++) {
+ kiocb->ki_flags = segments[i].flags;
+ if (kiocb->ki_flags & IOCB_DIRECT)
+ trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+ segments[i].iter.count);
+
+ host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt += host_err;
+ if (host_err < segments[i].iter.count)
+ break; /* partial write */
+ }
+
+ return 0;
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1328,25 +1453,32 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
- case NFSD_IO_BUFFERED:
+ case NFSD_IO_DIRECT:
+ host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs,
+ cnt, &kiocb);
break;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_BUFFERED:
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
+ if (host_err < 0)
+ break;
+ *cnt = host_err;
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v11 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-07 15:34 [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:34 ` [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-07 15:34 ` Chuck Lever
2 siblings, 0 replies; 32+ messages in thread
From: Chuck Lever @ 2025-11-07 15:34 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:
/sys/kernel/debug/nfsd/io_cache_read
/sys/kernel/debug/nfsd/io_cache_write
This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).
Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
.../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++++
1 file changed, 144 insertions(+)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 000000000000..e3a522d09766
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,144 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+
+To set an NFSD IO mode, write a supported value (0 - 2) to the
+corresponding IO operation's debugfs interface, e.g.:
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_write
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.:
+ cat /sys/kernel/debug/nfsd/io_cache_read
+ cat /sys/kernel/debug/nfsd/io_cache_write
+
+If you experiment with NFSD's IO modes on a recent kernel and have
+interesting results, please report them to linux-nfs@vger.kernel.org
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to servicing
+new IO requests.
+
+For more context on DONTCACHE, please see these Linux commit headers:
+- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+ to take a struct kiocb")
+- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
+ RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
+filesystem doesn't indicate support by setting FOP_DONTCACHE.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+But in summary:
+- NFSD DIRECT can significantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+ If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+ DIO-aligned block (on either end of the READ). The expanded READ is
+ verified to have proper offset/len (logical_block_size) and
+ dma_alignment checking.
+
+Misaligned WRITE:
+ If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+ middle and end as needed. The large middle segment is DIO-aligned
+ and the start and/or end are misaligned. Buffered IO is used for the
+ misaligned segments and O_DIRECT is used for the middle DIO-aligned
+ segment. DONTCACHE buffered IO is _not_ used for the misaligned
+ segments because using normal buffered IO offers significant RMW
+ performance benefit when handling streaming misaligned WRITEs.
+
+Tracing:
+ The nfsd_read_direct trace event shows how NFSD expands any
+ misaligned READ to the next DIO-aligned block (on either end of the
+ original READ, as needed).
+
+ This combination of trace events is useful for READs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+ The nfsd_write_direct trace event shows how NFSD splits a given
+ misaligned WRITE into a DIO-aligned middle segment.
+
+ This combination of trace events is useful for WRITEs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-07 15:39 ` Christoph Hellwig
2025-11-07 15:40 ` Chuck Lever
2025-11-07 20:28 ` Chuck Lever
2025-11-07 17:18 ` Mike Snitzer
2025-11-07 22:13 ` NeilBrown
2 siblings, 2 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-07 15:39 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer, Chuck Lever
On Fri, Nov 07, 2025 at 10:34:21AM -0500, Chuck Lever wrote:
> +no_dio:
> + /*
> + * No DIO alignment possible - pack into single non-DIO segment.
> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
> + */
> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
> + total, iocb);
I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
nothing is aligned, but not for the non-aligned parts as that is
extremely counter-intuitive. From the other thread it might be because
the test case used to justify it is very unaligned and caching partial
pages is helpful, but if that is indeed the case the right check would
be for writes that are file offset unaligned vs the page or max folio
size and not about being able to do parts of it as direct I/O.
Either way a tweak to suddenly use cached buffered I/O when the mode
asks for direct should have a comment explaining the justification
and explain the rationale instead of rushing it in.
The rest of the patch looks good to me.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:39 ` Christoph Hellwig
@ 2025-11-07 15:40 ` Chuck Lever
2025-11-07 20:05 ` Mike Snitzer
2025-11-07 20:28 ` Chuck Lever
1 sibling, 1 reply; 32+ messages in thread
From: Chuck Lever @ 2025-11-07 15:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer, Chuck Lever
On 11/7/25 10:39 AM, Christoph Hellwig wrote:
> On Fri, Nov 07, 2025 at 10:34:21AM -0500, Chuck Lever wrote:
>> +no_dio:
>> + /*
>> + * No DIO alignment possible - pack into single non-DIO segment.
>> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
>> + */
>> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
>> + total, iocb);
>
> I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> nothing is aligned, but not for the non-aligned parts as that is
> extremely counter-intuitive. From the other thread it might be because
> the test case used to justify it is very unaligned and caching partial
> pages is helpful, but if that is indeed the case the right check would
> be for writes that are file offset unaligned vs the page or max folio
> size and not about being able to do parts of it as direct I/O.
>
> Either way a tweak to suddenly use cached buffered I/O when the mode
> asks for direct should have a comment explaining the justification
> and explain the rationale instead of rushing it in.
Agreed. The cover letter noted that this is still controversial.
--
Chuck Lever
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:39 ` Christoph Hellwig
@ 2025-11-07 17:18 ` Mike Snitzer
2025-11-07 22:13 ` NeilBrown
2 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2025-11-07 17:18 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Chuck Lever
On Fri, Nov 07, 2025 at 10:34:21AM -0500, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
>
> When NFSD_IO_DIRECT is selected via the
> /sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
> incoming unaligned NFS WRITE requests into a prefix, middle and
> suffix segment, as needed. The middle segment is now DIO-aligned and
> the prefix and/or suffix are unaligned. Synchronous buffered IO is
> used for the unaligned segments, and IOCB_DIRECT is used for the
> middle DIO-aligned extent.
>
> Although IOCB_DIRECT avoids the use of the page cache, by itself it
> doesn't guarantee data durability. For UNSTABLE WRITE requests,
> durability is obtained by a subsequent NFS COMMIT request.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/debugfs.c | 1 +
> fs/nfsd/trace.h | 1 +
> fs/nfsd/vfs.c | 140 ++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 138 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 00eb1ecef6ac..7f44689e0a53 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
> switch (val) {
> case NFSD_IO_BUFFERED:
> case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> nfsd_io_cache_write = val;
> break;
> default:
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 85a1521ad757..8047a6d97b81 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -469,6 +469,7 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
> DEFINE_NFSD_IO_EVENT(read_done);
> DEFINE_NFSD_IO_EVENT(write_start);
> DEFINE_NFSD_IO_EVENT(write_opened);
> +DEFINE_NFSD_IO_EVENT(write_direct);
> DEFINE_NFSD_IO_EVENT(write_io_done);
> DEFINE_NFSD_IO_EVENT(write_done);
> DEFINE_NFSD_IO_EVENT(commit_start);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 5333d49910d9..7e56be190170 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1254,6 +1254,131 @@ static int wait_for_concurrent_writes(struct file *file)
> return err;
> }
>
> +struct nfsd_write_dio_seg {
> + struct iov_iter iter;
> + int flags;
> +};
> +
> +static unsigned long
> +iov_iter_bvec_offset(const struct iov_iter *iter)
> +{
> + return (unsigned long)(iter->bvec->bv_offset + iter->iov_offset);
> +}
> +
> +static void
> +nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
> + struct bio_vec *bvec, unsigned int nvecs,
> + unsigned long total, size_t start, size_t len,
> + struct kiocb *iocb)
> +{
> + iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
> + if (start)
> + iov_iter_advance(&segment->iter, start);
> + iov_iter_truncate(&segment->iter, len);
> + segment->flags = iocb->ki_flags;
> +}
> +
> +static unsigned int
> +nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> + unsigned int nvecs, struct kiocb *iocb,
> + unsigned long total,
> + struct nfsd_write_dio_seg segments[3])
> +{
> + u32 offset_align = nf->nf_dio_offset_align;
> + loff_t prefix_end, orig_end, middle_end;
> + u32 mem_align = nf->nf_dio_mem_align;
> + size_t prefix, middle, suffix;
> + loff_t offset = iocb->ki_pos;
> + unsigned int nsegs = 0;
> +
> + /*
> + * Check if direct I/O is feasible for this write request.
> + * If alignments are not available, the write is too small,
> + * or no alignment can be found, fall back to buffered I/O.
> + */
> + if (unlikely(!mem_align || !offset_align) ||
> + unlikely(total < max(offset_align, mem_align)))
> + goto no_dio;
> +
> + prefix_end = round_up(offset, offset_align);
> + orig_end = offset + total;
> + middle_end = round_down(orig_end, offset_align);
> +
> + prefix = prefix_end - offset;
> + middle = middle_end - prefix_end;
> + suffix = orig_end - middle_end;
> +
> + if (!middle)
> + goto no_dio;
> +
> + if (prefix)
> + nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
> + nvecs, total, 0, prefix, iocb);
> +
> + nfsd_write_dio_seg_init(&segments[nsegs], bvec, nvecs,
> + total, prefix, middle, iocb);
> +
> + /*
> + * Check if the bvec iterator is aligned for direct I/O.
> + *
> + * bvecs generated from RPC receive buffers are contiguous: After
> + * the first bvec, all subsequent bvecs start at bv_offset zero
> + * (page-aligned). Therefore, only the first bvec is checked.
> + */
> + if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
> + goto no_dio;
> + segments[nsegs].flags |= IOCB_DIRECT;
> + nsegs++;
> +
> + if (suffix)
> + nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
> + prefix + middle, suffix, iocb);
> +
> + return nsegs;
> +
> +no_dio:
> + /*
> + * No DIO alignment possible - pack into single non-DIO segment.
> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
> + */
> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
> + total, iocb);
> + if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> + segments[nsegs].flags |= IOCB_DONTCACHE;
This needs to be: segments[0].flags |= IOCB_DONTCACHE;
Mike
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:40 ` Chuck Lever
@ 2025-11-07 20:05 ` Mike Snitzer
2025-11-07 20:08 ` Chuck Lever
0 siblings, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-07 20:05 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Fri, Nov 07, 2025 at 10:40:56AM -0500, Chuck Lever wrote:
> On 11/7/25 10:39 AM, Christoph Hellwig wrote:
> > On Fri, Nov 07, 2025 at 10:34:21AM -0500, Chuck Lever wrote:
> >> +no_dio:
> >> + /*
> >> + * No DIO alignment possible - pack into single non-DIO segment.
> >> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
> >> + */
> >> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
> >> + total, iocb);
> >
> > I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> > nothing is aligned, but not for the non-aligned parts as that is
> > extremely counter-intuitive. From the other thread it might be because
> > the test case used to justify it is very unaligned and caching partial
> > pages is helpful, but if that is indeed the case the right check would
> > be for writes that are file offset unaligned vs the page or max folio
> > size and not about being able to do parts of it as direct I/O.
NFSD_IO_DIRECT's splitting an unaligned WRITE into 3 segments is a
hybrid IO approach that consciously (_from initial design_) leverages
the page cache for the unaligned ends.
1: The ends of an unaligned write are certainly file offset unaligned.
2: The ends are also less than a page so implies unaligned in memory.
Anyway, categorizing the misalignment further enables you to do what
differently? Any misalignment requires use of buffered IO.
At issue if whether we use DONTCACHE's drop-behind semantic or not.
That drop-behind is clearly unhelpful for the ends of each streaming
unaligned WRITE.
The page cache is beneficial for NFSD_IO_DIRECT's subpage writes on
either end of a DIO aligned middle (at most 2 pages left in page cache
for unaligned IO). So _not_ using DONTCACHE is a key point of the
hybrid IO model NFSD_IO_DIRECT provides for handling unaligned IO.
> > Either way a tweak to suddenly use cached buffered I/O when the mode
> > asks for direct should have a comment explaining the justification
> > and explain the rationale instead of rushing it in.
Use of cached buffered wasn't sudden or rushing -- it restores a key
original design point of NFSD_IO_DIRECT (from back in July/August).
While I can appreciate your confusion, given NFSD_IO_DIRECT's intent
of avoiding caching, DONTCACHE's ability to preserve the general
"avoid caching" semantic of O_DIRECT is really besides the point
because... NFSD_IO_DIRECT does need to use the page cache for its
unaligned end segments _and_ it actually benefits from page caching.
And I completely agree this needs a comment. Hopefully adding
something like the collowing comment before "return nsegs" in
nfsd_write_dio_iters_init() helps:
/*
* Normal buffered IO is used for the unaligned prefix/suffix
* because these subpage writes can benefit from the page cache
* (e.g. for optimally handling streaming unaligned WRITEs).
*/
> Agreed. The cover letter noted that this is still controversial.
Only see this, must be missing what you're referring to:
Changes since v9:
* Unaligned segments no longer use IOCB_DONTCACHE
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 20:05 ` Mike Snitzer
@ 2025-11-07 20:08 ` Chuck Lever
2025-11-07 20:10 ` Mike Snitzer
0 siblings, 1 reply; 32+ messages in thread
From: Chuck Lever @ 2025-11-07 20:08 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On 11/7/25 3:05 PM, Mike Snitzer wrote:
>> Agreed. The cover letter noted that this is still controversial.
> Only see this, must be missing what you're referring to:
>
> Changes since v9:
> * Unaligned segments no longer use IOCB_DONTCACHE
From the v11 cover letter:
> One controversy remains: Whether to set DONTCACHE for the unaligned
> segments.
--
Chuck Lever
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 20:08 ` Chuck Lever
@ 2025-11-07 20:10 ` Mike Snitzer
2025-11-07 21:58 ` NeilBrown
0 siblings, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-07 20:10 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Fri, Nov 07, 2025 at 03:08:11PM -0500, Chuck Lever wrote:
> On 11/7/25 3:05 PM, Mike Snitzer wrote:
> >> Agreed. The cover letter noted that this is still controversial.
> > Only see this, must be missing what you're referring to:
> >
> > Changes since v9:
> > * Unaligned segments no longer use IOCB_DONTCACHE
>
> From the v11 cover letter:
>
> > One controversy remains: Whether to set DONTCACHE for the unaligned
> > segments.
Ha, blind as a bat...
hopefully the rest of my reply helps dispel the controversy
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:39 ` Christoph Hellwig
2025-11-07 15:40 ` Chuck Lever
@ 2025-11-07 20:28 ` Chuck Lever
2025-11-07 22:16 ` Mike Snitzer
2025-11-10 9:17 ` Christoph Hellwig
1 sibling, 2 replies; 32+ messages in thread
From: Chuck Lever @ 2025-11-07 20:28 UTC (permalink / raw)
To: Christoph Hellwig
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer, Chuck Lever
On 11/7/25 10:39 AM, Christoph Hellwig wrote:
> On Fri, Nov 07, 2025 at 10:34:21AM -0500, Chuck Lever wrote:
>> +no_dio:
>> + /*
>> + * No DIO alignment possible - pack into single non-DIO segment.
>> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
>> + */
>> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
>> + total, iocb);
>
> I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> nothing is aligned, but not for the non-aligned parts as that is
> extremely counter-intuitive. From the other thread it might be because
> the test case used to justify it is very unaligned and caching partial
> pages is helpful, but if that is indeed the case the right check would
> be for writes that are file offset unaligned vs the page or max folio
> size and not about being able to do parts of it as direct I/O.
>
> Either way a tweak to suddenly use cached buffered I/O when the mode
> asks for direct should have a comment explaining the justification
> and explain the rationale instead of rushing it in.
+1 for an explanatory comment, but I'm not getting what is "counter-
intuitive" about leaving the content of the end segments in the page
cache. The end segments are small and simply cannot be handled by direct
I/O.
I raised a similar concern about whether NFSD needs to care about highly
unaligned NFS WRITE requests performing well. I'm not convinced that a
performance argument is an appropriate rationale for not using
DONTCACHE on the ends.
Upshot is I'm on the fence here; I can see both sides of this
controversy.
--
Chuck Lever
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 20:10 ` Mike Snitzer
@ 2025-11-07 21:58 ` NeilBrown
2025-11-07 22:24 ` Mike Snitzer
0 siblings, 1 reply; 32+ messages in thread
From: NeilBrown @ 2025-11-07 21:58 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Christoph Hellwig, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Sat, 08 Nov 2025, Mike Snitzer wrote:
> On Fri, Nov 07, 2025 at 03:08:11PM -0500, Chuck Lever wrote:
> > On 11/7/25 3:05 PM, Mike Snitzer wrote:
> > >> Agreed. The cover letter noted that this is still controversial.
> > > Only see this, must be missing what you're referring to:
> > >
> > > Changes since v9:
> > > * Unaligned segments no longer use IOCB_DONTCACHE
> >
> > From the v11 cover letter:
> >
> > > One controversy remains: Whether to set DONTCACHE for the unaligned
> > > segments.
>
> Ha, blind as a bat...
>
> hopefully the rest of my reply helps dispel the controversy
>
Unfortunately I don't think it does. I don't think it even addresses
it.
What Christoph said was:
I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
nothing is aligned, but not for the non-aligned parts as that is
extremely counter-intuitive.
You gave a lengthy (and probably valid) description on why "not for the
non-aligned parts" but don't seem to address the "why to set
IOCB_DONTCACHE when nothing is aligned" bit.
I can see Christoph's point.
Can you please explain why all the arguments you have for avoiding
IOCB_DONTCACHE on the non-aligned head and tail do not equally apply
when the whole body is assessed as non-aligned. i.e. the no_dio label
in nfsd_write_dio_iters_init().
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:39 ` Christoph Hellwig
2025-11-07 17:18 ` Mike Snitzer
@ 2025-11-07 22:13 ` NeilBrown
2 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2025-11-07 22:13 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Mike Snitzer, Chuck Lever
On Sat, 08 Nov 2025, Chuck Lever wrote:
> From: Mike Snitzer <snitzer@kernel.org>
>
> When NFSD_IO_DIRECT is selected via the
> /sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
> incoming unaligned NFS WRITE requests into a prefix, middle and
> suffix segment, as needed. The middle segment is now DIO-aligned and
> the prefix and/or suffix are unaligned. Synchronous buffered IO is
> used for the unaligned segments, and IOCB_DIRECT is used for the
> middle DIO-aligned extent.
>
> Although IOCB_DIRECT avoids the use of the page cache, by itself it
> doesn't guarantee data durability. For UNSTABLE WRITE requests,
> durability is obtained by a subsequent NFS COMMIT request.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> fs/nfsd/debugfs.c | 1 +
> fs/nfsd/trace.h | 1 +
> fs/nfsd/vfs.c | 140 ++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 138 insertions(+), 4 deletions(-)
This all looks nice and clean and easy to follow now - thanks.
Excepting the caveats you mentioned in the introductory note:
Reviewed-by: NeilBrown <neil@brown.name>
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 20:28 ` Chuck Lever
@ 2025-11-07 22:16 ` Mike Snitzer
2025-11-10 9:12 ` Christoph Hellwig
2025-11-10 9:17 ` Christoph Hellwig
1 sibling, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-07 22:16 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Fri, Nov 07, 2025 at 03:28:06PM -0500, Chuck Lever wrote:
> On 11/7/25 10:39 AM, Christoph Hellwig wrote:
> > On Fri, Nov 07, 2025 at 10:34:21AM -0500, Chuck Lever wrote:
> >> +no_dio:
> >> + /*
> >> + * No DIO alignment possible - pack into single non-DIO segment.
> >> + * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT.
> >> + */
> >> + nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
> >> + total, iocb);
> >
> > I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> > nothing is aligned, but not for the non-aligned parts as that is
> > extremely counter-intuitive. From the other thread it might be because
> > the test case used to justify it is very unaligned and caching partial
> > pages is helpful, but if that is indeed the case the right check would
> > be for writes that are file offset unaligned vs the page or max folio
> > size and not about being able to do parts of it as direct I/O.
> >
> > Either way a tweak to suddenly use cached buffered I/O when the mode
> > asks for direct should have a comment explaining the justification
> > and explain the rationale instead of rushing it in.
>
> +1 for an explanatory comment, but I'm not getting what is "counter-
> intuitive" about leaving the content of the end segments in the page
> cache. The end segments are small and simply cannot be handled by direct
> I/O.
>
> I raised a similar concern about whether NFSD needs to care about highly
> unaligned NFS WRITE requests performing well. I'm not convinced that a
> performance argument is an appropriate rationale for not using
> DONTCACHE on the ends.
Those ends lend themselves to benefitting from more capable RMW
if/when needed. All it takes to justify not using DONTCACHE is one
workload that benefits (even if suboptimal by nature) given there is
no apparent downside (other than requiring we document/comment the
behavior).
Or is this something we make tunable?
NFSD_IO_DIRECT_DONTCACHE_UNALIGNED?
I don't think it needed but maybe ultra small systems with next to no
memory that must not use memory at all costs (including performance)?
Maybe we wait until someone asks for that? ;)
> Upshot is I'm on the fence here; I can see both sides of this
> controversy.
NFSD_IO_DIRECT using cached buffered in this subpage write case makes
enabling NFSD_IO_DIRECT by default more acceptable _because_ it
doesn't cause needless performance problems (that we know of).
Streaming misaligned WRITEs is the only workload I'm aware of where
NFSD_IO_DIRECT can be made noticably worse in comparison to
NFSD_IO_BUFFERED (but I haven't done a bucnh of small IO testing).
That's a pretty good place for NFSD_IO_DIRECT given the fix is a
really straightforward tradeoff decision.
The MM subsystem handles order-0 page allocation and reclaim really
well, making NFSD_IO_DIRECT's 3 segment hybrid IO model quite capable
even though we hope applications don't force NFSD to use it (meaning
the client application takes care to send DIO-aligned IO).
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 21:58 ` NeilBrown
@ 2025-11-07 22:24 ` Mike Snitzer
2025-11-07 23:42 ` NeilBrown
0 siblings, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-07 22:24 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Christoph Hellwig, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Sat, Nov 08, 2025 at 08:58:56AM +1100, NeilBrown wrote:
> On Sat, 08 Nov 2025, Mike Snitzer wrote:
> > On Fri, Nov 07, 2025 at 03:08:11PM -0500, Chuck Lever wrote:
> > > On 11/7/25 3:05 PM, Mike Snitzer wrote:
> > > >> Agreed. The cover letter noted that this is still controversial.
> > > > Only see this, must be missing what you're referring to:
> > > >
> > > > Changes since v9:
> > > > * Unaligned segments no longer use IOCB_DONTCACHE
> > >
> > > From the v11 cover letter:
> > >
> > > > One controversy remains: Whether to set DONTCACHE for the unaligned
> > > > segments.
> >
> > Ha, blind as a bat...
> >
> > hopefully the rest of my reply helps dispel the controversy
> >
>
> Unfortunately I don't think it does. I don't think it even addresses
> it.
>
> What Christoph said was:
>
> I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> nothing is aligned, but not for the non-aligned parts as that is
> extremely counter-intuitive.
>
> You gave a lengthy (and probably valid) description on why "not for the
> non-aligned parts" but don't seem to address the "why to set
> IOCB_DONTCACHE when nothing is aligned" bit.
Because if the entire IO is so poorly formed that it cannot be issued
with DIO then at least we can provide the spirit of what was requested
from the admin having cared to enable NFSD_IO_DIRECT.
Those IOs could be quite large.. full WRITE payload. I have a
reproducer if you'd like it!
Limiting extensive use of caching buffered IO is pretty important if
the admin enabled NFSD_IO_DIRECT.
A misaligned WRITE's non-aligned head and tail are _not_ extensive
amounts of mmeory. They are at most 2 individial pages.
> I can see Christoph's point.
>
> Can you please explain why all the arguments you have for avoiding
> IOCB_DONTCACHE on the non-aligned head and tail do not equally apply
> when the whole body is assessed as non-aligned. i.e. the no_dio label
> in nfsd_write_dio_iters_init().
Sure, hopefully I just did above.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 22:24 ` Mike Snitzer
@ 2025-11-07 23:42 ` NeilBrown
2025-11-08 2:01 ` Mike Snitzer
0 siblings, 1 reply; 32+ messages in thread
From: NeilBrown @ 2025-11-07 23:42 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Christoph Hellwig, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Sat, 08 Nov 2025, Mike Snitzer wrote:
> On Sat, Nov 08, 2025 at 08:58:56AM +1100, NeilBrown wrote:
> > On Sat, 08 Nov 2025, Mike Snitzer wrote:
> > > On Fri, Nov 07, 2025 at 03:08:11PM -0500, Chuck Lever wrote:
> > > > On 11/7/25 3:05 PM, Mike Snitzer wrote:
> > > > >> Agreed. The cover letter noted that this is still controversial.
> > > > > Only see this, must be missing what you're referring to:
> > > > >
> > > > > Changes since v9:
> > > > > * Unaligned segments no longer use IOCB_DONTCACHE
> > > >
> > > > From the v11 cover letter:
> > > >
> > > > > One controversy remains: Whether to set DONTCACHE for the unaligned
> > > > > segments.
> > >
> > > Ha, blind as a bat...
> > >
> > > hopefully the rest of my reply helps dispel the controversy
> > >
> >
> > Unfortunately I don't think it does. I don't think it even addresses
> > it.
> >
> > What Christoph said was:
> >
> > I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> > nothing is aligned, but not for the non-aligned parts as that is
> > extremely counter-intuitive.
> >
> > You gave a lengthy (and probably valid) description on why "not for the
> > non-aligned parts" but don't seem to address the "why to set
> > IOCB_DONTCACHE when nothing is aligned" bit.
>
> Because if the entire IO is so poorly formed that it cannot be issued
> with DIO then at least we can provide the spirit of what was requested
> from the admin having cared to enable NFSD_IO_DIRECT.
"so poorly formed"... How can IO be poorly formed in a way that is
worse than not being aligned?
>
> Those IOs could be quite large.. full WRITE payload. I have a
> reproducer if you'd like it!
Can they? How does that work?
From looking at the code I can only see that "Those IOs" are too small
to have any whole pages in them.
Maybe we need a clear description of the sort of IOs that are big enough
that they impose a performance cost on the page-cache, but that somehow
cannot be usefully split into a prefix/middle/suffix where prefix and
suffix are less than a page each.
Thanks,
NeilBrown
>
> Limiting extensive use of caching buffered IO is pretty important if
> the admin enabled NFSD_IO_DIRECT.
>
> A misaligned WRITE's non-aligned head and tail are _not_ extensive
> amounts of mmeory. They are at most 2 individial pages.
>
> > I can see Christoph's point.
> >
> > Can you please explain why all the arguments you have for avoiding
> > IOCB_DONTCACHE on the non-aligned head and tail do not equally apply
> > when the whole body is assessed as non-aligned. i.e. the no_dio label
> > in nfsd_write_dio_iters_init().
>
> Sure, hopefully I just did above.
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 23:42 ` NeilBrown
@ 2025-11-08 2:01 ` Mike Snitzer
2025-11-10 16:41 ` Chuck Lever
0 siblings, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-08 2:01 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Christoph Hellwig, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Sat, Nov 08, 2025 at 10:42:27AM +1100, NeilBrown wrote:
> On Sat, 08 Nov 2025, Mike Snitzer wrote:
> > On Sat, Nov 08, 2025 at 08:58:56AM +1100, NeilBrown wrote:
> > > On Sat, 08 Nov 2025, Mike Snitzer wrote:
> > > > On Fri, Nov 07, 2025 at 03:08:11PM -0500, Chuck Lever wrote:
> > > > > On 11/7/25 3:05 PM, Mike Snitzer wrote:
> > > > > >> Agreed. The cover letter noted that this is still controversial.
> > > > > > Only see this, must be missing what you're referring to:
> > > > > >
> > > > > > Changes since v9:
> > > > > > * Unaligned segments no longer use IOCB_DONTCACHE
> > > > >
> > > > > From the v11 cover letter:
> > > > >
> > > > > > One controversy remains: Whether to set DONTCACHE for the unaligned
> > > > > > segments.
> > > >
> > > > Ha, blind as a bat...
> > > >
> > > > hopefully the rest of my reply helps dispel the controversy
> > > >
> > >
> > > Unfortunately I don't think it does. I don't think it even addresses
> > > it.
> > >
> > > What Christoph said was:
> > >
> > > I'd like to sort out the discussion on why to set IOCB_DONTCACHE when
> > > nothing is aligned, but not for the non-aligned parts as that is
> > > extremely counter-intuitive.
> > >
> > > You gave a lengthy (and probably valid) description on why "not for the
> > > non-aligned parts" but don't seem to address the "why to set
> > > IOCB_DONTCACHE when nothing is aligned" bit.
> >
> > Because if the entire IO is so poorly formed that it cannot be issued
> > with DIO then at least we can provide the spirit of what was requested
> > from the admin having cared to enable NFSD_IO_DIRECT.
>
> "so poorly formed"... How can IO be poorly formed in a way that is
> worse than not being aligned?
I was talking about the no_dio case when iov_iter_bvec_offset() isn't
aligned (case 2 below). The entire WRITE must use buffered IO even
though NFSD_IO_DIRECT configured.
> > Those IOs could be quite large.. full WRITE payload. I have a
> > reproducer if you'd like it!
>
> Can they? How does that work?
> From looking at the code I can only see that "Those IOs" are too small
> to have any whole pages in them.
Huh? You're clearly confused by the 2 cases of NFSD_IO_DIRECT's
unaligned WRITE handling (as evidenced by you solidly mixing them).
1: WRITE is split, has DIO-aligned middle and prefix and/or suffix.
prefix/suffix are the subpage IOs of a misaligned WRITE that _does_
have a DIO-aligned middle, I covered that in detail in this email:
https://lore.kernel.org/linux-nfs/aQ5Q99Kvw0ZE09Th@kernel.org/
it left you wanting, thinking I missed Christoph's point (I didn't).
2: WRITE cannot be split such that middle is DIO-aligned.
The case you wanted me to elaborate on, I did so here:
https://lore.kernel.org/linux-nfs/aQ5xkjIzf6uU_zLa@kernel.org/
when entire WRITE is misaligned it makes all the sense in the world to
use DONTCACHE. Chuck, Christoph and I all agree on this.
But just because it makes sense for case 2's (potentially large) WRITE
to not use cached buffered IO if NFSD_IO_DIRECT, it doesn't logically
follow that case 1's two individual pages for prefix/suffix must use
DONTCACHE buffered IO too.
Now do you see the so-called "controversy"?
Christoph expects case 1 to use DONTCACHE because it best reflects
O_DIRECT semantics (avoid using caching).
Here is another way to look at it:
Q: Case 2 uses DONTCACHE, so case 1 should too right?
A: NO. There is legit benefit to having case 1 use cached buffered IO
when issuing its 2 subpage IOs; and that benefit doesn't cause harm
because order-0 page management is not causing the MM problems that
NFSD_IO_DIRECT sets out to avoid (whereas higher order cached buffered
IO is exactly what both DONTCACHE and NFSD_IO_DIRECT aim to avoid.
Otherwise MM spins and spins trying to find adequate free pages,
cannot so does dirty writeback and reclaim which causes kswapd and
kcompactd to burn cpu, etc).
NFSD_IO_DIRECT and DONTCACHE both want to avoid the pathological worst
cases of the MM subsystem when its coping with large IOs. But
NFSD_IO_DIRECT is doing so in terms of O_DIRECT whereas DONTCACHE is
approximating O_DIRECT (uses page cache with drop-behind semantics,
without concern for DIO alignment, etc). Both are effective in their
own way, but NFSD_IO_DIRECT only uniformly benefits from using
DONTCACHE buffered IO for case 2.
I'm advocating we make an informed decision for NFSD_IO_DIRECT to use
cached buffered IO for case 1's simple usage of order-0 allocations.
Let's please not get hung up of intent of O_DIRECT because
NFSD_IO_DIRECT achieves that intent very well -- and by using cached
buffered IO it can avoid pathological performance problems (case 1's
handling of streaming misaligned WRITEs that _must_ do RMW).
> Maybe we need a clear description of the sort of IOs that are big enough
> that they impose a performance cost on the page-cache, but that somehow
> cannot be usefully split into a prefix/middle/suffix where prefix and
> suffix are less than a page each.
So defining case 2 again: entire WRITE payload is unaligned because
each aligned logical_block_size doesn't start on a dma_alignment
boundary in memory. (Also the definition of "the entire IO is so
poorly formed that it cannot be issued with DIO".)
Mike
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 22:16 ` Mike Snitzer
@ 2025-11-10 9:12 ` Christoph Hellwig
2025-11-10 15:42 ` Mike Snitzer
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-10 9:12 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Fri, Nov 07, 2025 at 05:16:27PM -0500, Mike Snitzer wrote:
> Those ends lend themselves to benefitting from more capable RMW
> if/when needed.
How are ends of a larger I/O inhently more benefits from cached
RMW then say a single tiny I/O?
> All it takes to justify not using DONTCACHE is one
> workload that benefits (even if suboptimal by nature) given there is
> no apparent downside (other than requiring we document/comment the
> behavior).
Well, the entire reason for this patchset is the downside of cached
buffered I/O, isn't it? Now suddently there's no downsides?
Also to go back to my previous mail: How can an I/O that is aligned
in file offset and length, but not in the backing memory ever going
to benefit from this caching?
> Or is this something we make tunable?
> NFSD_IO_DIRECT_DONTCACHE_UNALIGNED?
Throwing in yet another tunable while not understanding the problem
at hand is about the worst possible outcome.
> Streaming misaligned WRITEs is the only workload I'm aware of where
> NFSD_IO_DIRECT can be made noticably worse in comparison to
> NFSD_IO_BUFFERED (but I haven't done a bucnh of small IO testing).
> That's a pretty good place for NFSD_IO_DIRECT given the fix is a
> really straightforward tradeoff decision.
Can we please stop talking about "misaligned" as a generic thing, when
there is two very clearly different cases of misalignment.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-07 20:28 ` Chuck Lever
2025-11-07 22:16 ` Mike Snitzer
@ 2025-11-10 9:17 ` Christoph Hellwig
2025-11-10 15:43 ` Mike Snitzer
1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-10 9:17 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, Mike Snitzer, Chuck Lever
On Fri, Nov 07, 2025 at 03:28:06PM -0500, Chuck Lever wrote:
> +1 for an explanatory comment, but I'm not getting what is "counter-
> intuitive" about leaving the content of the end segments in the page
> cache. The end segments are small and simply cannot be handled by direct
> I/O.
The downside if of course extra page cache usage / pollution. If these
segments are aligned to the file system minimum I/O size, but not to
the memory requirements there is not going to be any RMW cycle that
leaving them in the page cache for would be useful.
Similarly if the single segment is not aligned in the file logic space,
chances of having another RMW come in are the same as for a large
one with unaligned head and/or tail.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-10 9:12 ` Christoph Hellwig
@ 2025-11-10 15:42 ` Mike Snitzer
2025-11-11 8:44 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-10 15:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs, Chuck Lever
On Mon, Nov 10, 2025 at 01:12:33AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 07, 2025 at 05:16:27PM -0500, Mike Snitzer wrote:
> > Those ends lend themselves to benefitting from more capable RMW
> > if/when needed.
>
> How are ends of a larger I/O inhently more benefits from cached
> RMW then say a single tiny I/O?
Any unaligned WRITE, no matter the size of the WRITE, will benefit
from RMW of its subpage ends if streaming unaligned WRITEs.
I saw below that even saying "unaligned WRITE" leaves you annoyed.
And time subpage ends are being issued by this code then they are part
of a larger "unaligned WRITE" that was able to be split and have up to
3 segments: unaligned subpage ends and a DIO-aligned middle.
> > All it takes to justify not using DONTCACHE is one
> > workload that benefits (even if suboptimal by nature) given there is
> > no apparent downside (other than requiring we document/comment the
> > behavior).
>
> Well, the entire reason for this patchset is the downside of cached
> buffered I/O, isn't it? Now suddently there's no downsides?
Every time I say stuff like "the entire reason for the patchset" I set
myself up for a rebuttal from you. You just did the same thing. I'm
going to try to be accomodating and cool but please lets dial back
this weirdly caustic dynamic that is forming yet again.
Why is it that you cannot acknowledge that it isn't an all or nothing
situation? This isn't hard to see, there is clear benefit from using
cached buffered for the misaligned subpage ends of an unaligned WRITE
that can be split into 3 segemnts.
I provided the benchmark result that shows significant performance
improvement and Chuck pointed it out to you in another thread:
https://lore.kernel.org/linux-nfs/aQrsWnHK1ny3Md9g@kernel.org/
And I do understand Chuck's point of "but do we care about streaming
misaligned WRITE"? Given the code needs to simply use cached buffered
IO instead of DONTCACHE for at most 2 individual pages I think it does
make sense to handle it.
Again, its niche but its easily handled and there certainly is benefit
to be had.
Restating an important point underpinning my desire to handle this
niche unaligned streaming WRITE case efficiently: it removes the one
case I know of where NFSD_IO_DIRECT cannot compete with
NFSD_IO_BUFFERED. Whereby making NFSD_IO_DIRECT a compelling
alternative default IO mode. To leave it to use DONTCACHE would be
purposely compromising NFSD_IO_DIRECT to be less effective.
I'm also saying: there is very limited downside to using cached
buffered for these subpage end segments. But IF I'm proven wrong and
they do show themselves to be a problem in the future then a code
change will be needed.
> Also to go back to my previous mail: How can an I/O that is aligned
> in file offset and length, but not in the backing memory ever going
> to benefit from this caching?
It wouldn't, because it would take the no_dio path and the entire IO
would be issued using DONTCACHE.
That I just had to point that out shows how absurd all this contrarian
hand-wringing has gotten.
> > Or is this something we make tunable?
> > NFSD_IO_DIRECT_DONTCACHE_UNALIGNED?
>
> Throwing in yet another tunable while not understanding the problem
> at hand is about the worst possible outcome.
Just because you don't understand something: that isn't my failing.
So your need to be the expert here is misplaced. Please show you
understand the problem before you project lack of understanding onto
me.
> > Streaming misaligned WRITEs is the only workload I'm aware of where
> > NFSD_IO_DIRECT can be made noticably worse in comparison to
> > NFSD_IO_BUFFERED (but I haven't done a bucnh of small IO testing).
> > That's a pretty good place for NFSD_IO_DIRECT given the fix is a
> > really straightforward tradeoff decision.
>
> Can we please stop talking about "misaligned" as a generic thing, when
> there is two very clearly different cases of misalignment.
Sure misaligned/unaligned WRITE is the NFS WTITE operation includes a
payload that is misaligned. Further categorizaiton is done to
determine of a given WRITE can benefit from being split.
If the WRITE can be split such that the middle is DIO-aligned then it
benefits from using cached buffered for the subpage end segments.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-10 9:17 ` Christoph Hellwig
@ 2025-11-10 15:43 ` Mike Snitzer
0 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2025-11-10 15:43 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs, Chuck Lever
On Mon, Nov 10, 2025 at 01:17:19AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 07, 2025 at 03:28:06PM -0500, Chuck Lever wrote:
> > +1 for an explanatory comment, but I'm not getting what is "counter-
> > intuitive" about leaving the content of the end segments in the page
> > cache. The end segments are small and simply cannot be handled by direct
> > I/O.
>
> The downside if of course extra page cache usage / pollution. If these
> segments are aligned to the file system minimum I/O size, but not to
> the memory requirements there is not going to be any RMW cycle that
> leaving them in the page cache for would be useful.
Straw man: the entire IO would be issued using DONTCACHE (due to it
being "case 2"). You too keep conflating "case 1" and "case 2":
https://lore.kernel.org/linux-nfs/20251107153422.4373-1-cel@kernel.org/T/#mca19127d102dd7d24b2e44df4d9417b2cc61b340
Your broader point of "extra page cache usage / pollution". DONTCACHE
is already using page cache, it just will hopefully drop-behind its
pages.
But for "case 1", yes if we use cached buffered IO for the subpage
ends _BUT_ the workload isn't ever going to actually need to RMW
(e.g. because it isn't part of a continuous stream of unaligned
WRITEs) then: Yes, those individual pages associated with the
unaligned ends will polute. BUT they are individual pages that MM is
quite good about handling.
> Similarly if the single segment is not aligned in the file logic space,
> chances of having another RMW come in are the same as for a large
> one with unaligned head and/or tail.
Not following you, please reword and be clearer.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-08 2:01 ` Mike Snitzer
@ 2025-11-10 16:41 ` Chuck Lever
2025-11-10 17:57 ` Mike Snitzer
2025-11-11 8:51 ` Christoph Hellwig
0 siblings, 2 replies; 32+ messages in thread
From: Chuck Lever @ 2025-11-10 16:41 UTC (permalink / raw)
To: Mike Snitzer, NeilBrown
Cc: Christoph Hellwig, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs, Chuck Lever
On 11/7/25 9:01 PM, Mike Snitzer wrote:
> Q: Case 2 uses DONTCACHE, so case 1 should too right?
>
> A: NO. There is legit benefit to having case 1 use cached buffered IO
> when issuing its 2 subpage IOs; and that benefit doesn't cause harm> because order-0 page management is not causing the MM problems that
> NFSD_IO_DIRECT sets out to avoid (whereas higher order cached buffered
> IO is exactly what both DONTCACHE and NFSD_IO_DIRECT aim to avoid.
> Otherwise MM spins and spins trying to find adequate free pages,
> cannot so does dirty writeback and reclaim which causes kswapd and
> kcompactd to burn cpu, etc).
Paraphrasing: Each unaligned end (case 1) is always smaller than a page,
thus it will stick with order-0 allocations (if that byte range is not
already in the page cache), allocations that are, practically speaking,
reliable.
However it might be a stretch to claim that an order-0 allocation
/never/ drives memory reclaim.
It still comes down to "it's faster and generally not harmful... and
clients have to issue WRITEs that are arbitrarily aligned, so let's help
them out."
What we still don't know is exactly what the extra cost of setting
DONTCACHE is, even on small writes. Maybe DONTCACHE should be cleared
for /all/ segments that are smaller than a page?
Sidebar: I resist calling such writes poorly formed or misaligned, as
there seems to be a little (unintended) moralism in those terms. Clients
must write only what data has changed. Aligning the payload byte ranges
(using RMW) is incredibly inefficient. So those writes are just as
correct and valid as writes that are optimally aligned.
When I hear "poorly formed" write, I think of an NFS WRITE that has
invalid arguments or corrupted XDR.
> Let's please not get hung up of intent of O_DIRECT because
> NFSD_IO_DIRECT achieves that intent very well
I think we need to have a clear idea what that intent is, because it
is explicitly referenced in a code comment as the rationale for setting
DONTCACHE. It might be better to replace that comment with a reason for
using DONTCACHE that does not mention "the intent of using DIRECT".
Something like:
* Mark the I/O buffer as evict-able to reduce memory contention.
--
Chuck Lever
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-10 16:41 ` Chuck Lever
@ 2025-11-10 17:57 ` Mike Snitzer
2025-11-11 8:51 ` Christoph Hellwig
1 sibling, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2025-11-10 17:57 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Christoph Hellwig, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, axboe
On Mon, Nov 10, 2025 at 11:41:09AM -0500, Chuck Lever wrote:
> On 11/7/25 9:01 PM, Mike Snitzer wrote:
> > Q: Case 2 uses DONTCACHE, so case 1 should too right?
> >
> > A: NO. There is legit benefit to having case 1 use cached buffered IO
> > when issuing its 2 subpage IOs; and that benefit doesn't cause harm> because order-0 page management is not causing the MM problems that
> > NFSD_IO_DIRECT sets out to avoid (whereas higher order cached buffered
> > IO is exactly what both DONTCACHE and NFSD_IO_DIRECT aim to avoid.
> > Otherwise MM spins and spins trying to find adequate free pages,
> > cannot so does dirty writeback and reclaim which causes kswapd and
> > kcompactd to burn cpu, etc).
>
> Paraphrasing: Each unaligned end (case 1) is always smaller than a page,
> thus it will stick with order-0 allocations (if that byte range is not
> already in the page cache), allocations that are, practically speaking,
> reliable.
>
> However it might be a stretch to claim that an order-0 allocation
> /never/ drives memory reclaim.
That's fair.
> It still comes down to "it's faster and generally not harmful... and
> clients have to issue WRITEs that are arbitrarily aligned, so let's help
> them out."
>
> What we still don't know is exactly what the extra cost of setting
> DONTCACHE is, even on small writes. Maybe DONTCACHE should be cleared
> for /all/ segments that are smaller than a page?
I think so. Might be Jens has thoughts on this (now cc'd)?
> Sidebar: I resist calling such writes poorly formed or misaligned, as
> there seems to be a little (unintended) moralism in those terms. Clients
> must write only what data has changed. Aligning the payload byte ranges
> (using RMW) is incredibly inefficient. So those writes are just as
> correct and valid as writes that are optimally aligned.
>
> When I hear "poorly formed" write, I think of an NFS WRITE that has
> invalid arguments or corrupted XDR.
Very useful sidebar point.
NFSD is just trying to accommodate IO it has no choice but to service
from the NFS clients.
> > Let's please not get hung up of intent of O_DIRECT because
> > NFSD_IO_DIRECT achieves that intent very well
>
> I think we need to have a clear idea what that intent is, because it
> is explicitly referenced in a code comment as the rationale for setting
> DONTCACHE.
Sure, the duality of:
DONTCACHE being useful for large buffered IO
versus
DONTCACHE being detrimental for subpage IO.
That duality just means using DONTCACHE isn't a silver bullet for when
we need to fallback to buffered IO (but our overall intent is to avoid
page cache contention).
> It might be better to replace that comment with a reason for
> using DONTCACHE that does not mention "the intent of using DIRECT".
> Something like:
>
> * Mark the I/O buffer as evict-able to reduce memory contention.
Sure, that'd work.
I'm really reassured by how well you understand all this Chuck.
Restating for benefit of others (Jens in particular):
My point was that even with the case where an IO is split into 3
segments (1: subpage head 2: DIO-aligned middle 3: subpage tail) the
intent of O_DIRECT (avoiding caching buffered IO's page cache use) is
met as best we can (ideally IO is aligned so there are no subpage
end segments).
The other extreme this NFSD_IO_DIRECT mode must handle is the entire
IO doesn't have any segment that is DIO-aligned, so it must be issued
with buffered IO but we want to do so without opening us up to memory
contention, DONTCACHE gives us the best option for those IOs.
Mike
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-10 15:42 ` Mike Snitzer
@ 2025-11-11 8:44 ` Christoph Hellwig
0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-11 8:44 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever
On Mon, Nov 10, 2025 at 10:42:48AM -0500, Mike Snitzer wrote:
> On Mon, Nov 10, 2025 at 01:12:33AM -0800, Christoph Hellwig wrote:
> > On Fri, Nov 07, 2025 at 05:16:27PM -0500, Mike Snitzer wrote:
> > > Those ends lend themselves to benefitting from more capable RMW
> > > if/when needed.
> >
> > How are ends of a larger I/O inhently more benefits from cached
> > RMW then say a single tiny I/O?
>
> Any unaligned WRITE, no matter the size of the WRITE, will benefit
> from RMW of its subpage ends if streaming unaligned WRITEs.
Yes.
> I saw below that even saying "unaligned WRITE" leaves you annoyed.
No, it doesn't.
> Every time I say stuff like "the entire reason for the patchset" I set
> myself up for a rebuttal from you. You just did the same thing. I'm
> going to try to be accomodating and cool but please lets dial back
> this weirdly caustic dynamic that is forming yet again.
>
> Why is it that you cannot acknowledge that it isn't an all or nothing
> situation?
I'm actually arguing for that, not you. You make claims of "the entire
reason", not me.
> Restating an important point underpinning my desire to handle this
> niche unaligned streaming WRITE case efficiently: it removes the one
> case I know of where NFSD_IO_DIRECT cannot compete with
> NFSD_IO_BUFFERED. Whereby making NFSD_IO_DIRECT a compelling
> alternative default IO mode. To leave it to use DONTCACHE would be
> purposely compromising NFSD_IO_DIRECT to be less effective.
>
> I'm also saying: there is very limited downside to using cached
> buffered for these subpage end segments. But IF I'm proven wrong and
> they do show themselves to be a problem in the future then a code
> change will be needed.
So spell them out in a comment, and leave nuggets for future readers.
And explain why they don't apply for a single segment this is not
page aligned by either being smaller than a page or straddling two
pages.
>
> > Also to go back to my previous mail: How can an I/O that is aligned
> > in file offset and length, but not in the backing memory ever going
> > to benefit from this caching?
>
> It wouldn't, because it would take the no_dio path and the entire IO
> would be issued using DONTCACHE.
Yes, I see that right now it does not. But I want to hear from you
why you think it is worth treating different. Because there is no
rationale that a small unaligned write won't benefit from it. In fact
all conventional wisdom suggest it will benefit even more. And I'm
trying to explain that for a few days now, but somehow it doesn't make
it across.
>
> That I just had to point that out shows how absurd all this contrarian
> hand-wringing has gotten.
Can you please top these attacks?
>
> > > Or is this something we make tunable?
> > > NFSD_IO_DIRECT_DONTCACHE_UNALIGNED?
> >
> > Throwing in yet another tunable while not understanding the problem
> > at hand is about the worst possible outcome.
>
> Just because you don't understand something: that isn't my failing.
> So your need to be the expert here is misplaced. Please show you
> understand the problem before you project lack of understanding onto
> me.
Well, if you do understand it you failed to explain it, which means
there it no _collective_ understanding. But to me this thread suggests
you don't understand it either, although I'd rather get to the bottom
of it then getting stuck on who understands what.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-10 16:41 ` Chuck Lever
2025-11-10 17:57 ` Mike Snitzer
@ 2025-11-11 8:51 ` Christoph Hellwig
2025-11-11 14:20 ` Chuck Lever
2025-11-12 0:06 ` Mike Snitzer
1 sibling, 2 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-11 8:51 UTC (permalink / raw)
To: Chuck Lever
Cc: Mike Snitzer, NeilBrown, Christoph Hellwig, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs
On Mon, Nov 10, 2025 at 11:41:09AM -0500, Chuck Lever wrote:
> On 11/7/25 9:01 PM, Mike Snitzer wrote:
> > Q: Case 2 uses DONTCACHE, so case 1 should too right?
> >
> > A: NO. There is legit benefit to having case 1 use cached buffered IO
> > when issuing its 2 subpage IOs; and that benefit doesn't cause harm> because order-0 page management is not causing the MM problems that
> > NFSD_IO_DIRECT sets out to avoid (whereas higher order cached buffered
> > IO is exactly what both DONTCACHE and NFSD_IO_DIRECT aim to avoid.
> > Otherwise MM spins and spins trying to find adequate free pages,
> > cannot so does dirty writeback and reclaim which causes kswapd and
> > kcompactd to burn cpu, etc).
>
> Paraphrasing: Each unaligned end (case 1) is always smaller than a page,
> thus it will stick with order-0 allocations (if that byte range is not
> already in the page cache), allocations that are, practically speaking,
> reliable.
>
> However it might be a stretch to claim that an order-0 allocation
> /never/ drives memory reclaim.
That is of course not true. order-0 pages of course do drive memory
reclaim. In fact if you're on a file system without huge folios and
an applications that doesn't use THP, the majority of reclaim is
driven by order 0 allocations.
> What we still don't know is exactly what the extra cost of setting
> DONTCACHE is, even on small writes. Maybe DONTCACHE should be cleared
> for /all/ segments that are smaller than a page?
I suspect the best initial tweak is for every segment or entire write
that is not page aligned in the file, as that is an indicator that
multiple RMW cycles are possible. At least if we're streaming, but
we don't have that information. That means all of these cases:
1) writes smaller than PAGE_SIZE
2) writes smaller than PAGE_SIZE * 2 but not aligned to PAGE_SIZE
3) unaligned end segments < PAGE_SIZE
If we want to fine tune, we'd probably expand case 2 a bit as a single
page cache operation on an order 2 pages is going to be faster than
three I/Os most of the time, but compared to the high level discussion
here that's minor details.
> Sidebar: I resist calling such writes poorly formed or misaligned, as
> there seems to be a little (unintended) moralism in those terms. Clients
> must write only what data has changed. Aligning the payload byte ranges
> (using RMW) is incredibly inefficient. So those writes are just as
> correct and valid as writes that are optimally aligned.
>
> When I hear "poorly formed" write, I think of an NFS WRITE that has
> invalid arguments or corrupted XDR.
On poorly formed I fully agree. Misaligned OTOH doesn't have such a
negative connotation to me. Of course "not page aligned vs the file
offset" is more descriptive, but it's also awfully long.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-11 8:51 ` Christoph Hellwig
@ 2025-11-11 14:20 ` Chuck Lever
2025-11-11 14:21 ` Christoph Hellwig
2025-11-12 0:06 ` Mike Snitzer
1 sibling, 1 reply; 32+ messages in thread
From: Chuck Lever @ 2025-11-11 14:20 UTC (permalink / raw)
To: Christoph Hellwig, Mike Snitzer
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs
On 11/11/25 3:51 AM, Christoph Hellwig wrote:
>> What we still don't know is exactly what the extra cost of setting
>> DONTCACHE is, even on small writes. Maybe DONTCACHE should be cleared
>> for /all/ segments that are smaller than a page?
> I suspect the best initial tweak is for every segment or entire write
> that is not page aligned in the file, as that is an indicator that
> multiple RMW cycles are possible. At least if we're streaming, but
> we don't have that information. That means all of these cases:
>
> 1) writes smaller than PAGE_SIZE
> 2) writes smaller than PAGE_SIZE * 2 but not aligned to PAGE_SIZE
> 3) unaligned end segments < PAGE_SIZE
>
> If we want to fine tune, we'd probably expand case 2 a bit as a single
> page cache operation on an order 2 pages is going to be faster than
> three I/Os most of the time, but compared to the high level discussion
> here that's minor details.
To move forward before it is too late to hit v6.19, what I'm thinking
is this:
* I'll reset the patch to leave DONTCACHE set for all unaligned
segments, as it was in the original logic
* I'll post a final version of the series to be merged
* We can resume this discussion after the series is merged, as an
opportunity for optimization
--
Chuck Lever
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-11 14:20 ` Chuck Lever
@ 2025-11-11 14:21 ` Christoph Hellwig
0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-11 14:21 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, Mike Snitzer, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs
On Tue, Nov 11, 2025 at 09:20:52AM -0500, Chuck Lever wrote:
> To move forward before it is too late to hit v6.19, what I'm thinking
> is this:
>
> * I'll reset the patch to leave DONTCACHE set for all unaligned
> segments, as it was in the original logic
>
> * I'll post a final version of the series to be merged
>
> * We can resume this discussion after the series is merged, as an
> opportunity for optimization
Sounds reasonable. And just to make it clear again: I'm not against
this optimization, I just want to it to be consistent and well
understood.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-11 8:51 ` Christoph Hellwig
2025-11-11 14:20 ` Chuck Lever
@ 2025-11-12 0:06 ` Mike Snitzer
2025-11-12 15:02 ` Christoph Hellwig
1 sibling, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-12 0:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs
On Tue, Nov 11, 2025 at 12:51:28AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 10, 2025 at 11:41:09AM -0500, Chuck Lever wrote:
> > On 11/7/25 9:01 PM, Mike Snitzer wrote:
> > > Q: Case 2 uses DONTCACHE, so case 1 should too right?
> > >
> > > A: NO. There is legit benefit to having case 1 use cached buffered IO
> > > when issuing its 2 subpage IOs; and that benefit doesn't cause harm> because order-0 page management is not causing the MM problems that
> > > NFSD_IO_DIRECT sets out to avoid (whereas higher order cached buffered
> > > IO is exactly what both DONTCACHE and NFSD_IO_DIRECT aim to avoid.
> > > Otherwise MM spins and spins trying to find adequate free pages,
> > > cannot so does dirty writeback and reclaim which causes kswapd and
> > > kcompactd to burn cpu, etc).
> >
> > Paraphrasing: Each unaligned end (case 1) is always smaller than a page,
> > thus it will stick with order-0 allocations (if that byte range is not
> > already in the page cache), allocations that are, practically speaking,
> > reliable.
> >
> > However it might be a stretch to claim that an order-0 allocation
> > /never/ drives memory reclaim.
>
> That is of course not true. order-0 pages of course do drive memory
> reclaim. In fact if you're on a file system without huge folios and
> an applications that doesn't use THP, the majority of reclaim is
> driven by order 0 allocations.
Correct, but order-0 pages aren't the burden on MM that reclaim of
higher order pages causes.
To be clear I didn't say what Chuck paraphrased with "/never/", etc.
But anyway I think we're all in agreement.
> > What we still don't know is exactly what the extra cost of setting
> > DONTCACHE is, even on small writes. Maybe DONTCACHE should be cleared
> > for /all/ segments that are smaller than a page?
>
> I suspect the best initial tweak is for every segment or entire write
> that is not page aligned in the file, as that is an indicator that
> multiple RMW cycles are possible. At least if we're streaming, but
> we don't have that information. That means all of these cases:
>
> 1) writes smaller than PAGE_SIZE
> 2) writes smaller than PAGE_SIZE * 2 but not aligned to PAGE_SIZE
> 3) unaligned end segments < PAGE_SIZE
>
> If we want to fine tune, we'd probably expand case 2 a bit as a single
> page cache operation on an order 2 pages is going to be faster than
> three I/Os most of the time, but compared to the high level discussion
> here that's minor details.
I agree, the following should accomplish this.
Would prefer we get this heuristic agreed on so it can land along with
the NFSD_IO_DIRECT WRITE support for the 6.19 merge window.
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 02ff6bd48a18..ce162ae2f985 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1277,6 +1277,12 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
segment->flags = iocb->ki_flags;
}
+/*
+ * Unaligned IO that is smaller than order-2 is best
+ * handled using cached buffered IO rather than DONTCACHE.
+ */
+#define NFSD_DONTCACHE_MIN_BYTES ((PAGE_SIZE << 2) + 1)
+
static unsigned int
nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
unsigned int nvecs, struct kiocb *iocb,
@@ -1284,6 +1290,7 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
struct nfsd_write_dio_seg segments[3])
{
u32 offset_align = nf->nf_dio_offset_align;
+ bool use_cached_buffered_fallback = false;
loff_t prefix_end, orig_end, middle_end;
u32 mem_align = nf->nf_dio_mem_align;
size_t prefix, middle, suffix;
@@ -1295,6 +1302,8 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
* If alignments are not available, the write is too small,
* or no alignment can be found, fall back to buffered I/O.
*/
+ if (total < NFSD_DONTCACHE_MIN_BYTES)
+ use_cached_buffered_fallback = true;
if (unlikely(!mem_align || !offset_align) ||
unlikely(total < max(offset_align, mem_align)))
goto no_dio;
@@ -1333,12 +1342,29 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
prefix + middle, suffix, iocb);
+ /*
+ * Unaligned end segments will always use cached buffered IO
+ * because they are less than PAGE_SIZE.
+ */
+ if ((prefix || suffix) && use_cached_buffered_fallback) {
+ /*
+ * This unaligned IO is small enough that it
+ * warrants fall back to a single cached buffered IO.
+ */
+ goto no_dio;
+ }
+
return nsegs;
no_dio:
/* No DIO alignment possible - pack into single non-DIO segment. */
nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
total, iocb);
+ /* Mark the I/O buffer as evict-able to reduce memory contention. */
+ if (!use_cached_buffered_fallback &&
+ nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ kiocb->ki_flags |= IOCB_DONTCACHE;
+
return 1;
}
@@ -1361,16 +1387,9 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (kiocb->ki_flags & IOCB_DIRECT)
trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
segments[i].iter.count);
- else {
+ else
trace_nfsd_write_vector(rqstp, fhp, kiocb->ki_pos,
segments[i].iter.count);
- /*
- * Mark the I/O buffer as evict-able to reduce
- * memory contention.
- */
- if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
- kiocb->ki_flags |= IOCB_DONTCACHE;
- }
host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
if (host_err < 0)
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-12 0:06 ` Mike Snitzer
@ 2025-11-12 15:02 ` Christoph Hellwig
2025-11-12 23:14 ` Mike Snitzer
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-12 15:02 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs
On Tue, Nov 11, 2025 at 07:06:34PM -0500, Mike Snitzer wrote:
> + /* Mark the I/O buffer as evict-able to reduce memory contention. */
> + if (!use_cached_buffered_fallback &&
> + nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> + kiocb->ki_flags |= IOCB_DONTCACHE;
This won't work, because ki_flags get overwritten by the flags in
the segment layer. I walked through this to understand all the cases,
and ended up with a slightly different version including long comments
explaining them. This is also untested, but I'd love to have other
cross check the logic in it:
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ab46301da4ae..4727235ef43f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1292,12 +1292,23 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
unsigned int nsegs = 0;
/*
- * Check if direct I/O is feasible for this write request.
- * If alignments are not available, the write is too small,
- * or no alignment can be found, fall back to buffered I/O.
+ * If the file system doesn't advertise any alignment requirements,
+ * don't try to issue direct I/O. Fall back to uncached buffered
+ * I/O if possible because we'll assume it is not block based and
+ * doesn't need read-modify-write cycles.
*/
- if (unlikely(!mem_align || !offset_align) ||
- unlikely(total < max(offset_align, mem_align)))
+ if (unlikely(!mem_align || !offset_align)) {
+ if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ segments[0].flags |= IOCB_DONTCACHE;
+ goto no_dio;
+ }
+
+ /*
+ * If the I/O is smaller than the larger of the memory and logical
+ * offset alignment, it is like to require read-modify-write cycles.
+ * Issue cached buffered I/O.
+ */
+ if (unlikely(total < max(offset_align, mem_align)))
goto no_dio;
prefix_end = round_up(offset, offset_align);
@@ -1308,7 +1319,17 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
middle = middle_end - prefix_end;
suffix = orig_end - middle_end;
- if (!middle)
+ /*
+ * If there is no aligned middle section, or aligned part is tiny,
+ * issue a single buffered I/O write instead of splitting up the
+ * write.
+ *
+ * Note: the middle section size here is random constant. I suspect
+ * when benchmarking it we'd actually end up with a significant larger
+ * number, with the details depending on hardware.
+ */
+ if (!middle ||
+ ((prefix || suffix) && middle < PAGE_SIZE * 2))
goto no_dio;
if (prefix)
@@ -1324,16 +1345,24 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
* bvecs generated from RPC receive buffers are contiguous: After
* the first bvec, all subsequent bvecs start at bv_offset zero
* (page-aligned). Therefore, only the first bvec is checked.
+ *
+ * If the memory is not aligned at all, but we have a large enough
+ * logical offset-aligned middle section, try to use uncached buffered
+ * I/O for that to avoid cache pollution. If not fall back to a single
+ * cached buffered I/O for the entire write.
*/
- if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
- goto no_dio;
- segments[nsegs].flags |= IOCB_DIRECT;
+ if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1)) {
+ if (!(nf->nf_file->f_op->fop_flags & FOP_DONTCACHE))
+ goto no_dio;
+ segments[nsegs].flags |= IOCB_DONTCACHE;
+ } else {
+ segments[nsegs].flags |= IOCB_DIRECT;
+ }
nsegs++;
if (suffix)
nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
prefix + middle, suffix, iocb);
-
return nsegs;
no_dio:
@@ -1362,16 +1391,9 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (kiocb->ki_flags & IOCB_DIRECT)
trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
segments[i].iter.count);
- else {
+ else
trace_nfsd_write_vector(rqstp, fhp, kiocb->ki_pos,
segments[i].iter.count);
- /*
- * Mark the I/O buffer as evict-able to reduce
- * memory contention.
- */
- if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
- kiocb->ki_flags |= IOCB_DONTCACHE;
- }
host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
if (host_err < 0)
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-12 15:02 ` Christoph Hellwig
@ 2025-11-12 23:14 ` Mike Snitzer
2025-11-13 8:13 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Mike Snitzer @ 2025-11-12 23:14 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs, jonathan.flynn
On Wed, Nov 12, 2025 at 07:02:37AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 11, 2025 at 07:06:34PM -0500, Mike Snitzer wrote:
> > + /* Mark the I/O buffer as evict-able to reduce memory contention. */
> > + if (!use_cached_buffered_fallback &&
> > + nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> > + kiocb->ki_flags |= IOCB_DONTCACHE;
>
> This won't work, because ki_flags get overwritten by the flags in
> the segment layer.
Thanks, yeah that was a copy-n-paste bug, should've been:
segments[0].flags |= IOCB_DONTCACHE;
I should've stated that my patch wasn't tested and was intended as
RFC.
> I walked through this to understand all the cases,
> and ended up with a slightly different version including long comments
> explaining them. This is also untested, but I'd love to have other
> cross check the logic in it:
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index ab46301da4ae..4727235ef43f 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1292,12 +1292,23 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> unsigned int nsegs = 0;
>
> /*
> - * Check if direct I/O is feasible for this write request.
> - * If alignments are not available, the write is too small,
> - * or no alignment can be found, fall back to buffered I/O.
> + * If the file system doesn't advertise any alignment requirements,
> + * don't try to issue direct I/O. Fall back to uncached buffered
> + * I/O if possible because we'll assume it is not block based and
> + * doesn't need read-modify-write cycles.
> */
Not sure what is implied by "not block based" in this comment.
> - if (unlikely(!mem_align || !offset_align) ||
> - unlikely(total < max(offset_align, mem_align)))
> + if (unlikely(!mem_align || !offset_align)) {
> + if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> + segments[0].flags |= IOCB_DONTCACHE;
> + goto no_dio;
> + }
> +
> + /*
> + * If the I/O is smaller than the larger of the memory and logical
> + * offset alignment, it is like to require read-modify-write cycles.
> + * Issue cached buffered I/O.
> + */
Nit: s/like/likely/ typo.
> + if (unlikely(total < max(offset_align, mem_align)))
> goto no_dio;
>
> prefix_end = round_up(offset, offset_align);
> @@ -1308,7 +1319,17 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> middle = middle_end - prefix_end;
> suffix = orig_end - middle_end;
>
> - if (!middle)
> + /*
> + * If there is no aligned middle section, or aligned part is tiny,
> + * issue a single buffered I/O write instead of splitting up the
> + * write.
> + *
> + * Note: the middle section size here is random constant. I suspect
> + * when benchmarking it we'd actually end up with a significant larger
> + * number, with the details depending on hardware.
> + */
> + if (!middle ||
> + ((prefix || suffix) && middle < PAGE_SIZE * 2))
> goto no_dio;
>
> if (prefix)
Expressing this threshold in terms of PAGE_SIZE might be a
problem for other archs like ARM (where using 64K the norm for
performance).
The "Note:" portion of the above comment might do well to address the
goal being to avoid using cached buffered for higher order
allocations. You or Chuck may have solid ideas for refining wording.
But so you're aware Jon Flynn (who kindly does the heavy lifting of
performance testing for me) agrees with you: larger will likely be
beneficial but it'll be hardware dependent. So I'm going to expose a
knob for Jon to try various values so we can see.
From Jon's results we might elect to take further action.
> @@ -1324,16 +1345,24 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
> * bvecs generated from RPC receive buffers are contiguous: After
> * the first bvec, all subsequent bvecs start at bv_offset zero
> * (page-aligned). Therefore, only the first bvec is checked.
> + *
> + * If the memory is not aligned at all, but we have a large enough
> + * logical offset-aligned middle section, try to use uncached buffered
> + * I/O for that to avoid cache pollution. If not fall back to a single
> + * cached buffered I/O for the entire write.
> */
> - if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
> - goto no_dio;
> - segments[nsegs].flags |= IOCB_DIRECT;
> + if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1)) {
> + if (!(nf->nf_file->f_op->fop_flags & FOP_DONTCACHE))
> + goto no_dio;
> + segments[nsegs].flags |= IOCB_DONTCACHE;
This is a new one for me, but its a nice catch: just because
IOCBD_DIRECT isn't possible for the middle doesn't mean that we should
avoid issuing the subpage segments in terms of buffered IO.
> + } else {
> + segments[nsegs].flags |= IOCB_DIRECT;
> + }
> nsegs++;
>
> if (suffix)
> nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
> prefix + middle, suffix, iocb);
> -
> return nsegs;
>
> no_dio:
> @@ -1362,16 +1391,9 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (kiocb->ki_flags & IOCB_DIRECT)
> trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> segments[i].iter.count);
> - else {
> + else
> trace_nfsd_write_vector(rqstp, fhp, kiocb->ki_pos,
> segments[i].iter.count);
> - /*
> - * Mark the I/O buffer as evict-able to reduce
> - * memory contention.
> - */
> - if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> - kiocb->ki_flags |= IOCB_DONTCACHE;
> - }
>
> host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
> if (host_err < 0)
Overall this patch looks good. Would welcome seeing you submit a
formal patch. In parallel I'll work with Jon to get this tested on
Hammerspace's performance cluster to confirm all looks good.
Mike
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-12 23:14 ` Mike Snitzer
@ 2025-11-13 8:13 ` Christoph Hellwig
2025-11-13 21:45 ` Mike Snitzer
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-11-13 8:13 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Chuck Lever, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, jonathan.flynn
On Wed, Nov 12, 2025 at 06:14:15PM -0500, Mike Snitzer wrote:
> > - * Check if direct I/O is feasible for this write request.
> > - * If alignments are not available, the write is too small,
> > - * or no alignment can be found, fall back to buffered I/O.
> > + * If the file system doesn't advertise any alignment requirements,
> > + * don't try to issue direct I/O. Fall back to uncached buffered
> > + * I/O if possible because we'll assume it is not block based and
> > + * doesn't need read-modify-write cycles.
> > */
>
> Not sure what is implied by "not block based" in this comment.
e.g. re-exporting another network file system or tmpfs.
> > + /*
> > + * If the I/O is smaller than the larger of the memory and logical
> > + * offset alignment, it is like to require read-modify-write cycles.
> > + * Issue cached buffered I/O.
> > + */
>
> Nit: s/like/likely/ typo.
Thanks.
> > + if (!middle ||
> > + ((prefix || suffix) && middle < PAGE_SIZE * 2))
> > goto no_dio;
> >
> > if (prefix)
>
> Expressing this threshold in terms of PAGE_SIZE might be a
> problem for other archs like ARM (where using 64K the norm for
> performance).
True.
> The "Note:" portion of the above comment might do well to address the
> goal being to avoid using cached buffered for higher order
> allocations. You or Chuck may have solid ideas for refining wording.
Honestly I'm not worried about higher order allocations at all. Yes,
they used to be very painful in the past, but if we get them here
it means the file system supports large folios, which means we're going
to churn through a lot of them anyway.
> But so you're aware Jon Flynn (who kindly does the heavy lifting of
> performance testing for me) agrees with you: larger will likely be
> beneficial but it'll be hardware dependent. So I'm going to expose a
> knob for Jon to try various values so we can see.
That does sound useful, thanks!
> Overall this patch looks good. Would welcome seeing you submit a
> formal patch. In parallel I'll work with Jon to get this tested on
> Hammerspace's performance cluster to confirm all looks good.
I have no problem fixing up the nits and writing a commit message,
but I have no time to properly test this. I'm too busy with my
own projects and just stealing some time to help with whiteboard
coding here :) I'd also be perfectly fine with whoever actually
brings this across taking full credit.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-13 8:13 ` Christoph Hellwig
@ 2025-11-13 21:45 ` Mike Snitzer
0 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2025-11-13 21:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
Tom Talpey, linux-nfs, jonathan.flynn
On Thu, Nov 13, 2025 at 12:13:09AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 12, 2025 at 06:14:15PM -0500, Mike Snitzer wrote:
> > > - * Check if direct I/O is feasible for this write request.
> > > - * If alignments are not available, the write is too small,
> > > - * or no alignment can be found, fall back to buffered I/O.
> > > + * If the file system doesn't advertise any alignment requirements,
> > > + * don't try to issue direct I/O. Fall back to uncached buffered
> > > + * I/O if possible because we'll assume it is not block based and
> > > + * doesn't need read-modify-write cycles.
> > > */
> >
> > Not sure what is implied by "not block based" in this comment.
>
> e.g. re-exporting another network file system or tmpfs.
>
> > > + /*
> > > + * If the I/O is smaller than the larger of the memory and logical
> > > + * offset alignment, it is like to require read-modify-write cycles.
> > > + * Issue cached buffered I/O.
> > > + */
> >
> > Nit: s/like/likely/ typo.
>
> Thanks.
>
> > > + if (!middle ||
> > > + ((prefix || suffix) && middle < PAGE_SIZE * 2))
> > > goto no_dio;
> > >
> > > if (prefix)
> >
> > Expressing this threshold in terms of PAGE_SIZE might be a
> > problem for other archs like ARM (where using 64K the norm for
> > performance).
>
> True.
>
> > The "Note:" portion of the above comment might do well to address the
> > goal being to avoid using cached buffered for higher order
> > allocations. You or Chuck may have solid ideas for refining wording.
>
> Honestly I'm not worried about higher order allocations at all. Yes,
> they used to be very painful in the past, but if we get them here
> it means the file system supports large folios, which means we're going
> to churn through a lot of them anyway.
>
> > But so you're aware Jon Flynn (who kindly does the heavy lifting of
> > performance testing for me) agrees with you: larger will likely be
> > beneficial but it'll be hardware dependent. So I'm going to expose a
> > knob for Jon to try various values so we can see.
>
> That does sound useful, thanks!
>
> > Overall this patch looks good. Would welcome seeing you submit a
> > formal patch. In parallel I'll work with Jon to get this tested on
> > Hammerspace's performance cluster to confirm all looks good.
>
> I have no problem fixing up the nits and writing a commit message,
> but I have no time to properly test this. I'm too busy with my
> own projects and just stealing some time to help with whiteboard
> coding here :) I'd also be perfectly fine with whoever actually
> brings this across taking full credit.
Think it best to be attributed to you so if you have time to polish
the nits and put a header on it that'd establish a base that can be
tweaked (by Chuck) if needed.
You don't need to worry about testing, we'll get that covered for you
and report back.
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2025-11-13 21:45 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-07 15:34 [PATCH v11 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:34 ` [PATCH v11 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-07 15:34 ` [PATCH v11 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-07 15:39 ` Christoph Hellwig
2025-11-07 15:40 ` Chuck Lever
2025-11-07 20:05 ` Mike Snitzer
2025-11-07 20:08 ` Chuck Lever
2025-11-07 20:10 ` Mike Snitzer
2025-11-07 21:58 ` NeilBrown
2025-11-07 22:24 ` Mike Snitzer
2025-11-07 23:42 ` NeilBrown
2025-11-08 2:01 ` Mike Snitzer
2025-11-10 16:41 ` Chuck Lever
2025-11-10 17:57 ` Mike Snitzer
2025-11-11 8:51 ` Christoph Hellwig
2025-11-11 14:20 ` Chuck Lever
2025-11-11 14:21 ` Christoph Hellwig
2025-11-12 0:06 ` Mike Snitzer
2025-11-12 15:02 ` Christoph Hellwig
2025-11-12 23:14 ` Mike Snitzer
2025-11-13 8:13 ` Christoph Hellwig
2025-11-13 21:45 ` Mike Snitzer
2025-11-07 20:28 ` Chuck Lever
2025-11-07 22:16 ` Mike Snitzer
2025-11-10 9:12 ` Christoph Hellwig
2025-11-10 15:42 ` Mike Snitzer
2025-11-11 8:44 ` Christoph Hellwig
2025-11-10 9:17 ` Christoph Hellwig
2025-11-10 15:43 ` Mike Snitzer
2025-11-07 17:18 ` Mike Snitzer
2025-11-07 22:13 ` NeilBrown
2025-11-07 15:34 ` [PATCH v11 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).