* [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-11-11 14:59 Chuck Lever
2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series includes the patches needed to make NFSD Direct WRITE
operational.
I still see this during fstests runs with NFSD_IO_DIRECT enabled:
WARNING: CPU: 5 PID: 1309 at fs/iomap/buffered-io.c:1402 iomap_zero_iter+0x1a4/0x390
No new test failures, but I need to narrow down which test is
triggering this warning.
Applies on the branch:
https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=nfsd-next
Changes since v11:
* Added trace point to the unaligned write arm
* Reverted to use DONTCACHE on all unaligned writes
* Replaced the comment that mentioned the intent of IOCB_DIRECT
Changes since v10:
* Applied Christoph's clean-ups
* Applied Mike's documentation fixes
Changes since v9:
* Unaligned segments no longer use IOCB_DONTCACHE
* Squashed all review patches into Mike's initial patch
* Squashed Mike's documentation update into the final patch
Changes since v8:
* Drop "NFSD: Handle both offset and memory alignment for direct I/O"
* Include the Sep 3 version of the Documentation update
Changes since v7:
* Rebase the series on Mike's original v3 patch
* Address more review comments
* Optimize the "when can NFSD use IOCB_DIRECT" logic
* Revert the "always promote to FILE_SYNC" logic
Changes since v6:
* Patches to address review comments have been split out
* Refactored the iter initialization code
Changes since v5:
* Add a patch to make FILE_SYNC WRITEs persist timestamps
* Address some of Christoph's review comments
* The svcrdma patch has been dropped until we actually need it
Changes since v4:
* Split out refactoring nfsd_buffered_write() into a separate patch
* Expand patch description of 1/4
* Don't set IOCB_SYNC flag
Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks
Chuck Lever (1):
NFSD: Make FILE_SYNC WRITEs comply with spec
Mike Snitzer (2):
NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
.../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 2 +
fs/nfsd/vfs.c | 159 +++++++++++++++++-
4 files changed, 300 insertions(+), 6 deletions(-)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec
2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-11 14:59 ` Chuck Lever
2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Mike Snitzer, stable, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:
> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.
Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:
- flags |= RWF_SYNC;
with:
+ kiocb.ki_flags |= IOCB_DSYNC;
which appears to be correct given:
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
in kiocb_set_rw_flags(). However the author of that commit did not
appreciate that the previous line in kiocb_set_rw_flags() results
in IOCB_SYNC also being set:
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
IOCB_SYNC. Reviewers at the time did not catch the omission.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/vfs.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5333d49910d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
stable = NFS_UNSTABLE;
init_sync_kiocb(&kiocb, file);
kiocb.ki_pos = offset;
- if (stable && !fhp->fh_use_wgather)
- kiocb.ki_flags |= IOCB_DSYNC;
+ if (likely(!fhp->fh_use_wgather)) {
+ switch (stable) {
+ case NFS_FILE_SYNC:
+ /* persist data and timestamps */
+ kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+ break;
+ case NFS_DATA_SYNC:
+ /* persist data only */
+ kiocb.ki_flags |= IOCB_DSYNC;
+ break;
+ }
+ }
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
--
2.51.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-11-11 14:59 ` Chuck Lever
2025-11-11 15:10 ` Christoph Hellwig
2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer, Chuck Lever
From: Mike Snitzer <snitzer@kernel.org>
When NFSD_IO_DIRECT is selected via the
/sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
incoming unaligned NFS WRITE requests into a prefix, middle and
suffix segment, as needed. The middle segment is now DIO-aligned and
the prefix and/or suffix are unaligned. Synchronous buffered IO is
used for the unaligned segments, and IOCB_DIRECT is used for the
middle DIO-aligned extent.
Although IOCB_DIRECT avoids the use of the page cache, by itself it
doesn't guarantee data durability. For UNSTABLE WRITE requests,
durability is obtained by a subsequent NFS COMMIT request.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfsd/debugfs.c | 1 +
fs/nfsd/trace.h | 2 +
fs/nfsd/vfs.c | 145 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 144 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
switch (val) {
case NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
nfsd_io_cache_write = val;
break;
default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 85a1521ad757..5ae2a611e57f 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,8 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
+DEFINE_NFSD_IO_EVENT(write_vector);
DEFINE_NFSD_IO_EVENT(write_io_done);
DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5333d49910d9..ab46301da4ae 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,136 @@ static int wait_for_concurrent_writes(struct file *file)
return err;
}
+struct nfsd_write_dio_seg {
+ struct iov_iter iter;
+ int flags;
+};
+
+static unsigned long
+iov_iter_bvec_offset(const struct iov_iter *iter)
+{
+ return (unsigned long)(iter->bvec->bv_offset + iter->iov_offset);
+}
+
+static void
+nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
+ struct bio_vec *bvec, unsigned int nvecs,
+ unsigned long total, size_t start, size_t len,
+ struct kiocb *iocb)
+{
+ iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
+ if (start)
+ iov_iter_advance(&segment->iter, start);
+ iov_iter_truncate(&segment->iter, len);
+ segment->flags = iocb->ki_flags;
+}
+
+static unsigned int
+nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
+ unsigned int nvecs, struct kiocb *iocb,
+ unsigned long total,
+ struct nfsd_write_dio_seg segments[3])
+{
+ u32 offset_align = nf->nf_dio_offset_align;
+ loff_t prefix_end, orig_end, middle_end;
+ u32 mem_align = nf->nf_dio_mem_align;
+ size_t prefix, middle, suffix;
+ loff_t offset = iocb->ki_pos;
+ unsigned int nsegs = 0;
+
+ /*
+ * Check if direct I/O is feasible for this write request.
+ * If alignments are not available, the write is too small,
+ * or no alignment can be found, fall back to buffered I/O.
+ */
+ if (unlikely(!mem_align || !offset_align) ||
+ unlikely(total < max(offset_align, mem_align)))
+ goto no_dio;
+
+ prefix_end = round_up(offset, offset_align);
+ orig_end = offset + total;
+ middle_end = round_down(orig_end, offset_align);
+
+ prefix = prefix_end - offset;
+ middle = middle_end - prefix_end;
+ suffix = orig_end - middle_end;
+
+ if (!middle)
+ goto no_dio;
+
+ if (prefix)
+ nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
+ nvecs, total, 0, prefix, iocb);
+
+ nfsd_write_dio_seg_init(&segments[nsegs], bvec, nvecs,
+ total, prefix, middle, iocb);
+
+ /*
+ * Check if the bvec iterator is aligned for direct I/O.
+ *
+ * bvecs generated from RPC receive buffers are contiguous: After
+ * the first bvec, all subsequent bvecs start at bv_offset zero
+ * (page-aligned). Therefore, only the first bvec is checked.
+ */
+ if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
+ goto no_dio;
+ segments[nsegs].flags |= IOCB_DIRECT;
+ nsegs++;
+
+ if (suffix)
+ nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
+ prefix + middle, suffix, iocb);
+
+ return nsegs;
+
+no_dio:
+ /* No DIO alignment possible - pack into single non-DIO segment. */
+ nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
+ total, iocb);
+ return 1;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ struct nfsd_file *nf, unsigned int nvecs,
+ unsigned long *cnt, struct kiocb *kiocb)
+{
+ struct nfsd_write_dio_seg segments[3];
+ struct file *file = nf->nf_file;
+ unsigned int nsegs, i;
+ ssize_t host_err;
+
+ nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
+ kiocb, *cnt, segments);
+
+ *cnt = 0;
+ for (i = 0; i < nsegs; i++) {
+ kiocb->ki_flags = segments[i].flags;
+ if (kiocb->ki_flags & IOCB_DIRECT)
+ trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+ segments[i].iter.count);
+ else {
+ trace_nfsd_write_vector(rqstp, fhp, kiocb->ki_pos,
+ segments[i].iter.count);
+ /*
+ * Mark the I/O buffer as evict-able to reduce
+ * memory contention.
+ */
+ if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+ kiocb->ki_flags |= IOCB_DONTCACHE;
+ }
+
+ host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
+ if (host_err < 0)
+ return host_err;
+ *cnt += host_err;
+ if (host_err < segments[i].iter.count)
+ break; /* partial write */
+ }
+
+ return 0;
+}
+
/**
* nfsd_vfs_write - write data to an already-open file
* @rqstp: RPC execution context
@@ -1328,25 +1458,32 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
- iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
switch (nfsd_io_cache_write) {
- case NFSD_IO_BUFFERED:
+ case NFSD_IO_DIRECT:
+ host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs,
+ cnt, &kiocb);
break;
case NFSD_IO_DONTCACHE:
if (file->f_op->fop_flags & FOP_DONTCACHE)
kiocb.ki_flags |= IOCB_DONTCACHE;
+ fallthrough;
+ case NFSD_IO_BUFFERED:
+ iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+ host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
+ if (host_err < 0)
+ break;
+ *cnt = host_err;
break;
}
- host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
if (host_err < 0) {
commit_reset_write_verifier(nn, rqstp, host_err);
goto out_nfserr;
}
- *cnt = host_err;
nfsd_stats_io_write_add(nn, exp, *cnt);
fsnotify_modify(file);
host_err = filemap_check_wb_err(file->f_mapping, since);
--
2.51.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-11 14:59 ` Chuck Lever
2025-11-21 9:20 ` Anton Gavriliuk
2 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Mike Snitzer
From: Mike Snitzer <snitzer@kernel.org>
This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:
/sys/kernel/debug/nfsd/io_cache_read
/sys/kernel/debug/nfsd/io_cache_write
This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).
Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
.../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++++
1 file changed, 144 insertions(+)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 000000000000..e3a522d09766
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,144 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+
+To set an NFSD IO mode, write a supported value (0 - 2) to the
+corresponding IO operation's debugfs interface, e.g.:
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_write
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.:
+ cat /sys/kernel/debug/nfsd/io_cache_read
+ cat /sys/kernel/debug/nfsd/io_cache_write
+
+If you experiment with NFSD's IO modes on a recent kernel and have
+interesting results, please report them to linux-nfs@vger.kernel.org
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to servicing
+new IO requests.
+
+For more context on DONTCACHE, please see these Linux commit headers:
+- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+ to take a struct kiocb")
+- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
+ RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
+filesystem doesn't indicate support by setting FOP_DONTCACHE.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+But in summary:
+- NFSD DIRECT can significantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+ If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+ DIO-aligned block (on either end of the READ). The expanded READ is
+ verified to have proper offset/len (logical_block_size) and
+ dma_alignment checking.
+
+Misaligned WRITE:
+ If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+ middle and end as needed. The large middle segment is DIO-aligned
+ and the start and/or end are misaligned. Buffered IO is used for the
+ misaligned segments and O_DIRECT is used for the middle DIO-aligned
+ segment. DONTCACHE buffered IO is _not_ used for the misaligned
+ segments because using normal buffered IO offers significant RMW
+ performance benefit when handling streaming misaligned WRITEs.
+
+Tracing:
+ The nfsd_read_direct trace event shows how NFSD expands any
+ misaligned READ to the next DIO-aligned block (on either end of the
+ original READ, as needed).
+
+ This combination of trace events is useful for READs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+ The nfsd_write_direct trace event shows how NFSD splits a given
+ misaligned WRITE into a DIO-aligned middle segment.
+
+ This combination of trace events is useful for WRITEs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
--
2.51.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-11 15:10 ` Christoph Hellwig
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2025-11-11 15:10 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer, Chuck Lever
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
@ 2025-11-21 9:20 ` Anton Gavriliuk
2025-11-22 15:52 ` Chuck Lever
0 siblings, 1 reply; 7+ messages in thread
From: Anton Gavriliuk @ 2025-11-21 9:20 UTC (permalink / raw)
To: Chuck Lever
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer
> +If you experiment with NFSD's IO modes on a recent kernel and have
> +interesting results, please report them to linux-nfs@vger.kernel.org
Hello
There are two physical boxes - NFS server and client, directly
connected via 4x200 Gb/s ConnectX-7 ports.
Rocky Linux 10 on both boxes, on NFS server kernel 6.18.0-rc6, on NFS
client kernel 6.12.0-55.41.1.el10_0.x86_64
Both boxes have 192 GB DRAM.
On the NFS server there are 6 x CM7-V 5.0 NVMe SSD in mdadm raid0.
Thanks to DMA, locally I'm able to read the file 87 GB/s,
[root@memverge4 ~]# fio --name=local_test --ioengine=libaio --rw=read
--bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
--iodepth=32
local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
fio-3.41-41-gf5b2
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=81.2GiB/s][r=27.7k IOPS][eta 00m:00s]
local_test: (groupid=0, jobs=1): err= 0: pid=14009: Fri Nov 21 09:55:00 2025
read: IOPS=27.7k, BW=81.1GiB/s (87.1GB/s)(512GiB/6313msec)
slat (usec): min=4, max=1006, avg= 6.56, stdev= 5.34
clat (usec): min=275, max=4653, avg=1148.62, stdev=32.32
lat (usec): min=300, max=5659, avg=1155.19, stdev=33.71
On the NFS client share mounted with the next options,
[root@memverge3 ~]# mount -t nfs -o proto=rdma,nconnect=16,vers=3
1.1.1.4:/mnt /mnt
All caches (on server and client) are cleared using "sync; echo 3 >
/proc/sys/vm/drop_caches".
On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 0.
[root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
--bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
--iodepth=32
local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
fio-3.41-45-g7c8d
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=2865MiB/s][r=955 IOPS][eta 00m:00s]
local_test: (groupid=0, jobs=1): err= 0: pid=141930: Fri Nov 21 10:04:08 2025
read: IOPS=1723, BW=5170MiB/s (5421MB/s)(512GiB/101408msec)
slat (usec): min=84, max=528, avg=147.91, stdev=29.89
clat (usec): min=950, max=116978, avg=18419.36, stdev=11746.45
lat (usec): min=1082, max=117116, avg=18567.27, stdev=11729.90
All caches (on server and client) are cleared using "sync; echo 3 >
/proc/sys/vm/drop_caches".
On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 2.
[root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
--bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
--iodepth=32
local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
fio-3.41-45-g7c8d
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=16.9GiB/s][r=5770 IOPS][eta 00m:00s]
local_test: (groupid=0, jobs=1): err= 0: pid=142151: Fri Nov 21 10:07:36 2025
read: IOPS=5803, BW=17.0GiB/s (18.3GB/s)(512GiB/30111msec)
slat (usec): min=58, max=468, avg=171.72, stdev=15.26
clat (usec): min=264, max=9284, avg=5340.71, stdev=141.85
lat (usec): min=455, max=9748, avg=5512.42, stdev=145.54
3+x times improvement!!
Now let's take a look at NFS client side, why can't I exceed 20 GB/s ?
It looks that the fio thread most of time spends on the cpu executing
next kernel functions,
HARDCLOCK entries
Count Pct State Function
384 38.40% SYS nfs_page_create_from_page
142 14.20% SYS nfs_get_lock_context
109 10.90% SYS __nfs_pageio_add_request
70 7.00% SYS refcount_dec_and_lock
64 6.40% SYS nfs_page_create
58 5.80% SYS nfs_direct_read_schedule_iovec
39 3.90% SYS nfs_pageio_add_request
38 3.80% SYS kmem_cache_alloc_noprof
13 1.30% SYS rpc_execute
12 1.20% SYS nfs_generic_pg_pgios
10 1.00% SYS get_partial_node.part.0
10 1.00% SYS rmqueue_bulk
10 1.00% SYS nfs_generic_pgio
8 0.80% SYS gup_fast_fallback
6 0.60% SYS xprt_iter_next_entry_roundrobin
4 0.40% SYS allocate_slab
4 0.40% SYS nfs_file_direct_read
3 0.30% SYS rpc_task_set_transport
2 0.20% SYS __get_random_u32_below
2 0.20% SYS nfs_pgheader_init
Count Pct HARDCLOCK Stack trace
============================================================
384 38.40% nfs_page_create_from_page
nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
io_submit_one __x64_sys_io_submit do_syscall_64
entry_SYSCALL_64_after_hwframe | syscall
141 14.10% nfs_get_lock_context nfs_page_create_from_page
nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
io_submit_one __x64_sys_io_submit do_syscall_64
entry_SYSCALL_64_after_hwframe | syscall
109 10.90% __nfs_pageio_add_request nfs_pageio_add_request
nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
io_submit_one __x64_sys_io_submit do_syscall_64
entry_SYSCALL_64_after_hwframe | syscall
70 7.00% refcount_dec_and_lock nfs_put_lock_context
nfs_page_create_from_page nfs_direct_read_schedule_iovec
nfs_file_direct_read aio_read io_submit_one __x64_sys_io_submit
do_syscall_64 entry_SYSCALL_64_after_hwframe | syscall
64 6.40% nfs_page_create nfs_page_create_from_page
nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
io_submit_one __x64_sys_io_submit do_syscall_64
entry_SYSCALL_64_after_hwframe | syscall
58 5.80% nfs_direct_read_schedule_iovec
nfs_file_direct_read aio_read io_submit_one __x64_sys_io_submit
do_syscall_64 entry_SYSCALL_64_after_hwframe | syscall
39 3.90% nfs_pageio_add_request
nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
io_submit_one __x64_sys_io_submit do_syscall_64
entry_SYSCALL_64_after_hwframe | syscall
36 3.60% kmem_cache_alloc_noprof nfs_page_create
nfs_page_create_from_page nfs_direct_read_schedule_iovec
nfs_file_direct_read aio_read io_submit_one __x64_sys_io_submit
do_syscall_64 entry_SYSCALL_64_after_hwframe | syscall
Even if I created and added 2M huge pages as fio's backing page size
(adding --mem=mmaphuge --hugepage-size=2m), there is still
distribution as shown above.
I might be wrong, but even with fio's huge pages, the standard
upstream NFS client creates a struct nfs_page for every 4K chunk
(page_size), regardless of the backing page size.
Could 2M huge pages be implemented for NFS client ??, it would even
more improve performance using NFSD direct reads.
Anton
вт, 11 нояб. 2025 г. в 17:09, Chuck Lever <cel@kernel.org>:
>
> From: Mike Snitzer <snitzer@kernel.org>
>
> This document details the NFSD IO modes that are configurable using
> NFSD's experimental debugfs interfaces:
>
> /sys/kernel/debug/nfsd/io_cache_read
> /sys/kernel/debug/nfsd/io_cache_write
>
> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
> debugfs interfaces are replaced with per-export controls).
>
> Future updates will provide more specific guidance and howto
> information to help others use and evaluate NFSD's IO modes:
> BUFFERED, DONTCACHE and DIRECT.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> .../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++++
> 1 file changed, 144 insertions(+)
> create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
>
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> new file mode 100644
> index 000000000000..e3a522d09766
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -0,0 +1,144 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +NFSD IO MODES
> +=============
> +
> +Overview
> +========
> +
> +NFSD has historically always used buffered IO when servicing READ and
> +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
> +to override that default to use either DONTCACHE or DIRECT IO modes.
> +
> +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
> +mode used for READ and WRITE to be configured independently. See both:
> +- /sys/kernel/debug/nfsd/io_cache_read
> +- /sys/kernel/debug/nfsd/io_cache_write
> +
> +The default value for both io_cache_read and io_cache_write reflects
> +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
> +
> +Based on the configured settings, NFSD's IO will either be:
> +- cached using page cache (NFSD_IO_BUFFERED=0)
> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> +
> +To set an NFSD IO mode, write a supported value (0 - 2) to the
> +corresponding IO operation's debugfs interface, e.g.:
> + echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> + echo 2 > /sys/kernel/debug/nfsd/io_cache_write
> +
> +To check which IO mode NFSD is using for READ or WRITE, simply read the
> +corresponding IO operation's debugfs interface, e.g.:
> + cat /sys/kernel/debug/nfsd/io_cache_read
> + cat /sys/kernel/debug/nfsd/io_cache_write
> +
> +If you experiment with NFSD's IO modes on a recent kernel and have
> +interesting results, please report them to linux-nfs@vger.kernel.org
> +
> +NFSD DONTCACHE
> +==============
> +
> +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
> +the benefits of using DIRECT IO without any of the strict alignment
> +requirements that DIRECT IO imposes. To achieve this buffered IO is used
> +but the IO is flagged to "drop behind" (meaning associated pages are
> +dropped from the page cache) when IO completes.
> +
> +DONTCACHE aims to avoid what has proven to be a fairly significant
> +limition of Linux's memory management subsystem if/when large amounts of
> +data is infrequently accessed (e.g. read once _or_ written once but not
> +read until much later). Such use-cases are particularly problematic
> +because the page cache will eventually become a bottleneck to servicing
> +new IO requests.
> +
> +For more context on DONTCACHE, please see these Linux commit headers:
> +- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
> + to take a struct kiocb")
> +- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
> + RWF_DONTCACHE")
> +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
> +
> +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
> +filesystem doesn't indicate support by setting FOP_DONTCACHE.
> +
> +NFSD DIRECT
> +===========
> +
> +DIRECT IO doesn't make use of the page cache, as such it is able to
> +avoid the Linux memory management's page reclaim scalability problems
> +without resorting to the hybrid use of page cache that DONTCACHE does.
> +
> +Some workloads benefit from NFSD avoiding the page cache, particularly
> +those with a working set that is significantly larger than available
> +system memory. The pathological worst-case workload that NFSD DIRECT has
> +proven to help most is: NFS client issuing large sequential IO to a file
> +that is 2-3 times larger than the NFS server's available system memory.
> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
> +that the memory management subsystem would otherwise be required to
> +perform (e.g. page allocation, dirty writeback, page reclaim). When
> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
> +time trying to find adequate free pages so that forward IO progress can
> +be made.
> +
> +The performance win associated with using NFSD DIRECT was previously
> +discussed on linux-nfs, see:
> +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> +But in summary:
> +- NFSD DIRECT can significantly reduce memory requirements
> +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
> +- NFSD DIRECT can offer more deterministic IO performance
> +
> +As always, your mileage may vary and so it is important to carefully
> +consider if/when it is beneficial to make use of NFSD DIRECT. When
> +assessing comparative performance of your workload please be sure to log
> +relevant performance metrics during testing (e.g. memory usage, cpu
> +usage, IO performance). Using perf to collect perf data that may be used
> +to generate a "flamegraph" for work Linux must perform on behalf of your
> +test is a really meaningful way to compare the relative health of the
> +system and how switching NFSD's IO mode changes what is observed.
> +
> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
> +the underlying block device's logical_block_size. Also the memory buffer
> +used to store the READ or WRITE payload must be aligned relative to the
> +underlying block device's dma_alignment.
> +
> +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
> +it can:
> +
> +Misaligned READ:
> + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> + DIO-aligned block (on either end of the READ). The expanded READ is
> + verified to have proper offset/len (logical_block_size) and
> + dma_alignment checking.
> +
> +Misaligned WRITE:
> + If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> + middle and end as needed. The large middle segment is DIO-aligned
> + and the start and/or end are misaligned. Buffered IO is used for the
> + misaligned segments and O_DIRECT is used for the middle DIO-aligned
> + segment. DONTCACHE buffered IO is _not_ used for the misaligned
> + segments because using normal buffered IO offers significant RMW
> + performance benefit when handling streaming misaligned WRITEs.
> +
> +Tracing:
> + The nfsd_read_direct trace event shows how NFSD expands any
> + misaligned READ to the next DIO-aligned block (on either end of the
> + original READ, as needed).
> +
> + This combination of trace events is useful for READs:
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
> + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
> +
> + The nfsd_write_direct trace event shows how NFSD splits a given
> + misaligned WRITE into a DIO-aligned middle segment.
> +
> + This combination of trace events is useful for WRITEs:
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
> + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
2025-11-21 9:20 ` Anton Gavriliuk
@ 2025-11-22 15:52 ` Chuck Lever
0 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever @ 2025-11-22 15:52 UTC (permalink / raw)
To: Anton Gavriliuk, Trond Myklebust, Anna Schumaker
Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
linux-nfs, Mike Snitzer
Comments on client behavior go to Trond and Anna.
On 11/21/25 4:20 AM, Anton Gavriliuk wrote:
>> +If you experiment with NFSD's IO modes on a recent kernel and have
>> +interesting results, please report them to linux-nfs@vger.kernel.org
>
> Hello
>
> There are two physical boxes - NFS server and client, directly
> connected via 4x200 Gb/s ConnectX-7 ports.
> Rocky Linux 10 on both boxes, on NFS server kernel 6.18.0-rc6, on NFS
> client kernel 6.12.0-55.41.1.el10_0.x86_64
> Both boxes have 192 GB DRAM.
> On the NFS server there are 6 x CM7-V 5.0 NVMe SSD in mdadm raid0.
> Thanks to DMA, locally I'm able to read the file 87 GB/s,
>
> [root@memverge4 ~]# fio --name=local_test --ioengine=libaio --rw=read
> --bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
> --iodepth=32
> local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
> 3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
> fio-3.41-41-gf5b2
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=81.2GiB/s][r=27.7k IOPS][eta 00m:00s]
> local_test: (groupid=0, jobs=1): err= 0: pid=14009: Fri Nov 21 09:55:00 2025
> read: IOPS=27.7k, BW=81.1GiB/s (87.1GB/s)(512GiB/6313msec)
> slat (usec): min=4, max=1006, avg= 6.56, stdev= 5.34
> clat (usec): min=275, max=4653, avg=1148.62, stdev=32.32
> lat (usec): min=300, max=5659, avg=1155.19, stdev=33.71
>
> On the NFS client share mounted with the next options,
>
> [root@memverge3 ~]# mount -t nfs -o proto=rdma,nconnect=16,vers=3
> 1.1.1.4:/mnt /mnt
>
> All caches (on server and client) are cleared using "sync; echo 3 >
> /proc/sys/vm/drop_caches".
> On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 0.
>
> [root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
> --bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
> --iodepth=32
> local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
> 3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
> fio-3.41-45-g7c8d
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=2865MiB/s][r=955 IOPS][eta 00m:00s]
> local_test: (groupid=0, jobs=1): err= 0: pid=141930: Fri Nov 21 10:04:08 2025
> read: IOPS=1723, BW=5170MiB/s (5421MB/s)(512GiB/101408msec)
> slat (usec): min=84, max=528, avg=147.91, stdev=29.89
> clat (usec): min=950, max=116978, avg=18419.36, stdev=11746.45
> lat (usec): min=1082, max=117116, avg=18567.27, stdev=11729.90
>
> All caches (on server and client) are cleared using "sync; echo 3 >
> /proc/sys/vm/drop_caches".
> On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 2.
>
> [root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
> --bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
> --iodepth=32
> local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
> 3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
> fio-3.41-45-g7c8d
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=16.9GiB/s][r=5770 IOPS][eta 00m:00s]
> local_test: (groupid=0, jobs=1): err= 0: pid=142151: Fri Nov 21 10:07:36 2025
> read: IOPS=5803, BW=17.0GiB/s (18.3GB/s)(512GiB/30111msec)
> slat (usec): min=58, max=468, avg=171.72, stdev=15.26
> clat (usec): min=264, max=9284, avg=5340.71, stdev=141.85
> lat (usec): min=455, max=9748, avg=5512.42, stdev=145.54
>
> 3+x times improvement!!
>
> Now let's take a look at NFS client side, why can't I exceed 20 GB/s ?
>
> It looks that the fio thread most of time spends on the cpu executing
> next kernel functions,
>
> HARDCLOCK entries
> Count Pct State Function
> 384 38.40% SYS nfs_page_create_from_page
> 142 14.20% SYS nfs_get_lock_context
> 109 10.90% SYS __nfs_pageio_add_request
> 70 7.00% SYS refcount_dec_and_lock
> 64 6.40% SYS nfs_page_create
> 58 5.80% SYS nfs_direct_read_schedule_iovec
> 39 3.90% SYS nfs_pageio_add_request
> 38 3.80% SYS kmem_cache_alloc_noprof
> 13 1.30% SYS rpc_execute
> 12 1.20% SYS nfs_generic_pg_pgios
> 10 1.00% SYS get_partial_node.part.0
> 10 1.00% SYS rmqueue_bulk
> 10 1.00% SYS nfs_generic_pgio
> 8 0.80% SYS gup_fast_fallback
> 6 0.60% SYS xprt_iter_next_entry_roundrobin
> 4 0.40% SYS allocate_slab
> 4 0.40% SYS nfs_file_direct_read
> 3 0.30% SYS rpc_task_set_transport
> 2 0.20% SYS __get_random_u32_below
> 2 0.20% SYS nfs_pgheader_init
>
> Count Pct HARDCLOCK Stack trace
> ============================================================
> 384 38.40% nfs_page_create_from_page
> nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
> io_submit_one __x64_sys_io_submit do_syscall_64
> entry_SYSCALL_64_after_hwframe | syscall
> 141 14.10% nfs_get_lock_context nfs_page_create_from_page
> nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
> io_submit_one __x64_sys_io_submit do_syscall_64
> entry_SYSCALL_64_after_hwframe | syscall
> 109 10.90% __nfs_pageio_add_request nfs_pageio_add_request
> nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
> io_submit_one __x64_sys_io_submit do_syscall_64
> entry_SYSCALL_64_after_hwframe | syscall
> 70 7.00% refcount_dec_and_lock nfs_put_lock_context
> nfs_page_create_from_page nfs_direct_read_schedule_iovec
> nfs_file_direct_read aio_read io_submit_one __x64_sys_io_submit
> do_syscall_64 entry_SYSCALL_64_after_hwframe | syscall
> 64 6.40% nfs_page_create nfs_page_create_from_page
> nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
> io_submit_one __x64_sys_io_submit do_syscall_64
> entry_SYSCALL_64_after_hwframe | syscall
> 58 5.80% nfs_direct_read_schedule_iovec
> nfs_file_direct_read aio_read io_submit_one __x64_sys_io_submit
> do_syscall_64 entry_SYSCALL_64_after_hwframe | syscall
> 39 3.90% nfs_pageio_add_request
> nfs_direct_read_schedule_iovec nfs_file_direct_read aio_read
> io_submit_one __x64_sys_io_submit do_syscall_64
> entry_SYSCALL_64_after_hwframe | syscall
> 36 3.60% kmem_cache_alloc_noprof nfs_page_create
> nfs_page_create_from_page nfs_direct_read_schedule_iovec
> nfs_file_direct_read aio_read io_submit_one __x64_sys_io_submit
> do_syscall_64 entry_SYSCALL_64_after_hwframe | syscall
>
> Even if I created and added 2M huge pages as fio's backing page size
> (adding --mem=mmaphuge --hugepage-size=2m), there is still
> distribution as shown above.
>
> I might be wrong, but even with fio's huge pages, the standard
> upstream NFS client creates a struct nfs_page for every 4K chunk
> (page_size), regardless of the backing page size.
>
> Could 2M huge pages be implemented for NFS client ??, it would even
> more improve performance using NFSD direct reads.
>
> Anton
>
> вт, 11 нояб. 2025 г. в 17:09, Chuck Lever <cel@kernel.org>:
>>
>> From: Mike Snitzer <snitzer@kernel.org>
>>
>> This document details the NFSD IO modes that are configurable using
>> NFSD's experimental debugfs interfaces:
>>
>> /sys/kernel/debug/nfsd/io_cache_read
>> /sys/kernel/debug/nfsd/io_cache_write
>>
>> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
>> debugfs interfaces are replaced with per-export controls).
>>
>> Future updates will provide more specific guidance and howto
>> information to help others use and evaluate NFSD's IO modes:
>> BUFFERED, DONTCACHE and DIRECT.
>>
>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> .../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++++
>> 1 file changed, 144 insertions(+)
>> create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
>>
>> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>> new file mode 100644
>> index 000000000000..e3a522d09766
>> --- /dev/null
>> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>> @@ -0,0 +1,144 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=============
>> +NFSD IO MODES
>> +=============
>> +
>> +Overview
>> +========
>> +
>> +NFSD has historically always used buffered IO when servicing READ and
>> +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
>> +to override that default to use either DONTCACHE or DIRECT IO modes.
>> +
>> +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
>> +mode used for READ and WRITE to be configured independently. See both:
>> +- /sys/kernel/debug/nfsd/io_cache_read
>> +- /sys/kernel/debug/nfsd/io_cache_write
>> +
>> +The default value for both io_cache_read and io_cache_write reflects
>> +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
>> +
>> +Based on the configured settings, NFSD's IO will either be:
>> +- cached using page cache (NFSD_IO_BUFFERED=0)
>> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
>> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
>> +
>> +To set an NFSD IO mode, write a supported value (0 - 2) to the
>> +corresponding IO operation's debugfs interface, e.g.:
>> + echo 2 > /sys/kernel/debug/nfsd/io_cache_read
>> + echo 2 > /sys/kernel/debug/nfsd/io_cache_write
>> +
>> +To check which IO mode NFSD is using for READ or WRITE, simply read the
>> +corresponding IO operation's debugfs interface, e.g.:
>> + cat /sys/kernel/debug/nfsd/io_cache_read
>> + cat /sys/kernel/debug/nfsd/io_cache_write
>> +
>> +If you experiment with NFSD's IO modes on a recent kernel and have
>> +interesting results, please report them to linux-nfs@vger.kernel.org
>> +
>> +NFSD DONTCACHE
>> +==============
>> +
>> +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
>> +the benefits of using DIRECT IO without any of the strict alignment
>> +requirements that DIRECT IO imposes. To achieve this buffered IO is used
>> +but the IO is flagged to "drop behind" (meaning associated pages are
>> +dropped from the page cache) when IO completes.
>> +
>> +DONTCACHE aims to avoid what has proven to be a fairly significant
>> +limition of Linux's memory management subsystem if/when large amounts of
>> +data is infrequently accessed (e.g. read once _or_ written once but not
>> +read until much later). Such use-cases are particularly problematic
>> +because the page cache will eventually become a bottleneck to servicing
>> +new IO requests.
>> +
>> +For more context on DONTCACHE, please see these Linux commit headers:
>> +- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
>> + to take a struct kiocb")
>> +- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
>> + RWF_DONTCACHE")
>> +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
>> +
>> +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
>> +filesystem doesn't indicate support by setting FOP_DONTCACHE.
>> +
>> +NFSD DIRECT
>> +===========
>> +
>> +DIRECT IO doesn't make use of the page cache, as such it is able to
>> +avoid the Linux memory management's page reclaim scalability problems
>> +without resorting to the hybrid use of page cache that DONTCACHE does.
>> +
>> +Some workloads benefit from NFSD avoiding the page cache, particularly
>> +those with a working set that is significantly larger than available
>> +system memory. The pathological worst-case workload that NFSD DIRECT has
>> +proven to help most is: NFS client issuing large sequential IO to a file
>> +that is 2-3 times larger than the NFS server's available system memory.
>> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
>> +that the memory management subsystem would otherwise be required to
>> +perform (e.g. page allocation, dirty writeback, page reclaim). When
>> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
>> +time trying to find adequate free pages so that forward IO progress can
>> +be made.
>> +
>> +The performance win associated with using NFSD DIRECT was previously
>> +discussed on linux-nfs, see:
>> +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>> +But in summary:
>> +- NFSD DIRECT can significantly reduce memory requirements
>> +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
>> +- NFSD DIRECT can offer more deterministic IO performance
>> +
>> +As always, your mileage may vary and so it is important to carefully
>> +consider if/when it is beneficial to make use of NFSD DIRECT. When
>> +assessing comparative performance of your workload please be sure to log
>> +relevant performance metrics during testing (e.g. memory usage, cpu
>> +usage, IO performance). Using perf to collect perf data that may be used
>> +to generate a "flamegraph" for work Linux must perform on behalf of your
>> +test is a really meaningful way to compare the relative health of the
>> +system and how switching NFSD's IO mode changes what is observed.
>> +
>> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
>> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
>> +the underlying block device's logical_block_size. Also the memory buffer
>> +used to store the READ or WRITE payload must be aligned relative to the
>> +underlying block device's dma_alignment.
>> +
>> +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
>> +it can:
>> +
>> +Misaligned READ:
>> + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
>> + DIO-aligned block (on either end of the READ). The expanded READ is
>> + verified to have proper offset/len (logical_block_size) and
>> + dma_alignment checking.
>> +
>> +Misaligned WRITE:
>> + If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
>> + middle and end as needed. The large middle segment is DIO-aligned
>> + and the start and/or end are misaligned. Buffered IO is used for the
>> + misaligned segments and O_DIRECT is used for the middle DIO-aligned
>> + segment. DONTCACHE buffered IO is _not_ used for the misaligned
>> + segments because using normal buffered IO offers significant RMW
>> + performance benefit when handling streaming misaligned WRITEs.
>> +
>> +Tracing:
>> + The nfsd_read_direct trace event shows how NFSD expands any
>> + misaligned READ to the next DIO-aligned block (on either end of the
>> + original READ, as needed).
>> +
>> + This combination of trace events is useful for READs:
>> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
>> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
>> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
>> + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
>> +
>> + The nfsd_write_direct trace event shows how NFSD splits a given
>> + misaligned WRITE into a DIO-aligned middle segment.
>> +
>> + This combination of trace events is useful for WRITEs:
>> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
>> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
>> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
>> + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
>> --
>> 2.51.0
>>
>>
--
Chuck Lever
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-11-22 15:52 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-11 15:10 ` Christoph Hellwig
2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2025-11-21 9:20 ` Anton Gavriliuk
2025-11-22 15:52 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).