linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
@ 2025-11-11 14:59 Chuck Lever
  2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Following on https://lore.kernel.org/linux-nfs/aPAci7O_XK1ljaum@kernel.org/
this series includes the patches needed to make NFSD Direct WRITE
operational.

I still see this during fstests runs with NFSD_IO_DIRECT enabled:

WARNING: CPU: 5 PID: 1309 at fs/iomap/buffered-io.c:1402 iomap_zero_iter+0x1a4/0x390

No new test failures, but I need to narrow down which test is
triggering this warning.

Applies on the branch:
https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=nfsd-next


Changes since v11:
* Added trace point to the unaligned write arm
* Reverted to use DONTCACHE on all unaligned writes
* Replaced the comment that mentioned the intent of IOCB_DIRECT

Changes since v10:
* Applied Christoph's clean-ups
* Applied Mike's documentation fixes

Changes since v9:
* Unaligned segments no longer use IOCB_DONTCACHE
* Squashed all review patches into Mike's initial patch
* Squashed Mike's documentation update into the final patch

Changes since v8:
* Drop "NFSD: Handle both offset and memory alignment for direct I/O"
* Include the Sep 3 version of the Documentation update

Changes since v7:
* Rebase the series on Mike's original v3 patch
* Address more review comments
* Optimize the "when can NFSD use IOCB_DIRECT" logic
* Revert the "always promote to FILE_SYNC" logic

Changes since v6:
* Patches to address review comments have been split out
* Refactored the iter initialization code

Changes since v5:
* Add a patch to make FILE_SYNC WRITEs persist timestamps
* Address some of Christoph's review comments
* The svcrdma patch has been dropped until we actually need it

Changes since v4:
* Split out refactoring nfsd_buffered_write() into a separate patch
* Expand patch description of 1/4
* Don't set IOCB_SYNC flag

Changes since v3:
* Address checkpatch.pl nits in 2/3
* Add an untested patch to mark ingress RDMA Read chunks

Chuck Lever (1):
  NFSD: Make FILE_SYNC WRITEs comply with spec

Mike Snitzer (2):
  NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst

 .../filesystems/nfs/nfsd-io-modes.rst         | 144 ++++++++++++++++
 fs/nfsd/debugfs.c                             |   1 +
 fs/nfsd/trace.h                               |   2 +
 fs/nfsd/vfs.c                                 | 159 +++++++++++++++++-
 4 files changed, 300 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst

-- 
2.51.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec
  2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-11 14:59 ` Chuck Lever
  2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
  2 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Mike Snitzer, stable, Christoph Hellwig

From: Chuck Lever <chuck.lever@oracle.com>

Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:

> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.

Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:

-		flags |= RWF_SYNC;

with:

+		kiocb.ki_flags |= IOCB_DSYNC;

which appears to be correct given:

	if (flags & RWF_SYNC)
		kiocb_flags |= IOCB_DSYNC;

in kiocb_set_rw_flags(). However the author of that commit did not
appreciate that the previous line in kiocb_set_rw_flags() results
in IOCB_SYNC also being set:

	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);

RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
IOCB_SYNC. Reviewers at the time did not catch the omission.

Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f537a7b4ee01..5333d49910d9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1314,8 +1314,18 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		stable = NFS_UNSTABLE;
 	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = offset;
-	if (stable && !fhp->fh_use_wgather)
-		kiocb.ki_flags |= IOCB_DSYNC;
+	if (likely(!fhp->fh_use_wgather)) {
+		switch (stable) {
+		case NFS_FILE_SYNC:
+			/* persist data and timestamps */
+			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+			break;
+		case NFS_DATA_SYNC:
+			/* persist data only */
+			kiocb.ki_flags |= IOCB_DSYNC;
+			break;
+		}
+	}
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
@ 2025-11-11 14:59 ` Chuck Lever
  2025-11-11 15:10   ` Christoph Hellwig
  2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
  2 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer, Chuck Lever

From: Mike Snitzer <snitzer@kernel.org>

When NFSD_IO_DIRECT is selected via the
/sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
incoming unaligned NFS WRITE requests into a prefix, middle and
suffix segment, as needed. The middle segment is now DIO-aligned and
the prefix and/or suffix are unaligned. Synchronous buffered IO is
used for the unaligned segments, and IOCB_DIRECT is used for the
middle DIO-aligned extent.

Although IOCB_DIRECT avoids the use of the page cache, by itself it
doesn't guarantee data durability. For UNSTABLE WRITE requests,
durability is obtained by a subsequent NFS COMMIT request.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/debugfs.c |   1 +
 fs/nfsd/trace.h   |   2 +
 fs/nfsd/vfs.c     | 145 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 144 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 00eb1ecef6ac..7f44689e0a53 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -108,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	switch (val) {
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
+	case NFSD_IO_DIRECT:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 85a1521ad757..5ae2a611e57f 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -469,6 +469,8 @@ DEFINE_NFSD_IO_EVENT(read_io_done);
 DEFINE_NFSD_IO_EVENT(read_done);
 DEFINE_NFSD_IO_EVENT(write_start);
 DEFINE_NFSD_IO_EVENT(write_opened);
+DEFINE_NFSD_IO_EVENT(write_direct);
+DEFINE_NFSD_IO_EVENT(write_vector);
 DEFINE_NFSD_IO_EVENT(write_io_done);
 DEFINE_NFSD_IO_EVENT(write_done);
 DEFINE_NFSD_IO_EVENT(commit_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5333d49910d9..ab46301da4ae 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1254,6 +1254,136 @@ static int wait_for_concurrent_writes(struct file *file)
 	return err;
 }
 
+struct nfsd_write_dio_seg {
+	struct iov_iter			iter;
+	int				flags;
+};
+
+static unsigned long
+iov_iter_bvec_offset(const struct iov_iter *iter)
+{
+	return (unsigned long)(iter->bvec->bv_offset + iter->iov_offset);
+}
+
+static void
+nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
+			struct bio_vec *bvec, unsigned int nvecs,
+			unsigned long total, size_t start, size_t len,
+			struct kiocb *iocb)
+{
+	iov_iter_bvec(&segment->iter, ITER_SOURCE, bvec, nvecs, total);
+	if (start)
+		iov_iter_advance(&segment->iter, start);
+	iov_iter_truncate(&segment->iter, len);
+	segment->flags = iocb->ki_flags;
+}
+
+static unsigned int
+nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
+			  unsigned int nvecs, struct kiocb *iocb,
+			  unsigned long total,
+			  struct nfsd_write_dio_seg segments[3])
+{
+	u32 offset_align = nf->nf_dio_offset_align;
+	loff_t prefix_end, orig_end, middle_end;
+	u32 mem_align = nf->nf_dio_mem_align;
+	size_t prefix, middle, suffix;
+	loff_t offset = iocb->ki_pos;
+	unsigned int nsegs = 0;
+
+	/*
+	 * Check if direct I/O is feasible for this write request.
+	 * If alignments are not available, the write is too small,
+	 * or no alignment can be found, fall back to buffered I/O.
+	 */
+	if (unlikely(!mem_align || !offset_align) ||
+	    unlikely(total < max(offset_align, mem_align)))
+		goto no_dio;
+
+	prefix_end = round_up(offset, offset_align);
+	orig_end = offset + total;
+	middle_end = round_down(orig_end, offset_align);
+
+	prefix = prefix_end - offset;
+	middle = middle_end - prefix_end;
+	suffix = orig_end - middle_end;
+
+	if (!middle)
+		goto no_dio;
+
+	if (prefix)
+		nfsd_write_dio_seg_init(&segments[nsegs++], bvec,
+					nvecs, total, 0, prefix, iocb);
+
+	nfsd_write_dio_seg_init(&segments[nsegs], bvec, nvecs,
+				total, prefix, middle, iocb);
+
+	/*
+	 * Check if the bvec iterator is aligned for direct I/O.
+	 *
+	 * bvecs generated from RPC receive buffers are contiguous: After
+	 * the first bvec, all subsequent bvecs start at bv_offset zero
+	 * (page-aligned). Therefore, only the first bvec is checked.
+	 */
+	if (iov_iter_bvec_offset(&segments[nsegs].iter) & (mem_align - 1))
+		goto no_dio;
+	segments[nsegs].flags |= IOCB_DIRECT;
+	nsegs++;
+
+	if (suffix)
+		nfsd_write_dio_seg_init(&segments[nsegs++], bvec, nvecs, total,
+					prefix + middle, suffix, iocb);
+
+	return nsegs;
+
+no_dio:
+	/* No DIO alignment possible - pack into single non-DIO segment. */
+	nfsd_write_dio_seg_init(&segments[0], bvec, nvecs, total, 0,
+				total, iocb);
+	return 1;
+}
+
+static noinline_for_stack int
+nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
+		  struct nfsd_file *nf, unsigned int nvecs,
+		  unsigned long *cnt, struct kiocb *kiocb)
+{
+	struct nfsd_write_dio_seg segments[3];
+	struct file *file = nf->nf_file;
+	unsigned int nsegs, i;
+	ssize_t host_err;
+
+	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
+					  kiocb, *cnt, segments);
+
+	*cnt = 0;
+	for (i = 0; i < nsegs; i++) {
+		kiocb->ki_flags = segments[i].flags;
+		if (kiocb->ki_flags & IOCB_DIRECT)
+			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+						segments[i].iter.count);
+		else {
+			trace_nfsd_write_vector(rqstp, fhp, kiocb->ki_pos,
+						segments[i].iter.count);
+			/*
+			 * Mark the I/O buffer as evict-able to reduce
+			 * memory contention.
+			 */
+			if (nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+				kiocb->ki_flags |= IOCB_DONTCACHE;
+		}
+
+		host_err = vfs_iocb_iter_write(file, kiocb, &segments[i].iter);
+		if (host_err < 0)
+			return host_err;
+		*cnt += host_err;
+		if (host_err < segments[i].iter.count)
+			break;	/* partial write */
+	}
+
+	return 0;
+}
+
 /**
  * nfsd_vfs_write - write data to an already-open file
  * @rqstp: RPC execution context
@@ -1328,25 +1458,32 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	}
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
-	iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
 		nfsd_copy_write_verifier(verf, nn);
 
 	switch (nfsd_io_cache_write) {
-	case NFSD_IO_BUFFERED:
+	case NFSD_IO_DIRECT:
+		host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs,
+					     cnt, &kiocb);
 		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags |= IOCB_DONTCACHE;
+		fallthrough;
+	case NFSD_IO_BUFFERED:
+		iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+		host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
+		if (host_err < 0)
+			break;
+		*cnt = host_err;
 		break;
 	}
-	host_err = vfs_iocb_iter_write(file, &kiocb, &iter);
 	if (host_err < 0) {
 		commit_reset_write_verifier(nn, rqstp, host_err);
 		goto out_nfserr;
 	}
-	*cnt = host_err;
 	nfsd_stats_io_write_add(nn, exp, *cnt);
 	fsnotify_modify(file);
 	host_err = filemap_check_wb_err(file->f_mapping, since);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
  2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
  2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
  2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-11 14:59 ` Chuck Lever
  2025-11-21  9:20   ` Anton Gavriliuk
  2 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2025-11-11 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Mike Snitzer

From: Mike Snitzer <snitzer@kernel.org>

This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:

  /sys/kernel/debug/nfsd/io_cache_read
  /sys/kernel/debug/nfsd/io_cache_write

This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).

Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 .../filesystems/nfs/nfsd-io-modes.rst         | 144 ++++++++++++++++++
 1 file changed, 144 insertions(+)
 create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst

diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 000000000000..e3a522d09766
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,144 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+
+To set an NFSD IO mode, write a supported value (0 - 2) to the
+corresponding IO operation's debugfs interface, e.g.:
+  echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+  echo 2 > /sys/kernel/debug/nfsd/io_cache_write
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.:
+  cat /sys/kernel/debug/nfsd/io_cache_read
+  cat /sys/kernel/debug/nfsd/io_cache_write
+
+If you experiment with NFSD's IO modes on a recent kernel and have
+interesting results, please report them to linux-nfs@vger.kernel.org
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to servicing
+new IO requests.
+
+For more context on DONTCACHE, please see these Linux commit headers:
+- Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+  to take a struct kiocb")
+- for READ:  8026e49bff9b1 ("mm/filemap: add read support for
+  RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
+filesystem doesn't indicate support by setting FOP_DONTCACHE.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+But in summary:
+- NFSD DIRECT can significantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+    If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+    DIO-aligned block (on either end of the READ). The expanded READ is
+    verified to have proper offset/len (logical_block_size) and
+    dma_alignment checking.
+
+Misaligned WRITE:
+    If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+    middle and end as needed. The large middle segment is DIO-aligned
+    and the start and/or end are misaligned. Buffered IO is used for the
+    misaligned segments and O_DIRECT is used for the middle DIO-aligned
+    segment. DONTCACHE buffered IO is _not_ used for the misaligned
+    segments because using normal buffered IO offers significant RMW
+    performance benefit when handling streaming misaligned WRITEs.
+
+Tracing:
+    The nfsd_read_direct trace event shows how NFSD expands any
+    misaligned READ to the next DIO-aligned block (on either end of the
+    original READ, as needed).
+
+    This combination of trace events is useful for READs:
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+    The nfsd_write_direct trace event shows how NFSD splits a given
+    misaligned WRITE into a DIO-aligned middle segment.
+
+    This combination of trace events is useful for WRITEs:
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
  2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
@ 2025-11-11 15:10   ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2025-11-11 15:10 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer, Chuck Lever

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
  2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
@ 2025-11-21  9:20   ` Anton Gavriliuk
  2025-11-22 15:52     ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: Anton Gavriliuk @ 2025-11-21  9:20 UTC (permalink / raw)
  To: Chuck Lever
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer

> +If you experiment with NFSD's IO modes on a recent kernel and have
> +interesting results, please report them to linux-nfs@vger.kernel.org

Hello

There are two physical boxes - NFS server and client, directly
connected via 4x200 Gb/s ConnectX-7 ports.
Rocky Linux 10 on both boxes, on NFS server kernel 6.18.0-rc6, on NFS
client kernel 6.12.0-55.41.1.el10_0.x86_64
Both boxes have 192 GB DRAM.
On the NFS server there are 6 x CM7-V 5.0 NVMe SSD in mdadm raid0.
Thanks to DMA, locally I'm able to read the file 87 GB/s,

[root@memverge4 ~]# fio --name=local_test --ioengine=libaio --rw=read
--bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
--iodepth=32
local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
fio-3.41-41-gf5b2
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=81.2GiB/s][r=27.7k IOPS][eta 00m:00s]
local_test: (groupid=0, jobs=1): err= 0: pid=14009: Fri Nov 21 09:55:00 2025
  read: IOPS=27.7k, BW=81.1GiB/s (87.1GB/s)(512GiB/6313msec)
    slat (usec): min=4, max=1006, avg= 6.56, stdev= 5.34
    clat (usec): min=275, max=4653, avg=1148.62, stdev=32.32
     lat (usec): min=300, max=5659, avg=1155.19, stdev=33.71

On the NFS client share mounted with the next options,

[root@memverge3 ~]# mount -t nfs -o proto=rdma,nconnect=16,vers=3
1.1.1.4:/mnt /mnt

All caches (on server and client) are cleared using "sync; echo 3 >
/proc/sys/vm/drop_caches".
On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 0.

[root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
--bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
--iodepth=32
local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
fio-3.41-45-g7c8d
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=2865MiB/s][r=955 IOPS][eta 00m:00s]
local_test: (groupid=0, jobs=1): err= 0: pid=141930: Fri Nov 21 10:04:08 2025
  read: IOPS=1723, BW=5170MiB/s (5421MB/s)(512GiB/101408msec)
    slat (usec): min=84, max=528, avg=147.91, stdev=29.89
    clat (usec): min=950, max=116978, avg=18419.36, stdev=11746.45
     lat (usec): min=1082, max=117116, avg=18567.27, stdev=11729.90

All caches (on server and client) are cleared using "sync; echo 3 >
/proc/sys/vm/drop_caches".
On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 2.

[root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
--bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
--iodepth=32
local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
fio-3.41-45-g7c8d
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=16.9GiB/s][r=5770 IOPS][eta 00m:00s]
local_test: (groupid=0, jobs=1): err= 0: pid=142151: Fri Nov 21 10:07:36 2025
  read: IOPS=5803, BW=17.0GiB/s (18.3GB/s)(512GiB/30111msec)
    slat (usec): min=58, max=468, avg=171.72, stdev=15.26
    clat (usec): min=264, max=9284, avg=5340.71, stdev=141.85
     lat (usec): min=455, max=9748, avg=5512.42, stdev=145.54

3+x times improvement!!

Now let's take a look at NFS client side, why can't I exceed 20 GB/s ?

It looks that the fio thread most of time spends on the cpu executing
next kernel functions,

    HARDCLOCK entries
       Count     Pct  State  Function
         384  38.40%  SYS    nfs_page_create_from_page
         142  14.20%  SYS    nfs_get_lock_context
         109  10.90%  SYS    __nfs_pageio_add_request
          70   7.00%  SYS    refcount_dec_and_lock
          64   6.40%  SYS    nfs_page_create
          58   5.80%  SYS    nfs_direct_read_schedule_iovec
          39   3.90%  SYS    nfs_pageio_add_request
          38   3.80%  SYS    kmem_cache_alloc_noprof
          13   1.30%  SYS    rpc_execute
          12   1.20%  SYS    nfs_generic_pg_pgios
          10   1.00%  SYS    get_partial_node.part.0
          10   1.00%  SYS    rmqueue_bulk
          10   1.00%  SYS    nfs_generic_pgio
           8   0.80%  SYS    gup_fast_fallback
           6   0.60%  SYS    xprt_iter_next_entry_roundrobin
           4   0.40%  SYS    allocate_slab
           4   0.40%  SYS    nfs_file_direct_read
           3   0.30%  SYS    rpc_task_set_transport
           2   0.20%  SYS    __get_random_u32_below
           2   0.20%  SYS    nfs_pgheader_init

       Count     Pct  HARDCLOCK Stack trace
       ============================================================
         384  38.40%  nfs_page_create_from_page
nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
io_submit_one  __x64_sys_io_submit  do_syscall_64
entry_SYSCALL_64_after_hwframe  |  syscall
         141  14.10%  nfs_get_lock_context  nfs_page_create_from_page
nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
io_submit_one  __x64_sys_io_submit  do_syscall_64
entry_SYSCALL_64_after_hwframe  |  syscall
         109  10.90%  __nfs_pageio_add_request  nfs_pageio_add_request
 nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
io_submit_one  __x64_sys_io_submit  do_syscall_64
entry_SYSCALL_64_after_hwframe  |  syscall
          70   7.00%  refcount_dec_and_lock  nfs_put_lock_context
nfs_page_create_from_page  nfs_direct_read_schedule_iovec
nfs_file_direct_read  aio_read io_submit_one  __x64_sys_io_submit
do_syscall_64  entry_SYSCALL_64_after_hwframe  |  syscall
          64   6.40%  nfs_page_create  nfs_page_create_from_page
nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
io_submit_one  __x64_sys_io_submit  do_syscall_64
entry_SYSCALL_64_after_hwframe  |  syscall
          58   5.80%  nfs_direct_read_schedule_iovec
nfs_file_direct_read  aio_read  io_submit_one  __x64_sys_io_submit
do_syscall_64  entry_SYSCALL_64_after_hwframe  |  syscall
          39   3.90%  nfs_pageio_add_request
nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
io_submit_one  __x64_sys_io_submit  do_syscall_64
entry_SYSCALL_64_after_hwframe  |  syscall
          36   3.60%  kmem_cache_alloc_noprof  nfs_page_create
nfs_page_create_from_page  nfs_direct_read_schedule_iovec
nfs_file_direct_read  aio_read  io_submit_one  __x64_sys_io_submit
do_syscall_64  entry_SYSCALL_64_after_hwframe  |  syscall

Even if I created and added 2M huge pages as fio's backing page size
(adding --mem=mmaphuge --hugepage-size=2m), there is still
distribution as shown above.

I might be wrong, but even with fio's huge pages, the standard
upstream NFS client creates a struct nfs_page for every 4K chunk
(page_size), regardless of the backing page size.

Could 2M huge pages be implemented for NFS client ??, it would even
more improve performance using NFSD direct reads.

Anton

вт, 11 нояб. 2025 г. в 17:09, Chuck Lever <cel@kernel.org>:
>
> From: Mike Snitzer <snitzer@kernel.org>
>
> This document details the NFSD IO modes that are configurable using
> NFSD's experimental debugfs interfaces:
>
>   /sys/kernel/debug/nfsd/io_cache_read
>   /sys/kernel/debug/nfsd/io_cache_write
>
> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
> debugfs interfaces are replaced with per-export controls).
>
> Future updates will provide more specific guidance and howto
> information to help others use and evaluate NFSD's IO modes:
> BUFFERED, DONTCACHE and DIRECT.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  .../filesystems/nfs/nfsd-io-modes.rst         | 144 ++++++++++++++++++
>  1 file changed, 144 insertions(+)
>  create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
>
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> new file mode 100644
> index 000000000000..e3a522d09766
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -0,0 +1,144 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +NFSD IO MODES
> +=============
> +
> +Overview
> +========
> +
> +NFSD has historically always used buffered IO when servicing READ and
> +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
> +to override that default to use either DONTCACHE or DIRECT IO modes.
> +
> +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
> +mode used for READ and WRITE to be configured independently. See both:
> +- /sys/kernel/debug/nfsd/io_cache_read
> +- /sys/kernel/debug/nfsd/io_cache_write
> +
> +The default value for both io_cache_read and io_cache_write reflects
> +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
> +
> +Based on the configured settings, NFSD's IO will either be:
> +- cached using page cache (NFSD_IO_BUFFERED=0)
> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> +
> +To set an NFSD IO mode, write a supported value (0 - 2) to the
> +corresponding IO operation's debugfs interface, e.g.:
> +  echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> +  echo 2 > /sys/kernel/debug/nfsd/io_cache_write
> +
> +To check which IO mode NFSD is using for READ or WRITE, simply read the
> +corresponding IO operation's debugfs interface, e.g.:
> +  cat /sys/kernel/debug/nfsd/io_cache_read
> +  cat /sys/kernel/debug/nfsd/io_cache_write
> +
> +If you experiment with NFSD's IO modes on a recent kernel and have
> +interesting results, please report them to linux-nfs@vger.kernel.org
> +
> +NFSD DONTCACHE
> +==============
> +
> +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
> +the benefits of using DIRECT IO without any of the strict alignment
> +requirements that DIRECT IO imposes. To achieve this buffered IO is used
> +but the IO is flagged to "drop behind" (meaning associated pages are
> +dropped from the page cache) when IO completes.
> +
> +DONTCACHE aims to avoid what has proven to be a fairly significant
> +limition of Linux's memory management subsystem if/when large amounts of
> +data is infrequently accessed (e.g. read once _or_ written once but not
> +read until much later). Such use-cases are particularly problematic
> +because the page cache will eventually become a bottleneck to servicing
> +new IO requests.
> +
> +For more context on DONTCACHE, please see these Linux commit headers:
> +- Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
> +  to take a struct kiocb")
> +- for READ:  8026e49bff9b1 ("mm/filemap: add read support for
> +  RWF_DONTCACHE")
> +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
> +
> +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
> +filesystem doesn't indicate support by setting FOP_DONTCACHE.
> +
> +NFSD DIRECT
> +===========
> +
> +DIRECT IO doesn't make use of the page cache, as such it is able to
> +avoid the Linux memory management's page reclaim scalability problems
> +without resorting to the hybrid use of page cache that DONTCACHE does.
> +
> +Some workloads benefit from NFSD avoiding the page cache, particularly
> +those with a working set that is significantly larger than available
> +system memory. The pathological worst-case workload that NFSD DIRECT has
> +proven to help most is: NFS client issuing large sequential IO to a file
> +that is 2-3 times larger than the NFS server's available system memory.
> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
> +that the memory management subsystem would otherwise be required to
> +perform (e.g. page allocation, dirty writeback, page reclaim). When
> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
> +time trying to find adequate free pages so that forward IO progress can
> +be made.
> +
> +The performance win associated with using NFSD DIRECT was previously
> +discussed on linux-nfs, see:
> +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> +But in summary:
> +- NFSD DIRECT can significantly reduce memory requirements
> +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
> +- NFSD DIRECT can offer more deterministic IO performance
> +
> +As always, your mileage may vary and so it is important to carefully
> +consider if/when it is beneficial to make use of NFSD DIRECT. When
> +assessing comparative performance of your workload please be sure to log
> +relevant performance metrics during testing (e.g. memory usage, cpu
> +usage, IO performance). Using perf to collect perf data that may be used
> +to generate a "flamegraph" for work Linux must perform on behalf of your
> +test is a really meaningful way to compare the relative health of the
> +system and how switching NFSD's IO mode changes what is observed.
> +
> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
> +the underlying block device's logical_block_size. Also the memory buffer
> +used to store the READ or WRITE payload must be aligned relative to the
> +underlying block device's dma_alignment.
> +
> +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
> +it can:
> +
> +Misaligned READ:
> +    If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> +    DIO-aligned block (on either end of the READ). The expanded READ is
> +    verified to have proper offset/len (logical_block_size) and
> +    dma_alignment checking.
> +
> +Misaligned WRITE:
> +    If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> +    middle and end as needed. The large middle segment is DIO-aligned
> +    and the start and/or end are misaligned. Buffered IO is used for the
> +    misaligned segments and O_DIRECT is used for the middle DIO-aligned
> +    segment. DONTCACHE buffered IO is _not_ used for the misaligned
> +    segments because using normal buffered IO offers significant RMW
> +    performance benefit when handling streaming misaligned WRITEs.
> +
> +Tracing:
> +    The nfsd_read_direct trace event shows how NFSD expands any
> +    misaligned READ to the next DIO-aligned block (on either end of the
> +    original READ, as needed).
> +
> +    This combination of trace events is useful for READs:
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
> +    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
> +
> +    The nfsd_write_direct trace event shows how NFSD splits a given
> +    misaligned WRITE into a DIO-aligned middle segment.
> +
> +    This combination of trace events is useful for WRITEs:
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
> +    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
  2025-11-21  9:20   ` Anton Gavriliuk
@ 2025-11-22 15:52     ` Chuck Lever
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever @ 2025-11-22 15:52 UTC (permalink / raw)
  To: Anton Gavriliuk, Trond Myklebust, Anna Schumaker
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Mike Snitzer

Comments on client behavior go to Trond and Anna.


On 11/21/25 4:20 AM, Anton Gavriliuk wrote:
>> +If you experiment with NFSD's IO modes on a recent kernel and have
>> +interesting results, please report them to linux-nfs@vger.kernel.org
> 
> Hello
> 
> There are two physical boxes - NFS server and client, directly
> connected via 4x200 Gb/s ConnectX-7 ports.
> Rocky Linux 10 on both boxes, on NFS server kernel 6.18.0-rc6, on NFS
> client kernel 6.12.0-55.41.1.el10_0.x86_64
> Both boxes have 192 GB DRAM.
> On the NFS server there are 6 x CM7-V 5.0 NVMe SSD in mdadm raid0.
> Thanks to DMA, locally I'm able to read the file 87 GB/s,
> 
> [root@memverge4 ~]# fio --name=local_test --ioengine=libaio --rw=read
> --bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
> --iodepth=32
> local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
> 3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
> fio-3.41-41-gf5b2
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=81.2GiB/s][r=27.7k IOPS][eta 00m:00s]
> local_test: (groupid=0, jobs=1): err= 0: pid=14009: Fri Nov 21 09:55:00 2025
>   read: IOPS=27.7k, BW=81.1GiB/s (87.1GB/s)(512GiB/6313msec)
>     slat (usec): min=4, max=1006, avg= 6.56, stdev= 5.34
>     clat (usec): min=275, max=4653, avg=1148.62, stdev=32.32
>      lat (usec): min=300, max=5659, avg=1155.19, stdev=33.71
> 
> On the NFS client share mounted with the next options,
> 
> [root@memverge3 ~]# mount -t nfs -o proto=rdma,nconnect=16,vers=3
> 1.1.1.4:/mnt /mnt
> 
> All caches (on server and client) are cleared using "sync; echo 3 >
> /proc/sys/vm/drop_caches".
> On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 0.
> 
> [root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
> --bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
> --iodepth=32
> local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
> 3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
> fio-3.41-45-g7c8d
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=2865MiB/s][r=955 IOPS][eta 00m:00s]
> local_test: (groupid=0, jobs=1): err= 0: pid=141930: Fri Nov 21 10:04:08 2025
>   read: IOPS=1723, BW=5170MiB/s (5421MB/s)(512GiB/101408msec)
>     slat (usec): min=84, max=528, avg=147.91, stdev=29.89
>     clat (usec): min=950, max=116978, avg=18419.36, stdev=11746.45
>      lat (usec): min=1082, max=117116, avg=18567.27, stdev=11729.90
> 
> All caches (on server and client) are cleared using "sync; echo 3 >
> /proc/sys/vm/drop_caches".
> On the NFS server /sys/kernel/debug/nfsd/io_cache_read default value 2.
> 
> [root@memverge3 ~]# fio --name=local_test --ioengine=libaio --rw=read
> --bs=3072K --numjobs=1 --direct=1 --filename=/mnt/testfile
> --iodepth=32
> local_test: (g=0): rw=read, bs=(R) 3072KiB-3072KiB, (W)
> 3072KiB-3072KiB, (T) 3072KiB-3072KiB, ioengine=libaio, iodepth=32
> fio-3.41-45-g7c8d
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=16.9GiB/s][r=5770 IOPS][eta 00m:00s]
> local_test: (groupid=0, jobs=1): err= 0: pid=142151: Fri Nov 21 10:07:36 2025
>   read: IOPS=5803, BW=17.0GiB/s (18.3GB/s)(512GiB/30111msec)
>     slat (usec): min=58, max=468, avg=171.72, stdev=15.26
>     clat (usec): min=264, max=9284, avg=5340.71, stdev=141.85
>      lat (usec): min=455, max=9748, avg=5512.42, stdev=145.54
> 
> 3+x times improvement!!
> 
> Now let's take a look at NFS client side, why can't I exceed 20 GB/s ?
> 
> It looks that the fio thread most of time spends on the cpu executing
> next kernel functions,
> 
>     HARDCLOCK entries
>        Count     Pct  State  Function
>          384  38.40%  SYS    nfs_page_create_from_page
>          142  14.20%  SYS    nfs_get_lock_context
>          109  10.90%  SYS    __nfs_pageio_add_request
>           70   7.00%  SYS    refcount_dec_and_lock
>           64   6.40%  SYS    nfs_page_create
>           58   5.80%  SYS    nfs_direct_read_schedule_iovec
>           39   3.90%  SYS    nfs_pageio_add_request
>           38   3.80%  SYS    kmem_cache_alloc_noprof
>           13   1.30%  SYS    rpc_execute
>           12   1.20%  SYS    nfs_generic_pg_pgios
>           10   1.00%  SYS    get_partial_node.part.0
>           10   1.00%  SYS    rmqueue_bulk
>           10   1.00%  SYS    nfs_generic_pgio
>            8   0.80%  SYS    gup_fast_fallback
>            6   0.60%  SYS    xprt_iter_next_entry_roundrobin
>            4   0.40%  SYS    allocate_slab
>            4   0.40%  SYS    nfs_file_direct_read
>            3   0.30%  SYS    rpc_task_set_transport
>            2   0.20%  SYS    __get_random_u32_below
>            2   0.20%  SYS    nfs_pgheader_init
> 
>        Count     Pct  HARDCLOCK Stack trace
>        ============================================================
>          384  38.40%  nfs_page_create_from_page
> nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
> io_submit_one  __x64_sys_io_submit  do_syscall_64
> entry_SYSCALL_64_after_hwframe  |  syscall
>          141  14.10%  nfs_get_lock_context  nfs_page_create_from_page
> nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
> io_submit_one  __x64_sys_io_submit  do_syscall_64
> entry_SYSCALL_64_after_hwframe  |  syscall
>          109  10.90%  __nfs_pageio_add_request  nfs_pageio_add_request
>  nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
> io_submit_one  __x64_sys_io_submit  do_syscall_64
> entry_SYSCALL_64_after_hwframe  |  syscall
>           70   7.00%  refcount_dec_and_lock  nfs_put_lock_context
> nfs_page_create_from_page  nfs_direct_read_schedule_iovec
> nfs_file_direct_read  aio_read io_submit_one  __x64_sys_io_submit
> do_syscall_64  entry_SYSCALL_64_after_hwframe  |  syscall
>           64   6.40%  nfs_page_create  nfs_page_create_from_page
> nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
> io_submit_one  __x64_sys_io_submit  do_syscall_64
> entry_SYSCALL_64_after_hwframe  |  syscall
>           58   5.80%  nfs_direct_read_schedule_iovec
> nfs_file_direct_read  aio_read  io_submit_one  __x64_sys_io_submit
> do_syscall_64  entry_SYSCALL_64_after_hwframe  |  syscall
>           39   3.90%  nfs_pageio_add_request
> nfs_direct_read_schedule_iovec  nfs_file_direct_read  aio_read
> io_submit_one  __x64_sys_io_submit  do_syscall_64
> entry_SYSCALL_64_after_hwframe  |  syscall
>           36   3.60%  kmem_cache_alloc_noprof  nfs_page_create
> nfs_page_create_from_page  nfs_direct_read_schedule_iovec
> nfs_file_direct_read  aio_read  io_submit_one  __x64_sys_io_submit
> do_syscall_64  entry_SYSCALL_64_after_hwframe  |  syscall
> 
> Even if I created and added 2M huge pages as fio's backing page size
> (adding --mem=mmaphuge --hugepage-size=2m), there is still
> distribution as shown above.
> 
> I might be wrong, but even with fio's huge pages, the standard
> upstream NFS client creates a struct nfs_page for every 4K chunk
> (page_size), regardless of the backing page size.
> 
> Could 2M huge pages be implemented for NFS client ??, it would even
> more improve performance using NFSD direct reads.
> 
> Anton
> 
> вт, 11 нояб. 2025 г. в 17:09, Chuck Lever <cel@kernel.org>:
>>
>> From: Mike Snitzer <snitzer@kernel.org>
>>
>> This document details the NFSD IO modes that are configurable using
>> NFSD's experimental debugfs interfaces:
>>
>>   /sys/kernel/debug/nfsd/io_cache_read
>>   /sys/kernel/debug/nfsd/io_cache_write
>>
>> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
>> debugfs interfaces are replaced with per-export controls).
>>
>> Future updates will provide more specific guidance and howto
>> information to help others use and evaluate NFSD's IO modes:
>> BUFFERED, DONTCACHE and DIRECT.
>>
>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  .../filesystems/nfs/nfsd-io-modes.rst         | 144 ++++++++++++++++++
>>  1 file changed, 144 insertions(+)
>>  create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
>>
>> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>> new file mode 100644
>> index 000000000000..e3a522d09766
>> --- /dev/null
>> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>> @@ -0,0 +1,144 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=============
>> +NFSD IO MODES
>> +=============
>> +
>> +Overview
>> +========
>> +
>> +NFSD has historically always used buffered IO when servicing READ and
>> +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
>> +to override that default to use either DONTCACHE or DIRECT IO modes.
>> +
>> +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
>> +mode used for READ and WRITE to be configured independently. See both:
>> +- /sys/kernel/debug/nfsd/io_cache_read
>> +- /sys/kernel/debug/nfsd/io_cache_write
>> +
>> +The default value for both io_cache_read and io_cache_write reflects
>> +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
>> +
>> +Based on the configured settings, NFSD's IO will either be:
>> +- cached using page cache (NFSD_IO_BUFFERED=0)
>> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
>> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
>> +
>> +To set an NFSD IO mode, write a supported value (0 - 2) to the
>> +corresponding IO operation's debugfs interface, e.g.:
>> +  echo 2 > /sys/kernel/debug/nfsd/io_cache_read
>> +  echo 2 > /sys/kernel/debug/nfsd/io_cache_write
>> +
>> +To check which IO mode NFSD is using for READ or WRITE, simply read the
>> +corresponding IO operation's debugfs interface, e.g.:
>> +  cat /sys/kernel/debug/nfsd/io_cache_read
>> +  cat /sys/kernel/debug/nfsd/io_cache_write
>> +
>> +If you experiment with NFSD's IO modes on a recent kernel and have
>> +interesting results, please report them to linux-nfs@vger.kernel.org
>> +
>> +NFSD DONTCACHE
>> +==============
>> +
>> +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
>> +the benefits of using DIRECT IO without any of the strict alignment
>> +requirements that DIRECT IO imposes. To achieve this buffered IO is used
>> +but the IO is flagged to "drop behind" (meaning associated pages are
>> +dropped from the page cache) when IO completes.
>> +
>> +DONTCACHE aims to avoid what has proven to be a fairly significant
>> +limition of Linux's memory management subsystem if/when large amounts of
>> +data is infrequently accessed (e.g. read once _or_ written once but not
>> +read until much later). Such use-cases are particularly problematic
>> +because the page cache will eventually become a bottleneck to servicing
>> +new IO requests.
>> +
>> +For more context on DONTCACHE, please see these Linux commit headers:
>> +- Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
>> +  to take a struct kiocb")
>> +- for READ:  8026e49bff9b1 ("mm/filemap: add read support for
>> +  RWF_DONTCACHE")
>> +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
>> +
>> +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
>> +filesystem doesn't indicate support by setting FOP_DONTCACHE.
>> +
>> +NFSD DIRECT
>> +===========
>> +
>> +DIRECT IO doesn't make use of the page cache, as such it is able to
>> +avoid the Linux memory management's page reclaim scalability problems
>> +without resorting to the hybrid use of page cache that DONTCACHE does.
>> +
>> +Some workloads benefit from NFSD avoiding the page cache, particularly
>> +those with a working set that is significantly larger than available
>> +system memory. The pathological worst-case workload that NFSD DIRECT has
>> +proven to help most is: NFS client issuing large sequential IO to a file
>> +that is 2-3 times larger than the NFS server's available system memory.
>> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
>> +that the memory management subsystem would otherwise be required to
>> +perform (e.g. page allocation, dirty writeback, page reclaim). When
>> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
>> +time trying to find adequate free pages so that forward IO progress can
>> +be made.
>> +
>> +The performance win associated with using NFSD DIRECT was previously
>> +discussed on linux-nfs, see:
>> +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>> +But in summary:
>> +- NFSD DIRECT can significantly reduce memory requirements
>> +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
>> +- NFSD DIRECT can offer more deterministic IO performance
>> +
>> +As always, your mileage may vary and so it is important to carefully
>> +consider if/when it is beneficial to make use of NFSD DIRECT. When
>> +assessing comparative performance of your workload please be sure to log
>> +relevant performance metrics during testing (e.g. memory usage, cpu
>> +usage, IO performance). Using perf to collect perf data that may be used
>> +to generate a "flamegraph" for work Linux must perform on behalf of your
>> +test is a really meaningful way to compare the relative health of the
>> +system and how switching NFSD's IO mode changes what is observed.
>> +
>> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
>> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
>> +the underlying block device's logical_block_size. Also the memory buffer
>> +used to store the READ or WRITE payload must be aligned relative to the
>> +underlying block device's dma_alignment.
>> +
>> +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
>> +it can:
>> +
>> +Misaligned READ:
>> +    If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
>> +    DIO-aligned block (on either end of the READ). The expanded READ is
>> +    verified to have proper offset/len (logical_block_size) and
>> +    dma_alignment checking.
>> +
>> +Misaligned WRITE:
>> +    If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
>> +    middle and end as needed. The large middle segment is DIO-aligned
>> +    and the start and/or end are misaligned. Buffered IO is used for the
>> +    misaligned segments and O_DIRECT is used for the middle DIO-aligned
>> +    segment. DONTCACHE buffered IO is _not_ used for the misaligned
>> +    segments because using normal buffered IO offers significant RMW
>> +    performance benefit when handling streaming misaligned WRITEs.
>> +
>> +Tracing:
>> +    The nfsd_read_direct trace event shows how NFSD expands any
>> +    misaligned READ to the next DIO-aligned block (on either end of the
>> +    original READ, as needed).
>> +
>> +    This combination of trace events is useful for READs:
>> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
>> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
>> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
>> +    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
>> +
>> +    The nfsd_write_direct trace event shows how NFSD splits a given
>> +    misaligned WRITE into a DIO-aligned middle segment.
>> +
>> +    This combination of trace events is useful for WRITEs:
>> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
>> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
>> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
>> +    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
>> --
>> 2.51.0
>>
>>


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-11-22 15:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11 14:59 [PATCH v12 0/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-11 14:59 ` [PATCH v12 1/3] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-11 14:59 ` [PATCH v12 2/3] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-11 15:10   ` Christoph Hellwig
2025-11-11 14:59 ` [PATCH v12 3/3] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2025-11-21  9:20   ` Anton Gavriliuk
2025-11-22 15:52     ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).