linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] [PATCH 0/3] NFSD: additional NFSD Direct changes
@ 2025-11-05 17:42 Mike Snitzer
  2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Mike Snitzer @ 2025-11-05 17:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Hi,

This series builds ontop of what has been staged in nfsd-testing for
NFSD Direct.

For now, I elected to use io_cache_write to expose control over
stable_how when NFSD Direct is used, rather than a per-export control
that can work with any NFSD IO mode. It seemed best to approach it
this way until/unless there is a clear associated win.

But the benchmarking of stable_how variants is still pending
(performance cluster that's required to do the benchmarking is tied up
with higher priority work at the moment, so will need to circle back
to that).

Thanks,
Mike

changes in v4:
- set flags_{buffered,direct} in nfsd_write_dio_iters_init()
- moving where flags_{buffered,direct} set exposed that
  nfsd_issue_dio_write should just be folded into nfsd_direct_write
- patch 2 was just rebased on patch 1
- patch 3 is unchanged

Mike Snitzer (3):
  NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE
  NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst

 .../filesystems/nfs/nfsd-io-modes.rst         |  58 +++++-----
 fs/nfsd/debugfs.c                             |   7 +-
 fs/nfsd/nfsd.h                                |   2 +
 fs/nfsd/vfs.c                                 | 109 ++++++++++--------
 4 files changed, 100 insertions(+), 76 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE
  2025-11-05 17:42 [PATCH v4 0/3] [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
@ 2025-11-05 17:42 ` Mike Snitzer
  2025-11-05 18:47   ` Chuck Lever
  2025-11-07 15:29   ` Christoph Hellwig
  2025-11-05 17:42 ` [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
  2025-11-05 17:42 ` [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
  2 siblings, 2 replies; 16+ messages in thread
From: Mike Snitzer @ 2025-11-05 17:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

NFSD_IO_DIRECT can easily improve streaming misaligned WRITE
performance if it uses buffered IO (without DONTCACHE) for the
misaligned end segment(s) and O_DIRECT for the aligned middle
segment's IO.

On one capable testbed, this commit improved streaming 47008 byte
write performance from 0.3433 GB/s to 1.26 GB/s.

This commit also merges nfsd_issue_dio_write into its only caller
(nfsd_direct_write).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/vfs.c | 73 ++++++++++++++++++++++-----------------------------
 1 file changed, 31 insertions(+), 42 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 701dd261c252..a4700c917c72 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1296,12 +1296,13 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
 
 static void
 nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
-			  loff_t offset, unsigned long total,
+			  struct kiocb *kiocb, unsigned long total,
 			  struct nfsd_write_dio_args *args)
 {
 	u32 offset_align = args->nf->nf_dio_offset_align;
 	u32 mem_align = args->nf->nf_dio_mem_align;
 	loff_t prefix_end, orig_end, middle_end;
+	loff_t offset = kiocb->ki_pos;
 	size_t prefix, middle, suffix;
 
 	args->nsegs = 0;
@@ -1347,6 +1348,8 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
 		++args->nsegs;
 	}
 
+	args->flags_buffered = kiocb->ki_flags;
+	args->flags_direct = kiocb->ki_flags | IOCB_DIRECT;
 	return;
 
 no_dio:
@@ -1354,39 +1357,14 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
 	nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
 				0, total);
 	args->nsegs = 1;
-}
 
-static int
-nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		     struct kiocb *kiocb, unsigned int nvecs,
-		     unsigned long *cnt, struct nfsd_write_dio_args *args)
-{
-	struct file *file = args->nf->nf_file;
-	ssize_t host_err;
-	unsigned int i;
-
-	nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
-				  *cnt, args);
-
-	*cnt = 0;
-	for (i = 0; i < args->nsegs; i++) {
-		if (args->segment[i].use_dio) {
-			kiocb->ki_flags = args->flags_direct;
-			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
-						args->segment[i].iter.count);
-		} else
-			kiocb->ki_flags = args->flags_buffered;
-
-		host_err = vfs_iocb_iter_write(file, kiocb,
-					       &args->segment[i].iter);
-		if (host_err < 0)
-			return host_err;
-		*cnt += host_err;
-		if (host_err < args->segment[i].iter.count)
-			break;	/* partial write */
-	}
-
-	return 0;
+	/*
+	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
+	 * falling back to buffered IO if the entire WRITE is unaligned.
+	 */
+	args->flags_buffered = kiocb->ki_flags;
+	if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
+		args->flags_buffered |= IOCB_DONTCACHE;
 }
 
 static noinline_for_stack int
@@ -1395,20 +1373,31 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		  unsigned long *cnt, struct kiocb *kiocb)
 {
 	struct nfsd_write_dio_args args;
+	ssize_t host_err;
+	unsigned int i;
 
 	args.nf = nf;
+	nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb, *cnt, &args);
 
-	/*
-	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
-	 * writing unaligned segments or handling fallback I/O.
-	 */
-	args.flags_buffered = kiocb->ki_flags;
-	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
-		args.flags_buffered |= IOCB_DONTCACHE;
+	*cnt = 0;
+	for (i = 0; i < args.nsegs; i++) {
+		if (args.segment[i].use_dio) {
+			kiocb->ki_flags = args.flags_direct;
+			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
+						args.segment[i].iter.count);
+		} else
+			kiocb->ki_flags = args.flags_buffered;
 
-	args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
+		host_err = vfs_iocb_iter_write(nf->nf_file, kiocb,
+					       &args.segment[i].iter);
+		if (host_err < 0)
+			return host_err;
+		*cnt += host_err;
+		if (host_err < args.segment[i].iter.count)
+			break;	/* partial write */
+	}
 
-	return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
+	return 0;
 }
 
 /**
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-05 17:42 [PATCH v4 0/3] [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
  2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
@ 2025-11-05 17:42 ` Mike Snitzer
  2025-11-05 18:49   ` Chuck Lever
  2025-11-07 15:30   ` Christoph Hellwig
  2025-11-05 17:42 ` [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
  2 siblings, 2 replies; 16+ messages in thread
From: Mike Snitzer @ 2025-11-05 17:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

NFSD_IO_DIRECT_WRITE_FILE_SYNC is direct IO with stable_how=NFS_FILE_SYNC.
NFSD_IO_DIRECT_WRITE_DATA_SYNC is direct IO with stable_how=NFS_DATA_SYNC.

The stable_how associated with each is a hint in the form of a "floor"
value for stable_how.  Meaning if the client provided stable_how is
already of higher value it will not be changed.

These permutations of NFSD_IO_DIRECT allow to experiment with also
elevating stable_how and sending it back to the client.  Which for
NFSD_IO_DIRECT_WRITE_FILE_SYNC will cause the client to elide its
COMMIT.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/debugfs.c |  7 ++++++-
 fs/nfsd/nfsd.h    |  2 ++
 fs/nfsd/vfs.c     | 46 ++++++++++++++++++++++++++++++++++------------
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 7f44689e0a53..8538e29ed2ab 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -68,7 +68,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
 	case NFSD_IO_DIRECT:
 		/*
 		 * Must disable splice_read when enabling
-		 * NFSD_IO_DONTCACHE.
+		 * NFSD_IO_DONTCACHE and NFSD_IO_DIRECT.
 		 */
 		nfsd_disable_splice_read = true;
 		nfsd_io_cache_read = val;
@@ -90,6 +90,9 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
  * Contents:
  *   %0: NFS WRITE will use buffered IO
  *   %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ *   %2: NFS WRITE will use direct IO with stable_how=NFS_UNSTABLE
+ *   %3: NFS WRITE will use direct IO with stable_how=NFS_DATA_SYNC
+ *   %4: NFS WRITE will use direct IO with stable_how=NFS_FILE_SYNC
  *
  * This setting takes immediate effect for all NFS versions,
  * all exports, and in all NFSD net namespaces.
@@ -109,6 +112,8 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
 	case NFSD_IO_DIRECT:
+	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
+	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index e4263326ca4a..10eca169392b 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -161,6 +161,8 @@ enum {
 	NFSD_IO_BUFFERED,
 	NFSD_IO_DONTCACHE,
 	NFSD_IO_DIRECT,
+	NFSD_IO_DIRECT_WRITE_DATA_SYNC,
+	NFSD_IO_DIRECT_WRITE_FILE_SYNC,
 };
 
 extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a4700c917c72..1b61185e96a9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1367,15 +1367,45 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
 		args->flags_buffered |= IOCB_DONTCACHE;
 }
 
+static void
+nfsd_init_write_kiocb_from_stable(u32 stable_floor,
+				  struct kiocb *kiocb,
+				  u32 *stable_how)
+{
+	if (stable_floor < *stable_how)
+		return; /* stable_how already set higher */
+
+	*stable_how = stable_floor;
+
+	switch (stable_floor) {
+	case NFS_FILE_SYNC:
+		/* persist data and timestamps */
+		kiocb->ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+		break;
+	case NFS_DATA_SYNC:
+		/* persist data only */
+		kiocb->ki_flags |= IOCB_DSYNC;
+		break;
+	}
+}
+
 static noinline_for_stack int
 nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
 		  unsigned long *cnt, struct kiocb *kiocb)
 {
+	u32 stable_floor = NFS_UNSTABLE;
 	struct nfsd_write_dio_args args;
 	ssize_t host_err;
 	unsigned int i;
 
+	if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_FILE_SYNC)
+		stable_floor = NFS_FILE_SYNC;
+	else if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_DATA_SYNC)
+		stable_floor = NFS_DATA_SYNC;
+	if (stable_floor != NFS_UNSTABLE)
+		nfsd_init_write_kiocb_from_stable(stable_floor, kiocb,
+						  stable_how);
 	args.nf = nf;
 	nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb, *cnt, &args);
 
@@ -1461,18 +1491,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		stable = NFS_UNSTABLE;
 	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = offset;
-	if (likely(!fhp->fh_use_wgather)) {
-		switch (stable) {
-		case NFS_FILE_SYNC:
-			/* persist data and timestamps */
-			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
-			break;
-		case NFS_DATA_SYNC:
-			/* persist data only */
-			kiocb.ki_flags |= IOCB_DSYNC;
-			break;
-		}
-	}
+	if (likely(!fhp->fh_use_wgather))
+		nfsd_init_write_kiocb_from_stable(stable, &kiocb, stable_how);
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 
@@ -1482,6 +1502,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	switch (nfsd_io_cache_write) {
 	case NFSD_IO_DIRECT:
+	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
+	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
 		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
 					     nvecs, cnt, &kiocb);
 		stable = *stable_how;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst
  2025-11-05 17:42 [PATCH v4 0/3] [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
  2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
  2025-11-05 17:42 ` [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
@ 2025-11-05 17:42 ` Mike Snitzer
  2025-11-05 18:50   ` Chuck Lever
  2 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2025-11-05 17:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Also fixed some typos.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 .../filesystems/nfs/nfsd-io-modes.rst         | 58 ++++++++++---------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
index 4863885c7035..29b84c9c9e25 100644
--- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -21,17 +21,20 @@ NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
 
 Based on the configured settings, NFSD's IO will either be:
 - cached using page cache (NFSD_IO_BUFFERED=0)
-- cached but removed from the page cache upon completion
-  (NFSD_IO_DONTCACHE=1).
-- not cached (NFSD_IO_DIRECT=2)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
+- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
 
-To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
+To set an NFSD IO mode, write a supported value (0 - 4) to the
 corresponding IO operation's debugfs interface, e.g.:
   echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+  echo 4 > /sys/kernel/debug/nfsd/io_cache_write
 
 To check which IO mode NFSD is using for READ or WRITE, simply read the
 corresponding IO operation's debugfs interface, e.g.:
   cat /sys/kernel/debug/nfsd/io_cache_read
+  cat /sys/kernel/debug/nfsd/io_cache_write
 
 NFSD DONTCACHE
 ==============
@@ -46,10 +49,10 @@ DONTCACHE aims to avoid what has proven to be a fairly significant
 limition of Linux's memory management subsystem if/when large amounts of
 data is infrequently accessed (e.g. read once _or_ written once but not
 read until much later). Such use-cases are particularly problematic
-because the page cache will eventually become a bottleneck to surfacing
+because the page cache will eventually become a bottleneck to servicing
 new IO requests.
 
-For more context, please see these Linux commit headers:
+For more context on DONTCACHE, please see these Linux commit headers:
 - Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
   to take a struct kiocb")
 - for READ:  8026e49bff9b1 ("mm/filemap: add read support for
@@ -73,12 +76,18 @@ those with a working set that is significantly larger than available
 system memory. The pathological worst-case workload that NFSD DIRECT has
 proven to help most is: NFS client issuing large sequential IO to a file
 that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
 
 The performance win associated with using NFSD DIRECT was previously
 discussed on linux-nfs, see:
 https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
 But in summary:
-- NFSD DIRECT can signicantly reduce memory requirements
+- NFSD DIRECT can significantly reduce memory requirements
 - NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
 - NFSD DIRECT can offer more deterministic IO performance
 
@@ -91,11 +100,11 @@ to generate a "flamegraph" for work Linux must perform on behalf of your
 test is a really meaningful way to compare the relative health of the
 system and how switching NFSD's IO mode changes what is observed.
 
-If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
-interfaces, ideally the IO will be aligned relative to the underlying
-block device's logical_block_size. Also the memory buffer used to store
-the READ or WRITE payload must be aligned relative to the underlying
-block device's dma_alignment.
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
 
 But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
 it can:
@@ -113,32 +122,29 @@ Misaligned READ:
 
 Misaligned WRITE:
     If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
-    middle and end as needed. The large middle extent is DIO-aligned and
-    the start and/or end are misaligned. Buffered IO is used for the
-    misaligned extents and O_DIRECT is used for the middle DIO-aligned
-    extent.
-
-    If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
-    invalidate the page cache on behalf of the DIO WRITE, then
-    nfsd_issue_write_dio() will fall back to using buffered IO.
+    middle and end as needed. The large middle segment is DIO-aligned
+    and the start and/or end are misaligned. Buffered IO is used for the
+    misaligned segments and O_DIRECT is used for the middle DIO-aligned
+    segment. DONTCACHE buffered IO is _not_ used for the misaligned
+    segments because using normal buffered IO offers significant RMW
+    performance benefit when handling streaming misaligned WRITEs.
 
 Tracing:
-    The nfsd_analyze_read_dio trace event shows how NFSD expands any
+    The nfsd_read_direct trace event shows how NFSD expands any
     misaligned READ to the next DIO-aligned block (on either end of the
     original READ, as needed).
 
     This combination of trace events is useful for READs:
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
-    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
     echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
 
-    The nfsd_analyze_write_dio trace event shows how NFSD splits a given
-    misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
-    extent.
+    The nfsd_write_direct trace event shows how NFSD splits a given
+    misaligned WRITE into a DIO-aligned middle segment.
 
     This combination of trace events is useful for WRITEs:
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
-    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
     echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE
  2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
@ 2025-11-05 18:47   ` Chuck Lever
  2025-11-07 15:29   ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2025-11-05 18:47 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 11/5/25 12:42 PM, Mike Snitzer wrote:
> NFSD_IO_DIRECT can easily improve streaming misaligned WRITE
> performance if it uses buffered IO (without DONTCACHE) for the
> misaligned end segment(s) and O_DIRECT for the aligned middle
> segment's IO.
> 
> On one capable testbed, this commit improved streaming 47008 byte
> write performance from 0.3433 GB/s to 1.26 GB/s.
> 
> This commit also merges nfsd_issue_dio_write into its only caller
> (nfsd_direct_write).
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/vfs.c | 73 ++++++++++++++++++++++-----------------------------
>  1 file changed, 31 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 701dd261c252..a4700c917c72 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1296,12 +1296,13 @@ nfsd_write_dio_seg_init(struct nfsd_write_dio_seg *segment,
>  
>  static void
>  nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> -			  loff_t offset, unsigned long total,
> +			  struct kiocb *kiocb, unsigned long total,
>  			  struct nfsd_write_dio_args *args)
>  {
>  	u32 offset_align = args->nf->nf_dio_offset_align;
>  	u32 mem_align = args->nf->nf_dio_mem_align;
>  	loff_t prefix_end, orig_end, middle_end;
> +	loff_t offset = kiocb->ki_pos;
>  	size_t prefix, middle, suffix;
>  
>  	args->nsegs = 0;
> @@ -1347,6 +1348,8 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
>  		++args->nsegs;
>  	}
>  
> +	args->flags_buffered = kiocb->ki_flags;
> +	args->flags_direct = kiocb->ki_flags | IOCB_DIRECT;
>  	return;
>  
>  no_dio:
> @@ -1354,39 +1357,14 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
>  	nfsd_write_dio_seg_init(&args->segment[0], bvec, nvecs, total,
>  				0, total);
>  	args->nsegs = 1;
> -}
>  
> -static int
> -nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -		     struct kiocb *kiocb, unsigned int nvecs,
> -		     unsigned long *cnt, struct nfsd_write_dio_args *args)
> -{
> -	struct file *file = args->nf->nf_file;
> -	ssize_t host_err;
> -	unsigned int i;
> -
> -	nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb->ki_pos,
> -				  *cnt, args);
> -
> -	*cnt = 0;
> -	for (i = 0; i < args->nsegs; i++) {
> -		if (args->segment[i].use_dio) {
> -			kiocb->ki_flags = args->flags_direct;
> -			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> -						args->segment[i].iter.count);
> -		} else
> -			kiocb->ki_flags = args->flags_buffered;
> -
> -		host_err = vfs_iocb_iter_write(file, kiocb,
> -					       &args->segment[i].iter);
> -		if (host_err < 0)
> -			return host_err;
> -		*cnt += host_err;
> -		if (host_err < args->segment[i].iter.count)
> -			break;	/* partial write */
> -	}
> -
> -	return 0;
> +	/*
> +	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> +	 * falling back to buffered IO if the entire WRITE is unaligned.
> +	 */
> +	args->flags_buffered = kiocb->ki_flags;
> +	if (args->nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> +		args->flags_buffered |= IOCB_DONTCACHE;
>  }
>  
>  static noinline_for_stack int
> @@ -1395,20 +1373,31 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		  unsigned long *cnt, struct kiocb *kiocb)
>  {
>  	struct nfsd_write_dio_args args;
> +	ssize_t host_err;
> +	unsigned int i;
>  
>  	args.nf = nf;
> +	nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb, *cnt, &args);
>  
> -	/*
> -	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> -	 * writing unaligned segments or handling fallback I/O.
> -	 */
> -	args.flags_buffered = kiocb->ki_flags;
> -	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> -		args.flags_buffered |= IOCB_DONTCACHE;
> +	*cnt = 0;
> +	for (i = 0; i < args.nsegs; i++) {
> +		if (args.segment[i].use_dio) {
> +			kiocb->ki_flags = args.flags_direct;
> +			trace_nfsd_write_direct(rqstp, fhp, kiocb->ki_pos,
> +						args.segment[i].iter.count);
> +		} else
> +			kiocb->ki_flags = args.flags_buffered;
>  
> -	args.flags_direct = kiocb->ki_flags | IOCB_DIRECT;
> +		host_err = vfs_iocb_iter_write(nf->nf_file, kiocb,
> +					       &args.segment[i].iter);
> +		if (host_err < 0)
> +			return host_err;
> +		*cnt += host_err;
> +		if (host_err < args.segment[i].iter.count)
> +			break;	/* partial write */
> +	}
>  
> -	return nfsd_issue_dio_write(rqstp, fhp, kiocb, nvecs, cnt, &args);
> +	return 0;
>  }
>  
>  /**


This is not at all what I had in mind. I'll fold something into the
jumbo patch roll up for v10.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-05 17:42 ` [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
@ 2025-11-05 18:49   ` Chuck Lever
  2025-11-06 20:17     ` Mike Snitzer
  2025-11-07 15:30   ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Chuck Lever @ 2025-11-05 18:49 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 11/5/25 12:42 PM, Mike Snitzer wrote:
> NFSD_IO_DIRECT_WRITE_FILE_SYNC is direct IO with stable_how=NFS_FILE_SYNC.
> NFSD_IO_DIRECT_WRITE_DATA_SYNC is direct IO with stable_how=NFS_DATA_SYNC.
> 
> The stable_how associated with each is a hint in the form of a "floor"
> value for stable_how.  Meaning if the client provided stable_how is
> already of higher value it will not be changed.
> 
> These permutations of NFSD_IO_DIRECT allow to experiment with also
> elevating stable_how and sending it back to the client.  Which for
> NFSD_IO_DIRECT_WRITE_FILE_SYNC will cause the client to elide its
> COMMIT.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/debugfs.c |  7 ++++++-
>  fs/nfsd/nfsd.h    |  2 ++
>  fs/nfsd/vfs.c     | 46 ++++++++++++++++++++++++++++++++++------------
>  3 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 7f44689e0a53..8538e29ed2ab 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -68,7 +68,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
>  	case NFSD_IO_DIRECT:
>  		/*
>  		 * Must disable splice_read when enabling
> -		 * NFSD_IO_DONTCACHE.
> +		 * NFSD_IO_DONTCACHE and NFSD_IO_DIRECT.
>  		 */
>  		nfsd_disable_splice_read = true;
>  		nfsd_io_cache_read = val;
> @@ -90,6 +90,9 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
>   * Contents:
>   *   %0: NFS WRITE will use buffered IO
>   *   %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> + *   %2: NFS WRITE will use direct IO with stable_how=NFS_UNSTABLE
> + *   %3: NFS WRITE will use direct IO with stable_how=NFS_DATA_SYNC
> + *   %4: NFS WRITE will use direct IO with stable_how=NFS_FILE_SYNC
>   *
>   * This setting takes immediate effect for all NFS versions,
>   * all exports, and in all NFSD net namespaces.
> @@ -109,6 +112,8 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
>  	case NFSD_IO_BUFFERED:
>  	case NFSD_IO_DONTCACHE:
>  	case NFSD_IO_DIRECT:
> +	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
> +	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
>  		nfsd_io_cache_write = val;
>  		break;
>  	default:
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index e4263326ca4a..10eca169392b 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -161,6 +161,8 @@ enum {
>  	NFSD_IO_BUFFERED,
>  	NFSD_IO_DONTCACHE,
>  	NFSD_IO_DIRECT,
> +	NFSD_IO_DIRECT_WRITE_DATA_SYNC,
> +	NFSD_IO_DIRECT_WRITE_FILE_SYNC,
>  };
>  
>  extern u64 nfsd_io_cache_read __read_mostly;
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index a4700c917c72..1b61185e96a9 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1367,15 +1367,45 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
>  		args->flags_buffered |= IOCB_DONTCACHE;
>  }
>  
> +static void
> +nfsd_init_write_kiocb_from_stable(u32 stable_floor,
> +				  struct kiocb *kiocb,
> +				  u32 *stable_how)
> +{
> +	if (stable_floor < *stable_how)
> +		return; /* stable_how already set higher */
> +
> +	*stable_how = stable_floor;
> +
> +	switch (stable_floor) {
> +	case NFS_FILE_SYNC:
> +		/* persist data and timestamps */
> +		kiocb->ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> +		break;
> +	case NFS_DATA_SYNC:
> +		/* persist data only */
> +		kiocb->ki_flags |= IOCB_DSYNC;
> +		break;
> +	}
> +}
> +
>  static noinline_for_stack int
>  nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
>  		  unsigned long *cnt, struct kiocb *kiocb)
>  {
> +	u32 stable_floor = NFS_UNSTABLE;
>  	struct nfsd_write_dio_args args;
>  	ssize_t host_err;
>  	unsigned int i;
>  
> +	if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_FILE_SYNC)
> +		stable_floor = NFS_FILE_SYNC;
> +	else if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_DATA_SYNC)
> +		stable_floor = NFS_DATA_SYNC;
> +	if (stable_floor != NFS_UNSTABLE)
> +		nfsd_init_write_kiocb_from_stable(stable_floor, kiocb,
> +						  stable_how);
>  	args.nf = nf;
>  	nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb, *cnt, &args);
>  
> @@ -1461,18 +1491,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		stable = NFS_UNSTABLE;
>  	init_sync_kiocb(&kiocb, file);
>  	kiocb.ki_pos = offset;
> -	if (likely(!fhp->fh_use_wgather)) {
> -		switch (stable) {
> -		case NFS_FILE_SYNC:
> -			/* persist data and timestamps */
> -			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> -			break;
> -		case NFS_DATA_SYNC:
> -			/* persist data only */
> -			kiocb.ki_flags |= IOCB_DSYNC;
> -			break;
> -		}
> -	}
> +	if (likely(!fhp->fh_use_wgather))
> +		nfsd_init_write_kiocb_from_stable(stable, &kiocb, stable_how);
>  
>  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
>  
> @@ -1482,6 +1502,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	switch (nfsd_io_cache_write) {
>  	case NFSD_IO_DIRECT:
> +	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
> +	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
>  		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
>  					     nvecs, cnt, &kiocb);
>  		stable = *stable_how;


I asked for the use of a file_sync export option because we need to test
the BUFFERED cache mode as well as DIRECT. So, continue to experiment
with this one, but I don't plan to merge it for now.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst
  2025-11-05 17:42 ` [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
@ 2025-11-05 18:50   ` Chuck Lever
  0 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2025-11-05 18:50 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 11/5/25 12:42 PM, Mike Snitzer wrote:
> Also fixed some typos.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  .../filesystems/nfs/nfsd-io-modes.rst         | 58 ++++++++++---------
>  1 file changed, 32 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> index 4863885c7035..29b84c9c9e25 100644
> --- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -21,17 +21,20 @@ NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
>  
>  Based on the configured settings, NFSD's IO will either be:
>  - cached using page cache (NFSD_IO_BUFFERED=0)
> -- cached but removed from the page cache upon completion
> -  (NFSD_IO_DONTCACHE=1).
> -- not cached (NFSD_IO_DIRECT=2)
> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> +- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
> +- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
>  
> -To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
> +To set an NFSD IO mode, write a supported value (0 - 4) to the
>  corresponding IO operation's debugfs interface, e.g.:
>    echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> +  echo 4 > /sys/kernel/debug/nfsd/io_cache_write
>  
>  To check which IO mode NFSD is using for READ or WRITE, simply read the
>  corresponding IO operation's debugfs interface, e.g.:
>    cat /sys/kernel/debug/nfsd/io_cache_read
> +  cat /sys/kernel/debug/nfsd/io_cache_write
>  
>  NFSD DONTCACHE
>  ==============
> @@ -46,10 +49,10 @@ DONTCACHE aims to avoid what has proven to be a fairly significant
>  limition of Linux's memory management subsystem if/when large amounts of
>  data is infrequently accessed (e.g. read once _or_ written once but not
>  read until much later). Such use-cases are particularly problematic
> -because the page cache will eventually become a bottleneck to surfacing
> +because the page cache will eventually become a bottleneck to servicing
>  new IO requests.
>  
> -For more context, please see these Linux commit headers:
> +For more context on DONTCACHE, please see these Linux commit headers:
>  - Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
>    to take a struct kiocb")
>  - for READ:  8026e49bff9b1 ("mm/filemap: add read support for
> @@ -73,12 +76,18 @@ those with a working set that is significantly larger than available
>  system memory. The pathological worst-case workload that NFSD DIRECT has
>  proven to help most is: NFS client issuing large sequential IO to a file
>  that is 2-3 times larger than the NFS server's available system memory.
> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
> +that the memory management subsystem would otherwise be required to
> +perform (e.g. page allocation, dirty writeback, page reclaim). When
> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
> +time trying to find adequate free pages so that forward IO progress can
> +be made.
>  
>  The performance win associated with using NFSD DIRECT was previously
>  discussed on linux-nfs, see:
>  https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>  But in summary:
> -- NFSD DIRECT can signicantly reduce memory requirements
> +- NFSD DIRECT can significantly reduce memory requirements
>  - NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
>  - NFSD DIRECT can offer more deterministic IO performance
>  
> @@ -91,11 +100,11 @@ to generate a "flamegraph" for work Linux must perform on behalf of your
>  test is a really meaningful way to compare the relative health of the
>  system and how switching NFSD's IO mode changes what is observed.
>  
> -If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
> -interfaces, ideally the IO will be aligned relative to the underlying
> -block device's logical_block_size. Also the memory buffer used to store
> -the READ or WRITE payload must be aligned relative to the underlying
> -block device's dma_alignment.
> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
> +the underlying block device's logical_block_size. Also the memory buffer
> +used to store the READ or WRITE payload must be aligned relative to the
> +underlying block device's dma_alignment.
>  
>  But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
>  it can:
> @@ -113,32 +122,29 @@ Misaligned READ:
>  
>  Misaligned WRITE:
>      If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> -    middle and end as needed. The large middle extent is DIO-aligned and
> -    the start and/or end are misaligned. Buffered IO is used for the
> -    misaligned extents and O_DIRECT is used for the middle DIO-aligned
> -    extent.
> -
> -    If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> -    invalidate the page cache on behalf of the DIO WRITE, then
> -    nfsd_issue_write_dio() will fall back to using buffered IO.
> +    middle and end as needed. The large middle segment is DIO-aligned
> +    and the start and/or end are misaligned. Buffered IO is used for the
> +    misaligned segments and O_DIRECT is used for the middle DIO-aligned
> +    segment. DONTCACHE buffered IO is _not_ used for the misaligned
> +    segments because using normal buffered IO offers significant RMW
> +    performance benefit when handling streaming misaligned WRITEs.
>  
>  Tracing:
> -    The nfsd_analyze_read_dio trace event shows how NFSD expands any
> +    The nfsd_read_direct trace event shows how NFSD expands any
>      misaligned READ to the next DIO-aligned block (on either end of the
>      original READ, as needed).
>  
>      This combination of trace events is useful for READs:
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> -    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
>      echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
>  
> -    The nfsd_analyze_write_dio trace event shows how NFSD splits a given
> -    misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
> -    extent.
> +    The nfsd_write_direct trace event shows how NFSD splits a given
> +    misaligned WRITE into a DIO-aligned middle segment.
>  
>      This combination of trace events is useful for WRITEs:
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
> -    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
>      echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable

I've already squashed the previous version of this patch into my private
tree... if you confirm there were no changes, I'll leave this one for
now.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-05 18:49   ` Chuck Lever
@ 2025-11-06 20:17     ` Mike Snitzer
  2025-11-06 20:35       ` Chuck Lever
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2025-11-06 20:17 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Wed, Nov 05, 2025 at 01:49:29PM -0500, Chuck Lever wrote:
> On 11/5/25 12:42 PM, Mike Snitzer wrote:
> > NFSD_IO_DIRECT_WRITE_FILE_SYNC is direct IO with stable_how=NFS_FILE_SYNC.
> > NFSD_IO_DIRECT_WRITE_DATA_SYNC is direct IO with stable_how=NFS_DATA_SYNC.
> > 
> > The stable_how associated with each is a hint in the form of a "floor"
> > value for stable_how.  Meaning if the client provided stable_how is
> > already of higher value it will not be changed.
> > 
> > These permutations of NFSD_IO_DIRECT allow to experiment with also
> > elevating stable_how and sending it back to the client.  Which for
> > NFSD_IO_DIRECT_WRITE_FILE_SYNC will cause the client to elide its
> > COMMIT.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  fs/nfsd/debugfs.c |  7 ++++++-
> >  fs/nfsd/nfsd.h    |  2 ++
> >  fs/nfsd/vfs.c     | 46 ++++++++++++++++++++++++++++++++++------------
> >  3 files changed, 42 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 7f44689e0a53..8538e29ed2ab 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -68,7 +68,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
> >  	case NFSD_IO_DIRECT:
> >  		/*
> >  		 * Must disable splice_read when enabling
> > -		 * NFSD_IO_DONTCACHE.
> > +		 * NFSD_IO_DONTCACHE and NFSD_IO_DIRECT.
> >  		 */
> >  		nfsd_disable_splice_read = true;
> >  		nfsd_io_cache_read = val;
> > @@ -90,6 +90,9 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> >   * Contents:
> >   *   %0: NFS WRITE will use buffered IO
> >   *   %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
> > + *   %2: NFS WRITE will use direct IO with stable_how=NFS_UNSTABLE
> > + *   %3: NFS WRITE will use direct IO with stable_how=NFS_DATA_SYNC
> > + *   %4: NFS WRITE will use direct IO with stable_how=NFS_FILE_SYNC
> >   *
> >   * This setting takes immediate effect for all NFS versions,
> >   * all exports, and in all NFSD net namespaces.
> > @@ -109,6 +112,8 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
> >  	case NFSD_IO_BUFFERED:
> >  	case NFSD_IO_DONTCACHE:
> >  	case NFSD_IO_DIRECT:
> > +	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
> > +	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
> >  		nfsd_io_cache_write = val;
> >  		break;
> >  	default:
> > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> > index e4263326ca4a..10eca169392b 100644
> > --- a/fs/nfsd/nfsd.h
> > +++ b/fs/nfsd/nfsd.h
> > @@ -161,6 +161,8 @@ enum {
> >  	NFSD_IO_BUFFERED,
> >  	NFSD_IO_DONTCACHE,
> >  	NFSD_IO_DIRECT,
> > +	NFSD_IO_DIRECT_WRITE_DATA_SYNC,
> > +	NFSD_IO_DIRECT_WRITE_FILE_SYNC,
> >  };
> >  
> >  extern u64 nfsd_io_cache_read __read_mostly;
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index a4700c917c72..1b61185e96a9 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1367,15 +1367,45 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> >  		args->flags_buffered |= IOCB_DONTCACHE;
> >  }
> >  
> > +static void
> > +nfsd_init_write_kiocb_from_stable(u32 stable_floor,
> > +				  struct kiocb *kiocb,
> > +				  u32 *stable_how)
> > +{
> > +	if (stable_floor < *stable_how)
> > +		return; /* stable_how already set higher */
> > +
> > +	*stable_how = stable_floor;
> > +
> > +	switch (stable_floor) {
> > +	case NFS_FILE_SYNC:
> > +		/* persist data and timestamps */
> > +		kiocb->ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> > +		break;
> > +	case NFS_DATA_SYNC:
> > +		/* persist data only */
> > +		kiocb->ki_flags |= IOCB_DSYNC;
> > +		break;
> > +	}
> > +}
> > +
> >  static noinline_for_stack int
> >  nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
> >  		  unsigned long *cnt, struct kiocb *kiocb)
> >  {
> > +	u32 stable_floor = NFS_UNSTABLE;
> >  	struct nfsd_write_dio_args args;
> >  	ssize_t host_err;
> >  	unsigned int i;
> >  
> > +	if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_FILE_SYNC)
> > +		stable_floor = NFS_FILE_SYNC;
> > +	else if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_DATA_SYNC)
> > +		stable_floor = NFS_DATA_SYNC;
> > +	if (stable_floor != NFS_UNSTABLE)
> > +		nfsd_init_write_kiocb_from_stable(stable_floor, kiocb,
> > +						  stable_how);
> >  	args.nf = nf;
> >  	nfsd_write_dio_iters_init(rqstp->rq_bvec, nvecs, kiocb, *cnt, &args);
> >  
> > @@ -1461,18 +1491,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  		stable = NFS_UNSTABLE;
> >  	init_sync_kiocb(&kiocb, file);
> >  	kiocb.ki_pos = offset;
> > -	if (likely(!fhp->fh_use_wgather)) {
> > -		switch (stable) {
> > -		case NFS_FILE_SYNC:
> > -			/* persist data and timestamps */
> > -			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
> > -			break;
> > -		case NFS_DATA_SYNC:
> > -			/* persist data only */
> > -			kiocb.ki_flags |= IOCB_DSYNC;
> > -			break;
> > -		}
> > -	}
> > +	if (likely(!fhp->fh_use_wgather))
> > +		nfsd_init_write_kiocb_from_stable(stable, &kiocb, stable_how);
> >  
> >  	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> >  
> > @@ -1482,6 +1502,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  
> >  	switch (nfsd_io_cache_write) {
> >  	case NFSD_IO_DIRECT:
> > +	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
> > +	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
> >  		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
> >  					     nvecs, cnt, &kiocb);
> >  		stable = *stable_how;
> 
> 
> I asked for the use of a file_sync export option because we need to test
> the BUFFERED cache mode as well as DIRECT. So, continue to experiment
> with this one, but I don't plan to merge it for now.

Doesn't the client have the ability to control NFS_UNSTABLE,
NFS_DATA_SYNC and NFS_FILE_SYNC already?  What experiment are you
looking to run?

If just looking to compare NFS_FILE_SYNC performance of
NFSD_IO_BUFFERED versus NFSD_IO_DIRECT then using the client control
is fine right?

Anyway, maybe I'm just being overly concerned about the permanence of
an export option.  I thought it best to avoid export for now given we
do seem to have adequate controls for a NFS_FILE_SYNC performance
bakeoff.

Here is a rebased patch that applies ontop of Christoph's cleanup and
my incremental Documentation patch.  I would appreciate us exposing
this NFSD stable_how "floor" control so others can try.  But if this
still isn't OK, due it to being in terms of NFSD_IO_DIRECT debugfs
knobs, then I can pursue a generic export option that works for all
NFS IO modes.

Thanks,
Mike

From: Mike Snitzer <snitzer@kernel.org>
Date: Thu, 30 Oct 2025 17:53:09 -0400
Subject: [PATCH v4 rebased] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how

NFSD_IO_DIRECT_WRITE_FILE_SYNC is direct IO with stable_how=NFS_FILE_SYNC.
NFSD_IO_DIRECT_WRITE_DATA_SYNC is direct IO with stable_how=NFS_DATA_SYNC.

The stable_how associated with each is a hint in the form of a "floor"
value for stable_how.  Meaning if the client provided stable_how is
already of higher value it will not be changed.

These permutations of NFSD_IO_DIRECT allow to experiment with also
elevating stable_how and sending it back to the client.  Which for
NFSD_IO_DIRECT_WRITE_FILE_SYNC will cause the client to elide its
COMMIT.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 .../filesystems/nfs/nfsd-io-modes.rst         |  6 ++-
 fs/nfsd/debugfs.c                             |  7 ++-
 fs/nfsd/nfsd.h                                |  2 +
 fs/nfsd/vfs.c                                 | 54 +++++++++++++------
 4 files changed, 51 insertions(+), 18 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
index e3a522d09766..a2194ec45c76 100644
--- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -23,11 +23,13 @@ Based on the configured settings, NFSD's IO will either be:
 - cached using page cache (NFSD_IO_BUFFERED=0)
 - cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
 - not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
+- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
 
-To set an NFSD IO mode, write a supported value (0 - 2) to the
+To set an NFSD IO mode, write a supported value (0 - 4) to the
 corresponding IO operation's debugfs interface, e.g.:
   echo 2 > /sys/kernel/debug/nfsd/io_cache_read
-  echo 2 > /sys/kernel/debug/nfsd/io_cache_write
+  echo 4 > /sys/kernel/debug/nfsd/io_cache_write
 
 To check which IO mode NFSD is using for READ or WRITE, simply read the
 corresponding IO operation's debugfs interface, e.g.:
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 7f44689e0a53..8538e29ed2ab 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -68,7 +68,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
 	case NFSD_IO_DIRECT:
 		/*
 		 * Must disable splice_read when enabling
-		 * NFSD_IO_DONTCACHE.
+		 * NFSD_IO_DONTCACHE and NFSD_IO_DIRECT.
 		 */
 		nfsd_disable_splice_read = true;
 		nfsd_io_cache_read = val;
@@ -90,6 +90,9 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
  * Contents:
  *   %0: NFS WRITE will use buffered IO
  *   %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ *   %2: NFS WRITE will use direct IO with stable_how=NFS_UNSTABLE
+ *   %3: NFS WRITE will use direct IO with stable_how=NFS_DATA_SYNC
+ *   %4: NFS WRITE will use direct IO with stable_how=NFS_FILE_SYNC
  *
  * This setting takes immediate effect for all NFS versions,
  * all exports, and in all NFSD net namespaces.
@@ -109,6 +112,8 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
 	case NFSD_IO_DIRECT:
+	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
+	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index e4263326ca4a..10eca169392b 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -161,6 +161,8 @@ enum {
 	NFSD_IO_BUFFERED,
 	NFSD_IO_DONTCACHE,
 	NFSD_IO_DIRECT,
+	NFSD_IO_DIRECT_WRITE_DATA_SYNC,
+	NFSD_IO_DIRECT_WRITE_FILE_SYNC,
 };
 
 extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 326cf6f717b3..101c18d79208 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1353,16 +1353,47 @@ nfsd_write_dio_iters_init(struct nfsd_file *nf, struct bio_vec *bvec,
 	return 1;
 }
 
+static void
+nfsd_init_write_kiocb_from_stable(u32 stable_floor,
+				  struct kiocb *kiocb,
+				  u32 *stable_how)
+{
+	if (stable_floor < *stable_how)
+		return; /* stable_how already set higher */
+
+	*stable_how = stable_floor;
+
+	switch (stable_floor) {
+	case NFS_FILE_SYNC:
+		/* persist data and timestamps */
+		kiocb->ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+		break;
+	case NFS_DATA_SYNC:
+		/* persist data only */
+		kiocb->ki_flags |= IOCB_DSYNC;
+		break;
+	}
+}
+
 static noinline_for_stack int
 nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		  struct nfsd_file *nf, unsigned int nvecs,
+		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
 		  unsigned long *cnt, struct kiocb *kiocb)
 {
+	u32 stable_floor = NFS_UNSTABLE;
 	struct file *file = nf->nf_file;
 	struct nfsd_write_dio_seg segments[3];
 	unsigned int nsegs = 0, i;
 	ssize_t host_err;
 
+	if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_FILE_SYNC)
+		stable_floor = NFS_FILE_SYNC;
+	else if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_DATA_SYNC)
+		stable_floor = NFS_DATA_SYNC;
+	if (stable_floor != NFS_UNSTABLE)
+		nfsd_init_write_kiocb_from_stable(stable_floor, kiocb,
+						  stable_how);
+
 	nsegs = nfsd_write_dio_iters_init(nf, rqstp->rq_bvec, nvecs,
 			kiocb, *cnt, segments);
 
@@ -1445,18 +1476,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		stable = NFS_UNSTABLE;
 	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = offset;
-	if (likely(!fhp->fh_use_wgather)) {
-		switch (stable) {
-		case NFS_FILE_SYNC:
-			/* persist data and timestamps */
-			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
-			break;
-		case NFS_DATA_SYNC:
-			/* persist data only */
-			kiocb.ki_flags |= IOCB_DSYNC;
-			break;
-		}
-	}
+	if (likely(!fhp->fh_use_wgather))
+		nfsd_init_write_kiocb_from_stable(stable, &kiocb, stable_how);
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 
@@ -1466,8 +1487,11 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	switch (nfsd_io_cache_write) {
 	case NFSD_IO_DIRECT:
-		host_err = nfsd_direct_write(rqstp, fhp, nf, nvecs,
-					     cnt, &kiocb);
+	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
+	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
+		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
+					     nvecs, cnt, &kiocb);
+		stable = *stable_how;
 		break;
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-06 20:17     ` Mike Snitzer
@ 2025-11-06 20:35       ` Chuck Lever
  2025-11-06 22:56         ` Mike Snitzer
  0 siblings, 1 reply; 16+ messages in thread
From: Chuck Lever @ 2025-11-06 20:35 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jeff Layton, linux-nfs

On 11/6/25 3:17 PM, Mike Snitzer wrote:
>> I asked for the use of a file_sync export option because we need to test
>> the BUFFERED cache mode as well as DIRECT. So, continue to experiment
>> with this one, but I don't plan to merge it for now.
> Doesn't the client have the ability to control NFS_UNSTABLE,
> NFS_DATA_SYNC and NFS_FILE_SYNC already?  What experiment are you
> looking to run?
> 
> If just looking to compare NFS_FILE_SYNC performance of
> NFSD_IO_BUFFERED versus NFSD_IO_DIRECT then using the client control
> is fine right?

Not necessarily. You can mount with "sync" but for large application
write requests, that might still generate UNSTABLE + immediate COMMIT.


> Anyway, maybe I'm just being overly concerned about the permanence of
> an export option.  I thought it best to avoid export for now given we
> do seem to have adequate controls for a NFS_FILE_SYNC performance
> bakeoff.

I agree on not rushing another administrative API change. I was thinking
you would be prototyping some of this stuff first and playing with it
for a little while before it goes upstream.

It feels like it would be much more straightforward to implement an
export option that applies to all cache modes rather than gluing it to
DIRECT.

And as I said above, "no plan to merge it for now," meaning it's still
on the table for sometime down the road. I have some other ideas I'm
cooking up, such as using BDI congestion to control NFS WRITE
throttling.

But let's get the base direct WRITE stuff finished.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-06 20:35       ` Chuck Lever
@ 2025-11-06 22:56         ` Mike Snitzer
  2025-11-07 14:48           ` Chuck Lever
  2025-11-07 15:34           ` Christoph Hellwig
  0 siblings, 2 replies; 16+ messages in thread
From: Mike Snitzer @ 2025-11-06 22:56 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Thu, Nov 06, 2025 at 03:35:29PM -0500, Chuck Lever wrote:
> On 11/6/25 3:17 PM, Mike Snitzer wrote:
> >> I asked for the use of a file_sync export option because we need to test
> >> the BUFFERED cache mode as well as DIRECT. So, continue to experiment
> >> with this one, but I don't plan to merge it for now.
> > Doesn't the client have the ability to control NFS_UNSTABLE,
> > NFS_DATA_SYNC and NFS_FILE_SYNC already?  What experiment are you
> > looking to run?
> > 
> > If just looking to compare NFS_FILE_SYNC performance of
> > NFSD_IO_BUFFERED versus NFSD_IO_DIRECT then using the client control
> > is fine right?
> 
> Not necessarily. You can mount with "sync" but for large application
> write requests, that might still generate UNSTABLE + immediate COMMIT.

OK, I'll take a closer look at NFS client controls for stable_how,
because NFSD clearly handles NFS_DATA_SYNC and NFS_FILE_SYNC so I just
assumed its because the client does actually send them.

> > Anyway, maybe I'm just being overly concerned about the permanence of
> > an export option.  I thought it best to avoid export for now given we
> > do seem to have adequate controls for a NFS_FILE_SYNC performance
> > bakeoff.
> 
> I agree on not rushing another administrative API change. I was thinking
> you would be prototyping some of this stuff first and playing with it
> for a little while before it goes upstream.
> 
> It feels like it would be much more straightforward to implement an
> export option that applies to all cache modes rather than gluing it to
> DIRECT.

OK, I'll carry the patch in Hammerspace kernel for the time being so
Jon Flynn can test whenever he has availability,

> And as I said above, "no plan to merge it for now," meaning it's still
> on the table for sometime down the road. I have some other ideas I'm
> cooking up, such as using BDI congestion to control NFS WRITE
> throttling.

Hmm, I thought the BDI congestion infra got killed (by Jan Kara and
others).. which made me sad because when it was first introduced it
was amazing at solving some complex deadlocks (but we're talking 20
years ago now).  So I haven't kept my finger on the pulse of what is
still available to us relative to BDI congestion.

Another possiblity is the PSI data, that infra was put in place by
Meta to deal with exactly the sitution we're confronting: take
action(s) when pressure builds to a particular threshold.

> But let's get the base direct WRITE stuff finished.

Yes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-06 22:56         ` Mike Snitzer
@ 2025-11-07 14:48           ` Chuck Lever
  2025-11-07 15:34           ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2025-11-07 14:48 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jeff Layton, linux-nfs

On 11/6/25 5:56 PM, Mike Snitzer wrote:
> On Thu, Nov 06, 2025 at 03:35:29PM -0500, Chuck Lever wrote:
>> On 11/6/25 3:17 PM, Mike Snitzer wrote:
>>>> I asked for the use of a file_sync export option because we need to test
>>>> the BUFFERED cache mode as well as DIRECT. So, continue to experiment
>>>> with this one, but I don't plan to merge it for now.
>>> Doesn't the client have the ability to control NFS_UNSTABLE,
>>> NFS_DATA_SYNC and NFS_FILE_SYNC already?  What experiment are you
>>> looking to run?
>>>
>>> If just looking to compare NFS_FILE_SYNC performance of
>>> NFSD_IO_BUFFERED versus NFSD_IO_DIRECT then using the client control
>>> is fine right?
>>
>> Not necessarily. You can mount with "sync" but for large application
>> write requests, that might still generate UNSTABLE + immediate COMMIT.
> 
> OK, I'll take a closer look at NFS client controls for stable_how,
> because NFSD clearly handles NFS_DATA_SYNC and NFS_FILE_SYNC so I just
> assumed its because the client does actually send them.

Clients do send them sometimes. The question here is what happens if
NFSD promotes all of them unconditionally. I'm not sure a client can
be made to send all FILE_SYNC (even the big writes) without code
changes.


>> And as I said above, "no plan to merge it for now," meaning it's still
>> on the table for sometime down the road. I have some other ideas I'm
>> cooking up, such as using BDI congestion to control NFS WRITE
>> throttling.
> 
> Hmm, I thought the BDI congestion infra got killed (by Jan Kara and
> others).. which made me sad because when it was first introduced it
> was amazing at solving some complex deadlocks (but we're talking 20
> years ago now).  So I haven't kept my finger on the pulse of what is
> still available to us relative to BDI congestion.

My (naive) audit suggests that, under memory pressure, the generic
write path will throttle writes by delaying the return from the
function (balance_dirty_pages and friends). What we might be seeing is
that this throttling is not aggressive enough to serve NFSD well.

When the base direct write patches are further along, I'll post my
RFC patches and copy fsdevel, and we can explore this more.



-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE
  2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
  2025-11-05 18:47   ` Chuck Lever
@ 2025-11-07 15:29   ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-11-07 15:29 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Chuck Lever, Jeff Layton, linux-nfs

On Wed, Nov 05, 2025 at 12:42:08PM -0500, Mike Snitzer wrote:
> NFSD_IO_DIRECT can easily improve streaming misaligned WRITE
> performance if it uses buffered IO (without DONTCACHE) for the
> misaligned end segment(s) and O_DIRECT for the aligned middle
> segment's IO.
> 
> On one capable testbed, this commit improved streaming 47008 byte
> write performance from 0.3433 GB/s to 1.26 GB/s.

When you say "improved streaming 47008 byte write performance" do you
mean the application on the client doing writes of 47008 bytes for
each syscall, probably synchronous?  Why is a workload that matters?
How do the numbers look for normal writes that are aligned to a power
of two, and especially the preferred write size exposed by the client
in st_blksize?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-05 17:42 ` [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
  2025-11-05 18:49   ` Chuck Lever
@ 2025-11-07 15:30   ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-11-07 15:30 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Chuck Lever, Jeff Layton, linux-nfs

On Wed, Nov 05, 2025 at 12:42:09PM -0500, Mike Snitzer wrote:
> NFSD_IO_DIRECT_WRITE_FILE_SYNC is direct IO with stable_how=NFS_FILE_SYNC.
> NFSD_IO_DIRECT_WRITE_DATA_SYNC is direct IO with stable_how=NFS_DATA_SYNC.
> 
> The stable_how associated with each is a hint in the form of a "floor"
> value for stable_how.  Meaning if the client provided stable_how is
> already of higher value it will not be changed.
> 
> These permutations of NFSD_IO_DIRECT allow to experiment with also
> elevating stable_how and sending it back to the client.  Which for
> NFSD_IO_DIRECT_WRITE_FILE_SYNC will cause the client to elide its
> COMMIT.

What is your use case?  Do you have performance numbers for them?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-06 22:56         ` Mike Snitzer
  2025-11-07 14:48           ` Chuck Lever
@ 2025-11-07 15:34           ` Christoph Hellwig
  2025-11-07 15:35             ` Chuck Lever
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2025-11-07 15:34 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Chuck Lever, Jeff Layton, linux-nfs

On Thu, Nov 06, 2025 at 05:56:32PM -0500, Mike Snitzer wrote:
> OK, I'll take a closer look at NFS client controls for stable_how,
> because NFSD clearly handles NFS_DATA_SYNC and NFS_FILE_SYNC so I just
> assumed its because the client does actually send them.

Asking for NFS_DATA_SYNC / NFS_FILE_SYNC for O_ЅYNC/O_SYNC writes or
even fdatasync/fsync calls that translate to a single on the write write
request would be a valuable client additoon that should speed things up
with most if not all servers.

> > And as I said above, "no plan to merge it for now," meaning it's still
> > on the table for sometime down the road. I have some other ideas I'm
> > cooking up, such as using BDI congestion to control NFS WRITE
> > throttling.
> 
> Hmm, I thought the BDI congestion infra got killed (by Jan Kara and
> others).. which made me sad because when it was first introduced it
> was amazing at solving some complex deadlocks (but we're talking 20
> years ago now).  So I haven't kept my finger on the pulse of what is
> still available to us relative to BDI congestion.

Yes, BDI congestion is gone, mostly because it didn't really work.
I'm also not sure how it could have solved deadlocks.

I feel a little out of the loops, though - what are the NFS-level
write throttling needs to start with?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-07 15:34           ` Christoph Hellwig
@ 2025-11-07 15:35             ` Chuck Lever
  2025-11-07 15:40               ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Chuck Lever @ 2025-11-07 15:35 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer; +Cc: Jeff Layton, linux-nfs

On 11/7/25 10:34 AM, Christoph Hellwig wrote:
> On Thu, Nov 06, 2025 at 05:56:32PM -0500, Mike Snitzer wrote:
>> OK, I'll take a closer look at NFS client controls for stable_how,
>> because NFSD clearly handles NFS_DATA_SYNC and NFS_FILE_SYNC so I just
>> assumed its because the client does actually send them.
> 
> Asking for NFS_DATA_SYNC / NFS_FILE_SYNC for O_ЅYNC/O_SYNC writes or
> even fdatasync/fsync calls that translate to a single on the write write
> request would be a valuable client additoon that should speed things up
> with most if not all servers.
> 
>>> And as I said above, "no plan to merge it for now," meaning it's still
>>> on the table for sometime down the road. I have some other ideas I'm
>>> cooking up, such as using BDI congestion to control NFS WRITE
>>> throttling.
>>
>> Hmm, I thought the BDI congestion infra got killed (by Jan Kara and
>> others).. which made me sad because when it was first introduced it
>> was amazing at solving some complex deadlocks (but we're talking 20
>> years ago now).  So I haven't kept my finger on the pulse of what is
>> still available to us relative to BDI congestion.
> 
> Yes, BDI congestion is gone, mostly because it didn't really work.
> I'm also not sure how it could have solved deadlocks.
> 
> I feel a little out of the loops, though - what are the NFS-level
> write throttling needs to start with?
> 

Let's not go down this path right now. I'd like to stay focused on
getting the direct WRITE work merged.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-07 15:35             ` Chuck Lever
@ 2025-11-07 15:40               ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-11-07 15:40 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Christoph Hellwig, Mike Snitzer, Jeff Layton, linux-nfs

On Fri, Nov 07, 2025 at 10:35:52AM -0500, Chuck Lever wrote:
> > Yes, BDI congestion is gone, mostly because it didn't really work.
> > I'm also not sure how it could have solved deadlocks.
> > 
> > I feel a little out of the loops, though - what are the NFS-level
> > write throttling needs to start with?
> > 
> 
> Let's not go down this path right now. I'd like to stay focused on
> getting the direct WRITE work merged.

I'm just curious what the discussion is about at all.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-11-07 15:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05 17:42 [PATCH v4 0/3] [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
2025-11-05 18:47   ` Chuck Lever
2025-11-07 15:29   ` Christoph Hellwig
2025-11-05 17:42 ` [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
2025-11-05 18:49   ` Chuck Lever
2025-11-06 20:17     ` Mike Snitzer
2025-11-06 20:35       ` Chuck Lever
2025-11-06 22:56         ` Mike Snitzer
2025-11-07 14:48           ` Chuck Lever
2025-11-07 15:34           ` Christoph Hellwig
2025-11-07 15:35             ` Chuck Lever
2025-11-07 15:40               ` Christoph Hellwig
2025-11-07 15:30   ` Christoph Hellwig
2025-11-05 17:42 ` [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
2025-11-05 18:50   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).