[PATCH 0/3] NFSD: additional NFSD Direct changes

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] NFSD: additional NFSD Direct changes
@ 2025-11-04 16:42 Mike Snitzer
  2025-11-04 16:42 ` [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback Mike Snitzer
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-11-04 16:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Hi,

This series builds ontop of what has been staged in nfsd-testing for
NFSD Direct.

For now, I elected to use io_cache_write to expose control over
stable_how when NFSD Direct is used, rather than a per-export control
that can work with any NFSD IO mode. It seemed best to approach it
this way until/unless there is a clear associated win.

But the benchmarking of stable_how variants is still pending
(performance cluster that's required to do the benchmarking is tied up
with higher priority work at the moment, so will need to circle back
to that).

Thanks,
Mike

Mike Snitzer (3):
  nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback
  NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst

 .../filesystems/nfs/nfsd-io-modes.rst         | 58 ++++++++-------
 fs/nfsd/debugfs.c                             |  7 +-
 fs/nfsd/nfsd.h                                |  2 +
 fs/nfsd/vfs.c                                 | 74 ++++++++++++++-----
 4 files changed, 97 insertions(+), 44 deletions(-)

-- 
2.44.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback
  2025-11-04 16:42 [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
@ 2025-11-04 16:42 ` Mike Snitzer
  2025-11-04 17:23   ` Chuck Lever
                     ` (2 more replies)
  2025-11-04 16:42 ` [PATCH 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
  2025-11-04 16:42 ` [PATCH 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
  2 siblings, 3 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-11-04 16:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Also, use buffered IO (without DONTCACHE) if READ is less than 32K.
But do use DONTCACHE if an entire WRITE is misaligned, this preserves
intent of NFSD_IO_DIRECT.

The misaligned ends of a misaligned DIO WRITE will use buffered IO
(without DONTCACHE) but the middle DIO-aligned segment with use direct
IO.  This provides ideal performance for streaming misaligned DIO
(e.g. IO500's IOR_HARD) because buffered IO is used to benefit RMW.

On one capable testbed, this commit improved IOR_HARD WRITE
performance from 0.3433GB/s to 1.26GB/s.

Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
 fs/nfsd/vfs.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 701dd261c252..9403ec8bb2da 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -104,6 +104,7 @@ nfserrno (int errno)
 		{ nfserr_perm, -ENOKEY },
 		{ nfserr_no_grace, -ENOGRACE},
 		{ nfserr_io, -EBADMSG },
+		{ nfserr_eagain, -ENOTBLK },
 	};
 	int	i;
 
@@ -1099,13 +1100,18 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	size_t len;
 
 	init_sync_kiocb(&kiocb, nf->nf_file);
-	kiocb.ki_flags |= IOCB_DIRECT;
 
 	/* Read a properly-aligned region of bytes into rq_bvec */
 	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
 	dio_end = round_up((u64)offset + *count, nf->nf_dio_read_offset_align);
 
+	/* Don't use expanded DIO READ for IO less than 32K */
+	if ((*count < (32 << 10)) &&
+	    (((offset - dio_start) > 0) || ((dio_end - (offset + *count)) > 0)))
+		return nfserrno(-ENOTBLK); /* fallback to buffered */
+
 	kiocb.ki_pos = dio_start;
+	kiocb.ki_flags |= IOCB_DIRECT;
 
 	v = 0;
 	total = dio_end - dio_start;
@@ -1184,10 +1190,13 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		break;
 	case NFSD_IO_DIRECT:
 		/* When dio_read_offset_align is zero, dio is not supported */
-		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len)
-			return nfsd_direct_read(rqstp, fhp, nf, offset,
+		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len) {
+			__be32 nfserr = nfsd_direct_read(rqstp, fhp, nf, offset,
 						count, eof);
-		fallthrough;
+			if (nfserr != nfserr_eagain)
+				return nfserr;
+		}
+		break; /* fallback to buffered */
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags = IOCB_DONTCACHE;
@@ -1347,6 +1356,15 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
 		++args->nsegs;
 	}
 
+	/*
+	 * Don't use IOCB_DONTCACHE if misaligned DIO WRITE (args->nsegs > 1),
+	 * because it compromises unaligned segments' RMW IO being able to
+	 * benefit from buffered IO (especially important for streaming
+	 * misaligned DIO WRITE performance).
+	 */
+	if (args->nsegs > 1 && (args->flags_buffered & IOCB_DONTCACHE))
+		args->flags_buffered &= ~IOCB_DONTCACHE;
+
 	return;
 
 no_dio:
@@ -1400,7 +1418,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	/*
 	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
-	 * writing unaligned segments or handling fallback I/O.
+	 * falling back to buffered IO if entire WRITE is unaligned.
 	 */
 	args.flags_buffered = kiocb->ki_flags;
 	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how
  2025-11-04 16:42 [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
  2025-11-04 16:42 ` [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback Mike Snitzer
@ 2025-11-04 16:42 ` Mike Snitzer
  2025-11-04 16:42 ` [PATCH 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
  2 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-11-04 16:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

NFSD_IO_DIRECT_WRITE_FILE_SYNC is direct IO with stable_how=NFS_FILE_SYNC.
NFSD_IO_DIRECT_WRITE_DATA_SYNC is direct IO with stable_how=NFS_DATA_SYNC.

The stable_how associated with each is a hint in the form of a "floor"
value for stable_how.  Meaning if the client provided stable_how is
already of higher value it will not be changed.

These permutations of NFSD_IO_DIRECT allow to experiment with also
elevating stable_how and sending it back to the client.  Which for
NFSD_IO_DIRECT_WRITE_FILE_SYNC will cause the client to elide its
COMMIT.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/debugfs.c |  7 ++++++-
 fs/nfsd/nfsd.h    |  2 ++
 fs/nfsd/vfs.c     | 46 ++++++++++++++++++++++++++++++++++------------
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 7f44689e0a53..8538e29ed2ab 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -68,7 +68,7 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
 	case NFSD_IO_DIRECT:
 		/*
 		 * Must disable splice_read when enabling
-		 * NFSD_IO_DONTCACHE.
+		 * NFSD_IO_DONTCACHE and NFSD_IO_DIRECT.
 		 */
 		nfsd_disable_splice_read = true;
 		nfsd_io_cache_read = val;
@@ -90,6 +90,9 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
  * Contents:
  *   %0: NFS WRITE will use buffered IO
  *   %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ *   %2: NFS WRITE will use direct IO with stable_how=NFS_UNSTABLE
+ *   %3: NFS WRITE will use direct IO with stable_how=NFS_DATA_SYNC
+ *   %4: NFS WRITE will use direct IO with stable_how=NFS_FILE_SYNC
  *
  * This setting takes immediate effect for all NFS versions,
  * all exports, and in all NFSD net namespaces.
@@ -109,6 +112,8 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
 	case NFSD_IO_DIRECT:
+	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
+	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index e4263326ca4a..10eca169392b 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -161,6 +161,8 @@ enum {
 	NFSD_IO_BUFFERED,
 	NFSD_IO_DONTCACHE,
 	NFSD_IO_DIRECT,
+	NFSD_IO_DIRECT_WRITE_DATA_SYNC,
+	NFSD_IO_DIRECT_WRITE_FILE_SYNC,
 };
 
 extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 9403ec8bb2da..99c62340f58f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1407,15 +1407,45 @@ nfsd_issue_dio_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	return 0;
 }
 
+static void
+nfsd_init_write_kiocb_from_stable(u32 stable_floor,
+				  struct kiocb *kiocb,
+				  u32 *stable_how)
+{
+	if (stable_floor < *stable_how)
+		return; /* stable_how already set higher */
+
+	*stable_how = stable_floor;
+
+	switch (stable_floor) {
+	case NFS_FILE_SYNC:
+		/* persist data and timestamps */
+		kiocb->ki_flags |= IOCB_DSYNC | IOCB_SYNC;
+		break;
+	case NFS_DATA_SYNC:
+		/* persist data only */
+		kiocb->ki_flags |= IOCB_DSYNC;
+		break;
+	}
+}
+
 static noinline_for_stack int
 nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		  struct nfsd_file *nf, u32 *stable_how, unsigned int nvecs,
 		  unsigned long *cnt, struct kiocb *kiocb)
 {
 	struct nfsd_write_dio_args args;
+	u32 stable_floor = NFS_UNSTABLE;
 
 	args.nf = nf;
 
+	if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_FILE_SYNC)
+		stable_floor = NFS_FILE_SYNC;
+	else if (nfsd_io_cache_write == NFSD_IO_DIRECT_WRITE_DATA_SYNC)
+		stable_floor = NFS_DATA_SYNC;
+	if (stable_floor != NFS_UNSTABLE)
+		nfsd_init_write_kiocb_from_stable(stable_floor, kiocb,
+						  stable_how);
 	/*
 	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
 	 * falling back to buffered IO if entire WRITE is unaligned.
@@ -1490,18 +1520,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		stable = NFS_UNSTABLE;
 	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = offset;
-	if (likely(!fhp->fh_use_wgather)) {
-		switch (stable) {
-		case NFS_FILE_SYNC:
-			/* persist data and timestamps */
-			kiocb.ki_flags |= IOCB_DSYNC | IOCB_SYNC;
-			break;
-		case NFS_DATA_SYNC:
-			/* persist data only */
-			kiocb.ki_flags |= IOCB_DSYNC;
-			break;
-		}
-	}
+	if (likely(!fhp->fh_use_wgather))
+		nfsd_init_write_kiocb_from_stable(stable, &kiocb, stable_how);
 
 	nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
 
@@ -1511,6 +1531,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	switch (nfsd_io_cache_write) {
 	case NFSD_IO_DIRECT:
+	case NFSD_IO_DIRECT_WRITE_DATA_SYNC:
+	case NFSD_IO_DIRECT_WRITE_FILE_SYNC:
 		host_err = nfsd_direct_write(rqstp, fhp, nf, stable_how,
 					     nvecs, cnt, &kiocb);
 		stable = *stable_how;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst
  2025-11-04 16:42 [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
  2025-11-04 16:42 ` [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback Mike Snitzer
  2025-11-04 16:42 ` [PATCH 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
@ 2025-11-04 16:42 ` Mike Snitzer
  2025-11-04 17:25   ` Chuck Lever
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Snitzer @ 2025-11-04 16:42 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Also fixed some typos.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 .../filesystems/nfs/nfsd-io-modes.rst         | 58 ++++++++++---------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
index 4863885c7035..29b84c9c9e25 100644
--- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -21,17 +21,20 @@ NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
 
 Based on the configured settings, NFSD's IO will either be:
 - cached using page cache (NFSD_IO_BUFFERED=0)
-- cached but removed from the page cache upon completion
-  (NFSD_IO_DONTCACHE=1).
-- not cached (NFSD_IO_DIRECT=2)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
+- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
 
-To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
+To set an NFSD IO mode, write a supported value (0 - 4) to the
 corresponding IO operation's debugfs interface, e.g.:
   echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+  echo 4 > /sys/kernel/debug/nfsd/io_cache_write
 
 To check which IO mode NFSD is using for READ or WRITE, simply read the
 corresponding IO operation's debugfs interface, e.g.:
   cat /sys/kernel/debug/nfsd/io_cache_read
+  cat /sys/kernel/debug/nfsd/io_cache_write
 
 NFSD DONTCACHE
 ==============
@@ -46,10 +49,10 @@ DONTCACHE aims to avoid what has proven to be a fairly significant
 limition of Linux's memory management subsystem if/when large amounts of
 data is infrequently accessed (e.g. read once _or_ written once but not
 read until much later). Such use-cases are particularly problematic
-because the page cache will eventually become a bottleneck to surfacing
+because the page cache will eventually become a bottleneck to servicing
 new IO requests.
 
-For more context, please see these Linux commit headers:
+For more context on DONTCACHE, please see these Linux commit headers:
 - Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
   to take a struct kiocb")
 - for READ:  8026e49bff9b1 ("mm/filemap: add read support for
@@ -73,12 +76,18 @@ those with a working set that is significantly larger than available
 system memory. The pathological worst-case workload that NFSD DIRECT has
 proven to help most is: NFS client issuing large sequential IO to a file
 that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
 
 The performance win associated with using NFSD DIRECT was previously
 discussed on linux-nfs, see:
 https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
 But in summary:
-- NFSD DIRECT can signicantly reduce memory requirements
+- NFSD DIRECT can significantly reduce memory requirements
 - NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
 - NFSD DIRECT can offer more deterministic IO performance
 
@@ -91,11 +100,11 @@ to generate a "flamegraph" for work Linux must perform on behalf of your
 test is a really meaningful way to compare the relative health of the
 system and how switching NFSD's IO mode changes what is observed.
 
-If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
-interfaces, ideally the IO will be aligned relative to the underlying
-block device's logical_block_size. Also the memory buffer used to store
-the READ or WRITE payload must be aligned relative to the underlying
-block device's dma_alignment.
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
 
 But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
 it can:
@@ -113,32 +122,29 @@ Misaligned READ:
 
 Misaligned WRITE:
     If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
-    middle and end as needed. The large middle extent is DIO-aligned and
-    the start and/or end are misaligned. Buffered IO is used for the
-    misaligned extents and O_DIRECT is used for the middle DIO-aligned
-    extent.
-
-    If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
-    invalidate the page cache on behalf of the DIO WRITE, then
-    nfsd_issue_write_dio() will fall back to using buffered IO.
+    middle and end as needed. The large middle segment is DIO-aligned
+    and the start and/or end are misaligned. Buffered IO is used for the
+    misaligned segments and O_DIRECT is used for the middle DIO-aligned
+    segment. DONTCACHE buffered IO is _not_ used for the misaligned
+    segments because using normal buffered IO offers significant RMW
+    performance benefit when handling streaming misaligned WRITEs.
 
 Tracing:
-    The nfsd_analyze_read_dio trace event shows how NFSD expands any
+    The nfsd_read_direct trace event shows how NFSD expands any
     misaligned READ to the next DIO-aligned block (on either end of the
     original READ, as needed).
 
     This combination of trace events is useful for READs:
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
-    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
     echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
 
-    The nfsd_analyze_write_dio trace event shows how NFSD splits a given
-    misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
-    extent.
+    The nfsd_write_direct trace event shows how NFSD splits a given
+    misaligned WRITE into a DIO-aligned middle segment.
 
     This combination of trace events is useful for WRITEs:
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
-    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
     echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
     echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback
  2025-11-04 16:42 ` [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback Mike Snitzer
@ 2025-11-04 17:23   ` Chuck Lever
  2025-11-04 17:35     ` Mike Snitzer
  2025-11-04 18:11   ` [PATCH v2 " Mike Snitzer
  2025-11-05  6:19   ` [PATCH v3 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
  2 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2025-11-04 17:23 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 11/4/25 11:42 AM, Mike Snitzer wrote:
> Also, use buffered IO (without DONTCACHE) if READ is less than 32K.
> But do use DONTCACHE if an entire WRITE is misaligned, this preserves
> intent of NFSD_IO_DIRECT.

These two changes need to be separate patches.


> The misaligned ends of a misaligned DIO WRITE will use buffered IO
> (without DONTCACHE) but the middle DIO-aligned segment with use direct
> IO.  This provides ideal performance for streaming misaligned DIO
> (e.g. IO500's IOR_HARD) because buffered IO is used to benefit RMW.
> 
> On one capable testbed, this commit improved IOR_HARD WRITE
> performance from 0.3433GB/s to 1.26GB/s.
> 
> Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
> ---
>  fs/nfsd/vfs.c | 28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 701dd261c252..9403ec8bb2da 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -104,6 +104,7 @@ nfserrno (int errno)
>  		{ nfserr_perm, -ENOKEY },
>  		{ nfserr_no_grace, -ENOGRACE},
>  		{ nfserr_io, -EBADMSG },
> +		{ nfserr_eagain, -ENOTBLK },
>  	};
>  	int	i;
>  
> @@ -1099,13 +1100,18 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	size_t len;
>  
>  	init_sync_kiocb(&kiocb, nf->nf_file);
> -	kiocb.ki_flags |= IOCB_DIRECT;
>  
>  	/* Read a properly-aligned region of bytes into rq_bvec */
>  	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
>  	dio_end = round_up((u64)offset + *count, nf->nf_dio_read_offset_align);
>  
> +	/* Don't use expanded DIO READ for IO less than 32K */
> +	if ((*count < (32 << 10)) &&
> +	    (((offset - dio_start) > 0) || ((dio_end - (offset + *count)) > 0)))
> +		return nfserrno(-ENOTBLK); /* fallback to buffered */

Why not just return a specific nfserr code here? No need to go through
nfserrno.


> +
>  	kiocb.ki_pos = dio_start;
> +	kiocb.ki_flags |= IOCB_DIRECT;
>  
>  	v = 0;
>  	total = dio_end - dio_start;
> @@ -1184,10 +1190,13 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		break;
>  	case NFSD_IO_DIRECT:
>  		/* When dio_read_offset_align is zero, dio is not supported */
> -		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len)
> -			return nfsd_direct_read(rqstp, fhp, nf, offset,
> +		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len) {
> +			__be32 nfserr = nfsd_direct_read(rqstp, fhp, nf, offset,
>  						count, eof);
> -		fallthrough;
> +			if (nfserr != nfserr_eagain)
> +				return nfserr;
> +		}
> +		break; /* fallback to buffered */
>  	case NFSD_IO_DONTCACHE:
>  		if (file->f_op->fop_flags & FOP_DONTCACHE)
>  			kiocb.ki_flags = IOCB_DONTCACHE;
> @@ -1347,6 +1356,15 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
>  		++args->nsegs;
>  	}
>  
> +	/*
> +	 * Don't use IOCB_DONTCACHE if misaligned DIO WRITE (args->nsegs > 1),
> +	 * because it compromises unaligned segments' RMW IO being able to
> +	 * benefit from buffered IO (especially important for streaming
> +	 * misaligned DIO WRITE performance).
> +	 */
> +	if (args->nsegs > 1 && (args->flags_buffered & IOCB_DONTCACHE))
> +		args->flags_buffered &= ~IOCB_DONTCACHE;
> +
>  	return;
>  
>  no_dio:
> @@ -1400,7 +1418,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	/*
>  	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> -	 * writing unaligned segments or handling fallback I/O.
> +	 * falling back to buffered IO if entire WRITE is unaligned.
>  	 */
>  	args.flags_buffered = kiocb->ki_flags;
>  	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst
  2025-11-04 16:42 ` [PATCH 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
@ 2025-11-04 17:25   ` Chuck Lever
  0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2025-11-04 17:25 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 11/4/25 11:42 AM, Mike Snitzer wrote:
> Also fixed some typos.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  .../filesystems/nfs/nfsd-io-modes.rst         | 58 ++++++++++---------
>  1 file changed, 32 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> index 4863885c7035..29b84c9c9e25 100644
> --- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -21,17 +21,20 @@ NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
>  
>  Based on the configured settings, NFSD's IO will either be:
>  - cached using page cache (NFSD_IO_BUFFERED=0)
> -- cached but removed from the page cache upon completion
> -  (NFSD_IO_DONTCACHE=1).
> -- not cached (NFSD_IO_DIRECT=2)
> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> +- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
> +- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
>  
> -To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
> +To set an NFSD IO mode, write a supported value (0 - 4) to the
>  corresponding IO operation's debugfs interface, e.g.:
>    echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> +  echo 4 > /sys/kernel/debug/nfsd/io_cache_write
>  
>  To check which IO mode NFSD is using for READ or WRITE, simply read the
>  corresponding IO operation's debugfs interface, e.g.:
>    cat /sys/kernel/debug/nfsd/io_cache_read
> +  cat /sys/kernel/debug/nfsd/io_cache_write
>  
>  NFSD DONTCACHE
>  ==============
> @@ -46,10 +49,10 @@ DONTCACHE aims to avoid what has proven to be a fairly significant
>  limition of Linux's memory management subsystem if/when large amounts of
>  data is infrequently accessed (e.g. read once _or_ written once but not
>  read until much later). Such use-cases are particularly problematic
> -because the page cache will eventually become a bottleneck to surfacing
> +because the page cache will eventually become a bottleneck to servicing
>  new IO requests.
>  
> -For more context, please see these Linux commit headers:
> +For more context on DONTCACHE, please see these Linux commit headers:
>  - Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
>    to take a struct kiocb")
>  - for READ:  8026e49bff9b1 ("mm/filemap: add read support for
> @@ -73,12 +76,18 @@ those with a working set that is significantly larger than available
>  system memory. The pathological worst-case workload that NFSD DIRECT has
>  proven to help most is: NFS client issuing large sequential IO to a file
>  that is 2-3 times larger than the NFS server's available system memory.
> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
> +that the memory management subsystem would otherwise be required to
> +perform (e.g. page allocation, dirty writeback, page reclaim). When
> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
> +time trying to find adequate free pages so that forward IO progress can
> +be made.
>  
>  The performance win associated with using NFSD DIRECT was previously
>  discussed on linux-nfs, see:
>  https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>  But in summary:
> -- NFSD DIRECT can signicantly reduce memory requirements
> +- NFSD DIRECT can significantly reduce memory requirements
>  - NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
>  - NFSD DIRECT can offer more deterministic IO performance
>  
> @@ -91,11 +100,11 @@ to generate a "flamegraph" for work Linux must perform on behalf of your
>  test is a really meaningful way to compare the relative health of the
>  system and how switching NFSD's IO mode changes what is observed.
>  
> -If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
> -interfaces, ideally the IO will be aligned relative to the underlying
> -block device's logical_block_size. Also the memory buffer used to store
> -the READ or WRITE payload must be aligned relative to the underlying
> -block device's dma_alignment.
> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
> +the underlying block device's logical_block_size. Also the memory buffer
> +used to store the READ or WRITE payload must be aligned relative to the
> +underlying block device's dma_alignment.
>  
>  But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
>  it can:
> @@ -113,32 +122,29 @@ Misaligned READ:
>  
>  Misaligned WRITE:
>      If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> -    middle and end as needed. The large middle extent is DIO-aligned and
> -    the start and/or end are misaligned. Buffered IO is used for the
> -    misaligned extents and O_DIRECT is used for the middle DIO-aligned
> -    extent.
> -
> -    If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> -    invalidate the page cache on behalf of the DIO WRITE, then
> -    nfsd_issue_write_dio() will fall back to using buffered IO.
> +    middle and end as needed. The large middle segment is DIO-aligned
> +    and the start and/or end are misaligned. Buffered IO is used for the
> +    misaligned segments and O_DIRECT is used for the middle DIO-aligned
> +    segment. DONTCACHE buffered IO is _not_ used for the misaligned
> +    segments because using normal buffered IO offers significant RMW
> +    performance benefit when handling streaming misaligned WRITEs.
>  
>  Tracing:
> -    The nfsd_analyze_read_dio trace event shows how NFSD expands any
> +    The nfsd_read_direct trace event shows how NFSD expands any
>      misaligned READ to the next DIO-aligned block (on either end of the
>      original READ, as needed).
>  
>      This combination of trace events is useful for READs:
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> -    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
>      echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
>  
> -    The nfsd_analyze_write_dio trace event shows how NFSD splits a given
> -    misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
> -    extent.
> +    The nfsd_write_direct trace event shows how NFSD splits a given
> +    misaligned WRITE into a DIO-aligned middle segment.
>  
>      This combination of trace events is useful for WRITEs:
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
> -    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
> +    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
>      echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
>      echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable

I'm thinking this one should be squashed into the existing patch from
Sep 3.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback
  2025-11-04 17:23   ` Chuck Lever
@ 2025-11-04 17:35     ` Mike Snitzer
  2025-11-04 19:33       ` Chuck Lever
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Snitzer @ 2025-11-04 17:35 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Tue, Nov 04, 2025 at 12:23:02PM -0500, Chuck Lever wrote:
> On 11/4/25 11:42 AM, Mike Snitzer wrote:
> > Also, use buffered IO (without DONTCACHE) if READ is less than 32K.
> > But do use DONTCACHE if an entire WRITE is misaligned, this preserves
> > intent of NFSD_IO_DIRECT.
> 
> These two changes need to be separate patches.

They are linked, otherwise if READ uses DONTCACHE for the small IO
it'll kill any benefit to RMW.

Unless I'm misunderstanding which two changes you're referring to?

The "But do use DONTCACHE if an entire WRITE is misaligned" just
amounts to a comment tweak in nfsd_direct_write (last hunk below)

> > The misaligned ends of a misaligned DIO WRITE will use buffered IO
> > (without DONTCACHE) but the middle DIO-aligned segment with use direct
> > IO.  This provides ideal performance for streaming misaligned DIO
> > (e.g. IO500's IOR_HARD) because buffered IO is used to benefit RMW.
> > 
> > On one capable testbed, this commit improved IOR_HARD WRITE
> > performance from 0.3433GB/s to 1.26GB/s.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
> > ---
> >  fs/nfsd/vfs.c | 28 +++++++++++++++++++++++-----
> >  1 file changed, 23 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 701dd261c252..9403ec8bb2da 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -104,6 +104,7 @@ nfserrno (int errno)
> >  		{ nfserr_perm, -ENOKEY },
> >  		{ nfserr_no_grace, -ENOGRACE},
> >  		{ nfserr_io, -EBADMSG },
> > +		{ nfserr_eagain, -ENOTBLK },
> >  	};
> >  	int	i;
> >  
> > @@ -1099,13 +1100,18 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	size_t len;
> >  
> >  	init_sync_kiocb(&kiocb, nf->nf_file);
> > -	kiocb.ki_flags |= IOCB_DIRECT;
> >  
> >  	/* Read a properly-aligned region of bytes into rq_bvec */
> >  	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
> >  	dio_end = round_up((u64)offset + *count, nf->nf_dio_read_offset_align);
> >  
> > +	/* Don't use expanded DIO READ for IO less than 32K */
> > +	if ((*count < (32 << 10)) &&
> > +	    (((offset - dio_start) > 0) || ((dio_end - (offset + *count)) > 0)))
> > +		return nfserrno(-ENOTBLK); /* fallback to buffered */
> 
> Why not just return a specific nfserr code here? No need to go through
> nfserrno.

Could, I just tethered it to ENOTBLK given the history of such things
elsewhere for direct to buffered fallback.  But yes, could just as
easily simply return nfserr_eagain (or some other better suggestion).

> > +
> >  	kiocb.ki_pos = dio_start;
> > +	kiocb.ki_flags |= IOCB_DIRECT;
> >  
> >  	v = 0;
> >  	total = dio_end - dio_start;
> > @@ -1184,10 +1190,13 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  		break;
> >  	case NFSD_IO_DIRECT:
> >  		/* When dio_read_offset_align is zero, dio is not supported */
> > -		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len)
> > -			return nfsd_direct_read(rqstp, fhp, nf, offset,
> > +		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len) {
> > +			__be32 nfserr = nfsd_direct_read(rqstp, fhp, nf, offset,
> >  						count, eof);
> > -		fallthrough;
> > +			if (nfserr != nfserr_eagain)
> > +				return nfserr;
> > +		}
> > +		break; /* fallback to buffered */
> >  	case NFSD_IO_DONTCACHE:
> >  		if (file->f_op->fop_flags & FOP_DONTCACHE)
> >  			kiocb.ki_flags = IOCB_DONTCACHE;
> > @@ -1347,6 +1356,15 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
> >  		++args->nsegs;
> >  	}
> >  
> > +	/*
> > +	 * Don't use IOCB_DONTCACHE if misaligned DIO WRITE (args->nsegs > 1),
> > +	 * because it compromises unaligned segments' RMW IO being able to
> > +	 * benefit from buffered IO (especially important for streaming
> > +	 * misaligned DIO WRITE performance).
> > +	 */
> > +	if (args->nsegs > 1 && (args->flags_buffered & IOCB_DONTCACHE))
> > +		args->flags_buffered &= ~IOCB_DONTCACHE;
> > +
> >  	return;
> >  
> >  no_dio:
> > @@ -1400,7 +1418,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  
> >  	/*
> >  	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> > -	 * writing unaligned segments or handling fallback I/O.
> > +	 * falling back to buffered IO if entire WRITE is unaligned.
> >  	 */
> >  	args.flags_buffered = kiocb->ki_flags;
> >  	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
> 
> 
> -- 
> Chuck Lever

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback
  2025-11-04 16:42 ` [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback Mike Snitzer
  2025-11-04 17:23   ` Chuck Lever
@ 2025-11-04 18:11   ` Mike Snitzer
  2025-11-05  6:19   ` [PATCH v3 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
  2 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-11-04 18:11 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Also, use buffered IO (without DONTCACHE) if READ is less than 32K.
But do use DONTCACHE if an entire WRITE is misaligned, this preserves
intent of NFSD_IO_DIRECT.

The misaligned ends of a misaligned DIO WRITE will use buffered IO
(without DONTCACHE) but the middle DIO-aligned segment with use direct
IO.  This provides ideal performance for streaming misaligned DIO
(e.g. IO500's IOR_HARD) because buffered IO is used to benefit RMW.

On one capable testbed, this commit improved IOR_HARD WRITE
performance from 0.3433GB/s to 1.26GB/s.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/vfs.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

v2: don't use nfserrno(), just return nfserr_eagain directly from nfsd_direct_read

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 701dd261c252..a87ee9736d66 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1099,13 +1099,18 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	size_t len;
 
 	init_sync_kiocb(&kiocb, nf->nf_file);
-	kiocb.ki_flags |= IOCB_DIRECT;
 
 	/* Read a properly-aligned region of bytes into rq_bvec */
 	dio_start = round_down(offset, nf->nf_dio_read_offset_align);
 	dio_end = round_up((u64)offset + *count, nf->nf_dio_read_offset_align);
 
+	/* Don't use expanded DIO READ for IO less than 32K */
+	if ((*count < (32 << 10)) &&
+	    (((offset - dio_start) > 0) || ((dio_end - (offset + *count)) > 0)))
+		return nfserr_eagain; /* fallback to buffered */
+
 	kiocb.ki_pos = dio_start;
+	kiocb.ki_flags |= IOCB_DIRECT;
 
 	v = 0;
 	total = dio_end - dio_start;
@@ -1184,10 +1189,13 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		break;
 	case NFSD_IO_DIRECT:
 		/* When dio_read_offset_align is zero, dio is not supported */
-		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len)
-			return nfsd_direct_read(rqstp, fhp, nf, offset,
-						count, eof);
-		fallthrough;
+		if (nf->nf_dio_read_offset_align && !rqstp->rq_res.page_len) {
+			__be32 nfserr = nfsd_direct_read(rqstp, fhp, nf, offset,
+							 count, eof);
+			if (nfserr != nfserr_eagain)
+				return nfserr;
+		}
+		break; /* fallback to buffered */
 	case NFSD_IO_DONTCACHE:
 		if (file->f_op->fop_flags & FOP_DONTCACHE)
 			kiocb.ki_flags = IOCB_DONTCACHE;
@@ -1347,6 +1355,15 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
 		++args->nsegs;
 	}
 
+	/*
+	 * Don't use IOCB_DONTCACHE if misaligned DIO WRITE (args->nsegs > 1),
+	 * because it compromises unaligned segments' RMW IO being able to
+	 * benefit from buffered IO (especially important for streaming
+	 * misaligned DIO WRITE performance).
+	 */
+	if (args->nsegs > 1 && (args->flags_buffered & IOCB_DONTCACHE))
+		args->flags_buffered &= ~IOCB_DONTCACHE;
+
 	return;
 
 no_dio:
@@ -1400,7 +1417,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	/*
 	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
-	 * writing unaligned segments or handling fallback I/O.
+	 * falling back to buffered IO if entire WRITE is unaligned.
 	 */
 	args.flags_buffered = kiocb->ki_flags;
 	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback
  2025-11-04 17:35     ` Mike Snitzer
@ 2025-11-04 19:33       ` Chuck Lever
  0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2025-11-04 19:33 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jeff Layton, linux-nfs

On 11/4/25 12:35 PM, Mike Snitzer wrote:
> On Tue, Nov 04, 2025 at 12:23:02PM -0500, Chuck Lever wrote:
>> On 11/4/25 11:42 AM, Mike Snitzer wrote:
>>> Also, use buffered IO (without DONTCACHE) if READ is less than 32K.
>>> But do use DONTCACHE if an entire WRITE is misaligned, this preserves
>>> intent of NFSD_IO_DIRECT.
>> These two changes need to be separate patches.
> They are linked, otherwise if READ uses DONTCACHE for the small IO
> it'll kill any benefit to RMW.
> 
> Unless I'm misunderstanding which two changes you're referring to?

It's unclear to me why the READ and WRITE parts of this patch are
related to each other. So, if you believe these are inseparable changes,
the patch description must explain why.

I don't understand why performing READs with buffered I/O has any effect
on the RMW behavior of direct WRITEs, for instance. Unless, you mean
that it's not an RMW cycle during the server's write processing, but the
application and client are doing RMW? If so, the use of "RMW" in the
patch description and new comments is misleading... and I'm not sure we
care all that much about optimizing unfortunate application behavior.


> The "But do use DONTCACHE if an entire WRITE is misaligned" just
> amounts to a comment tweak in nfsd_direct_write (last hunk below)

Don't the last two hunks below both modify the direct write path?

The Subject: is misleading, then: it refers to both "misaligned I/O" and
"fallback". I think you mean only the unaligned ends should avoid
DONTCACHE?


> +	/* Don't use expanded DIO READ for IO less than 32K */
> +	if ((*count < (32 << 10)) &&
> +	    (((offset - dio_start) > 0) || ((dio_end - (offset + *count)) > 0)))
> +		return nfserrno(-ENOTBLK); /* fallback to buffered */

A few more comments on this hunk:

- What caught me up about "nfserrno(-ENOTBLK)" is that generally
  nfserrno() is for converting an errno that is not known at compile
  time. Here you are passing it a constant. I mean, that is already
  what you do in the caller with "if (nfserr != nfserr_eagain)".

- Why cap the buffered READs at exactly 32KB? That feels like you are
  using a number that is beneficial to a specific workload and hardware
  configuration. I'd like to have a warm, fuzzy feeling that you have
  looked at the impact of this change on other common workloads.

- The convention for new code added to fs/nfsd is to use symbolic
  constants rather than raw integers for such magic numbers. The
  definition of said constant would appear in an NFSD header.

- The new comment is repeating what the code does. A more useful comment
  might explain why this cap is necessary. But, such a comment might
  instead be placed in front of the definition of the symbolic constant
  mentioned above.


> On one capable testbed, this commit improved IOR_HARD WRITE
> performance from 0.3433GB/s to 1.26GB/s.

Can you elaborate a bit on what the benchmark is doing? Why make a code
change rather than simply recognize this is a workload that does not
perform well with NFSD_IO_DIRECT ?


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE
  2025-11-04 16:42 ` [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback Mike Snitzer
  2025-11-04 17:23   ` Chuck Lever
  2025-11-04 18:11   ` [PATCH v2 " Mike Snitzer
@ 2025-11-05  6:19   ` Mike Snitzer
  2025-11-05 14:58     ` Chuck Lever
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Snitzer @ 2025-11-05  6:19 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

NFSD_IO_DIRECT can easily improve streaming misaligned WRITE
performance if it uses buffered IO (without DONTCACHE) for the
misaligned end segment(s) and O_DIRECT for the aligned middle
segment's IO.

On one capable testbed, this commit improved streaming 47008 byte
write performance from 0.3433 GB/s to 1.26 GB/s.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/vfs.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

v3: drop unrelated change to avoid DONTCACHE if READ is less than 32K

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 701dd261c252..075d7162eb2e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1347,6 +1347,14 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
 		++args->nsegs;
 	}
 
+	/*
+	 * Don't use IOCB_DONTCACHE if misaligned DIO (args->nsegs > 1),
+	 * because IO for misaligned segments can benefit from the page
+	 * cache (e.g. when handling streaming misaligned IO).
+	 */
+	if (args->nsegs > 1 && (args->flags_buffered & IOCB_DONTCACHE))
+		args->flags_buffered &= ~IOCB_DONTCACHE;
+
 	return;
 
 no_dio:
@@ -1400,7 +1408,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	/*
 	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
-	 * writing unaligned segments or handling fallback I/O.
+	 * falling back to buffered IO if entire WRITE is unaligned.
 	 */
 	args.flags_buffered = kiocb->ki_flags;
 	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE
  2025-11-05  6:19   ` [PATCH v3 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
@ 2025-11-05 14:58     ` Chuck Lever
  2025-11-05 17:33       ` Mike Snitzer
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2025-11-05 14:58 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton; +Cc: linux-nfs

On 11/5/25 1:19 AM, Mike Snitzer wrote:
> NFSD_IO_DIRECT can easily improve streaming misaligned WRITE
> performance if it uses buffered IO (without DONTCACHE) for the
> misaligned end segment(s) and O_DIRECT for the aligned middle
> segment's IO.
> 
> On one capable testbed, this commit improved streaming 47008 byte
> write performance from 0.3433 GB/s to 1.26 GB/s.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/vfs.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> v3: drop unrelated change to avoid DONTCACHE if READ is less than 32K

The direct code path now handles the "no support for direct I/O" case
in exactly one spot: the "no_dio" label in nfsd_write_dio_iters_init().

So it seems to me that it would be slightly friendlier overall to not
set DONTCACHE in nfsd_direct_write(), but then set it just after the
"no_dio" label. The nf pointer is available in the nfsd_write_dio_args
structure.


> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 701dd261c252..075d7162eb2e 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1347,6 +1347,14 @@ nfsd_write_dio_iters_init(struct bio_vec *bvec, unsigned int nvecs,
>  		++args->nsegs;
>  	}
>  
> +	/*
> +	 * Don't use IOCB_DONTCACHE if misaligned DIO (args->nsegs > 1),
> +	 * because IO for misaligned segments can benefit from the page
> +	 * cache (e.g. when handling streaming misaligned IO).
> +	 */
> +	if (args->nsegs > 1 && (args->flags_buffered & IOCB_DONTCACHE))
> +		args->flags_buffered &= ~IOCB_DONTCACHE;
> +
>  	return;
>  
>  no_dio:
> @@ -1400,7 +1408,7 @@ nfsd_direct_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  
>  	/*
>  	 * IOCB_DONTCACHE preserves the intent of NFSD_IO_DIRECT when
> -	 * writing unaligned segments or handling fallback I/O.
> +	 * falling back to buffered IO if entire WRITE is unaligned.
>  	 */
>  	args.flags_buffered = kiocb->ki_flags;
>  	if (args.nf->nf_file->f_op->fop_flags & FOP_DONTCACHE)


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE
  2025-11-05 14:58     ` Chuck Lever
@ 2025-11-05 17:33       ` Mike Snitzer
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-11-05 17:33 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Wed, Nov 05, 2025 at 09:58:28AM -0500, Chuck Lever wrote:
> On 11/5/25 1:19 AM, Mike Snitzer wrote:
> > NFSD_IO_DIRECT can easily improve streaming misaligned WRITE
> > performance if it uses buffered IO (without DONTCACHE) for the
> > misaligned end segment(s) and O_DIRECT for the aligned middle
> > segment's IO.
> > 
> > On one capable testbed, this commit improved streaming 47008 byte
> > write performance from 0.3433 GB/s to 1.26 GB/s.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  fs/nfsd/vfs.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > v3: drop unrelated change to avoid DONTCACHE if READ is less than 32K
> 
> The direct code path now handles the "no support for direct I/O" case
> in exactly one spot: the "no_dio" label in nfsd_write_dio_iters_init().
> 
> So it seems to me that it would be slightly friendlier overall to not
> set DONTCACHE in nfsd_direct_write(), but then set it just after the
> "no_dio" label. The nf pointer is available in the nfsd_write_dio_args
> structure.

Sure, will post a v4 of this series.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-11-05 17:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-04 16:42 [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
2025-11-04 16:42 ` [PATCH 1/3] nfsd: avoid using DONTCACHE for misaligned DIO's buffered IO fallback Mike Snitzer
2025-11-04 17:23   ` Chuck Lever
2025-11-04 17:35     ` Mike Snitzer
2025-11-04 19:33       ` Chuck Lever
2025-11-04 18:11   ` [PATCH v2 " Mike Snitzer
2025-11-05  6:19   ` [PATCH v3 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
2025-11-05 14:58     ` Chuck Lever
2025-11-05 17:33       ` Mike Snitzer
2025-11-04 16:42 ` [PATCH 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
2025-11-04 16:42 ` [PATCH 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
2025-11-04 17:25   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox