All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <cel@kernel.org>
To: NeilBrown <neil@brown.name>, Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Cc: <linux-nfs@vger.kernel.org>, Mike Snitzer <snitzer@kernel.org>
Subject: [PATCH v9 12/12] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
Date: Mon,  3 Nov 2025 11:53:51 -0500	[thread overview]
Message-ID: <20251103165351.10261-13-cel@kernel.org> (raw)
In-Reply-To: <20251103165351.10261-1-cel@kernel.org>

From: Mike Snitzer <snitzer@kernel.org>

This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:

  /sys/kernel/debug/nfsd/io_cache_read
  /sys/kernel/debug/nfsd/io_cache_write

This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).

Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 .../filesystems/nfs/nfsd-io-modes.rst         | 144 ++++++++++++++++++
 1 file changed, 144 insertions(+)
 create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst

diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 000000000000..4863885c7035
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,144 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from the page cache upon completion
+  (NFSD_IO_DONTCACHE=1).
+- not cached (NFSD_IO_DIRECT=2)
+
+To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
+corresponding IO operation's debugfs interface, e.g.:
+  echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.:
+  cat /sys/kernel/debug/nfsd/io_cache_read
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to surfacing
+new IO requests.
+
+For more context, please see these Linux commit headers:
+- Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+  to take a struct kiocb")
+- for READ:  8026e49bff9b1 ("mm/filemap: add read support for
+  RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
+interfaces, FOP_DONTCACHE must be advertised as supported by the
+underlying filesystem (e.g. XFS), otherwise all IO flagged with
+RWF_DONTCACHE will fail with -EOPNOTSUPP.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+But in summary:
+- NFSD DIRECT can signicantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
+interfaces, ideally the IO will be aligned relative to the underlying
+block device's logical_block_size. Also the memory buffer used to store
+the READ or WRITE payload must be aligned relative to the underlying
+block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+    If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+    DIO-aligned block (on either end of the READ). The expanded READ is
+    verified to have proper offset/len (logical_block_size) and
+    dma_alignment checking.
+
+    Any misaligned READ that is less than 32K won't be expanded to be
+    DIO-aligned (this heuristic just avoids excess work, like allocating
+    start_extra_page, for smaller IO that can generally already perform
+    well using buffered IO).
+
+Misaligned WRITE:
+    If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+    middle and end as needed. The large middle extent is DIO-aligned and
+    the start and/or end are misaligned. Buffered IO is used for the
+    misaligned extents and O_DIRECT is used for the middle DIO-aligned
+    extent.
+
+    If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
+    invalidate the page cache on behalf of the DIO WRITE, then
+    nfsd_issue_write_dio() will fall back to using buffered IO.
+
+Tracing:
+    The nfsd_analyze_read_dio trace event shows how NFSD expands any
+    misaligned READ to the next DIO-aligned block (on either end of the
+    original READ, as needed).
+
+    This combination of trace events is useful for READs:
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+    The nfsd_analyze_write_dio trace event shows how NFSD splits a given
+    misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
+    extent.
+
+    This combination of trace events is useful for WRITEs:
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
+    echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+    echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
-- 
2.51.0


  parent reply	other threads:[~2025-11-03 16:54 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-03 16:53 [PATCH v9 00/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-03 16:53 ` [PATCH v9 01/12] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-03 22:14   ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 02/12] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-11-03 16:53 ` [PATCH v9 03/12] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-06  9:17   ` kernel test robot
2025-11-03 16:53 ` [PATCH v9 04/12] NFSD: Remove specific error handling Chuck Lever
2025-11-03 21:17   ` Mike Snitzer
2025-11-03 16:53 ` [PATCH v9 05/12] NFSD: Remove alignment size checking Chuck Lever
2025-11-03 21:16   ` Mike Snitzer
2025-11-03 22:30   ` NeilBrown
2025-11-04 11:52     ` Christoph Hellwig
2025-11-04 14:14       ` Chuck Lever
2025-11-04 15:54         ` Mike Snitzer
2025-11-05 12:52         ` Christoph Hellwig
2025-11-05 14:38           ` Chuck Lever
2025-11-05 14:48             ` Mike Snitzer
2025-11-05 14:55             ` Christoph Hellwig
2025-11-03 16:53 ` [PATCH v9 06/12] NFSD: Clean up struct nfsd_write_dio Chuck Lever
2025-11-03 22:36   ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 07/12] NFSD: Introduce struct nfsd_write_dio_seg Chuck Lever
2025-11-03 22:45   ` NeilBrown
2025-11-03 22:48     ` Chuck Lever
2025-11-03 16:53 ` [PATCH v9 08/12] NFSD: Simplify nfsd_iov_iter_aligned_bvec() Chuck Lever
2025-11-03 21:13   ` Mike Snitzer
2025-11-03 22:48   ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 09/12] NFSD: Combine direct I/O feasibility check with iterator setup Chuck Lever
2025-11-03 21:19   ` Mike Snitzer
2025-11-03 22:55   ` NeilBrown
2025-11-03 16:53 ` [PATCH v9 10/12] NFSD: Handle kiocb->ki_flags correctly Chuck Lever
2025-11-03 21:14   ` Mike Snitzer
2025-11-03 23:03   ` NeilBrown
2025-11-04 11:55     ` Christoph Hellwig
2025-11-03 16:53 ` [PATCH v9 11/12] NFSD: Refactor nfsd_vfs_write Chuck Lever
2025-11-03 21:15   ` Mike Snitzer
2025-11-03 23:05   ` NeilBrown
2025-11-03 16:53 ` Chuck Lever [this message]
2025-11-03 21:25   ` [PATCH v9 12/12] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251103165351.10261-13-cel@kernel.org \
    --to=cel@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.