From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>, Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Subject: [PATCH v9 8/9] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
Date: Wed, 3 Sep 2025 16:51:20 -0400 [thread overview]
Message-ID: <20250903205121.41380-9-snitzer@kernel.org> (raw)
In-Reply-To: <20250903205121.41380-1-snitzer@kernel.org>
This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:
/sys/kernel/debug/nfsd/io_cache_read
/sys/kernel/debug/nfsd/io_cache_write
This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).
Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
.../filesystems/nfs/nfsd-io-modes.rst | 144 ++++++++++++++++++
1 file changed, 144 insertions(+)
create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
new file mode 100644
index 0000000000000..4863885c70352
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -0,0 +1,144 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+NFSD IO MODES
+=============
+
+Overview
+========
+
+NFSD has historically always used buffered IO when servicing READ and
+WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
+to override that default to use either DONTCACHE or DIRECT IO modes.
+
+Experimental NFSD debugfs interfaces are available to allow the NFSD IO
+mode used for READ and WRITE to be configured independently. See both:
+- /sys/kernel/debug/nfsd/io_cache_read
+- /sys/kernel/debug/nfsd/io_cache_write
+
+The default value for both io_cache_read and io_cache_write reflects
+NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
+
+Based on the configured settings, NFSD's IO will either be:
+- cached using page cache (NFSD_IO_BUFFERED=0)
+- cached but removed from the page cache upon completion
+ (NFSD_IO_DONTCACHE=1).
+- not cached (NFSD_IO_DIRECT=2)
+
+To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
+corresponding IO operation's debugfs interface, e.g.:
+ echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+
+To check which IO mode NFSD is using for READ or WRITE, simply read the
+corresponding IO operation's debugfs interface, e.g.:
+ cat /sys/kernel/debug/nfsd/io_cache_read
+
+NFSD DONTCACHE
+==============
+
+DONTCACHE offers a hybrid approach to servicing IO that aims to offer
+the benefits of using DIRECT IO without any of the strict alignment
+requirements that DIRECT IO imposes. To achieve this buffered IO is used
+but the IO is flagged to "drop behind" (meaning associated pages are
+dropped from the page cache) when IO completes.
+
+DONTCACHE aims to avoid what has proven to be a fairly significant
+limition of Linux's memory management subsystem if/when large amounts of
+data is infrequently accessed (e.g. read once _or_ written once but not
+read until much later). Such use-cases are particularly problematic
+because the page cache will eventually become a bottleneck to surfacing
+new IO requests.
+
+For more context, please see these Linux commit headers:
+- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
+ to take a struct kiocb")
+- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
+ RWF_DONTCACHE")
+- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
+
+If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
+interfaces, FOP_DONTCACHE must be advertised as supported by the
+underlying filesystem (e.g. XFS), otherwise all IO flagged with
+RWF_DONTCACHE will fail with -EOPNOTSUPP.
+
+NFSD DIRECT
+===========
+
+DIRECT IO doesn't make use of the page cache, as such it is able to
+avoid the Linux memory management's page reclaim scalability problems
+without resorting to the hybrid use of page cache that DONTCACHE does.
+
+Some workloads benefit from NFSD avoiding the page cache, particularly
+those with a working set that is significantly larger than available
+system memory. The pathological worst-case workload that NFSD DIRECT has
+proven to help most is: NFS client issuing large sequential IO to a file
+that is 2-3 times larger than the NFS server's available system memory.
+
+The performance win associated with using NFSD DIRECT was previously
+discussed on linux-nfs, see:
+https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
+But in summary:
+- NFSD DIRECT can signicantly reduce memory requirements
+- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
+- NFSD DIRECT can offer more deterministic IO performance
+
+As always, your mileage may vary and so it is important to carefully
+consider if/when it is beneficial to make use of NFSD DIRECT. When
+assessing comparative performance of your workload please be sure to log
+relevant performance metrics during testing (e.g. memory usage, cpu
+usage, IO performance). Using perf to collect perf data that may be used
+to generate a "flamegraph" for work Linux must perform on behalf of your
+test is a really meaningful way to compare the relative health of the
+system and how switching NFSD's IO mode changes what is observed.
+
+If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
+interfaces, ideally the IO will be aligned relative to the underlying
+block device's logical_block_size. Also the memory buffer used to store
+the READ or WRITE payload must be aligned relative to the underlying
+block device's dma_alignment.
+
+But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
+it can:
+
+Misaligned READ:
+ If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
+ DIO-aligned block (on either end of the READ). The expanded READ is
+ verified to have proper offset/len (logical_block_size) and
+ dma_alignment checking.
+
+ Any misaligned READ that is less than 32K won't be expanded to be
+ DIO-aligned (this heuristic just avoids excess work, like allocating
+ start_extra_page, for smaller IO that can generally already perform
+ well using buffered IO).
+
+Misaligned WRITE:
+ If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
+ middle and end as needed. The large middle extent is DIO-aligned and
+ the start and/or end are misaligned. Buffered IO is used for the
+ misaligned extents and O_DIRECT is used for the middle DIO-aligned
+ extent.
+
+ If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
+ invalidate the page cache on behalf of the DIO WRITE, then
+ nfsd_issue_write_dio() will fall back to using buffered IO.
+
+Tracing:
+ The nfsd_analyze_read_dio trace event shows how NFSD expands any
+ misaligned READ to the next DIO-aligned block (on either end of the
+ original READ, as needed).
+
+ This combination of trace events is useful for READs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
+
+ The nfsd_analyze_write_dio trace event shows how NFSD splits a given
+ misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
+ extent.
+
+ This combination of trace events is useful for WRITEs:
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
+ echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
--
2.44.0
next prev parent reply other threads:[~2025-09-03 20:51 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-03 20:51 [PATCH v9 0/9] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes Mike Snitzer
2025-09-03 20:51 ` [PATCH v9 1/9] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-09-03 20:51 ` [PATCH v9 2/9] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-09-03 20:51 ` [PATCH v9 3/9] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-09-03 20:51 ` [PATCH v9 4/9] NFSD: add io_cache_write " Mike Snitzer
2025-09-03 20:51 ` [PATCH v9 5/9] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-09-03 20:51 ` [PATCH v9 6/9] NFSD: issue WRITEs " Mike Snitzer
2025-09-03 20:51 ` [PATCH v9 7/9] NFSD: add nfsd_analyze_read_dio and nfsd_analyze_write_dio trace events Mike Snitzer
2025-09-03 20:51 ` Mike Snitzer [this message]
2025-09-03 20:51 ` [PATCH v9 9/9] NFSD: use /end/ of rq_pages for misaligned DIO READ's start_extra page Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250903205121.41380-9-snitzer@kernel.org \
--to=snitzer@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox