From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>, Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Subject: [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst
Date: Wed, 5 Nov 2025 12:42:10 -0500 [thread overview]
Message-ID: <20251105174210.54023-4-snitzer@kernel.org> (raw)
In-Reply-To: <20251105174210.54023-1-snitzer@kernel.org>
Also fixed some typos.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
.../filesystems/nfs/nfsd-io-modes.rst | 58 ++++++++++---------
1 file changed, 32 insertions(+), 26 deletions(-)
diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
index 4863885c7035..29b84c9c9e25 100644
--- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
+++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
@@ -21,17 +21,20 @@ NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
Based on the configured settings, NFSD's IO will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
-- cached but removed from the page cache upon completion
- (NFSD_IO_DONTCACHE=1).
-- not cached (NFSD_IO_DIRECT=2)
+- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
+- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
+- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
+- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
-To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
+To set an NFSD IO mode, write a supported value (0 - 4) to the
corresponding IO operation's debugfs interface, e.g.:
echo 2 > /sys/kernel/debug/nfsd/io_cache_read
+ echo 4 > /sys/kernel/debug/nfsd/io_cache_write
To check which IO mode NFSD is using for READ or WRITE, simply read the
corresponding IO operation's debugfs interface, e.g.:
cat /sys/kernel/debug/nfsd/io_cache_read
+ cat /sys/kernel/debug/nfsd/io_cache_write
NFSD DONTCACHE
==============
@@ -46,10 +49,10 @@ DONTCACHE aims to avoid what has proven to be a fairly significant
limition of Linux's memory management subsystem if/when large amounts of
data is infrequently accessed (e.g. read once _or_ written once but not
read until much later). Such use-cases are particularly problematic
-because the page cache will eventually become a bottleneck to surfacing
+because the page cache will eventually become a bottleneck to servicing
new IO requests.
-For more context, please see these Linux commit headers:
+For more context on DONTCACHE, please see these Linux commit headers:
- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
to take a struct kiocb")
- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
@@ -73,12 +76,18 @@ those with a working set that is significantly larger than available
system memory. The pathological worst-case workload that NFSD DIRECT has
proven to help most is: NFS client issuing large sequential IO to a file
that is 2-3 times larger than the NFS server's available system memory.
+The reason for such improvement is NFSD DIRECT eliminates a lot of work
+that the memory management subsystem would otherwise be required to
+perform (e.g. page allocation, dirty writeback, page reclaim). When
+using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
+time trying to find adequate free pages so that forward IO progress can
+be made.
The performance win associated with using NFSD DIRECT was previously
discussed on linux-nfs, see:
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
But in summary:
-- NFSD DIRECT can signicantly reduce memory requirements
+- NFSD DIRECT can significantly reduce memory requirements
- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
- NFSD DIRECT can offer more deterministic IO performance
@@ -91,11 +100,11 @@ to generate a "flamegraph" for work Linux must perform on behalf of your
test is a really meaningful way to compare the relative health of the
system and how switching NFSD's IO mode changes what is observed.
-If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
-interfaces, ideally the IO will be aligned relative to the underlying
-block device's logical_block_size. Also the memory buffer used to store
-the READ or WRITE payload must be aligned relative to the underlying
-block device's dma_alignment.
+If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
+NFSD's debugfs interfaces, ideally the IO will be aligned relative to
+the underlying block device's logical_block_size. Also the memory buffer
+used to store the READ or WRITE payload must be aligned relative to the
+underlying block device's dma_alignment.
But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
it can:
@@ -113,32 +122,29 @@ Misaligned READ:
Misaligned WRITE:
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
- middle and end as needed. The large middle extent is DIO-aligned and
- the start and/or end are misaligned. Buffered IO is used for the
- misaligned extents and O_DIRECT is used for the middle DIO-aligned
- extent.
-
- If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
- invalidate the page cache on behalf of the DIO WRITE, then
- nfsd_issue_write_dio() will fall back to using buffered IO.
+ middle and end as needed. The large middle segment is DIO-aligned
+ and the start and/or end are misaligned. Buffered IO is used for the
+ misaligned segments and O_DIRECT is used for the middle DIO-aligned
+ segment. DONTCACHE buffered IO is _not_ used for the misaligned
+ segments because using normal buffered IO offers significant RMW
+ performance benefit when handling streaming misaligned WRITEs.
Tracing:
- The nfsd_analyze_read_dio trace event shows how NFSD expands any
+ The nfsd_read_direct trace event shows how NFSD expands any
misaligned READ to the next DIO-aligned block (on either end of the
original READ, as needed).
This combination of trace events is useful for READs:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
- echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
- The nfsd_analyze_write_dio trace event shows how NFSD splits a given
- misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
- extent.
+ The nfsd_write_direct trace event shows how NFSD splits a given
+ misaligned WRITE into a DIO-aligned middle segment.
This combination of trace events is useful for WRITEs:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
- echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
+ echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
--
2.44.0
next prev parent reply other threads:[~2025-11-05 17:42 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 17:42 [PATCH v4 0/3] [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
2025-11-05 18:47 ` Chuck Lever
2025-11-07 15:29 ` Christoph Hellwig
2025-11-05 17:42 ` [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
2025-11-05 18:49 ` Chuck Lever
2025-11-06 20:17 ` Mike Snitzer
2025-11-06 20:35 ` Chuck Lever
2025-11-06 22:56 ` Mike Snitzer
2025-11-07 14:48 ` Chuck Lever
2025-11-07 15:34 ` Christoph Hellwig
2025-11-07 15:35 ` Chuck Lever
2025-11-07 15:40 ` Christoph Hellwig
2025-11-07 15:30 ` Christoph Hellwig
2025-11-05 17:42 ` Mike Snitzer [this message]
2025-11-05 18:50 ` [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251105174210.54023-4-snitzer@kernel.org \
--to=snitzer@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).