From: Chuck Lever <chuck.lever@oracle.com>
To: Mike Snitzer <snitzer@kernel.org>, Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst
Date: Wed, 5 Nov 2025 13:50:20 -0500 [thread overview]
Message-ID: <65e3729d-8434-4bdd-8039-804782c20f95@oracle.com> (raw)
In-Reply-To: <20251105174210.54023-4-snitzer@kernel.org>
On 11/5/25 12:42 PM, Mike Snitzer wrote:
> Also fixed some typos.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> .../filesystems/nfs/nfsd-io-modes.rst | 58 ++++++++++---------
> 1 file changed, 32 insertions(+), 26 deletions(-)
>
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> index 4863885c7035..29b84c9c9e25 100644
> --- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -21,17 +21,20 @@ NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
>
> Based on the configured settings, NFSD's IO will either be:
> - cached using page cache (NFSD_IO_BUFFERED=0)
> -- cached but removed from the page cache upon completion
> - (NFSD_IO_DONTCACHE=1).
> -- not cached (NFSD_IO_DIRECT=2)
> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> +- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
> +- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
>
> -To set an NFSD IO mode, write a supported value (0, 1 or 2) to the
> +To set an NFSD IO mode, write a supported value (0 - 4) to the
> corresponding IO operation's debugfs interface, e.g.:
> echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> + echo 4 > /sys/kernel/debug/nfsd/io_cache_write
>
> To check which IO mode NFSD is using for READ or WRITE, simply read the
> corresponding IO operation's debugfs interface, e.g.:
> cat /sys/kernel/debug/nfsd/io_cache_read
> + cat /sys/kernel/debug/nfsd/io_cache_write
>
> NFSD DONTCACHE
> ==============
> @@ -46,10 +49,10 @@ DONTCACHE aims to avoid what has proven to be a fairly significant
> limition of Linux's memory management subsystem if/when large amounts of
> data is infrequently accessed (e.g. read once _or_ written once but not
> read until much later). Such use-cases are particularly problematic
> -because the page cache will eventually become a bottleneck to surfacing
> +because the page cache will eventually become a bottleneck to servicing
> new IO requests.
>
> -For more context, please see these Linux commit headers:
> +For more context on DONTCACHE, please see these Linux commit headers:
> - Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
> to take a struct kiocb")
> - for READ: 8026e49bff9b1 ("mm/filemap: add read support for
> @@ -73,12 +76,18 @@ those with a working set that is significantly larger than available
> system memory. The pathological worst-case workload that NFSD DIRECT has
> proven to help most is: NFS client issuing large sequential IO to a file
> that is 2-3 times larger than the NFS server's available system memory.
> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
> +that the memory management subsystem would otherwise be required to
> +perform (e.g. page allocation, dirty writeback, page reclaim). When
> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
> +time trying to find adequate free pages so that forward IO progress can
> +be made.
>
> The performance win associated with using NFSD DIRECT was previously
> discussed on linux-nfs, see:
> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
> But in summary:
> -- NFSD DIRECT can signicantly reduce memory requirements
> +- NFSD DIRECT can significantly reduce memory requirements
> - NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
> - NFSD DIRECT can offer more deterministic IO performance
>
> @@ -91,11 +100,11 @@ to generate a "flamegraph" for work Linux must perform on behalf of your
> test is a really meaningful way to compare the relative health of the
> system and how switching NFSD's IO mode changes what is observed.
>
> -If NFSD_IO_DIRECT is specified by writing 2 to NFSD's debugfs
> -interfaces, ideally the IO will be aligned relative to the underlying
> -block device's logical_block_size. Also the memory buffer used to store
> -the READ or WRITE payload must be aligned relative to the underlying
> -block device's dma_alignment.
> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
> +the underlying block device's logical_block_size. Also the memory buffer
> +used to store the READ or WRITE payload must be aligned relative to the
> +underlying block device's dma_alignment.
>
> But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
> it can:
> @@ -113,32 +122,29 @@ Misaligned READ:
>
> Misaligned WRITE:
> If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
> - middle and end as needed. The large middle extent is DIO-aligned and
> - the start and/or end are misaligned. Buffered IO is used for the
> - misaligned extents and O_DIRECT is used for the middle DIO-aligned
> - extent.
> -
> - If vfs_iocb_iter_write() returns -ENOTBLK, due to its inability to
> - invalidate the page cache on behalf of the DIO WRITE, then
> - nfsd_issue_write_dio() will fall back to using buffered IO.
> + middle and end as needed. The large middle segment is DIO-aligned
> + and the start and/or end are misaligned. Buffered IO is used for the
> + misaligned segments and O_DIRECT is used for the middle DIO-aligned
> + segment. DONTCACHE buffered IO is _not_ used for the misaligned
> + segments because using normal buffered IO offers significant RMW
> + performance benefit when handling streaming misaligned WRITEs.
>
> Tracing:
> - The nfsd_analyze_read_dio trace event shows how NFSD expands any
> + The nfsd_read_direct trace event shows how NFSD expands any
> misaligned READ to the next DIO-aligned block (on either end of the
> original READ, as needed).
>
> This combination of trace events is useful for READs:
> echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> - echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_read_dio/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
> echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
> echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
>
> - The nfsd_analyze_write_dio trace event shows how NFSD splits a given
> - misaligned WRITE into a mix of misaligned extent(s) and a DIO-aligned
> - extent.
> + The nfsd_write_direct trace event shows how NFSD splits a given
> + misaligned WRITE into a DIO-aligned middle segment.
>
> This combination of trace events is useful for WRITEs:
> echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
> - echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_analyze_write_dio/enable
> + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
> echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
> echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
I've already squashed the previous version of this patch into my private
tree... if you confirm there were no changes, I'll leave this one for
now.
--
Chuck Lever
prev parent reply other threads:[~2025-11-05 18:50 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 17:42 [PATCH v4 0/3] [PATCH 0/3] NFSD: additional NFSD Direct changes Mike Snitzer
2025-11-05 17:42 ` [PATCH v4 1/3] NFSD: avoid DONTCACHE for misaligned ends of misaligned DIO WRITE Mike Snitzer
2025-11-05 18:47 ` Chuck Lever
2025-11-07 15:29 ` Christoph Hellwig
2025-11-05 17:42 ` [PATCH v4 2/3] NFSD: add new NFSD_IO_DIRECT variants that may override stable_how Mike Snitzer
2025-11-05 18:49 ` Chuck Lever
2025-11-06 20:17 ` Mike Snitzer
2025-11-06 20:35 ` Chuck Lever
2025-11-06 22:56 ` Mike Snitzer
2025-11-07 14:48 ` Chuck Lever
2025-11-07 15:34 ` Christoph Hellwig
2025-11-07 15:35 ` Chuck Lever
2025-11-07 15:40 ` Christoph Hellwig
2025-11-07 15:30 ` Christoph Hellwig
2025-11-05 17:42 ` [PATCH v4 3/3] NFSD: update Documentation/filesystems/nfs/nfsd-io-modes.rst Mike Snitzer
2025-11-05 18:50 ` Chuck Lever [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=65e3729d-8434-4bdd-8039-804782c20f95@oracle.com \
--to=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).