linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chuck Lever <cel@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>, NeilBrown <neilb@ownmail.net>
Cc: Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
Date: Thu, 6 Nov 2025 10:52:33 -0500	[thread overview]
Message-ID: <9705aa01-eec8-471f-b18b-41017dbd7440@kernel.org> (raw)
In-Reply-To: <aQzC85evu-w3-apF@kernel.org>

On 11/6/25 10:46 AM, Mike Snitzer wrote:
> On Thu, Nov 06, 2025 at 09:24:06PM +1100, NeilBrown wrote:
>> On Thu, 06 Nov 2025, Chuck Lever wrote:
>>> From: Mike Snitzer <snitzer@kernel.org>
>>>
>>> This document details the NFSD IO modes that are configurable using
>>> NFSD's experimental debugfs interfaces:
>>>
>>>   /sys/kernel/debug/nfsd/io_cache_read
>>>   /sys/kernel/debug/nfsd/io_cache_write
>>>
>>> This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
>>> debugfs interfaces are replaced with per-export controls).
>>>
>>> Future updates will provide more specific guidance and howto
>>> information to help others use and evaluate NFSD's IO modes:
>>> BUFFERED, DONTCACHE and DIRECT.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>>  .../filesystems/nfs/nfsd-io-modes.rst         | 150 ++++++++++++++++++
>>>  1 file changed, 150 insertions(+)
>>>  create mode 100644 Documentation/filesystems/nfs/nfsd-io-modes.rst
>>>
>>> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>>> new file mode 100644
>>> index 000000000000..29b84c9c9e25
>>> --- /dev/null
>>> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
>>> @@ -0,0 +1,150 @@
>>> +.. SPDX-License-Identifier: GPL-2.0
>>> +
>>> +=============
>>> +NFSD IO MODES
>>> +=============
>>> +
>>> +Overview
>>> +========
>>> +
>>> +NFSD has historically always used buffered IO when servicing READ and
>>> +WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
>>> +to override that default to use either DONTCACHE or DIRECT IO modes.
>>> +
>>> +Experimental NFSD debugfs interfaces are available to allow the NFSD IO
>>> +mode used for READ and WRITE to be configured independently. See both:
>>> +- /sys/kernel/debug/nfsd/io_cache_read
>>> +- /sys/kernel/debug/nfsd/io_cache_write
>>> +
>>> +The default value for both io_cache_read and io_cache_write reflects
>>> +NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
>>> +
>>> +Based on the configured settings, NFSD's IO will either be:
>>> +- cached using page cache (NFSD_IO_BUFFERED=0)
>>> +- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
>>> +- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
>>> +- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
>>> +- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
>>> +
>>> +To set an NFSD IO mode, write a supported value (0 - 4) to the
>>> +corresponding IO operation's debugfs interface, e.g.:
>>> +  echo 2 > /sys/kernel/debug/nfsd/io_cache_read
>>> +  echo 4 > /sys/kernel/debug/nfsd/io_cache_write
>>> +
>>> +To check which IO mode NFSD is using for READ or WRITE, simply read the
>>> +corresponding IO operation's debugfs interface, e.g.:
>>> +  cat /sys/kernel/debug/nfsd/io_cache_read
>>> +  cat /sys/kernel/debug/nfsd/io_cache_write
>>> +
>>> +NFSD DONTCACHE
>>> +==============
>>> +
>>> +DONTCACHE offers a hybrid approach to servicing IO that aims to offer
>>> +the benefits of using DIRECT IO without any of the strict alignment
>>> +requirements that DIRECT IO imposes. To achieve this buffered IO is used
>>> +but the IO is flagged to "drop behind" (meaning associated pages are
>>> +dropped from the page cache) when IO completes.
>>> +
>>> +DONTCACHE aims to avoid what has proven to be a fairly significant
>>> +limition of Linux's memory management subsystem if/when large amounts of
>>> +data is infrequently accessed (e.g. read once _or_ written once but not
>>> +read until much later). Such use-cases are particularly problematic
>>> +because the page cache will eventually become a bottleneck to servicing
>>> +new IO requests.
>>> +
>>> +For more context on DONTCACHE, please see these Linux commit headers:
>>> +- Overview:  9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
>>> +  to take a struct kiocb")
>>> +- for READ:  8026e49bff9b1 ("mm/filemap: add read support for
>>> +  RWF_DONTCACHE")
>>> +- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
>>> +
>>> +If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
>>> +interfaces, FOP_DONTCACHE must be advertised as supported by the
>>> +underlying filesystem (e.g. XFS), otherwise all IO flagged with
>>> +RWF_DONTCACHE will fail with -EOPNOTSUPP.
>>
>> If FOP_DONTCACHE isn't advertised, nfsd doesn't even try RWF_DONTCACHE,
>> so error don't occur.
>>
>> Maybe:
>>
>>   "NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the
>>   underlying filesystem doesn't indicaate support by setting
>>   FOP_DONTCACHE."
>>
>>> +
>>> +NFSD DIRECT
>>> +===========
>>> +
>>> +DIRECT IO doesn't make use of the page cache, as such it is able to
>>> +avoid the Linux memory management's page reclaim scalability problems
>>> +without resorting to the hybrid use of page cache that DONTCACHE does.
>>> +
>>> +Some workloads benefit from NFSD avoiding the page cache, particularly
>>> +those with a working set that is significantly larger than available
>>> +system memory. The pathological worst-case workload that NFSD DIRECT has
>>> +proven to help most is: NFS client issuing large sequential IO to a file
>>> +that is 2-3 times larger than the NFS server's available system memory.
>>> +The reason for such improvement is NFSD DIRECT eliminates a lot of work
>>> +that the memory management subsystem would otherwise be required to
>>> +perform (e.g. page allocation, dirty writeback, page reclaim). When
>>> +using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
>>> +time trying to find adequate free pages so that forward IO progress can
>>> +be made.
>>> +
>>> +The performance win associated with using NFSD DIRECT was previously
>>> +discussed on linux-nfs, see:
>>> +https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
>>> +But in summary:
>>> +- NFSD DIRECT can significantly reduce memory requirements
>>> +- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
>>> +- NFSD DIRECT can offer more deterministic IO performance
>>> +
>>> +As always, your mileage may vary and so it is important to carefully
>>> +consider if/when it is beneficial to make use of NFSD DIRECT. When
>>> +assessing comparative performance of your workload please be sure to log
>>> +relevant performance metrics during testing (e.g. memory usage, cpu
>>> +usage, IO performance). Using perf to collect perf data that may be used
>>> +to generate a "flamegraph" for work Linux must perform on behalf of your
>>> +test is a really meaningful way to compare the relative health of the
>>> +system and how switching NFSD's IO mode changes what is observed.
>>> +
>>> +If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
>>> +NFSD's debugfs interfaces, ideally the IO will be aligned relative to
>>> +the underlying block device's logical_block_size. Also the memory buffer
>>> +used to store the READ or WRITE payload must be aligned relative to the
>>> +underlying block device's dma_alignment.
>>> +
>>> +But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
>>> +it can:
>>> +
>>> +Misaligned READ:
>>> +    If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
>>> +    DIO-aligned block (on either end of the READ). The expanded READ is
>>> +    verified to have proper offset/len (logical_block_size) and
>>> +    dma_alignment checking.
>>> +
>>> +    Any misaligned READ that is less than 32K won't be expanded to be
>>> +    DIO-aligned (this heuristic just avoids excess work, like allocating
>>> +    start_extra_page, for smaller IO that can generally already perform
>>> +    well using buffered IO).
>>
>> I couldn't find this 32K in the code.
>>
>> Do we want to say something like:
>>
>>   If you experiment with this on a recent kernel have have interesting
>>   results, please report them to linux-nfs@vger.kernel.org
>>
>> Thanks,
>> NeilBrown
>>
> 
> Thanks for the review, I clearly missed some clean up.  Chuck, please
> consider applying this incremental patch which should address Neil's
> feedback and remove some stable_how related changes that aren't
> relevant without my corresponding patch:
> 
> diff --git a/Documentation/filesystems/nfs/nfsd-io-modes.rst b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> index 29b84c9c9e25..e3a522d09766 100644
> --- a/Documentation/filesystems/nfs/nfsd-io-modes.rst
> +++ b/Documentation/filesystems/nfs/nfsd-io-modes.rst
> @@ -23,19 +23,20 @@ Based on the configured settings, NFSD's IO will either be:
>  - cached using page cache (NFSD_IO_BUFFERED=0)
>  - cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
>  - not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
> -- not cached stable_how=NFS_DATA_SYNC (NFSD_IO_DIRECT_WRITE_DATA_SYNC=3)
> -- not cached stable_how=NFS_FILE_SYNC (NFSD_IO_DIRECT_WRITE_FILE_SYNC=4)
>  
> -To set an NFSD IO mode, write a supported value (0 - 4) to the
> +To set an NFSD IO mode, write a supported value (0 - 2) to the
>  corresponding IO operation's debugfs interface, e.g.:
>    echo 2 > /sys/kernel/debug/nfsd/io_cache_read
> -  echo 4 > /sys/kernel/debug/nfsd/io_cache_write
> +  echo 2 > /sys/kernel/debug/nfsd/io_cache_write
>  
>  To check which IO mode NFSD is using for READ or WRITE, simply read the
>  corresponding IO operation's debugfs interface, e.g.:
>    cat /sys/kernel/debug/nfsd/io_cache_read
>    cat /sys/kernel/debug/nfsd/io_cache_write
>  
> +If you experiment with NFSD's IO modes on a recent kernel and have
> +interesting results, please report them to linux-nfs@vger.kernel.org
> +
>  NFSD DONTCACHE
>  ==============
>  
> @@ -59,10 +60,8 @@ For more context on DONTCACHE, please see these Linux commit headers:
>    RWF_DONTCACHE")
>  - for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
>  
> -If NFSD_IO_DONTCACHE is specified by writing 1 to NFSD's debugfs
> -interfaces, FOP_DONTCACHE must be advertised as supported by the
> -underlying filesystem (e.g. XFS), otherwise all IO flagged with
> -RWF_DONTCACHE will fail with -EOPNOTSUPP.
> +NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
> +filesystem doesn't indicate support by setting FOP_DONTCACHE.
>  
>  NFSD DIRECT
>  ===========
> @@ -115,11 +114,6 @@ Misaligned READ:
>      verified to have proper offset/len (logical_block_size) and
>      dma_alignment checking.
>  
> -    Any misaligned READ that is less than 32K won't be expanded to be
> -    DIO-aligned (this heuristic just avoids excess work, like allocating
> -    start_extra_page, for smaller IO that can generally already perform
> -    well using buffered IO).
> -
>  Misaligned WRITE:
>      If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
>      middle and end as needed. The large middle segment is DIO-aligned

Thanks, that saves me some time!


-- 
Chuck Lever

      reply	other threads:[~2025-11-06 15:52 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-05 19:28 [PATCH v10 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-05 19:28 ` [PATCH v10 1/5] NFSD: don't start nfsd if sv_permsocks is empty Chuck Lever
2025-11-05 19:31   ` Chuck Lever
2025-11-05 19:28 ` [PATCH v10 2/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-11-06  0:55   ` NeilBrown
2025-11-06 13:05   ` Christoph Hellwig
2025-11-05 19:28 ` [PATCH v10 3/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-11-06 13:07   ` Christoph Hellwig
2025-11-06 16:30     ` Chuck Lever
2025-11-05 19:28 ` [PATCH v10 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-11-06 10:11   ` NeilBrown
2025-11-06 13:15     ` Christoph Hellwig
2025-11-06 13:51       ` Christoph Hellwig
2025-11-06 14:45         ` Chuck Lever
2025-11-06 14:49           ` Christoph Hellwig
2025-11-06 16:48         ` Mike Snitzer
2025-11-06 18:10           ` Chuck Lever
2025-11-06 19:02             ` Mike Snitzer
2025-11-07 13:24               ` Christoph Hellwig
2025-11-07 14:38                 ` Chuck Lever
2025-11-07 15:24                   ` Christoph Hellwig
2025-11-07 15:26                     ` Chuck Lever
2025-11-05 19:28 ` [PATCH v10 5/5] NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst Chuck Lever
2025-11-06 10:24   ` NeilBrown
2025-11-06 15:46     ` Mike Snitzer
2025-11-06 15:52       ` Chuck Lever [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9705aa01-eec8-471f-b18b-41017dbd7440@kernel.org \
    --to=cel@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    --cc=okorniev@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).