linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>, Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
Date: Wed, 11 Jun 2025 08:55:20 -0400	[thread overview]
Message-ID: <54acf3548634f5a46fa261fc2ab3fdbf86938c1c.camel@kernel.org> (raw)
In-Reply-To: <20250610205737.63343-1-snitzer@kernel.org>

On Tue, 2025-06-10 at 16:57 -0400, Mike Snitzer wrote:
> Hi,
> 
> This series introduces 'enable-dontcache' to NFSD's debugfs interface,
> once enabled NFSD will selectively make use of O_DIRECT when issuing
> read and write IO:
> - all READs will use O_DIRECT (both aligned and misaligned)
> - all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
> - misaligned WRITEs currently continue to use normal buffered IO
> 
> Q: Why not actually use RWF_DONTCACHE (yet)?
> A: 
> If IO can is properly DIO-aligned, or can be made to be, using
> O_DIRECT is preferred over DONTCACHE because of its reduced CPU and
> memory usage.  Relative to NFSD using RWF_DONTCACHE for misaligned
> WRITEs, I've briefly discussed with Jens that follow-on dontcache work
> is needed to justify falling back to actually using RWF_DONTCACHE.
> Specifically, Hammerspace benchmarking has confirmed as Jeff Layton
> suggested at Bakeathon, we need dontcache to be enhanced to not
> immediately dropbehind when IO completes -- because it works against
> us (due to RMW needing to read without benefit of cache), whereas
> buffered IO enables misaligned IO to be more performant. Jens thought
> that delayed dropbehind is certainly doable but that he needed to
> reason through it further (so timing on availability is TBD). As soon
> as it is possible I'll happily switch NFSD's misaligned write IO
> fallback from normal buffered IO to actually using RWF_DONTCACHE.
> 

To be clear, my concern with *_DONTCACHE is this bit in
generic_write_sync():

        } else if (iocb->ki_flags & IOCB_DONTCACHE) {                                               
                struct address_space *mapping = iocb->ki_filp->f_mapping;                           
                                                                                                    
                filemap_fdatawrite_range_kick(mapping, iocb->ki_pos - count,                        
                                              iocb->ki_pos - 1);                                    
        }                                                                                           
                                                                                                    
I understand why it was done, but it means that we're kicking off
writeback for small ranges after every write. I think we'd be better
served by allowing for a little batching, and just kick off writeback
(maybe even for the whole inode) after a short delay. IOW, I agree with
Dave Chinner that we need some sort of writebehind window.

The dropbehind part (where we drop it from the pagecache after
writeback completes) is fine, IMO.

> Continuing with what this patchset provides:
> 
> NFSD now uses STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store
> DIO alignment attributes from underlying filesystem in associated
> nfsd_file.  This is done when the nfsd_file is first opened for a
> regular file.
> 
> A new RWF_DIRECT flag is added to include/uapi/linux/fs.h to allow
> NFSD to use O_DIRECT on a per-IO basis.
> 
> If enable-dontcache=1 then RWF_DIRECT will be set for all READ IO
> (even if the IO is misaligned, thanks to expanding the read to be
> aligned for use with DIO, as suggested by Jeff and Chuck at the NFS
> Bakeathon held recently in Ann Arbor).
> 
> NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
> DIO alignment (both page and disk alignment).  This works quite well
> for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
> maps the WRITE payload into aligned pages. But more work is needed to
> be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
> used. I spent quite a bit of time analyzing the existing xdr_buf code
> and NFSD's use of it.  Unfortunately, the WRITE payload gets stored in
> misaligned pages such that O_DIRECT isn't possible without a copy
> (completely defeating the point).  I'll reply to this cover letter to
> start a subthread to discuss how best to deal with misaligned write
> IO (by association with Hammerspace, I'm most interested in NFS v3).
> 

Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the
rq_pages array. To get alignment right, you'd probably have to do the
receive in a much more piecemeal way.

Basically, you'd need to decode as you receive chunks of the message,
and look out for WRITEs, and then set it up so that their payloads are
received with proper alignment.

Anyway, separate thread to discuss that sounds good.
 
> Performance benefits of using O_DIRECT in NFSD:
> 
> Hammerspace's testbed was 10 NFS servers connected via 800Gbit
> RDMA networking (mlx5_core), each with 1TB of memory, 48 cores (2 NUMA
> nodes) and 8 ScaleFlux NVMe devices (each with two 3.5TB namespaces.
> Theoretical max for reads per NVMe device is 14GB/s, or ~7GB/s per
> namespace).
> 
> And 10 client systems each running 64 IO threads.
> 
> The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> and memory use, particularly for workloads with a working set that far
> exceeds the available memory of a given server.  This patchset's
> changes (though patch 5, patch 6 wasn't written until after
> benchmarking performed) enabled Hammerspace to improve its IO500.org
> benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> Germany) by 25%.
> 
> That 25% improvement on IO500 is owed to NFS servers seeing:
> - reduced CPU usage from 100% to ~50%
>   O_DIRECT:
>   write: 51% idle, 25% system,   14% IO wait,   2% IRQ
>   read:  55% idle,  9% system, 32.5% IO wait, 1.5% IRQ
>   buffered:
>   write: 17.8% idle, 67.5% system,   8% IO wait,  2% IRQ
>   read:  3.29% idle, 94.2% system, 2.5% IO wait,  1% IRQ
> 
> - reduced memory usage from just under 100% (987GiB for reads, 978GiB
>   for writes) to only ~244 MB for cache+buffer use (for both reads and
>   writes).
>   - buffered would tip-over due to kswapd and kcompactd struggling to
>     find free memory during reclaim.
> 
> - increased NVMe throughtput when comparing O_DIRECT vs buffered:
>   O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
>   buffered: 8 GB/s for writes,    4-5 GB/s for reads
> 
> - abiliy to support more IO threads per client system (from 48 to 64)
> 
> The performance improvement highlight of the numerous individual tests
> in the IO500 collection of benchamrks was in the IOR "easy" test:
> 
> Write:
> O_DIRECT: [RESULT]      ior-easy-write     420.351599 GiB/s : time 869.650 seconds
> CACHED:   [RESULT]      ior-easy-write     368.268722 GiB/s : time 413.647 seconds
> 
> Read: 
> O_DIRECT: [RESULT]      ior-easy-read     446.790791 GiB/s : time 818.219 seconds
> CACHED:   [RESULT]      ior-easy-read     284.706196 GiB/s : time 534.950 seconds
> 

Wow!

> It is suspected that patch 6 in this patchset will improve IOR "hard"
> read results. The "hard" name comes from the fact that it performs all
> IO using a mislaigned blocksize of 47008 bytes (which happens to be
> the IO size I showed ftrace output for in the 6th patch's header).
> 
> All review and discussion is welcome, thanks!
> Mike
> 
> Mike Snitzer (6):
>   NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
>   NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
>   NFSD: pass nfsd_file to nfsd_iter_read()
>   fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
>   NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
>   NFSD: issue READs using O_DIRECT even if IO is misaligned
> 
>  fs/nfsd/debugfs.c          |  39 +++++++++++++
>  fs/nfsd/filecache.c        |  32 +++++++++++
>  fs/nfsd/filecache.h        |   4 ++
>  fs/nfsd/nfs4xdr.c          |   8 +--
>  fs/nfsd/nfsd.h             |   1 +
>  fs/nfsd/trace.h            |  37 +++++++++++++
>  fs/nfsd/vfs.c              | 111 ++++++++++++++++++++++++++++++++++---
>  fs/nfsd/vfs.h              |  17 +-----
>  include/linux/fs.h         |   2 +-
>  include/linux/sunrpc/svc.h |   5 +-
>  include/uapi/linux/fs.h    |   5 +-
>  11 files changed, 231 insertions(+), 30 deletions(-)

-- 
Jeff Layton <jlayton@kernel.org>

  parent reply	other threads:[~2025-06-11 12:55 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11  6:57   ` Christoph Hellwig
2025-06-11 10:44     ` Mike Snitzer
2025-06-11 13:04       ` Jeff Layton
2025-06-11 13:56     ` Chuck Lever
2025-06-11 14:31   ` Chuck Lever
2025-06-11 19:18     ` Mike Snitzer
2025-06-11 20:29       ` Jeff Layton
2025-06-11 21:36         ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28           ` Jeff Layton
2025-06-12 11:28             ` Jeff Layton
2025-06-12 13:28             ` Chuck Lever
2025-06-12 14:17               ` Benjamin Coddington
2025-06-12 15:56                 ` Mike Snitzer
2025-06-12 15:58                   ` Chuck Lever
2025-06-12 16:12                     ` Mike Snitzer
2025-06-12 16:32                       ` Chuck Lever
2025-06-13  5:39                     ` Christoph Hellwig
2025-06-12 16:22               ` Jeff Layton
2025-06-13  5:46                 ` Christoph Hellwig
2025-06-13  9:23                   ` Mike Snitzer
2025-06-13 13:02                     ` Jeff Layton
2025-06-16 12:35                       ` Christoph Hellwig
2025-06-16 12:29                     ` Christoph Hellwig
2025-06-16 16:07                       ` Mike Snitzer
2025-06-17  4:37                         ` Christoph Hellwig
2025-06-17 20:26                           ` Mike Snitzer
2025-06-17 22:23                             ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03  0:12             ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12  7:13         ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15           ` Chuck Lever
2025-06-12 13:21       ` Chuck Lever
2025-06-12 16:00         ` Mike Snitzer
2025-06-16 13:32           ` Chuck Lever
2025-06-16 16:10             ` Mike Snitzer
2025-06-17 17:22               ` Mike Snitzer
2025-06-17 17:31                 ` Chuck Lever
2025-06-19 20:19                   ` Mike Snitzer
2025-06-30 14:50                     ` Chuck Lever
2025-07-04 19:46                       ` Mike Snitzer
2025-07-04 19:49                         ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11  6:58   ` Christoph Hellwig
2025-06-11 10:51     ` Mike Snitzer
2025-06-11 14:17     ` Chuck Lever
2025-06-12  7:15       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11  7:00   ` Christoph Hellwig
2025-06-11 12:23     ` Mike Snitzer
2025-06-11 13:30       ` Jeff Layton
2025-06-12  7:22         ` Christoph Hellwig
2025-06-12  7:23       ` Christoph Hellwig
2025-06-11 14:42   ` Chuck Lever
2025-06-11 15:07     ` Jeff Layton
2025-06-11 15:11       ` Chuck Lever
2025-06-11 15:44         ` Jeff Layton
2025-06-11 20:51           ` Mike Snitzer
2025-06-12  7:32           ` Christoph Hellwig
2025-06-12  7:28         ` Christoph Hellwig
2025-06-12  7:25       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` Jeff Layton [this message]
2025-06-12  7:39   ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Christoph Hellwig
2025-06-12 20:37     ` Mike Snitzer
2025-06-13  5:31       ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02   ` Mike Snitzer
2025-06-11 19:06     ` Chuck Lever
2025-06-11 19:58       ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08   ` Mike Snitzer
2025-06-12 20:17     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54acf3548634f5a46fa261fc2ab3fdbf86938c1c.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=chuck.lever@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=snitzer@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).