linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
@ 2025-06-10 20:57 Mike Snitzer
  2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
                   ` (8 more replies)
  0 siblings, 9 replies; 75+ messages in thread
From: Mike Snitzer @ 2025-06-10 20:57 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs, linux-fsdevel, Jens Axboe

Hi,

This series introduces 'enable-dontcache' to NFSD's debugfs interface,
once enabled NFSD will selectively make use of O_DIRECT when issuing
read and write IO:
- all READs will use O_DIRECT (both aligned and misaligned)
- all DIO-aligned WRITEs will use O_DIRECT (useful for SUNRPC RDMA)
- misaligned WRITEs currently continue to use normal buffered IO

Q: Why not actually use RWF_DONTCACHE (yet)?
A: 
If IO can is properly DIO-aligned, or can be made to be, using
O_DIRECT is preferred over DONTCACHE because of its reduced CPU and
memory usage.  Relative to NFSD using RWF_DONTCACHE for misaligned
WRITEs, I've briefly discussed with Jens that follow-on dontcache work
is needed to justify falling back to actually using RWF_DONTCACHE.
Specifically, Hammerspace benchmarking has confirmed as Jeff Layton
suggested at Bakeathon, we need dontcache to be enhanced to not
immediately dropbehind when IO completes -- because it works against
us (due to RMW needing to read without benefit of cache), whereas
buffered IO enables misaligned IO to be more performant. Jens thought
that delayed dropbehind is certainly doable but that he needed to
reason through it further (so timing on availability is TBD). As soon
as it is possible I'll happily switch NFSD's misaligned write IO
fallback from normal buffered IO to actually using RWF_DONTCACHE.

Continuing with what this patchset provides:

NFSD now uses STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store
DIO alignment attributes from underlying filesystem in associated
nfsd_file.  This is done when the nfsd_file is first opened for a
regular file.

A new RWF_DIRECT flag is added to include/uapi/linux/fs.h to allow
NFSD to use O_DIRECT on a per-IO basis.

If enable-dontcache=1 then RWF_DIRECT will be set for all READ IO
(even if the IO is misaligned, thanks to expanding the read to be
aligned for use with DIO, as suggested by Jeff and Chuck at the NFS
Bakeathon held recently in Ann Arbor).

NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to
DIO alignment (both page and disk alignment).  This works quite well
for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it
maps the WRITE payload into aligned pages. But more work is needed to
be able to leverage O_DIRECT when SUNRPC's regular TCP transport is
used. I spent quite a bit of time analyzing the existing xdr_buf code
and NFSD's use of it.  Unfortunately, the WRITE payload gets stored in
misaligned pages such that O_DIRECT isn't possible without a copy
(completely defeating the point).  I'll reply to this cover letter to
start a subthread to discuss how best to deal with misaligned write
IO (by association with Hammerspace, I'm most interested in NFS v3).

Performance benefits of using O_DIRECT in NFSD:

Hammerspace's testbed was 10 NFS servers connected via 800Gbit
RDMA networking (mlx5_core), each with 1TB of memory, 48 cores (2 NUMA
nodes) and 8 ScaleFlux NVMe devices (each with two 3.5TB namespaces.
Theoretical max for reads per NVMe device is 14GB/s, or ~7GB/s per
namespace).

And 10 client systems each running 64 IO threads.

The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
and memory use, particularly for workloads with a working set that far
exceeds the available memory of a given server.  This patchset's
changes (though patch 5, patch 6 wasn't written until after
benchmarking performed) enabled Hammerspace to improve its IO500.org
benchmark result (as submitted for this week's ISC 2025 in Hamburg,
Germany) by 25%.

That 25% improvement on IO500 is owed to NFS servers seeing:
- reduced CPU usage from 100% to ~50%
  O_DIRECT:
  write: 51% idle, 25% system,   14% IO wait,   2% IRQ
  read:  55% idle,  9% system, 32.5% IO wait, 1.5% IRQ
  buffered:
  write: 17.8% idle, 67.5% system,   8% IO wait,  2% IRQ
  read:  3.29% idle, 94.2% system, 2.5% IO wait,  1% IRQ

- reduced memory usage from just under 100% (987GiB for reads, 978GiB
  for writes) to only ~244 MB for cache+buffer use (for both reads and
  writes).
  - buffered would tip-over due to kswapd and kcompactd struggling to
    find free memory during reclaim.

- increased NVMe throughtput when comparing O_DIRECT vs buffered:
  O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
  buffered: 8 GB/s for writes,    4-5 GB/s for reads

- abiliy to support more IO threads per client system (from 48 to 64)

The performance improvement highlight of the numerous individual tests
in the IO500 collection of benchamrks was in the IOR "easy" test:

Write:
O_DIRECT: [RESULT]      ior-easy-write     420.351599 GiB/s : time 869.650 seconds
CACHED:   [RESULT]      ior-easy-write     368.268722 GiB/s : time 413.647 seconds

Read: 
O_DIRECT: [RESULT]      ior-easy-read     446.790791 GiB/s : time 818.219 seconds
CACHED:   [RESULT]      ior-easy-read     284.706196 GiB/s : time 534.950 seconds

It is suspected that patch 6 in this patchset will improve IOR "hard"
read results. The "hard" name comes from the fact that it performs all
IO using a mislaigned blocksize of 47008 bytes (which happens to be
the IO size I showed ftrace output for in the 6th patch's header).

All review and discussion is welcome, thanks!
Mike

Mike Snitzer (6):
  NFSD: add the ability to enable use of RWF_DONTCACHE for all IO
  NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  NFSD: pass nfsd_file to nfsd_iter_read()
  fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis
  NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes
  NFSD: issue READs using O_DIRECT even if IO is misaligned

 fs/nfsd/debugfs.c          |  39 +++++++++++++
 fs/nfsd/filecache.c        |  32 +++++++++++
 fs/nfsd/filecache.h        |   4 ++
 fs/nfsd/nfs4xdr.c          |   8 +--
 fs/nfsd/nfsd.h             |   1 +
 fs/nfsd/trace.h            |  37 +++++++++++++
 fs/nfsd/vfs.c              | 111 ++++++++++++++++++++++++++++++++++---
 fs/nfsd/vfs.h              |  17 +-----
 include/linux/fs.h         |   2 +-
 include/linux/sunrpc/svc.h |   5 +-
 include/uapi/linux/fs.h    |   5 +-
 11 files changed, 231 insertions(+), 30 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2025-07-04 19:49 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11  6:57   ` Christoph Hellwig
2025-06-11 10:44     ` Mike Snitzer
2025-06-11 13:04       ` Jeff Layton
2025-06-11 13:56     ` Chuck Lever
2025-06-11 14:31   ` Chuck Lever
2025-06-11 19:18     ` Mike Snitzer
2025-06-11 20:29       ` Jeff Layton
2025-06-11 21:36         ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28           ` Jeff Layton
2025-06-12 11:28             ` Jeff Layton
2025-06-12 13:28             ` Chuck Lever
2025-06-12 14:17               ` Benjamin Coddington
2025-06-12 15:56                 ` Mike Snitzer
2025-06-12 15:58                   ` Chuck Lever
2025-06-12 16:12                     ` Mike Snitzer
2025-06-12 16:32                       ` Chuck Lever
2025-06-13  5:39                     ` Christoph Hellwig
2025-06-12 16:22               ` Jeff Layton
2025-06-13  5:46                 ` Christoph Hellwig
2025-06-13  9:23                   ` Mike Snitzer
2025-06-13 13:02                     ` Jeff Layton
2025-06-16 12:35                       ` Christoph Hellwig
2025-06-16 12:29                     ` Christoph Hellwig
2025-06-16 16:07                       ` Mike Snitzer
2025-06-17  4:37                         ` Christoph Hellwig
2025-06-17 20:26                           ` Mike Snitzer
2025-06-17 22:23                             ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03  0:12             ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12  7:13         ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15           ` Chuck Lever
2025-06-12 13:21       ` Chuck Lever
2025-06-12 16:00         ` Mike Snitzer
2025-06-16 13:32           ` Chuck Lever
2025-06-16 16:10             ` Mike Snitzer
2025-06-17 17:22               ` Mike Snitzer
2025-06-17 17:31                 ` Chuck Lever
2025-06-19 20:19                   ` Mike Snitzer
2025-06-30 14:50                     ` Chuck Lever
2025-07-04 19:46                       ` Mike Snitzer
2025-07-04 19:49                         ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11  6:58   ` Christoph Hellwig
2025-06-11 10:51     ` Mike Snitzer
2025-06-11 14:17     ` Chuck Lever
2025-06-12  7:15       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11  7:00   ` Christoph Hellwig
2025-06-11 12:23     ` Mike Snitzer
2025-06-11 13:30       ` Jeff Layton
2025-06-12  7:22         ` Christoph Hellwig
2025-06-12  7:23       ` Christoph Hellwig
2025-06-11 14:42   ` Chuck Lever
2025-06-11 15:07     ` Jeff Layton
2025-06-11 15:11       ` Chuck Lever
2025-06-11 15:44         ` Jeff Layton
2025-06-11 20:51           ` Mike Snitzer
2025-06-12  7:32           ` Christoph Hellwig
2025-06-12  7:28         ` Christoph Hellwig
2025-06-12  7:25       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12  7:39   ` Christoph Hellwig
2025-06-12 20:37     ` Mike Snitzer
2025-06-13  5:31       ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02   ` Mike Snitzer
2025-06-11 19:06     ` Chuck Lever
2025-06-11 19:58       ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08   ` Mike Snitzer
2025-06-12 20:17     ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).