linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>,
	linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>, Dave Chinner <david@fromorbit.com>,
	willy@infradead.org, jonathan.flynn@hammerspace.com,
	keith.mannthey@hammerspace.com
Subject: Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support
Date: Thu, 12 Jun 2025 15:08:50 -0400	[thread overview]
Message-ID: <aEslwqa9iMeZjjlV@kernel.org> (raw)
In-Reply-To: <7944f6aa-d797-4323-98dc-7ef7e388c69f@oracle.com>

On Thu, Jun 12, 2025 at 09:46:12AM -0400, Chuck Lever wrote:
> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> > and memory use, particularly for workloads with a working set that far
> > exceeds the available memory of a given server.  This patchset's
> > changes (though patch 5, patch 6 wasn't written until after
> > benchmarking performed) enabled Hammerspace to improve its IO500.org
> > benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> > Germany) by 25%.
> > 
> > That 25% improvement on IO500 is owed to NFS servers seeing:
> > - reduced CPU usage from 100% to ~50%

Apples: 10 servers, 10 clients, 64 PPN (Processes Per Node):

> >   O_DIRECT:
> >   write: 51% idle, 25% system,   14% IO wait,   2% IRQ
> >   read:  55% idle,  9% system, 32.5% IO wait, 1.5% IRQ

Oranges: 6 servers, 6 clients, 128 PPN:

> >   buffered:
> >   write: 17.8% idle, 67.5% system,   8% IO wait,  2% IRQ
> >   read:  3.29% idle, 94.2% system, 2.5% IO wait,  1% IRQ
> 
> The IO wait and IRQ numbers for the buffered results appear to be
> significantly better than for O_DIRECT. Can you help us understand
> that? Is device utilization better or worse with O_DIRECT?

It was a post-mortum data analysis fail: when I worked with others
(Jon and Keith, cc'd) to collect performance data for use in my 0th
header (above) we didn't have buffered IO performance data from the
full IO500-scale benchmark (10 servers, 10 clients).  I missed that
the above married apples and oranges until you noticed something
off...

Sorry about that.  We don't currently have the full 10 nodes
available, so Keith re-ran IOR "easy" testing with 6 nodes to collect
new data.

NOTE his run was with much larger PPN (128 instead of 64) coupled with
reduction of both the number of client and server nodes (from 10 to 6)
used.

Here is CPU usage for one of the server nodes while running IOR "easy" 
with 128 PPN on each of 6 clients, against 6 servers:

- reduced CPU usage from 100% to ~56% for read and 71.6% for Write
  O_DIRECT:
  write: 28.4% idle, 50% system,   13% IO wait,   2% IRQ
  read:    44% idle, 11% system,   39% IO wait, 1.8% IRQ
  buffered:
  write:   17% idle, 68.4% system, 8.5% IO wait,   2% IRQ
  read:  3.51% idle, 94.5% system,   0% IO wait, 0.6% IRQ

And associated NVMe performance:

- increased NVMe throughtput when comparing O_DIRECT vs buffered:
  O_DIRECT: 10.5 GB/s for writes, 11.6 GB/s for reads
  buffered: 7.75-8 GB/s for writes, 4 GB/s before tip over but 800MB/s after for reads 

("tipover" is when reclaim starts to dominate due to inability to
efficiently find free pages so kswapd and kcompactd burn a lot of
resources).

And again here is the associated 6 node IOR easy NVMe performance in
graph form: https://original.art/NFSD_direct_vs_buffered_IO.jpg

> > - reduced memory usage from just under 100% (987GiB for reads, 978GiB
> >   for writes) to only ~244 MB for cache+buffer use (for both reads and
> >   writes).
> >   - buffered would tip-over due to kswapd and kcompactd struggling to
> >     find free memory during reclaim.

This memory usage data is still the case with the 6 server testbed.

> > - increased NVMe throughtput when comparing O_DIRECT vs buffered:
> >   O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
> >   buffered: 8 GB/s for writes,    4-5 GB/s for reads

And again, here is the end result for the IOR easy benchmark:

From Hammerspace's 10 node IO500 reported summary of IOR easy result:
Write:
O_DIRECT: [RESULT]      ior-easy-write     420.351599 GiB/s : time 869.650 seconds
CACHED:   [RESULT]      ior-easy-write     368.268722 GiB/s : time 413.647 seconds

Read: 
O_DIRECT: [RESULT]      ior-easy-read     446.790791 GiB/s : time 818.219 seconds
CACHED:   [RESULT]      ior-easy-read     284.706196 GiB/s : time 534.950 seconds

From Hammerspace's 6 node run, IOR's summary output format (as opposed
to IO500 reported summary from above 10 node result):

Write:
IO mode:  access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
--------  ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
O_DIRECT  write     348132     348133     0.002035    278921216  1024.00    0.040978   600.89     384.04     600.90     0   
CACHED    write     295579     295579     0.002416    278921216  1024.00    0.051602   707.73     355.27     707.73     0   

IO mode:  access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
--------  ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
O_DIRECT  read      347971     347973     0.001928    278921216  1024.00    0.017612   601.17     421.30     601.17     0   
CACHED    read      60653      60653      0.006894    278921216  1024.00    0.017279   3448.99    2975.23    3448.99    0   

> > - ability to support more IO threads per client system (from 48 to 64)
> 
> This last item: how do you measure the "ability to support more
> threads"? Is there a latency curve that is flatter? Do you see changes
> in the latency distribution and the number of latency outliers?

Mainly in the context of the IOR benchmark's result, we can see that
increasing PPN becomes detrimental because the score doesn't improve
or gets worse.

> My general comment here is kind of in the "related or future work"
> category. This is not an objection, just thinking out loud.
> 
> But, can we get more insight into specifically where the CPU
> utilization reduction comes from? Is it lock contention? Is it
> inefficient data structure traversal? Any improvement here benefits
> everyone, so that should be a focus of some study.

Buffered IO just commands more resources than O_DIRECT for workloads
with a working set that exceeds system memory.

Each of the 6 servers has 1TiB of memory.

So for the above 6 client 128 PPN IOT "easy" run, each client thread
is writing and then reading 266 GiB.  That creates an aggregate
working set of 199.50 TiB

The 199.50 TiB working set dwarfs the servers' aggregate 6 TiB of
memory.  Being able to drive each of the 8 NVMe in each server as
efficiently as possible is critical.

As you can see from the above NVMe performance above O_DIRECT is best.

> If the memory utilization is a problem, that sounds like an issue with
> kernel systems outside of NFSD, or perhaps some system tuning can be
> done to improve matters. Again, drilling into this and trying to improve
> it will benefit everyone.

Yeah, there is an extensive iceberg level issue with buffered IO and
MM (reclaim's use of kswapd and kcompactd to find free pages) that
underpins the justifcation for RWF_DONTCACHE being developed and
merged.  I'm not the best person to speak to all the long-standing
challenges (Willy, Dave, Jens, others would be better).

> These results do point to some problems, clearly. Whether NFSD using
> direct I/O is the best solution is not obvious to me yet.

All solutions are on the table.  O_DIRECT just happens to be the most
straight-forward to work through at this point.

Dave Chinner's feeling that O_DIRECT a much better solution than
RWF_DONTCACHE for NFSD certainly helped narrow my focus too, from:
https://lore.kernel.org/linux-nfs/aBrKbOoj4dgUvz8f@dread.disaster.area/

"The nfs client largely aligns all of the page caceh based IO, so I'd
think that O_DIRECT on the server side would be much more performant
than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
writes all the way down to the storage....."

(Dave would be correct about NFSD's page alignment if RDMA used, but
obviously not the case if TCP used due to SUNRPC TCP's WRITE payload
being received into misaligned pages).

Thanks,
Mike

  reply	other threads:[~2025-06-12 19:08 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-10 20:57 [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Mike Snitzer
2025-06-10 20:57 ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Mike Snitzer
2025-06-11  6:57   ` Christoph Hellwig
2025-06-11 10:44     ` Mike Snitzer
2025-06-11 13:04       ` Jeff Layton
2025-06-11 13:56     ` Chuck Lever
2025-06-11 14:31   ` Chuck Lever
2025-06-11 19:18     ` Mike Snitzer
2025-06-11 20:29       ` Jeff Layton
2025-06-11 21:36         ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] Mike Snitzer
2025-06-12 10:28           ` Jeff Layton
2025-06-12 11:28             ` Jeff Layton
2025-06-12 13:28             ` Chuck Lever
2025-06-12 14:17               ` Benjamin Coddington
2025-06-12 15:56                 ` Mike Snitzer
2025-06-12 15:58                   ` Chuck Lever
2025-06-12 16:12                     ` Mike Snitzer
2025-06-12 16:32                       ` Chuck Lever
2025-06-13  5:39                     ` Christoph Hellwig
2025-06-12 16:22               ` Jeff Layton
2025-06-13  5:46                 ` Christoph Hellwig
2025-06-13  9:23                   ` Mike Snitzer
2025-06-13 13:02                     ` Jeff Layton
2025-06-16 12:35                       ` Christoph Hellwig
2025-06-16 12:29                     ` Christoph Hellwig
2025-06-16 16:07                       ` Mike Snitzer
2025-06-17  4:37                         ` Christoph Hellwig
2025-06-17 20:26                           ` Mike Snitzer
2025-06-17 22:23                             ` [RFC PATCH] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec [was: Re: need SUNRPC TCP to receive into aligned pages] Mike Snitzer
2025-07-03  0:12             ` need SUNRPC TCP to receive into aligned pages [was: Re: [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO] NeilBrown
2025-06-12  7:13         ` [PATCH 1/6] NFSD: add the ability to enable use of RWF_DONTCACHE for all IO Christoph Hellwig
2025-06-12 13:15           ` Chuck Lever
2025-06-12 13:21       ` Chuck Lever
2025-06-12 16:00         ` Mike Snitzer
2025-06-16 13:32           ` Chuck Lever
2025-06-16 16:10             ` Mike Snitzer
2025-06-17 17:22               ` Mike Snitzer
2025-06-17 17:31                 ` Chuck Lever
2025-06-19 20:19                   ` Mike Snitzer
2025-06-30 14:50                     ` Chuck Lever
2025-07-04 19:46                       ` Mike Snitzer
2025-07-04 19:49                         ` Chuck Lever
2025-06-10 20:57 ` [PATCH 2/6] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-06-10 20:57 ` [PATCH 3/6] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-06-10 20:57 ` [PATCH 4/6] fs: introduce RWF_DIRECT to allow using O_DIRECT on a per-IO basis Mike Snitzer
2025-06-11  6:58   ` Christoph Hellwig
2025-06-11 10:51     ` Mike Snitzer
2025-06-11 14:17     ` Chuck Lever
2025-06-12  7:15       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 5/6] NFSD: leverage DIO alignment to selectively issue O_DIRECT reads and writes Mike Snitzer
2025-06-11  7:00   ` Christoph Hellwig
2025-06-11 12:23     ` Mike Snitzer
2025-06-11 13:30       ` Jeff Layton
2025-06-12  7:22         ` Christoph Hellwig
2025-06-12  7:23       ` Christoph Hellwig
2025-06-11 14:42   ` Chuck Lever
2025-06-11 15:07     ` Jeff Layton
2025-06-11 15:11       ` Chuck Lever
2025-06-11 15:44         ` Jeff Layton
2025-06-11 20:51           ` Mike Snitzer
2025-06-12  7:32           ` Christoph Hellwig
2025-06-12  7:28         ` Christoph Hellwig
2025-06-12  7:25       ` Christoph Hellwig
2025-06-10 20:57 ` [PATCH 6/6] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-06-11 12:55 ` [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support Jeff Layton
2025-06-12  7:39   ` Christoph Hellwig
2025-06-12 20:37     ` Mike Snitzer
2025-06-13  5:31       ` Christoph Hellwig
2025-06-11 14:16 ` Chuck Lever
2025-06-11 18:02   ` Mike Snitzer
2025-06-11 19:06     ` Chuck Lever
2025-06-11 19:58       ` Mike Snitzer
2025-06-12 13:46 ` Chuck Lever
2025-06-12 19:08   ` Mike Snitzer [this message]
2025-06-12 20:17     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aEslwqa9iMeZjjlV@kernel.org \
    --to=snitzer@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=jlayton@kernel.org \
    --cc=jonathan.flynn@hammerspace.com \
    --cc=keith.mannthey@hammerspace.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).